Investigating the Relationship Between Text Vectorization Cosine Similarity and Classification Performance
In recent years, optimizing classification pipelines has become increasingly critical due to the growing volume of textual data and the computational challenges associated with exhaustive hyperparameter tuning. This paper proposes a similarity-based approach for selecting the most promising vectoriz...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11108167/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | In recent years, optimizing classification pipelines has become increasingly critical due to the growing volume of textual data and the computational challenges associated with exhaustive hyperparameter tuning. This paper proposes a similarity-based approach for selecting the most promising vectorization configurations–specifically, Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Word2Vec–by analyzing the average cosine similarity of the generated vectors; by preselecting configurations that yield more diverse textual representations, our method relies on the hypothesis that increased diversity in text representations enhances the discriminative capacity of classification models. Experimental evaluations conducted on five different datasets demonstrate that the similarity-based approach achieves accuracy and F1-Score results very close to those obtained via exhaustive search, with notable reductions in processing time; furthermore, correlation analyses reveal a strong inverse relationship between vector similarity and model performance for BoW and TF-IDF, and a moderate relationship for Word2Vec. These findings validate the efficacy of the proposed method as a practical alternative to hyperparameter selection in vectorization pipelines, offering significant benefits for applications where exhaustive exploration is unfeasible. |
|---|---|
| ISSN: | 2169-3536 |