Investigating the Relationship Between Text Vectorization Cosine Similarity and Classification Performance

In recent years, optimizing classification pipelines has become increasingly critical due to the growing volume of textual data and the computational challenges associated with exhaustive hyperparameter tuning. This paper proposes a similarity-based approach for selecting the most promising vectoriz...

Full description

Saved in:
Bibliographic Details
Main Authors: Fernando Rezende Zagatti, Gilson Yuuji Shimizu, Daniel Lucredio, Helena de Medeiros Caseli
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11108167/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In recent years, optimizing classification pipelines has become increasingly critical due to the growing volume of textual data and the computational challenges associated with exhaustive hyperparameter tuning. This paper proposes a similarity-based approach for selecting the most promising vectorization configurations–specifically, Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Word2Vec–by analyzing the average cosine similarity of the generated vectors; by preselecting configurations that yield more diverse textual representations, our method relies on the hypothesis that increased diversity in text representations enhances the discriminative capacity of classification models. Experimental evaluations conducted on five different datasets demonstrate that the similarity-based approach achieves accuracy and F1-Score results very close to those obtained via exhaustive search, with notable reductions in processing time; furthermore, correlation analyses reveal a strong inverse relationship between vector similarity and model performance for BoW and TF-IDF, and a moderate relationship for Word2Vec. These findings validate the efficacy of the proposed method as a practical alternative to hyperparameter selection in vectorization pipelines, offering significant benefits for applications where exhaustive exploration is unfeasible.
ISSN:2169-3536