Gibbs-BERTopic: A Hybrid Approach for Short Text Topic Modeling

As a rich source of direct user needs, online reviews can be effectively analyzed through topic modeling to uncover user preferences and requirements. However, the short and unstructured nature of online reviews, along with their high dimensionality, noise, and complex semantic structure, poses chal...

Full description

Saved in:
Bibliographic Details
Main Authors: Yan Zhu, Yueying Liu
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10930480/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:As a rich source of direct user needs, online reviews can be effectively analyzed through topic modeling to uncover user preferences and requirements. However, the short and unstructured nature of online reviews, along with their high dimensionality, noise, and complex semantic structure, poses challenges for traditional topic modeling techniques. The topic modeling technique BERTopic utilizes high-dimensional embeddings to capture text semantic information, but it faces the problems of sparsity, noise, and uncertainty when dealing with small-scale data. To reduce information barriers and improve the robustness and accuracy of topic models, we propose a new enhanced topic model—a hybrid method that integrates Gibbs sampling with BERTopic (Gibbs-BERTopic). Gibbs-BERTopic builds on BERTopic and iteratively optimizes and adjusts the topic distribution by Gibbs sampling the initial topic modeling results of BERTopic to improve the quality of the topic model. Experimental results show that Gibbs-BERTopic significantly outperforms the original BERTopic model, achieving over a 20% increase in silhouette coefficient and nearly a 20% improvement in topic coherence. It is verified that Gibbs-BERTopic can effectively reduce the impact of noise and data sparsity on topic extraction, optimize topic distribution, enhance topic clustering effect, and thus extract topics that are more in line with the actual data characteristics and potential structure.
ISSN:2169-3536