MIRoBERTa: Mental Illness Text Classification With Transfer Learning on Subreddits
Social media has emerged as a critical resource for text classification, with Reddit prominent among these platforms. In addition to serving as a space for users to share thoughts openly, Reddit is a substantial repository for storing diverse expressions, encompassing discourse on mental health. Thi...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2024-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10815727/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Social media has emerged as a critical resource for text classification, with Reddit prominent among these platforms. In addition to serving as a space for users to share thoughts openly, Reddit is a substantial repository for storing diverse expressions, encompassing discourse on mental health. This study employs the Reddit PullPush application programming interface to gather posts related to seven common mental illnesses: attention-deficit/hyperactivity disorder, anxiety, bipolar disorder, borderline personality disorder, depression, obsessive-compulsive disorder, and post-traumatic stress disorder. A robust collection of 54 161 submissions across 11 subreddits was collected, with the “none” class indicating the absence of mental illness. This study comparatively evaluated three traditional machine learning, two powerful bidirectional deep learning models, and four transformer models: bidirectional encoder representations from transformers (BERT), robustly optimized BERT pretraining approach (RoBERTa), BigBIRD, and long-document transformer (Longformer) on a multiclass text classification task. The BigBIRD and Longformer approaches outperformed standard transformers, BERT and RoBERTa, on the multiclass text classification task. BigBIRD achieved the highest accuracy and F1-score of 0.840, whereas the RoBERTa model had a slightly lower accuracy and F1-score of 0.834. We pretrain the RoBERTa-base and BERT-base-uncased model using a fill-mask task in the public mental illness corpus domain to improve language understanding before fine-tuning. This pretrained model, mental illness RoBERTa (MIRoBERTa), outperformed other models on the text classification task with an accuracy of 0.847 and F1-score of 0.847. Additionally, mental illness BERT (MIBERT) surpassed existing domain-specific pretrained models with an accuracy of 0.835 and an F1 score of 0.835. We also explore the effectiveness of ensemble techniques by combining the domain adaptation model with the original variant. Finally, we analyze word importance to identify the terms that most significantly contribute to the model classification decisions. |
|---|---|
| ISSN: | 2169-3536 |