DAugSindhi: a data augmentation approach for enhancing Sindhi language text classification
Abstract Sindhi, a low-resource language spoken by millions, faces significant challenges in Natural Language Processing (NLP) due to the scarcity of annotated datasets. This paper presents DAugSindhi, a study focused on enhancing Sindhi text classification through data augmentation techniques. Thes...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-06-01
|
| Series: | Discover Data |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s44248-025-00040-8 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Sindhi, a low-resource language spoken by millions, faces significant challenges in Natural Language Processing (NLP) due to the scarcity of annotated datasets. This paper presents DAugSindhi, a study focused on enhancing Sindhi text classification through data augmentation techniques. These methods aim to address data scarcity by artificially expanding the dataset to improve model performance. The study explores various augmentation techniques, including Easy Data Augmentation (EDA), Back Translation, Paraphrasing, and Text Generation using Large Language Models (LLMs). EDA methods like synonym replacement, random insertion, random deletion, and random swapping introduce semantic diversity, while Back Translation creates contextual variations. Paraphrasing and Text Generation leverage LLMs to produce enriched and diverse text samples. Experiments were conducted on a Sindhi dataset of 3364 news articles categorized into sports, entertainment, and technology. A multilingual BERT model (bert-base-multilingual-cased) was fine-tuned for binary and multi-class classification tasks. Results revealed that EDA, particularly Random Deletion, outperformed all other techniques, achieving a 99% F1 score for binary classification. Back Translation and Paraphrasing also delivered substantial improvements, highlighting their utility in low-resource settings. This work establishes a robust baseline for Sindhi text classification, demonstrating that simple augmentation techniques can significantly enhance NLP applications for underrepresented languages. Future work will explore hybrid strategies and larger datasets to further advance Sindhi NLP. |
|---|---|
| ISSN: | 2731-6955 |