DAugSindhi: a data augmentation approach for enhancing Sindhi language text classification

Abstract Sindhi, a low-resource language spoken by millions, faces significant challenges in Natural Language Processing (NLP) due to the scarcity of annotated datasets. This paper presents DAugSindhi, a study focused on enhancing Sindhi text classification through data augmentation techniques. Thes...

Full description

Saved in:
Bibliographic Details
Main Authors: Raja Vavekanand, Bhagwan Das, Teerath Kumar
Format: Article
Language:English
Published: Springer 2025-06-01
Series:Discover Data
Subjects:
Online Access:https://doi.org/10.1007/s44248-025-00040-8
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Sindhi, a low-resource language spoken by millions, faces significant challenges in Natural Language Processing (NLP) due to the scarcity of annotated datasets. This paper presents DAugSindhi, a study focused on enhancing Sindhi text classification through data augmentation techniques. These methods aim to address data scarcity by artificially expanding the dataset to improve model performance. The study explores various augmentation techniques, including Easy Data Augmentation (EDA), Back Translation, Paraphrasing, and Text Generation using Large Language Models (LLMs). EDA methods like synonym replacement, random insertion, random deletion, and random swapping introduce semantic diversity, while Back Translation creates contextual variations. Paraphrasing and Text Generation leverage LLMs to produce enriched and diverse text samples. Experiments were conducted on a Sindhi dataset of 3364 news articles categorized into sports, entertainment, and technology. A multilingual BERT model (bert-base-multilingual-cased) was fine-tuned for binary and multi-class classification tasks. Results revealed that EDA, particularly Random Deletion, outperformed all other techniques, achieving a 99% F1 score for binary classification. Back Translation and Paraphrasing also delivered substantial improvements, highlighting their utility in low-resource settings. This work establishes a robust baseline for Sindhi text classification, demonstrating that simple augmentation techniques can significantly enhance NLP applications for underrepresented languages. Future work will explore hybrid strategies and larger datasets to further advance Sindhi NLP.
ISSN:2731-6955