Optimized Identification of Sentence-Level Multiclass Events on Urdu-Language-Text Using Machine Learning Techniques

In today’s digital world, social media platforms generate a plethora of unstructured information. However, for low-resource languages like Urdu, there is a scarcity of well-structured data for specific tasks such as event classification. Urdu, a language prominent in South Asia, has boast...

Full description

Saved in:
Bibliographic Details
Main Authors: Somia Ali, Uzma Jamil, Muhammad Younas, Bushra Zafar, Muhammad Kashif Hanif
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10816408/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841563303044710400
author Somia Ali
Uzma Jamil
Muhammad Younas
Bushra Zafar
Muhammad Kashif Hanif
author_facet Somia Ali
Uzma Jamil
Muhammad Younas
Bushra Zafar
Muhammad Kashif Hanif
author_sort Somia Ali
collection DOAJ
description In today’s digital world, social media platforms generate a plethora of unstructured information. However, for low-resource languages like Urdu, there is a scarcity of well-structured data for specific tasks such as event classification. Urdu, a language prominent in South Asia, has boasted a complex morphological structure with unique features but has lacked standard linguistic resources like datasets. Long-text classification has demanded more effort than short-text classification due to its expansive vocabulary, information redundancy, and noise. Text processing has been the latest trend in research, with many machine learning and deep learning techniques widely used for it. Multiclass classification has been utilized to classify different languages for various purposes. In this research, a multiclass classification for the Urdu language was performed using a text dataset taken from five different social media platforms including Geo News, Samaa News, Dawn News, Express News, and Urdu Blogs totaling 103,771 sentences. We used sentence-level classification to categorize sentences including terrorist attacks, national news, sports, entertainment, politics, safety, earthquakes, fraud and corruption, sexual assault, weather, accidents, forces, inflation, murder and death, education, and international news. Deep learning, transformer-based and machine learning classifiers are used for event classification. The SMFCNN classifier achieved the greatest accuracy of 88.29%. We incorporated transformer-based models, with the proposed XLM-R+ model demonstrating superior performance with an accuracy of 89.8%. Our results were compared to previously reported techniques that used traditional models, highlighting the significant improvements offered by our approaches. The novelty of this research lies in the inclusion of 16 event categories to broaden coverage and the implementation of the SMFCNN and transformer-based algorithms. This study highlights the potential of deep learning and transformer-based models in enhancing the accuracy and generalizability of multiclass classification in low-resource languages Urdu.
format Article
id doaj-art-85a31264aebc4544bcb6ccbc5e210396
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-85a31264aebc4544bcb6ccbc5e2103962025-01-03T00:02:06ZengIEEEIEEE Access2169-35362025-01-011312510.1109/ACCESS.2024.352299210816408Optimized Identification of Sentence-Level Multiclass Events on Urdu-Language-Text Using Machine Learning TechniquesSomia Ali0https://orcid.org/0000-0002-3717-2012Uzma Jamil1https://orcid.org/0000-0003-4555-2389Muhammad Younas2https://orcid.org/0000-0003-4161-7843Bushra Zafar3https://orcid.org/0000-0002-8869-3037Muhammad Kashif Hanif4https://orcid.org/0000-0001-5150-2228Department of Computer Science, Government College University Faisalabad, Faisalabad, PakistanDepartment of Computer Science, Government College University Faisalabad, Faisalabad, PakistanDepartment of Computer Science, Government College University Faisalabad, Faisalabad, PakistanDepartment of Computer Science, Government College University Faisalabad, Faisalabad, PakistanDepartment of Computer Science, Government College University Faisalabad, Faisalabad, PakistanIn today’s digital world, social media platforms generate a plethora of unstructured information. However, for low-resource languages like Urdu, there is a scarcity of well-structured data for specific tasks such as event classification. Urdu, a language prominent in South Asia, has boasted a complex morphological structure with unique features but has lacked standard linguistic resources like datasets. Long-text classification has demanded more effort than short-text classification due to its expansive vocabulary, information redundancy, and noise. Text processing has been the latest trend in research, with many machine learning and deep learning techniques widely used for it. Multiclass classification has been utilized to classify different languages for various purposes. In this research, a multiclass classification for the Urdu language was performed using a text dataset taken from five different social media platforms including Geo News, Samaa News, Dawn News, Express News, and Urdu Blogs totaling 103,771 sentences. We used sentence-level classification to categorize sentences including terrorist attacks, national news, sports, entertainment, politics, safety, earthquakes, fraud and corruption, sexual assault, weather, accidents, forces, inflation, murder and death, education, and international news. Deep learning, transformer-based and machine learning classifiers are used for event classification. The SMFCNN classifier achieved the greatest accuracy of 88.29%. We incorporated transformer-based models, with the proposed XLM-R+ model demonstrating superior performance with an accuracy of 89.8%. Our results were compared to previously reported techniques that used traditional models, highlighting the significant improvements offered by our approaches. The novelty of this research lies in the inclusion of 16 event categories to broaden coverage and the implementation of the SMFCNN and transformer-based algorithms. This study highlights the potential of deep learning and transformer-based models in enhancing the accuracy and generalizability of multiclass classification in low-resource languages Urdu.https://ieeexplore.ieee.org/document/10816408/Sentence level classificationdeep learningmachine learningUrdu languageevent classification
spellingShingle Somia Ali
Uzma Jamil
Muhammad Younas
Bushra Zafar
Muhammad Kashif Hanif
Optimized Identification of Sentence-Level Multiclass Events on Urdu-Language-Text Using Machine Learning Techniques
IEEE Access
Sentence level classification
deep learning
machine learning
Urdu language
event classification
title Optimized Identification of Sentence-Level Multiclass Events on Urdu-Language-Text Using Machine Learning Techniques
title_full Optimized Identification of Sentence-Level Multiclass Events on Urdu-Language-Text Using Machine Learning Techniques
title_fullStr Optimized Identification of Sentence-Level Multiclass Events on Urdu-Language-Text Using Machine Learning Techniques
title_full_unstemmed Optimized Identification of Sentence-Level Multiclass Events on Urdu-Language-Text Using Machine Learning Techniques
title_short Optimized Identification of Sentence-Level Multiclass Events on Urdu-Language-Text Using Machine Learning Techniques
title_sort optimized identification of sentence level multiclass events on urdu language text using machine learning techniques
topic Sentence level classification
deep learning
machine learning
Urdu language
event classification
url https://ieeexplore.ieee.org/document/10816408/
work_keys_str_mv AT somiaali optimizedidentificationofsentencelevelmulticlasseventsonurdulanguagetextusingmachinelearningtechniques
AT uzmajamil optimizedidentificationofsentencelevelmulticlasseventsonurdulanguagetextusingmachinelearningtechniques
AT muhammadyounas optimizedidentificationofsentencelevelmulticlasseventsonurdulanguagetextusingmachinelearningtechniques
AT bushrazafar optimizedidentificationofsentencelevelmulticlasseventsonurdulanguagetextusingmachinelearningtechniques
AT muhammadkashifhanif optimizedidentificationofsentencelevelmulticlasseventsonurdulanguagetextusingmachinelearningtechniques