Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization

Abstract With the rapid increase of users over social media, cyberbullying, and hate speech problems have arisen over the past years. Automatic hate speech detection (HSD) from text is an emerging research problem in natural language processing (NLP). Researchers developed various approaches to solv...

Full description

Saved in:

Bibliographic Details
Main Authors:	Waqar Ashiq, Samra Kanwal, Adnan Rafique, Muhammad Waqas, Tahir Khurshaid, Elizabeth Caro Montero, Alicia Bustamante Alonso, Imran Ashraf
Format:	Article
Language:	English
Published:	Nature Portfolio 2024-11-01
Series:	Scientific Reports
Subjects:	Hate speech detection Deep learning Model optimization Urdu text classification
Online Access:	https://doi.org/10.1038/s41598-024-79106-7
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1846158515952943104
author	Waqar Ashiq Samra Kanwal Adnan Rafique Muhammad Waqas Tahir Khurshaid Elizabeth Caro Montero Alicia Bustamante Alonso Imran Ashraf
author_facet	Waqar Ashiq Samra Kanwal Adnan Rafique Muhammad Waqas Tahir Khurshaid Elizabeth Caro Montero Alicia Bustamante Alonso Imran Ashraf
author_sort	Waqar Ashiq
collection	DOAJ
description	Abstract With the rapid increase of users over social media, cyberbullying, and hate speech problems have arisen over the past years. Automatic hate speech detection (HSD) from text is an emerging research problem in natural language processing (NLP). Researchers developed various approaches to solve the automatic hate speech detection problem using different corpora in various languages, however, research on the Urdu language is rather scarce. This study aims to address the HSD task on Twitter using Roman Urdu text. The contribution of this research is the development of a hybrid model for Roman Urdu HSD, which has not been previously explored. The novel hybrid model integrates deep learning (DL) and transformer models for automatic feature extraction, combined with machine learning algorithms (MLAs) for classification. To further enhance model performance, we employ several hyperparameter optimization (HPO) techniques, including Grid Search (GS), Randomized Search (RS), and Bayesian Optimization with Gaussian Processes (BOGP). Evaluation is carried out on two publicly available benchmarks Roman Urdu corpora comprising HS-RU-20 corpus and RUHSOLD hate speech corpus. Results demonstrate that the Multilingual BERT (MBERT) feature learner, paired with a Support Vector Machine (SVM) classifier and optimized using RS, achieves state-of-the-art performance. On the HS-RU-20 corpus, this model attained an accuracy of 0.93 and an F1 score of 0.95 for the Neutral-Hostile classification task, and an accuracy of 0.89 with an F1 score of 0.88 for the Hate Speech-Offensive task. On the RUHSOLD corpus, the same model achieved an accuracy of 0.95 and an F1 score of 0.94 for the Coarse-grained task, alongside an accuracy of 0.87 and an F1 score of 0.84 for the Fine-grained task. These results demonstrate the effectiveness of our hybrid approach for Roman Urdu hate speech detection.
format	Article
id	doaj-art-41f3ccb688e849e3a29f8bc3b3179dc3
institution	Kabale University
issn	2045-2322
language	English
publishDate	2024-11-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj-art-41f3ccb688e849e3a29f8bc3b3179dc32024-11-24T12:25:15ZengNature PortfolioScientific Reports2045-23222024-11-0114112210.1038/s41598-024-79106-7Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimizationWaqar Ashiq0Samra Kanwal1Adnan Rafique2Muhammad Waqas3Tahir Khurshaid4Elizabeth Caro Montero5Alicia Bustamante Alonso6Imran Ashraf7Department of Software Engineering, University of Management and TechnologyDepartment of Computer Science, University of Management and TechnologySchool of Information and Communications Technology, University of TasmaniaDepartment of Mathematics, University of EducationDepartment of Electrical Engineering, Yeungnam UniversityUniversidad Europea del Atlantico.Universidad Europea del Atlantico.Department of Information and Communication Engineering, Yeungnam UniversityAbstract With the rapid increase of users over social media, cyberbullying, and hate speech problems have arisen over the past years. Automatic hate speech detection (HSD) from text is an emerging research problem in natural language processing (NLP). Researchers developed various approaches to solve the automatic hate speech detection problem using different corpora in various languages, however, research on the Urdu language is rather scarce. This study aims to address the HSD task on Twitter using Roman Urdu text. The contribution of this research is the development of a hybrid model for Roman Urdu HSD, which has not been previously explored. The novel hybrid model integrates deep learning (DL) and transformer models for automatic feature extraction, combined with machine learning algorithms (MLAs) for classification. To further enhance model performance, we employ several hyperparameter optimization (HPO) techniques, including Grid Search (GS), Randomized Search (RS), and Bayesian Optimization with Gaussian Processes (BOGP). Evaluation is carried out on two publicly available benchmarks Roman Urdu corpora comprising HS-RU-20 corpus and RUHSOLD hate speech corpus. Results demonstrate that the Multilingual BERT (MBERT) feature learner, paired with a Support Vector Machine (SVM) classifier and optimized using RS, achieves state-of-the-art performance. On the HS-RU-20 corpus, this model attained an accuracy of 0.93 and an F1 score of 0.95 for the Neutral-Hostile classification task, and an accuracy of 0.89 with an F1 score of 0.88 for the Hate Speech-Offensive task. On the RUHSOLD corpus, the same model achieved an accuracy of 0.95 and an F1 score of 0.94 for the Coarse-grained task, alongside an accuracy of 0.87 and an F1 score of 0.84 for the Fine-grained task. These results demonstrate the effectiveness of our hybrid approach for Roman Urdu hate speech detection.https://doi.org/10.1038/s41598-024-79106-7Hate speech detectionDeep learningModel optimizationUrdu text classification
spellingShingle	Waqar Ashiq Samra Kanwal Adnan Rafique Muhammad Waqas Tahir Khurshaid Elizabeth Caro Montero Alicia Bustamante Alonso Imran Ashraf Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization Scientific Reports Hate speech detection Deep learning Model optimization Urdu text classification
title	Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization
title_full	Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization
title_fullStr	Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization
title_full_unstemmed	Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization
title_short	Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization
title_sort	roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization
topic	Hate speech detection Deep learning Model optimization Urdu text classification
url	https://doi.org/10.1038/s41598-024-79106-7
work_keys_str_mv	AT waqarashiq romanurduhatespeechdetectionusinghybridmachinelearningmodelsandhyperparameteroptimization AT samrakanwal romanurduhatespeechdetectionusinghybridmachinelearningmodelsandhyperparameteroptimization AT adnanrafique romanurduhatespeechdetectionusinghybridmachinelearningmodelsandhyperparameteroptimization AT muhammadwaqas romanurduhatespeechdetectionusinghybridmachinelearningmodelsandhyperparameteroptimization AT tahirkhurshaid romanurduhatespeechdetectionusinghybridmachinelearningmodelsandhyperparameteroptimization AT elizabethcaromontero romanurduhatespeechdetectionusinghybridmachinelearningmodelsandhyperparameteroptimization AT aliciabustamantealonso romanurduhatespeechdetectionusinghybridmachinelearningmodelsandhyperparameteroptimization AT imranashraf romanurduhatespeechdetectionusinghybridmachinelearningmodelsandhyperparameteroptimization

Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization

Similar Items