Tokenizers for African Languages

Despite incredible development in the field of natural language processing (NLP), there has been a huge gap in the performance of NLP tasks between high-resource languages (HRLs) and low-resource languages (LRLs). African languages belong mainly to the LRLs, and one of the major contributing factors...

Full description

Saved in:

Bibliographic Details
Main Authors:	Goodwill Erasmo Ndomba, Medard Edmund Mswahili, Young-Seob Jeong
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Tokenizer classification African language Swahili Yoruba Hausa
Online Access:	https://ieeexplore.ieee.org/document/10815724/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841563293771104256
author	Goodwill Erasmo Ndomba Medard Edmund Mswahili Young-Seob Jeong
author_facet	Goodwill Erasmo Ndomba Medard Edmund Mswahili Young-Seob Jeong
author_sort	Goodwill Erasmo Ndomba
collection	DOAJ
description	Despite incredible development in the field of natural language processing (NLP), there has been a huge gap in the performance of NLP tasks between high-resource languages (HRLs) and low-resource languages (LRLs). African languages belong mainly to the LRLs, and one of the major contributing factors to the performance gap is tokenization, which plays a crucial role in NLP performance in general. Many recent studies on African languages often rely on multilingual tokenizers or general-purpose tokenizers, which are not optimized for African languages. This may lead to suboptimal performance in downstream NLP tasks. In this paper, we systematically analyze the performance of language-specific tokenizers for three African languages: Swahili, Hausa, and Yoruba. By experimental results on two classification tasks (i.e. sentiment classification and news classification), we found that the language-specific tokenizers for African languages consistently outperformed other monolingual tokenizers, with performance gaps of up to 5.43% in sentiment classification and 4.58% in news classification. We also found that multilingual tokenizers generally work well if they are trained in many African languages rather than global HRLs. For instance, African multilingual tokenizers outperformed global multilingual tokenizers by an average of 1.70% in sentiment classification and 1.41% in news classification. The largest observed improvement was 2.61% in news classification using Logistic Regression (LR). Based on the results, we suggest a method for choosing tokenizers when analyzing data or developing models for African languages.
format	Article
id	doaj-art-65657f26a76145d38fc1ba9f86912b0f
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-65657f26a76145d38fc1ba9f86912b0f2025-01-03T00:01:52ZengIEEEIEEE Access2169-35362025-01-01131046105410.1109/ACCESS.2024.352228510815724Tokenizers for African LanguagesGoodwill Erasmo Ndomba0https://orcid.org/0009-0009-6842-3548Medard Edmund Mswahili1https://orcid.org/0000-0002-6893-6281Young-Seob Jeong2https://orcid.org/0000-0002-9441-2940Department of Computer Engineering, Chungbuk National University, Cheongju, Republic of KoreaDepartment of Computer Engineering, Chungbuk National University, Cheongju, Republic of KoreaDepartment of Computer Engineering, Chungbuk National University, Cheongju, Republic of KoreaDespite incredible development in the field of natural language processing (NLP), there has been a huge gap in the performance of NLP tasks between high-resource languages (HRLs) and low-resource languages (LRLs). African languages belong mainly to the LRLs, and one of the major contributing factors to the performance gap is tokenization, which plays a crucial role in NLP performance in general. Many recent studies on African languages often rely on multilingual tokenizers or general-purpose tokenizers, which are not optimized for African languages. This may lead to suboptimal performance in downstream NLP tasks. In this paper, we systematically analyze the performance of language-specific tokenizers for three African languages: Swahili, Hausa, and Yoruba. By experimental results on two classification tasks (i.e. sentiment classification and news classification), we found that the language-specific tokenizers for African languages consistently outperformed other monolingual tokenizers, with performance gaps of up to 5.43% in sentiment classification and 4.58% in news classification. We also found that multilingual tokenizers generally work well if they are trained in many African languages rather than global HRLs. For instance, African multilingual tokenizers outperformed global multilingual tokenizers by an average of 1.70% in sentiment classification and 1.41% in news classification. The largest observed improvement was 2.61% in news classification using Logistic Regression (LR). Based on the results, we suggest a method for choosing tokenizers when analyzing data or developing models for African languages.https://ieeexplore.ieee.org/document/10815724/TokenizerclassificationAfrican languageSwahiliYorubaHausa
spellingShingle	Goodwill Erasmo Ndomba Medard Edmund Mswahili Young-Seob Jeong Tokenizers for African Languages IEEE Access Tokenizer classification African language Swahili Yoruba Hausa
title	Tokenizers for African Languages
title_full	Tokenizers for African Languages
title_fullStr	Tokenizers for African Languages
title_full_unstemmed	Tokenizers for African Languages
title_short	Tokenizers for African Languages
title_sort	tokenizers for african languages
topic	Tokenizer classification African language Swahili Yoruba Hausa
url	https://ieeexplore.ieee.org/document/10815724/
work_keys_str_mv	AT goodwillerasmondomba tokenizersforafricanlanguages AT medardedmundmswahili tokenizersforafricanlanguages AT youngseobjeong tokenizersforafricanlanguages

Tokenizers for African Languages

Similar Items