Tokenizers for African Languages

Despite incredible development in the field of natural language processing (NLP), there has been a huge gap in the performance of NLP tasks between high-resource languages (HRLs) and low-resource languages (LRLs). African languages belong mainly to the LRLs, and one of the major contributing factors...

Full description

Saved in:
Bibliographic Details
Main Authors: Goodwill Erasmo Ndomba, Medard Edmund Mswahili, Young-Seob Jeong
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10815724/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841563293771104256
author Goodwill Erasmo Ndomba
Medard Edmund Mswahili
Young-Seob Jeong
author_facet Goodwill Erasmo Ndomba
Medard Edmund Mswahili
Young-Seob Jeong
author_sort Goodwill Erasmo Ndomba
collection DOAJ
description Despite incredible development in the field of natural language processing (NLP), there has been a huge gap in the performance of NLP tasks between high-resource languages (HRLs) and low-resource languages (LRLs). African languages belong mainly to the LRLs, and one of the major contributing factors to the performance gap is tokenization, which plays a crucial role in NLP performance in general. Many recent studies on African languages often rely on multilingual tokenizers or general-purpose tokenizers, which are not optimized for African languages. This may lead to suboptimal performance in downstream NLP tasks. In this paper, we systematically analyze the performance of language-specific tokenizers for three African languages: Swahili, Hausa, and Yoruba. By experimental results on two classification tasks (i.e. sentiment classification and news classification), we found that the language-specific tokenizers for African languages consistently outperformed other monolingual tokenizers, with performance gaps of up to 5.43% in sentiment classification and 4.58% in news classification. We also found that multilingual tokenizers generally work well if they are trained in many African languages rather than global HRLs. For instance, African multilingual tokenizers outperformed global multilingual tokenizers by an average of 1.70% in sentiment classification and 1.41% in news classification. The largest observed improvement was 2.61% in news classification using Logistic Regression (LR). Based on the results, we suggest a method for choosing tokenizers when analyzing data or developing models for African languages.
format Article
id doaj-art-65657f26a76145d38fc1ba9f86912b0f
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-65657f26a76145d38fc1ba9f86912b0f2025-01-03T00:01:52ZengIEEEIEEE Access2169-35362025-01-01131046105410.1109/ACCESS.2024.352228510815724Tokenizers for African LanguagesGoodwill Erasmo Ndomba0https://orcid.org/0009-0009-6842-3548Medard Edmund Mswahili1https://orcid.org/0000-0002-6893-6281Young-Seob Jeong2https://orcid.org/0000-0002-9441-2940Department of Computer Engineering, Chungbuk National University, Cheongju, Republic of KoreaDepartment of Computer Engineering, Chungbuk National University, Cheongju, Republic of KoreaDepartment of Computer Engineering, Chungbuk National University, Cheongju, Republic of KoreaDespite incredible development in the field of natural language processing (NLP), there has been a huge gap in the performance of NLP tasks between high-resource languages (HRLs) and low-resource languages (LRLs). African languages belong mainly to the LRLs, and one of the major contributing factors to the performance gap is tokenization, which plays a crucial role in NLP performance in general. Many recent studies on African languages often rely on multilingual tokenizers or general-purpose tokenizers, which are not optimized for African languages. This may lead to suboptimal performance in downstream NLP tasks. In this paper, we systematically analyze the performance of language-specific tokenizers for three African languages: Swahili, Hausa, and Yoruba. By experimental results on two classification tasks (i.e. sentiment classification and news classification), we found that the language-specific tokenizers for African languages consistently outperformed other monolingual tokenizers, with performance gaps of up to 5.43% in sentiment classification and 4.58% in news classification. We also found that multilingual tokenizers generally work well if they are trained in many African languages rather than global HRLs. For instance, African multilingual tokenizers outperformed global multilingual tokenizers by an average of 1.70% in sentiment classification and 1.41% in news classification. The largest observed improvement was 2.61% in news classification using Logistic Regression (LR). Based on the results, we suggest a method for choosing tokenizers when analyzing data or developing models for African languages.https://ieeexplore.ieee.org/document/10815724/TokenizerclassificationAfrican languageSwahiliYorubaHausa
spellingShingle Goodwill Erasmo Ndomba
Medard Edmund Mswahili
Young-Seob Jeong
Tokenizers for African Languages
IEEE Access
Tokenizer
classification
African language
Swahili
Yoruba
Hausa
title Tokenizers for African Languages
title_full Tokenizers for African Languages
title_fullStr Tokenizers for African Languages
title_full_unstemmed Tokenizers for African Languages
title_short Tokenizers for African Languages
title_sort tokenizers for african languages
topic Tokenizer
classification
African language
Swahili
Yoruba
Hausa
url https://ieeexplore.ieee.org/document/10815724/
work_keys_str_mv AT goodwillerasmondomba tokenizersforafricanlanguages
AT medardedmundmswahili tokenizersforafricanlanguages
AT youngseobjeong tokenizersforafricanlanguages