Tokenizers for African Languages
Despite incredible development in the field of natural language processing (NLP), there has been a huge gap in the performance of NLP tasks between high-resource languages (HRLs) and low-resource languages (LRLs). African languages belong mainly to the LRLs, and one of the major contributing factors...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10815724/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841563293771104256 |
---|---|
author | Goodwill Erasmo Ndomba Medard Edmund Mswahili Young-Seob Jeong |
author_facet | Goodwill Erasmo Ndomba Medard Edmund Mswahili Young-Seob Jeong |
author_sort | Goodwill Erasmo Ndomba |
collection | DOAJ |
description | Despite incredible development in the field of natural language processing (NLP), there has been a huge gap in the performance of NLP tasks between high-resource languages (HRLs) and low-resource languages (LRLs). African languages belong mainly to the LRLs, and one of the major contributing factors to the performance gap is tokenization, which plays a crucial role in NLP performance in general. Many recent studies on African languages often rely on multilingual tokenizers or general-purpose tokenizers, which are not optimized for African languages. This may lead to suboptimal performance in downstream NLP tasks. In this paper, we systematically analyze the performance of language-specific tokenizers for three African languages: Swahili, Hausa, and Yoruba. By experimental results on two classification tasks (i.e. sentiment classification and news classification), we found that the language-specific tokenizers for African languages consistently outperformed other monolingual tokenizers, with performance gaps of up to 5.43% in sentiment classification and 4.58% in news classification. We also found that multilingual tokenizers generally work well if they are trained in many African languages rather than global HRLs. For instance, African multilingual tokenizers outperformed global multilingual tokenizers by an average of 1.70% in sentiment classification and 1.41% in news classification. The largest observed improvement was 2.61% in news classification using Logistic Regression (LR). Based on the results, we suggest a method for choosing tokenizers when analyzing data or developing models for African languages. |
format | Article |
id | doaj-art-65657f26a76145d38fc1ba9f86912b0f |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-65657f26a76145d38fc1ba9f86912b0f2025-01-03T00:01:52ZengIEEEIEEE Access2169-35362025-01-01131046105410.1109/ACCESS.2024.352228510815724Tokenizers for African LanguagesGoodwill Erasmo Ndomba0https://orcid.org/0009-0009-6842-3548Medard Edmund Mswahili1https://orcid.org/0000-0002-6893-6281Young-Seob Jeong2https://orcid.org/0000-0002-9441-2940Department of Computer Engineering, Chungbuk National University, Cheongju, Republic of KoreaDepartment of Computer Engineering, Chungbuk National University, Cheongju, Republic of KoreaDepartment of Computer Engineering, Chungbuk National University, Cheongju, Republic of KoreaDespite incredible development in the field of natural language processing (NLP), there has been a huge gap in the performance of NLP tasks between high-resource languages (HRLs) and low-resource languages (LRLs). African languages belong mainly to the LRLs, and one of the major contributing factors to the performance gap is tokenization, which plays a crucial role in NLP performance in general. Many recent studies on African languages often rely on multilingual tokenizers or general-purpose tokenizers, which are not optimized for African languages. This may lead to suboptimal performance in downstream NLP tasks. In this paper, we systematically analyze the performance of language-specific tokenizers for three African languages: Swahili, Hausa, and Yoruba. By experimental results on two classification tasks (i.e. sentiment classification and news classification), we found that the language-specific tokenizers for African languages consistently outperformed other monolingual tokenizers, with performance gaps of up to 5.43% in sentiment classification and 4.58% in news classification. We also found that multilingual tokenizers generally work well if they are trained in many African languages rather than global HRLs. For instance, African multilingual tokenizers outperformed global multilingual tokenizers by an average of 1.70% in sentiment classification and 1.41% in news classification. The largest observed improvement was 2.61% in news classification using Logistic Regression (LR). Based on the results, we suggest a method for choosing tokenizers when analyzing data or developing models for African languages.https://ieeexplore.ieee.org/document/10815724/TokenizerclassificationAfrican languageSwahiliYorubaHausa |
spellingShingle | Goodwill Erasmo Ndomba Medard Edmund Mswahili Young-Seob Jeong Tokenizers for African Languages IEEE Access Tokenizer classification African language Swahili Yoruba Hausa |
title | Tokenizers for African Languages |
title_full | Tokenizers for African Languages |
title_fullStr | Tokenizers for African Languages |
title_full_unstemmed | Tokenizers for African Languages |
title_short | Tokenizers for African Languages |
title_sort | tokenizers for african languages |
topic | Tokenizer classification African language Swahili Yoruba Hausa |
url | https://ieeexplore.ieee.org/document/10815724/ |
work_keys_str_mv | AT goodwillerasmondomba tokenizersforafricanlanguages AT medardedmundmswahili tokenizersforafricanlanguages AT youngseobjeong tokenizersforafricanlanguages |