Tokenizers for African Languages

Despite incredible development in the field of natural language processing (NLP), there has been a huge gap in the performance of NLP tasks between high-resource languages (HRLs) and low-resource languages (LRLs). African languages belong mainly to the LRLs, and one of the major contributing factors...

Full description

Saved in:
Bibliographic Details
Main Authors: Goodwill Erasmo Ndomba, Medard Edmund Mswahili, Young-Seob Jeong
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10815724/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Despite incredible development in the field of natural language processing (NLP), there has been a huge gap in the performance of NLP tasks between high-resource languages (HRLs) and low-resource languages (LRLs). African languages belong mainly to the LRLs, and one of the major contributing factors to the performance gap is tokenization, which plays a crucial role in NLP performance in general. Many recent studies on African languages often rely on multilingual tokenizers or general-purpose tokenizers, which are not optimized for African languages. This may lead to suboptimal performance in downstream NLP tasks. In this paper, we systematically analyze the performance of language-specific tokenizers for three African languages: Swahili, Hausa, and Yoruba. By experimental results on two classification tasks (i.e. sentiment classification and news classification), we found that the language-specific tokenizers for African languages consistently outperformed other monolingual tokenizers, with performance gaps of up to 5.43% in sentiment classification and 4.58% in news classification. We also found that multilingual tokenizers generally work well if they are trained in many African languages rather than global HRLs. For instance, African multilingual tokenizers outperformed global multilingual tokenizers by an average of 1.70% in sentiment classification and 1.41% in news classification. The largest observed improvement was 2.61% in news classification using Logistic Regression (LR). Based on the results, we suggest a method for choosing tokenizers when analyzing data or developing models for African languages.
ISSN:2169-3536