Automatic Morpheme Segmentation for Russian: Can an Algorithm Re-place Experts?

Introduction: Numerous algorithms have been proposed for the task of automatic morpheme segmentation of Russian words. Due to the differences in task formulation and datasets utilized, comparing the quality of these algorithms is challenging. It is unclear whether the errors in the models are due t...

Full description

Saved in:

Bibliographic Details
Main Authors:	Дмитрий Алексеевич Морозов, Тимур Александрович Гарипов, Ольга Николаевна Ляшевская, Светлана Олеговна Савчук, Борис Леонидович Иомдин, Анна Валерьевна Глазкова
Format:	Article
Language:	English
Published:	National Research University Higher School of Economics 2024-12-01
Series:	Journal of Language and Education
Subjects:	automatic morpheme segmentation Russian language morphology dictionary expansion morphological analysis natural language processing expert-level performance
Online Access:	https://jle.hse.ru/article/view/22237
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841555960396513280
author	Дмитрий Алексеевич Морозов Тимур Александрович Гарипов Ольга Николаевна Ляшевская Светлана Олеговна Савчук Борис Леонидович Иомдин Анна Валерьевна Глазкова
author_facet	Дмитрий Алексеевич Морозов Тимур Александрович Гарипов Ольга Николаевна Ляшевская Светлана Олеговна Савчук Борис Леонидович Иомдин Анна Валерьевна Глазкова
author_sort	Дмитрий Алексеевич Морозов
collection	DOAJ
description	Introduction: Numerous algorithms have been proposed for the task of automatic morpheme segmentation of Russian words. Due to the differences in task formulation and datasets utilized, comparing the quality of these algorithms is challenging. It is unclear whether the errors in the models are due to the ineffectiveness of algorithms themselves or to errors and inconsistencies in the morpheme dictionaries. Thus, it remains uncertain whether any algorithm can be used to automatically expand the existing morpheme dictionaries. Purpose: To compare various existing algorithms of morpheme segmentation for the Russian language and analyze their applicability in the task of automatic augmentation of various existing morpheme dictionaries. Results: In this study, we compared several state-of-the-art machine learning algorithms using three datasets structured around different segmentation paradigms. Two experiments were carried out, each employing five-fold cross-validation. In the first experiment, we randomly partitioned the dataset into five subsets. In the second, we grouped all words sharing the same root into a single subset, excluding words that contained multiple roots. During cross-validation, models were trained on four of these subsets and evaluated on the remaining one. Across both experiments, the algorithms that relied on ensembles of convolutional neural networks consistently demonstrated the highest performance. However, we observed a notable decline in accuracy when testing on words containing unfamiliar roots. We also found that, on a randomly selected set of words, the performance of these algorithms was comparable to that of human experts. Conclusion: Our results indicate that although automatic methods have, on average, reached a quality close to expert level, the lack of semantic consideration makes it impossible to use them for automatic dictionary expansion without expert validation. The conducted research revealed that further research should be aimed at addressing the key identified issues: poor performance with unknown roots and acronyms. At the same time, when a small number of unfamiliar roots can be assumed in the test dataset, an ensemble of convolutional neural networks should be utilized. The presented results can be used in the development of morpheme-oriented tokenizers and systems for analyzing the complexity of texts.
format	Article
id	doaj-art-2f306e2e1f60450e93cfec60ec0694e3
institution	Kabale University
issn	2411-7390
language	English
publishDate	2024-12-01
publisher	National Research University Higher School of Economics
record_format	Article
series	Journal of Language and Education
spelling	doaj-art-2f306e2e1f60450e93cfec60ec0694e32025-01-07T16:17:17ZengNational Research University Higher School of EconomicsJournal of Language and Education2411-73902024-12-0110410.17323/jle.2024.22237Automatic Morpheme Segmentation for Russian: Can an Algorithm Re-place Experts?Дмитрий Алексеевич Морозов0Тимур Александрович Гарипов1Ольга Николаевна Ляшевская2Светлана Олеговна Савчук3Борис Леонидович Иомдин4Анна Валерьевна Глазкова5Novosibirsk State University, Novosibirsk, RussiaNovosibirsk State University, Novosibirsk, RussiaHSE University, Moscow, Russia; Vinogradov Russian Language Institute, Russian Academy of Sciences, Moscow, RussiaVinogradov Russian Language Institute, Russian Academy of Sciences, Moscow, Russiaindependent researcherUniversity of Tyumen, Tyumen, Russia Introduction: Numerous algorithms have been proposed for the task of automatic morpheme segmentation of Russian words. Due to the differences in task formulation and datasets utilized, comparing the quality of these algorithms is challenging. It is unclear whether the errors in the models are due to the ineffectiveness of algorithms themselves or to errors and inconsistencies in the morpheme dictionaries. Thus, it remains uncertain whether any algorithm can be used to automatically expand the existing morpheme dictionaries. Purpose: To compare various existing algorithms of morpheme segmentation for the Russian language and analyze their applicability in the task of automatic augmentation of various existing morpheme dictionaries. Results: In this study, we compared several state-of-the-art machine learning algorithms using three datasets structured around different segmentation paradigms. Two experiments were carried out, each employing five-fold cross-validation. In the first experiment, we randomly partitioned the dataset into five subsets. In the second, we grouped all words sharing the same root into a single subset, excluding words that contained multiple roots. During cross-validation, models were trained on four of these subsets and evaluated on the remaining one. Across both experiments, the algorithms that relied on ensembles of convolutional neural networks consistently demonstrated the highest performance. However, we observed a notable decline in accuracy when testing on words containing unfamiliar roots. We also found that, on a randomly selected set of words, the performance of these algorithms was comparable to that of human experts. Conclusion: Our results indicate that although automatic methods have, on average, reached a quality close to expert level, the lack of semantic consideration makes it impossible to use them for automatic dictionary expansion without expert validation. The conducted research revealed that further research should be aimed at addressing the key identified issues: poor performance with unknown roots and acronyms. At the same time, when a small number of unfamiliar roots can be assumed in the test dataset, an ensemble of convolutional neural networks should be utilized. The presented results can be used in the development of morpheme-oriented tokenizers and systems for analyzing the complexity of texts. https://jle.hse.ru/article/view/22237automatic morpheme segmentationRussian language morphologydictionary expansionmorphological analysisnatural language processingexpert-level performance
spellingShingle	Дмитрий Алексеевич Морозов Тимур Александрович Гарипов Ольга Николаевна Ляшевская Светлана Олеговна Савчук Борис Леонидович Иомдин Анна Валерьевна Глазкова Automatic Morpheme Segmentation for Russian: Can an Algorithm Re-place Experts? Journal of Language and Education automatic morpheme segmentation Russian language morphology dictionary expansion morphological analysis natural language processing expert-level performance
title	Automatic Morpheme Segmentation for Russian: Can an Algorithm Re-place Experts?
title_full	Automatic Morpheme Segmentation for Russian: Can an Algorithm Re-place Experts?
title_fullStr	Automatic Morpheme Segmentation for Russian: Can an Algorithm Re-place Experts?
title_full_unstemmed	Automatic Morpheme Segmentation for Russian: Can an Algorithm Re-place Experts?
title_short	Automatic Morpheme Segmentation for Russian: Can an Algorithm Re-place Experts?
title_sort	automatic morpheme segmentation for russian can an algorithm re place experts
topic	automatic morpheme segmentation Russian language morphology dictionary expansion morphological analysis natural language processing expert-level performance
url	https://jle.hse.ru/article/view/22237
work_keys_str_mv	AT dmitrijalekseevičmorozov automaticmorphemesegmentationforrussiancananalgorithmreplaceexperts AT timuraleksandrovičgaripov automaticmorphemesegmentationforrussiancananalgorithmreplaceexperts AT olʹganikolaevnalâševskaâ automaticmorphemesegmentationforrussiancananalgorithmreplaceexperts AT svetlanaolegovnasavčuk automaticmorphemesegmentationforrussiancananalgorithmreplaceexperts AT borisleonidovičiomdin automaticmorphemesegmentationforrussiancananalgorithmreplaceexperts AT annavalerʹevnaglazkova automaticmorphemesegmentationforrussiancananalgorithmreplaceexperts

Automatic Morpheme Segmentation for Russian: Can an Algorithm Re-place Experts?

Similar Items