A BERT-Based Classification Model: The Case of Russian Fairy Tales

Introduction: Automatic profiling and genre classification are crucial for text suitability assessment and as such have been in high demand in education, information retrieval, sentiment analysis, and machine translation for over a decade. Of all kinds of genres, fairy tales make one of the most ch...

Full description

Saved in:
Bibliographic Details
Main Authors: Валерий Дмитриевич Соловьев, Марина Ивановна Солнышкина, Andrey Ten, Николай Аркадиевич Прокопьев
Format: Article
Language:English
Published: National Research University Higher School of Economics 2024-12-01
Series:Journal of Language and Education
Subjects:
Online Access:https://jle.hse.ru/article/view/24030
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841555987296681984
author Валерий Дмитриевич Соловьев
Марина Ивановна Солнышкина
Andrey Ten
Николай Аркадиевич Прокопьев
author_facet Валерий Дмитриевич Соловьев
Марина Ивановна Солнышкина
Andrey Ten
Николай Аркадиевич Прокопьев
author_sort Валерий Дмитриевич Соловьев
collection DOAJ
description Introduction: Automatic profiling and genre classification are crucial for text suitability assessment and as such have been in high demand in education, information retrieval, sentiment analysis, and machine translation for over a decade. Of all kinds of genres, fairy tales make one of the most challenging and valuable objects of study due to its heterogeneity and a wide range of implicit idiosyncrasies. Traditional classification methods including stylometric and parametric algorithms, however, are not only labour-intensive and time-consuming, but they are also struggling with identifying corresponding classifying discriminants. The research in the area is scarce, their findings are still controversial and debatable. Purpose: Our study aims to fill this crucial void and offers an algorithm to range Russian fairy-tales into classes based on the pre-set parameters. We present the latest BERT-based classification model for Russian fairy tales, test the hypothesis of BERT potential for classifying Russian texts and verify it on a representative corpus of 743 Russian fairy tales. Method: We pre-train BERT using a collection of three classes of documents and fine-tune it for implementation of a specific application task. Focused on the mechanism of tokenization and embeddings design as the key components in BERT’s text processing, the research also evaluates the standard benchmarks used to train classification models and analyze complex cases, possible errors and improvement algorithms thus raising the classification models accuracy. Evaluation of the models performance is conducted based on the loss function, prediction accuracy, precision and recall. Results: We validated BERT’s potential for Russian text classification and ability to enhance the performance and quality of the existing NLP models. Our experiments with cointegrated/rubert-tiny, ai forever/ruBert-base, and DeepPavlov/rubert-base-cased-sentence on different classification tasks demonstrate that our models achieve state-of-the-art results with the best accuracy of 95.9% in cointegrated/rubert-tiny thus outperforming the other two models by a good margin. Thus, the achieved by AI classification accuracy is so high that it can compete with that of human expertise.    Conclusion: The findings highlight the importance of fine-tuning for classifying models. BERT demonstrates great potential for improving NLP technologies and contributing to the quality of automatic text analysis and offering new opportunities for research and application in a wide range  of areas including identification and arrangement of all types of content-relevant texts thus contributing to decision making. The designed and validated algorithm can be scaled for classification of as complex and ambiguous discourse as fiction thus improving our understanding of text specific categories.  Considerably bigger datasets are required for these purposes.    
format Article
id doaj-art-8a7282830d98448a8c152438583db459
institution Kabale University
issn 2411-7390
language English
publishDate 2024-12-01
publisher National Research University Higher School of Economics
record_format Article
series Journal of Language and Education
spelling doaj-art-8a7282830d98448a8c152438583db4592025-01-07T16:17:12ZengNational Research University Higher School of EconomicsJournal of Language and Education2411-73902024-12-0110410.17323/jle.2024.24030A BERT-Based Classification Model: The Case of Russian Fairy TalesВалерий Дмитриевич Соловьев0Марина Ивановна Солнышкина1Andrey Ten2Николай Аркадиевич Прокопьев3Kazan Federal University, Kazan, RussiaKazan Federal University, Kazan, RussiaNobilis.Team, Kazan, RussiaTAS Institute of Applied Semiotics, Kazan, Russia Introduction: Automatic profiling and genre classification are crucial for text suitability assessment and as such have been in high demand in education, information retrieval, sentiment analysis, and machine translation for over a decade. Of all kinds of genres, fairy tales make one of the most challenging and valuable objects of study due to its heterogeneity and a wide range of implicit idiosyncrasies. Traditional classification methods including stylometric and parametric algorithms, however, are not only labour-intensive and time-consuming, but they are also struggling with identifying corresponding classifying discriminants. The research in the area is scarce, their findings are still controversial and debatable. Purpose: Our study aims to fill this crucial void and offers an algorithm to range Russian fairy-tales into classes based on the pre-set parameters. We present the latest BERT-based classification model for Russian fairy tales, test the hypothesis of BERT potential for classifying Russian texts and verify it on a representative corpus of 743 Russian fairy tales. Method: We pre-train BERT using a collection of three classes of documents and fine-tune it for implementation of a specific application task. Focused on the mechanism of tokenization and embeddings design as the key components in BERT’s text processing, the research also evaluates the standard benchmarks used to train classification models and analyze complex cases, possible errors and improvement algorithms thus raising the classification models accuracy. Evaluation of the models performance is conducted based on the loss function, prediction accuracy, precision and recall. Results: We validated BERT’s potential for Russian text classification and ability to enhance the performance and quality of the existing NLP models. Our experiments with cointegrated/rubert-tiny, ai forever/ruBert-base, and DeepPavlov/rubert-base-cased-sentence on different classification tasks demonstrate that our models achieve state-of-the-art results with the best accuracy of 95.9% in cointegrated/rubert-tiny thus outperforming the other two models by a good margin. Thus, the achieved by AI classification accuracy is so high that it can compete with that of human expertise.    Conclusion: The findings highlight the importance of fine-tuning for classifying models. BERT demonstrates great potential for improving NLP technologies and contributing to the quality of automatic text analysis and offering new opportunities for research and application in a wide range  of areas including identification and arrangement of all types of content-relevant texts thus contributing to decision making. The designed and validated algorithm can be scaled for classification of as complex and ambiguous discourse as fiction thus improving our understanding of text specific categories.  Considerably bigger datasets are required for these purposes.     https://jle.hse.ru/article/view/24030Bert modelfairy talesText classificationNeural networks
spellingShingle Валерий Дмитриевич Соловьев
Марина Ивановна Солнышкина
Andrey Ten
Николай Аркадиевич Прокопьев
A BERT-Based Classification Model: The Case of Russian Fairy Tales
Journal of Language and Education
Bert model
fairy tales
Text classification
Neural networks
title A BERT-Based Classification Model: The Case of Russian Fairy Tales
title_full A BERT-Based Classification Model: The Case of Russian Fairy Tales
title_fullStr A BERT-Based Classification Model: The Case of Russian Fairy Tales
title_full_unstemmed A BERT-Based Classification Model: The Case of Russian Fairy Tales
title_short A BERT-Based Classification Model: The Case of Russian Fairy Tales
title_sort bert based classification model the case of russian fairy tales
topic Bert model
fairy tales
Text classification
Neural networks
url https://jle.hse.ru/article/view/24030
work_keys_str_mv AT valerijdmitrievičsolovʹev abertbasedclassificationmodelthecaseofrussianfairytales
AT marinaivanovnasolnyškina abertbasedclassificationmodelthecaseofrussianfairytales
AT andreyten abertbasedclassificationmodelthecaseofrussianfairytales
AT nikolajarkadievičprokopʹev abertbasedclassificationmodelthecaseofrussianfairytales
AT valerijdmitrievičsolovʹev bertbasedclassificationmodelthecaseofrussianfairytales
AT marinaivanovnasolnyškina bertbasedclassificationmodelthecaseofrussianfairytales
AT andreyten bertbasedclassificationmodelthecaseofrussianfairytales
AT nikolajarkadievičprokopʹev bertbasedclassificationmodelthecaseofrussianfairytales