A BERT-Based Classification Model: The Case of Russian Fairy Tales

Introduction: Automatic profiling and genre classification are crucial for text suitability assessment and as such have been in high demand in education, information retrieval, sentiment analysis, and machine translation for over a decade. Of all kinds of genres, fairy tales make one of the most ch...

Full description

Saved in:

Bibliographic Details
Main Authors:	Валерий Дмитриевич Соловьев, Марина Ивановна Солнышкина, Andrey Ten, Николай Аркадиевич Прокопьев
Format:	Article
Language:	English
Published:	National Research University Higher School of Economics 2024-12-01
Series:	Journal of Language and Education
Subjects:	Bert model fairy tales Text classification Neural networks
Online Access:	https://jle.hse.ru/article/view/24030
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841555987296681984
author	Валерий Дмитриевич Соловьев Марина Ивановна Солнышкина Andrey Ten Николай Аркадиевич Прокопьев
author_facet	Валерий Дмитриевич Соловьев Марина Ивановна Солнышкина Andrey Ten Николай Аркадиевич Прокопьев
author_sort	Валерий Дмитриевич Соловьев
collection	DOAJ
description	Introduction: Automatic profiling and genre classification are crucial for text suitability assessment and as such have been in high demand in education, information retrieval, sentiment analysis, and machine translation for over a decade. Of all kinds of genres, fairy tales make one of the most challenging and valuable objects of study due to its heterogeneity and a wide range of implicit idiosyncrasies. Traditional classification methods including stylometric and parametric algorithms, however, are not only labour-intensive and time-consuming, but they are also struggling with identifying corresponding classifying discriminants. The research in the area is scarce, their findings are still controversial and debatable. Purpose: Our study aims to fill this crucial void and offers an algorithm to range Russian fairy-tales into classes based on the pre-set parameters. We present the latest BERT-based classification model for Russian fairy tales, test the hypothesis of BERT potential for classifying Russian texts and verify it on a representative corpus of 743 Russian fairy tales. Method: We pre-train BERT using a collection of three classes of documents and fine-tune it for implementation of a specific application task. Focused on the mechanism of tokenization and embeddings design as the key components in BERT’s text processing, the research also evaluates the standard benchmarks used to train classification models and analyze complex cases, possible errors and improvement algorithms thus raising the classification models accuracy. Evaluation of the models performance is conducted based on the loss function, prediction accuracy, precision and recall. Results: We validated BERT’s potential for Russian text classification and ability to enhance the performance and quality of the existing NLP models. Our experiments with cointegrated/rubert-tiny, ai forever/ruBert-base, and DeepPavlov/rubert-base-cased-sentence on different classification tasks demonstrate that our models achieve state-of-the-art results with the best accuracy of 95.9% in cointegrated/rubert-tiny thus outperforming the other two models by a good margin. Thus, the achieved by AI classification accuracy is so high that it can compete with that of human expertise. Conclusion: The findings highlight the importance of fine-tuning for classifying models. BERT demonstrates great potential for improving NLP technologies and contributing to the quality of automatic text analysis and offering new opportunities for research and application in a wide range of areas including identification and arrangement of all types of content-relevant texts thus contributing to decision making. The designed and validated algorithm can be scaled for classification of as complex and ambiguous discourse as fiction thus improving our understanding of text specific categories. Considerably bigger datasets are required for these purposes.
format	Article
id	doaj-art-8a7282830d98448a8c152438583db459
institution	Kabale University
issn	2411-7390
language	English
publishDate	2024-12-01
publisher	National Research University Higher School of Economics
record_format	Article
series	Journal of Language and Education
spelling	doaj-art-8a7282830d98448a8c152438583db4592025-01-07T16:17:12ZengNational Research University Higher School of EconomicsJournal of Language and Education2411-73902024-12-0110410.17323/jle.2024.24030A BERT-Based Classification Model: The Case of Russian Fairy TalesВалерий Дмитриевич Соловьев0Марина Ивановна Солнышкина1Andrey Ten2Николай Аркадиевич Прокопьев3Kazan Federal University, Kazan, RussiaKazan Federal University, Kazan, RussiaNobilis.Team, Kazan, RussiaTAS Institute of Applied Semiotics, Kazan, Russia Introduction: Automatic profiling and genre classification are crucial for text suitability assessment and as such have been in high demand in education, information retrieval, sentiment analysis, and machine translation for over a decade. Of all kinds of genres, fairy tales make one of the most challenging and valuable objects of study due to its heterogeneity and a wide range of implicit idiosyncrasies. Traditional classification methods including stylometric and parametric algorithms, however, are not only labour-intensive and time-consuming, but they are also struggling with identifying corresponding classifying discriminants. The research in the area is scarce, their findings are still controversial and debatable. Purpose: Our study aims to fill this crucial void and offers an algorithm to range Russian fairy-tales into classes based on the pre-set parameters. We present the latest BERT-based classification model for Russian fairy tales, test the hypothesis of BERT potential for classifying Russian texts and verify it on a representative corpus of 743 Russian fairy tales. Method: We pre-train BERT using a collection of three classes of documents and fine-tune it for implementation of a specific application task. Focused on the mechanism of tokenization and embeddings design as the key components in BERT’s text processing, the research also evaluates the standard benchmarks used to train classification models and analyze complex cases, possible errors and improvement algorithms thus raising the classification models accuracy. Evaluation of the models performance is conducted based on the loss function, prediction accuracy, precision and recall. Results: We validated BERT’s potential for Russian text classification and ability to enhance the performance and quality of the existing NLP models. Our experiments with cointegrated/rubert-tiny, ai forever/ruBert-base, and DeepPavlov/rubert-base-cased-sentence on different classification tasks demonstrate that our models achieve state-of-the-art results with the best accuracy of 95.9% in cointegrated/rubert-tiny thus outperforming the other two models by a good margin. Thus, the achieved by AI classification accuracy is so high that it can compete with that of human expertise. Conclusion: The findings highlight the importance of fine-tuning for classifying models. BERT demonstrates great potential for improving NLP technologies and contributing to the quality of automatic text analysis and offering new opportunities for research and application in a wide range of areas including identification and arrangement of all types of content-relevant texts thus contributing to decision making. The designed and validated algorithm can be scaled for classification of as complex and ambiguous discourse as fiction thus improving our understanding of text specific categories. Considerably bigger datasets are required for these purposes. https://jle.hse.ru/article/view/24030Bert modelfairy talesText classificationNeural networks
spellingShingle	Валерий Дмитриевич Соловьев Марина Ивановна Солнышкина Andrey Ten Николай Аркадиевич Прокопьев A BERT-Based Classification Model: The Case of Russian Fairy Tales Journal of Language and Education Bert model fairy tales Text classification Neural networks
title	A BERT-Based Classification Model: The Case of Russian Fairy Tales
title_full	A BERT-Based Classification Model: The Case of Russian Fairy Tales
title_fullStr	A BERT-Based Classification Model: The Case of Russian Fairy Tales
title_full_unstemmed	A BERT-Based Classification Model: The Case of Russian Fairy Tales
title_short	A BERT-Based Classification Model: The Case of Russian Fairy Tales
title_sort	bert based classification model the case of russian fairy tales
topic	Bert model fairy tales Text classification Neural networks
url	https://jle.hse.ru/article/view/24030
work_keys_str_mv	AT valerijdmitrievičsolovʹev abertbasedclassificationmodelthecaseofrussianfairytales AT marinaivanovnasolnyškina abertbasedclassificationmodelthecaseofrussianfairytales AT andreyten abertbasedclassificationmodelthecaseofrussianfairytales AT nikolajarkadievičprokopʹev abertbasedclassificationmodelthecaseofrussianfairytales AT valerijdmitrievičsolovʹev bertbasedclassificationmodelthecaseofrussianfairytales AT marinaivanovnasolnyškina bertbasedclassificationmodelthecaseofrussianfairytales AT andreyten bertbasedclassificationmodelthecaseofrussianfairytales AT nikolajarkadievičprokopʹev bertbasedclassificationmodelthecaseofrussianfairytales

A BERT-Based Classification Model: The Case of Russian Fairy Tales

Similar Items