ATTRIBUTION OF MEDIA TEXTS BASED ON A TRAINED NATURAL LANGUAGE MODEL AND LINGUISTIC ASSESSMENT OF IDENTIFICATION QUALITY

The creation of effective systems for filtering media texts is due to the need to develop artificial intelligence systems, which is a large language model that should be trained using “correct” text samples that do not contain signs of disinformation, infodemic and unreliability. The article present...

Full description

Saved in:
Bibliographic Details
Main Authors: Vladimir A. Klyachin, Ekaterina V. Khizhnyakova
Format: Article
Language:English
Published: Volgograd State University 2024-11-01
Series:Vestnik Volgogradskogo Gosudarstvennogo Universiteta. Seriâ 2. Âzykoznanie
Subjects:
Online Access:https://l.jvolsu.com/index.php/en/archive-en/928-science-journal-of-volsu-linguistics-2024-vol-23-no-5/artificial-intelligence-potential-in-natural-language-processing-and-machine-translation/2840-klyachin-v-a-khizhnyakova-e-v-attribution-of-media-texts-based-on-a-trained-natural-language-model-and-linguistic-assessment-of-identification-quality
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841545728328990720
author Vladimir A. Klyachin
Ekaterina V. Khizhnyakova
author_facet Vladimir A. Klyachin
Ekaterina V. Khizhnyakova
author_sort Vladimir A. Klyachin
collection DOAJ
description The creation of effective systems for filtering media texts is due to the need to develop artificial intelligence systems, which is a large language model that should be trained using “correct” text samples that do not contain signs of disinformation, infodemic and unreliability. The article presents the results of automatic detection of high-quality media texts, as well as text samples with infodemic features carried out using a trained natural language model based on a manually labeled corpus. Manual marking of the corpus was carried out by experts based on the parameterization of the text content. The goal of our work is to build a model of the language of media messages, assess the quality and identify detection errors caused by the linguistic characteristics of texts. Creating a model of the language of media messages is a condition for increasing the efficiency and quality of artificial intelligence systems. It has been established that the test use of a trained natural language model allows filtering media texts with fairly high accuracy. The support vector machine method proved to be most effective. The share of incorrectly recognized informative texts that meet the criteria of reliability and novelty is low and amounts to 6.2 percent. The percentage of incorrectly recognized uninformative texts is approximately 3.9 percent, which indicates a fairly high efficiency of the developed model. The errors in the detection of informative texts are associated with the use of proper names (anthroponyms, toponyms) and numerals in the headings. Linguistic features of misclassified texts containing signs of fake and misinformation comprise text samples using statements with speech verbs that are often used in informative texts.
format Article
id doaj-art-2fa95fe0548242c3bda28e88517cb78a
institution Kabale University
issn 1998-9911
2409-1979
language English
publishDate 2024-11-01
publisher Volgograd State University
record_format Article
series Vestnik Volgogradskogo Gosudarstvennogo Universiteta. Seriâ 2. Âzykoznanie
spelling doaj-art-2fa95fe0548242c3bda28e88517cb78a2025-01-11T16:09:17ZengVolgograd State UniversityVestnik Volgogradskogo Gosudarstvennogo Universiteta. Seriâ 2. Âzykoznanie1998-99112409-19792024-11-01235314610.15688/jvolsu2.2024.5.3ATTRIBUTION OF MEDIA TEXTS BASED ON A TRAINED NATURAL LANGUAGE MODEL AND LINGUISTIC ASSESSMENT OF IDENTIFICATION QUALITYVladimir A. Klyachin0https://orcid.org/0000-0003-1922-7849Ekaterina V. Khizhnyakova1https://orcid.org/0000-0002-7914-9988Volgograd State University, Volgograd, RussiaVolgograd State University, Volgograd, RussiaThe creation of effective systems for filtering media texts is due to the need to develop artificial intelligence systems, which is a large language model that should be trained using “correct” text samples that do not contain signs of disinformation, infodemic and unreliability. The article presents the results of automatic detection of high-quality media texts, as well as text samples with infodemic features carried out using a trained natural language model based on a manually labeled corpus. Manual marking of the corpus was carried out by experts based on the parameterization of the text content. The goal of our work is to build a model of the language of media messages, assess the quality and identify detection errors caused by the linguistic characteristics of texts. Creating a model of the language of media messages is a condition for increasing the efficiency and quality of artificial intelligence systems. It has been established that the test use of a trained natural language model allows filtering media texts with fairly high accuracy. The support vector machine method proved to be most effective. The share of incorrectly recognized informative texts that meet the criteria of reliability and novelty is low and amounts to 6.2 percent. The percentage of incorrectly recognized uninformative texts is approximately 3.9 percent, which indicates a fairly high efficiency of the developed model. The errors in the detection of informative texts are associated with the use of proper names (anthroponyms, toponyms) and numerals in the headings. Linguistic features of misclassified texts containing signs of fake and misinformation comprise text samples using statements with speech verbs that are often used in informative texts.https://l.jvolsu.com/index.php/en/archive-en/928-science-journal-of-volsu-linguistics-2024-vol-23-no-5/artificial-intelligence-potential-in-natural-language-processing-and-machine-translation/2840-klyachin-v-a-khizhnyakova-e-v-attribution-of-media-texts-based-on-a-trained-natural-language-model-and-linguistic-assessment-of-identification-qualitymedia textneural networklanguage modelmachine learning methodcorpusautomatic detection
spellingShingle Vladimir A. Klyachin
Ekaterina V. Khizhnyakova
ATTRIBUTION OF MEDIA TEXTS BASED ON A TRAINED NATURAL LANGUAGE MODEL AND LINGUISTIC ASSESSMENT OF IDENTIFICATION QUALITY
Vestnik Volgogradskogo Gosudarstvennogo Universiteta. Seriâ 2. Âzykoznanie
media text
neural network
language model
machine learning method
corpus
automatic detection
title ATTRIBUTION OF MEDIA TEXTS BASED ON A TRAINED NATURAL LANGUAGE MODEL AND LINGUISTIC ASSESSMENT OF IDENTIFICATION QUALITY
title_full ATTRIBUTION OF MEDIA TEXTS BASED ON A TRAINED NATURAL LANGUAGE MODEL AND LINGUISTIC ASSESSMENT OF IDENTIFICATION QUALITY
title_fullStr ATTRIBUTION OF MEDIA TEXTS BASED ON A TRAINED NATURAL LANGUAGE MODEL AND LINGUISTIC ASSESSMENT OF IDENTIFICATION QUALITY
title_full_unstemmed ATTRIBUTION OF MEDIA TEXTS BASED ON A TRAINED NATURAL LANGUAGE MODEL AND LINGUISTIC ASSESSMENT OF IDENTIFICATION QUALITY
title_short ATTRIBUTION OF MEDIA TEXTS BASED ON A TRAINED NATURAL LANGUAGE MODEL AND LINGUISTIC ASSESSMENT OF IDENTIFICATION QUALITY
title_sort attribution of media texts based on a trained natural language model and linguistic assessment of identification quality
topic media text
neural network
language model
machine learning method
corpus
automatic detection
url https://l.jvolsu.com/index.php/en/archive-en/928-science-journal-of-volsu-linguistics-2024-vol-23-no-5/artificial-intelligence-potential-in-natural-language-processing-and-machine-translation/2840-klyachin-v-a-khizhnyakova-e-v-attribution-of-media-texts-based-on-a-trained-natural-language-model-and-linguistic-assessment-of-identification-quality
work_keys_str_mv AT vladimiraklyachin attributionofmediatextsbasedonatrainednaturallanguagemodelandlinguisticassessmentofidentificationquality
AT ekaterinavkhizhnyakova attributionofmediatextsbasedonatrainednaturallanguagemodelandlinguisticassessmentofidentificationquality