Using Encyclopedic Texts for Training and Inference of Artificial Neural Networks

The article explores the use of texts from online encyclopedias, such as Wikipedia and RUWIKI, for training and query processing by artificial neural networks (ANNs) of large language models (LLMs). The focus is on the relevance and quality of training datasets, as well as issues related to the accu...

Full description

Saved in:
Bibliographic Details
Main Author: Vladimir Medeyko
Format: Article
Language:Russian
Published: The Fund for Promotion of Internet media, IT education, human development «League Internet Media» 2024-07-01
Series:Современные информационные технологии и IT-образование
Subjects:
Online Access:http://sitito.cs.msu.ru/index.php/SITITO/article/view/1144
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The article explores the use of texts from online encyclopedias, such as Wikipedia and RUWIKI, for training and query processing by artificial neural networks (ANNs) of large language models (LLMs). The focus is on the relevance and quality of training datasets, as well as issues related to the accuracy and bias of generated responses. ANNs based on the “transformer” architecture exhibit exceptional capabilities in various natural language processing tasks. However, there are several limitations, including hallucination problems, where models generate nonexistent or false statements. These issues can be attributed to the quality of training datasets, model training features, and query processing. Encyclopedias, especially Wikipedia, are widely used for training ANNs due to their openness and structured information. However, despite their multilingualism and accessibility, Wikipedia articles often exhibit significant quality variability, complicating the training process and increasing the risk of hallucinations. As a supplement to the existing datasets, the use of RUWIKI, a new online encyclopedia in the languages of the peoples of Russia, is proposed. RUWIKI is being created with the participation of experts and focuses on information accuracy. RUWIKI articles undergo thorough verification and annotation, contributing to improved training dataset quality and reduced risk of hallucinations. Other projects, such as the “Ark of Knowledge” and the online edition of the Great Russian Encyclopedia (GRE), are also mentioned, aimed at creating accurate and systematic information bases. The article emphasizes the importance of creating regional online encyclopedias to improve training dataset quality and reduce legal risks when using LLMs. This will enhance the accuracy and relevance of ANN responses, which is especially important for users in various regions and languages.
ISSN:2411-1473