Identification of Scientific Texts Generated by Large Language Models Using Machine Learning

Large language models (LLMs) are tools that help us in a variety of activities, from creating well-structured texts to quickly consulting information. But as these new technologies are so easily accessible, many people use them for their own benefit without properly citing the original author, or in...

Full description

Saved in:
Bibliographic Details
Main Authors: David Soto-Osorio, Grigori Sidorov, Liliana Chanona-Hernández, Blanca Cecilia López-Ramírez
Format: Article
Language:English
Published: MDPI AG 2024-12-01
Series:Computers
Subjects:
Online Access:https://www.mdpi.com/2073-431X/13/12/346
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Large language models (LLMs) are tools that help us in a variety of activities, from creating well-structured texts to quickly consulting information. But as these new technologies are so easily accessible, many people use them for their own benefit without properly citing the original author, or in other cases the student sector can be heavily compromised because students may opt for a quick answer over understanding and comprehending a specific topic in depth, considerably reducing their basic writing, editing and reading comprehension skills. Therefore, we propose to create a model to identify texts produced by LLM. To do so, we will use natural language processing (NLP) and machine-learning algorithms to recognize texts that mask LLM misuse using different types of adversarial attack, like paraphrasing or translation from one language to another. The main contributions of this work are to identify the texts generated by the large language models, and for this purpose several experiments were developed looking for the best results implementing the f1, accuracy, recall and precision metrics, together with PCA and t-SNE diagrams to see the classification of each one of the texts.
ISSN:2073-431X