Identification of Scientific Texts Generated by Large Language Models Using Machine Learning

Large language models (LLMs) are tools that help us in a variety of activities, from creating well-structured texts to quickly consulting information. But as these new technologies are so easily accessible, many people use them for their own benefit without properly citing the original author, or in...

Full description

Saved in:
Bibliographic Details
Main Authors: David Soto-Osorio, Grigori Sidorov, Liliana Chanona-Hernández, Blanca Cecilia López-Ramírez
Format: Article
Language:English
Published: MDPI AG 2024-12-01
Series:Computers
Subjects:
Online Access:https://www.mdpi.com/2073-431X/13/12/346
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846105223188185088
author David Soto-Osorio
Grigori Sidorov
Liliana Chanona-Hernández
Blanca Cecilia López-Ramírez
author_facet David Soto-Osorio
Grigori Sidorov
Liliana Chanona-Hernández
Blanca Cecilia López-Ramírez
author_sort David Soto-Osorio
collection DOAJ
description Large language models (LLMs) are tools that help us in a variety of activities, from creating well-structured texts to quickly consulting information. But as these new technologies are so easily accessible, many people use them for their own benefit without properly citing the original author, or in other cases the student sector can be heavily compromised because students may opt for a quick answer over understanding and comprehending a specific topic in depth, considerably reducing their basic writing, editing and reading comprehension skills. Therefore, we propose to create a model to identify texts produced by LLM. To do so, we will use natural language processing (NLP) and machine-learning algorithms to recognize texts that mask LLM misuse using different types of adversarial attack, like paraphrasing or translation from one language to another. The main contributions of this work are to identify the texts generated by the large language models, and for this purpose several experiments were developed looking for the best results implementing the f1, accuracy, recall and precision metrics, together with PCA and t-SNE diagrams to see the classification of each one of the texts.
format Article
id doaj-art-9fcd3c5d0c024fa493dd6ac14cf9b40c
institution Kabale University
issn 2073-431X
language English
publishDate 2024-12-01
publisher MDPI AG
record_format Article
series Computers
spelling doaj-art-9fcd3c5d0c024fa493dd6ac14cf9b40c2024-12-27T14:19:06ZengMDPI AGComputers2073-431X2024-12-01131234610.3390/computers13120346Identification of Scientific Texts Generated by Large Language Models Using Machine LearningDavid Soto-Osorio0Grigori Sidorov1Liliana Chanona-Hernández2Blanca Cecilia López-Ramírez3Computing Research Center, Instituto Politécnico Nacional, Av. Juan de Dios Bátiz S/N, Nueva Industrial Vallejo, Ciudad de México 07700, MexicoComputing Research Center, Instituto Politécnico Nacional, Av. Juan de Dios Bátiz S/N, Nueva Industrial Vallejo, Ciudad de México 07700, MexicoEscuela Superior de Ingeneria Mecanica y Electrica, Unidad Zacatenco, Instituto Politécnico Nacional, Av. Luis Enrique Erro, S/N, Ciudad de México 07700, MexicoTecnológico Nacional de México/I.T. de Roque, Celaya 38110, MexicoLarge language models (LLMs) are tools that help us in a variety of activities, from creating well-structured texts to quickly consulting information. But as these new technologies are so easily accessible, many people use them for their own benefit without properly citing the original author, or in other cases the student sector can be heavily compromised because students may opt for a quick answer over understanding and comprehending a specific topic in depth, considerably reducing their basic writing, editing and reading comprehension skills. Therefore, we propose to create a model to identify texts produced by LLM. To do so, we will use natural language processing (NLP) and machine-learning algorithms to recognize texts that mask LLM misuse using different types of adversarial attack, like paraphrasing or translation from one language to another. The main contributions of this work are to identify the texts generated by the large language models, and for this purpose several experiments were developed looking for the best results implementing the f1, accuracy, recall and precision metrics, together with PCA and t-SNE diagrams to see the classification of each one of the texts.https://www.mdpi.com/2073-431X/13/12/346large language modelsnatural language processingtext generationmachine learningtext classificationadversarial attacks
spellingShingle David Soto-Osorio
Grigori Sidorov
Liliana Chanona-Hernández
Blanca Cecilia López-Ramírez
Identification of Scientific Texts Generated by Large Language Models Using Machine Learning
Computers
large language models
natural language processing
text generation
machine learning
text classification
adversarial attacks
title Identification of Scientific Texts Generated by Large Language Models Using Machine Learning
title_full Identification of Scientific Texts Generated by Large Language Models Using Machine Learning
title_fullStr Identification of Scientific Texts Generated by Large Language Models Using Machine Learning
title_full_unstemmed Identification of Scientific Texts Generated by Large Language Models Using Machine Learning
title_short Identification of Scientific Texts Generated by Large Language Models Using Machine Learning
title_sort identification of scientific texts generated by large language models using machine learning
topic large language models
natural language processing
text generation
machine learning
text classification
adversarial attacks
url https://www.mdpi.com/2073-431X/13/12/346
work_keys_str_mv AT davidsotoosorio identificationofscientifictextsgeneratedbylargelanguagemodelsusingmachinelearning
AT grigorisidorov identificationofscientifictextsgeneratedbylargelanguagemodelsusingmachinelearning
AT lilianachanonahernandez identificationofscientifictextsgeneratedbylargelanguagemodelsusingmachinelearning
AT blancacecilialopezramirez identificationofscientifictextsgeneratedbylargelanguagemodelsusingmachinelearning