Identification of Scientific Texts Generated by Large Language Models Using Machine Learning

Large language models (LLMs) are tools that help us in a variety of activities, from creating well-structured texts to quickly consulting information. But as these new technologies are so easily accessible, many people use them for their own benefit without properly citing the original author, or in...

Full description

Saved in:

Bibliographic Details
Main Authors:	David Soto-Osorio, Grigori Sidorov, Liliana Chanona-Hernández, Blanca Cecilia López-Ramírez
Format:	Article
Language:	English
Published:	MDPI AG 2024-12-01
Series:	Computers
Subjects:	large language models natural language processing text generation machine learning text classification adversarial attacks
Online Access:	https://www.mdpi.com/2073-431X/13/12/346
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1846105223188185088
author	David Soto-Osorio Grigori Sidorov Liliana Chanona-Hernández Blanca Cecilia López-Ramírez
author_facet	David Soto-Osorio Grigori Sidorov Liliana Chanona-Hernández Blanca Cecilia López-Ramírez
author_sort	David Soto-Osorio
collection	DOAJ
description	Large language models (LLMs) are tools that help us in a variety of activities, from creating well-structured texts to quickly consulting information. But as these new technologies are so easily accessible, many people use them for their own benefit without properly citing the original author, or in other cases the student sector can be heavily compromised because students may opt for a quick answer over understanding and comprehending a specific topic in depth, considerably reducing their basic writing, editing and reading comprehension skills. Therefore, we propose to create a model to identify texts produced by LLM. To do so, we will use natural language processing (NLP) and machine-learning algorithms to recognize texts that mask LLM misuse using different types of adversarial attack, like paraphrasing or translation from one language to another. The main contributions of this work are to identify the texts generated by the large language models, and for this purpose several experiments were developed looking for the best results implementing the f1, accuracy, recall and precision metrics, together with PCA and t-SNE diagrams to see the classification of each one of the texts.
format	Article
id	doaj-art-9fcd3c5d0c024fa493dd6ac14cf9b40c
institution	Kabale University
issn	2073-431X
language	English
publishDate	2024-12-01
publisher	MDPI AG
record_format	Article
series	Computers
spelling	doaj-art-9fcd3c5d0c024fa493dd6ac14cf9b40c2024-12-27T14:19:06ZengMDPI AGComputers2073-431X2024-12-01131234610.3390/computers13120346Identification of Scientific Texts Generated by Large Language Models Using Machine LearningDavid Soto-Osorio0Grigori Sidorov1Liliana Chanona-Hernández2Blanca Cecilia López-Ramírez3Computing Research Center, Instituto Politécnico Nacional, Av. Juan de Dios Bátiz S/N, Nueva Industrial Vallejo, Ciudad de México 07700, MexicoComputing Research Center, Instituto Politécnico Nacional, Av. Juan de Dios Bátiz S/N, Nueva Industrial Vallejo, Ciudad de México 07700, MexicoEscuela Superior de Ingeneria Mecanica y Electrica, Unidad Zacatenco, Instituto Politécnico Nacional, Av. Luis Enrique Erro, S/N, Ciudad de México 07700, MexicoTecnológico Nacional de México/I.T. de Roque, Celaya 38110, MexicoLarge language models (LLMs) are tools that help us in a variety of activities, from creating well-structured texts to quickly consulting information. But as these new technologies are so easily accessible, many people use them for their own benefit without properly citing the original author, or in other cases the student sector can be heavily compromised because students may opt for a quick answer over understanding and comprehending a specific topic in depth, considerably reducing their basic writing, editing and reading comprehension skills. Therefore, we propose to create a model to identify texts produced by LLM. To do so, we will use natural language processing (NLP) and machine-learning algorithms to recognize texts that mask LLM misuse using different types of adversarial attack, like paraphrasing or translation from one language to another. The main contributions of this work are to identify the texts generated by the large language models, and for this purpose several experiments were developed looking for the best results implementing the f1, accuracy, recall and precision metrics, together with PCA and t-SNE diagrams to see the classification of each one of the texts.https://www.mdpi.com/2073-431X/13/12/346large language modelsnatural language processingtext generationmachine learningtext classificationadversarial attacks
spellingShingle	David Soto-Osorio Grigori Sidorov Liliana Chanona-Hernández Blanca Cecilia López-Ramírez Identification of Scientific Texts Generated by Large Language Models Using Machine Learning Computers large language models natural language processing text generation machine learning text classification adversarial attacks
title	Identification of Scientific Texts Generated by Large Language Models Using Machine Learning
title_full	Identification of Scientific Texts Generated by Large Language Models Using Machine Learning
title_fullStr	Identification of Scientific Texts Generated by Large Language Models Using Machine Learning
title_full_unstemmed	Identification of Scientific Texts Generated by Large Language Models Using Machine Learning
title_short	Identification of Scientific Texts Generated by Large Language Models Using Machine Learning
title_sort	identification of scientific texts generated by large language models using machine learning
topic	large language models natural language processing text generation machine learning text classification adversarial attacks
url	https://www.mdpi.com/2073-431X/13/12/346
work_keys_str_mv	AT davidsotoosorio identificationofscientifictextsgeneratedbylargelanguagemodelsusingmachinelearning AT grigorisidorov identificationofscientifictextsgeneratedbylargelanguagemodelsusingmachinelearning AT lilianachanonahernandez identificationofscientifictextsgeneratedbylargelanguagemodelsusingmachinelearning AT blancacecilialopezramirez identificationofscientifictextsgeneratedbylargelanguagemodelsusingmachinelearning

Identification of Scientific Texts Generated by Large Language Models Using Machine Learning

Similar Items