Identification of Scientific Texts Generated by Large Language Models Using Machine Learning
Large language models (LLMs) are tools that help us in a variety of activities, from creating well-structured texts to quickly consulting information. But as these new technologies are so easily accessible, many people use them for their own benefit without properly citing the original author, or in...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2024-12-01
|
| Series: | Computers |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2073-431X/13/12/346 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1846105223188185088 |
|---|---|
| author | David Soto-Osorio Grigori Sidorov Liliana Chanona-Hernández Blanca Cecilia López-Ramírez |
| author_facet | David Soto-Osorio Grigori Sidorov Liliana Chanona-Hernández Blanca Cecilia López-Ramírez |
| author_sort | David Soto-Osorio |
| collection | DOAJ |
| description | Large language models (LLMs) are tools that help us in a variety of activities, from creating well-structured texts to quickly consulting information. But as these new technologies are so easily accessible, many people use them for their own benefit without properly citing the original author, or in other cases the student sector can be heavily compromised because students may opt for a quick answer over understanding and comprehending a specific topic in depth, considerably reducing their basic writing, editing and reading comprehension skills. Therefore, we propose to create a model to identify texts produced by LLM. To do so, we will use natural language processing (NLP) and machine-learning algorithms to recognize texts that mask LLM misuse using different types of adversarial attack, like paraphrasing or translation from one language to another. The main contributions of this work are to identify the texts generated by the large language models, and for this purpose several experiments were developed looking for the best results implementing the f1, accuracy, recall and precision metrics, together with PCA and t-SNE diagrams to see the classification of each one of the texts. |
| format | Article |
| id | doaj-art-9fcd3c5d0c024fa493dd6ac14cf9b40c |
| institution | Kabale University |
| issn | 2073-431X |
| language | English |
| publishDate | 2024-12-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Computers |
| spelling | doaj-art-9fcd3c5d0c024fa493dd6ac14cf9b40c2024-12-27T14:19:06ZengMDPI AGComputers2073-431X2024-12-01131234610.3390/computers13120346Identification of Scientific Texts Generated by Large Language Models Using Machine LearningDavid Soto-Osorio0Grigori Sidorov1Liliana Chanona-Hernández2Blanca Cecilia López-Ramírez3Computing Research Center, Instituto Politécnico Nacional, Av. Juan de Dios Bátiz S/N, Nueva Industrial Vallejo, Ciudad de México 07700, MexicoComputing Research Center, Instituto Politécnico Nacional, Av. Juan de Dios Bátiz S/N, Nueva Industrial Vallejo, Ciudad de México 07700, MexicoEscuela Superior de Ingeneria Mecanica y Electrica, Unidad Zacatenco, Instituto Politécnico Nacional, Av. Luis Enrique Erro, S/N, Ciudad de México 07700, MexicoTecnológico Nacional de México/I.T. de Roque, Celaya 38110, MexicoLarge language models (LLMs) are tools that help us in a variety of activities, from creating well-structured texts to quickly consulting information. But as these new technologies are so easily accessible, many people use them for their own benefit without properly citing the original author, or in other cases the student sector can be heavily compromised because students may opt for a quick answer over understanding and comprehending a specific topic in depth, considerably reducing their basic writing, editing and reading comprehension skills. Therefore, we propose to create a model to identify texts produced by LLM. To do so, we will use natural language processing (NLP) and machine-learning algorithms to recognize texts that mask LLM misuse using different types of adversarial attack, like paraphrasing or translation from one language to another. The main contributions of this work are to identify the texts generated by the large language models, and for this purpose several experiments were developed looking for the best results implementing the f1, accuracy, recall and precision metrics, together with PCA and t-SNE diagrams to see the classification of each one of the texts.https://www.mdpi.com/2073-431X/13/12/346large language modelsnatural language processingtext generationmachine learningtext classificationadversarial attacks |
| spellingShingle | David Soto-Osorio Grigori Sidorov Liliana Chanona-Hernández Blanca Cecilia López-Ramírez Identification of Scientific Texts Generated by Large Language Models Using Machine Learning Computers large language models natural language processing text generation machine learning text classification adversarial attacks |
| title | Identification of Scientific Texts Generated by Large Language Models Using Machine Learning |
| title_full | Identification of Scientific Texts Generated by Large Language Models Using Machine Learning |
| title_fullStr | Identification of Scientific Texts Generated by Large Language Models Using Machine Learning |
| title_full_unstemmed | Identification of Scientific Texts Generated by Large Language Models Using Machine Learning |
| title_short | Identification of Scientific Texts Generated by Large Language Models Using Machine Learning |
| title_sort | identification of scientific texts generated by large language models using machine learning |
| topic | large language models natural language processing text generation machine learning text classification adversarial attacks |
| url | https://www.mdpi.com/2073-431X/13/12/346 |
| work_keys_str_mv | AT davidsotoosorio identificationofscientifictextsgeneratedbylargelanguagemodelsusingmachinelearning AT grigorisidorov identificationofscientifictextsgeneratedbylargelanguagemodelsusingmachinelearning AT lilianachanonahernandez identificationofscientifictextsgeneratedbylargelanguagemodelsusingmachinelearning AT blancacecilialopezramirez identificationofscientifictextsgeneratedbylargelanguagemodelsusingmachinelearning |