Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents

The capabilities of large language models (LLMs) have advanced to the point where entire textbooks can be queried using retrieval-augmented generation (RAG), enabling AI to integrate external, up-to-date information into its responses. This study evaluates the ability of two OpenAI models, GPT-3.5 T...

Full description

Saved in:

Bibliographic Details
Main Authors:	Erick Tyndall, Colleen Gayheart, Alexandre Some, Joseph Genz, Torrey Wagner, Brent Langhals
Format:	Article
Language:	English
Published:	Cambridge University Press 2025-01-01
Series:	Data & Policy
Subjects:	academic examinations artificial intelligence generative pre-training transformer large language models retrieval augmented generation
Online Access:	https://www.cambridge.org/core/product/identifier/S2632324925100242/type/journal_article
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849229097218080768
author	Erick Tyndall Colleen Gayheart Alexandre Some Joseph Genz Torrey Wagner Brent Langhals
author_facet	Erick Tyndall Colleen Gayheart Alexandre Some Joseph Genz Torrey Wagner Brent Langhals
author_sort	Erick Tyndall
collection	DOAJ
description	The capabilities of large language models (LLMs) have advanced to the point where entire textbooks can be queried using retrieval-augmented generation (RAG), enabling AI to integrate external, up-to-date information into its responses. This study evaluates the ability of two OpenAI models, GPT-3.5 Turbo and GPT-4 Turbo, to create and answer exam questions based on an undergraduate textbook. 14 exams were created with four true-false, four multiple-choice, and two short-answer questions derived from an open-source Pacific Studies textbook. Model performance was evaluated with and without access to the source material using text-similarity metrics such as ROUGE-1, cosine similarity, and word embeddings. Fifty-six exam scores were analyzed, revealing that RAG-assisted models significantly outperformed those relying solely on pre-trained knowledge. GPT-4 Turbo also consistently outperformed GPT-3.5 Turbo in accuracy and coherence, especially in short-answer responses. These findings demonstrate the potential of LLMs in automating exam generation while maintaining assessment quality. However, they also underscore the need for policy frameworks that promote fairness, transparency, and accessibility. Given regulatory considerations outlined in the European Union AI Act and the NIST AI Risk Management Framework, institutions using AI in education must establish governance protocols, bias mitigation strategies, and human oversight measures. The results of this study contribute to ongoing discussions on responsibly integrating AI in education, advocating for institutional policies that support AI-assisted assessment while preserving academic integrity. The empirical results suggest not only performance benefits but also actionable governance mechanisms, such as verifiable retrieval pipelines and oversight protocols, that can guide institutional policies.
format	Article
id	doaj-art-6d16c66f93a3404a8357b1c6e1a43d8b
institution	Kabale University
issn	2632-3249
language	English
publishDate	2025-01-01
publisher	Cambridge University Press
record_format	Article
series	Data & Policy
spelling	doaj-art-6d16c66f93a3404a8357b1c6e1a43d8b2025-08-22T06:20:16ZengCambridge University PressData & Policy2632-32492025-01-01710.1017/dap.2025.10024Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agentsErick Tyndall0https://orcid.org/0009-0009-7148-9652Colleen Gayheart1Alexandre Some2Joseph Genz3Torrey Wagner4Brent Langhals5Department of Systems Engineering and Management, https://ror.org/03f9f1d95 Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USADepartment of Systems Engineering and Management, https://ror.org/03f9f1d95 Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USADepartment of Systems Engineering and Management, https://ror.org/03f9f1d95 Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USADepartment of Anthropology, University of Hawaiʻi at Hilo, Hilo, Hawaii, USADepartment of Systems Engineering and Management, https://ror.org/03f9f1d95 Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USADepartment of Systems Engineering and Management, https://ror.org/03f9f1d95 Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USAThe capabilities of large language models (LLMs) have advanced to the point where entire textbooks can be queried using retrieval-augmented generation (RAG), enabling AI to integrate external, up-to-date information into its responses. This study evaluates the ability of two OpenAI models, GPT-3.5 Turbo and GPT-4 Turbo, to create and answer exam questions based on an undergraduate textbook. 14 exams were created with four true-false, four multiple-choice, and two short-answer questions derived from an open-source Pacific Studies textbook. Model performance was evaluated with and without access to the source material using text-similarity metrics such as ROUGE-1, cosine similarity, and word embeddings. Fifty-six exam scores were analyzed, revealing that RAG-assisted models significantly outperformed those relying solely on pre-trained knowledge. GPT-4 Turbo also consistently outperformed GPT-3.5 Turbo in accuracy and coherence, especially in short-answer responses. These findings demonstrate the potential of LLMs in automating exam generation while maintaining assessment quality. However, they also underscore the need for policy frameworks that promote fairness, transparency, and accessibility. Given regulatory considerations outlined in the European Union AI Act and the NIST AI Risk Management Framework, institutions using AI in education must establish governance protocols, bias mitigation strategies, and human oversight measures. The results of this study contribute to ongoing discussions on responsibly integrating AI in education, advocating for institutional policies that support AI-assisted assessment while preserving academic integrity. The empirical results suggest not only performance benefits but also actionable governance mechanisms, such as verifiable retrieval pipelines and oversight protocols, that can guide institutional policies.https://www.cambridge.org/core/product/identifier/S2632324925100242/type/journal_articleacademic examinationsartificial intelligencegenerative pre-training transformerlarge language modelsretrieval augmented generation
spellingShingle	Erick Tyndall Colleen Gayheart Alexandre Some Joseph Genz Torrey Wagner Brent Langhals Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents Data & Policy academic examinations artificial intelligence generative pre-training transformer large language models retrieval augmented generation
title	Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents
title_full	Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents
title_fullStr	Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents
title_full_unstemmed	Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents
title_short	Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents
title_sort	impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by ai agents
topic	academic examinations artificial intelligence generative pre-training transformer large language models retrieval augmented generation
url	https://www.cambridge.org/core/product/identifier/S2632324925100242/type/journal_article
work_keys_str_mv	AT ericktyndall impactofretrievalaugmentedgenerationandlargelanguagemodelcomplexityonundergraduateexamscreatedandtakenbyaiagents AT colleengayheart impactofretrievalaugmentedgenerationandlargelanguagemodelcomplexityonundergraduateexamscreatedandtakenbyaiagents AT alexandresome impactofretrievalaugmentedgenerationandlargelanguagemodelcomplexityonundergraduateexamscreatedandtakenbyaiagents AT josephgenz impactofretrievalaugmentedgenerationandlargelanguagemodelcomplexityonundergraduateexamscreatedandtakenbyaiagents AT torreywagner impactofretrievalaugmentedgenerationandlargelanguagemodelcomplexityonundergraduateexamscreatedandtakenbyaiagents AT brentlanghals impactofretrievalaugmentedgenerationandlargelanguagemodelcomplexityonundergraduateexamscreatedandtakenbyaiagents

Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents

Similar Items