Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents

The capabilities of large language models (LLMs) have advanced to the point where entire textbooks can be queried using retrieval-augmented generation (RAG), enabling AI to integrate external, up-to-date information into its responses. This study evaluates the ability of two OpenAI models, GPT-3.5 T...

Full description

Saved in:
Bibliographic Details
Main Authors: Erick Tyndall, Colleen Gayheart, Alexandre Some, Joseph Genz, Torrey Wagner, Brent Langhals
Format: Article
Language:English
Published: Cambridge University Press 2025-01-01
Series:Data & Policy
Subjects:
Online Access:https://www.cambridge.org/core/product/identifier/S2632324925100242/type/journal_article
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849229097218080768
author Erick Tyndall
Colleen Gayheart
Alexandre Some
Joseph Genz
Torrey Wagner
Brent Langhals
author_facet Erick Tyndall
Colleen Gayheart
Alexandre Some
Joseph Genz
Torrey Wagner
Brent Langhals
author_sort Erick Tyndall
collection DOAJ
description The capabilities of large language models (LLMs) have advanced to the point where entire textbooks can be queried using retrieval-augmented generation (RAG), enabling AI to integrate external, up-to-date information into its responses. This study evaluates the ability of two OpenAI models, GPT-3.5 Turbo and GPT-4 Turbo, to create and answer exam questions based on an undergraduate textbook. 14 exams were created with four true-false, four multiple-choice, and two short-answer questions derived from an open-source Pacific Studies textbook. Model performance was evaluated with and without access to the source material using text-similarity metrics such as ROUGE-1, cosine similarity, and word embeddings. Fifty-six exam scores were analyzed, revealing that RAG-assisted models significantly outperformed those relying solely on pre-trained knowledge. GPT-4 Turbo also consistently outperformed GPT-3.5 Turbo in accuracy and coherence, especially in short-answer responses. These findings demonstrate the potential of LLMs in automating exam generation while maintaining assessment quality. However, they also underscore the need for policy frameworks that promote fairness, transparency, and accessibility. Given regulatory considerations outlined in the European Union AI Act and the NIST AI Risk Management Framework, institutions using AI in education must establish governance protocols, bias mitigation strategies, and human oversight measures. The results of this study contribute to ongoing discussions on responsibly integrating AI in education, advocating for institutional policies that support AI-assisted assessment while preserving academic integrity. The empirical results suggest not only performance benefits but also actionable governance mechanisms, such as verifiable retrieval pipelines and oversight protocols, that can guide institutional policies.
format Article
id doaj-art-6d16c66f93a3404a8357b1c6e1a43d8b
institution Kabale University
issn 2632-3249
language English
publishDate 2025-01-01
publisher Cambridge University Press
record_format Article
series Data & Policy
spelling doaj-art-6d16c66f93a3404a8357b1c6e1a43d8b2025-08-22T06:20:16ZengCambridge University PressData & Policy2632-32492025-01-01710.1017/dap.2025.10024Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agentsErick Tyndall0https://orcid.org/0009-0009-7148-9652Colleen Gayheart1Alexandre Some2Joseph Genz3Torrey Wagner4Brent Langhals5Department of Systems Engineering and Management, https://ror.org/03f9f1d95 Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USADepartment of Systems Engineering and Management, https://ror.org/03f9f1d95 Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USADepartment of Systems Engineering and Management, https://ror.org/03f9f1d95 Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USADepartment of Anthropology, University of Hawaiʻi at Hilo, Hilo, Hawaii, USADepartment of Systems Engineering and Management, https://ror.org/03f9f1d95 Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USADepartment of Systems Engineering and Management, https://ror.org/03f9f1d95 Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USAThe capabilities of large language models (LLMs) have advanced to the point where entire textbooks can be queried using retrieval-augmented generation (RAG), enabling AI to integrate external, up-to-date information into its responses. This study evaluates the ability of two OpenAI models, GPT-3.5 Turbo and GPT-4 Turbo, to create and answer exam questions based on an undergraduate textbook. 14 exams were created with four true-false, four multiple-choice, and two short-answer questions derived from an open-source Pacific Studies textbook. Model performance was evaluated with and without access to the source material using text-similarity metrics such as ROUGE-1, cosine similarity, and word embeddings. Fifty-six exam scores were analyzed, revealing that RAG-assisted models significantly outperformed those relying solely on pre-trained knowledge. GPT-4 Turbo also consistently outperformed GPT-3.5 Turbo in accuracy and coherence, especially in short-answer responses. These findings demonstrate the potential of LLMs in automating exam generation while maintaining assessment quality. However, they also underscore the need for policy frameworks that promote fairness, transparency, and accessibility. Given regulatory considerations outlined in the European Union AI Act and the NIST AI Risk Management Framework, institutions using AI in education must establish governance protocols, bias mitigation strategies, and human oversight measures. The results of this study contribute to ongoing discussions on responsibly integrating AI in education, advocating for institutional policies that support AI-assisted assessment while preserving academic integrity. The empirical results suggest not only performance benefits but also actionable governance mechanisms, such as verifiable retrieval pipelines and oversight protocols, that can guide institutional policies.https://www.cambridge.org/core/product/identifier/S2632324925100242/type/journal_articleacademic examinationsartificial intelligencegenerative pre-training transformerlarge language modelsretrieval augmented generation
spellingShingle Erick Tyndall
Colleen Gayheart
Alexandre Some
Joseph Genz
Torrey Wagner
Brent Langhals
Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents
Data & Policy
academic examinations
artificial intelligence
generative pre-training transformer
large language models
retrieval augmented generation
title Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents
title_full Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents
title_fullStr Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents
title_full_unstemmed Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents
title_short Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents
title_sort impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by ai agents
topic academic examinations
artificial intelligence
generative pre-training transformer
large language models
retrieval augmented generation
url https://www.cambridge.org/core/product/identifier/S2632324925100242/type/journal_article
work_keys_str_mv AT ericktyndall impactofretrievalaugmentedgenerationandlargelanguagemodelcomplexityonundergraduateexamscreatedandtakenbyaiagents
AT colleengayheart impactofretrievalaugmentedgenerationandlargelanguagemodelcomplexityonundergraduateexamscreatedandtakenbyaiagents
AT alexandresome impactofretrievalaugmentedgenerationandlargelanguagemodelcomplexityonundergraduateexamscreatedandtakenbyaiagents
AT josephgenz impactofretrievalaugmentedgenerationandlargelanguagemodelcomplexityonundergraduateexamscreatedandtakenbyaiagents
AT torreywagner impactofretrievalaugmentedgenerationandlargelanguagemodelcomplexityonundergraduateexamscreatedandtakenbyaiagents
AT brentlanghals impactofretrievalaugmentedgenerationandlargelanguagemodelcomplexityonundergraduateexamscreatedandtakenbyaiagents