Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents
The capabilities of large language models (LLMs) have advanced to the point where entire textbooks can be queried using retrieval-augmented generation (RAG), enabling AI to integrate external, up-to-date information into its responses. This study evaluates the ability of two OpenAI models, GPT-3.5 T...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Cambridge University Press
2025-01-01
|
| Series: | Data & Policy |
| Subjects: | |
| Online Access: | https://www.cambridge.org/core/product/identifier/S2632324925100242/type/journal_article |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849229097218080768 |
|---|---|
| author | Erick Tyndall Colleen Gayheart Alexandre Some Joseph Genz Torrey Wagner Brent Langhals |
| author_facet | Erick Tyndall Colleen Gayheart Alexandre Some Joseph Genz Torrey Wagner Brent Langhals |
| author_sort | Erick Tyndall |
| collection | DOAJ |
| description | The capabilities of large language models (LLMs) have advanced to the point where entire textbooks can be queried using retrieval-augmented generation (RAG), enabling AI to integrate external, up-to-date information into its responses. This study evaluates the ability of two OpenAI models, GPT-3.5 Turbo and GPT-4 Turbo, to create and answer exam questions based on an undergraduate textbook. 14 exams were created with four true-false, four multiple-choice, and two short-answer questions derived from an open-source Pacific Studies textbook. Model performance was evaluated with and without access to the source material using text-similarity metrics such as ROUGE-1, cosine similarity, and word embeddings. Fifty-six exam scores were analyzed, revealing that RAG-assisted models significantly outperformed those relying solely on pre-trained knowledge. GPT-4 Turbo also consistently outperformed GPT-3.5 Turbo in accuracy and coherence, especially in short-answer responses. These findings demonstrate the potential of LLMs in automating exam generation while maintaining assessment quality. However, they also underscore the need for policy frameworks that promote fairness, transparency, and accessibility. Given regulatory considerations outlined in the European Union AI Act and the NIST AI Risk Management Framework, institutions using AI in education must establish governance protocols, bias mitigation strategies, and human oversight measures. The results of this study contribute to ongoing discussions on responsibly integrating AI in education, advocating for institutional policies that support AI-assisted assessment while preserving academic integrity. The empirical results suggest not only performance benefits but also actionable governance mechanisms, such as verifiable retrieval pipelines and oversight protocols, that can guide institutional policies. |
| format | Article |
| id | doaj-art-6d16c66f93a3404a8357b1c6e1a43d8b |
| institution | Kabale University |
| issn | 2632-3249 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | Cambridge University Press |
| record_format | Article |
| series | Data & Policy |
| spelling | doaj-art-6d16c66f93a3404a8357b1c6e1a43d8b2025-08-22T06:20:16ZengCambridge University PressData & Policy2632-32492025-01-01710.1017/dap.2025.10024Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agentsErick Tyndall0https://orcid.org/0009-0009-7148-9652Colleen Gayheart1Alexandre Some2Joseph Genz3Torrey Wagner4Brent Langhals5Department of Systems Engineering and Management, https://ror.org/03f9f1d95 Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USADepartment of Systems Engineering and Management, https://ror.org/03f9f1d95 Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USADepartment of Systems Engineering and Management, https://ror.org/03f9f1d95 Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USADepartment of Anthropology, University of Hawaiʻi at Hilo, Hilo, Hawaii, USADepartment of Systems Engineering and Management, https://ror.org/03f9f1d95 Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USADepartment of Systems Engineering and Management, https://ror.org/03f9f1d95 Air Force Institute of Technology , Wright-Patterson Air Force Base, Ohio, USAThe capabilities of large language models (LLMs) have advanced to the point where entire textbooks can be queried using retrieval-augmented generation (RAG), enabling AI to integrate external, up-to-date information into its responses. This study evaluates the ability of two OpenAI models, GPT-3.5 Turbo and GPT-4 Turbo, to create and answer exam questions based on an undergraduate textbook. 14 exams were created with four true-false, four multiple-choice, and two short-answer questions derived from an open-source Pacific Studies textbook. Model performance was evaluated with and without access to the source material using text-similarity metrics such as ROUGE-1, cosine similarity, and word embeddings. Fifty-six exam scores were analyzed, revealing that RAG-assisted models significantly outperformed those relying solely on pre-trained knowledge. GPT-4 Turbo also consistently outperformed GPT-3.5 Turbo in accuracy and coherence, especially in short-answer responses. These findings demonstrate the potential of LLMs in automating exam generation while maintaining assessment quality. However, they also underscore the need for policy frameworks that promote fairness, transparency, and accessibility. Given regulatory considerations outlined in the European Union AI Act and the NIST AI Risk Management Framework, institutions using AI in education must establish governance protocols, bias mitigation strategies, and human oversight measures. The results of this study contribute to ongoing discussions on responsibly integrating AI in education, advocating for institutional policies that support AI-assisted assessment while preserving academic integrity. The empirical results suggest not only performance benefits but also actionable governance mechanisms, such as verifiable retrieval pipelines and oversight protocols, that can guide institutional policies.https://www.cambridge.org/core/product/identifier/S2632324925100242/type/journal_articleacademic examinationsartificial intelligencegenerative pre-training transformerlarge language modelsretrieval augmented generation |
| spellingShingle | Erick Tyndall Colleen Gayheart Alexandre Some Joseph Genz Torrey Wagner Brent Langhals Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents Data & Policy academic examinations artificial intelligence generative pre-training transformer large language models retrieval augmented generation |
| title | Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents |
| title_full | Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents |
| title_fullStr | Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents |
| title_full_unstemmed | Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents |
| title_short | Impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by AI agents |
| title_sort | impact of retrieval augmented generation and large language model complexity on undergraduate exams created and taken by ai agents |
| topic | academic examinations artificial intelligence generative pre-training transformer large language models retrieval augmented generation |
| url | https://www.cambridge.org/core/product/identifier/S2632324925100242/type/journal_article |
| work_keys_str_mv | AT ericktyndall impactofretrievalaugmentedgenerationandlargelanguagemodelcomplexityonundergraduateexamscreatedandtakenbyaiagents AT colleengayheart impactofretrievalaugmentedgenerationandlargelanguagemodelcomplexityonundergraduateexamscreatedandtakenbyaiagents AT alexandresome impactofretrievalaugmentedgenerationandlargelanguagemodelcomplexityonundergraduateexamscreatedandtakenbyaiagents AT josephgenz impactofretrievalaugmentedgenerationandlargelanguagemodelcomplexityonundergraduateexamscreatedandtakenbyaiagents AT torreywagner impactofretrievalaugmentedgenerationandlargelanguagemodelcomplexityonundergraduateexamscreatedandtakenbyaiagents AT brentlanghals impactofretrievalaugmentedgenerationandlargelanguagemodelcomplexityonundergraduateexamscreatedandtakenbyaiagents |