Artificial intelligence assisted automated short answer question scoring tool shows high correlation with human examiner markings

Abstract Background Optimizing the skill of answering Short answer questions (SAQ) in medical undergraduates with personalized feedback is challenging. With the increasing number of students and staff shortages this task is becoming practically difficult. Hence, we aimed to develop automated SAQ sco...

Full description

Saved in:
Bibliographic Details
Main Authors: H.M.T.W. Seneviratne, S.S. Manathunga
Format: Article
Language:English
Published: BMC 2025-08-01
Series:BMC Medical Education
Subjects:
Online Access:https://doi.org/10.1186/s12909-025-07718-2
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Background Optimizing the skill of answering Short answer questions (SAQ) in medical undergraduates with personalized feedback is challenging. With the increasing number of students and staff shortages this task is becoming practically difficult. Hence, we aimed to develop automated SAQ scoring tool (ASST) using artificial intelligence (AI) to evaluate written answers with feedback. Methods This study investigated the use of large language models (LLMs) for automated SAQ scoring, specifically following rubrics. Rubric is a set of guidelines or criteria used to evaluate or grade assignments, which is provided by instructors. We focused on short answer questions from the Systematic Pharmacology course, along with model responses and rubrics shared with the LLM. The LLM analyzed student answers by extracting key parts, scoring them based on the rubric criteria, and providing feedback. The evaluation process relied on GPT-4, where the final score was determined by averaging results across five sampled runs. To validate the method, human examiners also graded the same answers, and the marks were compared to measure correlation. Results Across 30 student answers, AES scores showed high correlation with independent human examiner markings (correlation coefficients of 0.93 and 0.96). An intra-class correlation coefficient of 0.94 indicate excellent inter-rater reliability between the LLM and human examiners. Conclusion AI assisted automated SAQ scoring tool markings highly correlate with the human examiner markings which shows promise for transparent, flexible automated grading with granular feedback. This approach enabling criteria-based assessment could have valuable applications in education, reducing grading burdens on instructors while providing students with more granular, actionable feedback on their written work.
ISSN:1472-6920