ChatGPT-4 versus human generated multiple choice questions - A study from a medical college in Pakistan
Background: There has been a growing interest in using artificial intelligence (AI) generated multiple choice questions (MCQs) to supplement traditional assessments. While AI claims to generate higher-order questions, few studies focus on undergraduate medical education assessment in Pakistan. Ob...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Shalamar Medical & Dental College, Lahore, Pakistan
2024-12-01
|
Series: | Journal of Shalamar Medical & Dental College |
Subjects: | |
Online Access: | https://journal.smdc.edu.pk/index.php/journal/article/view/253 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Background: There has been a growing interest in using artificial intelligence (AI) generated multiple choice questions (MCQs) to supplement traditional assessments. While AI claims to generate higher-order questions, few studies focus on undergraduate medical education assessment in Pakistan.
Objective: To compare the quality of human-developed versus ChatGPT-4-generated MCQs for the final-year MBBS written MCQs examination
Methods: This observational study compared ChatGPT-4-generated and human-developed MCQs in four specialties: Pediatrics, Obstetrics and Gynecology (Ob/Gyn), Surgery, and Medicine. Based on the table of specifications, 204 MCQs were ChatGPT-4-generated and 196 MCQs were retrieved from the question bank of the medical college. ChatGPT-4-generated and human-generated MCQs were anonymized and MCQs quality was scored using a checklist based on the National Board of Medical Examiner criteria. Data was analyzed using SPSS version 23 and Mann-Whitney U and Chi square tests were applied.
Results: Out of 400 MCQs, 396 MCQs were included in the final review as four MCQs were not according to the table of specification. Total scores were not significantly different between human-generated and ChatGPT-4 generated MCQs (p=0.12). However, human-developed MCQs performed significantly better than ChatGPT-4-generated MCQ in Ob/Gyn (p=0.03). Human-developed MCQs scored better than ChatGPT-generated MCQs in the item checklist “stem includes necessary details for answering the question’’ in Ob/Gyn and Pediatrics (p < 0.05) as well as in "Is the item appropriate for cover the options rule"? in Surgery.
Conclusion: With a well-structured and specific prompting, ChatGPT-4 has the potential to assist in medical examination MCQ development. However, ChatGPT-4 has limitations where in depth contextual item generation is required.
|
---|---|
ISSN: | 2789-3669 2789-3677 |