Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education

Abstract Background The integration of artificial intelligence (AI) into medical education has gained significant attention, particularly with the emergence of advanced language models, such as ChatGPT and Gemini. While these tools show promise for answering multiple-choice questions (MCQs), their e...

Full description

Saved in:

Bibliographic Details
Main Authors:	Shaikha Nasser Al-Thani, Shahzad Anjum, Zain Ali Bhutta, Sarah Bashir, Muhammad Azhar Majeed, Anfal Sher Khan, Khalid Bashir
Format:	Article
Language:	English
Published:	BMC 2025-08-01
Series:	International Journal of Emergency Medicine
Online Access:	https://doi.org/10.1186/s12245-025-00949-6
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849767018730881024
author	Shaikha Nasser Al-Thani Shahzad Anjum Zain Ali Bhutta Sarah Bashir Muhammad Azhar Majeed Anfal Sher Khan Khalid Bashir
author_facet	Shaikha Nasser Al-Thani Shahzad Anjum Zain Ali Bhutta Sarah Bashir Muhammad Azhar Majeed Anfal Sher Khan Khalid Bashir
author_sort	Shaikha Nasser Al-Thani
collection	DOAJ
description	Abstract Background The integration of artificial intelligence (AI) into medical education has gained significant attention, particularly with the emergence of advanced language models, such as ChatGPT and Gemini. While these tools show promise for answering multiple-choice questions (MCQs), their efficacy in specialized domains, such as Emergency Medicine (EM) clerkship, remains underexplored. This study aimed to evaluate and compare the accuracy of ChatGPT, Gemini, and final-year EM students when it comes to answering text-only and image-based MCQs, in order to assess AI’s potential for use as a supplementary tool in the field of medical education. Methods In this proof-of-concept study, a comparative analysis was conducted using 160 MCQs from an EM clerkship curriculum, comprising 62 image-based questions and 98 text-only questions. The performance of the free versions of ChatGPT (4.0) and Gemini (1.5), as well as that of 125 final-year EM students, was assessed. Responses were categorized as “correct”, “incorrect”, or “unanswered”. Statistical analysis was then performed using IBM SPSS Statistics (Version 26.0) to compare accuracy across groups and question types. Results Significant performance differences were observed across the three groups (χ² = 42.7, p < 0.001). Final-year EM students demonstrated the highest overall accuracy at 79.4%, outperforming both ChatGPT (72.5%) and Gemini (54.4%). Students excelled in text-only MCQs, with an accuracy of 89.8%, and performed robustly on image-based questions (62.9%). ChatGPT showed strong performance on text-only items (83.7%) but had reduced accuracy on image-based questions (54.8%). Gemini performed moderately on text-only questions (73.5%) but struggled significantly with image-based content, achieving only 24.2% accuracy. Pairwise comparisons confirmed that students outperformed both AI models across all formats (p < 0.01), with the widest performance gap observed in image-based questions between students and Gemini (+ 38.7% points). All AI “unable to answer” responses were treated as incorrect for analysis. Conclusion This proof-of-concept study demonstrates that while AI shows promise as a supplementary educational tool, it cannot yet replace traditional training methods—particularly in domains requiring visual interpretation and clinical reasoning. ChatGPT’ s strong performance on text-based questions highlights its utility, but its limitations in image-based tasks emphasize the need for improvement. Gemini’s lower accuracy further highlights the challenges current AI models face in processing visually complex medical content. Future research should focus on enhancing AI’s multimodal capabilities to improve its applicability in medical education and assessment.
format	Article
id	doaj-art-c92e4b83d78b4fdc86a13a5a82a821b8
institution	DOAJ
issn	1865-1380
language	English
publishDate	2025-08-01
publisher	BMC
record_format	Article
series	International Journal of Emergency Medicine
spelling	doaj-art-c92e4b83d78b4fdc86a13a5a82a821b82025-08-20T03:04:22ZengBMCInternational Journal of Emergency Medicine1865-13802025-08-011811810.1186/s12245-025-00949-6Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical educationShaikha Nasser Al-Thani0Shahzad Anjum1Zain Ali Bhutta2Sarah Bashir3Muhammad Azhar Majeed4Anfal Sher Khan5Khalid Bashir6Department of Emergency Medicine, Hamad Medical CorporationDepartment of Emergency Medicine, Hamad Medical CorporationDepartment of Emergency Medicine, Hamad Medical CorporationUniversity of AberdeenDepartment of Emergency Medicine, Hamad Medical CorporationWeill Cornell MedicineDepartment of Emergency Medicine, Hamad Medical CorporationAbstract Background The integration of artificial intelligence (AI) into medical education has gained significant attention, particularly with the emergence of advanced language models, such as ChatGPT and Gemini. While these tools show promise for answering multiple-choice questions (MCQs), their efficacy in specialized domains, such as Emergency Medicine (EM) clerkship, remains underexplored. This study aimed to evaluate and compare the accuracy of ChatGPT, Gemini, and final-year EM students when it comes to answering text-only and image-based MCQs, in order to assess AI’s potential for use as a supplementary tool in the field of medical education. Methods In this proof-of-concept study, a comparative analysis was conducted using 160 MCQs from an EM clerkship curriculum, comprising 62 image-based questions and 98 text-only questions. The performance of the free versions of ChatGPT (4.0) and Gemini (1.5), as well as that of 125 final-year EM students, was assessed. Responses were categorized as “correct”, “incorrect”, or “unanswered”. Statistical analysis was then performed using IBM SPSS Statistics (Version 26.0) to compare accuracy across groups and question types. Results Significant performance differences were observed across the three groups (χ² = 42.7, p < 0.001). Final-year EM students demonstrated the highest overall accuracy at 79.4%, outperforming both ChatGPT (72.5%) and Gemini (54.4%). Students excelled in text-only MCQs, with an accuracy of 89.8%, and performed robustly on image-based questions (62.9%). ChatGPT showed strong performance on text-only items (83.7%) but had reduced accuracy on image-based questions (54.8%). Gemini performed moderately on text-only questions (73.5%) but struggled significantly with image-based content, achieving only 24.2% accuracy. Pairwise comparisons confirmed that students outperformed both AI models across all formats (p < 0.01), with the widest performance gap observed in image-based questions between students and Gemini (+ 38.7% points). All AI “unable to answer” responses were treated as incorrect for analysis. Conclusion This proof-of-concept study demonstrates that while AI shows promise as a supplementary educational tool, it cannot yet replace traditional training methods—particularly in domains requiring visual interpretation and clinical reasoning. ChatGPT’ s strong performance on text-based questions highlights its utility, but its limitations in image-based tasks emphasize the need for improvement. Gemini’s lower accuracy further highlights the challenges current AI models face in processing visually complex medical content. Future research should focus on enhancing AI’s multimodal capabilities to improve its applicability in medical education and assessment.https://doi.org/10.1186/s12245-025-00949-6
spellingShingle	Shaikha Nasser Al-Thani Shahzad Anjum Zain Ali Bhutta Sarah Bashir Muhammad Azhar Majeed Anfal Sher Khan Khalid Bashir Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education International Journal of Emergency Medicine
title	Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education
title_full	Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education
title_fullStr	Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education
title_full_unstemmed	Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education
title_short	Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education
title_sort	comparative performance of chatgpt gemini and final year emergency medicine clerkship students in answering multiple choice questions implications for the use of ai in medical education
url	https://doi.org/10.1186/s12245-025-00949-6
work_keys_str_mv	AT shaikhanasseralthani comparativeperformanceofchatgptgeminiandfinalyearemergencymedicineclerkshipstudentsinansweringmultiplechoicequestionsimplicationsfortheuseofaiinmedicaleducation AT shahzadanjum comparativeperformanceofchatgptgeminiandfinalyearemergencymedicineclerkshipstudentsinansweringmultiplechoicequestionsimplicationsfortheuseofaiinmedicaleducation AT zainalibhutta comparativeperformanceofchatgptgeminiandfinalyearemergencymedicineclerkshipstudentsinansweringmultiplechoicequestionsimplicationsfortheuseofaiinmedicaleducation AT sarahbashir comparativeperformanceofchatgptgeminiandfinalyearemergencymedicineclerkshipstudentsinansweringmultiplechoicequestionsimplicationsfortheuseofaiinmedicaleducation AT muhammadazharmajeed comparativeperformanceofchatgptgeminiandfinalyearemergencymedicineclerkshipstudentsinansweringmultiplechoicequestionsimplicationsfortheuseofaiinmedicaleducation AT anfalsherkhan comparativeperformanceofchatgptgeminiandfinalyearemergencymedicineclerkshipstudentsinansweringmultiplechoicequestionsimplicationsfortheuseofaiinmedicaleducation AT khalidbashir comparativeperformanceofchatgptgeminiandfinalyearemergencymedicineclerkshipstudentsinansweringmultiplechoicequestionsimplicationsfortheuseofaiinmedicaleducation

Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education

Similar Items