The performance of ChatGPT on medical image-based assessments and implications for medical education

Abstract Background Generative artificial intelligence (AI) tools like ChatGPT (OpenAI) have garnered significant attention for their potential in fields such as medical education; however, their performance of large language and vision models on medical test items involving images remains underexpl...

Full description

Saved in:
Bibliographic Details
Main Authors: Xiang Yang, Wei Chen
Format: Article
Language:English
Published: BMC 2025-08-01
Series:BMC Medical Education
Subjects:
Online Access:https://doi.org/10.1186/s12909-025-07752-0
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849226209081163776
author Xiang Yang
Wei Chen
author_facet Xiang Yang
Wei Chen
author_sort Xiang Yang
collection DOAJ
description Abstract Background Generative artificial intelligence (AI) tools like ChatGPT (OpenAI) have garnered significant attention for their potential in fields such as medical education; however, their performance of large language and vision models on medical test items involving images remains underexplored, limiting their broader educational utility. This study aims to evaluate the performance of GPT-4 and GPT-4 Omni (GPT-4o), accessed via the ChatGPT platform, on image-based United States Medical Licensing Examination (USMLE) sample items, to explore their implications for medical education. Methods We identified all image-based questions from the USMLE Step 1 and Step 2 Clinical Knowledge sample item sets. Prompt engineering techniques were applied to generate responses from GPT-4 and GPT-4o. Each model was independently tested, with accuracy calculated based on the proportion of correct answers. In addition, we explored the application of these models in case-based teaching scenarios involving medical images. Results A total of 38 image-based questions spanning multiple medical disciplines—including dermatology, cardiology, and gastroenterology—were included in the analysis. GPT-4 achieved an accuracy rate of 73.4% (95% CI, 57.0% to 85.5%), while GPT-4o outperformed it with an accuracy of 89.5% (95% CI, 74.4% to 96.1%), with a numerically higher accuracy but no statistically significant difference (P = 0.137). The two models showed substantial disagreement in their classification of question complexity. In exploratory case-based teaching scenarios, GPT-4o was able to analyze and revise incorrect responses with logical reasoning. Moreover, it demonstrated potential to assist educators in designing structured lesson plans focused on core clinical knowledge areas, though human oversight remained essential. Conclusion This study demonstrates that GPT models can accurately answer image-based medical examination questions, with GPT-4o exhibiting numerically higher performance. Prompt engineering further enables their use in instructional planning. While these models hold promise for enhancing medical education, expert supervision remains critical to ensure the accuracy and reliability of AI-generated content.
format Article
id doaj-art-5cf2094e92fa412eb55a8a2db597b54b
institution Kabale University
issn 1472-6920
language English
publishDate 2025-08-01
publisher BMC
record_format Article
series BMC Medical Education
spelling doaj-art-5cf2094e92fa412eb55a8a2db597b54b2025-08-24T11:35:19ZengBMCBMC Medical Education1472-69202025-08-012511610.1186/s12909-025-07752-0The performance of ChatGPT on medical image-based assessments and implications for medical educationXiang Yang0Wei Chen1Department of Neurosurgery, West China Hospital, Sichuan UniversityDepartment of Neurosurgery, West China Hospital, Sichuan UniversityAbstract Background Generative artificial intelligence (AI) tools like ChatGPT (OpenAI) have garnered significant attention for their potential in fields such as medical education; however, their performance of large language and vision models on medical test items involving images remains underexplored, limiting their broader educational utility. This study aims to evaluate the performance of GPT-4 and GPT-4 Omni (GPT-4o), accessed via the ChatGPT platform, on image-based United States Medical Licensing Examination (USMLE) sample items, to explore their implications for medical education. Methods We identified all image-based questions from the USMLE Step 1 and Step 2 Clinical Knowledge sample item sets. Prompt engineering techniques were applied to generate responses from GPT-4 and GPT-4o. Each model was independently tested, with accuracy calculated based on the proportion of correct answers. In addition, we explored the application of these models in case-based teaching scenarios involving medical images. Results A total of 38 image-based questions spanning multiple medical disciplines—including dermatology, cardiology, and gastroenterology—were included in the analysis. GPT-4 achieved an accuracy rate of 73.4% (95% CI, 57.0% to 85.5%), while GPT-4o outperformed it with an accuracy of 89.5% (95% CI, 74.4% to 96.1%), with a numerically higher accuracy but no statistically significant difference (P = 0.137). The two models showed substantial disagreement in their classification of question complexity. In exploratory case-based teaching scenarios, GPT-4o was able to analyze and revise incorrect responses with logical reasoning. Moreover, it demonstrated potential to assist educators in designing structured lesson plans focused on core clinical knowledge areas, though human oversight remained essential. Conclusion This study demonstrates that GPT models can accurately answer image-based medical examination questions, with GPT-4o exhibiting numerically higher performance. Prompt engineering further enables their use in instructional planning. While these models hold promise for enhancing medical education, expert supervision remains critical to ensure the accuracy and reliability of AI-generated content.https://doi.org/10.1186/s12909-025-07752-0Medical educationMedical examinationLarge language modelsChatGPT
spellingShingle Xiang Yang
Wei Chen
The performance of ChatGPT on medical image-based assessments and implications for medical education
BMC Medical Education
Medical education
Medical examination
Large language models
ChatGPT
title The performance of ChatGPT on medical image-based assessments and implications for medical education
title_full The performance of ChatGPT on medical image-based assessments and implications for medical education
title_fullStr The performance of ChatGPT on medical image-based assessments and implications for medical education
title_full_unstemmed The performance of ChatGPT on medical image-based assessments and implications for medical education
title_short The performance of ChatGPT on medical image-based assessments and implications for medical education
title_sort performance of chatgpt on medical image based assessments and implications for medical education
topic Medical education
Medical examination
Large language models
ChatGPT
url https://doi.org/10.1186/s12909-025-07752-0
work_keys_str_mv AT xiangyang theperformanceofchatgptonmedicalimagebasedassessmentsandimplicationsformedicaleducation
AT weichen theperformanceofchatgptonmedicalimagebasedassessmentsandimplicationsformedicaleducation
AT xiangyang performanceofchatgptonmedicalimagebasedassessmentsandimplicationsformedicaleducation
AT weichen performanceofchatgptonmedicalimagebasedassessmentsandimplicationsformedicaleducation