Performance of ChatGPT and Radiology Residents on Ultrasonography Board-Style Questions

Objective: This study aims to assess the performance of the Chat Generative Pre-Trained Transformer (ChatGPT), specifically versions GPT-3.5 and GPT-4, on ultrasonography board-style questions, and subsequently compare it with the performance of third-year radiology residents on the identical set of...

Full description

Saved in:
Bibliographic Details
Main Author: Jiale Xu, MD, Shujun Xia, MD, Qing Hua, MD, Zihan Mei, MD, Yiqing Hou, MD, Minyan Wei, MD, Limei Lai, MD, Yixuan Yang, MD, Jianqiao Zhou, MD
Format: Article
Language:English
Published: Editorial Office of Advanced Ultrasound in Diagnosis and Therapy 2024-12-01
Series:Advanced Ultrasound in Diagnosis and Therapy
Subjects:
Online Access:https://www.journaladvancedultrasound.com/fileup/2576-2516/PDF/1731398325260-1300381136.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Objective: This study aims to assess the performance of the Chat Generative Pre-Trained Transformer (ChatGPT), specifically versions GPT-3.5 and GPT-4, on ultrasonography board-style questions, and subsequently compare it with the performance of third-year radiology residents on the identical set of questions. Methods: The study, conducted from May 19 to May 30, 2023, utilized a selection of 134 multiple-choice questions sourced from a commercial question bank for American Registry for Diagnostic Medical Sonography (ARDMS) examinations and imported into the ChatGPT model (encompassing GPT-3.5 and GPT-4 versions). ChatGPT’s responses were evaluated overall, by topic, and by GPT version. An identical question set was assigned to three third-year radiology residents, enabling a direct comparison of performances with ChatGPT. Results: GPT-4 correctly responded to 82.1% of questions (110 of 134), significantly surpassing the performance of GPT-3.5 (P = 0.003), which correctly answered 66.4% of questions (89 of 134). Although GPT-3.5’s performance was statistically indistinguishable from the average performance of the radiology residents (66.7%, 89.3 of 134) (P = 0.969), there was a notable difference in the accuracy in question-answering accuracy between GPT-4 and the residents (P = 0.004). Conclusions: ChatGPT demonstrated significant competency in responding to ultrasonography board-style questions, with the GPT-4 version markedly surpassing both its predecessor GPT-3.5 and the radiology residents.
ISSN:2576-2516