ChatGPT4’s diagnostic accuracy in inpatient neurology: A retrospective cohort study

Background: Large language models (LLMs) such as ChatGPT-4 (CG4) are proving to be valuable tools in the medical field, not only in facilitating administrative tasks, but in augmenting medical decision-making. LLMs have previously been tested for diagnostic accuracy with expert-generated questions a...

Full description

Saved in:
Bibliographic Details
Main Authors: Sebastian Cano-Besquet, Tyler Rice-Canetto, Hadi Abou-El-Hassan, Simon Alarcon, Jason Zimmerman, Leo Issagholian, Nasser Salomon, Ivan Rojas, Joseph Dhahbi, Michael M. Neeki
Format: Article
Language:English
Published: Elsevier 2024-12-01
Series:Heliyon
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2405844024169952
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846115977627959296
author Sebastian Cano-Besquet
Tyler Rice-Canetto
Hadi Abou-El-Hassan
Simon Alarcon
Jason Zimmerman
Leo Issagholian
Nasser Salomon
Ivan Rojas
Joseph Dhahbi
Michael M. Neeki
author_facet Sebastian Cano-Besquet
Tyler Rice-Canetto
Hadi Abou-El-Hassan
Simon Alarcon
Jason Zimmerman
Leo Issagholian
Nasser Salomon
Ivan Rojas
Joseph Dhahbi
Michael M. Neeki
author_sort Sebastian Cano-Besquet
collection DOAJ
description Background: Large language models (LLMs) such as ChatGPT-4 (CG4) are proving to be valuable tools in the medical field, not only in facilitating administrative tasks, but in augmenting medical decision-making. LLMs have previously been tested for diagnostic accuracy with expert-generated questions and standardized test data. Among those studies, CG4 consistently outperformed alternative LLMs, including ChatGPT-3.5 (now no longer publicly available for use) and Google Bard (now known as “Google Gemini”). The next logical step was to begin to explore CG4’s accuracy within a specific clinical domain. Our study evaluated the diagnostic accuracy of CG4 within an inpatient neurology consultation service. Methods: We conducted a review of all patients listed on the daily neurology consultation roster at Arrowhead Regional Medical Center in Colton, CA, for all days surveyed until we reached a total of 51 patients, ensuring a complete and representative sample of the patient population. ChatGPT-4, using HIPAA-compliant methodology, received patient data from the Epic EHR as input and was asked to provide an initial differential diagnoses list, investigations and recommended actions, a final diagnosis, and a treatment plan for each patient. A comprehensiveness scale (an ordinal scale between 0 and 3) was then used to rate match accuracy between consultant and CG4 initial diagnoses and the consultants’ final diagnoses. In this proof-of-concept study, we assumed that the neurology consultants’ final diagnoses were accurate. We employed a non-parametric bootstrap resampling method to create 95 % confidence intervals around mean scores, a Fisher’s Exact test, a Wilcoxon Rank Sum test, and ordinal logistic regression models to compare the performance between consultant and CG4 groups. Findings: Our study found that CG4 demonstrated diagnostic accuracy comparable to that of consultant neurologists. The most frequent comprehensiveness score achieved by both groups was ''3,'' with consultant neurologists achieving this score 43 times and CG4 achieving it 31 times. The mean comprehensiveness scores were 2.75 (95 % CI: 2.49–2.90) for the consultant group and 2.57 (95 % CI: 2.31–2.67) for the CG4 group. The success rate for comprehensive diagnoses (a score of ''2″ or ''3″) was 94.1 % (95 % CI: 84.1%–98.0 %) for consultants and 96.1 % (95 % CI: 86.8%–98.9 %) for CG4, with no statistically significant difference in success rates (p = 1.00). The Wilcoxon Rank Sum Test indicated that the consultant group had a higher likelihood of providing more comprehensive diagnoses (W = 1583, p = 0.02). Ordinal logistic regression models identified significant predictors of diagnostic accuracy, with the consultant diagnosis group showing an odds ratio of 3.68 (CI 95 %: 1.28–10.55) for higher value outcomes. Notably, integrating CG4’s initial diagnoses with those from consultants could achieve comprehensive diagnostics in all cases, indicating a number needed to treat (NNT) of 17 to attain one additional comprehensive diagnosis. Interpretation: Our findings suggest that CG4 can serve as a valuable diagnostic tool within the domain of inpatient neurology, providing comprehensive and accurate initial diagnoses comparable to those of consultant neurologists. The use of CG4 might contribute to better patient outcomes by serving as an aid in diagnosis and treatment recommendations, potentially leading to reduced missed diagnoses and quicker diagnostic processes. Continuous strategies and evaluations to improve LLMs' accuracy remain crucial. Further studies with larger sample sizes and independent third-party evaluations are recommended to confirm these findings and assess the impact of LLMs on patient health.
format Article
id doaj-art-b82cd01a13ca4fe6a33ba7d4098c7331
institution Kabale University
issn 2405-8440
language English
publishDate 2024-12-01
publisher Elsevier
record_format Article
series Heliyon
spelling doaj-art-b82cd01a13ca4fe6a33ba7d4098c73312024-12-19T10:56:16ZengElsevierHeliyon2405-84402024-12-011024e40964ChatGPT4’s diagnostic accuracy in inpatient neurology: A retrospective cohort studySebastian Cano-Besquet0Tyler Rice-Canetto1Hadi Abou-El-Hassan2Simon Alarcon3Jason Zimmerman4Leo Issagholian5Nasser Salomon6Ivan Rojas7Joseph Dhahbi8Michael M. Neeki9California University of Science and Medicine, 1501 Violet St, Colton, Ca, 92324, USA; Corresponding author. California University of Science and Medicine, 26000 W Lugonia Ave, Apt2419 Redlands, CA 92374, USA.California University of Science and Medicine, 1501 Violet St, Colton, Ca, 92324, USADepartment of Emergency Medicine, Arrowhead Regional Medical Center, 400 N. Pepper Ave, Colton, CA, 92324, USACalifornia University of Science and Medicine, 1501 Violet St, Colton, Ca, 92324, USACalifornia University of Science and Medicine, 1501 Violet St, Colton, Ca, 92324, USACalifornia University of Science and Medicine, 1501 Violet St, Colton, Ca, 92324, USACalifornia University of Science and Medicine, 1501 Violet St, Colton, Ca, 92324, USACalifornia University of Science and Medicine, 1501 Violet St, Colton, Ca, 92324, USACalifornia University of Science and Medicine, 1501 Violet St, Colton, Ca, 92324, USADepartment of Emergency Medicine, Arrowhead Regional Medical Center, 400 N. Pepper Ave, Colton, CA, 92324, USABackground: Large language models (LLMs) such as ChatGPT-4 (CG4) are proving to be valuable tools in the medical field, not only in facilitating administrative tasks, but in augmenting medical decision-making. LLMs have previously been tested for diagnostic accuracy with expert-generated questions and standardized test data. Among those studies, CG4 consistently outperformed alternative LLMs, including ChatGPT-3.5 (now no longer publicly available for use) and Google Bard (now known as “Google Gemini”). The next logical step was to begin to explore CG4’s accuracy within a specific clinical domain. Our study evaluated the diagnostic accuracy of CG4 within an inpatient neurology consultation service. Methods: We conducted a review of all patients listed on the daily neurology consultation roster at Arrowhead Regional Medical Center in Colton, CA, for all days surveyed until we reached a total of 51 patients, ensuring a complete and representative sample of the patient population. ChatGPT-4, using HIPAA-compliant methodology, received patient data from the Epic EHR as input and was asked to provide an initial differential diagnoses list, investigations and recommended actions, a final diagnosis, and a treatment plan for each patient. A comprehensiveness scale (an ordinal scale between 0 and 3) was then used to rate match accuracy between consultant and CG4 initial diagnoses and the consultants’ final diagnoses. In this proof-of-concept study, we assumed that the neurology consultants’ final diagnoses were accurate. We employed a non-parametric bootstrap resampling method to create 95 % confidence intervals around mean scores, a Fisher’s Exact test, a Wilcoxon Rank Sum test, and ordinal logistic regression models to compare the performance between consultant and CG4 groups. Findings: Our study found that CG4 demonstrated diagnostic accuracy comparable to that of consultant neurologists. The most frequent comprehensiveness score achieved by both groups was ''3,'' with consultant neurologists achieving this score 43 times and CG4 achieving it 31 times. The mean comprehensiveness scores were 2.75 (95 % CI: 2.49–2.90) for the consultant group and 2.57 (95 % CI: 2.31–2.67) for the CG4 group. The success rate for comprehensive diagnoses (a score of ''2″ or ''3″) was 94.1 % (95 % CI: 84.1%–98.0 %) for consultants and 96.1 % (95 % CI: 86.8%–98.9 %) for CG4, with no statistically significant difference in success rates (p = 1.00). The Wilcoxon Rank Sum Test indicated that the consultant group had a higher likelihood of providing more comprehensive diagnoses (W = 1583, p = 0.02). Ordinal logistic regression models identified significant predictors of diagnostic accuracy, with the consultant diagnosis group showing an odds ratio of 3.68 (CI 95 %: 1.28–10.55) for higher value outcomes. Notably, integrating CG4’s initial diagnoses with those from consultants could achieve comprehensive diagnostics in all cases, indicating a number needed to treat (NNT) of 17 to attain one additional comprehensive diagnosis. Interpretation: Our findings suggest that CG4 can serve as a valuable diagnostic tool within the domain of inpatient neurology, providing comprehensive and accurate initial diagnoses comparable to those of consultant neurologists. The use of CG4 might contribute to better patient outcomes by serving as an aid in diagnosis and treatment recommendations, potentially leading to reduced missed diagnoses and quicker diagnostic processes. Continuous strategies and evaluations to improve LLMs' accuracy remain crucial. Further studies with larger sample sizes and independent third-party evaluations are recommended to confirm these findings and assess the impact of LLMs on patient health.http://www.sciencedirect.com/science/article/pii/S2405844024169952Large language models (LLMs)ChatGPT-4 (CG4)diagnostic accuracyInpatient neurologyDifferential diagnosesTreatment recommendations
spellingShingle Sebastian Cano-Besquet
Tyler Rice-Canetto
Hadi Abou-El-Hassan
Simon Alarcon
Jason Zimmerman
Leo Issagholian
Nasser Salomon
Ivan Rojas
Joseph Dhahbi
Michael M. Neeki
ChatGPT4’s diagnostic accuracy in inpatient neurology: A retrospective cohort study
Heliyon
Large language models (LLMs)
ChatGPT-4 (CG4)
diagnostic accuracy
Inpatient neurology
Differential diagnoses
Treatment recommendations
title ChatGPT4’s diagnostic accuracy in inpatient neurology: A retrospective cohort study
title_full ChatGPT4’s diagnostic accuracy in inpatient neurology: A retrospective cohort study
title_fullStr ChatGPT4’s diagnostic accuracy in inpatient neurology: A retrospective cohort study
title_full_unstemmed ChatGPT4’s diagnostic accuracy in inpatient neurology: A retrospective cohort study
title_short ChatGPT4’s diagnostic accuracy in inpatient neurology: A retrospective cohort study
title_sort chatgpt4 s diagnostic accuracy in inpatient neurology a retrospective cohort study
topic Large language models (LLMs)
ChatGPT-4 (CG4)
diagnostic accuracy
Inpatient neurology
Differential diagnoses
Treatment recommendations
url http://www.sciencedirect.com/science/article/pii/S2405844024169952
work_keys_str_mv AT sebastiancanobesquet chatgpt4sdiagnosticaccuracyininpatientneurologyaretrospectivecohortstudy
AT tylerricecanetto chatgpt4sdiagnosticaccuracyininpatientneurologyaretrospectivecohortstudy
AT hadiabouelhassan chatgpt4sdiagnosticaccuracyininpatientneurologyaretrospectivecohortstudy
AT simonalarcon chatgpt4sdiagnosticaccuracyininpatientneurologyaretrospectivecohortstudy
AT jasonzimmerman chatgpt4sdiagnosticaccuracyininpatientneurologyaretrospectivecohortstudy
AT leoissagholian chatgpt4sdiagnosticaccuracyininpatientneurologyaretrospectivecohortstudy
AT nassersalomon chatgpt4sdiagnosticaccuracyininpatientneurologyaretrospectivecohortstudy
AT ivanrojas chatgpt4sdiagnosticaccuracyininpatientneurologyaretrospectivecohortstudy
AT josephdhahbi chatgpt4sdiagnosticaccuracyininpatientneurologyaretrospectivecohortstudy
AT michaelmneeki chatgpt4sdiagnosticaccuracyininpatientneurologyaretrospectivecohortstudy