Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis

Background and aim: Visual data from images is essential for many medical diagnoses. This study evaluates the performance of multimodal Large Language Models (LLMs) in integrating textual and visual information for diagnostic purposes. Methods: We tested GPT-4o and Claude Sonnet 3.5 on 120 clinical...

Full description

Saved in:
Bibliographic Details
Main Authors: Reem Agbareia, Mahmud Omar, Shelly Soffer, Benjamin S. Glicksberg, Girish N. Nadkarni, Eyal Klang
Format: Article
Language:English
Published: Elsevier 2025-01-01
Series:Computational and Structural Biotechnology Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2001037024004379
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841560558923415552
author Reem Agbareia
Mahmud Omar
Shelly Soffer
Benjamin S. Glicksberg
Girish N. Nadkarni
Eyal Klang
author_facet Reem Agbareia
Mahmud Omar
Shelly Soffer
Benjamin S. Glicksberg
Girish N. Nadkarni
Eyal Klang
author_sort Reem Agbareia
collection DOAJ
description Background and aim: Visual data from images is essential for many medical diagnoses. This study evaluates the performance of multimodal Large Language Models (LLMs) in integrating textual and visual information for diagnostic purposes. Methods: We tested GPT-4o and Claude Sonnet 3.5 on 120 clinical vignettes with and without accompanying images. Each vignette included patient demographics, a chief concern, and relevant medical history. Vignettes were paired with either clinical or radiological images from two sources: 100 images from the OPENi database and 20 images from recent NEJM challenges, ensuring they were not in the LLMs' training sets. Three primary care physicians served as a human benchmark. We analyzed diagnostic accuracy and the models' explanations for a subset of cases. Results: LLMs outperformed physicians in text-only scenarios (GPT-4o: 70.8 %, Claude Sonnet 3.5: 59.5 %, Physicians: 39.5 %, p < 0.001, Bonferroni-adjusted). With image integration, all improved, but physicians showed the largest gain (GPT-4o: 84.5 %, p < 0.001; Claude Sonnet 3.5: 67.3 %, p = 0.060; Physicians: 78.8 %, p < 0.001, all Bonferroni-adjusted). LLMs altered their explanatory reasoning in 45–60 % of cases when images were provided. Conclusion: Multimodal LLMs showed higher diagnostic accuracy than physicians in text-only scenarios, even in cases designed to require visual interpretation, suggesting that while images can enhance diagnostic accuracy, they may not be essential in every instance. Although adding images further improved LLM performance, the magnitude of this improvement was smaller than that observed in physicians. These findings suggest that enhanced visual data processing may be needed for LLMs to achieve the degree of image-related performance gains seen in human examiners.
format Article
id doaj-art-667ec3df8231470e811cbf8e8698ebcd
institution Kabale University
issn 2001-0370
language English
publishDate 2025-01-01
publisher Elsevier
record_format Article
series Computational and Structural Biotechnology Journal
spelling doaj-art-667ec3df8231470e811cbf8e8698ebcd2025-01-04T04:56:14ZengElsevierComputational and Structural Biotechnology Journal2001-03702025-01-0127184189Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysisReem Agbareia0Mahmud Omar1Shelly Soffer2Benjamin S. Glicksberg3Girish N. Nadkarni4Eyal Klang5Ophthalmology Department, Hadassah Medical Center, Jerusalem, IsraelDivision of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA; Corresponding author.Institute of Hematology, Davidoff Cancer Center, Rabin Medical Center, Petah-Tikva, IsraelDivision of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USADivision of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USADivision of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USABackground and aim: Visual data from images is essential for many medical diagnoses. This study evaluates the performance of multimodal Large Language Models (LLMs) in integrating textual and visual information for diagnostic purposes. Methods: We tested GPT-4o and Claude Sonnet 3.5 on 120 clinical vignettes with and without accompanying images. Each vignette included patient demographics, a chief concern, and relevant medical history. Vignettes were paired with either clinical or radiological images from two sources: 100 images from the OPENi database and 20 images from recent NEJM challenges, ensuring they were not in the LLMs' training sets. Three primary care physicians served as a human benchmark. We analyzed diagnostic accuracy and the models' explanations for a subset of cases. Results: LLMs outperformed physicians in text-only scenarios (GPT-4o: 70.8 %, Claude Sonnet 3.5: 59.5 %, Physicians: 39.5 %, p < 0.001, Bonferroni-adjusted). With image integration, all improved, but physicians showed the largest gain (GPT-4o: 84.5 %, p < 0.001; Claude Sonnet 3.5: 67.3 %, p = 0.060; Physicians: 78.8 %, p < 0.001, all Bonferroni-adjusted). LLMs altered their explanatory reasoning in 45–60 % of cases when images were provided. Conclusion: Multimodal LLMs showed higher diagnostic accuracy than physicians in text-only scenarios, even in cases designed to require visual interpretation, suggesting that while images can enhance diagnostic accuracy, they may not be essential in every instance. Although adding images further improved LLM performance, the magnitude of this improvement was smaller than that observed in physicians. These findings suggest that enhanced visual data processing may be needed for LLMs to achieve the degree of image-related performance gains seen in human examiners.http://www.sciencedirect.com/science/article/pii/S2001037024004379Artificial intelligenceMedical diagnosisMultimodal learningLarge language modelsVisual data integration
spellingShingle Reem Agbareia
Mahmud Omar
Shelly Soffer
Benjamin S. Glicksberg
Girish N. Nadkarni
Eyal Klang
Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis
Computational and Structural Biotechnology Journal
Artificial intelligence
Medical diagnosis
Multimodal learning
Large language models
Visual data integration
title Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis
title_full Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis
title_fullStr Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis
title_full_unstemmed Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis
title_short Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis
title_sort visual textual integration in llms for medical diagnosis a preliminary quantitative analysis
topic Artificial intelligence
Medical diagnosis
Multimodal learning
Large language models
Visual data integration
url http://www.sciencedirect.com/science/article/pii/S2001037024004379
work_keys_str_mv AT reemagbareia visualtextualintegrationinllmsformedicaldiagnosisapreliminaryquantitativeanalysis
AT mahmudomar visualtextualintegrationinllmsformedicaldiagnosisapreliminaryquantitativeanalysis
AT shellysoffer visualtextualintegrationinllmsformedicaldiagnosisapreliminaryquantitativeanalysis
AT benjaminsglicksberg visualtextualintegrationinllmsformedicaldiagnosisapreliminaryquantitativeanalysis
AT girishnnadkarni visualtextualintegrationinllmsformedicaldiagnosisapreliminaryquantitativeanalysis
AT eyalklang visualtextualintegrationinllmsformedicaldiagnosisapreliminaryquantitativeanalysis