Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis

Background and aim: Visual data from images is essential for many medical diagnoses. This study evaluates the performance of multimodal Large Language Models (LLMs) in integrating textual and visual information for diagnostic purposes. Methods: We tested GPT-4o and Claude Sonnet 3.5 on 120 clinical...

Full description

Saved in:

Bibliographic Details
Main Authors:	Reem Agbareia, Mahmud Omar, Shelly Soffer, Benjamin S. Glicksberg, Girish N. Nadkarni, Eyal Klang
Format:	Article
Language:	English
Published:	Elsevier 2025-01-01
Series:	Computational and Structural Biotechnology Journal
Subjects:	Artificial intelligence Medical diagnosis Multimodal learning Large language models Visual data integration
Online Access:	http://www.sciencedirect.com/science/article/pii/S2001037024004379
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841560558923415552
author	Reem Agbareia Mahmud Omar Shelly Soffer Benjamin S. Glicksberg Girish N. Nadkarni Eyal Klang
author_facet	Reem Agbareia Mahmud Omar Shelly Soffer Benjamin S. Glicksberg Girish N. Nadkarni Eyal Klang
author_sort	Reem Agbareia
collection	DOAJ
description	Background and aim: Visual data from images is essential for many medical diagnoses. This study evaluates the performance of multimodal Large Language Models (LLMs) in integrating textual and visual information for diagnostic purposes. Methods: We tested GPT-4o and Claude Sonnet 3.5 on 120 clinical vignettes with and without accompanying images. Each vignette included patient demographics, a chief concern, and relevant medical history. Vignettes were paired with either clinical or radiological images from two sources: 100 images from the OPENi database and 20 images from recent NEJM challenges, ensuring they were not in the LLMs' training sets. Three primary care physicians served as a human benchmark. We analyzed diagnostic accuracy and the models' explanations for a subset of cases. Results: LLMs outperformed physicians in text-only scenarios (GPT-4o: 70.8 %, Claude Sonnet 3.5: 59.5 %, Physicians: 39.5 %, p < 0.001, Bonferroni-adjusted). With image integration, all improved, but physicians showed the largest gain (GPT-4o: 84.5 %, p < 0.001; Claude Sonnet 3.5: 67.3 %, p = 0.060; Physicians: 78.8 %, p < 0.001, all Bonferroni-adjusted). LLMs altered their explanatory reasoning in 45–60 % of cases when images were provided. Conclusion: Multimodal LLMs showed higher diagnostic accuracy than physicians in text-only scenarios, even in cases designed to require visual interpretation, suggesting that while images can enhance diagnostic accuracy, they may not be essential in every instance. Although adding images further improved LLM performance, the magnitude of this improvement was smaller than that observed in physicians. These findings suggest that enhanced visual data processing may be needed for LLMs to achieve the degree of image-related performance gains seen in human examiners.
format	Article
id	doaj-art-667ec3df8231470e811cbf8e8698ebcd
institution	Kabale University
issn	2001-0370
language	English
publishDate	2025-01-01
publisher	Elsevier
record_format	Article
series	Computational and Structural Biotechnology Journal
spelling	doaj-art-667ec3df8231470e811cbf8e8698ebcd2025-01-04T04:56:14ZengElsevierComputational and Structural Biotechnology Journal2001-03702025-01-0127184189Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysisReem Agbareia0Mahmud Omar1Shelly Soffer2Benjamin S. Glicksberg3Girish N. Nadkarni4Eyal Klang5Ophthalmology Department, Hadassah Medical Center, Jerusalem, IsraelDivision of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA; Corresponding author.Institute of Hematology, Davidoff Cancer Center, Rabin Medical Center, Petah-Tikva, IsraelDivision of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USADivision of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USADivision of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USABackground and aim: Visual data from images is essential for many medical diagnoses. This study evaluates the performance of multimodal Large Language Models (LLMs) in integrating textual and visual information for diagnostic purposes. Methods: We tested GPT-4o and Claude Sonnet 3.5 on 120 clinical vignettes with and without accompanying images. Each vignette included patient demographics, a chief concern, and relevant medical history. Vignettes were paired with either clinical or radiological images from two sources: 100 images from the OPENi database and 20 images from recent NEJM challenges, ensuring they were not in the LLMs' training sets. Three primary care physicians served as a human benchmark. We analyzed diagnostic accuracy and the models' explanations for a subset of cases. Results: LLMs outperformed physicians in text-only scenarios (GPT-4o: 70.8 %, Claude Sonnet 3.5: 59.5 %, Physicians: 39.5 %, p < 0.001, Bonferroni-adjusted). With image integration, all improved, but physicians showed the largest gain (GPT-4o: 84.5 %, p < 0.001; Claude Sonnet 3.5: 67.3 %, p = 0.060; Physicians: 78.8 %, p < 0.001, all Bonferroni-adjusted). LLMs altered their explanatory reasoning in 45–60 % of cases when images were provided. Conclusion: Multimodal LLMs showed higher diagnostic accuracy than physicians in text-only scenarios, even in cases designed to require visual interpretation, suggesting that while images can enhance diagnostic accuracy, they may not be essential in every instance. Although adding images further improved LLM performance, the magnitude of this improvement was smaller than that observed in physicians. These findings suggest that enhanced visual data processing may be needed for LLMs to achieve the degree of image-related performance gains seen in human examiners.http://www.sciencedirect.com/science/article/pii/S2001037024004379Artificial intelligenceMedical diagnosisMultimodal learningLarge language modelsVisual data integration
spellingShingle	Reem Agbareia Mahmud Omar Shelly Soffer Benjamin S. Glicksberg Girish N. Nadkarni Eyal Klang Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis Computational and Structural Biotechnology Journal Artificial intelligence Medical diagnosis Multimodal learning Large language models Visual data integration
title	Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis
title_full	Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis
title_fullStr	Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis
title_full_unstemmed	Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis
title_short	Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis
title_sort	visual textual integration in llms for medical diagnosis a preliminary quantitative analysis
topic	Artificial intelligence Medical diagnosis Multimodal learning Large language models Visual data integration
url	http://www.sciencedirect.com/science/article/pii/S2001037024004379
work_keys_str_mv	AT reemagbareia visualtextualintegrationinllmsformedicaldiagnosisapreliminaryquantitativeanalysis AT mahmudomar visualtextualintegrationinllmsformedicaldiagnosisapreliminaryquantitativeanalysis AT shellysoffer visualtextualintegrationinllmsformedicaldiagnosisapreliminaryquantitativeanalysis AT benjaminsglicksberg visualtextualintegrationinllmsformedicaldiagnosisapreliminaryquantitativeanalysis AT girishnnadkarni visualtextualintegrationinllmsformedicaldiagnosisapreliminaryquantitativeanalysis AT eyalklang visualtextualintegrationinllmsformedicaldiagnosisapreliminaryquantitativeanalysis

Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis

Similar Items