Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis
Background and aim: Visual data from images is essential for many medical diagnoses. This study evaluates the performance of multimodal Large Language Models (LLMs) in integrating textual and visual information for diagnostic purposes. Methods: We tested GPT-4o and Claude Sonnet 3.5 on 120 clinical...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2025-01-01
|
Series: | Computational and Structural Biotechnology Journal |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2001037024004379 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841560558923415552 |
---|---|
author | Reem Agbareia Mahmud Omar Shelly Soffer Benjamin S. Glicksberg Girish N. Nadkarni Eyal Klang |
author_facet | Reem Agbareia Mahmud Omar Shelly Soffer Benjamin S. Glicksberg Girish N. Nadkarni Eyal Klang |
author_sort | Reem Agbareia |
collection | DOAJ |
description | Background and aim: Visual data from images is essential for many medical diagnoses. This study evaluates the performance of multimodal Large Language Models (LLMs) in integrating textual and visual information for diagnostic purposes. Methods: We tested GPT-4o and Claude Sonnet 3.5 on 120 clinical vignettes with and without accompanying images. Each vignette included patient demographics, a chief concern, and relevant medical history. Vignettes were paired with either clinical or radiological images from two sources: 100 images from the OPENi database and 20 images from recent NEJM challenges, ensuring they were not in the LLMs' training sets. Three primary care physicians served as a human benchmark. We analyzed diagnostic accuracy and the models' explanations for a subset of cases. Results: LLMs outperformed physicians in text-only scenarios (GPT-4o: 70.8 %, Claude Sonnet 3.5: 59.5 %, Physicians: 39.5 %, p < 0.001, Bonferroni-adjusted). With image integration, all improved, but physicians showed the largest gain (GPT-4o: 84.5 %, p < 0.001; Claude Sonnet 3.5: 67.3 %, p = 0.060; Physicians: 78.8 %, p < 0.001, all Bonferroni-adjusted). LLMs altered their explanatory reasoning in 45–60 % of cases when images were provided. Conclusion: Multimodal LLMs showed higher diagnostic accuracy than physicians in text-only scenarios, even in cases designed to require visual interpretation, suggesting that while images can enhance diagnostic accuracy, they may not be essential in every instance. Although adding images further improved LLM performance, the magnitude of this improvement was smaller than that observed in physicians. These findings suggest that enhanced visual data processing may be needed for LLMs to achieve the degree of image-related performance gains seen in human examiners. |
format | Article |
id | doaj-art-667ec3df8231470e811cbf8e8698ebcd |
institution | Kabale University |
issn | 2001-0370 |
language | English |
publishDate | 2025-01-01 |
publisher | Elsevier |
record_format | Article |
series | Computational and Structural Biotechnology Journal |
spelling | doaj-art-667ec3df8231470e811cbf8e8698ebcd2025-01-04T04:56:14ZengElsevierComputational and Structural Biotechnology Journal2001-03702025-01-0127184189Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysisReem Agbareia0Mahmud Omar1Shelly Soffer2Benjamin S. Glicksberg3Girish N. Nadkarni4Eyal Klang5Ophthalmology Department, Hadassah Medical Center, Jerusalem, IsraelDivision of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA; Corresponding author.Institute of Hematology, Davidoff Cancer Center, Rabin Medical Center, Petah-Tikva, IsraelDivision of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USADivision of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USADivision of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USABackground and aim: Visual data from images is essential for many medical diagnoses. This study evaluates the performance of multimodal Large Language Models (LLMs) in integrating textual and visual information for diagnostic purposes. Methods: We tested GPT-4o and Claude Sonnet 3.5 on 120 clinical vignettes with and without accompanying images. Each vignette included patient demographics, a chief concern, and relevant medical history. Vignettes were paired with either clinical or radiological images from two sources: 100 images from the OPENi database and 20 images from recent NEJM challenges, ensuring they were not in the LLMs' training sets. Three primary care physicians served as a human benchmark. We analyzed diagnostic accuracy and the models' explanations for a subset of cases. Results: LLMs outperformed physicians in text-only scenarios (GPT-4o: 70.8 %, Claude Sonnet 3.5: 59.5 %, Physicians: 39.5 %, p < 0.001, Bonferroni-adjusted). With image integration, all improved, but physicians showed the largest gain (GPT-4o: 84.5 %, p < 0.001; Claude Sonnet 3.5: 67.3 %, p = 0.060; Physicians: 78.8 %, p < 0.001, all Bonferroni-adjusted). LLMs altered their explanatory reasoning in 45–60 % of cases when images were provided. Conclusion: Multimodal LLMs showed higher diagnostic accuracy than physicians in text-only scenarios, even in cases designed to require visual interpretation, suggesting that while images can enhance diagnostic accuracy, they may not be essential in every instance. Although adding images further improved LLM performance, the magnitude of this improvement was smaller than that observed in physicians. These findings suggest that enhanced visual data processing may be needed for LLMs to achieve the degree of image-related performance gains seen in human examiners.http://www.sciencedirect.com/science/article/pii/S2001037024004379Artificial intelligenceMedical diagnosisMultimodal learningLarge language modelsVisual data integration |
spellingShingle | Reem Agbareia Mahmud Omar Shelly Soffer Benjamin S. Glicksberg Girish N. Nadkarni Eyal Klang Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis Computational and Structural Biotechnology Journal Artificial intelligence Medical diagnosis Multimodal learning Large language models Visual data integration |
title | Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis |
title_full | Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis |
title_fullStr | Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis |
title_full_unstemmed | Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis |
title_short | Visual-textual integration in LLMs for medical diagnosis: A preliminary quantitative analysis |
title_sort | visual textual integration in llms for medical diagnosis a preliminary quantitative analysis |
topic | Artificial intelligence Medical diagnosis Multimodal learning Large language models Visual data integration |
url | http://www.sciencedirect.com/science/article/pii/S2001037024004379 |
work_keys_str_mv | AT reemagbareia visualtextualintegrationinllmsformedicaldiagnosisapreliminaryquantitativeanalysis AT mahmudomar visualtextualintegrationinllmsformedicaldiagnosisapreliminaryquantitativeanalysis AT shellysoffer visualtextualintegrationinllmsformedicaldiagnosisapreliminaryquantitativeanalysis AT benjaminsglicksberg visualtextualintegrationinllmsformedicaldiagnosisapreliminaryquantitativeanalysis AT girishnnadkarni visualtextualintegrationinllmsformedicaldiagnosisapreliminaryquantitativeanalysis AT eyalklang visualtextualintegrationinllmsformedicaldiagnosisapreliminaryquantitativeanalysis |