Performance of vision language models for optic disc swelling identification on fundus photographs

IntroductionVision language models (VLMs) combine image analysis capabilities with large language models (LLMs). Because of their multimodal capabilities, VLMs offer a clinical advantage over image classification models for the diagnosis of optic disc swelling by allowing a consideration of clinical...

Full description

Saved in:
Bibliographic Details
Main Authors: Kelvin Zhenghao Li, Tuyet Thao Nguyen, Heather E. Moss
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-08-01
Series:Frontiers in Digital Health
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fdgth.2025.1660887/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849225152625115136
author Kelvin Zhenghao Li
Kelvin Zhenghao Li
Kelvin Zhenghao Li
Tuyet Thao Nguyen
Tuyet Thao Nguyen
Heather E. Moss
Heather E. Moss
author_facet Kelvin Zhenghao Li
Kelvin Zhenghao Li
Kelvin Zhenghao Li
Tuyet Thao Nguyen
Tuyet Thao Nguyen
Heather E. Moss
Heather E. Moss
author_sort Kelvin Zhenghao Li
collection DOAJ
description IntroductionVision language models (VLMs) combine image analysis capabilities with large language models (LLMs). Because of their multimodal capabilities, VLMs offer a clinical advantage over image classification models for the diagnosis of optic disc swelling by allowing a consideration of clinical context. In this study, we compare the performance of non-specialty-trained VLMs with different prompts in the classification of optic disc swelling on fundus photographs.MethodsA diagnostic test accuracy study was conducted utilizing an open-sourced dataset. Five different prompts (increasing in context) were used with each of five different VLMs (Llama 3.2-vision, LLaVA-Med, LLaVA, GPT-4o, and DeepSeek-4V), resulting in 25 prompt-model pairs. The performance of VLMs in classifying photographs with and without optic disc swelling was measured using Youden's index (YI), F1 score, and accuracy rate.ResultsA total of 779 images of normal optic discs and 295 images of swollen discs were obtained from an open-source image database. Among the 25 prompt-model pairs, valid response rates ranged from 7.8% to 100% (median 93.6%). Diagnostic performance ranged from YI: 0.00 to 0.231 (median 0.042), F1 score: 0.00 to 0.716 (median 0.401), and accuracy rate: 27.5 to 70.5% (median 58.8%). The best-performing prompt-model pair was GPT-4o with role-playing with Chain-of-Thought and few-shot prompting. On average, Llama 3.2-vision performed the best (average YI across prompts 0.181). There was no consistent relationship between the amount of information given in the prompt and the model performance.ConclusionsNon-specialty-trained VLMs could classify photographs of swollen and normal optic discs better than chance, with performance varying by model. Increasing prompt complexity did not consistently improve performance. Specialty-specific VLMs may be necessary to improve ophthalmic image analysis performance.
format Article
id doaj-art-047f3c5afcd0485ba40b3b6ffef1e6f5
institution Kabale University
issn 2673-253X
language English
publishDate 2025-08-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Digital Health
spelling doaj-art-047f3c5afcd0485ba40b3b6ffef1e6f52025-08-25T05:26:22ZengFrontiers Media S.A.Frontiers in Digital Health2673-253X2025-08-01710.3389/fdgth.2025.16608871660887Performance of vision language models for optic disc swelling identification on fundus photographsKelvin Zhenghao Li0Kelvin Zhenghao Li1Kelvin Zhenghao Li2Tuyet Thao Nguyen3Tuyet Thao Nguyen4Heather E. Moss5Heather E. Moss6Department of Ophthalmology, Stanford University, Palo Alto, CA, United StatesDepartment of Ophthalmology, Tan Tock Seng Hospital, Singapore, SingaporeCentre of AI in Medicine, Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, SingaporeDepartment of Ophthalmology, Stanford University, Palo Alto, CA, United StatesSchool of Medicine, University of California Davis, Sacramento, CA, United StatesDepartment of Ophthalmology, Stanford University, Palo Alto, CA, United StatesDepartment of Neurology & Neurological Sciences, Stanford University, Palo Alto, CA, United StatesIntroductionVision language models (VLMs) combine image analysis capabilities with large language models (LLMs). Because of their multimodal capabilities, VLMs offer a clinical advantage over image classification models for the diagnosis of optic disc swelling by allowing a consideration of clinical context. In this study, we compare the performance of non-specialty-trained VLMs with different prompts in the classification of optic disc swelling on fundus photographs.MethodsA diagnostic test accuracy study was conducted utilizing an open-sourced dataset. Five different prompts (increasing in context) were used with each of five different VLMs (Llama 3.2-vision, LLaVA-Med, LLaVA, GPT-4o, and DeepSeek-4V), resulting in 25 prompt-model pairs. The performance of VLMs in classifying photographs with and without optic disc swelling was measured using Youden's index (YI), F1 score, and accuracy rate.ResultsA total of 779 images of normal optic discs and 295 images of swollen discs were obtained from an open-source image database. Among the 25 prompt-model pairs, valid response rates ranged from 7.8% to 100% (median 93.6%). Diagnostic performance ranged from YI: 0.00 to 0.231 (median 0.042), F1 score: 0.00 to 0.716 (median 0.401), and accuracy rate: 27.5 to 70.5% (median 58.8%). The best-performing prompt-model pair was GPT-4o with role-playing with Chain-of-Thought and few-shot prompting. On average, Llama 3.2-vision performed the best (average YI across prompts 0.181). There was no consistent relationship between the amount of information given in the prompt and the model performance.ConclusionsNon-specialty-trained VLMs could classify photographs of swollen and normal optic discs better than chance, with performance varying by model. Increasing prompt complexity did not consistently improve performance. Specialty-specific VLMs may be necessary to improve ophthalmic image analysis performance.https://www.frontiersin.org/articles/10.3389/fdgth.2025.1660887/fullvision language modeldisc swellingpapilledemaprompt engineeringartificial intelligencemachine learning
spellingShingle Kelvin Zhenghao Li
Kelvin Zhenghao Li
Kelvin Zhenghao Li
Tuyet Thao Nguyen
Tuyet Thao Nguyen
Heather E. Moss
Heather E. Moss
Performance of vision language models for optic disc swelling identification on fundus photographs
Frontiers in Digital Health
vision language model
disc swelling
papilledema
prompt engineering
artificial intelligence
machine learning
title Performance of vision language models for optic disc swelling identification on fundus photographs
title_full Performance of vision language models for optic disc swelling identification on fundus photographs
title_fullStr Performance of vision language models for optic disc swelling identification on fundus photographs
title_full_unstemmed Performance of vision language models for optic disc swelling identification on fundus photographs
title_short Performance of vision language models for optic disc swelling identification on fundus photographs
title_sort performance of vision language models for optic disc swelling identification on fundus photographs
topic vision language model
disc swelling
papilledema
prompt engineering
artificial intelligence
machine learning
url https://www.frontiersin.org/articles/10.3389/fdgth.2025.1660887/full
work_keys_str_mv AT kelvinzhenghaoli performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs
AT kelvinzhenghaoli performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs
AT kelvinzhenghaoli performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs
AT tuyetthaonguyen performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs
AT tuyetthaonguyen performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs
AT heatheremoss performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs
AT heatheremoss performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs