Performance of vision language models for optic disc swelling identification on fundus photographs

IntroductionVision language models (VLMs) combine image analysis capabilities with large language models (LLMs). Because of their multimodal capabilities, VLMs offer a clinical advantage over image classification models for the diagnosis of optic disc swelling by allowing a consideration of clinical...

Full description

Saved in:

Bibliographic Details
Main Authors:	Kelvin Zhenghao Li, Tuyet Thao Nguyen, Heather E. Moss
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2025-08-01
Series:	Frontiers in Digital Health
Subjects:	vision language model disc swelling papilledema prompt engineering artificial intelligence machine learning
Online Access:	https://www.frontiersin.org/articles/10.3389/fdgth.2025.1660887/full
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849225152625115136
author	Kelvin Zhenghao Li Kelvin Zhenghao Li Kelvin Zhenghao Li Tuyet Thao Nguyen Tuyet Thao Nguyen Heather E. Moss Heather E. Moss
author_facet	Kelvin Zhenghao Li Kelvin Zhenghao Li Kelvin Zhenghao Li Tuyet Thao Nguyen Tuyet Thao Nguyen Heather E. Moss Heather E. Moss
author_sort	Kelvin Zhenghao Li
collection	DOAJ
description	IntroductionVision language models (VLMs) combine image analysis capabilities with large language models (LLMs). Because of their multimodal capabilities, VLMs offer a clinical advantage over image classification models for the diagnosis of optic disc swelling by allowing a consideration of clinical context. In this study, we compare the performance of non-specialty-trained VLMs with different prompts in the classification of optic disc swelling on fundus photographs.MethodsA diagnostic test accuracy study was conducted utilizing an open-sourced dataset. Five different prompts (increasing in context) were used with each of five different VLMs (Llama 3.2-vision, LLaVA-Med, LLaVA, GPT-4o, and DeepSeek-4V), resulting in 25 prompt-model pairs. The performance of VLMs in classifying photographs with and without optic disc swelling was measured using Youden's index (YI), F1 score, and accuracy rate.ResultsA total of 779 images of normal optic discs and 295 images of swollen discs were obtained from an open-source image database. Among the 25 prompt-model pairs, valid response rates ranged from 7.8% to 100% (median 93.6%). Diagnostic performance ranged from YI: 0.00 to 0.231 (median 0.042), F1 score: 0.00 to 0.716 (median 0.401), and accuracy rate: 27.5 to 70.5% (median 58.8%). The best-performing prompt-model pair was GPT-4o with role-playing with Chain-of-Thought and few-shot prompting. On average, Llama 3.2-vision performed the best (average YI across prompts 0.181). There was no consistent relationship between the amount of information given in the prompt and the model performance.ConclusionsNon-specialty-trained VLMs could classify photographs of swollen and normal optic discs better than chance, with performance varying by model. Increasing prompt complexity did not consistently improve performance. Specialty-specific VLMs may be necessary to improve ophthalmic image analysis performance.
format	Article
id	doaj-art-047f3c5afcd0485ba40b3b6ffef1e6f5
institution	Kabale University
issn	2673-253X
language	English
publishDate	2025-08-01
publisher	Frontiers Media S.A.
record_format	Article
series	Frontiers in Digital Health
spelling	doaj-art-047f3c5afcd0485ba40b3b6ffef1e6f52025-08-25T05:26:22ZengFrontiers Media S.A.Frontiers in Digital Health2673-253X2025-08-01710.3389/fdgth.2025.16608871660887Performance of vision language models for optic disc swelling identification on fundus photographsKelvin Zhenghao Li0Kelvin Zhenghao Li1Kelvin Zhenghao Li2Tuyet Thao Nguyen3Tuyet Thao Nguyen4Heather E. Moss5Heather E. Moss6Department of Ophthalmology, Stanford University, Palo Alto, CA, United StatesDepartment of Ophthalmology, Tan Tock Seng Hospital, Singapore, SingaporeCentre of AI in Medicine, Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, SingaporeDepartment of Ophthalmology, Stanford University, Palo Alto, CA, United StatesSchool of Medicine, University of California Davis, Sacramento, CA, United StatesDepartment of Ophthalmology, Stanford University, Palo Alto, CA, United StatesDepartment of Neurology & Neurological Sciences, Stanford University, Palo Alto, CA, United StatesIntroductionVision language models (VLMs) combine image analysis capabilities with large language models (LLMs). Because of their multimodal capabilities, VLMs offer a clinical advantage over image classification models for the diagnosis of optic disc swelling by allowing a consideration of clinical context. In this study, we compare the performance of non-specialty-trained VLMs with different prompts in the classification of optic disc swelling on fundus photographs.MethodsA diagnostic test accuracy study was conducted utilizing an open-sourced dataset. Five different prompts (increasing in context) were used with each of five different VLMs (Llama 3.2-vision, LLaVA-Med, LLaVA, GPT-4o, and DeepSeek-4V), resulting in 25 prompt-model pairs. The performance of VLMs in classifying photographs with and without optic disc swelling was measured using Youden's index (YI), F1 score, and accuracy rate.ResultsA total of 779 images of normal optic discs and 295 images of swollen discs were obtained from an open-source image database. Among the 25 prompt-model pairs, valid response rates ranged from 7.8% to 100% (median 93.6%). Diagnostic performance ranged from YI: 0.00 to 0.231 (median 0.042), F1 score: 0.00 to 0.716 (median 0.401), and accuracy rate: 27.5 to 70.5% (median 58.8%). The best-performing prompt-model pair was GPT-4o with role-playing with Chain-of-Thought and few-shot prompting. On average, Llama 3.2-vision performed the best (average YI across prompts 0.181). There was no consistent relationship between the amount of information given in the prompt and the model performance.ConclusionsNon-specialty-trained VLMs could classify photographs of swollen and normal optic discs better than chance, with performance varying by model. Increasing prompt complexity did not consistently improve performance. Specialty-specific VLMs may be necessary to improve ophthalmic image analysis performance.https://www.frontiersin.org/articles/10.3389/fdgth.2025.1660887/fullvision language modeldisc swellingpapilledemaprompt engineeringartificial intelligencemachine learning
spellingShingle	Kelvin Zhenghao Li Kelvin Zhenghao Li Kelvin Zhenghao Li Tuyet Thao Nguyen Tuyet Thao Nguyen Heather E. Moss Heather E. Moss Performance of vision language models for optic disc swelling identification on fundus photographs Frontiers in Digital Health vision language model disc swelling papilledema prompt engineering artificial intelligence machine learning
title	Performance of vision language models for optic disc swelling identification on fundus photographs
title_full	Performance of vision language models for optic disc swelling identification on fundus photographs
title_fullStr	Performance of vision language models for optic disc swelling identification on fundus photographs
title_full_unstemmed	Performance of vision language models for optic disc swelling identification on fundus photographs
title_short	Performance of vision language models for optic disc swelling identification on fundus photographs
title_sort	performance of vision language models for optic disc swelling identification on fundus photographs
topic	vision language model disc swelling papilledema prompt engineering artificial intelligence machine learning
url	https://www.frontiersin.org/articles/10.3389/fdgth.2025.1660887/full
work_keys_str_mv	AT kelvinzhenghaoli performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs AT kelvinzhenghaoli performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs AT kelvinzhenghaoli performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs AT tuyetthaonguyen performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs AT tuyetthaonguyen performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs AT heatheremoss performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs AT heatheremoss performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs

Performance of vision language models for optic disc swelling identification on fundus photographs

Similar Items