Performance of vision language models for optic disc swelling identification on fundus photographs
IntroductionVision language models (VLMs) combine image analysis capabilities with large language models (LLMs). Because of their multimodal capabilities, VLMs offer a clinical advantage over image classification models for the diagnosis of optic disc swelling by allowing a consideration of clinical...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Frontiers Media S.A.
2025-08-01
|
| Series: | Frontiers in Digital Health |
| Subjects: | |
| Online Access: | https://www.frontiersin.org/articles/10.3389/fdgth.2025.1660887/full |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849225152625115136 |
|---|---|
| author | Kelvin Zhenghao Li Kelvin Zhenghao Li Kelvin Zhenghao Li Tuyet Thao Nguyen Tuyet Thao Nguyen Heather E. Moss Heather E. Moss |
| author_facet | Kelvin Zhenghao Li Kelvin Zhenghao Li Kelvin Zhenghao Li Tuyet Thao Nguyen Tuyet Thao Nguyen Heather E. Moss Heather E. Moss |
| author_sort | Kelvin Zhenghao Li |
| collection | DOAJ |
| description | IntroductionVision language models (VLMs) combine image analysis capabilities with large language models (LLMs). Because of their multimodal capabilities, VLMs offer a clinical advantage over image classification models for the diagnosis of optic disc swelling by allowing a consideration of clinical context. In this study, we compare the performance of non-specialty-trained VLMs with different prompts in the classification of optic disc swelling on fundus photographs.MethodsA diagnostic test accuracy study was conducted utilizing an open-sourced dataset. Five different prompts (increasing in context) were used with each of five different VLMs (Llama 3.2-vision, LLaVA-Med, LLaVA, GPT-4o, and DeepSeek-4V), resulting in 25 prompt-model pairs. The performance of VLMs in classifying photographs with and without optic disc swelling was measured using Youden's index (YI), F1 score, and accuracy rate.ResultsA total of 779 images of normal optic discs and 295 images of swollen discs were obtained from an open-source image database. Among the 25 prompt-model pairs, valid response rates ranged from 7.8% to 100% (median 93.6%). Diagnostic performance ranged from YI: 0.00 to 0.231 (median 0.042), F1 score: 0.00 to 0.716 (median 0.401), and accuracy rate: 27.5 to 70.5% (median 58.8%). The best-performing prompt-model pair was GPT-4o with role-playing with Chain-of-Thought and few-shot prompting. On average, Llama 3.2-vision performed the best (average YI across prompts 0.181). There was no consistent relationship between the amount of information given in the prompt and the model performance.ConclusionsNon-specialty-trained VLMs could classify photographs of swollen and normal optic discs better than chance, with performance varying by model. Increasing prompt complexity did not consistently improve performance. Specialty-specific VLMs may be necessary to improve ophthalmic image analysis performance. |
| format | Article |
| id | doaj-art-047f3c5afcd0485ba40b3b6ffef1e6f5 |
| institution | Kabale University |
| issn | 2673-253X |
| language | English |
| publishDate | 2025-08-01 |
| publisher | Frontiers Media S.A. |
| record_format | Article |
| series | Frontiers in Digital Health |
| spelling | doaj-art-047f3c5afcd0485ba40b3b6ffef1e6f52025-08-25T05:26:22ZengFrontiers Media S.A.Frontiers in Digital Health2673-253X2025-08-01710.3389/fdgth.2025.16608871660887Performance of vision language models for optic disc swelling identification on fundus photographsKelvin Zhenghao Li0Kelvin Zhenghao Li1Kelvin Zhenghao Li2Tuyet Thao Nguyen3Tuyet Thao Nguyen4Heather E. Moss5Heather E. Moss6Department of Ophthalmology, Stanford University, Palo Alto, CA, United StatesDepartment of Ophthalmology, Tan Tock Seng Hospital, Singapore, SingaporeCentre of AI in Medicine, Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, SingaporeDepartment of Ophthalmology, Stanford University, Palo Alto, CA, United StatesSchool of Medicine, University of California Davis, Sacramento, CA, United StatesDepartment of Ophthalmology, Stanford University, Palo Alto, CA, United StatesDepartment of Neurology & Neurological Sciences, Stanford University, Palo Alto, CA, United StatesIntroductionVision language models (VLMs) combine image analysis capabilities with large language models (LLMs). Because of their multimodal capabilities, VLMs offer a clinical advantage over image classification models for the diagnosis of optic disc swelling by allowing a consideration of clinical context. In this study, we compare the performance of non-specialty-trained VLMs with different prompts in the classification of optic disc swelling on fundus photographs.MethodsA diagnostic test accuracy study was conducted utilizing an open-sourced dataset. Five different prompts (increasing in context) were used with each of five different VLMs (Llama 3.2-vision, LLaVA-Med, LLaVA, GPT-4o, and DeepSeek-4V), resulting in 25 prompt-model pairs. The performance of VLMs in classifying photographs with and without optic disc swelling was measured using Youden's index (YI), F1 score, and accuracy rate.ResultsA total of 779 images of normal optic discs and 295 images of swollen discs were obtained from an open-source image database. Among the 25 prompt-model pairs, valid response rates ranged from 7.8% to 100% (median 93.6%). Diagnostic performance ranged from YI: 0.00 to 0.231 (median 0.042), F1 score: 0.00 to 0.716 (median 0.401), and accuracy rate: 27.5 to 70.5% (median 58.8%). The best-performing prompt-model pair was GPT-4o with role-playing with Chain-of-Thought and few-shot prompting. On average, Llama 3.2-vision performed the best (average YI across prompts 0.181). There was no consistent relationship between the amount of information given in the prompt and the model performance.ConclusionsNon-specialty-trained VLMs could classify photographs of swollen and normal optic discs better than chance, with performance varying by model. Increasing prompt complexity did not consistently improve performance. Specialty-specific VLMs may be necessary to improve ophthalmic image analysis performance.https://www.frontiersin.org/articles/10.3389/fdgth.2025.1660887/fullvision language modeldisc swellingpapilledemaprompt engineeringartificial intelligencemachine learning |
| spellingShingle | Kelvin Zhenghao Li Kelvin Zhenghao Li Kelvin Zhenghao Li Tuyet Thao Nguyen Tuyet Thao Nguyen Heather E. Moss Heather E. Moss Performance of vision language models for optic disc swelling identification on fundus photographs Frontiers in Digital Health vision language model disc swelling papilledema prompt engineering artificial intelligence machine learning |
| title | Performance of vision language models for optic disc swelling identification on fundus photographs |
| title_full | Performance of vision language models for optic disc swelling identification on fundus photographs |
| title_fullStr | Performance of vision language models for optic disc swelling identification on fundus photographs |
| title_full_unstemmed | Performance of vision language models for optic disc swelling identification on fundus photographs |
| title_short | Performance of vision language models for optic disc swelling identification on fundus photographs |
| title_sort | performance of vision language models for optic disc swelling identification on fundus photographs |
| topic | vision language model disc swelling papilledema prompt engineering artificial intelligence machine learning |
| url | https://www.frontiersin.org/articles/10.3389/fdgth.2025.1660887/full |
| work_keys_str_mv | AT kelvinzhenghaoli performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs AT kelvinzhenghaoli performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs AT kelvinzhenghaoli performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs AT tuyetthaonguyen performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs AT tuyetthaonguyen performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs AT heatheremoss performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs AT heatheremoss performanceofvisionlanguagemodelsforopticdiscswellingidentificationonfundusphotographs |