Performance of large language models on veterinary undergraduate multiple-choice examinations: a comparative evaluation

The integration of artificial intelligence, particularly large language models (LLMs), into veterinary education and practice presents promising opportunities, yet their performance in veterinary-specific contexts remains understudied. This research comparatively evaluated the performance of nine ad...

Full description

Saved in:
Bibliographic Details
Main Authors: Santiago Alonso Sousa, Syed Saad Ul Hassan Bukhari, Paulo Vinicius Steagall, Paweł M. Bęczkowski, Antonio Giuliano, Kate J. Flay
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-08-01
Series:Frontiers in Veterinary Science
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fvets.2025.1616566/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846173873294278656
author Santiago Alonso Sousa
Syed Saad Ul Hassan Bukhari
Paulo Vinicius Steagall
Paulo Vinicius Steagall
Paweł M. Bęczkowski
Antonio Giuliano
Kate J. Flay
author_facet Santiago Alonso Sousa
Syed Saad Ul Hassan Bukhari
Paulo Vinicius Steagall
Paulo Vinicius Steagall
Paweł M. Bęczkowski
Antonio Giuliano
Kate J. Flay
author_sort Santiago Alonso Sousa
collection DOAJ
description The integration of artificial intelligence, particularly large language models (LLMs), into veterinary education and practice presents promising opportunities, yet their performance in veterinary-specific contexts remains understudied. This research comparatively evaluated the performance of nine advanced LLMs (ChatGPT o1Pro, ChatGPT 4o, ChatGPT 4.5, Grok 3, Gemini 2, Copilot, DeepSeek R1, Qwen 2.5 Max, and Kimi 1.5) on 250 multiple-choice questions (MCQs) sourced from a veterinary undergraduate final qualifying examination. Questions spanned various species, clinical topics and reasoning stages, and included both text-based and image-based formats. ChatGPT o1Pro and ChatGPT 4.5 achieved the highest overall performance, with correct response rates of 90.4 and 90.8% respectively, demonstrating strong agreement with the gold standard across most categories, while Kimi 1.5 showed the lowest performance at 64.8%. Performance consistently declined with increased question difficulty and was generally lower for image-based than text-based questions. OpenAI models excelled in visual interpretation compared to previous studies. Disparities in performance were observed across specific clinical reasoning stages and veterinary subdomains, highlighting areas for targeted improvement. This study underscores the promising role of LLMs as supportive tools for quality assurance in veterinary assessment design and indicates key factors influencing their performance, including question difficulty, format, and domain-specific training data.
format Article
id doaj-art-8f7dd4aca2214934b4d081946f1f45f0
institution Kabale University
issn 2297-1769
language English
publishDate 2025-08-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Veterinary Science
spelling doaj-art-8f7dd4aca2214934b4d081946f1f45f02025-08-26T13:04:30ZengFrontiers Media S.A.Frontiers in Veterinary Science2297-17692025-08-011210.3389/fvets.2025.16165661616566Performance of large language models on veterinary undergraduate multiple-choice examinations: a comparative evaluationSantiago Alonso Sousa0Syed Saad Ul Hassan Bukhari1Paulo Vinicius Steagall2Paulo Vinicius Steagall3Paweł M. Bęczkowski4Antonio Giuliano5Kate J. Flay6Department of Veterinary Clinical Sciences, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Kowloon, Hong Kong SAR, ChinaDepartment of Veterinary Clinical Sciences, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Kowloon, Hong Kong SAR, ChinaDepartment of Veterinary Clinical Sciences, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Kowloon, Hong Kong SAR, ChinaCentre for Animal Health and Welfare, City University of Hong Kong, Kowloon, Hong Kong SAR, ChinaDepartment of Veterinary Clinical Sciences, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Kowloon, Hong Kong SAR, ChinaDepartment of Veterinary Clinical Sciences, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Kowloon, Hong Kong SAR, ChinaDepartment of Veterinary Clinical Sciences, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Kowloon, Hong Kong SAR, ChinaThe integration of artificial intelligence, particularly large language models (LLMs), into veterinary education and practice presents promising opportunities, yet their performance in veterinary-specific contexts remains understudied. This research comparatively evaluated the performance of nine advanced LLMs (ChatGPT o1Pro, ChatGPT 4o, ChatGPT 4.5, Grok 3, Gemini 2, Copilot, DeepSeek R1, Qwen 2.5 Max, and Kimi 1.5) on 250 multiple-choice questions (MCQs) sourced from a veterinary undergraduate final qualifying examination. Questions spanned various species, clinical topics and reasoning stages, and included both text-based and image-based formats. ChatGPT o1Pro and ChatGPT 4.5 achieved the highest overall performance, with correct response rates of 90.4 and 90.8% respectively, demonstrating strong agreement with the gold standard across most categories, while Kimi 1.5 showed the lowest performance at 64.8%. Performance consistently declined with increased question difficulty and was generally lower for image-based than text-based questions. OpenAI models excelled in visual interpretation compared to previous studies. Disparities in performance were observed across specific clinical reasoning stages and veterinary subdomains, highlighting areas for targeted improvement. This study underscores the promising role of LLMs as supportive tools for quality assurance in veterinary assessment design and indicates key factors influencing their performance, including question difficulty, format, and domain-specific training data.https://www.frontiersin.org/articles/10.3389/fvets.2025.1616566/fullveterinary educationartificial intelligenceassessmentlarge language modelscomparative analysisquality assurance
spellingShingle Santiago Alonso Sousa
Syed Saad Ul Hassan Bukhari
Paulo Vinicius Steagall
Paulo Vinicius Steagall
Paweł M. Bęczkowski
Antonio Giuliano
Kate J. Flay
Performance of large language models on veterinary undergraduate multiple-choice examinations: a comparative evaluation
Frontiers in Veterinary Science
veterinary education
artificial intelligence
assessment
large language models
comparative analysis
quality assurance
title Performance of large language models on veterinary undergraduate multiple-choice examinations: a comparative evaluation
title_full Performance of large language models on veterinary undergraduate multiple-choice examinations: a comparative evaluation
title_fullStr Performance of large language models on veterinary undergraduate multiple-choice examinations: a comparative evaluation
title_full_unstemmed Performance of large language models on veterinary undergraduate multiple-choice examinations: a comparative evaluation
title_short Performance of large language models on veterinary undergraduate multiple-choice examinations: a comparative evaluation
title_sort performance of large language models on veterinary undergraduate multiple choice examinations a comparative evaluation
topic veterinary education
artificial intelligence
assessment
large language models
comparative analysis
quality assurance
url https://www.frontiersin.org/articles/10.3389/fvets.2025.1616566/full
work_keys_str_mv AT santiagoalonsosousa performanceoflargelanguagemodelsonveterinaryundergraduatemultiplechoiceexaminationsacomparativeevaluation
AT syedsaadulhassanbukhari performanceoflargelanguagemodelsonveterinaryundergraduatemultiplechoiceexaminationsacomparativeevaluation
AT pauloviniciussteagall performanceoflargelanguagemodelsonveterinaryundergraduatemultiplechoiceexaminationsacomparativeevaluation
AT pauloviniciussteagall performanceoflargelanguagemodelsonveterinaryundergraduatemultiplechoiceexaminationsacomparativeevaluation
AT pawełmbeczkowski performanceoflargelanguagemodelsonveterinaryundergraduatemultiplechoiceexaminationsacomparativeevaluation
AT antoniogiuliano performanceoflargelanguagemodelsonveterinaryundergraduatemultiplechoiceexaminationsacomparativeevaluation
AT katejflay performanceoflargelanguagemodelsonveterinaryundergraduatemultiplechoiceexaminationsacomparativeevaluation