Performance of large language models on veterinary undergraduate multiple-choice examinations: a comparative evaluation
The integration of artificial intelligence, particularly large language models (LLMs), into veterinary education and practice presents promising opportunities, yet their performance in veterinary-specific contexts remains understudied. This research comparatively evaluated the performance of nine ad...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Frontiers Media S.A.
2025-08-01
|
| Series: | Frontiers in Veterinary Science |
| Subjects: | |
| Online Access: | https://www.frontiersin.org/articles/10.3389/fvets.2025.1616566/full |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1846173873294278656 |
|---|---|
| author | Santiago Alonso Sousa Syed Saad Ul Hassan Bukhari Paulo Vinicius Steagall Paulo Vinicius Steagall Paweł M. Bęczkowski Antonio Giuliano Kate J. Flay |
| author_facet | Santiago Alonso Sousa Syed Saad Ul Hassan Bukhari Paulo Vinicius Steagall Paulo Vinicius Steagall Paweł M. Bęczkowski Antonio Giuliano Kate J. Flay |
| author_sort | Santiago Alonso Sousa |
| collection | DOAJ |
| description | The integration of artificial intelligence, particularly large language models (LLMs), into veterinary education and practice presents promising opportunities, yet their performance in veterinary-specific contexts remains understudied. This research comparatively evaluated the performance of nine advanced LLMs (ChatGPT o1Pro, ChatGPT 4o, ChatGPT 4.5, Grok 3, Gemini 2, Copilot, DeepSeek R1, Qwen 2.5 Max, and Kimi 1.5) on 250 multiple-choice questions (MCQs) sourced from a veterinary undergraduate final qualifying examination. Questions spanned various species, clinical topics and reasoning stages, and included both text-based and image-based formats. ChatGPT o1Pro and ChatGPT 4.5 achieved the highest overall performance, with correct response rates of 90.4 and 90.8% respectively, demonstrating strong agreement with the gold standard across most categories, while Kimi 1.5 showed the lowest performance at 64.8%. Performance consistently declined with increased question difficulty and was generally lower for image-based than text-based questions. OpenAI models excelled in visual interpretation compared to previous studies. Disparities in performance were observed across specific clinical reasoning stages and veterinary subdomains, highlighting areas for targeted improvement. This study underscores the promising role of LLMs as supportive tools for quality assurance in veterinary assessment design and indicates key factors influencing their performance, including question difficulty, format, and domain-specific training data. |
| format | Article |
| id | doaj-art-8f7dd4aca2214934b4d081946f1f45f0 |
| institution | Kabale University |
| issn | 2297-1769 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | Frontiers Media S.A. |
| record_format | Article |
| series | Frontiers in Veterinary Science |
| spelling | doaj-art-8f7dd4aca2214934b4d081946f1f45f02025-08-26T13:04:30ZengFrontiers Media S.A.Frontiers in Veterinary Science2297-17692025-08-011210.3389/fvets.2025.16165661616566Performance of large language models on veterinary undergraduate multiple-choice examinations: a comparative evaluationSantiago Alonso Sousa0Syed Saad Ul Hassan Bukhari1Paulo Vinicius Steagall2Paulo Vinicius Steagall3Paweł M. Bęczkowski4Antonio Giuliano5Kate J. Flay6Department of Veterinary Clinical Sciences, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Kowloon, Hong Kong SAR, ChinaDepartment of Veterinary Clinical Sciences, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Kowloon, Hong Kong SAR, ChinaDepartment of Veterinary Clinical Sciences, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Kowloon, Hong Kong SAR, ChinaCentre for Animal Health and Welfare, City University of Hong Kong, Kowloon, Hong Kong SAR, ChinaDepartment of Veterinary Clinical Sciences, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Kowloon, Hong Kong SAR, ChinaDepartment of Veterinary Clinical Sciences, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Kowloon, Hong Kong SAR, ChinaDepartment of Veterinary Clinical Sciences, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Kowloon, Hong Kong SAR, ChinaThe integration of artificial intelligence, particularly large language models (LLMs), into veterinary education and practice presents promising opportunities, yet their performance in veterinary-specific contexts remains understudied. This research comparatively evaluated the performance of nine advanced LLMs (ChatGPT o1Pro, ChatGPT 4o, ChatGPT 4.5, Grok 3, Gemini 2, Copilot, DeepSeek R1, Qwen 2.5 Max, and Kimi 1.5) on 250 multiple-choice questions (MCQs) sourced from a veterinary undergraduate final qualifying examination. Questions spanned various species, clinical topics and reasoning stages, and included both text-based and image-based formats. ChatGPT o1Pro and ChatGPT 4.5 achieved the highest overall performance, with correct response rates of 90.4 and 90.8% respectively, demonstrating strong agreement with the gold standard across most categories, while Kimi 1.5 showed the lowest performance at 64.8%. Performance consistently declined with increased question difficulty and was generally lower for image-based than text-based questions. OpenAI models excelled in visual interpretation compared to previous studies. Disparities in performance were observed across specific clinical reasoning stages and veterinary subdomains, highlighting areas for targeted improvement. This study underscores the promising role of LLMs as supportive tools for quality assurance in veterinary assessment design and indicates key factors influencing their performance, including question difficulty, format, and domain-specific training data.https://www.frontiersin.org/articles/10.3389/fvets.2025.1616566/fullveterinary educationartificial intelligenceassessmentlarge language modelscomparative analysisquality assurance |
| spellingShingle | Santiago Alonso Sousa Syed Saad Ul Hassan Bukhari Paulo Vinicius Steagall Paulo Vinicius Steagall Paweł M. Bęczkowski Antonio Giuliano Kate J. Flay Performance of large language models on veterinary undergraduate multiple-choice examinations: a comparative evaluation Frontiers in Veterinary Science veterinary education artificial intelligence assessment large language models comparative analysis quality assurance |
| title | Performance of large language models on veterinary undergraduate multiple-choice examinations: a comparative evaluation |
| title_full | Performance of large language models on veterinary undergraduate multiple-choice examinations: a comparative evaluation |
| title_fullStr | Performance of large language models on veterinary undergraduate multiple-choice examinations: a comparative evaluation |
| title_full_unstemmed | Performance of large language models on veterinary undergraduate multiple-choice examinations: a comparative evaluation |
| title_short | Performance of large language models on veterinary undergraduate multiple-choice examinations: a comparative evaluation |
| title_sort | performance of large language models on veterinary undergraduate multiple choice examinations a comparative evaluation |
| topic | veterinary education artificial intelligence assessment large language models comparative analysis quality assurance |
| url | https://www.frontiersin.org/articles/10.3389/fvets.2025.1616566/full |
| work_keys_str_mv | AT santiagoalonsosousa performanceoflargelanguagemodelsonveterinaryundergraduatemultiplechoiceexaminationsacomparativeevaluation AT syedsaadulhassanbukhari performanceoflargelanguagemodelsonveterinaryundergraduatemultiplechoiceexaminationsacomparativeevaluation AT pauloviniciussteagall performanceoflargelanguagemodelsonveterinaryundergraduatemultiplechoiceexaminationsacomparativeevaluation AT pauloviniciussteagall performanceoflargelanguagemodelsonveterinaryundergraduatemultiplechoiceexaminationsacomparativeevaluation AT pawełmbeczkowski performanceoflargelanguagemodelsonveterinaryundergraduatemultiplechoiceexaminationsacomparativeevaluation AT antoniogiuliano performanceoflargelanguagemodelsonveterinaryundergraduatemultiplechoiceexaminationsacomparativeevaluation AT katejflay performanceoflargelanguagemodelsonveterinaryundergraduatemultiplechoiceexaminationsacomparativeevaluation |