Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval

Abstract Large language models (LLMs) are fundamentally transforming human-facing applications in the health and well-being domains: boosting patient engagement, accelerating clinical decision-making, and facilitating medical education. Although state-of-the-art LLMs have shown superior performance...

Full description

Saved in:
Bibliographic Details
Main Authors: Iman Azimi, Mohan Qi, Li Wang, Amir M. Rahmani, Youlin Li
Format: Article
Language:English
Published: Nature Portfolio 2025-01-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-024-85003-w
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841544649412444160
author Iman Azimi
Mohan Qi
Li Wang
Amir M. Rahmani
Youlin Li
author_facet Iman Azimi
Mohan Qi
Li Wang
Amir M. Rahmani
Youlin Li
author_sort Iman Azimi
collection DOAJ
description Abstract Large language models (LLMs) are fundamentally transforming human-facing applications in the health and well-being domains: boosting patient engagement, accelerating clinical decision-making, and facilitating medical education. Although state-of-the-art LLMs have shown superior performance in several conversational applications, evaluations within nutrition and diet applications are still insufficient. In this paper, we propose to employ the Registered Dietitian (RD) exam to conduct a standard and comprehensive evaluation of state-of-the-art LLMs, GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro, assessing both accuracy and consistency in nutrition queries. Our evaluation includes 1050 RD exam questions encompassing several nutrition topics and proficiency levels. In addition, for the first time, we examine the impact of Zero-Shot (ZS), Chain of Thought (CoT), Chain of Thought with Self Consistency (CoT-SC), and Retrieval Augmented Prompting (RAP) on both accuracy and consistency of the responses. Our findings revealed that while these LLMs obtained acceptable overall performance, their results varied considerably with different prompts and question domains. GPT-4o with CoT-SC prompting outperformed the other approaches, whereas Gemini 1.5 Pro with ZS recorded the highest consistency. For GPT-4o and Claude 3.5, CoT improved the accuracy, and CoT-SC improved both accuracy and consistency. RAP was particularly effective for GPT-4o to answer Expert level questions. Consequently, choosing the appropriate LLM and prompting technique, tailored to the proficiency level and specific domain, can mitigate errors and potential risks in diet and nutrition chatbots.
format Article
id doaj-art-322fde9b4a304aeaa483a9f4323c932b
institution Kabale University
issn 2045-2322
language English
publishDate 2025-01-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-322fde9b4a304aeaa483a9f4323c932b2025-01-12T12:23:18ZengNature PortfolioScientific Reports2045-23222025-01-0115111510.1038/s41598-024-85003-wEvaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrievalIman Azimi0Mohan Qi1Li Wang2Amir M. Rahmani3Youlin Li4Department of Engineering, iHealth LabsDepartment of Engineering, iHealth LabsDepartment of Clinical Research, iHealth LabsSchool of Nursing and Department of Computer Science, University of California IrvineDepartment of Engineering, iHealth LabsAbstract Large language models (LLMs) are fundamentally transforming human-facing applications in the health and well-being domains: boosting patient engagement, accelerating clinical decision-making, and facilitating medical education. Although state-of-the-art LLMs have shown superior performance in several conversational applications, evaluations within nutrition and diet applications are still insufficient. In this paper, we propose to employ the Registered Dietitian (RD) exam to conduct a standard and comprehensive evaluation of state-of-the-art LLMs, GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro, assessing both accuracy and consistency in nutrition queries. Our evaluation includes 1050 RD exam questions encompassing several nutrition topics and proficiency levels. In addition, for the first time, we examine the impact of Zero-Shot (ZS), Chain of Thought (CoT), Chain of Thought with Self Consistency (CoT-SC), and Retrieval Augmented Prompting (RAP) on both accuracy and consistency of the responses. Our findings revealed that while these LLMs obtained acceptable overall performance, their results varied considerably with different prompts and question domains. GPT-4o with CoT-SC prompting outperformed the other approaches, whereas Gemini 1.5 Pro with ZS recorded the highest consistency. For GPT-4o and Claude 3.5, CoT improved the accuracy, and CoT-SC improved both accuracy and consistency. RAP was particularly effective for GPT-4o to answer Expert level questions. Consequently, choosing the appropriate LLM and prompting technique, tailored to the proficiency level and specific domain, can mitigate errors and potential risks in diet and nutrition chatbots.https://doi.org/10.1038/s41598-024-85003-wLarge Language ModelsRegistered DietitianNutritionPrompt EngineeringKnowledge Retrieval
spellingShingle Iman Azimi
Mohan Qi
Li Wang
Amir M. Rahmani
Youlin Li
Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval
Scientific Reports
Large Language Models
Registered Dietitian
Nutrition
Prompt Engineering
Knowledge Retrieval
title Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval
title_full Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval
title_fullStr Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval
title_full_unstemmed Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval
title_short Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval
title_sort evaluation of llms accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval
topic Large Language Models
Registered Dietitian
Nutrition
Prompt Engineering
Knowledge Retrieval
url https://doi.org/10.1038/s41598-024-85003-w
work_keys_str_mv AT imanazimi evaluationofllmsaccuracyandconsistencyintheregistereddietitianexamthroughpromptengineeringandknowledgeretrieval
AT mohanqi evaluationofllmsaccuracyandconsistencyintheregistereddietitianexamthroughpromptengineeringandknowledgeretrieval
AT liwang evaluationofllmsaccuracyandconsistencyintheregistereddietitianexamthroughpromptengineeringandknowledgeretrieval
AT amirmrahmani evaluationofllmsaccuracyandconsistencyintheregistereddietitianexamthroughpromptengineeringandknowledgeretrieval
AT youlinli evaluationofllmsaccuracyandconsistencyintheregistereddietitianexamthroughpromptengineeringandknowledgeretrieval