Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval

Abstract Large language models (LLMs) are fundamentally transforming human-facing applications in the health and well-being domains: boosting patient engagement, accelerating clinical decision-making, and facilitating medical education. Although state-of-the-art LLMs have shown superior performance...

Full description

Saved in:

Bibliographic Details
Main Authors:	Iman Azimi, Mohan Qi, Li Wang, Amir M. Rahmani, Youlin Li
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-01-01
Series:	Scientific Reports
Subjects:	Large Language Models Registered Dietitian Nutrition Prompt Engineering Knowledge Retrieval
Online Access:	https://doi.org/10.1038/s41598-024-85003-w
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841544649412444160
author	Iman Azimi Mohan Qi Li Wang Amir M. Rahmani Youlin Li
author_facet	Iman Azimi Mohan Qi Li Wang Amir M. Rahmani Youlin Li
author_sort	Iman Azimi
collection	DOAJ
description	Abstract Large language models (LLMs) are fundamentally transforming human-facing applications in the health and well-being domains: boosting patient engagement, accelerating clinical decision-making, and facilitating medical education. Although state-of-the-art LLMs have shown superior performance in several conversational applications, evaluations within nutrition and diet applications are still insufficient. In this paper, we propose to employ the Registered Dietitian (RD) exam to conduct a standard and comprehensive evaluation of state-of-the-art LLMs, GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro, assessing both accuracy and consistency in nutrition queries. Our evaluation includes 1050 RD exam questions encompassing several nutrition topics and proficiency levels. In addition, for the first time, we examine the impact of Zero-Shot (ZS), Chain of Thought (CoT), Chain of Thought with Self Consistency (CoT-SC), and Retrieval Augmented Prompting (RAP) on both accuracy and consistency of the responses. Our findings revealed that while these LLMs obtained acceptable overall performance, their results varied considerably with different prompts and question domains. GPT-4o with CoT-SC prompting outperformed the other approaches, whereas Gemini 1.5 Pro with ZS recorded the highest consistency. For GPT-4o and Claude 3.5, CoT improved the accuracy, and CoT-SC improved both accuracy and consistency. RAP was particularly effective for GPT-4o to answer Expert level questions. Consequently, choosing the appropriate LLM and prompting technique, tailored to the proficiency level and specific domain, can mitigate errors and potential risks in diet and nutrition chatbots.
format	Article
id	doaj-art-322fde9b4a304aeaa483a9f4323c932b
institution	Kabale University
issn	2045-2322
language	English
publishDate	2025-01-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj-art-322fde9b4a304aeaa483a9f4323c932b2025-01-12T12:23:18ZengNature PortfolioScientific Reports2045-23222025-01-0115111510.1038/s41598-024-85003-wEvaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrievalIman Azimi0Mohan Qi1Li Wang2Amir M. Rahmani3Youlin Li4Department of Engineering, iHealth LabsDepartment of Engineering, iHealth LabsDepartment of Clinical Research, iHealth LabsSchool of Nursing and Department of Computer Science, University of California IrvineDepartment of Engineering, iHealth LabsAbstract Large language models (LLMs) are fundamentally transforming human-facing applications in the health and well-being domains: boosting patient engagement, accelerating clinical decision-making, and facilitating medical education. Although state-of-the-art LLMs have shown superior performance in several conversational applications, evaluations within nutrition and diet applications are still insufficient. In this paper, we propose to employ the Registered Dietitian (RD) exam to conduct a standard and comprehensive evaluation of state-of-the-art LLMs, GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro, assessing both accuracy and consistency in nutrition queries. Our evaluation includes 1050 RD exam questions encompassing several nutrition topics and proficiency levels. In addition, for the first time, we examine the impact of Zero-Shot (ZS), Chain of Thought (CoT), Chain of Thought with Self Consistency (CoT-SC), and Retrieval Augmented Prompting (RAP) on both accuracy and consistency of the responses. Our findings revealed that while these LLMs obtained acceptable overall performance, their results varied considerably with different prompts and question domains. GPT-4o with CoT-SC prompting outperformed the other approaches, whereas Gemini 1.5 Pro with ZS recorded the highest consistency. For GPT-4o and Claude 3.5, CoT improved the accuracy, and CoT-SC improved both accuracy and consistency. RAP was particularly effective for GPT-4o to answer Expert level questions. Consequently, choosing the appropriate LLM and prompting technique, tailored to the proficiency level and specific domain, can mitigate errors and potential risks in diet and nutrition chatbots.https://doi.org/10.1038/s41598-024-85003-wLarge Language ModelsRegistered DietitianNutritionPrompt EngineeringKnowledge Retrieval
spellingShingle	Iman Azimi Mohan Qi Li Wang Amir M. Rahmani Youlin Li Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval Scientific Reports Large Language Models Registered Dietitian Nutrition Prompt Engineering Knowledge Retrieval
title	Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval
title_full	Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval
title_fullStr	Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval
title_full_unstemmed	Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval
title_short	Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval
title_sort	evaluation of llms accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval
topic	Large Language Models Registered Dietitian Nutrition Prompt Engineering Knowledge Retrieval
url	https://doi.org/10.1038/s41598-024-85003-w
work_keys_str_mv	AT imanazimi evaluationofllmsaccuracyandconsistencyintheregistereddietitianexamthroughpromptengineeringandknowledgeretrieval AT mohanqi evaluationofllmsaccuracyandconsistencyintheregistereddietitianexamthroughpromptengineeringandknowledgeretrieval AT liwang evaluationofllmsaccuracyandconsistencyintheregistereddietitianexamthroughpromptengineeringandknowledgeretrieval AT amirmrahmani evaluationofllmsaccuracyandconsistencyintheregistereddietitianexamthroughpromptengineeringandknowledgeretrieval AT youlinli evaluationofllmsaccuracyandconsistencyintheregistereddietitianexamthroughpromptengineeringandknowledgeretrieval

Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval

Similar Items