Evaluating Large Language Models in extracting cognitive exam dates and scores.

Ensuring reliability of Large Language Models (LLMs) in clinical tasks is crucial. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to Ma...

Full description

Saved in:
Bibliographic Details
Main Authors: Hao Zhang, Neil Jethani, Simon Jones, Nicholas Genes, Vincent J Major, Ian S Jaffe, Anthony B Cardillo, Noah Heilenbach, Nadia Fazal Ali, Luke J Bonanni, Andrew J Clayburn, Zain Khera, Erica C Sadler, Jaideep Prasad, Jamie Schlacter, Kevin Liu, Benjamin Silva, Sophie Montgomery, Eric J Kim, Jacob Lester, Theodore M Hill, Alba Avoricani, Ethan Chervonski, James Davydov, William Small, Eesha Chakravartty, Himanshu Grover, John A Dodson, Abraham A Brody, Yindalon Aphinyanaphongs, Arjun Masurkar, Narges Razavian
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2024-12-01
Series:PLOS Digital Health
Online Access:https://doi.org/10.1371/journal.pdig.0000685
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846111709683515392
author Hao Zhang
Neil Jethani
Simon Jones
Nicholas Genes
Vincent J Major
Ian S Jaffe
Anthony B Cardillo
Noah Heilenbach
Nadia Fazal Ali
Luke J Bonanni
Andrew J Clayburn
Zain Khera
Erica C Sadler
Jaideep Prasad
Jamie Schlacter
Kevin Liu
Benjamin Silva
Sophie Montgomery
Eric J Kim
Jacob Lester
Theodore M Hill
Alba Avoricani
Ethan Chervonski
James Davydov
William Small
Eesha Chakravartty
Himanshu Grover
John A Dodson
Abraham A Brody
Yindalon Aphinyanaphongs
Arjun Masurkar
Narges Razavian
author_facet Hao Zhang
Neil Jethani
Simon Jones
Nicholas Genes
Vincent J Major
Ian S Jaffe
Anthony B Cardillo
Noah Heilenbach
Nadia Fazal Ali
Luke J Bonanni
Andrew J Clayburn
Zain Khera
Erica C Sadler
Jaideep Prasad
Jamie Schlacter
Kevin Liu
Benjamin Silva
Sophie Montgomery
Eric J Kim
Jacob Lester
Theodore M Hill
Alba Avoricani
Ethan Chervonski
James Davydov
William Small
Eesha Chakravartty
Himanshu Grover
John A Dodson
Abraham A Brody
Yindalon Aphinyanaphongs
Arjun Masurkar
Narges Razavian
author_sort Hao Zhang
collection DOAJ
description Ensuring reliability of Large Language Models (LLMs) in clinical tasks is crucial. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.
format Article
id doaj-art-a75d72ec94b64755a5f737a01e709bd9
institution Kabale University
issn 2767-3170
language English
publishDate 2024-12-01
publisher Public Library of Science (PLoS)
record_format Article
series PLOS Digital Health
spelling doaj-art-a75d72ec94b64755a5f737a01e709bd92024-12-23T05:31:52ZengPublic Library of Science (PLoS)PLOS Digital Health2767-31702024-12-01312e000068510.1371/journal.pdig.0000685Evaluating Large Language Models in extracting cognitive exam dates and scores.Hao ZhangNeil JethaniSimon JonesNicholas GenesVincent J MajorIan S JaffeAnthony B CardilloNoah HeilenbachNadia Fazal AliLuke J BonanniAndrew J ClayburnZain KheraErica C SadlerJaideep PrasadJamie SchlacterKevin LiuBenjamin SilvaSophie MontgomeryEric J KimJacob LesterTheodore M HillAlba AvoricaniEthan ChervonskiJames DavydovWilliam SmallEesha ChakravarttyHimanshu GroverJohn A DodsonAbraham A BrodyYindalon AphinyanaphongsArjun MasurkarNarges RazavianEnsuring reliability of Large Language Models (LLMs) in clinical tasks is crucial. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.https://doi.org/10.1371/journal.pdig.0000685
spellingShingle Hao Zhang
Neil Jethani
Simon Jones
Nicholas Genes
Vincent J Major
Ian S Jaffe
Anthony B Cardillo
Noah Heilenbach
Nadia Fazal Ali
Luke J Bonanni
Andrew J Clayburn
Zain Khera
Erica C Sadler
Jaideep Prasad
Jamie Schlacter
Kevin Liu
Benjamin Silva
Sophie Montgomery
Eric J Kim
Jacob Lester
Theodore M Hill
Alba Avoricani
Ethan Chervonski
James Davydov
William Small
Eesha Chakravartty
Himanshu Grover
John A Dodson
Abraham A Brody
Yindalon Aphinyanaphongs
Arjun Masurkar
Narges Razavian
Evaluating Large Language Models in extracting cognitive exam dates and scores.
PLOS Digital Health
title Evaluating Large Language Models in extracting cognitive exam dates and scores.
title_full Evaluating Large Language Models in extracting cognitive exam dates and scores.
title_fullStr Evaluating Large Language Models in extracting cognitive exam dates and scores.
title_full_unstemmed Evaluating Large Language Models in extracting cognitive exam dates and scores.
title_short Evaluating Large Language Models in extracting cognitive exam dates and scores.
title_sort evaluating large language models in extracting cognitive exam dates and scores
url https://doi.org/10.1371/journal.pdig.0000685
work_keys_str_mv AT haozhang evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT neiljethani evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT simonjones evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT nicholasgenes evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT vincentjmajor evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT iansjaffe evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT anthonybcardillo evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT noahheilenbach evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT nadiafazalali evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT lukejbonanni evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT andrewjclayburn evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT zainkhera evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT ericacsadler evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT jaideepprasad evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT jamieschlacter evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT kevinliu evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT benjaminsilva evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT sophiemontgomery evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT ericjkim evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT jacoblester evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT theodoremhill evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT albaavoricani evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT ethanchervonski evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT jamesdavydov evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT williamsmall evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT eeshachakravartty evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT himanshugrover evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT johnadodson evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT abrahamabrody evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT yindalonaphinyanaphongs evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT arjunmasurkar evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores
AT nargesrazavian evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores