Evaluating Large Language Models in extracting cognitive exam dates and scores.
Ensuring reliability of Large Language Models (LLMs) in clinical tasks is crucial. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to Ma...
Saved in:
| Main Authors: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Public Library of Science (PLoS)
2024-12-01
|
| Series: | PLOS Digital Health |
| Online Access: | https://doi.org/10.1371/journal.pdig.0000685 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1846111709683515392 |
|---|---|
| author | Hao Zhang Neil Jethani Simon Jones Nicholas Genes Vincent J Major Ian S Jaffe Anthony B Cardillo Noah Heilenbach Nadia Fazal Ali Luke J Bonanni Andrew J Clayburn Zain Khera Erica C Sadler Jaideep Prasad Jamie Schlacter Kevin Liu Benjamin Silva Sophie Montgomery Eric J Kim Jacob Lester Theodore M Hill Alba Avoricani Ethan Chervonski James Davydov William Small Eesha Chakravartty Himanshu Grover John A Dodson Abraham A Brody Yindalon Aphinyanaphongs Arjun Masurkar Narges Razavian |
| author_facet | Hao Zhang Neil Jethani Simon Jones Nicholas Genes Vincent J Major Ian S Jaffe Anthony B Cardillo Noah Heilenbach Nadia Fazal Ali Luke J Bonanni Andrew J Clayburn Zain Khera Erica C Sadler Jaideep Prasad Jamie Schlacter Kevin Liu Benjamin Silva Sophie Montgomery Eric J Kim Jacob Lester Theodore M Hill Alba Avoricani Ethan Chervonski James Davydov William Small Eesha Chakravartty Himanshu Grover John A Dodson Abraham A Brody Yindalon Aphinyanaphongs Arjun Masurkar Narges Razavian |
| author_sort | Hao Zhang |
| collection | DOAJ |
| description | Ensuring reliability of Large Language Models (LLMs) in clinical tasks is crucial. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations. |
| format | Article |
| id | doaj-art-a75d72ec94b64755a5f737a01e709bd9 |
| institution | Kabale University |
| issn | 2767-3170 |
| language | English |
| publishDate | 2024-12-01 |
| publisher | Public Library of Science (PLoS) |
| record_format | Article |
| series | PLOS Digital Health |
| spelling | doaj-art-a75d72ec94b64755a5f737a01e709bd92024-12-23T05:31:52ZengPublic Library of Science (PLoS)PLOS Digital Health2767-31702024-12-01312e000068510.1371/journal.pdig.0000685Evaluating Large Language Models in extracting cognitive exam dates and scores.Hao ZhangNeil JethaniSimon JonesNicholas GenesVincent J MajorIan S JaffeAnthony B CardilloNoah HeilenbachNadia Fazal AliLuke J BonanniAndrew J ClayburnZain KheraErica C SadlerJaideep PrasadJamie SchlacterKevin LiuBenjamin SilvaSophie MontgomeryEric J KimJacob LesterTheodore M HillAlba AvoricaniEthan ChervonskiJames DavydovWilliam SmallEesha ChakravarttyHimanshu GroverJohn A DodsonAbraham A BrodyYindalon AphinyanaphongsArjun MasurkarNarges RazavianEnsuring reliability of Large Language Models (LLMs) in clinical tasks is crucial. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.https://doi.org/10.1371/journal.pdig.0000685 |
| spellingShingle | Hao Zhang Neil Jethani Simon Jones Nicholas Genes Vincent J Major Ian S Jaffe Anthony B Cardillo Noah Heilenbach Nadia Fazal Ali Luke J Bonanni Andrew J Clayburn Zain Khera Erica C Sadler Jaideep Prasad Jamie Schlacter Kevin Liu Benjamin Silva Sophie Montgomery Eric J Kim Jacob Lester Theodore M Hill Alba Avoricani Ethan Chervonski James Davydov William Small Eesha Chakravartty Himanshu Grover John A Dodson Abraham A Brody Yindalon Aphinyanaphongs Arjun Masurkar Narges Razavian Evaluating Large Language Models in extracting cognitive exam dates and scores. PLOS Digital Health |
| title | Evaluating Large Language Models in extracting cognitive exam dates and scores. |
| title_full | Evaluating Large Language Models in extracting cognitive exam dates and scores. |
| title_fullStr | Evaluating Large Language Models in extracting cognitive exam dates and scores. |
| title_full_unstemmed | Evaluating Large Language Models in extracting cognitive exam dates and scores. |
| title_short | Evaluating Large Language Models in extracting cognitive exam dates and scores. |
| title_sort | evaluating large language models in extracting cognitive exam dates and scores |
| url | https://doi.org/10.1371/journal.pdig.0000685 |
| work_keys_str_mv | AT haozhang evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT neiljethani evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT simonjones evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT nicholasgenes evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT vincentjmajor evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT iansjaffe evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT anthonybcardillo evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT noahheilenbach evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT nadiafazalali evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT lukejbonanni evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT andrewjclayburn evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT zainkhera evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT ericacsadler evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT jaideepprasad evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT jamieschlacter evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT kevinliu evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT benjaminsilva evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT sophiemontgomery evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT ericjkim evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT jacoblester evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT theodoremhill evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT albaavoricani evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT ethanchervonski evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT jamesdavydov evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT williamsmall evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT eeshachakravartty evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT himanshugrover evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT johnadodson evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT abrahamabrody evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT yindalonaphinyanaphongs evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT arjunmasurkar evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores AT nargesrazavian evaluatinglargelanguagemodelsinextractingcognitiveexamdatesandscores |