Evaluating a large language model’s ability to answer clinicians’ requests for evidence summaries

Objective: This study investigated the performance of a generative artificial intelligence (AI) tool using GPT-4 in answering clinical questions in comparison with medical librarians’ gold-standard evidence syntheses. Methods: Questions were extracted from an in-house database of clinical evidenc...

Full description

Saved in:
Bibliographic Details
Main Authors: Mallory N. Blasingame, Taneya Y. Koonce, Annette M. Williams, Dario A. Giuse, Jing Su, Poppy A. Krump, Nunzia Bettinsoli Giuse
Format: Article
Language:English
Published: University Library System, University of Pittsburgh 2025-01-01
Series:Journal of the Medical Library Association
Subjects:
Online Access:http://jmla.pitt.edu/ojs/jmla/article/view/1985
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841536161791606784
author Mallory N. Blasingame
Taneya Y. Koonce
Annette M. Williams
Dario A. Giuse
Jing Su
Poppy A. Krump
Nunzia Bettinsoli Giuse
author_facet Mallory N. Blasingame
Taneya Y. Koonce
Annette M. Williams
Dario A. Giuse
Jing Su
Poppy A. Krump
Nunzia Bettinsoli Giuse
author_sort Mallory N. Blasingame
collection DOAJ
description Objective: This study investigated the performance of a generative artificial intelligence (AI) tool using GPT-4 in answering clinical questions in comparison with medical librarians’ gold-standard evidence syntheses. Methods: Questions were extracted from an in-house database of clinical evidence requests previously answered by medical librarians. Questions with multiple parts were subdivided into individual topics. A standardized prompt was developed using the COSTAR framework. Librarians submitted each question into aiChat, an internally managed chat tool using GPT-4, and recorded the responses. The summaries generated by aiChat were evaluated on whether they contained the critical elements used in the established gold-standard summary of the librarian. A subset of questions was randomly selected for verification of references provided by aiChat.  Results: Of the 216 evaluated questions, aiChat’s response was assessed as “correct” for 180 (83.3%) questions, “partially correct” for 35 (16.2%) questions, and “incorrect” for 1 (0.5%) question. No significant differences were observed in question ratings by question category (p=0.73). For a subset of 30% (n=66) of questions, 162 references were provided in the aiChat summaries, and 60 (37%) were confirmed as nonfabricated. Conclusions: Overall, the performance of a generative AI tool was promising. However, many included references could not be independently verified, and attempts were not made to assess whether any additional concepts introduced by aiChat were factually accurate. Thus, we envision this being the first of a series of investigations designed to further our understanding of how current and future versions of generative AI can be used and integrated into medical librarians’ workflow.
format Article
id doaj-art-2802123db5ec4470800f4374ed652021
institution Kabale University
issn 1536-5050
1558-9439
language English
publishDate 2025-01-01
publisher University Library System, University of Pittsburgh
record_format Article
series Journal of the Medical Library Association
spelling doaj-art-2802123db5ec4470800f4374ed6520212025-01-14T23:39:30ZengUniversity Library System, University of PittsburghJournal of the Medical Library Association1536-50501558-94392025-01-01113110.5195/jmla.2025.1985Evaluating a large language model’s ability to answer clinicians’ requests for evidence summariesMallory N. Blasingame0Taneya Y. Koonce1Annette M. Williams2Dario A. Giuse3Jing Su4Poppy A. Krump5Nunzia Bettinsoli Giuse6Center for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN, United StatesCenter for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN, United StatesCenter for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN, United StatesDepartment of Biomedical Informatics, Vanderbilt University School of Medicine, Vanderbilt University Medical Center, Nashville, TN, United StatesCenter for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN, United StatesCenter for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN, United StatesProfessor of Biomedical Informatics and Professor of Medicine; Vice President for Knowledge Management; and Director, Center for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN Objective: This study investigated the performance of a generative artificial intelligence (AI) tool using GPT-4 in answering clinical questions in comparison with medical librarians’ gold-standard evidence syntheses. Methods: Questions were extracted from an in-house database of clinical evidence requests previously answered by medical librarians. Questions with multiple parts were subdivided into individual topics. A standardized prompt was developed using the COSTAR framework. Librarians submitted each question into aiChat, an internally managed chat tool using GPT-4, and recorded the responses. The summaries generated by aiChat were evaluated on whether they contained the critical elements used in the established gold-standard summary of the librarian. A subset of questions was randomly selected for verification of references provided by aiChat.  Results: Of the 216 evaluated questions, aiChat’s response was assessed as “correct” for 180 (83.3%) questions, “partially correct” for 35 (16.2%) questions, and “incorrect” for 1 (0.5%) question. No significant differences were observed in question ratings by question category (p=0.73). For a subset of 30% (n=66) of questions, 162 references were provided in the aiChat summaries, and 60 (37%) were confirmed as nonfabricated. Conclusions: Overall, the performance of a generative AI tool was promising. However, many included references could not be independently verified, and attempts were not made to assess whether any additional concepts introduced by aiChat were factually accurate. Thus, we envision this being the first of a series of investigations designed to further our understanding of how current and future versions of generative AI can be used and integrated into medical librarians’ workflow. http://jmla.pitt.edu/ojs/jmla/article/view/1985Large Language ModelsLLMsGenerative AIArtificial IntelligenceEvidence SynthesisLibrary Science
spellingShingle Mallory N. Blasingame
Taneya Y. Koonce
Annette M. Williams
Dario A. Giuse
Jing Su
Poppy A. Krump
Nunzia Bettinsoli Giuse
Evaluating a large language model’s ability to answer clinicians’ requests for evidence summaries
Journal of the Medical Library Association
Large Language Models
LLMs
Generative AI
Artificial Intelligence
Evidence Synthesis
Library Science
title Evaluating a large language model’s ability to answer clinicians’ requests for evidence summaries
title_full Evaluating a large language model’s ability to answer clinicians’ requests for evidence summaries
title_fullStr Evaluating a large language model’s ability to answer clinicians’ requests for evidence summaries
title_full_unstemmed Evaluating a large language model’s ability to answer clinicians’ requests for evidence summaries
title_short Evaluating a large language model’s ability to answer clinicians’ requests for evidence summaries
title_sort evaluating a large language model s ability to answer clinicians requests for evidence summaries
topic Large Language Models
LLMs
Generative AI
Artificial Intelligence
Evidence Synthesis
Library Science
url http://jmla.pitt.edu/ojs/jmla/article/view/1985
work_keys_str_mv AT mallorynblasingame evaluatingalargelanguagemodelsabilitytoanswercliniciansrequestsforevidencesummaries
AT taneyaykoonce evaluatingalargelanguagemodelsabilitytoanswercliniciansrequestsforevidencesummaries
AT annettemwilliams evaluatingalargelanguagemodelsabilitytoanswercliniciansrequestsforevidencesummaries
AT darioagiuse evaluatingalargelanguagemodelsabilitytoanswercliniciansrequestsforevidencesummaries
AT jingsu evaluatingalargelanguagemodelsabilitytoanswercliniciansrequestsforevidencesummaries
AT poppyakrump evaluatingalargelanguagemodelsabilitytoanswercliniciansrequestsforevidencesummaries
AT nunziabettinsoligiuse evaluatingalargelanguagemodelsabilitytoanswercliniciansrequestsforevidencesummaries