Accelerating clinical evidence synthesis with large language models
Abstract Clinical evidence synthesis largely relies on systematic reviews (SR) of clinical studies from medical literature. Here, we propose a generative artificial intelligence (AI) pipeline named TrialMind to streamline study search, study screening, and data extraction tasks in SR. We chose publi...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-08-01
|
| Series: | npj Digital Medicine |
| Online Access: | https://doi.org/10.1038/s41746-025-01840-7 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849761153227423744 |
|---|---|
| author | Zifeng Wang Lang Cao Benjamin Danek Qiao Jin Zhiyong Lu Jimeng Sun |
| author_facet | Zifeng Wang Lang Cao Benjamin Danek Qiao Jin Zhiyong Lu Jimeng Sun |
| author_sort | Zifeng Wang |
| collection | DOAJ |
| description | Abstract Clinical evidence synthesis largely relies on systematic reviews (SR) of clinical studies from medical literature. Here, we propose a generative artificial intelligence (AI) pipeline named TrialMind to streamline study search, study screening, and data extraction tasks in SR. We chose published SRs to build TrialReviewBench, which contains 100 SRs and 2,220 clinical studies. For study search, it achieves high recall rates (Ours 0.711–0.834 v.s. Human baseline 0.138–0.232). For study screening, TrialMind beats previous document ranking methods in a 1.5–2.6 fold change. For data extraction, it outperforms a GPT-4’s accuracy by 16–32%. In a pilot study, human-AI collaboration with TrialMind improved recall by 71.4% and reduced screening time by 44.2%, while in data extraction, accuracy increased by 23.5% with a 63.4% time reduction. Medical experts preferred TrialMind’s synthesized evidence over GPT-4’s in 62.5%-100% of cases. These findings show the promise of accelerating clinical evidence synthesis driven by human-AI collaboration. |
| format | Article |
| id | doaj-art-f1f5e5546e9f4484b4f1b46c6b959eb0 |
| institution | DOAJ |
| issn | 2398-6352 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | npj Digital Medicine |
| spelling | doaj-art-f1f5e5546e9f4484b4f1b46c6b959eb02025-08-20T03:06:08ZengNature Portfolionpj Digital Medicine2398-63522025-08-018111410.1038/s41746-025-01840-7Accelerating clinical evidence synthesis with large language modelsZifeng Wang0Lang Cao1Benjamin Danek2Qiao Jin3Zhiyong Lu4Jimeng Sun5Siebel School of Computing and Data Science, University of Illinois Urbana-ChampaignSiebel School of Computing and Data Science, University of Illinois Urbana-ChampaignSiebel School of Computing and Data Science, University of Illinois Urbana-ChampaignDivision of Intramural Research, National Library of Medicine, National Institutes of HealthDivision of Intramural Research, National Library of Medicine, National Institutes of HealthSiebel School of Computing and Data Science, University of Illinois Urbana-ChampaignAbstract Clinical evidence synthesis largely relies on systematic reviews (SR) of clinical studies from medical literature. Here, we propose a generative artificial intelligence (AI) pipeline named TrialMind to streamline study search, study screening, and data extraction tasks in SR. We chose published SRs to build TrialReviewBench, which contains 100 SRs and 2,220 clinical studies. For study search, it achieves high recall rates (Ours 0.711–0.834 v.s. Human baseline 0.138–0.232). For study screening, TrialMind beats previous document ranking methods in a 1.5–2.6 fold change. For data extraction, it outperforms a GPT-4’s accuracy by 16–32%. In a pilot study, human-AI collaboration with TrialMind improved recall by 71.4% and reduced screening time by 44.2%, while in data extraction, accuracy increased by 23.5% with a 63.4% time reduction. Medical experts preferred TrialMind’s synthesized evidence over GPT-4’s in 62.5%-100% of cases. These findings show the promise of accelerating clinical evidence synthesis driven by human-AI collaboration.https://doi.org/10.1038/s41746-025-01840-7 |
| spellingShingle | Zifeng Wang Lang Cao Benjamin Danek Qiao Jin Zhiyong Lu Jimeng Sun Accelerating clinical evidence synthesis with large language models npj Digital Medicine |
| title | Accelerating clinical evidence synthesis with large language models |
| title_full | Accelerating clinical evidence synthesis with large language models |
| title_fullStr | Accelerating clinical evidence synthesis with large language models |
| title_full_unstemmed | Accelerating clinical evidence synthesis with large language models |
| title_short | Accelerating clinical evidence synthesis with large language models |
| title_sort | accelerating clinical evidence synthesis with large language models |
| url | https://doi.org/10.1038/s41746-025-01840-7 |
| work_keys_str_mv | AT zifengwang acceleratingclinicalevidencesynthesiswithlargelanguagemodels AT langcao acceleratingclinicalevidencesynthesiswithlargelanguagemodels AT benjamindanek acceleratingclinicalevidencesynthesiswithlargelanguagemodels AT qiaojin acceleratingclinicalevidencesynthesiswithlargelanguagemodels AT zhiyonglu acceleratingclinicalevidencesynthesiswithlargelanguagemodels AT jimengsun acceleratingclinicalevidencesynthesiswithlargelanguagemodels |