Improving automated deep phenotyping through large language models using retrieval-augmented generation

Abstract Background Diagnosing rare genetic disorders relies on precise phenotypic and genotypic analysis, with the Human Phenotype Ontology (HPO) providing a standardized language for capturing clinical phenotypes. Rule-based HPO extraction tools use concept recognition to automatically identify ph...

Full description

Saved in:

Bibliographic Details
Main Authors:	Brandon T. Garcia, Lauren Westerfield, Priya Yelemali, Nikhita Gogate, E. Andres Rivera-Munoz, Haowei Du, Moez Dawood, Angad Jolly, James R. Lupski, Jennifer E. Posey
Format:	Article
Language:	English
Published:	BMC 2025-08-01
Series:	Genome Medicine
Subjects:	Large language models (LLMs) Retrieval-augmented generation (RAG) Phenotyping Human Phenotype Ontology (HPO) Natural language processing (NLP) Clinical genomics
Online Access:	https://doi.org/10.1186/s13073-025-01521-w
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Abstract Background Diagnosing rare genetic disorders relies on precise phenotypic and genotypic analysis, with the Human Phenotype Ontology (HPO) providing a standardized language for capturing clinical phenotypes. Rule-based HPO extraction tools use concept recognition to automatically identify phenotypes, but they often struggle with incomplete phenotype assignment, requiring significant manual review. While large language models (LLMs) hold promise for more context-driven phenotype extraction, they are prone to errors and “hallucinations,” making them less reliable without further refinement. We present RAG-HPO, a Python-based tool that leverages retrieval-augmented generation (RAG) to elevate accuracy of HPO term assignment by LLM. This approach bypasses the limitations of baseline models and eliminates the need for time- and resource-intensive fine-tuning. RAG-HPO integrates a dynamic vector database, containing > 54,000 phenotypic phrases mapped to HPO IDs, which allows real-time retrieval and contextual matching. The RAG-HPO workflow begins by extracting phenotypic phrases from clinical text via an LLM and then matching them via semantic similarity to entries within the database. The best term matches are returned to the LLM as context for final HPO term assignment of each phrase. Results Performance was benchmarked on 112 published case reports with 1792 manually assigned HPO terms and compared to Doc2HPO, ClinPhen, and FastHPOCR. In evaluations, RAG-HPO + LLaMa-3.1 70B achieved a mean precision of 0.81, recall of 0.76, and an F1 score of 0.78—significantly surpassing conventional tools (p < 0.00001). RAG-HPO returned 1648 terms, of which 19.1% (315) were false positives that did not exactly match our manually annotated standard. Among these, < 1% (1/315) represented hallucinations, and 1.3% (4/315) represented terms with no ontological relationship to the desired target; the remaining false positives (95.2%, 300/315) were broader ancestor terms of the target term, which may still be relevant to users in many contexts. Conclusions RAG-HPO is a user-friendly, adaptable tool designed for secure evaluation of clinical text and outperforms standard HPO-matching tools in precision, recall, and F1. Its enhanced precision and recall represent a substantial advancement in phenotypic analysis, accelerating the identification of genetic mechanisms underlying rare diseases and driving progress in genetic research and clinical genomics. RAG-HPO is available at https://github.com/PoseyPod/RAG-HPO .
ISSN:	1756-994X

Improving automated deep phenotyping through large language models using retrieval-augmented generation

Similar Items