Improving automated deep phenotyping through large language models using retrieval-augmented generation
Abstract Background Diagnosing rare genetic disorders relies on precise phenotypic and genotypic analysis, with the Human Phenotype Ontology (HPO) providing a standardized language for capturing clinical phenotypes. Rule-based HPO extraction tools use concept recognition to automatically identify ph...
Saved in:
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMC
2025-08-01
|
| Series: | Genome Medicine |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s13073-025-01521-w |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Background Diagnosing rare genetic disorders relies on precise phenotypic and genotypic analysis, with the Human Phenotype Ontology (HPO) providing a standardized language for capturing clinical phenotypes. Rule-based HPO extraction tools use concept recognition to automatically identify phenotypes, but they often struggle with incomplete phenotype assignment, requiring significant manual review. While large language models (LLMs) hold promise for more context-driven phenotype extraction, they are prone to errors and “hallucinations,” making them less reliable without further refinement. We present RAG-HPO, a Python-based tool that leverages retrieval-augmented generation (RAG) to elevate accuracy of HPO term assignment by LLM. This approach bypasses the limitations of baseline models and eliminates the need for time- and resource-intensive fine-tuning. RAG-HPO integrates a dynamic vector database, containing > 54,000 phenotypic phrases mapped to HPO IDs, which allows real-time retrieval and contextual matching. The RAG-HPO workflow begins by extracting phenotypic phrases from clinical text via an LLM and then matching them via semantic similarity to entries within the database. The best term matches are returned to the LLM as context for final HPO term assignment of each phrase. Results Performance was benchmarked on 112 published case reports with 1792 manually assigned HPO terms and compared to Doc2HPO, ClinPhen, and FastHPOCR. In evaluations, RAG-HPO + LLaMa-3.1 70B achieved a mean precision of 0.81, recall of 0.76, and an F1 score of 0.78—significantly surpassing conventional tools (p < 0.00001). RAG-HPO returned 1648 terms, of which 19.1% (315) were false positives that did not exactly match our manually annotated standard. Among these, < 1% (1/315) represented hallucinations, and 1.3% (4/315) represented terms with no ontological relationship to the desired target; the remaining false positives (95.2%, 300/315) were broader ancestor terms of the target term, which may still be relevant to users in many contexts. Conclusions RAG-HPO is a user-friendly, adaptable tool designed for secure evaluation of clinical text and outperforms standard HPO-matching tools in precision, recall, and F1. Its enhanced precision and recall represent a substantial advancement in phenotypic analysis, accelerating the identification of genetic mechanisms underlying rare diseases and driving progress in genetic research and clinical genomics. RAG-HPO is available at https://github.com/PoseyPod/RAG-HPO . |
|---|---|
| ISSN: | 1756-994X |