Annotated corpus for traditional formula-disease relationships in biomedical articles

Abstract The Traditional Formula (TF), a combination of herbs prepared in accordance with traditional medicine principles, is increasingly garnering global attention as an alternative to modern medicine. Specifically, there is growing interest in exploring TF’s therapeutic effects across various dis...

Full description

Saved in:
Bibliographic Details
Main Authors: Sangjun Yea, Ho Jang, Soyoung Kim, Sanghun Lee, Jaeuk U. Kim
Format: Article
Language:English
Published: Nature Portfolio 2025-01-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-025-04377-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841544955334492160
author Sangjun Yea
Ho Jang
Soyoung Kim
Sanghun Lee
Jaeuk U. Kim
author_facet Sangjun Yea
Ho Jang
Soyoung Kim
Sanghun Lee
Jaeuk U. Kim
author_sort Sangjun Yea
collection DOAJ
description Abstract The Traditional Formula (TF), a combination of herbs prepared in accordance with traditional medicine principles, is increasingly garnering global attention as an alternative to modern medicine. Specifically, there is growing interest in exploring TF’s therapeutic effects across various diseases. A significant portion of the state-of-the-art knowledge regarding the relationship between TF and disease is found in scientific publications, where manual knowledge extraction is impractical. Thus, Natural Language Processing (NLP) is being employed to efficiently and accurately search and extract crucial knowledge from unstructured literatures. However, the absence of a high-quality manually annotated corpus focusing on TF-disease relationships hampers the use of NLP in the fields of traditional medicine and modern biomedical science. This article introduces the Traditional Formula-Disease Relationship (TFDR) corpus, a manually annotated corpus designed to facilitate the automatic extraction of TF-disease relationships from biomedical literatures. The TFDR corpus includes information gleaned from 740 PubMed abstracts, encompassing a total of 6,211 TF mentions, 7,166 disease mentions, and 1,109 relationships between them encapsulated within 744 key-sentences.
format Article
id doaj-art-2214fd8bb09e416fbd3ac04eabc5b7f7
institution Kabale University
issn 2052-4463
language English
publishDate 2025-01-01
publisher Nature Portfolio
record_format Article
series Scientific Data
spelling doaj-art-2214fd8bb09e416fbd3ac04eabc5b7f72025-01-12T12:07:31ZengNature PortfolioScientific Data2052-44632025-01-0112111110.1038/s41597-025-04377-2Annotated corpus for traditional formula-disease relationships in biomedical articlesSangjun Yea0Ho Jang1Soyoung Kim2Sanghun Lee3Jaeuk U. Kim4Korean medicine data division, Korea Institute of Oriental MedicineKorean medicine data division, Korea Institute of Oriental MedicineKorean medicine data division, Korea Institute of Oriental MedicineKorean medicine data division, Korea Institute of Oriental MedicineKorean convergence medical science, University of Science and TechnologyAbstract The Traditional Formula (TF), a combination of herbs prepared in accordance with traditional medicine principles, is increasingly garnering global attention as an alternative to modern medicine. Specifically, there is growing interest in exploring TF’s therapeutic effects across various diseases. A significant portion of the state-of-the-art knowledge regarding the relationship between TF and disease is found in scientific publications, where manual knowledge extraction is impractical. Thus, Natural Language Processing (NLP) is being employed to efficiently and accurately search and extract crucial knowledge from unstructured literatures. However, the absence of a high-quality manually annotated corpus focusing on TF-disease relationships hampers the use of NLP in the fields of traditional medicine and modern biomedical science. This article introduces the Traditional Formula-Disease Relationship (TFDR) corpus, a manually annotated corpus designed to facilitate the automatic extraction of TF-disease relationships from biomedical literatures. The TFDR corpus includes information gleaned from 740 PubMed abstracts, encompassing a total of 6,211 TF mentions, 7,166 disease mentions, and 1,109 relationships between them encapsulated within 744 key-sentences.https://doi.org/10.1038/s41597-025-04377-2
spellingShingle Sangjun Yea
Ho Jang
Soyoung Kim
Sanghun Lee
Jaeuk U. Kim
Annotated corpus for traditional formula-disease relationships in biomedical articles
Scientific Data
title Annotated corpus for traditional formula-disease relationships in biomedical articles
title_full Annotated corpus for traditional formula-disease relationships in biomedical articles
title_fullStr Annotated corpus for traditional formula-disease relationships in biomedical articles
title_full_unstemmed Annotated corpus for traditional formula-disease relationships in biomedical articles
title_short Annotated corpus for traditional formula-disease relationships in biomedical articles
title_sort annotated corpus for traditional formula disease relationships in biomedical articles
url https://doi.org/10.1038/s41597-025-04377-2
work_keys_str_mv AT sangjunyea annotatedcorpusfortraditionalformuladiseaserelationshipsinbiomedicalarticles
AT hojang annotatedcorpusfortraditionalformuladiseaserelationshipsinbiomedicalarticles
AT soyoungkim annotatedcorpusfortraditionalformuladiseaserelationshipsinbiomedicalarticles
AT sanghunlee annotatedcorpusfortraditionalformuladiseaserelationshipsinbiomedicalarticles
AT jaeukukim annotatedcorpusfortraditionalformuladiseaserelationshipsinbiomedicalarticles