Hybrid natural language processing tool for semantic annotation of medical texts in Spanish
Abstract Background Natural language processing (NLP) enables the extraction of information embedded within unstructured texts, such as clinical case reports and trial eligibility criteria. By identifying relevant medical concepts, NLP facilitates the generation of structured and actionable data, su...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2025-01-01
|
Series: | BMC Bioinformatics |
Subjects: | |
Online Access: | https://doi.org/10.1186/s12859-024-05949-6 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841544225254014976 |
---|---|
author | Leonardo Campillos-Llanos Ana Valverde-Mateos Adrián Capllonch-Carrión |
author_facet | Leonardo Campillos-Llanos Ana Valverde-Mateos Adrián Capllonch-Carrión |
author_sort | Leonardo Campillos-Llanos |
collection | DOAJ |
description | Abstract Background Natural language processing (NLP) enables the extraction of information embedded within unstructured texts, such as clinical case reports and trial eligibility criteria. By identifying relevant medical concepts, NLP facilitates the generation of structured and actionable data, supporting complex tasks like cohort identification and the analysis of clinical records. To accomplish those tasks, we introduce a deep learning-based and lexicon-based named entity recognition (NER) tool for texts in Spanish. It performs medical NER and normalization, medication information extraction and detection of temporal entities, negation and speculation, and temporality or experiencer attributes (Age, Contraindicated, Negated, Speculated, Hypothetical, Future, Family_member, Patient and Other). We built the tool with a dedicated lexicon and rules adapted from NegEx and HeidelTime. Using these resources, we annotated a corpus of 1200 texts, with high inter-annotator agreement (average F1 = 0.841% ± 0.045 for entities, and average F1 = 0.881% ± 0.032 for attributes). We used this corpus to train Transformer-based models (RoBERTa-based models, mBERT and mDeBERTa). We integrated them with the dictionary-based system in a hybrid tool, and distribute the models via the Hugging Face hub. For an internal validation, we used a held-out test set and conducted an error analysis. For an external validation, eight medical professionals evaluated the system by revising the annotation of 200 new texts not used in development. Results In the internal validation, the models yielded F1 values up to 0.915. In the external validation with 100 clinical trials, the tool achieved an average F1 score of 0.858 (± 0.032); and in 100 anonymized clinical cases, it achieved an average F1 score of 0.910 (± 0.019). Conclusions The tool is available at https://claramed.csic.es/medspaner . We also release the code ( https://github.com/lcampillos/medspaner ) and the annotated corpus to train the models. |
format | Article |
id | doaj-art-05063cd79bde4b24837ed3fc0f2b6a5a |
institution | Kabale University |
issn | 1471-2105 |
language | English |
publishDate | 2025-01-01 |
publisher | BMC |
record_format | Article |
series | BMC Bioinformatics |
spelling | doaj-art-05063cd79bde4b24837ed3fc0f2b6a5a2025-01-12T12:41:52ZengBMCBMC Bioinformatics1471-21052025-01-0126113910.1186/s12859-024-05949-6Hybrid natural language processing tool for semantic annotation of medical texts in SpanishLeonardo Campillos-Llanos0Ana Valverde-Mateos1Adrián Capllonch-Carrión2ILLA - CSIC (Spanish National Research Council)Medical Terminology Unit, Spanish Royal Academy of MedicineCentro de Salud Retiro, Hospital Universitario Gregorio MarañonAbstract Background Natural language processing (NLP) enables the extraction of information embedded within unstructured texts, such as clinical case reports and trial eligibility criteria. By identifying relevant medical concepts, NLP facilitates the generation of structured and actionable data, supporting complex tasks like cohort identification and the analysis of clinical records. To accomplish those tasks, we introduce a deep learning-based and lexicon-based named entity recognition (NER) tool for texts in Spanish. It performs medical NER and normalization, medication information extraction and detection of temporal entities, negation and speculation, and temporality or experiencer attributes (Age, Contraindicated, Negated, Speculated, Hypothetical, Future, Family_member, Patient and Other). We built the tool with a dedicated lexicon and rules adapted from NegEx and HeidelTime. Using these resources, we annotated a corpus of 1200 texts, with high inter-annotator agreement (average F1 = 0.841% ± 0.045 for entities, and average F1 = 0.881% ± 0.032 for attributes). We used this corpus to train Transformer-based models (RoBERTa-based models, mBERT and mDeBERTa). We integrated them with the dictionary-based system in a hybrid tool, and distribute the models via the Hugging Face hub. For an internal validation, we used a held-out test set and conducted an error analysis. For an external validation, eight medical professionals evaluated the system by revising the annotation of 200 new texts not used in development. Results In the internal validation, the models yielded F1 values up to 0.915. In the external validation with 100 clinical trials, the tool achieved an average F1 score of 0.858 (± 0.032); and in 100 anonymized clinical cases, it achieved an average F1 score of 0.910 (± 0.019). Conclusions The tool is available at https://claramed.csic.es/medspaner . We also release the code ( https://github.com/lcampillos/medspaner ) and the annotated corpus to train the models.https://doi.org/10.1186/s12859-024-05949-6Medical natural language processingMedical text mining toolNamed entity recognitionDeep learning in healthcareClinical trialsSpanish medical NLP |
spellingShingle | Leonardo Campillos-Llanos Ana Valverde-Mateos Adrián Capllonch-Carrión Hybrid natural language processing tool for semantic annotation of medical texts in Spanish BMC Bioinformatics Medical natural language processing Medical text mining tool Named entity recognition Deep learning in healthcare Clinical trials Spanish medical NLP |
title | Hybrid natural language processing tool for semantic annotation of medical texts in Spanish |
title_full | Hybrid natural language processing tool for semantic annotation of medical texts in Spanish |
title_fullStr | Hybrid natural language processing tool for semantic annotation of medical texts in Spanish |
title_full_unstemmed | Hybrid natural language processing tool for semantic annotation of medical texts in Spanish |
title_short | Hybrid natural language processing tool for semantic annotation of medical texts in Spanish |
title_sort | hybrid natural language processing tool for semantic annotation of medical texts in spanish |
topic | Medical natural language processing Medical text mining tool Named entity recognition Deep learning in healthcare Clinical trials Spanish medical NLP |
url | https://doi.org/10.1186/s12859-024-05949-6 |
work_keys_str_mv | AT leonardocampillosllanos hybridnaturallanguageprocessingtoolforsemanticannotationofmedicaltextsinspanish AT anavalverdemateos hybridnaturallanguageprocessingtoolforsemanticannotationofmedicaltextsinspanish AT adriancapllonchcarrion hybridnaturallanguageprocessingtoolforsemanticannotationofmedicaltextsinspanish |