Hybrid natural language processing tool for semantic annotation of medical texts in Spanish

Abstract Background Natural language processing (NLP) enables the extraction of information embedded within unstructured texts, such as clinical case reports and trial eligibility criteria. By identifying relevant medical concepts, NLP facilitates the generation of structured and actionable data, su...

Full description

Saved in:
Bibliographic Details
Main Authors: Leonardo Campillos-Llanos, Ana Valverde-Mateos, Adrián Capllonch-Carrión
Format: Article
Language:English
Published: BMC 2025-01-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-024-05949-6
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841544225254014976
author Leonardo Campillos-Llanos
Ana Valverde-Mateos
Adrián Capllonch-Carrión
author_facet Leonardo Campillos-Llanos
Ana Valverde-Mateos
Adrián Capllonch-Carrión
author_sort Leonardo Campillos-Llanos
collection DOAJ
description Abstract Background Natural language processing (NLP) enables the extraction of information embedded within unstructured texts, such as clinical case reports and trial eligibility criteria. By identifying relevant medical concepts, NLP facilitates the generation of structured and actionable data, supporting complex tasks like cohort identification and the analysis of clinical records. To accomplish those tasks, we introduce a deep learning-based and lexicon-based named entity recognition (NER) tool for texts in Spanish. It performs medical NER and normalization, medication information extraction and detection of temporal entities, negation and speculation, and temporality or experiencer attributes (Age, Contraindicated, Negated, Speculated, Hypothetical, Future, Family_member, Patient and Other). We built the tool with a dedicated lexicon and rules adapted from NegEx and HeidelTime. Using these resources, we annotated a corpus of 1200 texts, with high inter-annotator agreement (average F1 = 0.841% ± 0.045 for entities, and average F1 = 0.881% ± 0.032 for attributes). We used this corpus to train Transformer-based models (RoBERTa-based models, mBERT and mDeBERTa). We integrated them with the dictionary-based system in a hybrid tool, and distribute the models via the Hugging Face hub. For an internal validation, we used a held-out test set and conducted an error analysis. For an external validation, eight medical professionals evaluated the system by revising the annotation of 200 new texts not used in development. Results In the internal validation, the models yielded F1 values up to 0.915. In the external validation with 100 clinical trials, the tool achieved an average F1 score of 0.858 (± 0.032); and in 100 anonymized clinical cases, it achieved an average F1 score of 0.910 (± 0.019). Conclusions The tool is available at https://claramed.csic.es/medspaner . We also release the code ( https://github.com/lcampillos/medspaner ) and the annotated corpus to train the models.
format Article
id doaj-art-05063cd79bde4b24837ed3fc0f2b6a5a
institution Kabale University
issn 1471-2105
language English
publishDate 2025-01-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj-art-05063cd79bde4b24837ed3fc0f2b6a5a2025-01-12T12:41:52ZengBMCBMC Bioinformatics1471-21052025-01-0126113910.1186/s12859-024-05949-6Hybrid natural language processing tool for semantic annotation of medical texts in SpanishLeonardo Campillos-Llanos0Ana Valverde-Mateos1Adrián Capllonch-Carrión2ILLA - CSIC (Spanish National Research Council)Medical Terminology Unit, Spanish Royal Academy of MedicineCentro de Salud Retiro, Hospital Universitario Gregorio MarañonAbstract Background Natural language processing (NLP) enables the extraction of information embedded within unstructured texts, such as clinical case reports and trial eligibility criteria. By identifying relevant medical concepts, NLP facilitates the generation of structured and actionable data, supporting complex tasks like cohort identification and the analysis of clinical records. To accomplish those tasks, we introduce a deep learning-based and lexicon-based named entity recognition (NER) tool for texts in Spanish. It performs medical NER and normalization, medication information extraction and detection of temporal entities, negation and speculation, and temporality or experiencer attributes (Age, Contraindicated, Negated, Speculated, Hypothetical, Future, Family_member, Patient and Other). We built the tool with a dedicated lexicon and rules adapted from NegEx and HeidelTime. Using these resources, we annotated a corpus of 1200 texts, with high inter-annotator agreement (average F1 = 0.841% ± 0.045 for entities, and average F1 = 0.881% ± 0.032 for attributes). We used this corpus to train Transformer-based models (RoBERTa-based models, mBERT and mDeBERTa). We integrated them with the dictionary-based system in a hybrid tool, and distribute the models via the Hugging Face hub. For an internal validation, we used a held-out test set and conducted an error analysis. For an external validation, eight medical professionals evaluated the system by revising the annotation of 200 new texts not used in development. Results In the internal validation, the models yielded F1 values up to 0.915. In the external validation with 100 clinical trials, the tool achieved an average F1 score of 0.858 (± 0.032); and in 100 anonymized clinical cases, it achieved an average F1 score of 0.910 (± 0.019). Conclusions The tool is available at https://claramed.csic.es/medspaner . We also release the code ( https://github.com/lcampillos/medspaner ) and the annotated corpus to train the models.https://doi.org/10.1186/s12859-024-05949-6Medical natural language processingMedical text mining toolNamed entity recognitionDeep learning in healthcareClinical trialsSpanish medical NLP
spellingShingle Leonardo Campillos-Llanos
Ana Valverde-Mateos
Adrián Capllonch-Carrión
Hybrid natural language processing tool for semantic annotation of medical texts in Spanish
BMC Bioinformatics
Medical natural language processing
Medical text mining tool
Named entity recognition
Deep learning in healthcare
Clinical trials
Spanish medical NLP
title Hybrid natural language processing tool for semantic annotation of medical texts in Spanish
title_full Hybrid natural language processing tool for semantic annotation of medical texts in Spanish
title_fullStr Hybrid natural language processing tool for semantic annotation of medical texts in Spanish
title_full_unstemmed Hybrid natural language processing tool for semantic annotation of medical texts in Spanish
title_short Hybrid natural language processing tool for semantic annotation of medical texts in Spanish
title_sort hybrid natural language processing tool for semantic annotation of medical texts in spanish
topic Medical natural language processing
Medical text mining tool
Named entity recognition
Deep learning in healthcare
Clinical trials
Spanish medical NLP
url https://doi.org/10.1186/s12859-024-05949-6
work_keys_str_mv AT leonardocampillosllanos hybridnaturallanguageprocessingtoolforsemanticannotationofmedicaltextsinspanish
AT anavalverdemateos hybridnaturallanguageprocessingtoolforsemanticannotationofmedicaltextsinspanish
AT adriancapllonchcarrion hybridnaturallanguageprocessingtoolforsemanticannotationofmedicaltextsinspanish