Hybrid natural language processing tool for semantic annotation of medical texts in Spanish

Abstract Background Natural language processing (NLP) enables the extraction of information embedded within unstructured texts, such as clinical case reports and trial eligibility criteria. By identifying relevant medical concepts, NLP facilitates the generation of structured and actionable data, su...

Full description

Saved in:

Bibliographic Details
Main Authors:	Leonardo Campillos-Llanos, Ana Valverde-Mateos, Adrián Capllonch-Carrión
Format:	Article
Language:	English
Published:	BMC 2025-01-01
Series:	BMC Bioinformatics
Subjects:	Medical natural language processing Medical text mining tool Named entity recognition Deep learning in healthcare Clinical trials Spanish medical NLP
Online Access:	https://doi.org/10.1186/s12859-024-05949-6
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841544225254014976
author	Leonardo Campillos-Llanos Ana Valverde-Mateos Adrián Capllonch-Carrión
author_facet	Leonardo Campillos-Llanos Ana Valverde-Mateos Adrián Capllonch-Carrión
author_sort	Leonardo Campillos-Llanos
collection	DOAJ
description	Abstract Background Natural language processing (NLP) enables the extraction of information embedded within unstructured texts, such as clinical case reports and trial eligibility criteria. By identifying relevant medical concepts, NLP facilitates the generation of structured and actionable data, supporting complex tasks like cohort identification and the analysis of clinical records. To accomplish those tasks, we introduce a deep learning-based and lexicon-based named entity recognition (NER) tool for texts in Spanish. It performs medical NER and normalization, medication information extraction and detection of temporal entities, negation and speculation, and temporality or experiencer attributes (Age, Contraindicated, Negated, Speculated, Hypothetical, Future, Family_member, Patient and Other). We built the tool with a dedicated lexicon and rules adapted from NegEx and HeidelTime. Using these resources, we annotated a corpus of 1200 texts, with high inter-annotator agreement (average F1 = 0.841% ± 0.045 for entities, and average F1 = 0.881% ± 0.032 for attributes). We used this corpus to train Transformer-based models (RoBERTa-based models, mBERT and mDeBERTa). We integrated them with the dictionary-based system in a hybrid tool, and distribute the models via the Hugging Face hub. For an internal validation, we used a held-out test set and conducted an error analysis. For an external validation, eight medical professionals evaluated the system by revising the annotation of 200 new texts not used in development. Results In the internal validation, the models yielded F1 values up to 0.915. In the external validation with 100 clinical trials, the tool achieved an average F1 score of 0.858 (± 0.032); and in 100 anonymized clinical cases, it achieved an average F1 score of 0.910 (± 0.019). Conclusions The tool is available at https://claramed.csic.es/medspaner . We also release the code ( https://github.com/lcampillos/medspaner ) and the annotated corpus to train the models.
format	Article
id	doaj-art-05063cd79bde4b24837ed3fc0f2b6a5a
institution	Kabale University
issn	1471-2105
language	English
publishDate	2025-01-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj-art-05063cd79bde4b24837ed3fc0f2b6a5a2025-01-12T12:41:52ZengBMCBMC Bioinformatics1471-21052025-01-0126113910.1186/s12859-024-05949-6Hybrid natural language processing tool for semantic annotation of medical texts in SpanishLeonardo Campillos-Llanos0Ana Valverde-Mateos1Adrián Capllonch-Carrión2ILLA - CSIC (Spanish National Research Council)Medical Terminology Unit, Spanish Royal Academy of MedicineCentro de Salud Retiro, Hospital Universitario Gregorio MarañonAbstract Background Natural language processing (NLP) enables the extraction of information embedded within unstructured texts, such as clinical case reports and trial eligibility criteria. By identifying relevant medical concepts, NLP facilitates the generation of structured and actionable data, supporting complex tasks like cohort identification and the analysis of clinical records. To accomplish those tasks, we introduce a deep learning-based and lexicon-based named entity recognition (NER) tool for texts in Spanish. It performs medical NER and normalization, medication information extraction and detection of temporal entities, negation and speculation, and temporality or experiencer attributes (Age, Contraindicated, Negated, Speculated, Hypothetical, Future, Family_member, Patient and Other). We built the tool with a dedicated lexicon and rules adapted from NegEx and HeidelTime. Using these resources, we annotated a corpus of 1200 texts, with high inter-annotator agreement (average F1 = 0.841% ± 0.045 for entities, and average F1 = 0.881% ± 0.032 for attributes). We used this corpus to train Transformer-based models (RoBERTa-based models, mBERT and mDeBERTa). We integrated them with the dictionary-based system in a hybrid tool, and distribute the models via the Hugging Face hub. For an internal validation, we used a held-out test set and conducted an error analysis. For an external validation, eight medical professionals evaluated the system by revising the annotation of 200 new texts not used in development. Results In the internal validation, the models yielded F1 values up to 0.915. In the external validation with 100 clinical trials, the tool achieved an average F1 score of 0.858 (± 0.032); and in 100 anonymized clinical cases, it achieved an average F1 score of 0.910 (± 0.019). Conclusions The tool is available at https://claramed.csic.es/medspaner . We also release the code ( https://github.com/lcampillos/medspaner ) and the annotated corpus to train the models.https://doi.org/10.1186/s12859-024-05949-6Medical natural language processingMedical text mining toolNamed entity recognitionDeep learning in healthcareClinical trialsSpanish medical NLP
spellingShingle	Leonardo Campillos-Llanos Ana Valverde-Mateos Adrián Capllonch-Carrión Hybrid natural language processing tool for semantic annotation of medical texts in Spanish BMC Bioinformatics Medical natural language processing Medical text mining tool Named entity recognition Deep learning in healthcare Clinical trials Spanish medical NLP
title	Hybrid natural language processing tool for semantic annotation of medical texts in Spanish
title_full	Hybrid natural language processing tool for semantic annotation of medical texts in Spanish
title_fullStr	Hybrid natural language processing tool for semantic annotation of medical texts in Spanish
title_full_unstemmed	Hybrid natural language processing tool for semantic annotation of medical texts in Spanish
title_short	Hybrid natural language processing tool for semantic annotation of medical texts in Spanish
title_sort	hybrid natural language processing tool for semantic annotation of medical texts in spanish
topic	Medical natural language processing Medical text mining tool Named entity recognition Deep learning in healthcare Clinical trials Spanish medical NLP
url	https://doi.org/10.1186/s12859-024-05949-6
work_keys_str_mv	AT leonardocampillosllanos hybridnaturallanguageprocessingtoolforsemanticannotationofmedicaltextsinspanish AT anavalverdemateos hybridnaturallanguageprocessingtoolforsemanticannotationofmedicaltextsinspanish AT adriancapllonchcarrion hybridnaturallanguageprocessingtoolforsemanticannotationofmedicaltextsinspanish

Hybrid natural language processing tool for semantic annotation of medical texts in Spanish

Similar Items