Automated sample annotation for diabetes mellitus in healthcare integrated biobanking

Healthcare integrated biobanking describes the annotation and collection of residual samples from hospitalized patients for research purposes. The central idea of the current work is to establish an automated workflow for sample annotation, selection and storage for diabetes mellitus. This is challe...

Full description

Saved in:
Bibliographic Details
Main Authors: Johannes Stolp, Christoph Weber, Danny Ammon, André Scherag, Claudia Fischer, Christof Kloos, Gunter Wolf, P. Christian Schulze, Utz Settmacher, Michael Bauer, Andreas Stallmach, Michael Kiehntopf, Boris Betz
Format: Article
Language:English
Published: Elsevier 2024-12-01
Series:Computational and Structural Biotechnology Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2001037024003520
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Healthcare integrated biobanking describes the annotation and collection of residual samples from hospitalized patients for research purposes. The central idea of the current work is to establish an automated workflow for sample annotation, selection and storage for diabetes mellitus. This is challenging due to incomplete data at the time of sample selection. The study evaluates a machine learning (ML) and natural language processing (NLP) based two-step procedure for timely and precise sample annotation for diabetes mellitus. Electronic health record data of 785 persons were extracted from the hospital information system. In the first step, a conditional inference forest (CIF) model was trained and tested based on laboratory values from the first 72 h of the hospital stay using test- (n = 550) and training data sets (n = 235). Performance was compared with a simple laboratory cut-off classifier (LCC) and a logistic regression (LR) model. Algorithms based on laboratory values, ICD-10 codes or information from discharge summaries extracted by a natural language processing software (NLP-DS) were evaluated as a second (review) step designed to increase the precision of annotations. For the first step, recall/precision/F1-score/accuracy were 71 %/86 %/0.78/0.82 for CIF and 77 %/70 %/0.74/0.75 for LR compared to 73 %/68 %/0.70/0.72 for LCC. NLP-DS was the best-performing second (review) step (93 %/100 %/0.97/0.97). Combining first-step models with NLP-DS increased precision to 100 % for all procedures (66 %/100 %/0.80/0.85 for CIF&NLP-DS, 72 %/100 %/0.84/87.2 for LR&NLP-DS and 66 %/100 %/0.80/0.85 for LCC&NLP-DS). The number of samples removed by NLP-DS was higher for LR&NLP-DS and LCC&NLP-DS (removal rate 35 % and 38 % of initially selected samples) compared to CIF&NLP-DS (removal rate of 20 %). The developed two-step procedure is an efficient implementable method for timely and precise annotation of samples from diabetic hospitalized patients.
ISSN:2001-0370