Use of n-grams and K-means clustering to classify data from free text bone marrow reports

Natural language processing (NLP) has been used to extract information from and summarize medical reports. Currently, the most advanced NLP models require large training datasets of accurately labeled medical text. An approach to creating these large datasets is to use low resource intensive classic...

Full description

Saved in:
Bibliographic Details
Main Author: Richard F. Xiang
Format: Article
Language:English
Published: Elsevier 2024-12-01
Series:Journal of Pathology Informatics
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2153353923001724
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846122135914807296
author Richard F. Xiang
author_facet Richard F. Xiang
author_sort Richard F. Xiang
collection DOAJ
description Natural language processing (NLP) has been used to extract information from and summarize medical reports. Currently, the most advanced NLP models require large training datasets of accurately labeled medical text. An approach to creating these large datasets is to use low resource intensive classical NLP algorithms. In this manuscript, we examined how an automated classical NLP algorithm was able to classify portions of bone marrow report text into their appropriate sections. A total of 1480 bone marrow reports were extracted from the laboratory information system of a tertiary healthcare network. The free text of these bone marrow reports were preprocessed by separating the reports into text blocks and then removing the section headers. A natural language processing algorithm involving n-grams and K-means clustering was used to classify the text blocks into their appropriate bone marrow sections. The impact of token replacement of numerical values, accession numbers, and clusters of differentiation, varying the number of centroids (1–19) and n-grams (1–5), and utilizing an ensemble algorithm were assessed. The optimal NLP model was found to employ an ensemble algorithm that incorporated token replacement, utilized 1-gram or bag of words, and 10 centroids for K-means clustering. This optimal model was able to classify text blocks with an accuracy of 89%, suggesting that classical NLP models can accurately classify portions of marrow report text.
format Article
id doaj-art-c86d09fefc36442d9aaafa5b84a339c4
institution Kabale University
issn 2153-3539
language English
publishDate 2024-12-01
publisher Elsevier
record_format Article
series Journal of Pathology Informatics
spelling doaj-art-c86d09fefc36442d9aaafa5b84a339c42024-12-15T06:15:08ZengElsevierJournal of Pathology Informatics2153-35392024-12-0115100358Use of n-grams and K-means clustering to classify data from free text bone marrow reportsRichard F. Xiang0Corresponding author.; Department of Pathology and Laboratory Medicine, Dalhousie University, Halifax, Nova Scotia, CanadaNatural language processing (NLP) has been used to extract information from and summarize medical reports. Currently, the most advanced NLP models require large training datasets of accurately labeled medical text. An approach to creating these large datasets is to use low resource intensive classical NLP algorithms. In this manuscript, we examined how an automated classical NLP algorithm was able to classify portions of bone marrow report text into their appropriate sections. A total of 1480 bone marrow reports were extracted from the laboratory information system of a tertiary healthcare network. The free text of these bone marrow reports were preprocessed by separating the reports into text blocks and then removing the section headers. A natural language processing algorithm involving n-grams and K-means clustering was used to classify the text blocks into their appropriate bone marrow sections. The impact of token replacement of numerical values, accession numbers, and clusters of differentiation, varying the number of centroids (1–19) and n-grams (1–5), and utilizing an ensemble algorithm were assessed. The optimal NLP model was found to employ an ensemble algorithm that incorporated token replacement, utilized 1-gram or bag of words, and 10 centroids for K-means clustering. This optimal model was able to classify text blocks with an accuracy of 89%, suggesting that classical NLP models can accurately classify portions of marrow report text.http://www.sciencedirect.com/science/article/pii/S2153353923001724Hematologic pathologyBone marrowK-means clusteringn-gramsMachine learningNatural language processing
spellingShingle Richard F. Xiang
Use of n-grams and K-means clustering to classify data from free text bone marrow reports
Journal of Pathology Informatics
Hematologic pathology
Bone marrow
K-means clustering
n-grams
Machine learning
Natural language processing
title Use of n-grams and K-means clustering to classify data from free text bone marrow reports
title_full Use of n-grams and K-means clustering to classify data from free text bone marrow reports
title_fullStr Use of n-grams and K-means clustering to classify data from free text bone marrow reports
title_full_unstemmed Use of n-grams and K-means clustering to classify data from free text bone marrow reports
title_short Use of n-grams and K-means clustering to classify data from free text bone marrow reports
title_sort use of n grams and k means clustering to classify data from free text bone marrow reports
topic Hematologic pathology
Bone marrow
K-means clustering
n-grams
Machine learning
Natural language processing
url http://www.sciencedirect.com/science/article/pii/S2153353923001724
work_keys_str_mv AT richardfxiang useofngramsandkmeansclusteringtoclassifydatafromfreetextbonemarrowreports