Use of n-grams and K-means clustering to classify data from free text bone marrow reports

Natural language processing (NLP) has been used to extract information from and summarize medical reports. Currently, the most advanced NLP models require large training datasets of accurately labeled medical text. An approach to creating these large datasets is to use low resource intensive classic...

Full description

Saved in:

Bibliographic Details
Main Author:	Richard F. Xiang
Format:	Article
Language:	English
Published:	Elsevier 2024-12-01
Series:	Journal of Pathology Informatics
Subjects:	Hematologic pathology Bone marrow K-means clustering n-grams Machine learning Natural language processing
Online Access:	http://www.sciencedirect.com/science/article/pii/S2153353923001724
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1846122135914807296
author	Richard F. Xiang
author_facet	Richard F. Xiang
author_sort	Richard F. Xiang
collection	DOAJ
description	Natural language processing (NLP) has been used to extract information from and summarize medical reports. Currently, the most advanced NLP models require large training datasets of accurately labeled medical text. An approach to creating these large datasets is to use low resource intensive classical NLP algorithms. In this manuscript, we examined how an automated classical NLP algorithm was able to classify portions of bone marrow report text into their appropriate sections. A total of 1480 bone marrow reports were extracted from the laboratory information system of a tertiary healthcare network. The free text of these bone marrow reports were preprocessed by separating the reports into text blocks and then removing the section headers. A natural language processing algorithm involving n-grams and K-means clustering was used to classify the text blocks into their appropriate bone marrow sections. The impact of token replacement of numerical values, accession numbers, and clusters of differentiation, varying the number of centroids (1–19) and n-grams (1–5), and utilizing an ensemble algorithm were assessed. The optimal NLP model was found to employ an ensemble algorithm that incorporated token replacement, utilized 1-gram or bag of words, and 10 centroids for K-means clustering. This optimal model was able to classify text blocks with an accuracy of 89%, suggesting that classical NLP models can accurately classify portions of marrow report text.
format	Article
id	doaj-art-c86d09fefc36442d9aaafa5b84a339c4
institution	Kabale University
issn	2153-3539
language	English
publishDate	2024-12-01
publisher	Elsevier
record_format	Article
series	Journal of Pathology Informatics
spelling	doaj-art-c86d09fefc36442d9aaafa5b84a339c42024-12-15T06:15:08ZengElsevierJournal of Pathology Informatics2153-35392024-12-0115100358Use of n-grams and K-means clustering to classify data from free text bone marrow reportsRichard F. Xiang0Corresponding author.; Department of Pathology and Laboratory Medicine, Dalhousie University, Halifax, Nova Scotia, CanadaNatural language processing (NLP) has been used to extract information from and summarize medical reports. Currently, the most advanced NLP models require large training datasets of accurately labeled medical text. An approach to creating these large datasets is to use low resource intensive classical NLP algorithms. In this manuscript, we examined how an automated classical NLP algorithm was able to classify portions of bone marrow report text into their appropriate sections. A total of 1480 bone marrow reports were extracted from the laboratory information system of a tertiary healthcare network. The free text of these bone marrow reports were preprocessed by separating the reports into text blocks and then removing the section headers. A natural language processing algorithm involving n-grams and K-means clustering was used to classify the text blocks into their appropriate bone marrow sections. The impact of token replacement of numerical values, accession numbers, and clusters of differentiation, varying the number of centroids (1–19) and n-grams (1–5), and utilizing an ensemble algorithm were assessed. The optimal NLP model was found to employ an ensemble algorithm that incorporated token replacement, utilized 1-gram or bag of words, and 10 centroids for K-means clustering. This optimal model was able to classify text blocks with an accuracy of 89%, suggesting that classical NLP models can accurately classify portions of marrow report text.http://www.sciencedirect.com/science/article/pii/S2153353923001724Hematologic pathologyBone marrowK-means clusteringn-gramsMachine learningNatural language processing
spellingShingle	Richard F. Xiang Use of n-grams and K-means clustering to classify data from free text bone marrow reports Journal of Pathology Informatics Hematologic pathology Bone marrow K-means clustering n-grams Machine learning Natural language processing
title	Use of n-grams and K-means clustering to classify data from free text bone marrow reports
title_full	Use of n-grams and K-means clustering to classify data from free text bone marrow reports
title_fullStr	Use of n-grams and K-means clustering to classify data from free text bone marrow reports
title_full_unstemmed	Use of n-grams and K-means clustering to classify data from free text bone marrow reports
title_short	Use of n-grams and K-means clustering to classify data from free text bone marrow reports
title_sort	use of n grams and k means clustering to classify data from free text bone marrow reports
topic	Hematologic pathology Bone marrow K-means clustering n-grams Machine learning Natural language processing
url	http://www.sciencedirect.com/science/article/pii/S2153353923001724
work_keys_str_mv	AT richardfxiang useofngramsandkmeansclusteringtoclassifydatafromfreetextbonemarrowreports

Use of n-grams and K-means clustering to classify data from free text bone marrow reports

Similar Items