BMT: A Cross-Validated ThinPrep Pap Cervical Cytology Dataset for Machine Learning Model Training and Validation

Abstract In the past several years, a few cervical Pap smear datasets have been published for use in clinical training. However, most publicly available datasets consist of pre-segmented single cell images, contain on-image annotations that must be manually edited out, or are prepared using the conv...

Full description

Saved in:
Bibliographic Details
Main Authors: E. Celeste Welch, Chenhao Lu, C. James Sung, Cunxian Zhang, Anubhav Tripathi, Joyce Ou
Format: Article
Language:English
Published: Nature Portfolio 2024-12-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-024-04328-3
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846101409476378624
author E. Celeste Welch
Chenhao Lu
C. James Sung
Cunxian Zhang
Anubhav Tripathi
Joyce Ou
author_facet E. Celeste Welch
Chenhao Lu
C. James Sung
Cunxian Zhang
Anubhav Tripathi
Joyce Ou
author_sort E. Celeste Welch
collection DOAJ
description Abstract In the past several years, a few cervical Pap smear datasets have been published for use in clinical training. However, most publicly available datasets consist of pre-segmented single cell images, contain on-image annotations that must be manually edited out, or are prepared using the conventional Pap smear method. Multicellular liquid Pap image datasets are a more accurate reflection of current cervical screening techniques. While a multicellular liquid SurePath™ dataset has been created, machine learning models struggle to classify a test image set when it is prepared differently from the training set due to visual differences. Therefore, this dataset of multicellular Pap smear images prepared with the more common ThinPrep® protocol is presented as a helpful resource for training and testing artificial intelligence models, particularly for future application in cervical dysplasia diagnosis. The “Brown Multicellular ThinPrep” (BMT) dataset is the first publicly available multicellular ThinPrep® dataset, consisting of 600 clinically vetted images collected from 180 Pap smear slides from 180 patients, classified into three key diagnostic categories.
format Article
id doaj-art-6752cd08518f40d68dfcadd532a1da21
institution Kabale University
issn 2052-4463
language English
publishDate 2024-12-01
publisher Nature Portfolio
record_format Article
series Scientific Data
spelling doaj-art-6752cd08518f40d68dfcadd532a1da212024-12-29T12:10:04ZengNature PortfolioScientific Data2052-44632024-12-011111810.1038/s41597-024-04328-3BMT: A Cross-Validated ThinPrep Pap Cervical Cytology Dataset for Machine Learning Model Training and ValidationE. Celeste Welch0Chenhao Lu1C. James Sung2Cunxian Zhang3Anubhav Tripathi4Joyce Ou5Center for Biomedical Engineering, School of Engineering, Brown UniversityDepartment of Computer Science, Brown UniversityDepartment of Pathology and Laboratory Medicine, Alpert Medical School, Brown UniversityDepartment of Pathology and Laboratory Medicine, Alpert Medical School, Brown UniversityCenter for Biomedical Engineering, School of Engineering, Brown UniversityDepartment of Pathology and Laboratory Medicine, Alpert Medical School, Brown UniversityAbstract In the past several years, a few cervical Pap smear datasets have been published for use in clinical training. However, most publicly available datasets consist of pre-segmented single cell images, contain on-image annotations that must be manually edited out, or are prepared using the conventional Pap smear method. Multicellular liquid Pap image datasets are a more accurate reflection of current cervical screening techniques. While a multicellular liquid SurePath™ dataset has been created, machine learning models struggle to classify a test image set when it is prepared differently from the training set due to visual differences. Therefore, this dataset of multicellular Pap smear images prepared with the more common ThinPrep® protocol is presented as a helpful resource for training and testing artificial intelligence models, particularly for future application in cervical dysplasia diagnosis. The “Brown Multicellular ThinPrep” (BMT) dataset is the first publicly available multicellular ThinPrep® dataset, consisting of 600 clinically vetted images collected from 180 Pap smear slides from 180 patients, classified into three key diagnostic categories.https://doi.org/10.1038/s41597-024-04328-3
spellingShingle E. Celeste Welch
Chenhao Lu
C. James Sung
Cunxian Zhang
Anubhav Tripathi
Joyce Ou
BMT: A Cross-Validated ThinPrep Pap Cervical Cytology Dataset for Machine Learning Model Training and Validation
Scientific Data
title BMT: A Cross-Validated ThinPrep Pap Cervical Cytology Dataset for Machine Learning Model Training and Validation
title_full BMT: A Cross-Validated ThinPrep Pap Cervical Cytology Dataset for Machine Learning Model Training and Validation
title_fullStr BMT: A Cross-Validated ThinPrep Pap Cervical Cytology Dataset for Machine Learning Model Training and Validation
title_full_unstemmed BMT: A Cross-Validated ThinPrep Pap Cervical Cytology Dataset for Machine Learning Model Training and Validation
title_short BMT: A Cross-Validated ThinPrep Pap Cervical Cytology Dataset for Machine Learning Model Training and Validation
title_sort bmt a cross validated thinprep pap cervical cytology dataset for machine learning model training and validation
url https://doi.org/10.1038/s41597-024-04328-3
work_keys_str_mv AT ecelestewelch bmtacrossvalidatedthinpreppapcervicalcytologydatasetformachinelearningmodeltrainingandvalidation
AT chenhaolu bmtacrossvalidatedthinpreppapcervicalcytologydatasetformachinelearningmodeltrainingandvalidation
AT cjamessung bmtacrossvalidatedthinpreppapcervicalcytologydatasetformachinelearningmodeltrainingandvalidation
AT cunxianzhang bmtacrossvalidatedthinpreppapcervicalcytologydatasetformachinelearningmodeltrainingandvalidation
AT anubhavtripathi bmtacrossvalidatedthinpreppapcervicalcytologydatasetformachinelearningmodeltrainingandvalidation
AT joyceou bmtacrossvalidatedthinpreppapcervicalcytologydatasetformachinelearningmodeltrainingandvalidation