BMT: A Cross-Validated ThinPrep Pap Cervical Cytology Dataset for Machine Learning Model Training and Validation

Abstract In the past several years, a few cervical Pap smear datasets have been published for use in clinical training. However, most publicly available datasets consist of pre-segmented single cell images, contain on-image annotations that must be manually edited out, or are prepared using the conv...

Full description

Saved in:
Bibliographic Details
Main Authors: E. Celeste Welch, Chenhao Lu, C. James Sung, Cunxian Zhang, Anubhav Tripathi, Joyce Ou
Format: Article
Language:English
Published: Nature Portfolio 2024-12-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-024-04328-3
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract In the past several years, a few cervical Pap smear datasets have been published for use in clinical training. However, most publicly available datasets consist of pre-segmented single cell images, contain on-image annotations that must be manually edited out, or are prepared using the conventional Pap smear method. Multicellular liquid Pap image datasets are a more accurate reflection of current cervical screening techniques. While a multicellular liquid SurePath™ dataset has been created, machine learning models struggle to classify a test image set when it is prepared differently from the training set due to visual differences. Therefore, this dataset of multicellular Pap smear images prepared with the more common ThinPrep® protocol is presented as a helpful resource for training and testing artificial intelligence models, particularly for future application in cervical dysplasia diagnosis. The “Brown Multicellular ThinPrep” (BMT) dataset is the first publicly available multicellular ThinPrep® dataset, consisting of 600 clinically vetted images collected from 180 Pap smear slides from 180 patients, classified into three key diagnostic categories.
ISSN:2052-4463