Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling tool

Introduction: Metadata extraction from digitized slides or whole slide image files is a frequent, laborious, and tedious task. In this work, we present a tool to automatically extract all relevant slide information, such as case number, year, slide number, block number, and staining from the macro-i...

Full description

Saved in:
Bibliographic Details
Main Authors: Nils Englert, Constantin Schwab, Maximilian Legnar, Cleo-Aron Weis
Format: Article
Language:English
Published: Elsevier 2024-12-01
Series:Journal of Pathology Informatics
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2153353924000415
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846122125755154432
author Nils Englert
Constantin Schwab
Maximilian Legnar
Cleo-Aron Weis
author_facet Nils Englert
Constantin Schwab
Maximilian Legnar
Cleo-Aron Weis
author_sort Nils Englert
collection DOAJ
description Introduction: Metadata extraction from digitized slides or whole slide image files is a frequent, laborious, and tedious task. In this work, we present a tool to automatically extract all relevant slide information, such as case number, year, slide number, block number, and staining from the macro-images of the scanned slide.We named the tool Babel fish as it helps translate relevant information printed on the slide. It is written to contain certain basic assumptions regarding, for example, the location of certain information. This can be adapted to the respective location. The extracted metadata can then be used to sort digital slides into databases or to link them with associated case IDs from laboratory information systems. Material and methods: The tool is based on optical character recognition (OCR). For most information, the easyOCR tool is used. For the block number and cases with insufficient results in the first OCR round, a second OCR with pytesseract is applied.Two datasets are used: one for tool development has 342 slides; and another for one for testing has 110 slides. Results: For the testing set, the overall accuracy for retrieving all relevant information per slide is 0.982. Of note, the accuracy for most information parts is 1.000, whereas the accuracy for the block number detection is 0.982. Conclusion: The Babel fish tool can be used to rename vast amounts of whole slide image files in an image analysis pipeline. Furthermore, it could be an essential part of DICOM conversion pipelines, as it extracts relevant metadata like case number, year, block ID, and staining.
format Article
id doaj-art-ac2dea88bf1445c5a41fb8eb17f2fd6a
institution Kabale University
issn 2153-3539
language English
publishDate 2024-12-01
publisher Elsevier
record_format Article
series Journal of Pathology Informatics
spelling doaj-art-ac2dea88bf1445c5a41fb8eb17f2fd6a2024-12-15T06:15:22ZengElsevierJournal of Pathology Informatics2153-35392024-12-0115100402Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling toolNils Englert0Constantin Schwab1Maximilian Legnar2Cleo-Aron Weis3Section Computational Pathology Heidelberg, Institute of Pathology Heidelberg, University Hospital Heidelberg, University of Heidelberg, Heidelberg, GermanyInstitute of Pathology Heidelberg, University Hospital Heidelberg, University of Heidelberg, Heidelberg, GermanySection Computational Pathology Heidelberg, Institute of Pathology Heidelberg, University Hospital Heidelberg, University of Heidelberg, Heidelberg, GermanySection Computational Pathology Heidelberg, Institute of Pathology Heidelberg, University Hospital Heidelberg, University of Heidelberg, Heidelberg, Germany; Interdisciplinary Center for Scientific Computing (IWR), Heidelberg University, Heidelberg, Germany; Corresponding author.Introduction: Metadata extraction from digitized slides or whole slide image files is a frequent, laborious, and tedious task. In this work, we present a tool to automatically extract all relevant slide information, such as case number, year, slide number, block number, and staining from the macro-images of the scanned slide.We named the tool Babel fish as it helps translate relevant information printed on the slide. It is written to contain certain basic assumptions regarding, for example, the location of certain information. This can be adapted to the respective location. The extracted metadata can then be used to sort digital slides into databases or to link them with associated case IDs from laboratory information systems. Material and methods: The tool is based on optical character recognition (OCR). For most information, the easyOCR tool is used. For the block number and cases with insufficient results in the first OCR round, a second OCR with pytesseract is applied.Two datasets are used: one for tool development has 342 slides; and another for one for testing has 110 slides. Results: For the testing set, the overall accuracy for retrieving all relevant information per slide is 0.982. Of note, the accuracy for most information parts is 1.000, whereas the accuracy for the block number detection is 0.982. Conclusion: The Babel fish tool can be used to rename vast amounts of whole slide image files in an image analysis pipeline. Furthermore, it could be an essential part of DICOM conversion pipelines, as it extracts relevant metadata like case number, year, block ID, and staining.http://www.sciencedirect.com/science/article/pii/S2153353924000415DICOMDigital pathologyOptical character recognitionAutomatization
spellingShingle Nils Englert
Constantin Schwab
Maximilian Legnar
Cleo-Aron Weis
Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling tool
Journal of Pathology Informatics
DICOM
Digital pathology
Optical character recognition
Automatization
title Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling tool
title_full Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling tool
title_fullStr Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling tool
title_full_unstemmed Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling tool
title_short Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling tool
title_sort presenting the framework of the whole slide image file babel fish an ocr based file labeling tool
topic DICOM
Digital pathology
Optical character recognition
Automatization
url http://www.sciencedirect.com/science/article/pii/S2153353924000415
work_keys_str_mv AT nilsenglert presentingtheframeworkofthewholeslideimagefilebabelfishanocrbasedfilelabelingtool
AT constantinschwab presentingtheframeworkofthewholeslideimagefilebabelfishanocrbasedfilelabelingtool
AT maximilianlegnar presentingtheframeworkofthewholeslideimagefilebabelfishanocrbasedfilelabelingtool
AT cleoaronweis presentingtheframeworkofthewholeslideimagefilebabelfishanocrbasedfilelabelingtool