Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling tool
Introduction: Metadata extraction from digitized slides or whole slide image files is a frequent, laborious, and tedious task. In this work, we present a tool to automatically extract all relevant slide information, such as case number, year, slide number, block number, and staining from the macro-i...
        Saved in:
      
    
          | Main Authors: | , , , | 
|---|---|
| Format: | Article | 
| Language: | English | 
| Published: | Elsevier
    
        2024-12-01 | 
| Series: | Journal of Pathology Informatics | 
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2153353924000415 | 
| Tags: | Add Tag 
      No Tags, Be the first to tag this record!
   | 
| _version_ | 1846122125755154432 | 
|---|---|
| author | Nils Englert Constantin Schwab Maximilian Legnar Cleo-Aron Weis | 
| author_facet | Nils Englert Constantin Schwab Maximilian Legnar Cleo-Aron Weis | 
| author_sort | Nils Englert | 
| collection | DOAJ | 
| description | Introduction: Metadata extraction from digitized slides or whole slide image files is a frequent, laborious, and tedious task. In this work, we present a tool to automatically extract all relevant slide information, such as case number, year, slide number, block number, and staining from the macro-images of the scanned slide.We named the tool Babel fish as it helps translate relevant information printed on the slide. It is written to contain certain basic assumptions regarding, for example, the location of certain information. This can be adapted to the respective location. The extracted metadata can then be used to sort digital slides into databases or to link them with associated case IDs from laboratory information systems. Material and methods: The tool is based on optical character recognition (OCR). For most information, the easyOCR tool is used. For the block number and cases with insufficient results in the first OCR round, a second OCR with pytesseract is applied.Two datasets are used: one for tool development has 342 slides; and another for one for testing has 110 slides. Results: For the testing set, the overall accuracy for retrieving all relevant information per slide is 0.982. Of note, the accuracy for most information parts is 1.000, whereas the accuracy for the block number detection is 0.982. Conclusion: The Babel fish tool can be used to rename vast amounts of whole slide image files in an image analysis pipeline. Furthermore, it could be an essential part of DICOM conversion pipelines, as it extracts relevant metadata like case number, year, block ID, and staining. | 
| format | Article | 
| id | doaj-art-ac2dea88bf1445c5a41fb8eb17f2fd6a | 
| institution | Kabale University | 
| issn | 2153-3539 | 
| language | English | 
| publishDate | 2024-12-01 | 
| publisher | Elsevier | 
| record_format | Article | 
| series | Journal of Pathology Informatics | 
| spelling | doaj-art-ac2dea88bf1445c5a41fb8eb17f2fd6a2024-12-15T06:15:22ZengElsevierJournal of Pathology Informatics2153-35392024-12-0115100402Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling toolNils Englert0Constantin Schwab1Maximilian Legnar2Cleo-Aron Weis3Section Computational Pathology Heidelberg, Institute of Pathology Heidelberg, University Hospital Heidelberg, University of Heidelberg, Heidelberg, GermanyInstitute of Pathology Heidelberg, University Hospital Heidelberg, University of Heidelberg, Heidelberg, GermanySection Computational Pathology Heidelberg, Institute of Pathology Heidelberg, University Hospital Heidelberg, University of Heidelberg, Heidelberg, GermanySection Computational Pathology Heidelberg, Institute of Pathology Heidelberg, University Hospital Heidelberg, University of Heidelberg, Heidelberg, Germany; Interdisciplinary Center for Scientific Computing (IWR), Heidelberg University, Heidelberg, Germany; Corresponding author.Introduction: Metadata extraction from digitized slides or whole slide image files is a frequent, laborious, and tedious task. In this work, we present a tool to automatically extract all relevant slide information, such as case number, year, slide number, block number, and staining from the macro-images of the scanned slide.We named the tool Babel fish as it helps translate relevant information printed on the slide. It is written to contain certain basic assumptions regarding, for example, the location of certain information. This can be adapted to the respective location. The extracted metadata can then be used to sort digital slides into databases or to link them with associated case IDs from laboratory information systems. Material and methods: The tool is based on optical character recognition (OCR). For most information, the easyOCR tool is used. For the block number and cases with insufficient results in the first OCR round, a second OCR with pytesseract is applied.Two datasets are used: one for tool development has 342 slides; and another for one for testing has 110 slides. Results: For the testing set, the overall accuracy for retrieving all relevant information per slide is 0.982. Of note, the accuracy for most information parts is 1.000, whereas the accuracy for the block number detection is 0.982. Conclusion: The Babel fish tool can be used to rename vast amounts of whole slide image files in an image analysis pipeline. Furthermore, it could be an essential part of DICOM conversion pipelines, as it extracts relevant metadata like case number, year, block ID, and staining.http://www.sciencedirect.com/science/article/pii/S2153353924000415DICOMDigital pathologyOptical character recognitionAutomatization | 
| spellingShingle | Nils Englert Constantin Schwab Maximilian Legnar Cleo-Aron Weis Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling tool Journal of Pathology Informatics DICOM Digital pathology Optical character recognition Automatization | 
| title | Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling tool | 
| title_full | Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling tool | 
| title_fullStr | Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling tool | 
| title_full_unstemmed | Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling tool | 
| title_short | Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling tool | 
| title_sort | presenting the framework of the whole slide image file babel fish an ocr based file labeling tool | 
| topic | DICOM Digital pathology Optical character recognition Automatization | 
| url | http://www.sciencedirect.com/science/article/pii/S2153353924000415 | 
| work_keys_str_mv | AT nilsenglert presentingtheframeworkofthewholeslideimagefilebabelfishanocrbasedfilelabelingtool AT constantinschwab presentingtheframeworkofthewholeslideimagefilebabelfishanocrbasedfilelabelingtool AT maximilianlegnar presentingtheframeworkofthewholeslideimagefilebabelfishanocrbasedfilelabelingtool AT cleoaronweis presentingtheframeworkofthewholeslideimagefilebabelfishanocrbasedfilelabelingtool | 
 
       