Clinicians’ Agreement on Extrapulmonary Radiographic Findings in Chest X-Rays Using a Diagnostic Labelling Scheme

<b>Objective:</b> Reliable reading and annotation of chest X-ray (CXR) images are essential for both clinical decision-making and AI model development. While most of the literature emphasizes pulmonary findings, this study evaluates the consistency and reliability of annotations for extr...

Full description

Saved in:
Bibliographic Details
Main Authors: Lea Marie Pehrson, Dana Li, Alyas Mayar, Marco Fraccaro, Rasmus Bonnevie, Peter Jagd Sørensen, Alexander Malcom Rykkje, Tobias Thostrup Andersen, Henrik Steglich-Arnholm, Dorte Marianne Rohde Stærk, Lotte Borgwardt, Sune Darkner, Jonathan Frederik Carlsen, Michael Bachmann Nielsen, Silvia Ingala
Format: Article
Language:English
Published: MDPI AG 2025-04-01
Series:Diagnostics
Subjects:
Online Access:https://www.mdpi.com/2075-4418/15/7/902
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:<b>Objective:</b> Reliable reading and annotation of chest X-ray (CXR) images are essential for both clinical decision-making and AI model development. While most of the literature emphasizes pulmonary findings, this study evaluates the consistency and reliability of annotations for extrapulmonary findings, using a labelling scheme. <b>Methods:</b> Six clinicians with varying experience levels (novice, intermediate, and experienced) annotated 100 CXR images using a diagnostic labelling scheme, in two rounds, separated by a three-week washout period. Annotation consistency was assessed using Randolph’s free-marginal kappa (RK), prevalence- and bias-adjusted kappa (PABAK), proportion positive agreement (PPA), and proportion negative agreement (PNA). Pairwise comparisons and the McNemar’s test were conducted to assess inter-reader and intra-reader agreement. <b>Results:</b> PABAK values indicated high overall grouped labelling agreement (novice: 0.86, intermediate: 0.90, experienced: 0.91). PNA values demonstrated strong agreement on negative findings, while PPA values showed moderate-to-low consistency in positive findings. Significant differences in specific agreement emerged between novice and experienced clinicians for eight labels, but there were no significant variations in RK across experience levels. The McNemar’s test confirmed annotation stability between rounds. <b>Conclusions:</b> This study demonstrates that clinician annotations of extrapulmonary findings in CXR are consistent and reliable across different experience levels using a pre-defined diagnostic labelling scheme. These insights aid in optimizing training strategies for both clinicians and AI models.
ISSN:2075-4418