The Venus score for the assessment of the quality and trustworthiness of biomedical datasets

Abstract Biomedical datasets are the mainstays of computational biology and health informatics projects, and can be found on multiple data platforms online or obtained from wet-lab biologists and physicians. The quality and the trustworthiness of these datasets, however, can sometimes be poor, produ...

Full description

Saved in:
Bibliographic Details
Main Authors: Davide Chicco, Alessandro Fabris, Giuseppe Jurman
Format: Article
Language:English
Published: BMC 2025-01-01
Series:BioData Mining
Subjects:
Online Access:https://doi.org/10.1186/s13040-024-00412-x
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841544932185079808
author Davide Chicco
Alessandro Fabris
Giuseppe Jurman
author_facet Davide Chicco
Alessandro Fabris
Giuseppe Jurman
author_sort Davide Chicco
collection DOAJ
description Abstract Biomedical datasets are the mainstays of computational biology and health informatics projects, and can be found on multiple data platforms online or obtained from wet-lab biologists and physicians. The quality and the trustworthiness of these datasets, however, can sometimes be poor, producing bad results in turn, which can harm patients and data subjects. To address this problem, policy-makers, researchers, and consortia have proposed diverse regulations, guidelines, and scores to assess the quality and increase the reliability of datasets. Although generally useful, however, they are often incomplete and impractical. The guidelines of Datasheets for Datasets, in particular, are too numerous; the requirements of the Kaggle Dataset Usability Score focus on non-scientific requisites (for example, including a cover image); and the European Union Artificial Intelligence Act (EU AI Act) sets forth sparse and general data governance requirements, which we tailored to datasets for biomedical AI. Against this backdrop, we introduce our new Venus score to assess the data quality and trustworthiness of biomedical datasets. Our score ranges from 0 to 10 and consists of ten questions that anyone developing a bioinformatics, medical informatics, or cheminformatics dataset should answer before the release. In this study, we first describe the EU AI Act, Datasheets for Datasets, and the Kaggle Dataset Usability Score, presenting their requirements and their drawbacks. To do so, we reverse-engineer the weights of the influential Kaggle Score for the first time and report them in this study. We distill the most important data governance requirements into ten questions tailored to the biomedical domain, comprising the Venus score. We apply the Venus score to twelve datasets from multiple subdomains, including electronic health records, medical imaging, microarray and bulk RNA-seq gene expression, cheminformatics, physiologic electrogram signals, and medical text. Analyzing the results, we surface fine-grained strengths and weaknesses of popular datasets, as well as aggregate trends. Most notably, we find a widespread tendency to gloss over sources of data inaccuracy and noise, which may hinder the reliable exploitation of data and, consequently, research results. Overall, our results confirm the applicability and utility of the Venus score to assess the trustworthiness of biomedical data.
format Article
id doaj-art-9768c9285a7546d2b814218329f39c05
institution Kabale University
issn 1756-0381
language English
publishDate 2025-01-01
publisher BMC
record_format Article
series BioData Mining
spelling doaj-art-9768c9285a7546d2b814218329f39c052025-01-12T12:10:31ZengBMCBioData Mining1756-03812025-01-0118113110.1186/s13040-024-00412-xThe Venus score for the assessment of the quality and trustworthiness of biomedical datasetsDavide Chicco0Alessandro Fabris1Giuseppe Jurman2Università di Milano-Bicocca & University of TorontoMax Planck Institute for Security and PrivacyFondazione Bruno KesslerAbstract Biomedical datasets are the mainstays of computational biology and health informatics projects, and can be found on multiple data platforms online or obtained from wet-lab biologists and physicians. The quality and the trustworthiness of these datasets, however, can sometimes be poor, producing bad results in turn, which can harm patients and data subjects. To address this problem, policy-makers, researchers, and consortia have proposed diverse regulations, guidelines, and scores to assess the quality and increase the reliability of datasets. Although generally useful, however, they are often incomplete and impractical. The guidelines of Datasheets for Datasets, in particular, are too numerous; the requirements of the Kaggle Dataset Usability Score focus on non-scientific requisites (for example, including a cover image); and the European Union Artificial Intelligence Act (EU AI Act) sets forth sparse and general data governance requirements, which we tailored to datasets for biomedical AI. Against this backdrop, we introduce our new Venus score to assess the data quality and trustworthiness of biomedical datasets. Our score ranges from 0 to 10 and consists of ten questions that anyone developing a bioinformatics, medical informatics, or cheminformatics dataset should answer before the release. In this study, we first describe the EU AI Act, Datasheets for Datasets, and the Kaggle Dataset Usability Score, presenting their requirements and their drawbacks. To do so, we reverse-engineer the weights of the influential Kaggle Score for the first time and report them in this study. We distill the most important data governance requirements into ten questions tailored to the biomedical domain, comprising the Venus score. We apply the Venus score to twelve datasets from multiple subdomains, including electronic health records, medical imaging, microarray and bulk RNA-seq gene expression, cheminformatics, physiologic electrogram signals, and medical text. Analyzing the results, we surface fine-grained strengths and weaknesses of popular datasets, as well as aggregate trends. Most notably, we find a widespread tendency to gloss over sources of data inaccuracy and noise, which may hinder the reliable exploitation of data and, consequently, research results. Overall, our results confirm the applicability and utility of the Venus score to assess the trustworthiness of biomedical data.https://doi.org/10.1186/s13040-024-00412-xBiomedical data qualityData trustworthinessData documentationMedical dataHealth informaticsBioinformatics
spellingShingle Davide Chicco
Alessandro Fabris
Giuseppe Jurman
The Venus score for the assessment of the quality and trustworthiness of biomedical datasets
BioData Mining
Biomedical data quality
Data trustworthiness
Data documentation
Medical data
Health informatics
Bioinformatics
title The Venus score for the assessment of the quality and trustworthiness of biomedical datasets
title_full The Venus score for the assessment of the quality and trustworthiness of biomedical datasets
title_fullStr The Venus score for the assessment of the quality and trustworthiness of biomedical datasets
title_full_unstemmed The Venus score for the assessment of the quality and trustworthiness of biomedical datasets
title_short The Venus score for the assessment of the quality and trustworthiness of biomedical datasets
title_sort venus score for the assessment of the quality and trustworthiness of biomedical datasets
topic Biomedical data quality
Data trustworthiness
Data documentation
Medical data
Health informatics
Bioinformatics
url https://doi.org/10.1186/s13040-024-00412-x
work_keys_str_mv AT davidechicco thevenusscorefortheassessmentofthequalityandtrustworthinessofbiomedicaldatasets
AT alessandrofabris thevenusscorefortheassessmentofthequalityandtrustworthinessofbiomedicaldatasets
AT giuseppejurman thevenusscorefortheassessmentofthequalityandtrustworthinessofbiomedicaldatasets
AT davidechicco venusscorefortheassessmentofthequalityandtrustworthinessofbiomedicaldatasets
AT alessandrofabris venusscorefortheassessmentofthequalityandtrustworthinessofbiomedicaldatasets
AT giuseppejurman venusscorefortheassessmentofthequalityandtrustworthinessofbiomedicaldatasets