Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates

Missing values (MVs) in omic datasets affect the power, accuracy, and consistency of statistical and functional analyses. In mass spectrometry (MS)-based proteomics, MVs can arise due to several reasons: peptides could be below instrumental detection limits, peptides or proteins might be absent or d...

Full description

Saved in:
Bibliographic Details
Main Authors: Yuming Shi, Huan Zhong, Jason C. Rogalski, Leonard J. Foster
Format: Article
Language:English
Published: Elsevier 2025-01-01
Series:Computational and Structural Biotechnology Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S200103702500162X
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849323351581917184
author Yuming Shi
Huan Zhong
Jason C. Rogalski
Leonard J. Foster
author_facet Yuming Shi
Huan Zhong
Jason C. Rogalski
Leonard J. Foster
author_sort Yuming Shi
collection DOAJ
description Missing values (MVs) in omic datasets affect the power, accuracy, and consistency of statistical and functional analyses. In mass spectrometry (MS)-based proteomics, MVs can arise due to several reasons: peptides could be below instrumental detection limits, peptides or proteins might be absent or depleted from the sample for biological or technical reasons, or data processing could fail to detect a real signal. Several statistical and machine-learning methods have been described for imputing MVs in proteomics, such as Bayesian PCA estimation, random forest, and collaborative filtering. However, these approaches typically do not account for the underlying causes of MVs and treat all missing data uniformly, potentially introducing biases that affect the biological validity of the conclusions drawn from the imputed datasets. We found a strong negative correlation between the proportion of MVs and the average intensity for the individual protein, with more abundant proteins having fewer, but rarely zero, MVs. We divided the peptides from all proteins into nine bins based on their intensities and proportion of MV. Assuming the causes of MVs could be different in different regions, we then investigated the optimal imputation method in each bin, using normalized root mean square error (NRMSE), and found that the optimal imputation method varies across bins. A mix-imputed dataset was assembled using the optimal imputation method from each bin, and it was confirmed to exhibit low deviation from the original unimputed dataset, demonstrating mixing the optimal imputation method from each bin is a reliable strategy.
format Article
id doaj-art-e3586cebf7ed43a6aa3daab35d8dd4a4
institution Kabale University
issn 2001-0370
language English
publishDate 2025-01-01
publisher Elsevier
record_format Article
series Computational and Structural Biotechnology Journal
spelling doaj-art-e3586cebf7ed43a6aa3daab35d8dd4a42025-08-20T03:49:03ZengElsevierComputational and Structural Biotechnology Journal2001-03702025-01-01271818182610.1016/j.csbj.2025.04.041Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value ratesYuming Shi0Huan Zhong1Jason C. Rogalski2Leonard J. Foster3Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, Life Sciences Institute, University of British Columbia, Vancouver, BC, V6T 1Z4 CanadaDepartment of Biochemistry and Molecular Biology, Michael Smith Laboratories, Life Sciences Institute, University of British Columbia, Vancouver, BC, V6T 1Z4 CanadaDepartment of Biochemistry and Molecular Biology, Michael Smith Laboratories, Life Sciences Institute, University of British Columbia, Vancouver, BC, V6T 1Z4 CanadaCorresponding author.; Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, Life Sciences Institute, University of British Columbia, Vancouver, BC, V6T 1Z4 CanadaMissing values (MVs) in omic datasets affect the power, accuracy, and consistency of statistical and functional analyses. In mass spectrometry (MS)-based proteomics, MVs can arise due to several reasons: peptides could be below instrumental detection limits, peptides or proteins might be absent or depleted from the sample for biological or technical reasons, or data processing could fail to detect a real signal. Several statistical and machine-learning methods have been described for imputing MVs in proteomics, such as Bayesian PCA estimation, random forest, and collaborative filtering. However, these approaches typically do not account for the underlying causes of MVs and treat all missing data uniformly, potentially introducing biases that affect the biological validity of the conclusions drawn from the imputed datasets. We found a strong negative correlation between the proportion of MVs and the average intensity for the individual protein, with more abundant proteins having fewer, but rarely zero, MVs. We divided the peptides from all proteins into nine bins based on their intensities and proportion of MV. Assuming the causes of MVs could be different in different regions, we then investigated the optimal imputation method in each bin, using normalized root mean square error (NRMSE), and found that the optimal imputation method varies across bins. A mix-imputed dataset was assembled using the optimal imputation method from each bin, and it was confirmed to exhibit low deviation from the original unimputed dataset, demonstrating mixing the optimal imputation method from each bin is a reliable strategy.http://www.sciencedirect.com/science/article/pii/S200103702500162XMissing valueMissing value imputationProteomicsMass spectrometry
spellingShingle Yuming Shi
Huan Zhong
Jason C. Rogalski
Leonard J. Foster
Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates
Computational and Structural Biotechnology Journal
Missing value
Missing value imputation
Proteomics
Mass spectrometry
title Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates
title_full Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates
title_fullStr Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates
title_full_unstemmed Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates
title_short Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates
title_sort optimizing imputation strategies for mass spectrometry based proteomics considering intensity and missing value rates
topic Missing value
Missing value imputation
Proteomics
Mass spectrometry
url http://www.sciencedirect.com/science/article/pii/S200103702500162X
work_keys_str_mv AT yumingshi optimizingimputationstrategiesformassspectrometrybasedproteomicsconsideringintensityandmissingvaluerates
AT huanzhong optimizingimputationstrategiesformassspectrometrybasedproteomicsconsideringintensityandmissingvaluerates
AT jasoncrogalski optimizingimputationstrategiesformassspectrometrybasedproteomicsconsideringintensityandmissingvaluerates
AT leonardjfoster optimizingimputationstrategiesformassspectrometrybasedproteomicsconsideringintensityandmissingvaluerates