Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates

Missing values (MVs) in omic datasets affect the power, accuracy, and consistency of statistical and functional analyses. In mass spectrometry (MS)-based proteomics, MVs can arise due to several reasons: peptides could be below instrumental detection limits, peptides or proteins might be absent or d...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yuming Shi, Huan Zhong, Jason C. Rogalski, Leonard J. Foster
Format:	Article
Language:	English
Published:	Elsevier 2025-01-01
Series:	Computational and Structural Biotechnology Journal
Subjects:	Missing value Missing value imputation Proteomics Mass spectrometry
Online Access:	http://www.sciencedirect.com/science/article/pii/S200103702500162X
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849323351581917184
author	Yuming Shi Huan Zhong Jason C. Rogalski Leonard J. Foster
author_facet	Yuming Shi Huan Zhong Jason C. Rogalski Leonard J. Foster
author_sort	Yuming Shi
collection	DOAJ
description	Missing values (MVs) in omic datasets affect the power, accuracy, and consistency of statistical and functional analyses. In mass spectrometry (MS)-based proteomics, MVs can arise due to several reasons: peptides could be below instrumental detection limits, peptides or proteins might be absent or depleted from the sample for biological or technical reasons, or data processing could fail to detect a real signal. Several statistical and machine-learning methods have been described for imputing MVs in proteomics, such as Bayesian PCA estimation, random forest, and collaborative filtering. However, these approaches typically do not account for the underlying causes of MVs and treat all missing data uniformly, potentially introducing biases that affect the biological validity of the conclusions drawn from the imputed datasets. We found a strong negative correlation between the proportion of MVs and the average intensity for the individual protein, with more abundant proteins having fewer, but rarely zero, MVs. We divided the peptides from all proteins into nine bins based on their intensities and proportion of MV. Assuming the causes of MVs could be different in different regions, we then investigated the optimal imputation method in each bin, using normalized root mean square error (NRMSE), and found that the optimal imputation method varies across bins. A mix-imputed dataset was assembled using the optimal imputation method from each bin, and it was confirmed to exhibit low deviation from the original unimputed dataset, demonstrating mixing the optimal imputation method from each bin is a reliable strategy.
format	Article
id	doaj-art-e3586cebf7ed43a6aa3daab35d8dd4a4
institution	Kabale University
issn	2001-0370
language	English
publishDate	2025-01-01
publisher	Elsevier
record_format	Article
series	Computational and Structural Biotechnology Journal
spelling	doaj-art-e3586cebf7ed43a6aa3daab35d8dd4a42025-08-20T03:49:03ZengElsevierComputational and Structural Biotechnology Journal2001-03702025-01-01271818182610.1016/j.csbj.2025.04.041Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value ratesYuming Shi0Huan Zhong1Jason C. Rogalski2Leonard J. Foster3Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, Life Sciences Institute, University of British Columbia, Vancouver, BC, V6T 1Z4 CanadaDepartment of Biochemistry and Molecular Biology, Michael Smith Laboratories, Life Sciences Institute, University of British Columbia, Vancouver, BC, V6T 1Z4 CanadaDepartment of Biochemistry and Molecular Biology, Michael Smith Laboratories, Life Sciences Institute, University of British Columbia, Vancouver, BC, V6T 1Z4 CanadaCorresponding author.; Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, Life Sciences Institute, University of British Columbia, Vancouver, BC, V6T 1Z4 CanadaMissing values (MVs) in omic datasets affect the power, accuracy, and consistency of statistical and functional analyses. In mass spectrometry (MS)-based proteomics, MVs can arise due to several reasons: peptides could be below instrumental detection limits, peptides or proteins might be absent or depleted from the sample for biological or technical reasons, or data processing could fail to detect a real signal. Several statistical and machine-learning methods have been described for imputing MVs in proteomics, such as Bayesian PCA estimation, random forest, and collaborative filtering. However, these approaches typically do not account for the underlying causes of MVs and treat all missing data uniformly, potentially introducing biases that affect the biological validity of the conclusions drawn from the imputed datasets. We found a strong negative correlation between the proportion of MVs and the average intensity for the individual protein, with more abundant proteins having fewer, but rarely zero, MVs. We divided the peptides from all proteins into nine bins based on their intensities and proportion of MV. Assuming the causes of MVs could be different in different regions, we then investigated the optimal imputation method in each bin, using normalized root mean square error (NRMSE), and found that the optimal imputation method varies across bins. A mix-imputed dataset was assembled using the optimal imputation method from each bin, and it was confirmed to exhibit low deviation from the original unimputed dataset, demonstrating mixing the optimal imputation method from each bin is a reliable strategy.http://www.sciencedirect.com/science/article/pii/S200103702500162XMissing valueMissing value imputationProteomicsMass spectrometry
spellingShingle	Yuming Shi Huan Zhong Jason C. Rogalski Leonard J. Foster Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates Computational and Structural Biotechnology Journal Missing value Missing value imputation Proteomics Mass spectrometry
title	Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates
title_full	Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates
title_fullStr	Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates
title_full_unstemmed	Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates
title_short	Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates
title_sort	optimizing imputation strategies for mass spectrometry based proteomics considering intensity and missing value rates
topic	Missing value Missing value imputation Proteomics Mass spectrometry
url	http://www.sciencedirect.com/science/article/pii/S200103702500162X
work_keys_str_mv	AT yumingshi optimizingimputationstrategiesformassspectrometrybasedproteomicsconsideringintensityandmissingvaluerates AT huanzhong optimizingimputationstrategiesformassspectrometrybasedproteomicsconsideringintensityandmissingvaluerates AT jasoncrogalski optimizingimputationstrategiesformassspectrometrybasedproteomicsconsideringintensityandmissingvaluerates AT leonardjfoster optimizingimputationstrategiesformassspectrometrybasedproteomicsconsideringintensityandmissingvaluerates

Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates

Similar Items