Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates

Missing values (MVs) in omic datasets affect the power, accuracy, and consistency of statistical and functional analyses. In mass spectrometry (MS)-based proteomics, MVs can arise due to several reasons: peptides could be below instrumental detection limits, peptides or proteins might be absent or d...

Full description

Saved in:
Bibliographic Details
Main Authors: Yuming Shi, Huan Zhong, Jason C. Rogalski, Leonard J. Foster
Format: Article
Language:English
Published: Elsevier 2025-01-01
Series:Computational and Structural Biotechnology Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S200103702500162X
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Missing values (MVs) in omic datasets affect the power, accuracy, and consistency of statistical and functional analyses. In mass spectrometry (MS)-based proteomics, MVs can arise due to several reasons: peptides could be below instrumental detection limits, peptides or proteins might be absent or depleted from the sample for biological or technical reasons, or data processing could fail to detect a real signal. Several statistical and machine-learning methods have been described for imputing MVs in proteomics, such as Bayesian PCA estimation, random forest, and collaborative filtering. However, these approaches typically do not account for the underlying causes of MVs and treat all missing data uniformly, potentially introducing biases that affect the biological validity of the conclusions drawn from the imputed datasets. We found a strong negative correlation between the proportion of MVs and the average intensity for the individual protein, with more abundant proteins having fewer, but rarely zero, MVs. We divided the peptides from all proteins into nine bins based on their intensities and proportion of MV. Assuming the causes of MVs could be different in different regions, we then investigated the optimal imputation method in each bin, using normalized root mean square error (NRMSE), and found that the optimal imputation method varies across bins. A mix-imputed dataset was assembled using the optimal imputation method from each bin, and it was confirmed to exhibit low deviation from the original unimputed dataset, demonstrating mixing the optimal imputation method from each bin is a reliable strategy.
ISSN:2001-0370