Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates
Missing values (MVs) in omic datasets affect the power, accuracy, and consistency of statistical and functional analyses. In mass spectrometry (MS)-based proteomics, MVs can arise due to several reasons: peptides could be below instrumental detection limits, peptides or proteins might be absent or d...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-01-01
|
| Series: | Computational and Structural Biotechnology Journal |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S200103702500162X |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849323351581917184 |
|---|---|
| author | Yuming Shi Huan Zhong Jason C. Rogalski Leonard J. Foster |
| author_facet | Yuming Shi Huan Zhong Jason C. Rogalski Leonard J. Foster |
| author_sort | Yuming Shi |
| collection | DOAJ |
| description | Missing values (MVs) in omic datasets affect the power, accuracy, and consistency of statistical and functional analyses. In mass spectrometry (MS)-based proteomics, MVs can arise due to several reasons: peptides could be below instrumental detection limits, peptides or proteins might be absent or depleted from the sample for biological or technical reasons, or data processing could fail to detect a real signal. Several statistical and machine-learning methods have been described for imputing MVs in proteomics, such as Bayesian PCA estimation, random forest, and collaborative filtering. However, these approaches typically do not account for the underlying causes of MVs and treat all missing data uniformly, potentially introducing biases that affect the biological validity of the conclusions drawn from the imputed datasets. We found a strong negative correlation between the proportion of MVs and the average intensity for the individual protein, with more abundant proteins having fewer, but rarely zero, MVs. We divided the peptides from all proteins into nine bins based on their intensities and proportion of MV. Assuming the causes of MVs could be different in different regions, we then investigated the optimal imputation method in each bin, using normalized root mean square error (NRMSE), and found that the optimal imputation method varies across bins. A mix-imputed dataset was assembled using the optimal imputation method from each bin, and it was confirmed to exhibit low deviation from the original unimputed dataset, demonstrating mixing the optimal imputation method from each bin is a reliable strategy. |
| format | Article |
| id | doaj-art-e3586cebf7ed43a6aa3daab35d8dd4a4 |
| institution | Kabale University |
| issn | 2001-0370 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Computational and Structural Biotechnology Journal |
| spelling | doaj-art-e3586cebf7ed43a6aa3daab35d8dd4a42025-08-20T03:49:03ZengElsevierComputational and Structural Biotechnology Journal2001-03702025-01-01271818182610.1016/j.csbj.2025.04.041Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value ratesYuming Shi0Huan Zhong1Jason C. Rogalski2Leonard J. Foster3Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, Life Sciences Institute, University of British Columbia, Vancouver, BC, V6T 1Z4 CanadaDepartment of Biochemistry and Molecular Biology, Michael Smith Laboratories, Life Sciences Institute, University of British Columbia, Vancouver, BC, V6T 1Z4 CanadaDepartment of Biochemistry and Molecular Biology, Michael Smith Laboratories, Life Sciences Institute, University of British Columbia, Vancouver, BC, V6T 1Z4 CanadaCorresponding author.; Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, Life Sciences Institute, University of British Columbia, Vancouver, BC, V6T 1Z4 CanadaMissing values (MVs) in omic datasets affect the power, accuracy, and consistency of statistical and functional analyses. In mass spectrometry (MS)-based proteomics, MVs can arise due to several reasons: peptides could be below instrumental detection limits, peptides or proteins might be absent or depleted from the sample for biological or technical reasons, or data processing could fail to detect a real signal. Several statistical and machine-learning methods have been described for imputing MVs in proteomics, such as Bayesian PCA estimation, random forest, and collaborative filtering. However, these approaches typically do not account for the underlying causes of MVs and treat all missing data uniformly, potentially introducing biases that affect the biological validity of the conclusions drawn from the imputed datasets. We found a strong negative correlation between the proportion of MVs and the average intensity for the individual protein, with more abundant proteins having fewer, but rarely zero, MVs. We divided the peptides from all proteins into nine bins based on their intensities and proportion of MV. Assuming the causes of MVs could be different in different regions, we then investigated the optimal imputation method in each bin, using normalized root mean square error (NRMSE), and found that the optimal imputation method varies across bins. A mix-imputed dataset was assembled using the optimal imputation method from each bin, and it was confirmed to exhibit low deviation from the original unimputed dataset, demonstrating mixing the optimal imputation method from each bin is a reliable strategy.http://www.sciencedirect.com/science/article/pii/S200103702500162XMissing valueMissing value imputationProteomicsMass spectrometry |
| spellingShingle | Yuming Shi Huan Zhong Jason C. Rogalski Leonard J. Foster Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates Computational and Structural Biotechnology Journal Missing value Missing value imputation Proteomics Mass spectrometry |
| title | Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates |
| title_full | Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates |
| title_fullStr | Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates |
| title_full_unstemmed | Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates |
| title_short | Optimizing imputation strategies for mass spectrometry-based proteomics considering intensity and missing value rates |
| title_sort | optimizing imputation strategies for mass spectrometry based proteomics considering intensity and missing value rates |
| topic | Missing value Missing value imputation Proteomics Mass spectrometry |
| url | http://www.sciencedirect.com/science/article/pii/S200103702500162X |
| work_keys_str_mv | AT yumingshi optimizingimputationstrategiesformassspectrometrybasedproteomicsconsideringintensityandmissingvaluerates AT huanzhong optimizingimputationstrategiesformassspectrometrybasedproteomicsconsideringintensityandmissingvaluerates AT jasoncrogalski optimizingimputationstrategiesformassspectrometrybasedproteomicsconsideringintensityandmissingvaluerates AT leonardjfoster optimizingimputationstrategiesformassspectrometrybasedproteomicsconsideringintensityandmissingvaluerates |