Extending gene set variation analysis with a reference dataset to stabilize scores
Abstract Background Biological pathways are sets of genes that jointly drive biological processes. Rather than analyzing genes individually, it is common practice to summarize sets of related genes using gene set variation analysis (GSVA). In short, GSVA summarizes a set of genes into a single score...
Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMC
2025-07-01
|
| Series: | BMC Genomics |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s12864-025-11769-6 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849238745769836544 |
|---|---|
| author | Lorin Towle-Miller William Jordan Alexandre Lockhart Johannes Freudenburg Aman Virmani Mandy Bergquist Jeffrey Miecznikowski Will Powley |
| author_facet | Lorin Towle-Miller William Jordan Alexandre Lockhart Johannes Freudenburg Aman Virmani Mandy Bergquist Jeffrey Miecznikowski Will Powley |
| author_sort | Lorin Towle-Miller |
| collection | DOAJ |
| description | Abstract Background Biological pathways are sets of genes that jointly drive biological processes. Rather than analyzing genes individually, it is common practice to summarize sets of related genes using gene set variation analysis (GSVA). In short, GSVA summarizes a set of genes into a single score bounded between -1 and 1, where negative values suggest downregulation and positive values suggest upregulation. Although this interpretation is simple in theory, it depends on unbiased estimation of individual gene distributions. In the current version of GSVA, gene distributions are estimated using the input dataset (i.e., the scores are calculated based on the gene distributions from the same dataset). This becomes a major issue when study data does not adequately represent the full distribution of the population. For example, if RNA-seq data was collected on an imbalanced sample (e.g., more disease samples than healthy controls), it would be difficult to discern abnormalities in pathway activity since the gene distributions were estimated on a biased population. Therefore, we propose reference stabilizing GSVA (rsGSVA), a solution to this commonly ignored limitation by using reference datasets to estimate the gene distributions for a more stable GSVA score. Results rsGSVA shows comparable power to classic GSVA, singscore, and ssGSEA under ideal settings while demonstrating stable scores on sample subsets. An application on irritable bowel disease highlights interpretational advantages of rsGSVA to other methods in up/down regulation of inflammation signatures. Conclusions The rsGSVA technique enhances the GSVA functionality by incorporating a reference dataset. This integration of a reference dataset makes the enrichment scores independent of the input distribution and ensures their stability and reproducibility, even as samples are added or removed. |
| format | Article |
| id | doaj-art-18c51a6c521f4e2dabb6a8ab5244f61c |
| institution | Kabale University |
| issn | 1471-2164 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | BMC |
| record_format | Article |
| series | BMC Genomics |
| spelling | doaj-art-18c51a6c521f4e2dabb6a8ab5244f61c2025-08-20T04:01:25ZengBMCBMC Genomics1471-21642025-07-0126111110.1186/s12864-025-11769-6Extending gene set variation analysis with a reference dataset to stabilize scoresLorin Towle-Miller0William Jordan1Alexandre Lockhart2Johannes Freudenburg3Aman Virmani4Mandy Bergquist5Jeffrey Miecznikowski6Will Powley7GSK, BiostatisticsGSK, Computational BiologyGSK, BiostatisticsGSK, Computational BiologyGSK, Computational BiologyGSK, BiostatisticsBiostatistics Department, University at BuffaloGSK, BiostatisticsAbstract Background Biological pathways are sets of genes that jointly drive biological processes. Rather than analyzing genes individually, it is common practice to summarize sets of related genes using gene set variation analysis (GSVA). In short, GSVA summarizes a set of genes into a single score bounded between -1 and 1, where negative values suggest downregulation and positive values suggest upregulation. Although this interpretation is simple in theory, it depends on unbiased estimation of individual gene distributions. In the current version of GSVA, gene distributions are estimated using the input dataset (i.e., the scores are calculated based on the gene distributions from the same dataset). This becomes a major issue when study data does not adequately represent the full distribution of the population. For example, if RNA-seq data was collected on an imbalanced sample (e.g., more disease samples than healthy controls), it would be difficult to discern abnormalities in pathway activity since the gene distributions were estimated on a biased population. Therefore, we propose reference stabilizing GSVA (rsGSVA), a solution to this commonly ignored limitation by using reference datasets to estimate the gene distributions for a more stable GSVA score. Results rsGSVA shows comparable power to classic GSVA, singscore, and ssGSEA under ideal settings while demonstrating stable scores on sample subsets. An application on irritable bowel disease highlights interpretational advantages of rsGSVA to other methods in up/down regulation of inflammation signatures. Conclusions The rsGSVA technique enhances the GSVA functionality by incorporating a reference dataset. This integration of a reference dataset makes the enrichment scores independent of the input distribution and ensures their stability and reproducibility, even as samples are added or removed.https://doi.org/10.1186/s12864-025-11769-6Gene signaturesSequencingPathway analysisBayesian analysis |
| spellingShingle | Lorin Towle-Miller William Jordan Alexandre Lockhart Johannes Freudenburg Aman Virmani Mandy Bergquist Jeffrey Miecznikowski Will Powley Extending gene set variation analysis with a reference dataset to stabilize scores BMC Genomics Gene signatures Sequencing Pathway analysis Bayesian analysis |
| title | Extending gene set variation analysis with a reference dataset to stabilize scores |
| title_full | Extending gene set variation analysis with a reference dataset to stabilize scores |
| title_fullStr | Extending gene set variation analysis with a reference dataset to stabilize scores |
| title_full_unstemmed | Extending gene set variation analysis with a reference dataset to stabilize scores |
| title_short | Extending gene set variation analysis with a reference dataset to stabilize scores |
| title_sort | extending gene set variation analysis with a reference dataset to stabilize scores |
| topic | Gene signatures Sequencing Pathway analysis Bayesian analysis |
| url | https://doi.org/10.1186/s12864-025-11769-6 |
| work_keys_str_mv | AT lorintowlemiller extendinggenesetvariationanalysiswithareferencedatasettostabilizescores AT williamjordan extendinggenesetvariationanalysiswithareferencedatasettostabilizescores AT alexandrelockhart extendinggenesetvariationanalysiswithareferencedatasettostabilizescores AT johannesfreudenburg extendinggenesetvariationanalysiswithareferencedatasettostabilizescores AT amanvirmani extendinggenesetvariationanalysiswithareferencedatasettostabilizescores AT mandybergquist extendinggenesetvariationanalysiswithareferencedatasettostabilizescores AT jeffreymiecznikowski extendinggenesetvariationanalysiswithareferencedatasettostabilizescores AT willpowley extendinggenesetvariationanalysiswithareferencedatasettostabilizescores |