Extending gene set variation analysis with a reference dataset to stabilize scores

Abstract Background Biological pathways are sets of genes that jointly drive biological processes. Rather than analyzing genes individually, it is common practice to summarize sets of related genes using gene set variation analysis (GSVA). In short, GSVA summarizes a set of genes into a single score...

Full description

Saved in:
Bibliographic Details
Main Authors: Lorin Towle-Miller, William Jordan, Alexandre Lockhart, Johannes Freudenburg, Aman Virmani, Mandy Bergquist, Jeffrey Miecznikowski, Will Powley
Format: Article
Language:English
Published: BMC 2025-07-01
Series:BMC Genomics
Subjects:
Online Access:https://doi.org/10.1186/s12864-025-11769-6
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849238745769836544
author Lorin Towle-Miller
William Jordan
Alexandre Lockhart
Johannes Freudenburg
Aman Virmani
Mandy Bergquist
Jeffrey Miecznikowski
Will Powley
author_facet Lorin Towle-Miller
William Jordan
Alexandre Lockhart
Johannes Freudenburg
Aman Virmani
Mandy Bergquist
Jeffrey Miecznikowski
Will Powley
author_sort Lorin Towle-Miller
collection DOAJ
description Abstract Background Biological pathways are sets of genes that jointly drive biological processes. Rather than analyzing genes individually, it is common practice to summarize sets of related genes using gene set variation analysis (GSVA). In short, GSVA summarizes a set of genes into a single score bounded between -1 and 1, where negative values suggest downregulation and positive values suggest upregulation. Although this interpretation is simple in theory, it depends on unbiased estimation of individual gene distributions. In the current version of GSVA, gene distributions are estimated using the input dataset (i.e., the scores are calculated based on the gene distributions from the same dataset). This becomes a major issue when study data does not adequately represent the full distribution of the population. For example, if RNA-seq data was collected on an imbalanced sample (e.g., more disease samples than healthy controls), it would be difficult to discern abnormalities in pathway activity since the gene distributions were estimated on a biased population. Therefore, we propose reference stabilizing GSVA (rsGSVA), a solution to this commonly ignored limitation by using reference datasets to estimate the gene distributions for a more stable GSVA score. Results rsGSVA shows comparable power to classic GSVA, singscore, and ssGSEA under ideal settings while demonstrating stable scores on sample subsets. An application on irritable bowel disease highlights interpretational advantages of rsGSVA to other methods in up/down regulation of inflammation signatures. Conclusions The rsGSVA technique enhances the GSVA functionality by incorporating a reference dataset. This integration of a reference dataset makes the enrichment scores independent of the input distribution and ensures their stability and reproducibility, even as samples are added or removed.
format Article
id doaj-art-18c51a6c521f4e2dabb6a8ab5244f61c
institution Kabale University
issn 1471-2164
language English
publishDate 2025-07-01
publisher BMC
record_format Article
series BMC Genomics
spelling doaj-art-18c51a6c521f4e2dabb6a8ab5244f61c2025-08-20T04:01:25ZengBMCBMC Genomics1471-21642025-07-0126111110.1186/s12864-025-11769-6Extending gene set variation analysis with a reference dataset to stabilize scoresLorin Towle-Miller0William Jordan1Alexandre Lockhart2Johannes Freudenburg3Aman Virmani4Mandy Bergquist5Jeffrey Miecznikowski6Will Powley7GSK, BiostatisticsGSK, Computational BiologyGSK, BiostatisticsGSK, Computational BiologyGSK, Computational BiologyGSK, BiostatisticsBiostatistics Department, University at BuffaloGSK, BiostatisticsAbstract Background Biological pathways are sets of genes that jointly drive biological processes. Rather than analyzing genes individually, it is common practice to summarize sets of related genes using gene set variation analysis (GSVA). In short, GSVA summarizes a set of genes into a single score bounded between -1 and 1, where negative values suggest downregulation and positive values suggest upregulation. Although this interpretation is simple in theory, it depends on unbiased estimation of individual gene distributions. In the current version of GSVA, gene distributions are estimated using the input dataset (i.e., the scores are calculated based on the gene distributions from the same dataset). This becomes a major issue when study data does not adequately represent the full distribution of the population. For example, if RNA-seq data was collected on an imbalanced sample (e.g., more disease samples than healthy controls), it would be difficult to discern abnormalities in pathway activity since the gene distributions were estimated on a biased population. Therefore, we propose reference stabilizing GSVA (rsGSVA), a solution to this commonly ignored limitation by using reference datasets to estimate the gene distributions for a more stable GSVA score. Results rsGSVA shows comparable power to classic GSVA, singscore, and ssGSEA under ideal settings while demonstrating stable scores on sample subsets. An application on irritable bowel disease highlights interpretational advantages of rsGSVA to other methods in up/down regulation of inflammation signatures. Conclusions The rsGSVA technique enhances the GSVA functionality by incorporating a reference dataset. This integration of a reference dataset makes the enrichment scores independent of the input distribution and ensures their stability and reproducibility, even as samples are added or removed.https://doi.org/10.1186/s12864-025-11769-6Gene signaturesSequencingPathway analysisBayesian analysis
spellingShingle Lorin Towle-Miller
William Jordan
Alexandre Lockhart
Johannes Freudenburg
Aman Virmani
Mandy Bergquist
Jeffrey Miecznikowski
Will Powley
Extending gene set variation analysis with a reference dataset to stabilize scores
BMC Genomics
Gene signatures
Sequencing
Pathway analysis
Bayesian analysis
title Extending gene set variation analysis with a reference dataset to stabilize scores
title_full Extending gene set variation analysis with a reference dataset to stabilize scores
title_fullStr Extending gene set variation analysis with a reference dataset to stabilize scores
title_full_unstemmed Extending gene set variation analysis with a reference dataset to stabilize scores
title_short Extending gene set variation analysis with a reference dataset to stabilize scores
title_sort extending gene set variation analysis with a reference dataset to stabilize scores
topic Gene signatures
Sequencing
Pathway analysis
Bayesian analysis
url https://doi.org/10.1186/s12864-025-11769-6
work_keys_str_mv AT lorintowlemiller extendinggenesetvariationanalysiswithareferencedatasettostabilizescores
AT williamjordan extendinggenesetvariationanalysiswithareferencedatasettostabilizescores
AT alexandrelockhart extendinggenesetvariationanalysiswithareferencedatasettostabilizescores
AT johannesfreudenburg extendinggenesetvariationanalysiswithareferencedatasettostabilizescores
AT amanvirmani extendinggenesetvariationanalysiswithareferencedatasettostabilizescores
AT mandybergquist extendinggenesetvariationanalysiswithareferencedatasettostabilizescores
AT jeffreymiecznikowski extendinggenesetvariationanalysiswithareferencedatasettostabilizescores
AT willpowley extendinggenesetvariationanalysiswithareferencedatasettostabilizescores