Extending gene set variation analysis with a reference dataset to stabilize scores

Abstract Background Biological pathways are sets of genes that jointly drive biological processes. Rather than analyzing genes individually, it is common practice to summarize sets of related genes using gene set variation analysis (GSVA). In short, GSVA summarizes a set of genes into a single score...

Full description

Saved in:

Bibliographic Details
Main Authors:	Lorin Towle-Miller, William Jordan, Alexandre Lockhart, Johannes Freudenburg, Aman Virmani, Mandy Bergquist, Jeffrey Miecznikowski, Will Powley
Format:	Article
Language:	English
Published:	BMC 2025-07-01
Series:	BMC Genomics
Subjects:	Gene signatures Sequencing Pathway analysis Bayesian analysis
Online Access:	https://doi.org/10.1186/s12864-025-11769-6
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849238745769836544
author	Lorin Towle-Miller William Jordan Alexandre Lockhart Johannes Freudenburg Aman Virmani Mandy Bergquist Jeffrey Miecznikowski Will Powley
author_facet	Lorin Towle-Miller William Jordan Alexandre Lockhart Johannes Freudenburg Aman Virmani Mandy Bergquist Jeffrey Miecznikowski Will Powley
author_sort	Lorin Towle-Miller
collection	DOAJ
description	Abstract Background Biological pathways are sets of genes that jointly drive biological processes. Rather than analyzing genes individually, it is common practice to summarize sets of related genes using gene set variation analysis (GSVA). In short, GSVA summarizes a set of genes into a single score bounded between -1 and 1, where negative values suggest downregulation and positive values suggest upregulation. Although this interpretation is simple in theory, it depends on unbiased estimation of individual gene distributions. In the current version of GSVA, gene distributions are estimated using the input dataset (i.e., the scores are calculated based on the gene distributions from the same dataset). This becomes a major issue when study data does not adequately represent the full distribution of the population. For example, if RNA-seq data was collected on an imbalanced sample (e.g., more disease samples than healthy controls), it would be difficult to discern abnormalities in pathway activity since the gene distributions were estimated on a biased population. Therefore, we propose reference stabilizing GSVA (rsGSVA), a solution to this commonly ignored limitation by using reference datasets to estimate the gene distributions for a more stable GSVA score. Results rsGSVA shows comparable power to classic GSVA, singscore, and ssGSEA under ideal settings while demonstrating stable scores on sample subsets. An application on irritable bowel disease highlights interpretational advantages of rsGSVA to other methods in up/down regulation of inflammation signatures. Conclusions The rsGSVA technique enhances the GSVA functionality by incorporating a reference dataset. This integration of a reference dataset makes the enrichment scores independent of the input distribution and ensures their stability and reproducibility, even as samples are added or removed.
format	Article
id	doaj-art-18c51a6c521f4e2dabb6a8ab5244f61c
institution	Kabale University
issn	1471-2164
language	English
publishDate	2025-07-01
publisher	BMC
record_format	Article
series	BMC Genomics
spelling	doaj-art-18c51a6c521f4e2dabb6a8ab5244f61c2025-08-20T04:01:25ZengBMCBMC Genomics1471-21642025-07-0126111110.1186/s12864-025-11769-6Extending gene set variation analysis with a reference dataset to stabilize scoresLorin Towle-Miller0William Jordan1Alexandre Lockhart2Johannes Freudenburg3Aman Virmani4Mandy Bergquist5Jeffrey Miecznikowski6Will Powley7GSK, BiostatisticsGSK, Computational BiologyGSK, BiostatisticsGSK, Computational BiologyGSK, Computational BiologyGSK, BiostatisticsBiostatistics Department, University at BuffaloGSK, BiostatisticsAbstract Background Biological pathways are sets of genes that jointly drive biological processes. Rather than analyzing genes individually, it is common practice to summarize sets of related genes using gene set variation analysis (GSVA). In short, GSVA summarizes a set of genes into a single score bounded between -1 and 1, where negative values suggest downregulation and positive values suggest upregulation. Although this interpretation is simple in theory, it depends on unbiased estimation of individual gene distributions. In the current version of GSVA, gene distributions are estimated using the input dataset (i.e., the scores are calculated based on the gene distributions from the same dataset). This becomes a major issue when study data does not adequately represent the full distribution of the population. For example, if RNA-seq data was collected on an imbalanced sample (e.g., more disease samples than healthy controls), it would be difficult to discern abnormalities in pathway activity since the gene distributions were estimated on a biased population. Therefore, we propose reference stabilizing GSVA (rsGSVA), a solution to this commonly ignored limitation by using reference datasets to estimate the gene distributions for a more stable GSVA score. Results rsGSVA shows comparable power to classic GSVA, singscore, and ssGSEA under ideal settings while demonstrating stable scores on sample subsets. An application on irritable bowel disease highlights interpretational advantages of rsGSVA to other methods in up/down regulation of inflammation signatures. Conclusions The rsGSVA technique enhances the GSVA functionality by incorporating a reference dataset. This integration of a reference dataset makes the enrichment scores independent of the input distribution and ensures their stability and reproducibility, even as samples are added or removed.https://doi.org/10.1186/s12864-025-11769-6Gene signaturesSequencingPathway analysisBayesian analysis
spellingShingle	Lorin Towle-Miller William Jordan Alexandre Lockhart Johannes Freudenburg Aman Virmani Mandy Bergquist Jeffrey Miecznikowski Will Powley Extending gene set variation analysis with a reference dataset to stabilize scores BMC Genomics Gene signatures Sequencing Pathway analysis Bayesian analysis
title	Extending gene set variation analysis with a reference dataset to stabilize scores
title_full	Extending gene set variation analysis with a reference dataset to stabilize scores
title_fullStr	Extending gene set variation analysis with a reference dataset to stabilize scores
title_full_unstemmed	Extending gene set variation analysis with a reference dataset to stabilize scores
title_short	Extending gene set variation analysis with a reference dataset to stabilize scores
title_sort	extending gene set variation analysis with a reference dataset to stabilize scores
topic	Gene signatures Sequencing Pathway analysis Bayesian analysis
url	https://doi.org/10.1186/s12864-025-11769-6
work_keys_str_mv	AT lorintowlemiller extendinggenesetvariationanalysiswithareferencedatasettostabilizescores AT williamjordan extendinggenesetvariationanalysiswithareferencedatasettostabilizescores AT alexandrelockhart extendinggenesetvariationanalysiswithareferencedatasettostabilizescores AT johannesfreudenburg extendinggenesetvariationanalysiswithareferencedatasettostabilizescores AT amanvirmani extendinggenesetvariationanalysiswithareferencedatasettostabilizescores AT mandybergquist extendinggenesetvariationanalysiswithareferencedatasettostabilizescores AT jeffreymiecznikowski extendinggenesetvariationanalysiswithareferencedatasettostabilizescores AT willpowley extendinggenesetvariationanalysiswithareferencedatasettostabilizescores

Extending gene set variation analysis with a reference dataset to stabilize scores

Similar Items