Regularized sequence-context mutational trees capture variation in mutation rates across the human genome.

Germline mutation is the mechanism by which genetic variation in a population is created. Inferences derived from mutation rate models are fundamental to many population genetics methods. Previous models have demonstrated that nucleotides flanking polymorphic sites-the local sequence context-explain...

Full description

Saved in:
Bibliographic Details
Main Authors: Christopher J Adams, Mitchell Conery, Benjamin J Auerbach, Shane T Jensen, Iain Mathieson, Benjamin F Voight
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2023-07-01
Series:PLoS Genetics
Online Access:https://journals.plos.org/plosgenetics/article/file?id=10.1371/journal.pgen.1010807&type=printable
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846129457208754176
author Christopher J Adams
Mitchell Conery
Benjamin J Auerbach
Shane T Jensen
Iain Mathieson
Benjamin F Voight
author_facet Christopher J Adams
Mitchell Conery
Benjamin J Auerbach
Shane T Jensen
Iain Mathieson
Benjamin F Voight
author_sort Christopher J Adams
collection DOAJ
description Germline mutation is the mechanism by which genetic variation in a population is created. Inferences derived from mutation rate models are fundamental to many population genetics methods. Previous models have demonstrated that nucleotides flanking polymorphic sites-the local sequence context-explain variation in the probability that a site is polymorphic. However, limitations to these models exist as the size of the local sequence context window expands. These include a lack of robustness to data sparsity at typical sample sizes, lack of regularization to generate parsimonious models and lack of quantified uncertainty in estimated rates to facilitate comparison between models. To address these limitations, we developed Baymer, a regularized Bayesian hierarchical tree model that captures the heterogeneous effect of sequence contexts on polymorphism probabilities. Baymer implements an adaptive Metropolis-within-Gibbs Markov Chain Monte Carlo sampling scheme to estimate the posterior distributions of sequence-context based probabilities that a site is polymorphic. We show that Baymer accurately infers polymorphism probabilities and well-calibrated posterior distributions, robustly handles data sparsity, appropriately regularizes to return parsimonious models, and scales computationally at least up to 9-mer context windows. We demonstrate application of Baymer in three ways-first, identifying differences in polymorphism probabilities between continental populations in the 1000 Genomes Phase 3 dataset, second, in a sparse data setting to examine the use of polymorphism models as a proxy for de novo mutation probabilities as a function of variant age, sequence context window size, and demographic history, and third, comparing model concordance between different great ape species. We find a shared context-dependent mutation rate architecture underlying our models, enabling a transfer-learning inspired strategy for modeling germline mutations. In summary, Baymer is an accurate polymorphism probability estimation algorithm that automatically adapts to data sparsity at different sequence context levels, thereby making efficient use of the available data.
format Article
id doaj-art-46701a5fce6b4e619ee0f573d01e5c25
institution Kabale University
issn 1553-7390
1553-7404
language English
publishDate 2023-07-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS Genetics
spelling doaj-art-46701a5fce6b4e619ee0f573d01e5c252024-12-10T05:31:35ZengPublic Library of Science (PLoS)PLoS Genetics1553-73901553-74042023-07-01197e101080710.1371/journal.pgen.1010807Regularized sequence-context mutational trees capture variation in mutation rates across the human genome.Christopher J AdamsMitchell ConeryBenjamin J AuerbachShane T JensenIain MathiesonBenjamin F VoightGermline mutation is the mechanism by which genetic variation in a population is created. Inferences derived from mutation rate models are fundamental to many population genetics methods. Previous models have demonstrated that nucleotides flanking polymorphic sites-the local sequence context-explain variation in the probability that a site is polymorphic. However, limitations to these models exist as the size of the local sequence context window expands. These include a lack of robustness to data sparsity at typical sample sizes, lack of regularization to generate parsimonious models and lack of quantified uncertainty in estimated rates to facilitate comparison between models. To address these limitations, we developed Baymer, a regularized Bayesian hierarchical tree model that captures the heterogeneous effect of sequence contexts on polymorphism probabilities. Baymer implements an adaptive Metropolis-within-Gibbs Markov Chain Monte Carlo sampling scheme to estimate the posterior distributions of sequence-context based probabilities that a site is polymorphic. We show that Baymer accurately infers polymorphism probabilities and well-calibrated posterior distributions, robustly handles data sparsity, appropriately regularizes to return parsimonious models, and scales computationally at least up to 9-mer context windows. We demonstrate application of Baymer in three ways-first, identifying differences in polymorphism probabilities between continental populations in the 1000 Genomes Phase 3 dataset, second, in a sparse data setting to examine the use of polymorphism models as a proxy for de novo mutation probabilities as a function of variant age, sequence context window size, and demographic history, and third, comparing model concordance between different great ape species. We find a shared context-dependent mutation rate architecture underlying our models, enabling a transfer-learning inspired strategy for modeling germline mutations. In summary, Baymer is an accurate polymorphism probability estimation algorithm that automatically adapts to data sparsity at different sequence context levels, thereby making efficient use of the available data.https://journals.plos.org/plosgenetics/article/file?id=10.1371/journal.pgen.1010807&type=printable
spellingShingle Christopher J Adams
Mitchell Conery
Benjamin J Auerbach
Shane T Jensen
Iain Mathieson
Benjamin F Voight
Regularized sequence-context mutational trees capture variation in mutation rates across the human genome.
PLoS Genetics
title Regularized sequence-context mutational trees capture variation in mutation rates across the human genome.
title_full Regularized sequence-context mutational trees capture variation in mutation rates across the human genome.
title_fullStr Regularized sequence-context mutational trees capture variation in mutation rates across the human genome.
title_full_unstemmed Regularized sequence-context mutational trees capture variation in mutation rates across the human genome.
title_short Regularized sequence-context mutational trees capture variation in mutation rates across the human genome.
title_sort regularized sequence context mutational trees capture variation in mutation rates across the human genome
url https://journals.plos.org/plosgenetics/article/file?id=10.1371/journal.pgen.1010807&type=printable
work_keys_str_mv AT christopherjadams regularizedsequencecontextmutationaltreescapturevariationinmutationratesacrossthehumangenome
AT mitchellconery regularizedsequencecontextmutationaltreescapturevariationinmutationratesacrossthehumangenome
AT benjaminjauerbach regularizedsequencecontextmutationaltreescapturevariationinmutationratesacrossthehumangenome
AT shanetjensen regularizedsequencecontextmutationaltreescapturevariationinmutationratesacrossthehumangenome
AT iainmathieson regularizedsequencecontextmutationaltreescapturevariationinmutationratesacrossthehumangenome
AT benjaminfvoight regularizedsequencecontextmutationaltreescapturevariationinmutationratesacrossthehumangenome