Transformers significantly improve splice site prediction

Abstract Mutations that affect RNA splicing significantly impact human diversity and disease. Here we present a method using transformers, a type of machine learning model, to detect splicing from raw 45,000-nucleotide sequences. We generate embeddings with residual neural networks and apply hard at...

Full description

Saved in:
Bibliographic Details
Main Authors: Benedikt A. Jónsson, Gísli H. Halldórsson, Steinþór Árdal, Sölvi Rögnvaldsson, Eyþór Einarsson, Patrick Sulem, Daníel F. Guðbjartsson, Páll Melsted, Kári Stefánsson, Magnús Ö. Úlfarsson
Format: Article
Language:English
Published: Nature Portfolio 2024-12-01
Series:Communications Biology
Online Access:https://doi.org/10.1038/s42003-024-07298-9
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846112310128541696
author Benedikt A. Jónsson
Gísli H. Halldórsson
Steinþór Árdal
Sölvi Rögnvaldsson
Eyþór Einarsson
Patrick Sulem
Daníel F. Guðbjartsson
Páll Melsted
Kári Stefánsson
Magnús Ö. Úlfarsson
author_facet Benedikt A. Jónsson
Gísli H. Halldórsson
Steinþór Árdal
Sölvi Rögnvaldsson
Eyþór Einarsson
Patrick Sulem
Daníel F. Guðbjartsson
Páll Melsted
Kári Stefánsson
Magnús Ö. Úlfarsson
author_sort Benedikt A. Jónsson
collection DOAJ
description Abstract Mutations that affect RNA splicing significantly impact human diversity and disease. Here we present a method using transformers, a type of machine learning model, to detect splicing from raw 45,000-nucleotide sequences. We generate embeddings with residual neural networks and apply hard attention to select splice site candidates, enabling efficient training on long sequences. Our method surpasses the leading tool, SpliceAI, in detecting splice sites in GENCODE and ENSEMBL annotations. Using extensive RNA sequencing data from an Icelandic cohort of 17,848 individuals and the Genotype-Tissue Expression (GTEx) project, our method demonstrates superior performance in detecting splice junctions compared to SpliceAI-10k (PR-AUC = 0.834 vs. PR-AUC = 0.820) and is more effective at identifying disease-related splice variants in ClinVar (PR-AUC = 0.997 vs. PR-AUC = 0.996). These advancements hold promise for improving genetic research and clinical diagnostics, potentially leading to better understanding and treatment of splicing-related diseases.
format Article
id doaj-art-edb255f68b4b469996e1fb41ad38a6e0
institution Kabale University
issn 2399-3642
language English
publishDate 2024-12-01
publisher Nature Portfolio
record_format Article
series Communications Biology
spelling doaj-art-edb255f68b4b469996e1fb41ad38a6e02024-12-22T12:41:54ZengNature PortfolioCommunications Biology2399-36422024-12-01711910.1038/s42003-024-07298-9Transformers significantly improve splice site predictionBenedikt A. Jónsson0Gísli H. Halldórsson1Steinþór Árdal2Sölvi Rögnvaldsson3Eyþór Einarsson4Patrick Sulem5Daníel F. Guðbjartsson6Páll Melsted7Kári Stefánsson8Magnús Ö. Úlfarsson9deCODE Genetics/Amgen Inc.deCODE Genetics/Amgen Inc.deCODE Genetics/Amgen Inc.deCODE Genetics/Amgen Inc.deCODE Genetics/Amgen Inc.deCODE Genetics/Amgen Inc.deCODE Genetics/Amgen Inc.deCODE Genetics/Amgen Inc.deCODE Genetics/Amgen Inc.deCODE Genetics/Amgen Inc.Abstract Mutations that affect RNA splicing significantly impact human diversity and disease. Here we present a method using transformers, a type of machine learning model, to detect splicing from raw 45,000-nucleotide sequences. We generate embeddings with residual neural networks and apply hard attention to select splice site candidates, enabling efficient training on long sequences. Our method surpasses the leading tool, SpliceAI, in detecting splice sites in GENCODE and ENSEMBL annotations. Using extensive RNA sequencing data from an Icelandic cohort of 17,848 individuals and the Genotype-Tissue Expression (GTEx) project, our method demonstrates superior performance in detecting splice junctions compared to SpliceAI-10k (PR-AUC = 0.834 vs. PR-AUC = 0.820) and is more effective at identifying disease-related splice variants in ClinVar (PR-AUC = 0.997 vs. PR-AUC = 0.996). These advancements hold promise for improving genetic research and clinical diagnostics, potentially leading to better understanding and treatment of splicing-related diseases.https://doi.org/10.1038/s42003-024-07298-9
spellingShingle Benedikt A. Jónsson
Gísli H. Halldórsson
Steinþór Árdal
Sölvi Rögnvaldsson
Eyþór Einarsson
Patrick Sulem
Daníel F. Guðbjartsson
Páll Melsted
Kári Stefánsson
Magnús Ö. Úlfarsson
Transformers significantly improve splice site prediction
Communications Biology
title Transformers significantly improve splice site prediction
title_full Transformers significantly improve splice site prediction
title_fullStr Transformers significantly improve splice site prediction
title_full_unstemmed Transformers significantly improve splice site prediction
title_short Transformers significantly improve splice site prediction
title_sort transformers significantly improve splice site prediction
url https://doi.org/10.1038/s42003-024-07298-9
work_keys_str_mv AT benediktajonsson transformerssignificantlyimprovesplicesiteprediction
AT gislihhalldorsson transformerssignificantlyimprovesplicesiteprediction
AT steinþorardal transformerssignificantlyimprovesplicesiteprediction
AT solvirognvaldsson transformerssignificantlyimprovesplicesiteprediction
AT eyþoreinarsson transformerssignificantlyimprovesplicesiteprediction
AT patricksulem transformerssignificantlyimprovesplicesiteprediction
AT danielfguðbjartsson transformerssignificantlyimprovesplicesiteprediction
AT pallmelsted transformerssignificantlyimprovesplicesiteprediction
AT karistefansson transformerssignificantlyimprovesplicesiteprediction
AT magnusoulfarsson transformerssignificantlyimprovesplicesiteprediction