SurVIndel2: improving copy number variant calling from next-generation sequencing using hidden split reads

Abstract Deletions and tandem duplications (commonly called CNVs) represent the majority of structural variations in a human genome. They can be identified using short reads, but because they frequently occur in repetitive regions, existing methods fail to detect most of them. This is because CNVs i...

Full description

Saved in:
Bibliographic Details
Main Authors: Ramesh Rajaby, Wing-Kin Sung
Format: Article
Language:English
Published: Nature Portfolio 2024-12-01
Series:Nature Communications
Online Access:https://doi.org/10.1038/s41467-024-53087-7
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849220628940324864
author Ramesh Rajaby
Wing-Kin Sung
author_facet Ramesh Rajaby
Wing-Kin Sung
author_sort Ramesh Rajaby
collection DOAJ
description Abstract Deletions and tandem duplications (commonly called CNVs) represent the majority of structural variations in a human genome. They can be identified using short reads, but because they frequently occur in repetitive regions, existing methods fail to detect most of them. This is because CNVs in repetitive regions often do not produce the evidence needed by existing short reads-based callers (split reads, discordant pairs or read depth change). Here, we introduce a new CNV short reads-based caller named SurVIndel2. SurVindel2 builds on statistical techniques we previously developed, but also employs a novel type of evidence, hidden split reads, that can uncover many CNVs missed by existing algorithms. We use public benchmarks to show that SurVIndel2 outperforms other popular callers, both on human and non-human datasets. Then, we demonstrate the practical utility of the method by generating a catalogue of CNVs for the 1000 Genomes Project that contains hundreds of thousands of CNVs missing from the most recent public catalogue. We also show that SurVIndel2 is able to complement small indels predicted by Google DeepVariant, and the two software used in tandem produce a remarkably complete catalogue of variants in an individual. Finally, we characterise how the limitations of current sequencing technologies contribute significantly to the missing CNVs.
format Article
id doaj-art-13a792885ae24f3184e46d22f7419c22
institution Kabale University
issn 2041-1723
language English
publishDate 2024-12-01
publisher Nature Portfolio
record_format Article
series Nature Communications
spelling doaj-art-13a792885ae24f3184e46d22f7419c222024-12-08T12:37:16ZengNature PortfolioNature Communications2041-17232024-12-0115111610.1038/s41467-024-53087-7SurVIndel2: improving copy number variant calling from next-generation sequencing using hidden split readsRamesh Rajaby0Wing-Kin Sung1Department of Chemical Pathology, The Chinese University of Hong KongDepartment of Chemical Pathology, The Chinese University of Hong KongAbstract Deletions and tandem duplications (commonly called CNVs) represent the majority of structural variations in a human genome. They can be identified using short reads, but because they frequently occur in repetitive regions, existing methods fail to detect most of them. This is because CNVs in repetitive regions often do not produce the evidence needed by existing short reads-based callers (split reads, discordant pairs or read depth change). Here, we introduce a new CNV short reads-based caller named SurVIndel2. SurVindel2 builds on statistical techniques we previously developed, but also employs a novel type of evidence, hidden split reads, that can uncover many CNVs missed by existing algorithms. We use public benchmarks to show that SurVIndel2 outperforms other popular callers, both on human and non-human datasets. Then, we demonstrate the practical utility of the method by generating a catalogue of CNVs for the 1000 Genomes Project that contains hundreds of thousands of CNVs missing from the most recent public catalogue. We also show that SurVIndel2 is able to complement small indels predicted by Google DeepVariant, and the two software used in tandem produce a remarkably complete catalogue of variants in an individual. Finally, we characterise how the limitations of current sequencing technologies contribute significantly to the missing CNVs.https://doi.org/10.1038/s41467-024-53087-7
spellingShingle Ramesh Rajaby
Wing-Kin Sung
SurVIndel2: improving copy number variant calling from next-generation sequencing using hidden split reads
Nature Communications
title SurVIndel2: improving copy number variant calling from next-generation sequencing using hidden split reads
title_full SurVIndel2: improving copy number variant calling from next-generation sequencing using hidden split reads
title_fullStr SurVIndel2: improving copy number variant calling from next-generation sequencing using hidden split reads
title_full_unstemmed SurVIndel2: improving copy number variant calling from next-generation sequencing using hidden split reads
title_short SurVIndel2: improving copy number variant calling from next-generation sequencing using hidden split reads
title_sort survindel2 improving copy number variant calling from next generation sequencing using hidden split reads
url https://doi.org/10.1038/s41467-024-53087-7
work_keys_str_mv AT rameshrajaby survindel2improvingcopynumbervariantcallingfromnextgenerationsequencingusinghiddensplitreads
AT wingkinsung survindel2improvingcopynumbervariantcallingfromnextgenerationsequencingusinghiddensplitreads