Maptcha: an efficient parallel workflow for hybrid genome scaffolding

Abstract Background Genome assembly, which involves reconstructing a target genome, relies on scaffolding methods to organize and link partially assembled fragments. The rapid evolution of long read sequencing technologies toward more accurate long reads, coupled with the continued use of short read...

Full description

Saved in:

Bibliographic Details
Main Authors:	Oieswarya Bhowmik, Tazin Rahman, Ananth Kalyanaraman
Format:	Article
Language:	English
Published:	BMC 2024-08-01
Series:	BMC Bioinformatics
Subjects:	Genome assembly Hybrid scaffolding Long read mapping Sketching
Online Access:	https://doi.org/10.1186/s12859-024-05878-4
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1846136777571565568
author	Oieswarya Bhowmik Tazin Rahman Ananth Kalyanaraman
author_facet	Oieswarya Bhowmik Tazin Rahman Ananth Kalyanaraman
author_sort	Oieswarya Bhowmik
collection	DOAJ
description	Abstract Background Genome assembly, which involves reconstructing a target genome, relies on scaffolding methods to organize and link partially assembled fragments. The rapid evolution of long read sequencing technologies toward more accurate long reads, coupled with the continued use of short read technologies, has created a unique need for hybrid assembly workflows. The construction of accurate genomic scaffolds in hybrid workflows is complicated due to scale, sequencing technology diversity (e.g., short vs. long reads, contigs or partial assemblies), and repetitive regions within a target genome. Results In this paper, we present a new parallel workflow for hybrid genome scaffolding that would allow combining pre-constructed partial assemblies with newly sequenced long reads toward an improved assembly. More specifically, the workflow, called Maptcha, is aimed at generating long scaffolds of a target genome, from two sets of input sequences—an already constructed partial assembly of contigs, and a set of newly sequenced long reads. Our scaffolding approach internally uses an alignment-free mapping step to build a $$\langle $$ ⟨ contig,contig $$\rangle $$ ⟩ graph using long reads as linking information. Subsequently, this graph is used to generate scaffolds. We present and evaluate a graph-theoretic “wiring” heuristic to perform this scaffolding step. To enable efficient workload management in a parallel setting, we use a batching technique that partitions the scaffolding tasks so that the more expensive alignment-based assembly step at the end can be efficiently parallelized. This step also allows the use of any standalone assembler for generating the final scaffolds. Conclusions Our experiments with Maptcha on a variety of input genomes, and comparison against two state-of-the-art hybrid scaffolders demonstrate that Maptcha is able to generate longer and more accurate scaffolds substantially faster. In almost all cases, the scaffolds produced by Maptcha are at least an order of magnitude longer (in some cases two orders) than the scaffolds produced by state-of-the-art tools. Maptcha runs significantly faster too, reducing time-to-solution from hours to minutes for most input cases. We also performed a coverage experiment by varying the sequencing coverage depth for long reads, which demonstrated the potential of Maptcha to generate significantly longer scaffolds in low coverage settings ( $$1\times $$ 1 × – $$10\times $$ 10 × ).
format	Article
id	doaj-art-8855adf1018045b6a1925844aeb6bc17
institution	Kabale University
issn	1471-2105
language	English
publishDate	2024-08-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj-art-8855adf1018045b6a1925844aeb6bc172024-12-08T12:47:34ZengBMCBMC Bioinformatics1471-21052024-08-0125112710.1186/s12859-024-05878-4Maptcha: an efficient parallel workflow for hybrid genome scaffoldingOieswarya Bhowmik0Tazin Rahman1Ananth Kalyanaraman2School of Electrical Engineering and Computer Science, Washington State UniversitySchool of Electrical Engineering and Computer Science, Washington State UniversitySchool of Electrical Engineering and Computer Science, Washington State UniversityAbstract Background Genome assembly, which involves reconstructing a target genome, relies on scaffolding methods to organize and link partially assembled fragments. The rapid evolution of long read sequencing technologies toward more accurate long reads, coupled with the continued use of short read technologies, has created a unique need for hybrid assembly workflows. The construction of accurate genomic scaffolds in hybrid workflows is complicated due to scale, sequencing technology diversity (e.g., short vs. long reads, contigs or partial assemblies), and repetitive regions within a target genome. Results In this paper, we present a new parallel workflow for hybrid genome scaffolding that would allow combining pre-constructed partial assemblies with newly sequenced long reads toward an improved assembly. More specifically, the workflow, called Maptcha, is aimed at generating long scaffolds of a target genome, from two sets of input sequences—an already constructed partial assembly of contigs, and a set of newly sequenced long reads. Our scaffolding approach internally uses an alignment-free mapping step to build a $$\langle $$ ⟨ contig,contig $$\rangle $$ ⟩ graph using long reads as linking information. Subsequently, this graph is used to generate scaffolds. We present and evaluate a graph-theoretic “wiring” heuristic to perform this scaffolding step. To enable efficient workload management in a parallel setting, we use a batching technique that partitions the scaffolding tasks so that the more expensive alignment-based assembly step at the end can be efficiently parallelized. This step also allows the use of any standalone assembler for generating the final scaffolds. Conclusions Our experiments with Maptcha on a variety of input genomes, and comparison against two state-of-the-art hybrid scaffolders demonstrate that Maptcha is able to generate longer and more accurate scaffolds substantially faster. In almost all cases, the scaffolds produced by Maptcha are at least an order of magnitude longer (in some cases two orders) than the scaffolds produced by state-of-the-art tools. Maptcha runs significantly faster too, reducing time-to-solution from hours to minutes for most input cases. We also performed a coverage experiment by varying the sequencing coverage depth for long reads, which demonstrated the potential of Maptcha to generate significantly longer scaffolds in low coverage settings ( $$1\times $$ 1 × – $$10\times $$ 10 × ).https://doi.org/10.1186/s12859-024-05878-4Genome assemblyHybrid scaffoldingLong read mappingSketching
spellingShingle	Oieswarya Bhowmik Tazin Rahman Ananth Kalyanaraman Maptcha: an efficient parallel workflow for hybrid genome scaffolding BMC Bioinformatics Genome assembly Hybrid scaffolding Long read mapping Sketching
title	Maptcha: an efficient parallel workflow for hybrid genome scaffolding
title_full	Maptcha: an efficient parallel workflow for hybrid genome scaffolding
title_fullStr	Maptcha: an efficient parallel workflow for hybrid genome scaffolding
title_full_unstemmed	Maptcha: an efficient parallel workflow for hybrid genome scaffolding
title_short	Maptcha: an efficient parallel workflow for hybrid genome scaffolding
title_sort	maptcha an efficient parallel workflow for hybrid genome scaffolding
topic	Genome assembly Hybrid scaffolding Long read mapping Sketching
url	https://doi.org/10.1186/s12859-024-05878-4
work_keys_str_mv	AT oieswaryabhowmik maptchaanefficientparallelworkflowforhybridgenomescaffolding AT tazinrahman maptchaanefficientparallelworkflowforhybridgenomescaffolding AT ananthkalyanaraman maptchaanefficientparallelworkflowforhybridgenomescaffolding

Maptcha: an efficient parallel workflow for hybrid genome scaffolding

Similar Items