A generative adversarial network for multiple reads reconstruction in DNA storage

Abstract DNA storage is widely considered as a promising solution to the data explosion problem. However, the synthesis, PCR and sequencing processes usually result in erroneous reads involving base insertions, deletions, and substitutions. Specially this situation is even more serious in the 3rd ge...

Full description

Saved in:
Bibliographic Details
Main Authors: Xiaodong Zheng, Ranze Xie, Xiangyu Yao, Yanqing Su, Ling Chu, Peng Xu, Wenbin Liu
Format: Article
Language:English
Published: Nature Portfolio 2024-12-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-024-83806-5
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract DNA storage is widely considered as a promising solution to the data explosion problem. However, the synthesis, PCR and sequencing processes usually result in erroneous reads involving base insertions, deletions, and substitutions. Specially this situation is even more serious in the 3rd generation of sequencing technologies. Different from previous error-correction and multiple sequence alignment methods, we first transform the multiple reads into a noisy mage, and then construct a conditional generative adversarial network to produce a “smooth” image which refers to the consensus sequence. Results on two real datasets demonstrate that our model can completely reconstruct the tested sequences with as high as 5.9% errors. This means that the proposed DNA-GAN can be applied on 3rd generation nanopore sequencing environments, while the transformer-based models are only tested on next-generation sequencing datasets. Furthermore, DNA-GAN exhibits excellent robustness even when as much as 20% of the clusters are contaminated with irrelevant reads. To the best of our knowledge, this work is the first to use GAN for multi-reads reconstruction in DNA-based storage.
ISSN:2045-2322