Predict the degree of secondary structures of the encoding sequences in DNA storage by deep learning model
Abstract DNA storage has been widely considered as a promising alternative for exponentially growing data. However, the inherent complex secondary structures severely compromise the processes of synthesis, PCR amplification, and sequencing, interfering with reliable information recovery. In large-sc...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-07-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-05717-3 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract DNA storage has been widely considered as a promising alternative for exponentially growing data. However, the inherent complex secondary structures severely compromise the processes of synthesis, PCR amplification, and sequencing, interfering with reliable information recovery. In large-scale storage applications, how to effectively circumvent the negative effects is a critical problem. As secondary structures are formed by contiguous bases with reversal complementary relations and accompanied by the released free energy, we construct a BiLSTM-Transformer model with k-mer embedding to predict the free energy of sequences and further screen out these sequences with high values. K-mer embedding can capture the characteristics of contiguous base pairings through overlapping short subsequences, further facilitating free-energy prediction. Compared with other deep learning models, our simulation results demonstrate that BiLSTM-Transformer model with k-mer embedding has a better prediction performance. Application on a real dataset demonstrates that the proposed model can screen out those top high-risk sequences which are prone to more read errors and fewer retrieved copy numbers in real DNA storage. The proposed screening method for top high-risk sequences can be a proactive step to prevent the occurrence of severe secondary structures, providing a solution for reliable information retrieval. |
|---|---|
| ISSN: | 2045-2322 |