Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions
Abstract Background Accurate prediction of copy number variations (CNVs) from targeted capture next-generation sequencing (NGS) data relies on effective normalization of read coverage profiles. The normalization process is particularly challenging due to hidden systemic biases such as GC bias, which...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMC
2024-12-01
|
| Series: | BMC Bioinformatics |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s12859-024-06006-y |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1846100829239508992 |
|---|---|
| author | Zoltán Maróti Peter Juma Ochieng József Dombi Miklós Krész Tibor Kalmár |
| author_facet | Zoltán Maróti Peter Juma Ochieng József Dombi Miklós Krész Tibor Kalmár |
| author_sort | Zoltán Maróti |
| collection | DOAJ |
| description | Abstract Background Accurate prediction of copy number variations (CNVs) from targeted capture next-generation sequencing (NGS) data relies on effective normalization of read coverage profiles. The normalization process is particularly challenging due to hidden systemic biases such as GC bias, which can significantly affect the sensitivity and specificity of CNV detection. In many cases, the kit manifests provide only the genome coordinates of the targeted regions, and the exact bait design of the oligo capture baits is not available. Although the on-target regions significantly overlap with the bait design, a lack of adequate information allows less accurate normalization of the coverage data. In this study, we propose a novel approach that utilizes a 1D convolution neural network (CNN) model to predict the positions of capture baits in complex whole-exome sequencing (WES) kits. By accurately identifying the exact positions of bait coordinates, our model enables precise normalization of GC bias across target regions, thereby allowing better CNV data normalization. Results We evaluated the optimal hyperparameters, model architecture, and complexity to predict the likely positions of the oligo capture baits. Our analysis shows that the CNN models outperform the Dense NN for bait predictions. Batch normalization is the most important parameter for the stable training of CNN models. Our results indicate that the spatiality of the data plays an important role in the prediction performance. We have shown that combined input data, including experimental coverage, on-target information, and sequence data, are critical for bait prediction. Furthermore, comparison with the on-target information indicated that the CNN models performed better in predicting bait positions that exhibited a high degree of overlap (>90%) with the true bait positions. Results This study highlights the potential of utilizing CNN-based approaches to optimize coverage data analysis and improve copy number data normalization. Subsequent CNV detection based on these predicted coordinates facilitates more accurate measurement of coverage profiles and better normalization for GC bias. As a result, this approach could reduce systemic bias and improve the sensitivity and specificity of CNV detection in genomic studies. |
| format | Article |
| id | doaj-art-5f408970a43d4515b4259ceefb4363e5 |
| institution | Kabale University |
| issn | 1471-2105 |
| language | English |
| publishDate | 2024-12-01 |
| publisher | BMC |
| record_format | Article |
| series | BMC Bioinformatics |
| spelling | doaj-art-5f408970a43d4515b4259ceefb4363e52024-12-29T12:49:50ZengBMCBMC Bioinformatics1471-21052024-12-0125112010.1186/s12859-024-06006-yOptimizing sequence data analysis using convolution neural network for the prediction of CNV bait positionsZoltán Maróti0Peter Juma Ochieng1József Dombi2Miklós Krész3Tibor Kalmár4Albert Szent-Györgyi Health Centre, University of SzegedInterdisciplinary Research Development and Innovation Center of Excellence, Institute of Informatics, University of SzegedHUN-REN SZTE Research Group on Artificial Intelligence, University of SzegedInnoRenew CoEAlbert Szent-Györgyi Health Centre, University of SzegedAbstract Background Accurate prediction of copy number variations (CNVs) from targeted capture next-generation sequencing (NGS) data relies on effective normalization of read coverage profiles. The normalization process is particularly challenging due to hidden systemic biases such as GC bias, which can significantly affect the sensitivity and specificity of CNV detection. In many cases, the kit manifests provide only the genome coordinates of the targeted regions, and the exact bait design of the oligo capture baits is not available. Although the on-target regions significantly overlap with the bait design, a lack of adequate information allows less accurate normalization of the coverage data. In this study, we propose a novel approach that utilizes a 1D convolution neural network (CNN) model to predict the positions of capture baits in complex whole-exome sequencing (WES) kits. By accurately identifying the exact positions of bait coordinates, our model enables precise normalization of GC bias across target regions, thereby allowing better CNV data normalization. Results We evaluated the optimal hyperparameters, model architecture, and complexity to predict the likely positions of the oligo capture baits. Our analysis shows that the CNN models outperform the Dense NN for bait predictions. Batch normalization is the most important parameter for the stable training of CNN models. Our results indicate that the spatiality of the data plays an important role in the prediction performance. We have shown that combined input data, including experimental coverage, on-target information, and sequence data, are critical for bait prediction. Furthermore, comparison with the on-target information indicated that the CNN models performed better in predicting bait positions that exhibited a high degree of overlap (>90%) with the true bait positions. Results This study highlights the potential of utilizing CNN-based approaches to optimize coverage data analysis and improve copy number data normalization. Subsequent CNV detection based on these predicted coordinates facilitates more accurate measurement of coverage profiles and better normalization for GC bias. As a result, this approach could reduce systemic bias and improve the sensitivity and specificity of CNV detection in genomic studies.https://doi.org/10.1186/s12859-024-06006-yTargeted captureOligo capture baitsCopy number variationMachine learning |
| spellingShingle | Zoltán Maróti Peter Juma Ochieng József Dombi Miklós Krész Tibor Kalmár Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions BMC Bioinformatics Targeted capture Oligo capture baits Copy number variation Machine learning |
| title | Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions |
| title_full | Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions |
| title_fullStr | Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions |
| title_full_unstemmed | Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions |
| title_short | Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions |
| title_sort | optimizing sequence data analysis using convolution neural network for the prediction of cnv bait positions |
| topic | Targeted capture Oligo capture baits Copy number variation Machine learning |
| url | https://doi.org/10.1186/s12859-024-06006-y |
| work_keys_str_mv | AT zoltanmaroti optimizingsequencedataanalysisusingconvolutionneuralnetworkforthepredictionofcnvbaitpositions AT peterjumaochieng optimizingsequencedataanalysisusingconvolutionneuralnetworkforthepredictionofcnvbaitpositions AT jozsefdombi optimizingsequencedataanalysisusingconvolutionneuralnetworkforthepredictionofcnvbaitpositions AT mikloskresz optimizingsequencedataanalysisusingconvolutionneuralnetworkforthepredictionofcnvbaitpositions AT tiborkalmar optimizingsequencedataanalysisusingconvolutionneuralnetworkforthepredictionofcnvbaitpositions |