Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions

Abstract Background Accurate prediction of copy number variations (CNVs) from targeted capture next-generation sequencing (NGS) data relies on effective normalization of read coverage profiles. The normalization process is particularly challenging due to hidden systemic biases such as GC bias, which...

Full description

Saved in:
Bibliographic Details
Main Authors: Zoltán Maróti, Peter Juma Ochieng, József Dombi, Miklós Krész, Tibor Kalmár
Format: Article
Language:English
Published: BMC 2024-12-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-024-06006-y
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846100829239508992
author Zoltán Maróti
Peter Juma Ochieng
József Dombi
Miklós Krész
Tibor Kalmár
author_facet Zoltán Maróti
Peter Juma Ochieng
József Dombi
Miklós Krész
Tibor Kalmár
author_sort Zoltán Maróti
collection DOAJ
description Abstract Background Accurate prediction of copy number variations (CNVs) from targeted capture next-generation sequencing (NGS) data relies on effective normalization of read coverage profiles. The normalization process is particularly challenging due to hidden systemic biases such as GC bias, which can significantly affect the sensitivity and specificity of CNV detection. In many cases, the kit manifests provide only the genome coordinates of the targeted regions, and the exact bait design of the oligo capture baits is not available. Although the on-target regions significantly overlap with the bait design, a lack of adequate information allows less accurate normalization of the coverage data. In this study, we propose a novel approach that utilizes a 1D convolution neural network (CNN) model to predict the positions of capture baits in complex whole-exome sequencing (WES) kits. By accurately identifying the exact positions of bait coordinates, our model enables precise normalization of GC bias across target regions, thereby allowing better CNV data normalization. Results We evaluated the optimal hyperparameters, model architecture, and complexity to predict the likely positions of the oligo capture baits. Our analysis shows that the CNN models outperform the Dense NN for bait predictions. Batch normalization is the most important parameter for the stable training of CNN models. Our results indicate that the spatiality of the data plays an important role in the prediction performance. We have shown that combined input data, including experimental coverage, on-target information, and sequence data, are critical for bait prediction. Furthermore, comparison with the on-target information indicated that the CNN models performed better in predicting bait positions that exhibited a high degree of overlap (>90%) with the true bait positions. Results This study highlights the potential of utilizing CNN-based approaches to optimize coverage data analysis and improve copy number data normalization. Subsequent CNV detection based on these predicted coordinates facilitates more accurate measurement of coverage profiles and better normalization for GC bias. As a result, this approach could reduce systemic bias and improve the sensitivity and specificity of CNV detection in genomic studies.
format Article
id doaj-art-5f408970a43d4515b4259ceefb4363e5
institution Kabale University
issn 1471-2105
language English
publishDate 2024-12-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj-art-5f408970a43d4515b4259ceefb4363e52024-12-29T12:49:50ZengBMCBMC Bioinformatics1471-21052024-12-0125112010.1186/s12859-024-06006-yOptimizing sequence data analysis using convolution neural network for the prediction of CNV bait positionsZoltán Maróti0Peter Juma Ochieng1József Dombi2Miklós Krész3Tibor Kalmár4Albert Szent-Györgyi Health Centre, University of SzegedInterdisciplinary Research Development and Innovation Center of Excellence, Institute of Informatics, University of SzegedHUN-REN SZTE Research Group on Artificial Intelligence, University of SzegedInnoRenew CoEAlbert Szent-Györgyi Health Centre, University of SzegedAbstract Background Accurate prediction of copy number variations (CNVs) from targeted capture next-generation sequencing (NGS) data relies on effective normalization of read coverage profiles. The normalization process is particularly challenging due to hidden systemic biases such as GC bias, which can significantly affect the sensitivity and specificity of CNV detection. In many cases, the kit manifests provide only the genome coordinates of the targeted regions, and the exact bait design of the oligo capture baits is not available. Although the on-target regions significantly overlap with the bait design, a lack of adequate information allows less accurate normalization of the coverage data. In this study, we propose a novel approach that utilizes a 1D convolution neural network (CNN) model to predict the positions of capture baits in complex whole-exome sequencing (WES) kits. By accurately identifying the exact positions of bait coordinates, our model enables precise normalization of GC bias across target regions, thereby allowing better CNV data normalization. Results We evaluated the optimal hyperparameters, model architecture, and complexity to predict the likely positions of the oligo capture baits. Our analysis shows that the CNN models outperform the Dense NN for bait predictions. Batch normalization is the most important parameter for the stable training of CNN models. Our results indicate that the spatiality of the data plays an important role in the prediction performance. We have shown that combined input data, including experimental coverage, on-target information, and sequence data, are critical for bait prediction. Furthermore, comparison with the on-target information indicated that the CNN models performed better in predicting bait positions that exhibited a high degree of overlap (>90%) with the true bait positions. Results This study highlights the potential of utilizing CNN-based approaches to optimize coverage data analysis and improve copy number data normalization. Subsequent CNV detection based on these predicted coordinates facilitates more accurate measurement of coverage profiles and better normalization for GC bias. As a result, this approach could reduce systemic bias and improve the sensitivity and specificity of CNV detection in genomic studies.https://doi.org/10.1186/s12859-024-06006-yTargeted captureOligo capture baitsCopy number variationMachine learning
spellingShingle Zoltán Maróti
Peter Juma Ochieng
József Dombi
Miklós Krész
Tibor Kalmár
Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions
BMC Bioinformatics
Targeted capture
Oligo capture baits
Copy number variation
Machine learning
title Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions
title_full Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions
title_fullStr Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions
title_full_unstemmed Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions
title_short Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions
title_sort optimizing sequence data analysis using convolution neural network for the prediction of cnv bait positions
topic Targeted capture
Oligo capture baits
Copy number variation
Machine learning
url https://doi.org/10.1186/s12859-024-06006-y
work_keys_str_mv AT zoltanmaroti optimizingsequencedataanalysisusingconvolutionneuralnetworkforthepredictionofcnvbaitpositions
AT peterjumaochieng optimizingsequencedataanalysisusingconvolutionneuralnetworkforthepredictionofcnvbaitpositions
AT jozsefdombi optimizingsequencedataanalysisusingconvolutionneuralnetworkforthepredictionofcnvbaitpositions
AT mikloskresz optimizingsequencedataanalysisusingconvolutionneuralnetworkforthepredictionofcnvbaitpositions
AT tiborkalmar optimizingsequencedataanalysisusingconvolutionneuralnetworkforthepredictionofcnvbaitpositions