Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions

Abstract Background Accurate prediction of copy number variations (CNVs) from targeted capture next-generation sequencing (NGS) data relies on effective normalization of read coverage profiles. The normalization process is particularly challenging due to hidden systemic biases such as GC bias, which...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zoltán Maróti, Peter Juma Ochieng, József Dombi, Miklós Krész, Tibor Kalmár
Format:	Article
Language:	English
Published:	BMC 2024-12-01
Series:	BMC Bioinformatics
Subjects:	Targeted capture Oligo capture baits Copy number variation Machine learning
Online Access:	https://doi.org/10.1186/s12859-024-06006-y
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1846100829239508992
author	Zoltán Maróti Peter Juma Ochieng József Dombi Miklós Krész Tibor Kalmár
author_facet	Zoltán Maróti Peter Juma Ochieng József Dombi Miklós Krész Tibor Kalmár
author_sort	Zoltán Maróti
collection	DOAJ
description	Abstract Background Accurate prediction of copy number variations (CNVs) from targeted capture next-generation sequencing (NGS) data relies on effective normalization of read coverage profiles. The normalization process is particularly challenging due to hidden systemic biases such as GC bias, which can significantly affect the sensitivity and specificity of CNV detection. In many cases, the kit manifests provide only the genome coordinates of the targeted regions, and the exact bait design of the oligo capture baits is not available. Although the on-target regions significantly overlap with the bait design, a lack of adequate information allows less accurate normalization of the coverage data. In this study, we propose a novel approach that utilizes a 1D convolution neural network (CNN) model to predict the positions of capture baits in complex whole-exome sequencing (WES) kits. By accurately identifying the exact positions of bait coordinates, our model enables precise normalization of GC bias across target regions, thereby allowing better CNV data normalization. Results We evaluated the optimal hyperparameters, model architecture, and complexity to predict the likely positions of the oligo capture baits. Our analysis shows that the CNN models outperform the Dense NN for bait predictions. Batch normalization is the most important parameter for the stable training of CNN models. Our results indicate that the spatiality of the data plays an important role in the prediction performance. We have shown that combined input data, including experimental coverage, on-target information, and sequence data, are critical for bait prediction. Furthermore, comparison with the on-target information indicated that the CNN models performed better in predicting bait positions that exhibited a high degree of overlap (>90%) with the true bait positions. Results This study highlights the potential of utilizing CNN-based approaches to optimize coverage data analysis and improve copy number data normalization. Subsequent CNV detection based on these predicted coordinates facilitates more accurate measurement of coverage profiles and better normalization for GC bias. As a result, this approach could reduce systemic bias and improve the sensitivity and specificity of CNV detection in genomic studies.
format	Article
id	doaj-art-5f408970a43d4515b4259ceefb4363e5
institution	Kabale University
issn	1471-2105
language	English
publishDate	2024-12-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj-art-5f408970a43d4515b4259ceefb4363e52024-12-29T12:49:50ZengBMCBMC Bioinformatics1471-21052024-12-0125112010.1186/s12859-024-06006-yOptimizing sequence data analysis using convolution neural network for the prediction of CNV bait positionsZoltán Maróti0Peter Juma Ochieng1József Dombi2Miklós Krész3Tibor Kalmár4Albert Szent-Györgyi Health Centre, University of SzegedInterdisciplinary Research Development and Innovation Center of Excellence, Institute of Informatics, University of SzegedHUN-REN SZTE Research Group on Artificial Intelligence, University of SzegedInnoRenew CoEAlbert Szent-Györgyi Health Centre, University of SzegedAbstract Background Accurate prediction of copy number variations (CNVs) from targeted capture next-generation sequencing (NGS) data relies on effective normalization of read coverage profiles. The normalization process is particularly challenging due to hidden systemic biases such as GC bias, which can significantly affect the sensitivity and specificity of CNV detection. In many cases, the kit manifests provide only the genome coordinates of the targeted regions, and the exact bait design of the oligo capture baits is not available. Although the on-target regions significantly overlap with the bait design, a lack of adequate information allows less accurate normalization of the coverage data. In this study, we propose a novel approach that utilizes a 1D convolution neural network (CNN) model to predict the positions of capture baits in complex whole-exome sequencing (WES) kits. By accurately identifying the exact positions of bait coordinates, our model enables precise normalization of GC bias across target regions, thereby allowing better CNV data normalization. Results We evaluated the optimal hyperparameters, model architecture, and complexity to predict the likely positions of the oligo capture baits. Our analysis shows that the CNN models outperform the Dense NN for bait predictions. Batch normalization is the most important parameter for the stable training of CNN models. Our results indicate that the spatiality of the data plays an important role in the prediction performance. We have shown that combined input data, including experimental coverage, on-target information, and sequence data, are critical for bait prediction. Furthermore, comparison with the on-target information indicated that the CNN models performed better in predicting bait positions that exhibited a high degree of overlap (>90%) with the true bait positions. Results This study highlights the potential of utilizing CNN-based approaches to optimize coverage data analysis and improve copy number data normalization. Subsequent CNV detection based on these predicted coordinates facilitates more accurate measurement of coverage profiles and better normalization for GC bias. As a result, this approach could reduce systemic bias and improve the sensitivity and specificity of CNV detection in genomic studies.https://doi.org/10.1186/s12859-024-06006-yTargeted captureOligo capture baitsCopy number variationMachine learning
spellingShingle	Zoltán Maróti Peter Juma Ochieng József Dombi Miklós Krész Tibor Kalmár Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions BMC Bioinformatics Targeted capture Oligo capture baits Copy number variation Machine learning
title	Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions
title_full	Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions
title_fullStr	Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions
title_full_unstemmed	Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions
title_short	Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions
title_sort	optimizing sequence data analysis using convolution neural network for the prediction of cnv bait positions
topic	Targeted capture Oligo capture baits Copy number variation Machine learning
url	https://doi.org/10.1186/s12859-024-06006-y
work_keys_str_mv	AT zoltanmaroti optimizingsequencedataanalysisusingconvolutionneuralnetworkforthepredictionofcnvbaitpositions AT peterjumaochieng optimizingsequencedataanalysisusingconvolutionneuralnetworkforthepredictionofcnvbaitpositions AT jozsefdombi optimizingsequencedataanalysisusingconvolutionneuralnetworkforthepredictionofcnvbaitpositions AT mikloskresz optimizingsequencedataanalysisusingconvolutionneuralnetworkforthepredictionofcnvbaitpositions AT tiborkalmar optimizingsequencedataanalysisusingconvolutionneuralnetworkforthepredictionofcnvbaitpositions

Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions

Similar Items