A novel double machine learning approach for detecting early breast cancer using advanced feature selection and dimensionality reduction techniques

Abstract In this paper, three Double Machine Learning (DML) models are proposed to enhance the accuracy of breast cancer detection using machine learning techniques using breast cancer detection dataset. The DML models learn the primary features using machine learning and deep learning models. Then,...

Full description

Saved in:
Bibliographic Details
Main Authors: Suganya Athisayamani, Tamilazhagan S, A. Robert Singh, Jae-Yong Hwang, Gyanendra Prasad Joshi
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-06426-7
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849238673897291776
author Suganya Athisayamani
Tamilazhagan S
A. Robert Singh
Jae-Yong Hwang
Gyanendra Prasad Joshi
author_facet Suganya Athisayamani
Tamilazhagan S
A. Robert Singh
Jae-Yong Hwang
Gyanendra Prasad Joshi
author_sort Suganya Athisayamani
collection DOAJ
description Abstract In this paper, three Double Machine Learning (DML) models are proposed to enhance the accuracy of breast cancer detection using machine learning techniques using breast cancer detection dataset. The DML models learn the primary features using machine learning and deep learning models. Then, these features are fused by a meta-classifier to achieve the best classification performance. The first DML model combines the interpretability of Random Forest (RF) with the deep learning capabilities of a Feedforward Neural Network (FNN). RF processes structured features, providing class probabilities and feature importance scores, while the FNN learns non-linear relationships and generates embeddings. These outputs are fused into a combined feature vector, which is then used by a meta-classifier for final predictions. This approach effectively captures both structured features and non-linear patterns, making it suitable for datasets with complex dependencies. The second model pairs eXtreme Gradient Boosting (XGBoost), a highly efficient boosting algorithm for tabular data, with an Artificial Neural Network (ANN). XGBoost optimizes decision tree ensembles and provides class probabilities, while the ANN processes numerical data to learn deeper representations. A meta-classifier then uses the fused outputs from both XGBoost and ANN for final predictions. This model is particularly effective for datasets combining structured features (handled by XGBoost) with numerical features (handled by ANN). The third model integrates LightGBM, a fast and scalable gradient-boosting framework, with an ANN, which is well-suited for analyzing sequential data. LightGBM processes structured features to provide probabilities and importance scores, while the ANN learns temporal dependencies from sequential data. The outputs from LightGBM and ANN are concatenated and passed into a meta-classifier for decision-making. This model is ideal for datasets with both static features (LightGBM) and continuous data (ANN), such as time-series datasets or datasets with sequential dependencies. These DML models, when combined with dimensionality reduction (PCA) and feature selection, significantly improve the performance of breast cancer detection systems by leveraging both structured and sequential data with high accuracy of 0.99.
format Article
id doaj-art-2e86fb4cd52c4d81a1569c8981dfbf24
institution Kabale University
issn 2045-2322
language English
publishDate 2025-07-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-2e86fb4cd52c4d81a1569c8981dfbf242025-08-20T04:01:26ZengNature PortfolioScientific Reports2045-23222025-07-0115112310.1038/s41598-025-06426-7A novel double machine learning approach for detecting early breast cancer using advanced feature selection and dimensionality reduction techniquesSuganya Athisayamani0Tamilazhagan S1A. Robert Singh2Jae-Yong Hwang3Gyanendra Prasad Joshi4Department of Computing Technologies, SRM Institute of Science and TechnologySchool of Computing, Sastra Deemed to be UniversityDepartment of Computational Intelligence, SRM Institute of Science and TechnologyDepartment of Information & Communication Engineering, Daejeon UniversityDepartment of AI Software, Kangwon National UniversityAbstract In this paper, three Double Machine Learning (DML) models are proposed to enhance the accuracy of breast cancer detection using machine learning techniques using breast cancer detection dataset. The DML models learn the primary features using machine learning and deep learning models. Then, these features are fused by a meta-classifier to achieve the best classification performance. The first DML model combines the interpretability of Random Forest (RF) with the deep learning capabilities of a Feedforward Neural Network (FNN). RF processes structured features, providing class probabilities and feature importance scores, while the FNN learns non-linear relationships and generates embeddings. These outputs are fused into a combined feature vector, which is then used by a meta-classifier for final predictions. This approach effectively captures both structured features and non-linear patterns, making it suitable for datasets with complex dependencies. The second model pairs eXtreme Gradient Boosting (XGBoost), a highly efficient boosting algorithm for tabular data, with an Artificial Neural Network (ANN). XGBoost optimizes decision tree ensembles and provides class probabilities, while the ANN processes numerical data to learn deeper representations. A meta-classifier then uses the fused outputs from both XGBoost and ANN for final predictions. This model is particularly effective for datasets combining structured features (handled by XGBoost) with numerical features (handled by ANN). The third model integrates LightGBM, a fast and scalable gradient-boosting framework, with an ANN, which is well-suited for analyzing sequential data. LightGBM processes structured features to provide probabilities and importance scores, while the ANN learns temporal dependencies from sequential data. The outputs from LightGBM and ANN are concatenated and passed into a meta-classifier for decision-making. This model is ideal for datasets with both static features (LightGBM) and continuous data (ANN), such as time-series datasets or datasets with sequential dependencies. These DML models, when combined with dimensionality reduction (PCA) and feature selection, significantly improve the performance of breast cancer detection systems by leveraging both structured and sequential data with high accuracy of 0.99.https://doi.org/10.1038/s41598-025-06426-7Breast CancerMachine LearningSVMRandom ForestDecision TreeKNN
spellingShingle Suganya Athisayamani
Tamilazhagan S
A. Robert Singh
Jae-Yong Hwang
Gyanendra Prasad Joshi
A novel double machine learning approach for detecting early breast cancer using advanced feature selection and dimensionality reduction techniques
Scientific Reports
Breast Cancer
Machine Learning
SVM
Random Forest
Decision Tree
KNN
title A novel double machine learning approach for detecting early breast cancer using advanced feature selection and dimensionality reduction techniques
title_full A novel double machine learning approach for detecting early breast cancer using advanced feature selection and dimensionality reduction techniques
title_fullStr A novel double machine learning approach for detecting early breast cancer using advanced feature selection and dimensionality reduction techniques
title_full_unstemmed A novel double machine learning approach for detecting early breast cancer using advanced feature selection and dimensionality reduction techniques
title_short A novel double machine learning approach for detecting early breast cancer using advanced feature selection and dimensionality reduction techniques
title_sort novel double machine learning approach for detecting early breast cancer using advanced feature selection and dimensionality reduction techniques
topic Breast Cancer
Machine Learning
SVM
Random Forest
Decision Tree
KNN
url https://doi.org/10.1038/s41598-025-06426-7
work_keys_str_mv AT suganyaathisayamani anoveldoublemachinelearningapproachfordetectingearlybreastcancerusingadvancedfeatureselectionanddimensionalityreductiontechniques
AT tamilazhagans anoveldoublemachinelearningapproachfordetectingearlybreastcancerusingadvancedfeatureselectionanddimensionalityreductiontechniques
AT arobertsingh anoveldoublemachinelearningapproachfordetectingearlybreastcancerusingadvancedfeatureselectionanddimensionalityreductiontechniques
AT jaeyonghwang anoveldoublemachinelearningapproachfordetectingearlybreastcancerusingadvancedfeatureselectionanddimensionalityreductiontechniques
AT gyanendraprasadjoshi anoveldoublemachinelearningapproachfordetectingearlybreastcancerusingadvancedfeatureselectionanddimensionalityreductiontechniques
AT suganyaathisayamani noveldoublemachinelearningapproachfordetectingearlybreastcancerusingadvancedfeatureselectionanddimensionalityreductiontechniques
AT tamilazhagans noveldoublemachinelearningapproachfordetectingearlybreastcancerusingadvancedfeatureselectionanddimensionalityreductiontechniques
AT arobertsingh noveldoublemachinelearningapproachfordetectingearlybreastcancerusingadvancedfeatureselectionanddimensionalityreductiontechniques
AT jaeyonghwang noveldoublemachinelearningapproachfordetectingearlybreastcancerusingadvancedfeatureselectionanddimensionalityreductiontechniques
AT gyanendraprasadjoshi noveldoublemachinelearningapproachfordetectingearlybreastcancerusingadvancedfeatureselectionanddimensionalityreductiontechniques