Predicting drug-target interactions using machine learning with improved data balancing and feature engineering

Abstract Drug-Target Interaction (DTI) prediction is a vital task in drug discovery, yet it faces significant challenges such as data imbalance and the complexity of biochemical representations. This study makes several contributions to address these issues, introducing a novel hybrid framework that...

Full description

Saved in:

Bibliographic Details
Main Authors:	Md. Alamin Talukder, Mohsin Kazi, Ammar Alazab
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-06-01
Series:	Scientific Reports
Subjects:	Drug-Target interaction Generative adversarial networks Machine learning Random forest classifier Data imbalance Computational drug discovery
Online Access:	https://doi.org/10.1038/s41598-025-03932-6
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849333363393953792
author	Md. Alamin Talukder Mohsin Kazi Ammar Alazab
author_facet	Md. Alamin Talukder Mohsin Kazi Ammar Alazab
author_sort	Md. Alamin Talukder
collection	DOAJ
description	Abstract Drug-Target Interaction (DTI) prediction is a vital task in drug discovery, yet it faces significant challenges such as data imbalance and the complexity of biochemical representations. This study makes several contributions to address these issues, introducing a novel hybrid framework that combines advanced machine learning (ML) and deep learning (DL) techniques. The framework leverages comprehensive feature engineering, utilizing MACCS keys to extract structural drug features and amino acid/dipeptide compositions to represent target biomolecular properties. This dual feature extraction method enables a deeper understanding of chemical and biological interactions, enhancing predictive accuracy. To address data imbalance, Generative Adversarial Networks (GANs) are employed to create synthetic data for the minority class, effectively reducing false negatives and improving the sensitivity of the predictive model. The Random Forest Classifier (RFC) is utilized to make precise DTI predictions, optimized for handling high-dimensional data. The proposed framework’s scalability and robustness were validated across diverse datasets, including BindingDB-Kd, BindingDB-Ki, and BindingDB-IC50. For the BindingDB-Kd dataset, the GAN+RFC model achieved remarkable performance metrics: accuracy of 97.46%, precision of 97.49%, sensitivity of 97.46%, specificity of 98.82%, F1-score of 97.46%, and ROC-AUC of 99.42%. Similarly, for the BindingDB-Ki dataset, the model attained an accuracy of 91.69%, precision of 91.74%, sensitivity of 91.69%, specificity of 93.40%, F1-score of 91.69%, and ROC-AUC of 97.32%. On the BindingDB-IC50 dataset, the model achieved an accuracy of 95.40%, precision of 95.41%, sensitivity of 95.40%, specificity of 96.42%, F1-score of 95.39%, and ROC-AUC of 98.97%. These results demonstrate the efficacy of the GAN-based approach in capturing complex patterns, significantly improving DTI prediction outcomes. In conclusion, the proposed GAN-based hybrid framework sets a new benchmark in computational drug discovery by addressing critical challenges in DTI prediction. Its robust performance, scalability, and generalizability contribute substantially to therapeutic development and pharmaceutical research.
format	Article
id	doaj-art-cf74be9223c849f3a7cd8ba09f366be4
institution	Kabale University
issn	2045-2322
language	English
publishDate	2025-06-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj-art-cf74be9223c849f3a7cd8ba09f366be42025-08-20T03:45:53ZengNature PortfolioScientific Reports2045-23222025-06-0115112410.1038/s41598-025-03932-6Predicting drug-target interactions using machine learning with improved data balancing and feature engineeringMd. Alamin Talukder0Mohsin Kazi1Ammar Alazab2Department of Computer Science and Engineering, International University of Business Agriculture and TechnologyDepartment of Pharmaceutics, College of Pharmacy, King Saud UniversityCyber Security and Digital Technology, Victoria UniversityAbstract Drug-Target Interaction (DTI) prediction is a vital task in drug discovery, yet it faces significant challenges such as data imbalance and the complexity of biochemical representations. This study makes several contributions to address these issues, introducing a novel hybrid framework that combines advanced machine learning (ML) and deep learning (DL) techniques. The framework leverages comprehensive feature engineering, utilizing MACCS keys to extract structural drug features and amino acid/dipeptide compositions to represent target biomolecular properties. This dual feature extraction method enables a deeper understanding of chemical and biological interactions, enhancing predictive accuracy. To address data imbalance, Generative Adversarial Networks (GANs) are employed to create synthetic data for the minority class, effectively reducing false negatives and improving the sensitivity of the predictive model. The Random Forest Classifier (RFC) is utilized to make precise DTI predictions, optimized for handling high-dimensional data. The proposed framework’s scalability and robustness were validated across diverse datasets, including BindingDB-Kd, BindingDB-Ki, and BindingDB-IC50. For the BindingDB-Kd dataset, the GAN+RFC model achieved remarkable performance metrics: accuracy of 97.46%, precision of 97.49%, sensitivity of 97.46%, specificity of 98.82%, F1-score of 97.46%, and ROC-AUC of 99.42%. Similarly, for the BindingDB-Ki dataset, the model attained an accuracy of 91.69%, precision of 91.74%, sensitivity of 91.69%, specificity of 93.40%, F1-score of 91.69%, and ROC-AUC of 97.32%. On the BindingDB-IC50 dataset, the model achieved an accuracy of 95.40%, precision of 95.41%, sensitivity of 95.40%, specificity of 96.42%, F1-score of 95.39%, and ROC-AUC of 98.97%. These results demonstrate the efficacy of the GAN-based approach in capturing complex patterns, significantly improving DTI prediction outcomes. In conclusion, the proposed GAN-based hybrid framework sets a new benchmark in computational drug discovery by addressing critical challenges in DTI prediction. Its robust performance, scalability, and generalizability contribute substantially to therapeutic development and pharmaceutical research.https://doi.org/10.1038/s41598-025-03932-6Drug-Target interactionGenerative adversarial networksMachine learningRandom forest classifierData imbalanceComputational drug discovery
spellingShingle	Md. Alamin Talukder Mohsin Kazi Ammar Alazab Predicting drug-target interactions using machine learning with improved data balancing and feature engineering Scientific Reports Drug-Target interaction Generative adversarial networks Machine learning Random forest classifier Data imbalance Computational drug discovery
title	Predicting drug-target interactions using machine learning with improved data balancing and feature engineering
title_full	Predicting drug-target interactions using machine learning with improved data balancing and feature engineering
title_fullStr	Predicting drug-target interactions using machine learning with improved data balancing and feature engineering
title_full_unstemmed	Predicting drug-target interactions using machine learning with improved data balancing and feature engineering
title_short	Predicting drug-target interactions using machine learning with improved data balancing and feature engineering
title_sort	predicting drug target interactions using machine learning with improved data balancing and feature engineering
topic	Drug-Target interaction Generative adversarial networks Machine learning Random forest classifier Data imbalance Computational drug discovery
url	https://doi.org/10.1038/s41598-025-03932-6
work_keys_str_mv	AT mdalamintalukder predictingdrugtargetinteractionsusingmachinelearningwithimproveddatabalancingandfeatureengineering AT mohsinkazi predictingdrugtargetinteractionsusingmachinelearningwithimproveddatabalancingandfeatureengineering AT ammaralazab predictingdrugtargetinteractionsusingmachinelearningwithimproveddatabalancingandfeatureengineering

Predicting drug-target interactions using machine learning with improved data balancing and feature engineering

Similar Items