IMPLEMENTATION OF BALANCING DATA METHOD USING SMOTETOMEK IN DIABETES CLASSIFICATION USING XGBOOST

In this research, XGBoost algorithm and the SMOTETomek approach are employed with the objective of enhancing the accuracy of diabetes classification. The study utilises 2,000 patient data points, comprising demographic and medical information, sourced from Kaggle. The dataset employed in this study...

Full description

Saved in:
Bibliographic Details
Main Authors: Fatwa Ratantja Kusumajati, Basuki Rahmat, Achmad Junaidi
Format: Article
Language:English
Published: Informatics Department, Engineering Faculty 2024-12-01
Series:Jurnal Ilmiah Kursor: Menuju Solusi Teknologi Informasi
Subjects:
Online Access:http://www.kursorjournal.org/index.php/kursor/article/view/410
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In this research, XGBoost algorithm and the SMOTETomek approach are employed with the objective of enhancing the accuracy of diabetes classification. The study utilises 2,000 patient data points, comprising demographic and medical information, sourced from Kaggle. The dataset employed in this study comprises a number of variables, including pregnancies, glucose levels, blood pressure, skin thickness, insulin levels, Body Mass Index (BMI), diabetes pedigree function, age, and an outcome variable. The latter is a binary classification label, taking on the values 0 and 1. A value of 0 indicates that the patient is not affected by diabetes, whereas a value of 1 indicates that the patient has diabetes. Diabetes represents a significant public health concern in Indonesia. A significant challenge in this study was the imbalanced nature of the dataset, which included a disproportionate number of non-diabetic samples relative to diabetic samples. To address this class imbalance, the researchers employed the SMOTETomek method. SMOTETomek integrates the SMOTE (Synthetic Minority Over-sampling Technique) and Tomek links algorithms to oversample the minority class and remove borderline samples, thereby balancing the class distributions. The SMOTETomek method achieved higher accuracy (95.01%) than SMOTE and the original data (both 92.13%), highlighting the benefits of combining SMOTE with Tomek Links for XGBoost. During testing, SMOTETomek slightly reduced the minority class accuracy (0.97 vs. 0.99 for SMOTE and original data) but maintained strong F1-score and precision, indicating effective handling of data imbalance despite minor trade-offs.
ISSN:0216-0544
2301-6914