IMPLEMENTATION OF BALANCING DATA METHOD USING SMOTETOMEK IN DIABETES CLASSIFICATION USING XGBOOST
In this research, XGBoost algorithm and the SMOTETomek approach are employed with the objective of enhancing the accuracy of diabetes classification. The study utilises 2,000 patient data points, comprising demographic and medical information, sourced from Kaggle. The dataset employed in this study...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Informatics Department, Engineering Faculty
2024-12-01
|
Series: | Jurnal Ilmiah Kursor: Menuju Solusi Teknologi Informasi |
Subjects: | |
Online Access: | http://www.kursorjournal.org/index.php/kursor/article/view/410 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In this research, XGBoost algorithm and the SMOTETomek approach are employed with the objective of enhancing the accuracy of diabetes classification. The study utilises 2,000 patient data points, comprising demographic and medical information, sourced from Kaggle. The dataset employed in this study comprises a number of variables, including pregnancies, glucose levels, blood pressure, skin thickness, insulin levels, Body Mass Index (BMI), diabetes pedigree function, age, and an outcome variable. The latter is a binary classification label, taking on the values 0 and 1. A value of 0 indicates that the patient is not affected by diabetes, whereas a value of 1 indicates that the patient has diabetes. Diabetes represents a significant public health concern in Indonesia. A significant challenge in this study was the imbalanced nature of the dataset, which included a disproportionate number of non-diabetic samples relative to diabetic samples. To address this class imbalance, the researchers employed the SMOTETomek method. SMOTETomek integrates the SMOTE (Synthetic Minority Over-sampling Technique) and Tomek links algorithms to oversample the minority class and remove borderline samples, thereby balancing the class distributions. The SMOTETomek method achieved higher accuracy (95.01%) than SMOTE and the original data (both 92.13%), highlighting the benefits of combining SMOTE with Tomek Links for XGBoost. During testing, SMOTETomek slightly reduced the minority class accuracy (0.97 vs. 0.99 for SMOTE and original data) but maintained strong F1-score and precision, indicating effective handling of data imbalance despite minor trade-offs.
|
---|---|
ISSN: | 0216-0544 2301-6914 |