IMPLEMENTATION OF BALANCING DATA METHOD USING SMOTETOMEK IN DIABETES CLASSIFICATION USING XGBOOST
In this research, XGBoost algorithm and the SMOTETomek approach are employed with the objective of enhancing the accuracy of diabetes classification. The study utilises 2,000 patient data points, comprising demographic and medical information, sourced from Kaggle. The dataset employed in this study...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Informatics Department, Engineering Faculty
2024-12-01
|
Series: | Jurnal Ilmiah Kursor: Menuju Solusi Teknologi Informasi |
Subjects: | |
Online Access: | http://www.kursorjournal.org/index.php/kursor/article/view/410 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841544154960625664 |
---|---|
author | Fatwa Ratantja Kusumajati Basuki Rahmat Achmad Junaidi |
author_facet | Fatwa Ratantja Kusumajati Basuki Rahmat Achmad Junaidi |
author_sort | Fatwa Ratantja Kusumajati |
collection | DOAJ |
description |
In this research, XGBoost algorithm and the SMOTETomek approach are employed with the objective of enhancing the accuracy of diabetes classification. The study utilises 2,000 patient data points, comprising demographic and medical information, sourced from Kaggle. The dataset employed in this study comprises a number of variables, including pregnancies, glucose levels, blood pressure, skin thickness, insulin levels, Body Mass Index (BMI), diabetes pedigree function, age, and an outcome variable. The latter is a binary classification label, taking on the values 0 and 1. A value of 0 indicates that the patient is not affected by diabetes, whereas a value of 1 indicates that the patient has diabetes. Diabetes represents a significant public health concern in Indonesia. A significant challenge in this study was the imbalanced nature of the dataset, which included a disproportionate number of non-diabetic samples relative to diabetic samples. To address this class imbalance, the researchers employed the SMOTETomek method. SMOTETomek integrates the SMOTE (Synthetic Minority Over-sampling Technique) and Tomek links algorithms to oversample the minority class and remove borderline samples, thereby balancing the class distributions. The SMOTETomek method achieved higher accuracy (95.01%) than SMOTE and the original data (both 92.13%), highlighting the benefits of combining SMOTE with Tomek Links for XGBoost. During testing, SMOTETomek slightly reduced the minority class accuracy (0.97 vs. 0.99 for SMOTE and original data) but maintained strong F1-score and precision, indicating effective handling of data imbalance despite minor trade-offs.
|
format | Article |
id | doaj-art-b5fe845077e94968bddf2defa74f071c |
institution | Kabale University |
issn | 0216-0544 2301-6914 |
language | English |
publishDate | 2024-12-01 |
publisher | Informatics Department, Engineering Faculty |
record_format | Article |
series | Jurnal Ilmiah Kursor: Menuju Solusi Teknologi Informasi |
spelling | doaj-art-b5fe845077e94968bddf2defa74f071c2025-01-12T15:53:13ZengInformatics Department, Engineering FacultyJurnal Ilmiah Kursor: Menuju Solusi Teknologi Informasi0216-05442301-69142024-12-0112410.21107/kursor.v12i4.410IMPLEMENTATION OF BALANCING DATA METHOD USING SMOTETOMEK IN DIABETES CLASSIFICATION USING XGBOOSTFatwa Ratantja Kusumajati0Basuki Rahmat1Achmad Junaidi2UPN "Veteran" Jawa TimurUPN "Veteran" Jawa TimurUPN "Veteran" Jawa Timur In this research, XGBoost algorithm and the SMOTETomek approach are employed with the objective of enhancing the accuracy of diabetes classification. The study utilises 2,000 patient data points, comprising demographic and medical information, sourced from Kaggle. The dataset employed in this study comprises a number of variables, including pregnancies, glucose levels, blood pressure, skin thickness, insulin levels, Body Mass Index (BMI), diabetes pedigree function, age, and an outcome variable. The latter is a binary classification label, taking on the values 0 and 1. A value of 0 indicates that the patient is not affected by diabetes, whereas a value of 1 indicates that the patient has diabetes. Diabetes represents a significant public health concern in Indonesia. A significant challenge in this study was the imbalanced nature of the dataset, which included a disproportionate number of non-diabetic samples relative to diabetic samples. To address this class imbalance, the researchers employed the SMOTETomek method. SMOTETomek integrates the SMOTE (Synthetic Minority Over-sampling Technique) and Tomek links algorithms to oversample the minority class and remove borderline samples, thereby balancing the class distributions. The SMOTETomek method achieved higher accuracy (95.01%) than SMOTE and the original data (both 92.13%), highlighting the benefits of combining SMOTE with Tomek Links for XGBoost. During testing, SMOTETomek slightly reduced the minority class accuracy (0.97 vs. 0.99 for SMOTE and original data) but maintained strong F1-score and precision, indicating effective handling of data imbalance despite minor trade-offs. http://www.kursorjournal.org/index.php/kursor/article/view/410Balancing DataDiabetes ClassificationSMOTETomekXGBOOST |
spellingShingle | Fatwa Ratantja Kusumajati Basuki Rahmat Achmad Junaidi IMPLEMENTATION OF BALANCING DATA METHOD USING SMOTETOMEK IN DIABETES CLASSIFICATION USING XGBOOST Jurnal Ilmiah Kursor: Menuju Solusi Teknologi Informasi Balancing Data Diabetes Classification SMOTETomek XGBOOST |
title | IMPLEMENTATION OF BALANCING DATA METHOD USING SMOTETOMEK IN DIABETES CLASSIFICATION USING XGBOOST |
title_full | IMPLEMENTATION OF BALANCING DATA METHOD USING SMOTETOMEK IN DIABETES CLASSIFICATION USING XGBOOST |
title_fullStr | IMPLEMENTATION OF BALANCING DATA METHOD USING SMOTETOMEK IN DIABETES CLASSIFICATION USING XGBOOST |
title_full_unstemmed | IMPLEMENTATION OF BALANCING DATA METHOD USING SMOTETOMEK IN DIABETES CLASSIFICATION USING XGBOOST |
title_short | IMPLEMENTATION OF BALANCING DATA METHOD USING SMOTETOMEK IN DIABETES CLASSIFICATION USING XGBOOST |
title_sort | implementation of balancing data method using smotetomek in diabetes classification using xgboost |
topic | Balancing Data Diabetes Classification SMOTETomek XGBOOST |
url | http://www.kursorjournal.org/index.php/kursor/article/view/410 |
work_keys_str_mv | AT fatwaratantjakusumajati implementationofbalancingdatamethodusingsmotetomekindiabetesclassificationusingxgboost AT basukirahmat implementationofbalancingdatamethodusingsmotetomekindiabetesclassificationusingxgboost AT achmadjunaidi implementationofbalancingdatamethodusingsmotetomekindiabetesclassificationusingxgboost |