Clustering and classification for dry bean feature imbalanced data

Abstract The traditional machine learning methods such as decision tree (DT), random forest (RF), and support vector machine (SVM) have low classification performance. This paper proposes an algorithm for the dry bean dataset and obesity levels dataset that can balance the minority class and the maj...

Full description

Saved in:
Bibliographic Details
Main Authors: Chou-Yuan Lee, Wei Wang, Jian-Qiong Huang
Format: Article
Language:English
Published: Nature Portfolio 2024-12-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-024-82253-6
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846101335217274880
author Chou-Yuan Lee
Wei Wang
Jian-Qiong Huang
author_facet Chou-Yuan Lee
Wei Wang
Jian-Qiong Huang
author_sort Chou-Yuan Lee
collection DOAJ
description Abstract The traditional machine learning methods such as decision tree (DT), random forest (RF), and support vector machine (SVM) have low classification performance. This paper proposes an algorithm for the dry bean dataset and obesity levels dataset that can balance the minority class and the majority class and has a clustering function to improve the traditional machine learning classification accuracy and various performance indicators such as precision, recall, f1-score, and area under curve (AUC) for imbalanced data. The key idea is to use the advantages of borderline-synthetic minority oversampling technique (BLSMOTE) to generate new samples using samples on the boundary of minority class samples to reduce the impact of noise on model building, and the advantages of K-means clustering to divide data into different groups according to similarities or common features. The results show that the proposed algorithm BLSMOTE + K-means + SVM is superior to other traditional machine learning methods in classification and various performance indicators. The BLSMOTE + K-means + DT generates decision rules for the dry bean dataset and the the obesity levels dataset, and the BLSMOTE + K-means + RF ranks the importance of explanatory variables. These experimental results can provide scientific evidence for decision-makers.
format Article
id doaj-art-5f3e4d8ab21144e5a0236452c53d4580
institution Kabale University
issn 2045-2322
language English
publishDate 2024-12-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-5f3e4d8ab21144e5a0236452c53d45802024-12-29T12:16:48ZengNature PortfolioScientific Reports2045-23222024-12-0114111910.1038/s41598-024-82253-6Clustering and classification for dry bean feature imbalanced dataChou-Yuan Lee0Wei Wang1Jian-Qiong Huang2School of Big Data, Fuzhou University of International Studies and TradeSchool of Software, Yunnan UniversitySchool of Big Data, Fuzhou University of International Studies and TradeAbstract The traditional machine learning methods such as decision tree (DT), random forest (RF), and support vector machine (SVM) have low classification performance. This paper proposes an algorithm for the dry bean dataset and obesity levels dataset that can balance the minority class and the majority class and has a clustering function to improve the traditional machine learning classification accuracy and various performance indicators such as precision, recall, f1-score, and area under curve (AUC) for imbalanced data. The key idea is to use the advantages of borderline-synthetic minority oversampling technique (BLSMOTE) to generate new samples using samples on the boundary of minority class samples to reduce the impact of noise on model building, and the advantages of K-means clustering to divide data into different groups according to similarities or common features. The results show that the proposed algorithm BLSMOTE + K-means + SVM is superior to other traditional machine learning methods in classification and various performance indicators. The BLSMOTE + K-means + DT generates decision rules for the dry bean dataset and the the obesity levels dataset, and the BLSMOTE + K-means + RF ranks the importance of explanatory variables. These experimental results can provide scientific evidence for decision-makers.https://doi.org/10.1038/s41598-024-82253-6K-meansBLSMOTEDecision treeRandom forestSupport vector machineImbalanced data
spellingShingle Chou-Yuan Lee
Wei Wang
Jian-Qiong Huang
Clustering and classification for dry bean feature imbalanced data
Scientific Reports
K-means
BLSMOTE
Decision tree
Random forest
Support vector machine
Imbalanced data
title Clustering and classification for dry bean feature imbalanced data
title_full Clustering and classification for dry bean feature imbalanced data
title_fullStr Clustering and classification for dry bean feature imbalanced data
title_full_unstemmed Clustering and classification for dry bean feature imbalanced data
title_short Clustering and classification for dry bean feature imbalanced data
title_sort clustering and classification for dry bean feature imbalanced data
topic K-means
BLSMOTE
Decision tree
Random forest
Support vector machine
Imbalanced data
url https://doi.org/10.1038/s41598-024-82253-6
work_keys_str_mv AT chouyuanlee clusteringandclassificationfordrybeanfeatureimbalanceddata
AT weiwang clusteringandclassificationfordrybeanfeatureimbalanceddata
AT jianqionghuang clusteringandclassificationfordrybeanfeatureimbalanceddata