Clustering and classification for dry bean feature imbalanced data

Abstract The traditional machine learning methods such as decision tree (DT), random forest (RF), and support vector machine (SVM) have low classification performance. This paper proposes an algorithm for the dry bean dataset and obesity levels dataset that can balance the minority class and the maj...

Full description

Saved in:

Bibliographic Details
Main Authors:	Chou-Yuan Lee, Wei Wang, Jian-Qiong Huang
Format:	Article
Language:	English
Published:	Nature Portfolio 2024-12-01
Series:	Scientific Reports
Subjects:	K-means BLSMOTE Decision tree Random forest Support vector machine Imbalanced data
Online Access:	https://doi.org/10.1038/s41598-024-82253-6
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1846101335217274880
author	Chou-Yuan Lee Wei Wang Jian-Qiong Huang
author_facet	Chou-Yuan Lee Wei Wang Jian-Qiong Huang
author_sort	Chou-Yuan Lee
collection	DOAJ
description	Abstract The traditional machine learning methods such as decision tree (DT), random forest (RF), and support vector machine (SVM) have low classification performance. This paper proposes an algorithm for the dry bean dataset and obesity levels dataset that can balance the minority class and the majority class and has a clustering function to improve the traditional machine learning classification accuracy and various performance indicators such as precision, recall, f1-score, and area under curve (AUC) for imbalanced data. The key idea is to use the advantages of borderline-synthetic minority oversampling technique (BLSMOTE) to generate new samples using samples on the boundary of minority class samples to reduce the impact of noise on model building, and the advantages of K-means clustering to divide data into different groups according to similarities or common features. The results show that the proposed algorithm BLSMOTE + K-means + SVM is superior to other traditional machine learning methods in classification and various performance indicators. The BLSMOTE + K-means + DT generates decision rules for the dry bean dataset and the the obesity levels dataset, and the BLSMOTE + K-means + RF ranks the importance of explanatory variables. These experimental results can provide scientific evidence for decision-makers.
format	Article
id	doaj-art-5f3e4d8ab21144e5a0236452c53d4580
institution	Kabale University
issn	2045-2322
language	English
publishDate	2024-12-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj-art-5f3e4d8ab21144e5a0236452c53d45802024-12-29T12:16:48ZengNature PortfolioScientific Reports2045-23222024-12-0114111910.1038/s41598-024-82253-6Clustering and classification for dry bean feature imbalanced dataChou-Yuan Lee0Wei Wang1Jian-Qiong Huang2School of Big Data, Fuzhou University of International Studies and TradeSchool of Software, Yunnan UniversitySchool of Big Data, Fuzhou University of International Studies and TradeAbstract The traditional machine learning methods such as decision tree (DT), random forest (RF), and support vector machine (SVM) have low classification performance. This paper proposes an algorithm for the dry bean dataset and obesity levels dataset that can balance the minority class and the majority class and has a clustering function to improve the traditional machine learning classification accuracy and various performance indicators such as precision, recall, f1-score, and area under curve (AUC) for imbalanced data. The key idea is to use the advantages of borderline-synthetic minority oversampling technique (BLSMOTE) to generate new samples using samples on the boundary of minority class samples to reduce the impact of noise on model building, and the advantages of K-means clustering to divide data into different groups according to similarities or common features. The results show that the proposed algorithm BLSMOTE + K-means + SVM is superior to other traditional machine learning methods in classification and various performance indicators. The BLSMOTE + K-means + DT generates decision rules for the dry bean dataset and the the obesity levels dataset, and the BLSMOTE + K-means + RF ranks the importance of explanatory variables. These experimental results can provide scientific evidence for decision-makers.https://doi.org/10.1038/s41598-024-82253-6K-meansBLSMOTEDecision treeRandom forestSupport vector machineImbalanced data
spellingShingle	Chou-Yuan Lee Wei Wang Jian-Qiong Huang Clustering and classification for dry bean feature imbalanced data Scientific Reports K-means BLSMOTE Decision tree Random forest Support vector machine Imbalanced data
title	Clustering and classification for dry bean feature imbalanced data
title_full	Clustering and classification for dry bean feature imbalanced data
title_fullStr	Clustering and classification for dry bean feature imbalanced data
title_full_unstemmed	Clustering and classification for dry bean feature imbalanced data
title_short	Clustering and classification for dry bean feature imbalanced data
title_sort	clustering and classification for dry bean feature imbalanced data
topic	K-means BLSMOTE Decision tree Random forest Support vector machine Imbalanced data
url	https://doi.org/10.1038/s41598-024-82253-6
work_keys_str_mv	AT chouyuanlee clusteringandclassificationfordrybeanfeatureimbalanceddata AT weiwang clusteringandclassificationfordrybeanfeatureimbalanceddata AT jianqionghuang clusteringandclassificationfordrybeanfeatureimbalanceddata

Clustering and classification for dry bean feature imbalanced data

Similar Items