Clustering and classification for dry bean feature imbalanced data
Abstract The traditional machine learning methods such as decision tree (DT), random forest (RF), and support vector machine (SVM) have low classification performance. This paper proposes an algorithm for the dry bean dataset and obesity levels dataset that can balance the minority class and the maj...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2024-12-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-024-82253-6 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1846101335217274880 |
|---|---|
| author | Chou-Yuan Lee Wei Wang Jian-Qiong Huang |
| author_facet | Chou-Yuan Lee Wei Wang Jian-Qiong Huang |
| author_sort | Chou-Yuan Lee |
| collection | DOAJ |
| description | Abstract The traditional machine learning methods such as decision tree (DT), random forest (RF), and support vector machine (SVM) have low classification performance. This paper proposes an algorithm for the dry bean dataset and obesity levels dataset that can balance the minority class and the majority class and has a clustering function to improve the traditional machine learning classification accuracy and various performance indicators such as precision, recall, f1-score, and area under curve (AUC) for imbalanced data. The key idea is to use the advantages of borderline-synthetic minority oversampling technique (BLSMOTE) to generate new samples using samples on the boundary of minority class samples to reduce the impact of noise on model building, and the advantages of K-means clustering to divide data into different groups according to similarities or common features. The results show that the proposed algorithm BLSMOTE + K-means + SVM is superior to other traditional machine learning methods in classification and various performance indicators. The BLSMOTE + K-means + DT generates decision rules for the dry bean dataset and the the obesity levels dataset, and the BLSMOTE + K-means + RF ranks the importance of explanatory variables. These experimental results can provide scientific evidence for decision-makers. |
| format | Article |
| id | doaj-art-5f3e4d8ab21144e5a0236452c53d4580 |
| institution | Kabale University |
| issn | 2045-2322 |
| language | English |
| publishDate | 2024-12-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-5f3e4d8ab21144e5a0236452c53d45802024-12-29T12:16:48ZengNature PortfolioScientific Reports2045-23222024-12-0114111910.1038/s41598-024-82253-6Clustering and classification for dry bean feature imbalanced dataChou-Yuan Lee0Wei Wang1Jian-Qiong Huang2School of Big Data, Fuzhou University of International Studies and TradeSchool of Software, Yunnan UniversitySchool of Big Data, Fuzhou University of International Studies and TradeAbstract The traditional machine learning methods such as decision tree (DT), random forest (RF), and support vector machine (SVM) have low classification performance. This paper proposes an algorithm for the dry bean dataset and obesity levels dataset that can balance the minority class and the majority class and has a clustering function to improve the traditional machine learning classification accuracy and various performance indicators such as precision, recall, f1-score, and area under curve (AUC) for imbalanced data. The key idea is to use the advantages of borderline-synthetic minority oversampling technique (BLSMOTE) to generate new samples using samples on the boundary of minority class samples to reduce the impact of noise on model building, and the advantages of K-means clustering to divide data into different groups according to similarities or common features. The results show that the proposed algorithm BLSMOTE + K-means + SVM is superior to other traditional machine learning methods in classification and various performance indicators. The BLSMOTE + K-means + DT generates decision rules for the dry bean dataset and the the obesity levels dataset, and the BLSMOTE + K-means + RF ranks the importance of explanatory variables. These experimental results can provide scientific evidence for decision-makers.https://doi.org/10.1038/s41598-024-82253-6K-meansBLSMOTEDecision treeRandom forestSupport vector machineImbalanced data |
| spellingShingle | Chou-Yuan Lee Wei Wang Jian-Qiong Huang Clustering and classification for dry bean feature imbalanced data Scientific Reports K-means BLSMOTE Decision tree Random forest Support vector machine Imbalanced data |
| title | Clustering and classification for dry bean feature imbalanced data |
| title_full | Clustering and classification for dry bean feature imbalanced data |
| title_fullStr | Clustering and classification for dry bean feature imbalanced data |
| title_full_unstemmed | Clustering and classification for dry bean feature imbalanced data |
| title_short | Clustering and classification for dry bean feature imbalanced data |
| title_sort | clustering and classification for dry bean feature imbalanced data |
| topic | K-means BLSMOTE Decision tree Random forest Support vector machine Imbalanced data |
| url | https://doi.org/10.1038/s41598-024-82253-6 |
| work_keys_str_mv | AT chouyuanlee clusteringandclassificationfordrybeanfeatureimbalanceddata AT weiwang clusteringandclassificationfordrybeanfeatureimbalanceddata AT jianqionghuang clusteringandclassificationfordrybeanfeatureimbalanceddata |