SMOTE-LOF for noise identification in imbalanced data classification
Imbalanced data typically refers to a condition in which several data samples in a certain problem is not equally distributed, thereby leading to the underrepresentation of one or more classes in the dataset. These underrepresented classes are referred to as a minority, while the overrepresented one...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2022-06-01
|
| Series: | Journal of King Saud University: Computer and Information Sciences |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S1319157821000161 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849324840573468672 |
|---|---|
| author | Asniar Nur Ulfa Maulidevi Kridanto Surendro |
| author_facet | Asniar Nur Ulfa Maulidevi Kridanto Surendro |
| author_sort | Asniar |
| collection | DOAJ |
| description | Imbalanced data typically refers to a condition in which several data samples in a certain problem is not equally distributed, thereby leading to the underrepresentation of one or more classes in the dataset. These underrepresented classes are referred to as a minority, while the overrepresented ones are called the majority. The unequal distribution of data leads to the machine's inability to carry out predictive accuracy in determining the minority classes, thereby causing various costs of classification errors. Currently, the standard framework used to solve the unequal distribution of imbalanced data learning is the Synthetic Minority Oversampling Technique (SMOTE). However, SMOTE can produce synthetic minority data samples considered as noise, which is also part of the majority classes. Therefore, this study aims to improve SMOTE to identify the noise from synthetic minority data produced in handling imbalanced data by adding the Local Outlier Factor (LOF). The proposed method is called SMOTE-LOF, and the experiment was carried out using imbalanced datasets with the results compared with the performance of the SMOTE. The results showed that SMOTE-LOF produces better accuracy and f-measure than the SMOTE. In a dataset with a large number of data examples and a smaller imbalance ratio, the SMOTE-LOF approach also produced a better AUC than the SMOTE. However, for a dataset with a smaller number of data samples, the SMOTE's AUC result is arguably better at handling imbalanced data. Therefore, future research needs to be carried out using different datasets with combinations varying from the number of data samples and the imbalanced ratio. |
| format | Article |
| id | doaj-art-2e491e90c5b04f2388a228b92c3fa7c3 |
| institution | Kabale University |
| issn | 1319-1578 |
| language | English |
| publishDate | 2022-06-01 |
| publisher | Springer |
| record_format | Article |
| series | Journal of King Saud University: Computer and Information Sciences |
| spelling | doaj-art-2e491e90c5b04f2388a228b92c3fa7c32025-08-20T03:48:35ZengSpringerJournal of King Saud University: Computer and Information Sciences1319-15782022-06-013463413342310.1016/j.jksuci.2021.01.014SMOTE-LOF for noise identification in imbalanced data classification Asniar0Nur Ulfa Maulidevi1Kridanto Surendro2School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, Indonesia; School of Applied Science, Telkom University, Jalan Telekomunikasi, Terusan Buah Batu, Bandung, Indonesia; Corresponding author.School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, Indonesia; PUI-PT AI-VLB (Artificial Intelligence for Vision, Natural Language Processing & Big Data Analytics), IndonesiaSchool of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, IndonesiaImbalanced data typically refers to a condition in which several data samples in a certain problem is not equally distributed, thereby leading to the underrepresentation of one or more classes in the dataset. These underrepresented classes are referred to as a minority, while the overrepresented ones are called the majority. The unequal distribution of data leads to the machine's inability to carry out predictive accuracy in determining the minority classes, thereby causing various costs of classification errors. Currently, the standard framework used to solve the unequal distribution of imbalanced data learning is the Synthetic Minority Oversampling Technique (SMOTE). However, SMOTE can produce synthetic minority data samples considered as noise, which is also part of the majority classes. Therefore, this study aims to improve SMOTE to identify the noise from synthetic minority data produced in handling imbalanced data by adding the Local Outlier Factor (LOF). The proposed method is called SMOTE-LOF, and the experiment was carried out using imbalanced datasets with the results compared with the performance of the SMOTE. The results showed that SMOTE-LOF produces better accuracy and f-measure than the SMOTE. In a dataset with a large number of data examples and a smaller imbalance ratio, the SMOTE-LOF approach also produced a better AUC than the SMOTE. However, for a dataset with a smaller number of data samples, the SMOTE's AUC result is arguably better at handling imbalanced data. Therefore, future research needs to be carried out using different datasets with combinations varying from the number of data samples and the imbalanced ratio.http://www.sciencedirect.com/science/article/pii/S1319157821000161Imbalanced dataSMOTENoisy dataOutliersPredictive accuracy |
| spellingShingle | Asniar Nur Ulfa Maulidevi Kridanto Surendro SMOTE-LOF for noise identification in imbalanced data classification Journal of King Saud University: Computer and Information Sciences Imbalanced data SMOTE Noisy data Outliers Predictive accuracy |
| title | SMOTE-LOF for noise identification in imbalanced data classification |
| title_full | SMOTE-LOF for noise identification in imbalanced data classification |
| title_fullStr | SMOTE-LOF for noise identification in imbalanced data classification |
| title_full_unstemmed | SMOTE-LOF for noise identification in imbalanced data classification |
| title_short | SMOTE-LOF for noise identification in imbalanced data classification |
| title_sort | smote lof for noise identification in imbalanced data classification |
| topic | Imbalanced data SMOTE Noisy data Outliers Predictive accuracy |
| url | http://www.sciencedirect.com/science/article/pii/S1319157821000161 |
| work_keys_str_mv | AT asniar smoteloffornoiseidentificationinimbalanceddataclassification AT nurulfamaulidevi smoteloffornoiseidentificationinimbalanceddataclassification AT kridantosurendro smoteloffornoiseidentificationinimbalanceddataclassification |