SMOTE-LOF for noise identification in imbalanced data classification

Imbalanced data typically refers to a condition in which several data samples in a certain problem is not equally distributed, thereby leading to the underrepresentation of one or more classes in the dataset. These underrepresented classes are referred to as a minority, while the overrepresented one...

Full description

Saved in:
Bibliographic Details
Main Authors: Asniar, Nur Ulfa Maulidevi, Kridanto Surendro
Format: Article
Language:English
Published: Springer 2022-06-01
Series:Journal of King Saud University: Computer and Information Sciences
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S1319157821000161
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849324840573468672
author Asniar
Nur Ulfa Maulidevi
Kridanto Surendro
author_facet Asniar
Nur Ulfa Maulidevi
Kridanto Surendro
author_sort Asniar
collection DOAJ
description Imbalanced data typically refers to a condition in which several data samples in a certain problem is not equally distributed, thereby leading to the underrepresentation of one or more classes in the dataset. These underrepresented classes are referred to as a minority, while the overrepresented ones are called the majority. The unequal distribution of data leads to the machine's inability to carry out predictive accuracy in determining the minority classes, thereby causing various costs of classification errors. Currently, the standard framework used to solve the unequal distribution of imbalanced data learning is the Synthetic Minority Oversampling Technique (SMOTE). However, SMOTE can produce synthetic minority data samples considered as noise, which is also part of the majority classes. Therefore, this study aims to improve SMOTE to identify the noise from synthetic minority data produced in handling imbalanced data by adding the Local Outlier Factor (LOF). The proposed method is called SMOTE-LOF, and the experiment was carried out using imbalanced datasets with the results compared with the performance of the SMOTE. The results showed that SMOTE-LOF produces better accuracy and f-measure than the SMOTE. In a dataset with a large number of data examples and a smaller imbalance ratio, the SMOTE-LOF approach also produced a better AUC than the SMOTE. However, for a dataset with a smaller number of data samples, the SMOTE's AUC result is arguably better at handling imbalanced data. Therefore, future research needs to be carried out using different datasets with combinations varying from the number of data samples and the imbalanced ratio.
format Article
id doaj-art-2e491e90c5b04f2388a228b92c3fa7c3
institution Kabale University
issn 1319-1578
language English
publishDate 2022-06-01
publisher Springer
record_format Article
series Journal of King Saud University: Computer and Information Sciences
spelling doaj-art-2e491e90c5b04f2388a228b92c3fa7c32025-08-20T03:48:35ZengSpringerJournal of King Saud University: Computer and Information Sciences1319-15782022-06-013463413342310.1016/j.jksuci.2021.01.014SMOTE-LOF for noise identification in imbalanced data classification Asniar0Nur Ulfa Maulidevi1Kridanto Surendro2School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, Indonesia; School of Applied Science, Telkom University, Jalan Telekomunikasi, Terusan Buah Batu, Bandung, Indonesia; Corresponding author.School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, Indonesia; PUI-PT AI-VLB (Artificial Intelligence for Vision, Natural Language Processing & Big Data Analytics), IndonesiaSchool of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, IndonesiaImbalanced data typically refers to a condition in which several data samples in a certain problem is not equally distributed, thereby leading to the underrepresentation of one or more classes in the dataset. These underrepresented classes are referred to as a minority, while the overrepresented ones are called the majority. The unequal distribution of data leads to the machine's inability to carry out predictive accuracy in determining the minority classes, thereby causing various costs of classification errors. Currently, the standard framework used to solve the unequal distribution of imbalanced data learning is the Synthetic Minority Oversampling Technique (SMOTE). However, SMOTE can produce synthetic minority data samples considered as noise, which is also part of the majority classes. Therefore, this study aims to improve SMOTE to identify the noise from synthetic minority data produced in handling imbalanced data by adding the Local Outlier Factor (LOF). The proposed method is called SMOTE-LOF, and the experiment was carried out using imbalanced datasets with the results compared with the performance of the SMOTE. The results showed that SMOTE-LOF produces better accuracy and f-measure than the SMOTE. In a dataset with a large number of data examples and a smaller imbalance ratio, the SMOTE-LOF approach also produced a better AUC than the SMOTE. However, for a dataset with a smaller number of data samples, the SMOTE's AUC result is arguably better at handling imbalanced data. Therefore, future research needs to be carried out using different datasets with combinations varying from the number of data samples and the imbalanced ratio.http://www.sciencedirect.com/science/article/pii/S1319157821000161Imbalanced dataSMOTENoisy dataOutliersPredictive accuracy
spellingShingle Asniar
Nur Ulfa Maulidevi
Kridanto Surendro
SMOTE-LOF for noise identification in imbalanced data classification
Journal of King Saud University: Computer and Information Sciences
Imbalanced data
SMOTE
Noisy data
Outliers
Predictive accuracy
title SMOTE-LOF for noise identification in imbalanced data classification
title_full SMOTE-LOF for noise identification in imbalanced data classification
title_fullStr SMOTE-LOF for noise identification in imbalanced data classification
title_full_unstemmed SMOTE-LOF for noise identification in imbalanced data classification
title_short SMOTE-LOF for noise identification in imbalanced data classification
title_sort smote lof for noise identification in imbalanced data classification
topic Imbalanced data
SMOTE
Noisy data
Outliers
Predictive accuracy
url http://www.sciencedirect.com/science/article/pii/S1319157821000161
work_keys_str_mv AT asniar smoteloffornoiseidentificationinimbalanceddataclassification
AT nurulfamaulidevi smoteloffornoiseidentificationinimbalanceddataclassification
AT kridantosurendro smoteloffornoiseidentificationinimbalanceddataclassification