SMOTE-LOF for noise identification in imbalanced data classification

Imbalanced data typically refers to a condition in which several data samples in a certain problem is not equally distributed, thereby leading to the underrepresentation of one or more classes in the dataset. These underrepresented classes are referred to as a minority, while the overrepresented one...

Full description

Saved in:

Bibliographic Details
Main Authors:	Asniar, Nur Ulfa Maulidevi, Kridanto Surendro
Format:	Article
Language:	English
Published:	Springer 2022-06-01
Series:	Journal of King Saud University: Computer and Information Sciences
Subjects:	Imbalanced data SMOTE Noisy data Outliers Predictive accuracy
Online Access:	http://www.sciencedirect.com/science/article/pii/S1319157821000161
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849324840573468672
author	Asniar Nur Ulfa Maulidevi Kridanto Surendro
author_facet	Asniar Nur Ulfa Maulidevi Kridanto Surendro
author_sort	Asniar
collection	DOAJ
description	Imbalanced data typically refers to a condition in which several data samples in a certain problem is not equally distributed, thereby leading to the underrepresentation of one or more classes in the dataset. These underrepresented classes are referred to as a minority, while the overrepresented ones are called the majority. The unequal distribution of data leads to the machine's inability to carry out predictive accuracy in determining the minority classes, thereby causing various costs of classification errors. Currently, the standard framework used to solve the unequal distribution of imbalanced data learning is the Synthetic Minority Oversampling Technique (SMOTE). However, SMOTE can produce synthetic minority data samples considered as noise, which is also part of the majority classes. Therefore, this study aims to improve SMOTE to identify the noise from synthetic minority data produced in handling imbalanced data by adding the Local Outlier Factor (LOF). The proposed method is called SMOTE-LOF, and the experiment was carried out using imbalanced datasets with the results compared with the performance of the SMOTE. The results showed that SMOTE-LOF produces better accuracy and f-measure than the SMOTE. In a dataset with a large number of data examples and a smaller imbalance ratio, the SMOTE-LOF approach also produced a better AUC than the SMOTE. However, for a dataset with a smaller number of data samples, the SMOTE's AUC result is arguably better at handling imbalanced data. Therefore, future research needs to be carried out using different datasets with combinations varying from the number of data samples and the imbalanced ratio.
format	Article
id	doaj-art-2e491e90c5b04f2388a228b92c3fa7c3
institution	Kabale University
issn	1319-1578
language	English
publishDate	2022-06-01
publisher	Springer
record_format	Article
series	Journal of King Saud University: Computer and Information Sciences
spelling	doaj-art-2e491e90c5b04f2388a228b92c3fa7c32025-08-20T03:48:35ZengSpringerJournal of King Saud University: Computer and Information Sciences1319-15782022-06-013463413342310.1016/j.jksuci.2021.01.014SMOTE-LOF for noise identification in imbalanced data classification Asniar0Nur Ulfa Maulidevi1Kridanto Surendro2School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, Indonesia; School of Applied Science, Telkom University, Jalan Telekomunikasi, Terusan Buah Batu, Bandung, Indonesia; Corresponding author.School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, Indonesia; PUI-PT AI-VLB (Artificial Intelligence for Vision, Natural Language Processing & Big Data Analytics), IndonesiaSchool of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, IndonesiaImbalanced data typically refers to a condition in which several data samples in a certain problem is not equally distributed, thereby leading to the underrepresentation of one or more classes in the dataset. These underrepresented classes are referred to as a minority, while the overrepresented ones are called the majority. The unequal distribution of data leads to the machine's inability to carry out predictive accuracy in determining the minority classes, thereby causing various costs of classification errors. Currently, the standard framework used to solve the unequal distribution of imbalanced data learning is the Synthetic Minority Oversampling Technique (SMOTE). However, SMOTE can produce synthetic minority data samples considered as noise, which is also part of the majority classes. Therefore, this study aims to improve SMOTE to identify the noise from synthetic minority data produced in handling imbalanced data by adding the Local Outlier Factor (LOF). The proposed method is called SMOTE-LOF, and the experiment was carried out using imbalanced datasets with the results compared with the performance of the SMOTE. The results showed that SMOTE-LOF produces better accuracy and f-measure than the SMOTE. In a dataset with a large number of data examples and a smaller imbalance ratio, the SMOTE-LOF approach also produced a better AUC than the SMOTE. However, for a dataset with a smaller number of data samples, the SMOTE's AUC result is arguably better at handling imbalanced data. Therefore, future research needs to be carried out using different datasets with combinations varying from the number of data samples and the imbalanced ratio.http://www.sciencedirect.com/science/article/pii/S1319157821000161Imbalanced dataSMOTENoisy dataOutliersPredictive accuracy
spellingShingle	Asniar Nur Ulfa Maulidevi Kridanto Surendro SMOTE-LOF for noise identification in imbalanced data classification Journal of King Saud University: Computer and Information Sciences Imbalanced data SMOTE Noisy data Outliers Predictive accuracy
title	SMOTE-LOF for noise identification in imbalanced data classification
title_full	SMOTE-LOF for noise identification in imbalanced data classification
title_fullStr	SMOTE-LOF for noise identification in imbalanced data classification
title_full_unstemmed	SMOTE-LOF for noise identification in imbalanced data classification
title_short	SMOTE-LOF for noise identification in imbalanced data classification
title_sort	smote lof for noise identification in imbalanced data classification
topic	Imbalanced data SMOTE Noisy data Outliers Predictive accuracy
url	http://www.sciencedirect.com/science/article/pii/S1319157821000161
work_keys_str_mv	AT asniar smoteloffornoiseidentificationinimbalanceddataclassification AT nurulfamaulidevi smoteloffornoiseidentificationinimbalanceddataclassification AT kridantosurendro smoteloffornoiseidentificationinimbalanceddataclassification

SMOTE-LOF for noise identification in imbalanced data classification

Similar Items