Big Data Cleaning Based on Improved CLOF and Random Forest for Distribution Networks

In order to improve the data quality, the big data cleaning method for distribution networks is studied in this paper. First, the Local Outlier Factor (LOF) algorithm based on DBSCAN clustering is used to detect outliers. However, due to the difficulty in determining the LOF threshold, a method of d...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jie Liu, Yijia Cao, Yong Li, Yixiu Guo, Wei Deng
Format:	Article
Language:	English
Published:	China electric power research institute 2024-01-01
Series:	CSEE Journal of Power and Energy Systems
Subjects:	Data cleaning DBSCAN LOF missing data imputation outliers detection Random Forest
Online Access:	https://ieeexplore.ieee.org/document/9299499/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841533405740662784
author	Jie Liu Yijia Cao Yong Li Yixiu Guo Wei Deng
author_facet	Jie Liu Yijia Cao Yong Li Yixiu Guo Wei Deng
author_sort	Jie Liu
collection	DOAJ
description	In order to improve the data quality, the big data cleaning method for distribution networks is studied in this paper. First, the Local Outlier Factor (LOF) algorithm based on DBSCAN clustering is used to detect outliers. However, due to the difficulty in determining the LOF threshold, a method of dynamically calculating the threshold based on the transformer districts and time is proposed. In addition, the LOF algorithm combines the statistical distribution method to reduce the misjudgment rate. Aiming at the diversity and complexity of data missing forms in power big data, this paper has improved the Random Forest imputation algorithm, which can be applied to various forms of missing data, especially the blocked missing data and even some completely missing horizontal or vertical data. The data in this paper are from real data of 44 transformer districts of a certain 10 kV line in a distribution network. Experimental results show that outlier detection is accurate and suitable for any shape and multidimensional power big data. The improved Random Forest imputation algorithm is suitable for all missing forms, with higher imputation accuracy and better model stability. By comparing the network loss prediction between the data using this data cleaning method and the data removing outliers and missing values, it can be found that the accuracy of network loss prediction has improved by nearly 4 % using the data cleaning method identified in this paper. Additionally, as the proportion of bad data increased, the difference between the prediction accuracy of cleaned data and that of uncleaned data is more significant.
format	Article
id	doaj-art-50874733a50a40baaa418b6ee4b22570
institution	Kabale University
issn	2096-0042
language	English
publishDate	2024-01-01
publisher	China electric power research institute
record_format	Article
series	CSEE Journal of Power and Energy Systems
spelling	doaj-art-50874733a50a40baaa418b6ee4b225702025-01-16T00:02:18ZengChina electric power research instituteCSEE Journal of Power and Energy Systems2096-00422024-01-011062528253810.17775/CSEEJPES.2020.040809299499Big Data Cleaning Based on Improved CLOF and Random Forest for Distribution NetworksJie Liu0Yijia Cao1Yong Li2Yixiu Guo3Wei Deng4College of Electrical and Information Engineering, Hunan University,Changsha,China,410082College of Electrical and Information Engineering, Hunan University,Changsha,China,410082College of Electrical and Information Engineering, Hunan University,Changsha,China,410082College of Electrical and Information Engineering, Hunan University,Changsha,China,410082State Grid Hunan Electric Power Company Limited Research Institute,Changsha,China,410007In order to improve the data quality, the big data cleaning method for distribution networks is studied in this paper. First, the Local Outlier Factor (LOF) algorithm based on DBSCAN clustering is used to detect outliers. However, due to the difficulty in determining the LOF threshold, a method of dynamically calculating the threshold based on the transformer districts and time is proposed. In addition, the LOF algorithm combines the statistical distribution method to reduce the misjudgment rate. Aiming at the diversity and complexity of data missing forms in power big data, this paper has improved the Random Forest imputation algorithm, which can be applied to various forms of missing data, especially the blocked missing data and even some completely missing horizontal or vertical data. The data in this paper are from real data of 44 transformer districts of a certain 10 kV line in a distribution network. Experimental results show that outlier detection is accurate and suitable for any shape and multidimensional power big data. The improved Random Forest imputation algorithm is suitable for all missing forms, with higher imputation accuracy and better model stability. By comparing the network loss prediction between the data using this data cleaning method and the data removing outliers and missing values, it can be found that the accuracy of network loss prediction has improved by nearly 4 % using the data cleaning method identified in this paper. Additionally, as the proportion of bad data increased, the difference between the prediction accuracy of cleaned data and that of uncleaned data is more significant.https://ieeexplore.ieee.org/document/9299499/Data cleaningDBSCANLOFmissing data imputationoutliers detectionRandom Forest
spellingShingle	Jie Liu Yijia Cao Yong Li Yixiu Guo Wei Deng Big Data Cleaning Based on Improved CLOF and Random Forest for Distribution Networks CSEE Journal of Power and Energy Systems Data cleaning DBSCAN LOF missing data imputation outliers detection Random Forest
title	Big Data Cleaning Based on Improved CLOF and Random Forest for Distribution Networks
title_full	Big Data Cleaning Based on Improved CLOF and Random Forest for Distribution Networks
title_fullStr	Big Data Cleaning Based on Improved CLOF and Random Forest for Distribution Networks
title_full_unstemmed	Big Data Cleaning Based on Improved CLOF and Random Forest for Distribution Networks
title_short	Big Data Cleaning Based on Improved CLOF and Random Forest for Distribution Networks
title_sort	big data cleaning based on improved clof and random forest for distribution networks
topic	Data cleaning DBSCAN LOF missing data imputation outliers detection Random Forest
url	https://ieeexplore.ieee.org/document/9299499/
work_keys_str_mv	AT jieliu bigdatacleaningbasedonimprovedclofandrandomforestfordistributionnetworks AT yijiacao bigdatacleaningbasedonimprovedclofandrandomforestfordistributionnetworks AT yongli bigdatacleaningbasedonimprovedclofandrandomforestfordistributionnetworks AT yixiuguo bigdatacleaningbasedonimprovedclofandrandomforestfordistributionnetworks AT weideng bigdatacleaningbasedonimprovedclofandrandomforestfordistributionnetworks

Big Data Cleaning Based on Improved CLOF and Random Forest for Distribution Networks

Similar Items