Distributed K-Means algorithm based on a Spark optimization sample.

To address the instability and performance issues of the classical K-Means algorithm when dealing with massive datasets, we propose SOSK-Means, an improved K-Means algorithm based on Spark optimization. SOSK-Means incorporates several key modifications to enhance the clustering process.Firstly, a we...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yongan Feng, Jiapeng Zou, Wanjun Liu, Fu Lv
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2024-01-01
Series:	PLoS ONE
Online Access:	https://doi.org/10.1371/journal.pone.0308993
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841555501756710912
author	Yongan Feng Jiapeng Zou Wanjun Liu Fu Lv
author_facet	Yongan Feng Jiapeng Zou Wanjun Liu Fu Lv
author_sort	Yongan Feng
collection	DOAJ
description	To address the instability and performance issues of the classical K-Means algorithm when dealing with massive datasets, we propose SOSK-Means, an improved K-Means algorithm based on Spark optimization. SOSK-Means incorporates several key modifications to enhance the clustering process.Firstly, a weighted jump-bank approach is introduced to enable efficient random sampling and preclustering. By incorporating weights and jump pointers, this approach improves the quality of initial centers and reduces sensitivity to their selection. Secondly, we utilize a weighted max-min distance with variance to calculate distances, considering both weight and variance information. This enables SOSK-Means to identify clusters that are farther apart and denser, enhancing clustering accuracy. The selection of the best initial centers is performed using the mean square error criterion. This ensures that the initial centers better represent the distribution and structure of the dataset, leading to improved clustering performance. During the iteration process, a novel distance comparison method is employed to reduce computation time, optimizing the overall efficiency of the algorithm. Additionally, SOSK-Means incorporates a Directed Acyclic Graph (DAG) to optimize performance through distributed strategies, leveraging the capabilities of the Spark framework. Experimental results show that SOSK-Means significantly improves computational speed while maintaining high computational accuracy.
format	Article
id	doaj-art-fee2b3bd41a645298a41b288256149ff
institution	Kabale University
issn	1932-6203
language	English
publishDate	2024-01-01
publisher	Public Library of Science (PLoS)
record_format	Article
series	PLoS ONE
spelling	doaj-art-fee2b3bd41a645298a41b288256149ff2025-01-08T05:32:41ZengPublic Library of Science (PLoS)PLoS ONE1932-62032024-01-011912e030899310.1371/journal.pone.0308993Distributed K-Means algorithm based on a Spark optimization sample.Yongan FengJiapeng ZouWanjun LiuFu LvTo address the instability and performance issues of the classical K-Means algorithm when dealing with massive datasets, we propose SOSK-Means, an improved K-Means algorithm based on Spark optimization. SOSK-Means incorporates several key modifications to enhance the clustering process.Firstly, a weighted jump-bank approach is introduced to enable efficient random sampling and preclustering. By incorporating weights and jump pointers, this approach improves the quality of initial centers and reduces sensitivity to their selection. Secondly, we utilize a weighted max-min distance with variance to calculate distances, considering both weight and variance information. This enables SOSK-Means to identify clusters that are farther apart and denser, enhancing clustering accuracy. The selection of the best initial centers is performed using the mean square error criterion. This ensures that the initial centers better represent the distribution and structure of the dataset, leading to improved clustering performance. During the iteration process, a novel distance comparison method is employed to reduce computation time, optimizing the overall efficiency of the algorithm. Additionally, SOSK-Means incorporates a Directed Acyclic Graph (DAG) to optimize performance through distributed strategies, leveraging the capabilities of the Spark framework. Experimental results show that SOSK-Means significantly improves computational speed while maintaining high computational accuracy.https://doi.org/10.1371/journal.pone.0308993
spellingShingle	Yongan Feng Jiapeng Zou Wanjun Liu Fu Lv Distributed K-Means algorithm based on a Spark optimization sample. PLoS ONE
title	Distributed K-Means algorithm based on a Spark optimization sample.
title_full	Distributed K-Means algorithm based on a Spark optimization sample.
title_fullStr	Distributed K-Means algorithm based on a Spark optimization sample.
title_full_unstemmed	Distributed K-Means algorithm based on a Spark optimization sample.
title_short	Distributed K-Means algorithm based on a Spark optimization sample.
title_sort	distributed k means algorithm based on a spark optimization sample
url	https://doi.org/10.1371/journal.pone.0308993
work_keys_str_mv	AT yonganfeng distributedkmeansalgorithmbasedonasparkoptimizationsample AT jiapengzou distributedkmeansalgorithmbasedonasparkoptimizationsample AT wanjunliu distributedkmeansalgorithmbasedonasparkoptimizationsample AT fulv distributedkmeansalgorithmbasedonasparkoptimizationsample

Distributed K-Means algorithm based on a Spark optimization sample.

Similar Items