Distributed K-Means algorithm based on a Spark optimization sample.

To address the instability and performance issues of the classical K-Means algorithm when dealing with massive datasets, we propose SOSK-Means, an improved K-Means algorithm based on Spark optimization. SOSK-Means incorporates several key modifications to enhance the clustering process.Firstly, a we...

Full description

Saved in:
Bibliographic Details
Main Authors: Yongan Feng, Jiapeng Zou, Wanjun Liu, Fu Lv
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2024-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0308993
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841555501756710912
author Yongan Feng
Jiapeng Zou
Wanjun Liu
Fu Lv
author_facet Yongan Feng
Jiapeng Zou
Wanjun Liu
Fu Lv
author_sort Yongan Feng
collection DOAJ
description To address the instability and performance issues of the classical K-Means algorithm when dealing with massive datasets, we propose SOSK-Means, an improved K-Means algorithm based on Spark optimization. SOSK-Means incorporates several key modifications to enhance the clustering process.Firstly, a weighted jump-bank approach is introduced to enable efficient random sampling and preclustering. By incorporating weights and jump pointers, this approach improves the quality of initial centers and reduces sensitivity to their selection. Secondly, we utilize a weighted max-min distance with variance to calculate distances, considering both weight and variance information. This enables SOSK-Means to identify clusters that are farther apart and denser, enhancing clustering accuracy. The selection of the best initial centers is performed using the mean square error criterion. This ensures that the initial centers better represent the distribution and structure of the dataset, leading to improved clustering performance. During the iteration process, a novel distance comparison method is employed to reduce computation time, optimizing the overall efficiency of the algorithm. Additionally, SOSK-Means incorporates a Directed Acyclic Graph (DAG) to optimize performance through distributed strategies, leveraging the capabilities of the Spark framework. Experimental results show that SOSK-Means significantly improves computational speed while maintaining high computational accuracy.
format Article
id doaj-art-fee2b3bd41a645298a41b288256149ff
institution Kabale University
issn 1932-6203
language English
publishDate 2024-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-fee2b3bd41a645298a41b288256149ff2025-01-08T05:32:41ZengPublic Library of Science (PLoS)PLoS ONE1932-62032024-01-011912e030899310.1371/journal.pone.0308993Distributed K-Means algorithm based on a Spark optimization sample.Yongan FengJiapeng ZouWanjun LiuFu LvTo address the instability and performance issues of the classical K-Means algorithm when dealing with massive datasets, we propose SOSK-Means, an improved K-Means algorithm based on Spark optimization. SOSK-Means incorporates several key modifications to enhance the clustering process.Firstly, a weighted jump-bank approach is introduced to enable efficient random sampling and preclustering. By incorporating weights and jump pointers, this approach improves the quality of initial centers and reduces sensitivity to their selection. Secondly, we utilize a weighted max-min distance with variance to calculate distances, considering both weight and variance information. This enables SOSK-Means to identify clusters that are farther apart and denser, enhancing clustering accuracy. The selection of the best initial centers is performed using the mean square error criterion. This ensures that the initial centers better represent the distribution and structure of the dataset, leading to improved clustering performance. During the iteration process, a novel distance comparison method is employed to reduce computation time, optimizing the overall efficiency of the algorithm. Additionally, SOSK-Means incorporates a Directed Acyclic Graph (DAG) to optimize performance through distributed strategies, leveraging the capabilities of the Spark framework. Experimental results show that SOSK-Means significantly improves computational speed while maintaining high computational accuracy.https://doi.org/10.1371/journal.pone.0308993
spellingShingle Yongan Feng
Jiapeng Zou
Wanjun Liu
Fu Lv
Distributed K-Means algorithm based on a Spark optimization sample.
PLoS ONE
title Distributed K-Means algorithm based on a Spark optimization sample.
title_full Distributed K-Means algorithm based on a Spark optimization sample.
title_fullStr Distributed K-Means algorithm based on a Spark optimization sample.
title_full_unstemmed Distributed K-Means algorithm based on a Spark optimization sample.
title_short Distributed K-Means algorithm based on a Spark optimization sample.
title_sort distributed k means algorithm based on a spark optimization sample
url https://doi.org/10.1371/journal.pone.0308993
work_keys_str_mv AT yonganfeng distributedkmeansalgorithmbasedonasparkoptimizationsample
AT jiapengzou distributedkmeansalgorithmbasedonasparkoptimizationsample
AT wanjunliu distributedkmeansalgorithmbasedonasparkoptimizationsample
AT fulv distributedkmeansalgorithmbasedonasparkoptimizationsample