Structured Inverted-File k-Means Clustering for High-Dimensional Sparse Data

03/30/2021
by   Kazuo Aoyama, et al.
0

This paper presents an architecture-friendly k-means clustering algorithm called SIVF for a large-scale and high-dimensional sparse data set. Algorithm efficiency on time is often measured by the number of costly operations such as similarity calculations. In practice, however, it depends greatly on how the algorithm adapts to an architecture of the computer system which it is executed on. Our proposed SIVF employs invariant centroid-pair based filter (ICP) to decrease the number of similarity calculations between a data object and centroids of all the clusters. To maximize the ICP performance, SIVF exploits for a centroid set an inverted-file that is structured so as to reduce pipeline hazards. We demonstrate in our experiments on real large-scale document data sets that SIVF operates at higher speed and with lower memory consumption than existing algorithms. Our performance analysis reveals that SIVF achieves the higher speed by suppressing performance degradation factors of the number of cache misses and branch mispredictions rather than less similarity calculations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/21/2020

Inverted-File k-Means Clustering: Performance Analysis

This paper presents an inverted-file k-means clustering algorithm (IVF) ...
research
05/10/2020

Improving The Performance Of The K-means Algorithm

The Incremental K-means (IKM), an improved version of K-means (KM), was ...
research
11/10/2014

Similarity Learning for High-Dimensional Sparse Data

A good measure of similarity between data points is crucial to many task...
research
04/24/2014

Scalable Similarity Learning using Large Margin Neighborhood Embedding

Classifying large-scale image data into object categories is an importan...
research
09/12/2017

PQk-means: Billion-scale Clustering for Product-quantized Codes

Data clustering is a fundamental operation in data analysis. For handlin...
research
06/29/2022

The Vera C. Rubin Observatory Data Butler and Pipeline Execution System

The Rubin Observatory's Data Butler is designed to allow data file locat...

Please sign up or login with your details

Forgot password? Click here to reset