BETULA: Numerically Stable CF-Trees for BIRCH Clustering

06/23/2020
by   Andreas Lang, et al.
0

BIRCH clustering is a widely known approach for clustering, that has influenced much subsequent research and commercial products. The key contribution of BIRCH is the Clustering Feature tree (CF-Tree), which is a compressed representation of the input data. As new data arrives, the tree is eventually rebuilt to increase the compression. Afterward, the leaves of the tree are used for clustering. Because of the data compression, this method is very scalable. The idea has been adopted for example for k-means, data stream, and density-based clustering. Clustering features used by BIRCH are simple summary statistics that can easily be updated with new data: the number of points, the linear sums, and the sum of squared values. Unfortunately, how the sum of squares is then used in BIRCH is prone to catastrophic cancellation. We introduce a replacement cluster feature that does not have this numeric problem, that is not much more expensive to maintain, and which makes many computations simpler and hence more efficient. These cluster features can also easily be used in other work derived from BIRCH, such as algorithms for streaming data. In the experiments, we demonstrate the numerical problem and compare the performance of the original algorithm compared to the improved cluster features.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/06/2017

An Online Hierarchical Algorithm for Extreme Clustering

Many modern clustering methods scale well to a large number of data item...
research
10/02/2017

Clustering Stream Data by Exploring the Evolution of Density Mountain

Stream clustering is a fundamental problem in many streaming data analys...
research
05/20/2016

Statistical Inference for Cluster Trees

A cluster tree provides a highly-interpretable summary of a density func...
research
05/08/2021

Parameterized Complexity of Feature Selection for Categorical Data Clustering

We develop new algorithmic methods with provable guarantees for feature ...
research
04/26/2022

Polylogarithmic Sketches for Clustering

Given n points in ℓ_p^d, we consider the problem of partitioning points ...
research
11/19/2021

An Asymptotic Equivalence between the Mean-Shift Algorithm and the Cluster Tree

Two important nonparametric approaches to clustering emerged in the 1970...

Please sign up or login with your details

Forgot password? Click here to reset