Polylogarithmic Sketches for Clustering
Given n points in ℓ_p^d, we consider the problem of partitioning points into k clusters with associated centers. The cost of a clustering is the sum of p^th powers of distances of points to their cluster centers. For p ∈ [1,2], we design sketches of size poly(log(nd),k,1/ϵ) such that the cost of the optimal clustering can be estimated to within factor 1+ϵ, despite the fact that the compressed representation does not contain enough information to recover the cluster centers or the partition into clusters. This leads to a streaming algorithm for estimating the clustering cost with space poly(log(nd),k,1/ϵ). We also obtain a distributed memory algorithm, where the n points are arbitrarily partitioned amongst m machines, each of which sends information to a central party who then computes an approximation of the clustering cost. Prior to this work, no such streaming or distributed-memory algorithm was known with sublinear dependence on d for p ∈ [1,2).
READ FULL TEXT