Sets Clustering

by   Ibrahim Jubran, et al.

The input to the sets-k-means problem is an integer k≥ 1 and a set P={P_1,...,P_n} of sets in R^d. The goal is to compute a set C of k centers (points) in R^d that minimizes the sum ∑_P∈Pmin_p∈ P, c∈ C p-c ^2 of squared distances to these sets. An ε-core-set for this problem is a weighted subset of P that approximates this sum up to 1±ε factor, for every set C of k centers in R^d. We prove that such a core-set of O(log^2n) sets always exists, and can be computed in O(nlogn) time, for every input P and every fixed d,k≥ 1 and ε∈ (0,1). The result easily generalized for any metric space, distances to the power of z>0, and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS (1+ε approximation) for the sets-k-means problem that takes time near linear in n. This is the first result even for sets-mean on the plane (k=1, d=2). Open source code and experimental results for document classification and facility locations are also provided.



