ε-Coresets for Clustering (with Outliers) in Doubling Metrics
We study the problem of constructing ε-coresets for the (k, z)-clustering problem in a doubling metric M(X, d). An ε-coreset is a weighted subset S⊆ X with weight function w : S →R_≥ 0, such that for any k-subset C ∈ [X]^k, it holds that ∑_x ∈ Sw(x) · d^z(x, C)∈ (1 ±ε) ·∑_x ∈ Xd^z(x, C). We present an efficient algorithm that constructs an ε-coreset for the (k, z)-clustering problem in M(X, d), where the size of the coreset only depends on the parameters k, z, ε and the doubling dimension ddim(M). To the best of our knowledge, this is the first efficient ε-coreset construction of size independent of |X| for general clustering problems in doubling metrics. To this end, we establish the first relation between the doubling dimension of M(X, d) and the shattering dimension (or VC-dimension) of the range space induced by the distance d. Such a relation was not known before, since one can easily construct instances in which neither one can be bounded by (some function of) the other. Surprisingly, we show that if we allow a small (1±ϵ)-distortion of the distance function d, and consider the notion of τ-error probabilistic shattering dimension, we can prove an upper bound of O( ddim(M)·(1/ε) +1/τ ) for the probabilistic shattering dimension for even weighted doubling metrics. We believe this new relation is of independent interest and may find other applications. We also study the robust coresets and centroid sets in doubling metrics. Our robust coreset construction leads to new results in clustering and property testing, and the centroid sets can be used to accelerate the local search algorithms for clustering problems.
READ FULL TEXT