Sensitivity Sampling Over Dynamic Geometric Data Streams with Applications to k-Clustering
Sensitivity based sampling is crucial for constructing nearly-optimal coreset for k-means / median clustering. In this paper, we provide a novel data structure that enables sensitivity sampling over a dynamic data stream, where points from a high dimensional discrete Euclidean space can be either inserted or deleted. Based on this data structure, we provide a one-pass coreset construction for k-means O(kpoly(d)) over d-dimensional geometric dynamic data streams. While previous best known result is only for k-median [Braverman, Frahling, Lang, Sohler, Yang' 17], which cannot be directly generalized to k-means to obtain algorithms with space nearly linear in k. To the best of our knowledge, our algorithm is the first dynamic geometric data stream algorithm for k-means using space polynomial in dimension and nearly optimal in k. We further show that our data structure for maintaining coreset can be extended as a unified approach for a more general classes of k-clustering, including k-median, M-estimator clustering, and clusterings with a more general set of cost functions over distances. For all these tasks, the space/time of our algorithm is similar to k-means with only poly(d) factor difference.
READ FULL TEXT