DeepAI
Log In Sign Up

Fully Dynamic k-Center Clustering in Doubling Metrics

08/11/2019
by   Gramoz Goranci, et al.
0

In the k-center clustering problem, we are given a set of n points in a metric space and a parameter k ≤ n. The goal is to select k designated points, referred to as centers, such that the maximum distance of any point to its closest center is minimized. This notion of clustering is of fundamental importance and has been extensively studied. We study a dynamic variant of the k-center clustering problem, where the goal is to maintain a clustering with small approximation ratio while supporting an intermixed update sequence of insertions and deletions of points with small update time. Moreover, the data structure should be able to support the following queries for any given point: (1) report whether this point is a center or (2) determine the cluster this point is assigned to. We present a deterministic dynamic algorithms for the k-center clustering problem that achieves a (2+ϵ)-approximation with O(2^O(κ)ΔΔ·ϵ^-1ϵ^-1) update time and O(Δ) query time, where κ bounds the doubling dimension of the metric and Δ is the aspect ratio. Our running and query times are independent of the number of centers k, and are poly-logarithmic when the metric has constant doubling dimension and the aspect ratio is bounded by a polynomial.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

05/17/2017

Range-Clustering Queries

In a geometric k-clustering problem the goal is to partition a set of po...
12/13/2021

Optimal Fully Dynamic k-Centers Clustering

We present the first algorithm for fully dynamic k-centers clustering in...
08/09/2014

Efficient Clustering with Limited Distance Information

Given a point set S and an unknown metric d on S, we study the problem o...
11/13/2020

Consistent k-Clustering for General Metrics

Given a stream of points in a metric space, is it possible to maintain a...
12/28/2020

No-substitution k-means Clustering with Adversarial Order

We investigate k-means clustering in the online no-substitution setting ...
08/30/2017

Improvements on the k-center problem for uncertain data

In real applications, there are situations where we need to model some p...
10/01/2018

Topological Stability of Kinetic k-Centers

We study the k-center problem in a kinetic setting: given a set of conti...

1 Introduction

The massive increase in the amount of data produced over the last few decades has motivated the study of different tools for analysing and computing specific properties of the data. One of the most extensively studied analytical tool is clustering, where the goal is to group the data into clusters of “close” data points. Clustering is a fundamental problem in computer science and it has found a wide range of applications in unsupervised learning, classification, community detection, image segmentation and databases (see e.g. 

[6, 19, 21]).

A natural definition of clustering is the -center clustering, where given a set of points in a metric space and a parameter , the goal is to select designated points, referred to as centers, such that their cost, defined as the maximum distance of any point to its closest center, is minimized. As finding the optimal -center clustering is NP-hard [15], the focus has been on studying the approximate version of this problem. For a parameter , an -approximation to the -center clustering problem is an algorithm that outputs centers such that their cost is within times the cost of the optimal solution. There is a simple -approximate -center clustering algorithm by Gonzalez [9] that runs in time; repeatedly pick the point furthest away from the current set of centers as the next center to be added. The problem of finding a -approximate -center clustering is known to be NP-complete [9].

In many real-world applications, including social networks and the Internet, the data is subject to frequent updates over time. For example, every second about thousands of Google searches, YouTube video uploads and Twitter posts are generated. However, most of the traditional clustering algorithms are not capable of capturing the dynamic nature of data and often completely reclustering from scratch is used to obtain desirable clustering guarantees.

To address the above challenges, in this paper we study a dynamic variant of the -center clustering problem, where the goal is to maintain a clustering with small approximation ratio while supporting an intermixed update sequence of insertions and deletions of points with small time per update. Additionally, for any given point we want to report whether this point is a center or determine the cluster this point is assigned to. When only insertions of points are allowed, also known as the incremental setting, Charikar et al. [2] designed an -approximation algorithm with amortized time per point insertion. This result was later improved to a -approximation by McCutchen and Khuller in [17]. Recently, Chan et al. [1] studied the model that supports both point insertions and deletions, referred to as the fully-dynamic setting. Their dynamic algorithm is randomized and achieves a -approximation with update time per operation, where is the aspect ratio of the underlying metric space.

It is an open question whether there are fully-dynamic algorithms that achieve smaller running time (ideally independent of ) while still keeping the same approximation guarantee. We study such data structures for metrics spaces with “limited expansion”. More specifically we consider the well-studied notion of doubling dimension. The doubling dimension of a metric space is said to be bounded by if any ball of radius in this metric can be covered by balls of radius  [16]. This notion can be thought of as a generalization of the Euclidean dimension since has doubling dimension .

The -center clustering problem has been studied in the low dimensional regime from both the static and dynamic perspective. Feder and Greene [4] showed that if the input points are taken from , there is a -approximation to the optimal clustering that can be implemented in time. They also showed that computing an approximation better than is NP-hard, even when restricted to Euclidean spaces. For metrics of bounded doubling dimension, Har-Peled and Mendel [13] devised an algorithm that achieves a -approximation and runs in time. In the dynamic setting, Har-Peled [12] implicitly gave a fully-dynamic algorithm for metrics with bounded doubling dimension that reports a -clustering at any time while supporting insertion and deletions of points in time, where is a fixed-degree polynomial in the input parameters.

One drawback shared by the above dynamic algorithms for the -center clustering is that the update time is dependent on the number of centers . This is particularly undesirable in the applications where is relatively large. For example, one application where this is justified is the distribution of servers on the Internet, where thousands of servers are heading towards millions of routers. Moreover, this dependency on seems inherent in the state-of-the-art dynamic algorithms; for example, the algorithm due to Chan et al. [1] requires examining the set of current centers upon insertion of a point, while the algorithm due to Har-Peled [12] employs the notion of coresets, which in turn requires dependency on the number of centers.

In this paper we present a dynamic algorithms for metrics with bounded doubling dimension that achieves a approximation ratio for the -center clustering problem (thus matching the approximation ratio of the dynamic algorithm in general metric spaces [1]) while supporting insertions and deletion of points in time independent of the number of centers and poly-logarithmic in the aspect ratio . Our algorithm is deterministic and thus works against an adaptive adversary.

Theorem 1.1.

There is a fully-dynamic algorithm for the -center clustering problem, where points are taken from a metric space with doubling dimension , such that any time the cost of the maintained solution is withing a factor to the cost of the optimal solution and the insertions and deletions of points are supported in update time. For any given point, queries about whether this point is a center or reporting the cluster this point is assigned to can be answered in and , respectively.

Remark.

Recently and independently of our work, Schmidt and Sohler [20] gave an -approximate fully-dynamic algorithm for the hierarchical -center clustering with and expected amortized insertion and deletion time, respectively, and query time, where points come from the the discrete space with being a constant. This result implies a dynamic algorithm for the -center clustering problem with the same guarantees. In comparison with our result, our algorithm (i) achieves a better and an almost tight approximation, (ii) is deterministic and maintains comparable running time guarantees, and (iii) applies to any metric with bounded doubling dimension.

Other related work.

For an in-depth overview on clustering and its wide applicability we refer the reader to two excellent surveys [19, 11]. Here we briefly discuss closely related variants of the -center clustering problem such as the kinetic and the streaming model. In the kinetic setting, the goal is to efficiently maintain a clustering under the continuous motion of the data points. Gao et al. [8] showed an algorithm that achieves an -approximation factor. Their result was subsequently improved to a guarantee by Friedler and Mount [7]. In the streaming setting, Cohen-Addad et al. [3] designed a -approximation algorithm with an expected update time of . However, their algorithm only works in the sliding window model and does not support arbitrary insertions and deletions of points.

There has been growing interest in designing provably dynamic algorithms for clustering problems with different objectives. Two recent examples include works on dynamically maintaining expander decompositions [18] and low-diameter decompositions [5]. For application of such algorithms we refer the reader to these papers and the references therein.

Technical overview.

In the static setting, a well-known approach for designing approximation algorithms for the -center clustering problem is exploiting the notion of -nets. Given a metric space , and an integer parameter , an -net is a set of points, referred to as centers, satisfying (a) the covering property, i.e., for every point there exists a point within distance at most and (b) the separating property, i.e., all distinct points are at distance strictly larger than . Restricting the set of possible radii to powers of in allows us to consider only different -nets, where is the apsect ratio, defined as the ratio between the maximum and the minimum pair-wise distance in . The union over all such -nets naturally defines a hierarchy . It can be shown that the smallest in such that the size of the -net is at most yields a feasible -center clustering whose cost is within to the optimal one (see e.g., [1]).

A natural attempt to extend the above static algorithm to the incremental setting is to maintain the hierarchy under insertions of points. In fact, Chan et al. [1] follow this idea to obtain a simple incremental algorithm that has a linear dependency on the number of centers . We show how to remove this dependency in metrics with bounded doubling dimension and maintain the hierarchy under deletion of points. Concretely, our algorithm employs navigating nets, which can be though as a union over slightly modified -nets with slightly larger constants in the cover and packing properties. Navigating nets were introduced by Krauthgamer and Lee [16] to build an efficient data-structure for the nearest-neighbor search problem. We observe that their data-structure can be slightly extended to a dynamic algorithm for the -center clustering problem that achieves a -approximation with similar update time guarantees to those in [16]. Similar to [17], maintaining a carefully defined collection of navigating nets allows us to bring down the approximation factor to while increasing the running time by a factor of .

Similar hierarchical structures have been recently employed for solving the the dynamic sum-of-radii clustering problem [14] and the dynamic facility location problem [10]. In comparison to our result that achieves a -approximation, the first work proves an approximation factor that has exponential dependency on the doubling dimension while the second one achieves a very large constant. Moreover, while our data-structure supports arbitrary insertions of points, both data-structures support only updates to specific subset of points in the metric space.

2 Preliminaries

In the -center clustering problem, we are given a set of points equipped with some metric and an integer parameter . The goal is to find a set of points (centers) so as to minimize the quantity , where . We let OPT denote the cost of the optimal solution.

In the dynamic version of this problem, the set evolves over time and queries can be asked. Concretely, at each timestep , either a new point is added to , removed from or one of the following queries is made for any given point : (i) decide whether is a center in the current solution, and (ii) find the center to which is assigned to. The goal is to maintain the set of centers after each client update so as to maintain a small factor approximation to the optimal solution.

Let and be lower and upper bounds on the minimum and the maximum distance between any two points that are ever inserted. For each and radius , we let be the set of all points in that are within distance from , i.e., .

The metric spaces that we consider throughout satisfy the following property.

Definition 2.1 (Doubling Dimension).

The doubling dimension of a metric space is said to be bounded by if any ball in can be covered by balls of radius .

3 Fully dynamic -center clustering using navigating nets

In this section, we present a fully-dynamic algorithm for the -center clustering problem that achieves a -approximation with a running time not depending on the number of clusters . Our construction is based on navigating nets of Krauthgamer and Lee [16] and a scaling technique of McCutchen and Khuller [17].

We start by reviewing some notation from [16].

-nets and Navigating nets.

Let be a metric space. For a given parameter , a subset is an r-net of if the following properties hold:

  1. (separating) For every we have that and,

  2. (covering) .

Let be a constant and let be a set of scales. Let for all , and for all , define to be an -net of . A navigating net is defined as the union of all for all . We refer to the elements in as centers.

Note that for every scale the set contains only one element due to the separating property. A navigating net keeps track of (i) the smallest scale defined by , and (ii) the largest scale defined by . All scales such that are referred to as nontrivial scales.

3.1 Navigating Nets with differing base distances

In what follows, we describe how to obtain a -approximation for the -center clustering problem by maintaining navigating nets in parallel. This technique was originally introduced by McCutchen and Khuller [17] for improving the approximation ratio of the incremental doubling algorithm for the -center problem due to Charikar et al. [2]

The key idea behind the construction is that instead of maintaining one navigating net, we maintain navigating nets with differing base distances. The navigating nets differ only in the corresponding set which is used to define them. More concretely, for each integer , let .

Let for all and for all , let be an -net of . A navigating net is defined as the union over all for . As above, we maintain and , such that and , respectively. Due to the construction of , there is an -net for all positive integers .

We obtain a -center for the current set of points as follows: For each navigating net , define to be the index such that the -net has at most centers and has more than centers. Define for all . We compare the costs of all navigating nets and pick the navigating net with minimal cost . The set of centers is the output -center solution.

The next lemma proves that every is within a distance of a center in .

Lemma 3.1.

For and there is a center such that .

Proof.

By construction, the set is an -net of and all elements of are within distance to a center in . Similarly, the elements of are within distance to a center in and so on. Note that the set contains all points currently in and thus the distance of every point in to some center in forms a geometric series.

Formally, let be arbitrary and let be its ancestor in . Then the distance between and is bounded as follows:

The above lemma shows an upper bound for the solution output -center solution , i.e., . The next lemma proves that has the desired approximation guarantee, i.e., .

Lemma 3.2.

If and then .

Proof.

We set , where for some . For comparison, consider level and the corresponding -net . Note that we returned instead of as a solution even though . Consequently, . Because , at least two points are assigned to the same center in the optimal solution. By the separation property we get that . Using the triangle inequality we obtain

and thus . To obtain the desired approximation we compare our result with :

It remains to show that . Set . Clearly, and because . Moreover note that . The latter holds for any , which in turn implies that .∎

3.2 Fully dynamic -center clustering

In this section, we present the details of the data structure presented in Section 3.1.

Data structure.

Our data-structure needs to (1) maintain navigating nets and (2) answer queries about our current solution to the given -center clustering problem.

For (1) we use the data structure described in [16]: Let and : For the navigating net we do not store the sets explicitly. Instead, for every nontrivial scale and every we store the navigation list which contains nearby points to in the -net , i.e., where . Additionally, for each and each , we store the largest scale such that but we do not store any navigation list where and .

For (2), we also maintain the reverse information. Specifically, for every in and nontrivial scale we maintain which contains all the points in the -net whose navigation list contains , i.e., . We maintain each in a min-heap data structure, where each element is stored with the distance . It is well known that constructing such a min-heap takes time and the insert and delete operations can be supported in logarithmic time. Let be the closest point to in . The min-heap allows us to extract in time. Note that due to the covering property the closest point to is also the closest point to in .

Additionally, we maintain a counter for each scale and navigating net . Also, for all navigating nets , we maintain the largest scale such that and . We store and .

Preprocessing.

Consider the construction of a single navigating net . We start by inserting the points using the routine described in [16, Chapter 2.5] whose running time is . Additionally we construct the lists for every , and scale . We do this during the insert operation which takes care of the lists . Due to Lemma 2.2 in [16] every navigation list has size and due to Lemma 2.3 in [16] every navigating net has only nontrivial scales. Consequently, the sum of all navigation lists in a navigating net is of size . Notice that because the sets store the reverse information of the sets . Since there are navigating nets, the latter yields a construction time of .

Handling Point Updates and Queries.

To handle point insertions and deletions in the navigating nets, we invoke the routines described in [16, Chapters 2.5-2.6] for all the navigating nets. We also keep track of the counters and sets when we handle the insertion and deletions of points in the navigating nets. While updating the counters we simultaneously keep track of for all navigating nets and maintain .

We next discuss the query operations that our data-structure supports. First, we answer the query whether a given point is a center by simply checking if the list exists. Second, given a point we return its corresponding center in as follows: First we check if is a center. If not, we consider . Note that for some . Then we repeatedly determine the navigation list where is the center in which contains within radius using the min-heap . Then we set until . Once we arrive at the list we return as the center is assigned to.

The correctness of the maintained hierarchies follows from the correctness in [16]. Due to Lemma 3.2 the set is a feasible solution to the -center problem whose cost is guaranteed to be within times the optimum cost.

We finally analyze the running time of the update and query operations. The time for handling a point insertion and a point deletion in a single navigating net is  (Theorem 2.5 in [16]). Since we maintain navigating nets, the overall time to handle a point insertion or deletion is . It is straightforward to see that maintaining the counters and min-heaps in all navigating nets can also be done in the same time per update. Determining if a point is a center can be done in . Determining the center of a given point takes time because there are nontrivial scales (Lemma 2.3 in [16]) and thus there are iterations in the lookup algorithm until the scale is reached.

Combining the above guarantees yields Theorem 1.1.

References

  • [1] T.-H. Hubert Chan, Arnaud Guerqin, and Mauro Sozio. Fully dynamic k-center clustering. In International World Wide Web Conference (WWW), pages 579–587, 2018.
  • [2] Moses Charikar, Chandra Chekuri, Tomás Feder, and Rajeev Motwani. Incremental clustering and dynamic information retrieval. SIAM J. Comput., 33(6):1417–1440, 2004. announced at STOC’97.
  • [3] Vincent Cohen-Addad, Chris Schwiegelshohn, and Christian Sohler. Diameter and k-center in sliding windows. In International Colloquium on Automata, Languages, and Programming (ICALP), pages 19:1–19:12, 2016.
  • [4] Tomás Feder and Daniel H. Greene. Optimal algorithms for approximate clustering. In

    Symposium on Theory of Computing (STOC)

    , pages 434–444, 1988.
  • [5] Sebastian Forster and Gramoz Goranci. Dynamic low-stretch trees via dynamic low-diameter decompositions. In Symposium on Theory of Computing (STOC) 2019, pages 377–388, 2019.
  • [6] Santo Fortunato. Community detection in graphs. Physics Reports, 486(3):75 – 174, 2010.
  • [7] Sorelle A. Friedler and David M. Mount. Approximation algorithm for the kinetic robust k-center problem. Comput. Geom., 43(6-7):572–586, 2010.
  • [8] Jie Gao, Leonidas J. Guibas, and An Thai Nguyen. Deformable spanners and applications. Comput. Geom., 35(1-2):2–19, 2006.
  • [9] T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci., 38:293–306, 1985.
  • [10] Gramoz Goranci, Monika Henzinger, and Dariusz Leniowski. A tree structure for dynamic facility location. In European Symposium on Algorithms (ESA), pages 39:1–39:13, 2018.
  • [11] Pierre Hansen and Brigitte Jaumard. Cluster analysis and mathematical programming. Math. Program., 79:191–215, 1997.
  • [12] Sariel Har-Peled. Clustering motion. Discrete & Computational Geometry, 31(4):545–565, 2004. announced at FOCS’04.
  • [13] Sariel Har-Peled and Manor Mendel. Fast construction of nets in low-dimensional metrics and their applications. SIAM J. Comput., 35(5):1148–1184, 2006. announced at SoCG’04.
  • [14] Monika Henzinger, Dariusz Leniowski, and Claire Mathieu. Dynamic clustering to minimize the sum of radii. In European Symposium on Algorithms (ESA), pages 48:1–48:10, 2017.
  • [15] O. Kariv and S. L. Hakimi. An algorithmic approach to network location problems. i: The p-centers. SIAM Journal on Applied Mathematics, 37(3):513–538, 1979.
  • [16] Robert Krauthgamer and James R. Lee. Navigating nets: simple algorithms for proximity search. In Symposium on Discrete Algorithms (SODA), pages 798–807, 2004.
  • [17] Richard Matthew McCutchen and Samir Khuller.

    Streaming algorithms for k-center clustering with outliers and with anonymity.

    In APPROX-RANDOM, pages 165–178, 2008. URL: https://doi.org/10.1007/978-3-540-85363-3_14, doi:10.1007/978-3-540-85363-3_14.
  • [18] Thatchaphol Saranurak and Di Wang. Expander decomposition and pruning: Faster, stronger, and simpler. In Symposium on Discrete Algorithms (SODA) 2019, pages 2616–2635, 2019.
  • [19] Satu Elisa Schaeffer. Graph clustering. Computer Science Review, 1(1):27–64, 2007.
  • [20] Melanie Schmidt and Christian Sohler. Fully dynamic hierarchical diameter k-clustering and k-center. CoRR, abs/1908.02645, 2019.
  • [21] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):888–905, 2000.