DeepAI
Log In Sign Up

Computing Euclidean k-Center over Sliding Windows

In the Euclidean k-center problem in sliding window model, input points are given in a data stream and the goal is to find the k smallest congruent balls whose union covers the N most recent points of the stream. In this model, input points are allowed to be examined only once and the amount of space that can be used to store relative information is limited. Cohen-Addad et al. <cit.> gave a (6+ϵ)-approximation for the metric k-center problem using O(k/ϵlogα) points, where α is the ratio of the largest and smallest distance and is assumed to be known in advance. In this paper, we present a (3+ϵ)-approximation algorithm for the Euclidean 1-center problem using O(1/ϵlogα) points. We present an algorithm for the Euclidean k-center problem that maintains a coreset of size O(k). Our algorithm gives a (c+2√(3) + ϵ)-approximation for the Euclidean k-center problem using O(k/ϵlogα) points by using any given c-approximation for the coreset where c is a positive real number. For example, by using the 2-approximation <cit.> of the coreset, our algorithm gives a (2+2√(3) + ϵ)-approximation (≈ 5.465) using O(klog k) time. This is an improvement over the approximation factor of (6+ϵ) by Cohen-Addad et al. <cit.> with the same space complexity and smaller update time per point. Moreover we remove the assumption that α is known in advance. Our idea can be adapted to the metric diameter problem and the metric k-center problem to remove the assumption. For low dimensional Euclidean space, we give an approximation algorithm that guarantees an even better approximation.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

12/09/2021

Improved approximation algorithms for two Euclidean k-Center variants

The k-Center problem is one of the most popular clustering problems. Aft...
07/21/2019

A Constant Approximation for Colorful k-Center

In this paper, we consider the colorful k-center problem, which is a gen...
07/15/2021

A Refined Approximation for Euclidean k-Means

In the Euclidean k-Means problem we are given a collection of n points D...
09/24/2021

k-Center Clustering with Outliers in the Sliding-Window Model

The k-center problem for a point set P asks for a collection of k congru...
07/31/2020

MSPP: A Highly Efficient and Scalable Algorithm for Mining Similar Pairs of Points

The closest pair of points problem or closest pair problem (CPP) is an i...
03/16/2022

Tight Lower Bounds for Approximate Exact k-Center in ℝ^d

In the discrete k-center problem, we are given a metric space (P,) where...
09/23/2018

Improved constant approximation factor algorithms for k-center problem for uncertain data

In real applications, database systems should be able to manage and proc...

1 Introduction and problem statement

The k-center problem, which is finding the

smallest congruent balls containing a set of input points, is a fundamental problem arising from abundant real-world applications, including machine learning, data mining, and image processing. The growth of the Internet and the computing power of machines has facilitated a significant increase in the amount of data collected and used by various applications over the last decades. But in huge data sets it is quite hard to guarantee reasonable processing time and memory space. To cope with this difficulty, data stream models have received considerable attention in the theoretical as well as the application field.

In the streaming model, it is important to design an algorithm whose space complexity does not depend on the size of the input, since the memory size is typically much smaller than the input size. In this paper, we consider the single-pass streaming model [chan-pathak-2014], where elements in the data stream are allowed to be examined only once and only a limited amount of information can be stored. The insertion-only stream model is well studied, but also more flexible settings like dynamic streams and the sliding window model have received some attention for many clustering problems. In the dynamic stream model input points can be removed arbitrarily and in the sliding window model older input is deleted as new elements arrive.

In this paper, we consider the Euclidean -center problem for a sliding window, which contains the most recent points.

1.1 Previous Work

The -center problem in the static setting. The Euclidean -center problem has been extensively studied in the literature. If is part of the input, the -center problem is -hard [garey-johnson-1979], even in the plane [megiddo-supowit-1984]. In fact, it is known to be -hard to approximate the -center problem with a factor smaller than for arbitrary metric spaces [fowler-paterson-1981], and with a factor smaller than for the Euclidean space [bern-eppstein-1996]. If the Euclidean dimension is not fixed, the problem is -hard for fixed  [megiddo-1990]. Agarwal and Procopic [agarwal-procopiuc-2002] gave an exact algorithm that runs in O) for the -metric and the -metric. Feder and Greene [feder-greene-1988] gave a -approximation that runs in O() for any -metric.

For small and , better solutions are available. The -center problem in fixed dimensions is known to be of LP-type and can be solved in linear time [chazelle-matousek-2015]. For the Euclidean -center problem in the plane, the best known algorithm is given by Chan [chan-1999], which runs deterministically in O() time using O() space.
The -center problem in the streaming model. In the streaming model, where only a single pass over the input is allowed, McCutchen and Khuller [mccutchen-khuller-2008], and independently Guha [guha-2009], presented algorithms to maintain a -approximation for -centers in any dimension using O() space and O() update time. For , Zarrabi-Zadeh and Chan [zarrabi-chan-2006] presented a simple algorithm achieving an approximation factor of using only O() space. Agarwal and Sharathkumar [agarwal-shara-2010] improved the approximation factor to using O() space and O() update time. The approximation factor of their algorithm was later improved to by Chan and Pathak [chan-pathak-2014]. For , Kim and Ahn [kim-ahn-2015] gave a -approximation using O() space and O() update time. Their algorithm extends to any fixed with the same approximation factor. If and are fixed, Zarrabi-Zadeh [zarrabi-2008] gave an -approximation using O() space and O() update time.
In the sliding window model, Cohen-Addad et al. [cohen-2016] have recently obtained results for a variant of the -center problem that returns

centers, but not the radius. Under the assumption that the algorithm cannot create new points, they show that any randomized sliding window algorithm achieving an approximation factor less than 4 with probability greater than

for metric -center problem requires points. For , they give a -approximation using O( points and O( update time per point where is the ratio of the largest and smallest distance and is assumed to be known in advance. For general , they gave a -approximation using O( points and O( points.
Problems in the sliding window model. Many problems have been studied in the literature [braverman-ostrovsky-2010, braverman-2016]. Several algorithms have been proposed for the diameter problem [feigenbaum-2004, chan-Sadjad-2006, cohen-2016]. Chan and Sadjad [chan-Sadjad-2006] give a -approximation using points. For higher dimensions, Cohen-Addad et al. [cohen-2016] give a -approximation using points and update time per point.

1.2 Detailed Comparison with [cohen-2016]

For the -center problem, when the value of is known, our algorithm is a modification of the diameter algorithm of Cohen-Addad et al. [cohen-2016]. In addition to the center point returned by the algorithm of Cohen Addad, we maintain a second center point, such that all alive points are contained in two small balls centered at these two points. This way we obtain a better approximation factor. For the case of unknown , the number of our solutions is bounded, thereby removing the assumption that the value of is known in advance. We show that it is sufficient to maintain O() solutions.

A coreset is a small portion of the data, such that running a clustering algorithm on it, generates a clustering which is approximately optimal for the whole data. For the -center problem, we find a coreset of size which guarantees a -approximation by using a -approximation for the coreset. We use the observation that no three points within the Euclidean unit ball have all pairwise distances of more than .

1.3 Our Contribution

For the -center problem, we obtain a -approximation that works without knowing the parameter in advance by adding a carefully chosen point. The parameter is the ratio of largest and smallest possible distance between the points. Our algorithm maintains O() points and O() update time per point. We also remove the assumption that is known in advance. Because

is changed when the sliding window moves, finding a proper estimate of

is difficult. Therefore it is important to remove the assumption for implementing an algorithm in the streaming model. Our idea is general enough to be adapted to the algorithms of Cohen-Addad et al. for the metric diameter problem and the metric -center problem.

In the static model, finding a -approximation for the -center problem is known and easy, however, in the streaming model finding a feasible radius is difficult because we do not know all the points. Therefore our result is non-trivial.

For the -center problem, our algorithm finds a coreset of size using O() points and O() update time per point. Our algorithm gives a -approximation for the Euclidean -center problem by using any given -approximation for the coreset where is a positive real number. By using the exact algorithm [agarwal-procopiuc-2002] for the coreset, our algorithm gives a -approximation () using O) time per point. By using the -approximation [feder-greene-1988] for the coreset, our algorithm gives a -approximation () using time per point. Our two algorithms for the -center problem are an improvement on the approximation factor of by Cohen-Addad et al. [cohen-2016] with the same space complexity.

For low dimensional Euclidean space, better approximation is available. Our algorithm finds a coreset of size maintaining O() points and O) update time per point where is an trade-off parameter, which is a positive integer, and is the doubling constant. We give a -approximation by using any given -approximation for the coreset. We can get a -approximation by using this result.

2 Preliminaries

Let be a set of points in -dimensional Euclidean space. In the sliding window model, the points in arrive one by one, and are allowed to be examined only once. The points in are labeled in order of their arrival. That is, is a point in that arrives in the -th step. We denote by a subset of points in that are the last points, that is, . We call a point alive if . Let be the index of in the insertion order. (.) Let be the diameter of the points set . Let be the closest pair of the points set . Let be the ratio between the diameter and the minimum non-zero distance between any two points in . Let be the Euclidean distance between and . Let denote a ball of radius centered at . Let be the radius of optimal solution.

3 The 1-Center Problem

1: first point of the stream
2: null
3:for all elements of the stream do
4:     if  is deleted then
5:         if null and  then
6:              ; ;
7:               null;          
8:         if null and  then
9:              ; ;
10:               null;          
11:         if null then
12:              ; ;               
13:     INSERT()
14:     
15:procedure INSERT(p)
16:     if  null then
17:         if  then
18:              ,; ;
19:         else if  then
20:              ; ;          
21:     else
22:         if  then
23:              ,; ;
24:         else if  then
25:              ; ; ;
26:         else if  then
27:              if  then
28:                  ; ; ;                             
Algorithm 1 Diameter()

First we give a -approximation by using the algorithm for the diameter problem. The details of this can be found in the Appendix A.

Now we explain our -approximation algorithm. Our idea in this section is to carefully adapt Algorithm 1 of Cohen-Addad et al. [cohen-2016], originally designed for the diameter problem. To improve readability, we sketch their algorithm and its properties to explain our modifications. Their algorithm consists of two parts: a fixed parameter algorithm and a way to maintain parameters. We only explain the first part.

For a given estimate , their algorithm maintains four points , , , and . Their algorithm returns either two points and () such that or a point . If their algorithm returns one point, then they show that in this case.

They show that the following two invariants are satisfied. The invariants are:

Invariant 1

If then the following statements hold:
a) For any alive points with , we have .
b) For any point with , we have .

Invariant 2

If , then the following statements hold:
a) .
b) For any point with , we have .
c) For any point with , we have .
d) If then for any point with , we have .

Now we explain our algorithm. Our algorithm consists of two parts: a fixed parameter algorithm and a way to maintain parameters.

Given an estimate , we use their fixed parameter algorithm with maintaining a new point. We call our algorithm . We maintain a bridge point and following an invariant show the property of the point. The main idea is that two balls and contain all alive point if our algorithm returns a point where is a bridge point. See Figure 1 (b). This invariant is a variant of the Invariant 1 and is given implicitly in the proofs of their Lemma 3 and Lemma 4 [cohen-2016]. The proof of this Invariant can be found in the Appendix B.

Lemma 1 (Invariant 3)

If then the following statement holds:
There exists a bridge point such that for any alive point with .

By Invariant 1 and 3, we can find a solution. The proof of this Lemma can be found in the Appendix B.

Lemma 2

If then contains where .

Now we explain the way to maintain estimates. First we explain the case when the number of estimate is unbounded and then we explain the case when the number of estimate is bounded.

We maintain estimates, which is an exponential sequence to the base of (), such that any value between the distance of the closest pair and the distance of the diameter is () approximated. For each power of (), we run .

[width=0.6]figures/onecenter

Figure 1: (a) and are diameter and . Therefore (b) contains

Now we explain the way to bound the number of estimates. Let be an estimates set containing for all integer . We modify some solutions of our algorithm such that the new solutions also satisfy all invariants and maintain the specific points. Let () be an estimate in such that, for any estimate (), and () maintain the same points and the same solution. By this property, we know all the solutions in by just maintaining estimates between and .

To find proper and , we use the following witnesses. The proof of this Lemma can be found in the Appendix B.

Lemma 3

If , then maintains and . It returns and as a solution. If , then we can maintain , , , , and . We return as a solution. All the invariants hold for the solution.

We set the largest estimate satisfying as . Let be the length of our solution. Because by Invariant 1, we set the smallest estimate satisfying as .

When a constant number of points are inserted, we maintain all estimates between and by using evidences for each direction. Then we update and . We first update the direction decreasing the number of estimates if it is possible. Because is larger than and is smaller than , the number of estimates is .

Among estimates, we choose the smallest estimate that returns a point . Note that returns two points and its means . Since , . Therefore our solution guarantees -approximation. See Figure 1 (b).

Note that takes a constant time to update.

Therefore we obtain the following Theorem.

Theorem 3.1

Given a set of points with a window of size and a value , our algorithm guarantees a -approximation to the Euclidean -center problem in the sliding window model maintaining O points and requiring update time per point in arbitrary dimensions .

4 The k-Center Problem

Our algorithm is similar to Algorithm 2 of Cohen-Addad et al. [cohen-2016]. The main difference is that we implicitly maintain O() balls of radius by observing a property of the Euclidean ball. Our algorithm maintains a coreset and does not compute a solution for each update. When a query is given, our algorithm computes a solution from the coreset. Our algorithm consists of two parts: a fixed parameter algorithm and a way to maintain parameters.

We call our algorithm 4kCoreSet. For a given an estimate optimal radius , our algorithm gives a coreset of at most points such that, for any alive point and its nearest point in , . If our algorithm maintains satisfying the condition, we call the coreset feasible. Otherwise we call the coreset infeasible. If our algorithm maintains a feasible coreset. Otherwise our algorithm may maintain a feasible coreset or an infeasible coreset. If , then we can compute an -approximation by computing -approximation of -center problem for the O() points. For small , we compute an -approximation solution by computing an optimal solution for the points. For large , we compute an -approximation solution by computing a -approximation solution for the points. We maintain estimates and one of them satisfies the condition.

[width=0.6]figures/4kdiscrete

Figure 2: An example for the -center problem . The centers of balls are active center points and raidus is . The centers of blue balls are in and the centers of green balls are not in . The red points are the representative points. The dashed red ball is and it contains the green ball whose representative point is .

A high level description of our algorithm is as follows. We implicitly maintain at most balls with radius such that the balls contain all points in . We maintain a representative point per each ball and return the representative points as our coreset. In Figure 2, we give an example of our solution for the -center problem.

We explain our algorithm more precisely. We maintain a set of at most active center points. For each active center point , we maintain a representative point within radius . We maintain a set for the representative points. When new point is arrived, we first remove points in and that are deleted. Note that the representative point of a deleted active center point can be in . Then we choose all active center points from such that the distance from is at most , and if there are points we update corresponding representative points to . Otherwise we add to . If , we remove oldest active center point . In this case, we set this coreset as infeasible until is deleted and delete all representative points equal or older than the removed active center point. We do this process to bound the number of points we maintain. To decide whether our coreset is infeasible or not, we maintain the feasible time FT and the current time CT. We set and . If , then this solution is infeasible.

If our coreset is feasible, then all points in are in the union of at most balls we maintain implicitly. Moreover, each ball contains a representative point in . Let be a set of congruent balls such that is contained in the union of balls in . We enlarge the balls by , then the enlarged balls contain all points in .

1:
2:
3:for all element of the stream do
4:     
5:     if  is deleted then
6:               
7:     if  is deleted then
8:         DeleteActive(a)      
9:     Insert(p)
10:procedure DeleteActive(a)
11:     if  is not deleted then
12:         
13:         for  do
14:              if  then
15:                                               
16:     
17:procedure Insert(p)
18:     
19:     if  then
20:         if  then
21:               oldest point in
22:              DeleteActive()          
23:         
24:         
25:         
26:     else
27:         for all  do
28:                             
Algorithm 2 CoreSet()

We start our analysis by giving the space bound.

Lemma 4

CoreSet() uses O() space and update time per point.

Proof

The number of active center points in is at most and the number of representative points corresponding the active center points is at most . What remains to be shown is that the number of representative points whose active center point is not in is at most . This result come from Lemma 7 of Cohen-Addad [cohen-2016], but for the completeness we explain it.

Let be th active center point inserted in . Note that after an active center point is removed from , is not updated. Therefore, for all . Assume we currently have where . If is deleted, then is also deleted for all . If is not deleted, then we removed it from and we also removed all representative points older than by line 13 to line 15 of CoreSet(). Therefore, we removed from for all .

By the above reason, and , and the space bounds holds.

Now we bound time to update. Removing points in the main procedure takes . The procedure INSERT takes .

Now we show that our algorithm return a feasible solution when . We need the following technical lemma to show it.

Lemma 5

Let be a unit ball in for . There are no three point , , and such that the points are contained in and all of their pairwise distances are larger than .

Proof

In order to derive a contradiction, we assume that there exist such points , , and . Then we choose the plane passing the points. Let be the 2-dimensional ball intersection of and the plane. Note that the radius of is at most and the convex hull of the points lies in . Then we can move three points such that their pairwise distances are and those points are contained in the convex hull of the origin points. Let those points be , , and . Because , , and are on the boundary of when is a unit ball, at least one of the original points lies outside of . See the Figure 3. This is a contradiction.

By Lemma 5, the following lemma hold.

[width=0.25]figures/triangle

Figure 3: , , and are on the boundary of .
Lemma 6

Let be a unit ball in for . For any points and in and . Then .

Now we show that our algorithm maintains a feasible coreset when .

Lemma 7

If , then CoreSet() maintains a feasible coreset.

Proof

In CoreSet(), we maintain at most active center points in . Assume we currently have where .

If , then this lemma holds. If , then we will show it is impossible when . In order to derive a contradiction, we assume that is alive. Since is alive, we have at least active center points whose pairwise distance is larger than . By Lemma 6 and the distances between the points, at most two points of the points lie in an optimal ball. Therefore the number of the points are at most , but we have more than , it is a contradiction.

Now we explain the way to maintain estimates. We use a similar way as in Section 3. We maintain estimates, which is an exponential sequence to the base of (), such that any value between the distance of the closest pair and the distance of the diameter is () approximated. For each power of (), we run 4kCoreSet. We choose the smallest that returns a feasible solution.

Let () be an estimate in such that, for any estimate (), 4kCoreSet and 4kCoreSet (4kCoreSet) maintain the same coreset. To bound the number of estimates, we use the same idea as in Section 3. Let be an estimates set containing for all integer . Let () be an estimate in such that for any estimate () 4kCoreSet (4kCoreSet) and 4kCoreSet maintain the same coreset. We set to the witness. For any estimate , balls are disjoint where , therefore is an infeasible coreset until . We set to the witness. Because contains all points in , we maintain as the coreset for . We set where is the estimate of the solution of our -center problem.

Combining these lemmas and ideas we have :

Theorem 4.1

Given a set of points with a window of size and a value , our algorithm maintains at most points such that for any point there is a point such that . Our algorithm maintains O points and requires O update time per point. To compute a -approximate solution to the Euclidean -center problem we need a -approximation solution of the -center problem algorithm for O points.

Proof

By Lemma 7, CoreSet() returns a feasible coreset when . Now we show that our algorithm is a -approximation when . Our algorithm gives a set of at most points. For any alive point , there exists a point such that by lemma 7. Let be -approximation solution for -center problem for . For any alive point , the distance from its closest center in is at most . Therefore, the union of balls contains all points in and it is a -approximation.

For finding , we use a -approximation of the -center problem for . Note that the closest pair are in an optimal ball and other optimal balls contain exactly one point. Therefore, the radius is between and . It takes O() time by the algorithm of Feder and Greene [feder-greene-1988].

The memory usage of the algorithm consists of O() per instance of 4kCoreSet and estimates.

CoreSet() takes update time per point and estimates. To bound the number of estimate it takes time per point.

5 The k-Center Problem in Low Dimension

In this section, we have the following theorem. The details of this section can be found in the Appendix C.

Theorem 5.1

Given a set of points with a window of size and a value , our algorithm maintains points such that for any point there is a point such that . Our algorithm maintains O points and requires O) update time per point. To compute a -approximation to the Euclidean -center problem we need a -approximation of the -center problem algorithm for the coreset.

References

Appendices

A The -approximation for -center problem

First we give a -approximation by using the algorithm for the diameter problem. The metric diameter problem is to find two points of maximum distance among a set of points lying in some metric space. The algorithm of Cohen-Addad et al. [cohen-2016] returns two points and such that their distance is at least . We choose the last point as the center and as the radius. We give an example that gives -approximation. See Figure 1 (a). In the figure, the length of the diameter is .

B Proofs of lemmas in Section 3

Lemma 8 (Invariant 3)

If then the following statement holds:
There exists a bridge point such that for any alive point with .

Proof

We consider the situation when Invariant 3 holds and a new point is inserted. We will show that Invariant 3 is satisfied after the update. Let () be the point () before is inserted. Let , and be defined similar way.

There are 3 cases when and .
Case 1 : and is deleted.
Case 2 : and , and is deleted.
Case 3 : and , and is deleted.

We explain how to update the bridge point for each case. Note that for any case is deleted.
Case 1) In this case, their algorithm sets . We need to show that for any point with , there is a point such that . By Invariant 1.b), for any point with , we have . We set .
Case 2) In this case, their algorithm sets . We need to show that for any point with , there is a point such that . Since and , is the oldest alive point. And, by Invariant 2.c), for any point with , we have . We set .
Case 3) In this case, their algorithm sets . We need to show that for any point with , there is a point such that . By Invariant 2.b), for any point with , we have . We set .

Lemma 9

If then contains where .

Proof

By Invariant 3, we have a bridge point , for any alive point that is older than , satisfying . By Invariant 1, for any with . by Invariant 3. We choose the mid point of and . Then for any alive point . See Figure 1 (b).

Lemma 10

If , then maintains and . It returns and as a solution. If , then we can maintain , , , , and . We return as a solution. All the invariants hold for the solution.

Proof

First we prove the case when . After updating , we set to by line 14 of Algorithm 1. When is inserted our algorithm call Insert. In the procedure, whether is null or not, . So we set and to and to . Since is not null, we do not maintain . In line 14, we set to .

When , we set , and to , to null, and to . Note that Invariant 1 and Invariant 3 hold because .

C The k-Center Problem in Low Dimension

In this section, we explain a better approximation when the dimension of Euclidean space is low. The idea is similar to Section 4. Our algorithm consists of two parts: a fixed parameter algorithm and a way to maintain parameters. We focus on the fixed paprmeter algorithm. We implicitly maintain small balls that contain all alive point. For a given radius of balls, we will bound the number of balls and approximation factor of our algorithm by using a property of doubling metric space.

The doubling dimension [assouad-1983, heinonen-2001] of a metric space is the smallest such that every ball of radius is covered by balls of radius most . It is well-known that a point set in -dimensional Euclidean metric has doubling dimension . In [verger-2005], Jean-Louis give a constant ( equation (4) in Theorem 1.2) such that .

APX balls
1 O()
3 O()
5 O()
8 O()
10 O()
O()
Table 1: Relation between the approximation factor and the number of balls

We will use the property of doubling metric space in the following way. A unit ball is contained in balls with radius . We can use this idea recursively. A ball with radius is contained in balls with radius . Therefore balls with radius contain a unit ball. By lemma 6, a ball with radius is contained in the union of two balls with radius .

The basic idea of our algorithm is similar to our algorithm in Section 4. The difference is that we maintain a set of at most active center points and the distance between an active center point and its representative point is at most . If , then our algorithm returns feasible solution and it guarantees a -approximation by using -approximation for the coreset. We maintain estimates with the similar way in Section 3. To bound the number of estimates, we set and .

Theorem 0..2

Given a set of points with a window of size and a value , our algorithm maintains points such that for any point there is a point such that . Our algorithm maintains O points and requires O) update time per point. To compute a -approximation to the Euclidean -center problem we need a -approximation of the -center problem algorithm for the coreset.

To get -approximation, inequality holds and we get by modifying the inequality.