Coresets for Minimum Enclosing Balls over Sliding Windows

05/09/2019 ∙ by Yanhao Wang, et al. ∙ 0

Coresets are important tools to generate concise summaries of massive datasets for approximate analysis. A coreset is a small subset of points extracted from the original point set such that certain geometric properties are preserved with provable guarantees. This paper investigates the problem of maintaining a coreset to preserve the minimum enclosing ball (MEB) for a sliding window of points that are continuously updated in a data stream. Although the problem has been extensively studied in batch and append-only streaming settings, no efficient sliding-window solution is available yet. In this work, we first introduce an algorithm, called AOMEB, to build a coreset for MEB in an append-only stream. AOMEB improves the practical performance of the state-of-the-art algorithm while having the same approximation ratio. Furthermore, using AOMEB as a building block, we propose two novel algorithms, namely SWMEB and SWMEB+, to maintain coresets for MEB over the sliding window with constant approximation ratios. The proposed algorithms also support coresets for MEB in a reproducing kernel Hilbert space (RKHS). Finally, extensive experiments on real-world and synthetic datasets demonstrate that SWMEB and SWMEB+ achieve speedups of up to four orders of magnitude over the state-of-the-art batch algorithm while providing coresets for MEB with rather small errors compared to the optimal ones.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 10

page 11

page 15

page 17

page 23

page 24

page 25

Code Repositories

SW-MEB

A library for minimum enclosing ball over sliding windows


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unprecedented growth of data poses significant challenges in designing algorithms that can scale to massive datasets. Algorithms with superlinear complexity often become infeasible on datasets with millions or billions of points. Coresets are effective approaches to tackling the challenges of big data analysis. A coreset is a small subset extracted from the original point set such that certain geometric properties are preserved with provable guarantees [3]. Instead of processing the original dataset, one can perform the computation on its coreset with little loss of accuracy. Various types of problems have been shown to be effective under coreset approximation, e.g., -median and -means clustering [20, 3, 9], non-negative matrix factorization (NMF) [17]

, kernel density estimation (KDE) 

[34, 26], and many others [6, 21, 10].

Coresets for minimum enclosing balls (MEB) [5, 33, 22, 4, 1, 32, 23] have received significant attention due to its wide applications in clustering [5, 22, 4]

, support vector machines 

[28], kernel regression [31], fuzzy inference [13], shape fitting [23], and approximate furthest neighbor search [25]. Given a set of points , the minimum enclosing ball of , denoted by , is the smallest ball that contains all points in . A subset is a -coreset for if the distance between the center of and any point in is within , where is the radius of  [4]. Existing studies [22, 4] show that there always exists a -coreset of size () for the MEB of any point set, which is independent of the dataset size and dimension.

Most existing methods [5, 22, 4, 32, 23] of coresets for MEB focus on the batch setting and must keep all points in memory when constructing the coresets. In many applications, such as network monitoring, financial analysis, and sensor data mining, one needs to process data in the streaming model [24] where the input points arrive one at a time and cannot be stored entirely. There have been several methods [2, 33, 1, 11] to maintain coresets for MEB in data streams. The state-of-the-art streaming algorithm [1] can maintain a -coreset for MEB with a single pass through the input dataset. However, these algorithms only consider the append-only scenario where new points are continuously added to, but old ones are never deleted from, the stream. Hence, they fail to capture recency in time-sensitive applications since the computation may be performed on outdated data. To meet the recency requirement, the sliding window model [15, 8, 16, 9] that only considers the most recent points in the stream at any time is a popular approach for real-time analytics. One can trivially adapt append-only methods for the sliding window model but a complete coreset reconstruction is deemed inevitable once an expired point is deleted. To the best of our knowledge, there is no existing algorithm that can maintain coresets for MEB over the sliding window efficiently.

Our Results. In this paper, we investigate the problem of maintaining coresets for MEB in the sliding window model. In particular, our results are summarized as follows.

  • In Section 3.1, we present the AOMEB algorithm to maintain a -coreset of size with computation time per point for the MEB of an append-only stream, where is the dimension of the points and is the ratio of the maximum and minimum distances between any two points in the input dataset. AOMEB shows better empirical performance than the algorithm in [1] while having the same approximation ratio.

  • In Section 3.2, using AOMEB as a building block, we propose the SWMEB algorithm for coreset maintenance over a sliding window of the most recent points at time . SWMEB divides into equal-length partitions. On each partition, it maintains a sequence of indices where each index corresponds to an instance of AOMEB. Theoretically, SWMEB can return a -coreset for with time and space complexity, where is the window size.

  • In Section 3.3, we propose the SWMEB+ algorithm to improve upon SWMEB. SWMEB+ only maintains one sequence of indices, as well as the corresponding AOMEB instances, over . By keeping fewer indices, SWMEB+ is more efficient than SWMEB in terms of time and space. Specifically, it only stores points with processing time per point, both of which are independent of . At the same time, it can still return a -coreset for .

  • In Section 3.4, we generalize our proposed algorithms to maintain coresets for MEB in a reproducing kernel Hilbert space (RKHS).

  • In Section 4, we conduct extensive experiments on real-world and synthetic datasets to evaluate the performance of our proposed algorithms. The experimental results demonstrate that (1) AOMEB outperforms the state-of-the-art streaming algorithm [1] in terms of coreset quality and efficiency; (2) SWMEB and SWMEB+ can return coresets for MEB with rather small errors (mostly within ), which are competitive with AOMEB and other streaming algorithms; (3) SWMEB and SWMEB+ achieve 2 to 4 orders of magnitude speedups over batch algorithms while running between 10 and 150 times faster than AOMEB; (4) SWMEB+ further improves the efficiency of SWMEB by up to 14 times while providing coresets with similar or even better quality.

2 Preliminaries and Related Work

Coresets for MEB. For two -dimensional points , , the Euclidean distance between and is denoted by . An -dimensional (closed) ball with center and radius is defined as . We use and to denote the center and radius of ball . The -expansion of ball , denoted as , is a ball centered at with radius , i.e., .

Given a set of points , the minimum enclosing ball of , denoted as , is the smallest ball that contains all points in . The center and radius of are represented by and . For a parameter , a ball is a -approximate MEB of if and . A subset is a -coreset for , or - for brevity, if . Since and , is always a -approximate MEB of .

Sliding Window Model. This work focuses on maintaining coresets for MEB in append-only streaming and sliding window settings. For a sequence of (possibly infinite) points arriving continuously as a data stream where is the -th point, we first consider the problem of maintaining a - for at any time .

Furthermore, we consider the count-based sliding window111In this paper, we focus on the count-based sliding window model. But our proposed approaches can be trivially extended to the time-based sliding window model [15]. on the stream : given a window size , the sliding window [15] at any time always contains the latest points, i.e., where . We consider the problem of maintaining a - for at any time .

Input : A set of points , a parameter
Output : A coreset for
1 , ;
2 ;
3 Initialize with , ;
4 while   do
5       , ;
6       Update such that ;
7      
8return ;
Algorithm 1 CoreMEB [4]

Related Work. We review the literature on MEB computation and coresets for MEB. Gärtner [19] and Fischer et al. [18]

propose two pivoting algorithms that resemble the simplex method of linear programming for computing exact MEBs. Both algorithms have an exponential complexity w.r.t. the dimension

and thus are not scalable for large datasets with high dimensions. Subsequently, a line of research work [5, 22, 4, 32, 23] studies the problem of building coresets to approximate MEBs. They propose efficient batch algorithms for constructing a - of any point set . The basic scheme used in these algorithms is presented in Algorithm 1. First of all, it selects the point furthest from and furthest from out of , using as the initial coreset (Lines 11). The center and radius of can be computed from and directly (Line 1). Then, it iteratively picks the point furthest from the current center , adds to , and updates so that is , until no point in is outside of the -expansion of (Lines 11). Finally, it returns as a coreset for (Line 1). Theoretically, Algorithm 1 terminates in iterations and returns a - of size  [4]. Compared with exact MEB solvers [19, 18], coreset-based approaches run in linear time w.r.t. the dataset size and dimension

, and achieve better performance on high-dimensional data. Nevertheless, they must store all points in memory and process them in multiple passes, which are not suitable for data stream applications.

Several methods are proposed to approximate MEBs or coresets for MEB in streaming and dynamic settings. Agarwal et al. [2] and Chan [11] propose algorithms to build -coresets for MEB in append-only streams. Though working well in low dimensions, both algorithms become impractical for higher dimensions (i.e., ) due to complexity. Zarrabi-Zadeh and Chan [33] propose a -approximate algorithm to compute MEBs in append-only streams. Agarwal and Sharathkumar [1] design a data structure that can maintain -coresets for MEB and -approximate MEBs over append-only streams. Chan and Pathak [12] propose a method for maintaining -approximate MEBs in the dynamic setting, which supports the insertions and deletions of random points. To the best of our knowledge, none of the existing methods can maintain coresets for MEB over the sliding window efficiently. All of them have to store the entire window of points and recompute from scratch for every window slide, which is expensive in terms of time and space.

3 Our Algorithms

In this section we present our algorithms to maintain coresets for MEB. We first introduce a -approximate append-only streaming algorithm, called AOMEB, in Section 3.1. Using AOMEB as a building block, we propose the SWMEB algorithm with the same -approximation ratio in Section 3.2. Furthermore, we propose a more efficient SWMEB+ algorithm that retains a constant approximation ratio in Section 3.3.

3.1 The AOMEB Algorithm

The AOMEB algorithm is inspired by CoreMEB [4] (see Algorithm 1) to work in the append-only streaming model. Compared with CoreMEB, which can access the entire dataset and optimally select the furthest point into the coreset at each iteration, AOMEB is restricted to process the dataset in a single pass and determine whether to include a point into the coreset or discard it immediately after seeing it. Therefore, AOMEB adopts a greedy strategy for coreset maintenance: adding a new point to the coreset once it is outside of the MEB w.r.t. the current coreset.

Input : A set of points , a parameter
Output : A coreset for
1 and initialize with ;
2 for  do
3       if   then
4             ;
5             Update to ;
6            
7       else
8             and ;
9            
10      
11return ;
Algorithm 2 AOMEB

The pseudo code of AOMEB is presented in Algorithm 2. First of all, it takes as the initial coreset with as (Line 2). Then, it performs a one-pass scan over the point set, using the procedure in Lines 22 for each point : It first computes the distance between and . If , no update is needed; otherwise, it adds to the coreset and updates to . Finally, after processing all points in , it returns as the coreset for (Line 2).

Theoretical Analysis. Next, we provide an analysis of the approximation ratio and complexity of AOMEB. It is noted that the greedy strategy of AOMEB is also adopted by existing streaming algorithms, i.e., SSMEB [33] and blurred ball cover (BBC) [1]. Nevertheless, the update procedure is different: SSMEB uses a simple geometric method to enlarge the MEB such that both the previous MEB and the new point are contained while AOMEB and BBC recompute the MEB once the coreset is updated. As a result, AOMEB and BBC are less efficient than SSMEB but ensure a better approximation ratio. Compared with BBC, which keeps the “archives” of MEBs for previous coresets, AOMEB only maintains one MEB w.r.t.  at time . Therefore, AOMEB is more efficient than BBC in practice. Next, we will prove that AOMEB has the same -approximation as BBC. First of all, we present the hemisphere property [5] that forms the basis of our analysis.

Lemma 1 (Hemisphere Property [5]).

For a set of points , any closed half-space that contains must contain at least a point such that .

The proof of Lemma 1 can be found in Section 2 of [5]. Based on Lemma 1, we can analyze the complexity and approximation ratio of AOMEB theoretically.

Theorem 1.

For any , it holds that .

Proof.

If , then . We discuss two cases of separately. If , then

If , then let

be a hyperplane passing through

with as its normal. Let be the closed half-space, bounded by , that does not contain . According to Lemma 1, there must exist a point such that . Thus,

In addition, as . Therefore, we prove that in both cases. ∎

Theorem 2.

For any , it holds that .

Proof.

For any , we have either or . If , it is obvious that . If , we have . According to Lemma 1, there must exist a point such that . Therefore,

We conclude that . ∎

Theorem 2 indicates that AOMEB returns a - where for an arbitrary point set . According to Theorem 1, the radius of increases by times whenever a new point is added to . After processing and , the coreset contains both points with where . In addition, the radius of is bounded by where . Therefore, where and the size of is . Finally, the update procedure for each point spends time to compute and time to update .

3.2 The SWMEB Algorithm

In this subsection, we present the SWMEB algorithm for coreset maintenance over the sliding window . The basic idea is to adapt AOMEB for the sliding window model by keeping multiple AOMEB instances with different starting points over . However, the key problem is to identify the appropriate indices, i.e., starting points, for these instances. A naive scheme, i.e., creating a set of indices that are evenly distributed over , cannot give any approximation guarantee of coreset quality. Therefore, we design a partition-based scheme for index maintenance in SWMEB: dividing into equal-length partitions and keeping a sequence of indices on each partition such that at least one instance can provide an approximate coreset for at any time .

Figure 1: An illustration of the SWMEB algorithm. Two arrows indicate the order in which the points in are processed by .

The procedure of SWMEB is illustrated in Figure 1. It divides into partitions of equal length . It keeps a sequence of indices from the end to the beginning of each partition . As slides over time, old points in expire (colored in grey) while new points are temporarily stored in a buffer . The index on , which is the closest to the beginning of , will be deleted once it expires. When the size of reaches , it will delete and shift remaining partitions as all points in must have expired. Then, it creates a new partition for the points in and the indices on . Moreover, each index corresponds to an AOMEB instance that processes at any time . Specifically, will process the points from the end of to when is created and then update for each point till . Finally, the coreset is always provided by .

Input : A sequence of points , the window size , the partition size , two parameters
Output : A coreset for
1 Initialize ;
2 for  do
3       , ;
4       if  then
5             if  then
6                   , create a new partition ;
7                  
8             else
9                   Drop , shift (as well as the indices on ) to for , and create a new partition ;
10                  
11             Initialize an instance of Algorithm 2, , ;
12             for  do
13                   processes with Line 22 of Algorithm 2 and maintains a coreset and its MEB ;
14                   if  then
15                         ;
16                         ;
17                         after processing ;
18                        
19                  
20            ;
21            
22       if   then
23             , terminate , and ;
24            
25       for  and  do
26             processes with Line 22 of Algorithm 2 and maintains a coreset and its MEB ;
27            
28       return as the coreset for ;
29      
Algorithm 3 SWMEB

The pseudo code of SWMEB is presented in Algorithm 3. For initialization, the latest partition ID is set to and the buffer as well as the indices are set to (Line 3). Then, it processes all points in the stream one by one with the procedure of Lines 33, which can be separated into four phases as follows.

  • Phase 1 (Lines 33): After adding a new point to , it checks the size of . If , a new partition will be created for . When , it increases by and creates a new partition . Otherwise, must have expired and thus is dropped. Then, the partitions (and the indices on each partition) are shifted to and a new partition is created.

  • Phase 2 (Lines 33): Next, it creates the indices and corresponding AOMEB instances on . It runs an AOMEB instance to process each point in inversely from to . Initially, the number of indices on and the radius w.r.t. the latest index are . We denote the coreset maintained by after processing as . Then, is represented by with radius . If , it will update to , add a new index to , and use the snapshot of after processing as . After the indices on is created, will be reset for new incoming points.

  • Phase 3 (Lines 33): It checks whether , i.e., the earliest index on , has expired. If so, it will delete from and terminate accordingly.

  • Phase 4 (Lines 33): For each index with and at time , it updates the corresponding AOMEB instance by processing .

Finally, it always returns from as the coreset for at time (Line 3).

Theoretical Analysis. In the following, we will first prove the approximation ratio of returned by SWMEB for . Then, we discuss the time and space complexity of SWMEB.

We first prove the following lemma that will be used in subsequent analyses.

Lemma 2.

For any two point sets such that , it must hold that .

Proof.

Obviously, Lemma 2 must hold when . When , we consider a hyperplane passing through with as its normal. Let be the close half-space, bounded by , that does not contain . According to Lemma 1, there must exist such that . In addition, for . Finally, as , , and is the normal of , we acquire . Thus, it holds that and we conclude the proof. ∎

Theorem 3.

For any , it holds that where .

Proof.

According to Algorithm 3, the instance is always used to return the coreset . Since has already processed the points from to , these points must be contained in according to Theorem 2. Thus, we only need to consider the points from to . To create the indices on , we process the points of with an AOMEB instance (see Line 3). Here we use and to denote the coresets of this instance and the corresponding MEBs after processing and () respectively. For each point , it holds that and . In addition, as , we have from Lemma 2. Therefore,

In addition, according to Lemma 1, there must exist a point such that . Let , we have

We prove that and thus conclude the proof. ∎

Theorem 3 shows that returned by SWMEB is a - where at any time . To analyze the complexity of SWMEB, we first consider the number of indices in . For each partition, SWMEB maintains indices where . Thus, contains indices and the number of points stored by SWMEB is . Furthermore, the time of SWMEB to update a point comprises (1) the time to maintain the instance w.r.t. each index in for and (2) the amortized time to create the indices for each partition. Overall, the time complexity of SWMEB to update each point is . As , the number of points maintained by SWMEB is minimal when . In this case, the number of points stored by SWMEB is and the time complexity of SWMEB to update one point is .

3.3 The SWMEB+ Algorithm

Figure 2: An illustration of the SWMEB+ algorithm.
Input : A sequence of points , the window size , two parameters
Output : A coreset for
1 Initialize ;
2 for  do
3       , and ;
4       Initialize an instance of Algorithm 2;
5       while  do
6             , terminate ;
7             Shift the remaining indices in , ;
8            
9      for  do
10             processes with Line 22 of Algorithm 2, maintaining a coreset and its MEB ;
11            
12      while  do
13             , terminate ;
14             Shift the remaining indices in , ;
15            
16      if  then
17             return as the coreset for ;
18            
19       else
20             return as the coreset for ;
21            
22      
Algorithm 4 SWMEB+

In this subsection we present the SWMEB+ algorithm that improves upon SWMEB in terms of time and space while still achieving a constant approximation ratio. The basic idea of SWMEB+ is illustrated in Figure 2. Different from SWMEB, SWMEB+ only maintains a single sequence of indices over . Then, each index also corresponds to an AOMEB instance that processes a substream of points from to . We use for the coreset returned by at time and centered at with radius for . Furthermore, SWMEB+ maintains the indices based on the radii of the MEBs. Specifically, given any , for three neighboring indices , if , then is considered as a good approximation for and thus can be deleted. In this way, the radii of the MEBs gradually decreases from to , with the ratios of any two neighboring indices close to . Any window starting between and is approximated by . Finally, SWMEB+ keeps at most one expired index (and must be ) in to track the upper bound for the radius of . The AOMEB instance corresponding to the first non-expired index ( or ) provides the coreset for .

The pseudo code of SWMEB+ is presented in Algorithm 4. In the initialization phase, and are set to and respectively (Line 4). Then, all points in are processed one by one with the procedure of Lines 44, which includes four phases as follows.

  • Phase 1 (Lines 44): Upon the arrival of at time , it creates a new index and adds to ; accordingly, an AOMEB instance w.r.t.  is initialized to process the substream beginning at .

  • Phase 2 (Lines 44): When there exists more than one expired index (i.e., earlier than the beginning of ), it deletes the first index and terminates until there is only one expired index. Note that it shifts the remaining indices after deletion to always guarantee is the -th index of .

  • Phase 3 (Lines 44): For each , it updates the instance for . The update procedure follows Line 22 of Algorithm 2. After the update, maintains a coreset and its MEB by processing a stream .

  • Phase 4 (Lines 44): It executes a scan of from to to delete the indices that can be approximated by their successors. For each (), it checks the radii and of and . If , then it deletes the index from , terminates , and shifts the remaining indices accordingly.

After performing the above procedure, it returns either (when has not expired) or (when has expired) as the coreset for at time .

Figure 3: An illustration of Example 1.

Theoretical Analysis. The strategy of index maintenance based on the ratios of radii is inspired by Smooth Histograms [8] for estimating stream statistics over sliding windows. However, Smooth Histograms cannot be applied to our problem because it requires an oracle to provide a -approximate function value in any append-only stream [16] but any practical solution (i.e., [1] and AOMEB) only gives a -approximation for of an append-only stream . In addition, Smooth Histograms are also used for submodular maximization in the sliding window model [16, 29, 30]. Nevertheless, such an extension is still not applicable for our problem because the radius function is not submodular in the view of set functions, which is shown by Example 1. In the following, we will prove that SWMEB+ still has a constant approximation ratio by an analysis that is different from [8, 16].

Example 1.

A function is submodular if for any set and point . In Figure 3, for and ,