Fast Online Clustering with Randomized Skeleton Sets

06/10/2015
by   Krzysztof Choromanski, et al.
0

We present a new fast online clustering algorithm that reliably recovers arbitrary-shaped data clusters in high throughout data streams. Unlike the existing state-of-the-art online clustering methods based on k-means or k-medoid, it does not make any restrictive generative assumptions. In addition, in contrast to existing nonparametric clustering techniques such as DBScan or DenStream, it gives provable theoretical guarantees. To achieve fast clustering, we propose to represent each cluster by a skeleton set which is updated continuously as new data is seen. A skeleton set consists of weighted samples from the data where weights encode local densities. The size of each skeleton set is adapted according to the cluster geometry. The proposed technique automatically detects the number of clusters and is robust to outliers. The algorithm works for the infinite data stream where more than one pass over the data is not feasible. We provide theoretical guarantees on the quality of the clustering and also demonstrate its advantage over the existing state-of-the-art on several datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

04/21/2021

Skeleton Clustering: Dimension-Free Density-based Clustering

We introduce a density-based clustering method called skeleton clusterin...
01/21/2021

Fast Clustering of Short Text Streams Using Efficient Cluster Indexing and Dynamic Similarity Thresholds

Short text stream clustering is an important but challenging task since ...
10/02/2017

Clustering Stream Data by Exploring the Evolution of Density Mountain

Stream clustering is a fundamental problem in many streaming data analys...
12/22/2020

Fast and Accurate k-means++ via Rejection Sampling

k-means++ <cit.> is a widely used clustering algorithm that is easy to i...
10/01/2014

Riemannian Multi-Manifold Modeling

This paper advocates a novel framework for segmenting a dataset in a Rie...
03/05/2020

Fast Noise Removal for k-Means Clustering

This paper considers k-means clustering in the presence of noise. It is ...
01/13/2022

Improved Multi-objective Data Stream Clustering with Time and Memory Optimization

The analysis of data streams has received considerable attention over th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Online clustering in massive data streams is becoming important as data in a variety of fields including social media, finance and web applications arrives as a high throughput stream. In social networks, detecting and tracking clusters or communities is important to analyze evolutionary patterns. Similarly, online clustering can lead to spam and fraud detection in web applications such as detecting unusual mass activities in email services or online reviews. There exist several challenges in developing a good clustering algorithm in a high throughput online scenario. In real-world applications, the number and shape of the clusters is typically unknown. The existing state-of-the-art online clustering methods with provable theoretical guarantees are primarily based on k-means or k-median/medoid, which assume apriori knowledge of the number of clusters and inherently make strong generative assumptions about the clusters. These assumptions force the retrieved clusters to be convex, leading to poor clustering for many real-world tasks. There exist several nonparametric techniques that do not make simplistic generative assumptions, but are mostly based on heuristics and lack theoretical guarantees. Moreover, in a true online scenario one needs to deal with continuos streams precluding the use of multiple passes over data applicable for finite-size streams commonly assumed by many techniques. Potential drift in data distribution over time is another practical difficulty that needs to be handled effectively. Finally, the clustering procedure should be efficient both in space and time to be able to handle massive data streams.

In this paper we propose a novel Skeleton-based Online Clustering (SOC) algorithm to address the above challenges. The basic idea of SOC is to represent each cluster via a compact skeleton set which faithfully captures the geometry of the cluster. Each skeleton set maintains a small random sample from the corresponding cluster in an online fashion, which is updated fast using the new data points. Each skeleton point is weighted according to local density around it. The number of skeleton points is automatically adapted to the structure of the cluster in such a way that more complicated shapes are approximated by more skeleton points. The skeleton sets are updated by a random procedure which provides robustness in the presence of outliers. The proposed algorithm automatically recovers the correct number of clusters in the data with high probability as more and more data is seen. The update strategy of the skeleton sets also allows the clustering method to automatically adapt to any drift in data distribution. In SOC, clusters can be merged as well as split over time. We also provide theoretical guarantees on the quality of clusters obtained from the proposed method.

1.1 Related work

In comparison to the huge literature in offline clustering, work on online clustering has been somewhat limited. Most of the existing online clustering algorithms that have theoretical guarantees fall under model-based techniques such as k-mean, k-median or k-medoid (Guha et al., 2003; Ailon et al., 2009; Shindler et al., 2011; Bagirov et al., 2011). They assume specific shape of the clusters such as spheres that trivially leads to their compact representation using just a few parameters, e.g., center, radius and the number of points. However, as discussed before, these model based algorithms fail to capture arbitrary clusters in the data and can perform poorly.

There exist several nonparametric clustering methods where no assumption is made about the cluster shapes. Popular among them are DBScan (Ester et al., 1996), CluStream (Aggarwal et al., 2003), and DenStream (Cao et al., 2006). Recent surveys have described several variants of these algorithms (de Andrade Silva et al., 2013; Amini et al., 2014). The DenStream and CluStream methods create microclusters based on local densities, and combine them to form bigger clusters over time. However, these methods need to periodically perform offline partitioning of all the microclusters to form the clusters, which is not suitable for online clustering of massive data streams. Leader-Follower algorithm is another popular method and there exist several variants of it (Duda et al., 2000)(Shah & Zaman, 2010). These techniques typically encode every cluster by one center which is updated continuously as new points belonging to the cluster are detected. Such a cluster representation is not rich enough to encode more complex clusters. Overall, the main drawback of the above nonparametric online clustering algorithms is that they are mostly based on heuristics and lack any theoretical guarantees. They also require extensive hand tuning of the parameters. In (Shah & Zaman, 2010), the authors assume each cluster to be clique in order to provide theoretical guarantees, which is very restrictive in real-world.

Another popular method used in the context of incremental clustering is doubling algorithm (Charikar et al., 1997). Its standard version encodes every cluster by just one point. Furthermore, even though it allows for merging clusters, it does not permit to split them. We implement a variant of the method, where instead of one center several centers are kept per cluster. As we will show in experimental section, this purely deterministic approach, even though with some theoretical guarantees, is too sensitive to outliers.

Our proposed SOC algorithm shares a few similarities with two existing techniques: CURE algorithm (Guha et al., 2001), and core-set (Bādoiu et al., 2002). In CURE, similar to SOC, each cluster is represented by a random sample of data instead of just one center to handle arbitrary cluster shapes. CURE, however, is an offline hierarchical agglomerative clustering approach with running time quadratic in the size of data, which is too slow for online applications. In core-set based clustering, the aim is to encode a complicated cluster shape via a compact sample of points. The existing state-of-the-art algorithms that use the idea of the core-set (Gonzalez, 1985; Alon et al., 2000; Har-peled & Varadarajan, 2001; Bādoiu et al., 2002) are computationally too intensive to be useful for online clustering in practice. The running time is exponential in the number of stored skeleton points. Furthermore, the variants that give provable theoretical guarantees are inherently offline methods that often require several passes over the data to produce good-quality clustering. For instance, the algorithm presented in (Bādoiu et al., 2002) needs to be rerun times, where is the size of the core-set an is dataset size.

Nonparameteric graph-based techniques such as spectral clustering can recover arbitrary shaped clusters but they are appropriate mainly for the offline setting

(Ng et al., 2001). Moreover, they also assume a priori knowledge of the number of clusters. Several relaxations such as iterative biclustering have been proposed to overcome the need to know the number of clusters apriori but these methods cannot be extended to an online setting. Recently, there has been some work on incremental spectral clustering which essentially iteratively modifies the Graph Laplacian (Ning et al., 2010; Langone et al., 2014; Jia, 2012; Chia et al., 2009). In a true online setting, however, even building a good initial Graph Laplacian is infeasible either due to the lack of enough data or the computational bottlenecks.

2 Clustering Framework

In this work, since we are interested in retrieving arbitrary-shaped clusters, it is important to first define what constitutes a good cluster. The traditional model-based techniques assume a global distance measure (e.g., distance in k-means), which restricts one to convex-shaped clusters. Instead, the proposed algorithm works in a nonparametric setting where clusters are defined by an intuitive notion of paths with ’short-edges’. Two points are more likely to belong to the same cluster if there exist many paths in the data neighborhood graph such that any two consecutive points on each path are not ”far” from each other. Clearly, the overall distance between two points can be large but they can still be in the same cluster. Such a setting enables us to consider many complicated cluster shapes. To emphasize, the idea of the neighborhood graph is to understand the cluster definition we implicitly utilize in this work. We do not need to explicitly construct a graph in our approach.

2.1 Skeleton-based Online Clustering (SOC) Algorithm

The key idea behind the SOC algorithm is to represent each cluster via a set of pseudorandom samples called the skeleton set. The algorithm stores and constantly updates a set of the skeleton sets . Note that the size of the set corresponds to the number of clusters, which may change over time as new data is seen. Each skeleton set represents a cluster and consists of a sample of all the points belonging to that cluster up to time together with some carefully chosen random numbers and weights . Thus, a skeleton set is the set of elements of the form: for . Sometimes we will also call a skeleton set, which should be clear from the context. Let us define a map . We denote by the weight of the skeleton point and by the corresponding random number. Weights encode the local density around a skeleton point. We denote by the sum of weights of all the skeleton points representing the th cluster and by the sum of weights for all the points belonging to set . The skeleton set is updated in such a way that at any given time skeleton points are pseudouniformly distributed in the entire cluster. As mentioned before, the number of skeleton points of the th cluster (alternative notation: , where stands for the skeleton set) is not fixed and can vary over time. When the cluster arises, it is initialized with skeleton points. In the algorithm we take and in the theoretical section we show how the lower bounds on can be translated to strict provable guarantees regarding the quality of the produced clustering. The algorithm tries to maintain a skeleton set in such a way that between any two skeleton points there exists a path of relatively short edges consisting entirely of other skeleton points from the same skeleton set. If the cluster grows and the average distance between skeleton points is too large, the number of skeleton points is increased. This number never exceeds which is a given upper bound determining the memory the user is inclined to delegate to encode any cluster.

The overall algorithm works by first initializing the number of samples stored in each skeleton set. In this work, without loss of generality we assume that each skeleton set initially has only one sample. We will propose two variants of the algorithm - one where only lazy cluster-merging is performed (MergeOnlySOC) and the other, where merging can be more aggressive since splitting clusters is also allowed (MergeSplitSOC). As we will see in the experimental section, the latter produces a good approximation of the groundtruth clustering in fewer steps, but at the cost of extra time needed to check and perform splitting. If MergeSplitSOC version is turned on then the algorithm keeps a set of undirected graphs . Each element of is associated with a different skeleton set and encodes the topology of connections between points in that set. We denote by an element of associated with the skeleton set .

Input: Infinite data stream Output: Cluster assignment for the observed data  Pick ; while true do

       for each do
             if exists such that then
                  
            end
      end Read next   Compute for each for each do
             if then
                  
            end
      end if then
             ;
      else
             ;
      end
end
Algorithm 1 SOC clustering - main procedure

The overview of the SOC algorithm is given in Algorithm 1. If splitting is turned on, at each time step t, given the existing skeleton set for each cluster, the algorithm checks if any cluster should be split. Then, as a new data point arrives in the stream, first a ball of radius centered at i.e., is created. Then, the intersection of this ball with each existing skeleton set is computed. If the weight of the skeleton points in this intersection is more than for , then is assumed to belong to that cluster and the corresponding skeleton set is updated. Note that it is possible that multiple clusters claim the point , in which case, all those clusters are considered for merging. If the intersection of the ball with all the skeleton sets is empty, then a new singleton cluster is created.

We will start with describing the MergeOnlySOC variant (i.e., assume that the splitting-related procedures: CheckSplit and UpdatedGraph in Algorithm 1 and Algorithm 2 are turned off). The MergeSplitSOC variant will discussed later in section 2.2.

Input: Datapoint , subset , current clustering , family of graphs , radius Output: Updated after merging with clusters from   Denote: , ,
  Compute: for   Let:
if or then

       
end Denote by the skeleton point of the skeleton set of   initialize: for do
        ;
;
;
;
end if and then
        generate according to
else
        Let , where be such that: if then
               , where is generated according to
       end
end Let
Algorithm 2 SOC clustering - Merge subprocedure

2.1.1 Subprocedure Merge

The goal of the Merge subprocedure is to merge a new point with one or more clusters (Algorithm 2). When is assigned to multiple clusters, it basically acts as a linking point to merge them, resulting in a unified cluster. This step is crucial in online clustering scenario as points from the same cluster may be assigned to different clusters initially but as more evidence builds up from the new data, one can combine these clusters to recover the true underlying cluster. In the Merge subprocedure the skeleton set is updated when a new cluster is constructed. The skeleton sets representing clusters that will be merged are given as input. Before describing the Merge procedure let us introduce an important subroutine. Let be a skeleton set of size for some . We denote by the extension of obtained by adding to exactly more triples according to the following procedure. Each newly added triple has weight . Each newly added skeleton point is chosen independently at random from the set in such a way that skeleton point is chosen with probability . The corresponding random number is generated by a pseudorandom number generator with seed . The seed can be initialized randomly or alternatively it can be chosen as a function of the skeleton point by conceptually partitioning the entire input space with a grid of lengh , and using the id of the cell occupied by in this grid. The latter procedure is useful for infinite streams to avoid correlated random sequences for far away points in the space. Subroutine can be run also if its first argument is a single point . In this case the output is a skeleton set of the form , where is a sequence of random numbers generated according to .

The Merge algorithm computes an average weighted distance from the new point to those skeleton points from that reside in . Next, two cases are considered. If or the number of skeleton points in all the skeleton sets to be merged is then the algorithm decides not to increase the size of the merged skeleton set (since the linking point is relatively close to the merged clusters or the union of skeleton sets under consideration is already saturated). Denote by the minimum of and the total number of skeleton points in all skeleton sets to be merged. The merged skeleton set will be of size . Let us describe how the skeleton point of the newly formed cluster for will be computed. First, each contributing skeleton set is extended to size by weighted random sampling from it (see procedure described above). This is also the case for (we treat as a skeleton set consisting of copies of ).

We take the skeleton points from all the clusters to be merged and , and choose the new point and the corresponding value as shown in Algorithm (2), using random sequence generated for . Each newly added point has to contribute to the weight distribution in the cluster. If point is not in the new skeleton set then the closest point from the skeleton set is found and its weight is increased by one ( contributes to the total weight of ). The new skeleton set replaces the skeleton sets of all the clusters that are merged.

Now let us assume that and the total number of all skeleton points in the skeleton sets to be merged is smaller than . Intuitivaly speaking, this means that the cluster had grown too much (the local density within linking point is too small) and thus the number of skeleton points encoding the cluster has to increase (since the pool of the skeleton points that can be used to represent a cluster has not been used entirely). If this is the case then the same procedure as for the previous case is conducted but the skeleton set corresponding to is excluded. Finally, is added as the last skeleton-singleton of weight (and with corresponding random number selected according to a given random number generator) to the newly formed cluster.

2.1.2 Subprocedure AddSingleton

The procedure AddSingleton adds a new cluster consisting of just when no existing cluster is found to be close enough to based on the intersection with the skeleton sets (Algorithm 3). Next, a skeleton set for this new singleton cluster is created. Since a skeleton set aims to cover uniformly the entire mass of the cluster using random samples, point is repeated times to form the skeleton set for the cluster-singleton.

Furthermore, a sequence of random values is generated from , and each copy of in the skeleton set is assigned one of the values and weight to build the triples and complete the skeleton. In the proposed implementation we initialize each newly created cluster-singleton with only one skeleton point, thus (see Algorithm 3). For the newly created skeleton set an undirected graph-singleton is created.

Input: Datapoint , current clustering , family of graphs Output: Updated version of after adding cluster-singleton   Generate a random number according to

Algorithm 3 SOC clustering - AddSingleton subprocedure

2.2 Splitting clusters

We now describe the cluster splitting procedure in the MergeSplitSOC variant of the algorithm. It is handled by two additional procedures: CheckSplit and UpdatedGraph.

CheckSplit determines whether a given skeleton set should be split by looking for a breaking point , which is the skeleton point whose weight is at most half of the average weight of the points within the skeleton set. If such a point is found then the algorithm determines whether the cluster should be split as follows. First, all the points from are deleted from the corresponding graph of the skeleton set. A connected component analysis is then conducted on the remaining graph. If more than one connected components is found, it means is a breaking point and the cluster is split so that each connected component forms a new cluster and the points in the connected components consitute the new skeleton sets.

The UpdatedGraph procedure is shown in Algorithm 5, and is responsible for constructing a graph for the newly formed cluster and replacing with it all graphs corresponding to merged skeleton sets. The graph is constructed by combining all elements of corresponding to the skeleton sets that need to be merged. Those graphs are combined by adding edges between skeleton points from the newly constructed skeleton set that are in the close neighborhood of the linking point . In the description of UpdatedGraph we denote by the graph with vertex set and edge set .

Input: Family of graphs , skeleton point , radius , skeleton set corresponding to , current clustering Output: Updated version of and   Let   Let   Run CC algorithm on to obtain if then

        Denote by the subset of corresponding to for   Update: and 
end
Algorithm 4 SOC clustering - CheckSplit subroutine

Input: Family of graphs , skeleton point , radius , subfamily , skeleton set Output: Updated version of   Let   Let   Let be a graph obtained from by adding edges from   Update:

Algorithm 5 SOC clustering - UpdatedGraph subroutine

3 Theoretical analysis

In this section we provide theoretical results regarding SOC algorithm for the clustering model described in Sec. 2. We start by introducing the general mathematical model we are about to analyze. It is one of the many variants of planted partition models used to construct data with hard clustering and outliers. Notice that our algorithm does not require the input to be produced according to this model. In particular, we do not use any specific parameters of the model in the algorithm.

For a set of -dimensional data, we assume it contains disjoint compact sets , which are called cores and denoted as . The cores are called -separable if the minimum distance between any two cores is greater than , i.e.: . These cores can have arbitrary shapes giving rise to the observed clusters such that the points in the cluster come from core with high probability and from the rest of the space with low probability. Formally, given a set of probabilities , points in the cluster are sampled from the core with probability and from outside with probability , where . It is important to note that even though the cores are separable, this is no longer the case for clusters due to the presence of ”outliers”. In other words, short-edge paths between points from different clusters may exist, but not many in expectation. We call the clustering model presented above as a -model, where and . This is a quite general model which has good-quality clustering of cores (because of -separability). However due to the outliers, the task of recovering the clusters is nontrivial even in an offline setting. Simple heuristics such as connected-components cannot be used to recover the clusters in the offline mode. The online setting brings additional algorithmic and computational challenges. Below, we give details of the proposed clustering mechanism.

We need the following definition.

Definition 3.1.

A set is called to be -coverable if can be covered by balls, each of radius r.

Let be the probability that a new point in the data stream belongs to cluster . Fix a covering of with (for every core ). Let be an arbitrary ball from the covering . Furthermore, let be a lower bound on the probability that a new point is from set () given it belongs to core , which can be expressed as . Denote , and , where , and . Here is the probability that a new point came from a fixed ball of covering of given that it belongs to cluster . Similarly, is the probabiltiy that a new point is an outlier given that it belongs to cluster . Since outliers are expected to be lower than the points from the cluster cores, . Denote and .

Since we keep at most samples in a skeleton set, most of the cluster points are not in this set. The error made by the algorithm on a new point is defined as follows: Suppose comes from the core , and gets assigned to a cluster that contains points from other cores as well, or there exists another cluster that also contains points from . Note that it is a strict definition of error as in an online setting transient overclustering is expected due to lack of enough data in the early phase. We say that the algorithm reaches the saturation phase when each skeleton set reaches size . We are ready to state the main results of our analysis regarding the MergeOnlySOC version of the algorithm.

Theorem 3.1.

Assume that we are given a dataset constructed according to the -model with cores with outliers. Cores are -separable. Assume that each core of the cluster is -coverable. Let be the number of all the points seen by the algorithm after the saturation phase has been reached. Then with probability at least for , and the SOC algorithm will not merge clusters containing points from different cores in the saturation phase if they were not merged earlier.

Theorem 3.1 gives upper bounds on the minimal number of skeleton points per cluster ensuring that MergeOnlySOC does not undercluster. As we will see in the experimental section, this bound in practice is much lower. We also have the following.

Theorem 3.2.

Under the assumptions from Theorem 3.1, with probability at least , the SOC algorithm will not make any errors on points coming from core after points from the corresponding cluster have been seen in the saturation phase.

Theorem 3.2 says that under a reasonable assumption regarding the number of initial skeleton points per cluster and after the short initial subphase of the saturation phase, algorithm MergeOnlySOCclassifies correctly all points coming from cores. In other words, we obtain the upper bound on the rate of convergence of the number of clusters produced by MergeOnlySOC to the groundtruth value.

The proofs of Theorem 3.1 and Theorem 3.2 will be given in the Appendix. Below we give a very short introduction and present a useful lemma that we will rely on later.

Figure 1: Merge scenario as in Lemma 3.1. Data point merges two clusters: and . The intersection for some must consist only of outliers (points marked red). The other intersection may contain points from the core (points marked green).

3.1 Merging Lemma

Since both theorems consider the saturation phase of the algorithm, in the theoretical analysis whenever we will talk about the algorithm we will in fact mean its saturation phase. Without loss of generality we can also assume in our theoretical analysis that each skeleton point has weight one. Indeed, a point that has weight may be equivalently treated as a collection of skeleton points of weight one (see: description of the algorithm). Let us formulate the following lemma:

Lemma 3.1.

Let us assume that at time the algorithm merges two clusters: and such that contains a point from and contains a point from for some . Then either: at least of all skeleton points of at time are outliers or: at least of all skeleton points of at time are outliers.

Figure 2: Clustering results for the different methods. The left column shows four different synthetic datasets. The resulting clusters for each method are marked using different colors. Note that DenStream is a hybrid online-offline technique while SOC is a purely online method.
Proof.

The lemma follows from the definition of , according to which, any two points taken from different cores are at least distance apart. If two clusters: , are merged at time then there exists a data point (a merger) and a ball such that contains at least skeleton points of and at least skeleton points of (see: Fig. 1). But cannot contain points from different cores since they are -separable. Thus at least one of the clusters from: has in at least skeleton points that are outliers. ∎

4 Experiments

We evaluated the performance of the proposed SOC algorithm using four synthetic datasets as shown in Fig. 2

(left column). The sets contain data points in 20 dimensions. The first two dimensions were randomly drawn from predefined clusters, as shown in the figure, while the other 18 dimensions was random noise. For the data sets B1 and B2, 1000 data points were randomly drawn from each of the two banana shaped clusters for the first two dimensions, Then 1000 (for B1) and 2000 (for B2) outliers were randomly generated from a vicinity of the shapes, respectively. For the data sets L1 and L2, 500 data points were randomly drawn from each of the four letter shaped clusters. Then 500 (for L1) and 2500 (for L2) outliers were randomly generated from a vicinity of the shapes respectively. The values in the other 18 dimensions for all the data were Gaussian white noise with a standard deviation of 0.01. All the data points were then randomly permuted so that their orders in the data stream would not affect the results. Examples of the datasets were plotted in the first two dimensions and shown in Fig. 

2 (left column). We used the MergeSplitSOC version of the algorithm since it provided faster convergence of the number of clusters under the same quality guarantees.

We compared the SOC method with several state-of-the-art nonparametric clustering methods, i.e., DBScan (Ester et al., 1996), Leader-Follower algorithm (Duda et al., 2000), Doubling algorithm (Charikar et al., 1997), and DenStream (Cao et al., 2006). The clustering quality was quantitatively evaluated using the average clustering purity, which is defined as

(1)

where is the number of clusters, is the number of points in cluster , and is the number of points in cluster with the dominant class label.

Fig. 2 shows the comparative results of the SoC method as well as several other methods. Fig. 3 shows the clustering purity of all the methods. The SOC method requires two parameters ( and ). In all the experiments, we selected and . All the results were produced using the best choice of parameters for each method. We note that parameter tuning was not trivial because most of the methods require at least two parameters.

Results showed that the SOC method was able to cluster the data well, even though it slightly over-clustered in the datasets B1 and B2. The Leader-Follower algorithm as well as streaming DBScan simply do not handle this type of data. SOC obtains similar results to these from DenStream algorithm (it produced slight overclustering, but obtained almost 100% purity). DBscan worked well on the banana sets, but it failed to cluster in L2, where the outliers over-numbered the true clusters. Doubling failed to work on the more noisy data sets (B2 and L2). For the other two datasets, the clustering purity was low, probably because of the noise in other 18 dimensions. Leader-Follower method worked fine for L1 and L2, but poorly for B1 and B2. It was mostly because of the nature of the method, and partly because of the noise in the other 18 dimensions. Standard variant of the Leader-Follower uses only a small number of centers per cluster and when the clusters are not contained in convex well-separable objects, the recognition is very poor. DenStream worked well on all the cases.

Figure 3: The clustering purity for different methods on the four synthetic data sets.
(a) B1
(b) L1
Figure 4: The skeleton points for the data sets (a) B1 and (b) L1. The data points are colored blue, and the skeleton points for different clusters are marked with different colors.

Our method is faster than DenStream (the running times per point varied from to microseconds for SOC and were above microseconds for DenStream). The reason is that DenStream is not a purely online approach and performs offline clustering periodically. That part is a bit expensive. DenStream has another serious drawback. The id of the cluster the newly coming point is assigned to is computed based on the most recent snapshot of the offline clustering that was produced, not on-fly. Thus the accuracy of the method depends heavily on the special parameter determining how much data distribution evolves over time. Since parameter tuning is always very nontrivial for the density-based methods, this one extra parameter adds another layer of difficulty. We do not need this parameter in our approach.

The SOC method slightly overclustered in B1 and B2 because of the online nature of the algorithm and the presence of the outliers. Nontheless, it was able to correctly throw out the outliers and produced results with high purity. This is because, even though outliers can become part of the skeleton set of a cluster, they are typically replaced by the true cluster points eventually as the true cluster points have a higher density and arrive at a higher rate than the outliers.

Fig. 4 shows the skeleton points generated by SOC for two data sets. The maximum number of skeleton points used was only few hundred points. The number of skeleton points grows gradually as clusters become bigger. We set up the upper bound on the skeleton points per cluster as but this number was not reached.

5 Conclusions

We have presented a new truly online clustering algorithm which can recover arbitrary-shaped clusters in the presence of outliers from massive data streams. Each cluster is represented efficiently by a skeleton set, which is continuously updated to dynamically adapt to the changing data distribution. The proposed technique is theoretically sound as well as fast and space-efficient in practice. It produced good-quality clusters in various experiments for nonconvex clusters. It outperforms several online approaches on many datasets and produces results similar to the most effective hybrid ones that combine online and offline steps (such as DenStream). In the future, we would like to investigate other methods for updating skeletons within given framework in the online fashion since this mechanism is crucial for the effectiveness of the presented approach. The other interesting area is the research on the maximal number of clusters that are created during the execution of the algorithm. A more precise bound will provide a more accurate theoretical estimate on the memory usage.

References

  • Aggarwal et al. (2003) Aggarwal, Charu C., Han, Jiawei, Wang, Jianyong, and Yu, Philip S. A framework for clustering evolving data streams. In VLDB, pp. 81–92, 2003.
  • Ailon et al. (2009) Ailon, N., Jaiswal, R., and Monteleoni, C. Streaming k-means approximation. Neural Processing Systems Conference, pp. 10–18, 2009.
  • Alon et al. (2000) Alon, Noga, Dar, Seannie, Parnas, Michal, and Ron, Dana. Testing of clustering. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, 12-14 November 2000, Redondo Beach, California, USA, pp. 240–250, 2000.
  • Amini et al. (2014) Amini, Amineh, Teh, Ying Wah, and Saboohi, Hadi. On density-based data streams clustering algorithms: A survey. J. Comput. Sci. Technol., 29(1):116–141, 2014.
  • Bādoiu et al. (2002) Bādoiu, Mihai, Har-Peled, Sariel, and Indyk, Piotr. Approximate clustering via core-sets. In

    Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing

    , STOC ’02, pp. 250–257. ACM, 2002.
  • Bagirov et al. (2011) Bagirov, A., Ugon, J., and Webb, D. Fast modified global k-means algorithm for incremental cluster recognition. Pattern Recognition, 44:866–876, 2011.
  • Cao et al. (2006) Cao, Feng, Ester, Martin, Qian, Weining, and Zhou, Aoying. Density-based clustering over an evolving data stream with noise. In Proceedings of the Sixth SIAM International Conference on Data Mining, April 20-22, 2006, Bethesda, MD, USA, pp. 328–339, 2006.
  • Charikar et al. (1997) Charikar, Moses, Chekuri, Chandra, Feder, Tomás, and Motwani, Rajeev. Incremental clustering and dynamic information retrieval. In Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing, STOC ’97, pp. 626–635. ACM, 1997.
  • Chia et al. (2009) Chia, Y., Song, X., Zhou, D., Hino, K., and Tseng, B. On evolutionary spectral clustering. TKDD, 3, 2009.
  • de Andrade Silva et al. (2013) de Andrade Silva, Jonathan, Faria, Elaine R., Barros, Rodrigo C., Hruschka, Eduardo R., de Carvalho, André Carlos Ponce Leon Ferreira, and Gama, João. Data stream clustering: A survey. ACM Comput. Surv., 46(1):13, 2013.
  • Duda et al. (2000) Duda, Richard O., Hart, Peter E., and Stork, David G. Pattern Classification (2Nd Edition). Wiley-Interscience, 2000.
  • Ester et al. (1996) Ester, Martin, peter Kriegel, Hans, S, Jörg, and Xu, Xiaowei. A density-based algorithm for discovering clusters in large spatial databases with noise. pp. 226–231. AAAI Press, 1996.
  • Gonzalez (1985) Gonzalez, Teofilo F. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci., 38:293–306, 1985.
  • Guha et al. (2003) Guha, S., Meyerson, A., Mishra, N., Motwani, R., and O’Callaghan, L. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering, 15:515–528, 2003.
  • Guha et al. (2001) Guha, Sudipto, Rastogi, Rajeev, and Shim, Kyuseok. Cure: An efficient clustering algorithm for large databases. Inf. Syst., 26(1):35–58, 2001.
  • Har-peled & Varadarajan (2001) Har-peled, Sariel and Varadarajan, Kasturi R. Approximate shape fitting via linearization. In In Proc. 42nd Annu. IEEE Sympos. Found. Comput. Sci, pp. 66–73, 2001.
  • Jia (2012) Jia, Y. Online spectral clustering on network streams. Dissertation at the Department of Electrical Engineering and Computer Science, University of Kansas, 2012.
  • Langone et al. (2014) Langone, R., Agudelo, O., Moor, B. De, and Suykens, J. Incremental kernel spectral clustering for online learning of non-stationary data. Neurocomputing, 2014.
  • Ng et al. (2001) Ng, Andrew Y., Jordan, Michael I., and Weiss, Yair.

    On spectral clustering: Analysis and an algorithm.

    In Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada], pp. 849–856, 2001.
  • Ning et al. (2010) Ning, H., Xu, W., Chi, Y., Gong, Y., and Huang, T. Incremental spectral clustering by efficiently updating the eigen-system. Pattern Recognition, pp. 113–127, 2010.
  • Shah & Zaman (2010) Shah, Devavrat and Zaman, Tauhid. Community detection in networks: The leader-follower algorithm. CoRR, abs/1011.0774, 2010.
  • Shindler et al. (2011) Shindler, M., Wong, A., and Meyerson, A. Fast and accurate k-means for large datasets. Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems Conference, pp. 2375–2383, 2011.

6 Appendix

6.1 Proof of Theorem 3.1

Proof.

From Lemma 3.1 we conclude that in order to find an upper bound on the event that the algorithm will at some point merge ”wrong clusters”, it suffices to find the upper bound on the probability that the algorithm will at some point produce a cluster with at least skeleton points that are outliers and at least one nonoutlier . Denote this latter event by . Denote by the intersection of with the event that comes from the core . Note that .

For a set of points we say that a point dominates if the following is true: At least one of the random numbers assigned to by the algorithm is smaller than all corresponding random numbers assigned to data points from . Denote the set of all data points that are outliers as . Let us fix a core and some constant . Denote by time stamps at which first , ,… points from are collected. Let us first find a lower bound on the probability of the following event : for every in every ball of the -covering of , there are at least points from that are dominating the set of all the points collected up to time . Fix some ball of the covering of . By the definition of , we know that on average points from arrived up to time . By Azuma’s inequality, we know that this number is tightly concentrated around its average, i.e., for every the probability that for a fixed and fixed ball this number is less than is at most . In fact, by using a more general version of the Azuma’s inequality, we can get rid of a fixed and have the same upper bound for all s simultanously. Consider the following event : for every ball up to time for every at least data points from that ball have arrived. Thus, using union bound over all the balls of the covering, we conclude that happens with probability at least . Now we analyze conditioned on . For any fixed , the average number of dominating points in the fixed ball of the covering of is at least . Besides, as previously, one can easily note that the actual number is tightly concentrated around the average. Taking and using Azuma’s inequality once again, we derive an upper bound of the form on the probability that a fixed ball of the covering at some fixed time contains fewer dominating points than we assumed above. Using this and taking the union bound over all and all balls of the covering we get: , where stands for the complement of an event . Let denote an event that among first points from there will be at most outliers, where: and is some fixed constant. From Chernoff’s inequality we get: . Now assume that happens and that points from have already arrived. Denote by the following event: for all balls of the covering of at any time no skeleton points in any ball are outliers. Note that if holds then after points have been seen, at least new points from that are outliers must be seen up to time . Fix again a ball of the covering of . Let us take next points coming from for (we will take in such a way that ). For a fixed the probability that out of those points there are more than outliers is, by Chernoff’s bound, at most . Denote by the following event: for all there are at most outliers out of those points. By the union bound we have: . Assume that holds and fix . The probability that after new points have been seen, one of the outliers will become a skeleton point in a fixed ball based on the fixed random number that was assigned to it by the algorithm is at most . The probability that this will be the case for at least outliers is at most . Thus, if we take the union bound we can conclude that . Notice that . Thus we obtain . We conclude that . Thus, taking the union bound over all the cores we obtain: . If we fix: and , then by bounding each term of the RHS expression in the last inequality with , we obtain the lower bound on as in the statement of the theorem. Since we have already noticed that it suffices to find an appropriate upper bound on , this concludes the proof. ∎

6.2 Proof of Theorem 3.2

We will use notation from the proof of Theorem 3.1.

Proof.

Fix some . Note than we have already proved that an event does not hold with probability at most . Also note that from the definition of we know that if holds then the number of clusters computed by the algorithm and containing at least one point from will not increase after the time when first data points from the corresponding cluster have been seen. Let us assume that holds. Notice that by the time every ball of the covering gets at least one new data point, there will be just one cluster computed by the algorithm with points from the core . When this is the case no further errors regarding new points from the core will be made. From Azuma’s inequality and the union bound we immediately get: the number of extra data points coming from that need to be seen to populate every ball of the covering with at least one of them is more than with probability at most . We conclude that with probability at most after points of have been already seen, the algorithm will still make mistakes on the new data points from . If we now upper-bound every ingredient of the above sum by and solve for and , then we get: , . Taking the number of points as in the previous theorem concludes the proof. ∎