S-RASTER: Contraction Clustering for Evolving Data Streams

11/21/2019 ∙ by Gregor Ulm, et al. ∙ 0

Contraction Clustering (RASTER) is a very fast algorithm for density-based clustering, which requires only a single pass. It can process arbitrary amounts of data in linear time and in constant memory, quickly identifying approximate clusters. It also exhibits good scalability in the presence of multiple CPU cores. Yet, RASTER is limited to batch processing. In contrast, S-RASTER is an adaptation of RASTER to the stream processing paradigm that is able to identify clusters in evolving data streams. This algorithm retains the main benefits of its parent algorithm, i.e. single-pass linear time cost and constant memory requirements for each discrete time step in the sliding window. The sliding window is efficiently pruned, and clustering is still performed in linear time. Like RASTER, S-RASTER trades off an often negligible amount of precision for speed. It is therefore very well suited to real-world scenarios where clustering does not happen continually but only periodically. We describe the algorithm, including a discussion of implementation details.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering is a standard method for data analysis. Many clustering methods have been proposed [11]. Some of the most well-known clustering algorithms are DBSCAN [7], -means clustering [10], and CLIQUE [1] [2]. Yet, they have in common that they do not perform well when clustering big data, i.e. data that far exceeds available main memory. Based on a real-world challenge we faced in industry, i.e. clustering of large amounts of geospatial data, we developed Contraction Clustering (RASTER), a very fast linear-time clustering algorithm for identifying approximate density-based clusters. The original presentation was focused on sequential processing of batch data [13] and was followed by a description of a parallel version of this algorithm [14]. A key aspect of RASTER is that it does not exhaustively cluster its input but instead identifies their approximate location in linear time. As it only requires constant space, it is eminently suitable for clustering big data. The variant RASTER retains its input and still runs in linear time while requiring only a single pass. Of course, it cannot operate in constant memory.

A common motivation for stream processing is that data does not fit into working memory and therefore cannot be retained. This is not a concern for RASTER as it can process an arbitrary amount of data in limited working memory. One could therefore divide a stream of data into discrete batches and consecutively cluster them. Yet, this approach does not address the problem that, in a given stream of data, any density-based cluster may only be temporary. In order to solve this problem, this paper presents Contraction Clustering for evolving data streams (S-RASTER). This algorithm has been designed for identifying density-based clusters in infinite data streams within a sliding window. S-RASTER is not a replacement of RASTER, but a complement, enabling this pair of algorithms to efficiently cluster data, regardless of whether it is available as a batch or a stream.

In the remainder of this paper, we provide relevant background in Sec. 2, which contains a brief recapitulation of RASTER and the motivating use case for S-RASTER, i.e. identifying evolving hubs in streams of GPS data. In Sect. 3 we provide a detailed description of S-RASTER, including a complete specification in pseudocode.111Source code is available at the following repository: https://gitlab.com/fraunhofer_chalmers_centre/s-raster. After a brief theoretical evaluation in Sec. 4, we highlight related work in Sec. 5 and future work in Sec. 6.

(a) Input data
(b) Significant tiles
(c) Clustered tiles
(d) Clustered points
Figure 1: High-level visualization of RASTER (best viewed in color). The original input is shown in a), followed by projection to tiles in b) where only significant tiles that contain more than values are retained. Tile-based clusters, assuming a minimum cluster size of , are visualized in c), corresponding to RASTER. Clusters as collections of points are shown in d), which corresponds to the variant RASTER.

2 Background

In this section, we give a brief presentation of the sequential RASTER algorithm in Subsect. 2.1. This is followed by a description of the motivating problem behind S-RASTER, i.e. the identification of so-called hubs within a sliding window, in Subsect. 2.2.

2.1 Raster

1:data points, precision prec, threshold , distance , minimum cluster size
2:set of clusters clusters
3:
4: set of sets
5:for  in points do
6:   
7:   if  keys of acc then
8:     
9:   else
10:         
11:for  in  do
12:   if  then
13:     remove    
14: keys of significant tiles
15:while  do for llns. 12–24
16:    :=
17:   cluster := set
18:   visit :=
19:   while visit  do
20:     
21:     
22:     
23:      cf. ln. 13
24:         
25:   if size of  then
26:     add to    
Algorithm 1 RASTER

In this subsection, we provide a brief description of RASTER [13] [14]. This algorithm approximately identifies density-based clusters very quickly (cf. Alg. 1). The main idea is to project data points to tiles and keep track of the number of points that are projected to each tile. Only tiles to which more than a predefined threshold number  of data points have been projected are retained. These are referred to as significant tiles , which are subsequently clustered by exhaustive lookup of neighboring tiles in a depth-first manner. Clustering continues for as long as there are significant tiles left. To do so, the algorithm selects an arbitrary tile as the seed of a new cluster. This cluster is grown iteratively by looking up all neighboring tiles within a given Manhattan or Chebyshev distance . This takes only as the location of all potential neighbors is known due to their location in the grid. Only clusters that contain more than a predefined number of significant tiles are kept.

The projection operation consists of reducing the precision of the input by scaling a floating-point number to an integer. For instance, take an arbitrary GPS coordinate , which is already truncated compared to the full representation with double-precision floating-point numbers. GPS data is inherently imprecise, yet stored in floating-point format with the maximum precision, which is potentially misleading, considering that consumer-grade GPS is only accurate to within about ten meters under ideal conditions [6]. Consequently, these data suggest a level of precision they do not possess. Thus, by dropping a few digit values, we do not lose much, if any, information. Furthermore, vehicle GPS data is sent by vehicles that may, in the case of trucks with an attached trailer, be more than 25 meters long. To cluster such coordinates, we can truncate even more digit points. In our example, if we concluded that a precision of around 11.1 meters is sufficient, we would transform the input to . To avoid issues pertaining to working with floating-point numbers, the input is instead scaled to . Before generating the final output, all significant tiles of the resulting clusters are scaled back to floating-point numbers.

RASTER is a single-pass linear time algorithm. However, in a big data context, its biggest benefit is that it only requires constant memory, assuming a finite range of inputs. This is the case with GPS data. It is therefore possible to process an arbitrary amount of data on a resource-constrained workstation with this algorithm. We have also shown that it can be effectively parallelized [14]. A variation of this algorithm that retains its inputs is referred to as RASTER. It is less suited for big data applications. However, it is effective for general-purpose density-based clustering and very competitive against standard clustering methods; cf. Appendix A in [14].

2.2 Identifying Evolving Hubs

RASTER was designed for finite batches of GPS traces of commercial vehicles. The goal was to identify hubs, i.e. locations where many vehicles come to a halt, for instance vans at delivery points or buses at bus stops. After identifying all hubs in a data set, it is possible to construct vehicular networks. However, what if the data does not represent a static reality? It is a common observation that the location of hubs changes over time. A bus stop may get moved or abolished, for instance. This motivates adapting RASTER so that it is able to detect hubs over time and maintaining hubs within a sliding window . The length of depends on the actual use case. With GPS traces of infrequent but important deliveries, many months may be necessary. Yet, with daily deliveries, a few days would suffice to detect new hubs as well as discard old ones.

Figure 2: The flow graph of RASTER with evolving data streams. The input source node distributes values arbitrarily to projection nodes . In turn, they send projected values to accumulation nodes . These determine significant tiles for the chosen sliding window. Should a tile become significant or a once significant tile no longer be significant, a corresponding update is sent to the clustering node . Node periodically performs clustering of significant tiles, which is a very fast operation.

3 S-Raster

This section starts with a concise general description of S-RASTER in Sect. 3.1, followed by a detailed specification in Sect. 3.2. Afterwards, we highlight some implementation details in Sect. 3.3 and outline, in Sect. 3.4, how S-RASTER has to be modified to retain its inputs, which makes this algorithm applicable to different use cases.

3.1 Idea

S-RASTER (cf. Fig. 2) performs, for an indefinite amount of time, projection and accumulation continually, and clustering periodically. Projection nodes receive their input from the source node . Each incoming , where , is surjected to , where . Together with a period indicator , these values are sent to accumulation nodes . The identifier designates a period with a fixed size, e.g. one day, and is non-strictly increasing. Each -node maintains a sliding window of length , which is constructed from multisets . Each such multiset keeps running totals of how many times the input was surjected to any given tuple in the chosen period. The sliding window starting at period is defined as . It contains the set of significant tiles , where indicates the number of appearances in the multiset and the threshold for a significant tile. If a tile becomes significant, its corresponding value is forwarded to the clustering node . Whenever advances to the next period , the oldest entry  is removed from . Furthermore, all affected running totals are adjusted, which may lead to some significant tiles no longer being significant. If so, node receives corresponding updates to likewise remove those entries. Node keeps track of all significant tiles, which it clusters whenever advances (cf. Alg. 1, lls. 12–24). Call the resulting set of clusters . The resulting clusters are defined as , where is the minimum cluster size. Each in each is finally projected to , where . Together with cluster and period IDs, these values are sent to the sink node .

3.2 Detailed Description

S-RASTER processes the data stream coming from source in Fig. 2 and outputs clusters to sink . There are three different kinds of nodes: projection nodes project points to tiles, accumulation nodes determine significant tiles within each sliding window , and one clustering node outputs, for each , all identified clusters. Below, we describe the three nodes in detail.

1:stream of tuples where are coordinates and is the current period; precision
2:stream of tuples
3: projection of with
4: projection of with
5:send to node
Algorithm 2 Projection node
Projection.

Nodes labeled with project incoming values to tiles with the specified precision (cf. Alg. 2). In general, the input is transformed into . The period is a non-strictly increasing integer, for instance uniquely identifying each day. Projection is a stateless operation that can be executed in parallel. It is irrelevant which node performs projection on which input value as they are interchangeable. However, the input of the subsequent accumulator nodes needs to be grouped by values, for instance by assigning values within a certain segment of the longitude range of the input values. Yet, this is not strictly necessary as long as it is ensured that every unique surjected value is sent to the same -node.

Accumulation.

The accumulator nodes keep track of the number of counts per projected tile for the sliding window (cf. Alg. 3). The input consists of a stream of tuples as well as the size of the sliding window and the threshold value . A global variable keeps track of the current period. Each -node maintains two persistent data structures. For reasons of efficiency, the hashmap totals records, for each tile, how many points were projected to it in . Tiles become significant once their associated count reaches . In addition, the hashmap window records the counts per tile for each period in . Given that input is non-strictly increasing, there are two cases to consider:

  1. [label=()]

  2. , i.e. the period is unchanged. In this case, the counts of tile in totals and its corresponding value in window are incremented by 1. If the total count for has just reached , an update is sent to the -node, containing and the flag 1.

  3. , i.e. the current input belongs to a later period. Now the sliding window needs to be advanced, which means that the oldest entry gets pruned. But first, an update is sent to the -node with and the flag 0. Afterwards, the entries in the hashmap window have to be adjusted. Entry is removed and for each coordinate pair and its associated counts, the corresponding entry in the hashmap totals gets adjusted downward as the totals now should no longer include these values. In case the associated value of a coordinate pair in totals drops below , an update is sent to , consisting of and the flag -1. Should a value in totals reach 0, the corresponding key-value pair is removed. Afterwards, the steps outlined in the previous case are performed.

Regarding significant tiles, only status changes are communicated from nodes to , which is much more efficient than sending a stream with constant status updates for each tile.

1:stream of tuples , sliding window size , threshold
2:stream of tuples where flag and val
3:
4:
5: Period count
6:for  in  do
7:   if  then New period
8:     send to node Re-cluster
9:     
10:      := Oldest Entry
11:     if  keys of  then Prune
12:         :=
13:        for  in keys of  do
14:          
15:          
16:          
17:          if  and  then
18:            send to node Remove           
19:          if  then
20:            remove entry from                   
21:        remove entry from         
22:   if  keys of  then
23:     
24:     
25:   else
26:     
27:     if  keys of  then
28:        
29:     else
30:                 
31:   if  then
32:     send to node Add    
Algorithm 3 Accumulation node
Clustering.

The clustering node (cf. Alg. 4) takes as input a precision value , which is identical to the one that was used in the -node, the minimum cluster size , and a stream consisting of tuples of a flag as well as a value . This node keeps track of the significant tiles  of the current sliding window, based on updates received from all nodes. If , the associated coordinate pair is added to . On the other hand, if , tile is removed from . Thus, is synchronized with the information stored in all nodes. Lastly, if the associated value of the input tuple represents a period identifier . This is interpreted as the beginning of this period and, conversely, the end of period . Now clusters, taking into account, the set of significant tiles (cf. Alg. 1, lls. 12 – 24) and produces an output stream that represents the clusters found within the current sliding window. In this stream, each projected coordinate is assigned period and cluster identifiers, which leads, after re-scaling the coordinate pairs to floating point numbers , to an output stream of tuples of the format .

1:stream of tuples where flag and val , precision , size
2:stream of tuples
3: Significant tiles
4:for  in  do
5:   if  then
6:     
7:         
8:   if  then
9:     
10:         
11:   if  then Next period
12:     
13:      := cluster() cf. Alg. 1, lls. 12 – 24
14:      Cluster ID
15:     for cluster in clusters do
16:        for  in cluster do
17:           Int to Float
18:          send         
19:                 
Algorithm 4 Clustering node

3.3 Implementation Details

The previous description is idealized. Yet, our software solution has to take the vagaries of real-world data into account. Below, we therefore highlight two relevant practical aspects that the formal definition of our algorithm does not capture.

Out-of-Order Processing.

Tuples are assigned a timestamp at the source, which is projected to a period identifier . Assuming that the input stream is in-order, parallelizing the operator could nonetheless lead to out-of-order input of some tuples at the node, i.e. the latter could receive a notification about the start of period , cluster all significant tiles as they were recorded up to the seeming end of , but receive further tuples pertaining to it from other -nodes afterwards. One solution is to simply ignore these values as their number should be minuscule. Commonly used periods, e.g. one day, are quite large and the expected inconsistencies are confined to their beginning and end. Thus, it may make sense to set the start of a period to a time where very little data is generated, e.g. 3:00 a.m. for a 24-hour period when processing data of commercial vehicles. Alternatively, clustering could be triggered with a delay of one full period, where the chosen length of the period exceeds the expected delay.

Interpolation.

The sliding window also has to advance when there are missing period identifiers in the data. It is not sufficient to simply remove the oldest key-value pair. Instead, missing periods need to be interpolated. If a gap greater than one period between the current and the last encountered period is detected, the algorithm advances the sliding window as many times as needed, one period at a time. This is omitted from Alg. 

3 for the sake of brevity but part of our published implementation.

3.4 Retaining data points with S-RASTER

There are use cases where it is desirable to not only identify clusters based on their significant tiles but also on the data points that were projected to those tiles (cf. Figs. 0(c) and 0(d)). In the following, we refer to the variant of S-RASTER that retains relevant input data as S-RASTER. The required changes are minor. With the goal of keeping the overall design of the algorithm unchanged, first the  nodes have to be modified to produce a stream of tuples , i.e. it retains the original input coordinates. In the nodes the hashmaps totals and window have to be changed to retain sets of unscaled coordinates per projected pair . Counts are given by the size of those sets. This assumes that determining the size of a set is an operation in the implementation language, for instance due to the underlying object maintaining this value as a variable. In case a tile becomes significant, each -node sends not just the tile but also a bounded stream to that includes all coordinate pairs that were projected to it. After a tile has become significant, every additional point that maps to it also has to be forwarded to . Lastly, in the node, the set tiles has to be turned into a hashmap that maps projected tiles to their corresponding points and the output stream modified to return, for each point that is part of a cluster, the tuple .

4 Evaluation

As we have shown previously, RASTER is very fast and generates results more quickly than competing algorithms. Of course, the caveat is that the resulting clusters may not include elements at the periphery that other methods include. Yet, the big benefit of our algorithm is its very fast throughput, being able to process an arbitrary amount of data in linear time and constant memory. This makes it possible to use a standard workstation, while comparable workloads with other algorithms would necessitate a much more expensive and time-consuming cloud computing setup. The same holds true for S-RASTER as it is an adaptation of RASTER that does not change its fundamental properties. Consequentially, S-RASTER, just like RASTER, requires only constant memory, needs only a single pass, i.e. every element of the input is only processed once, and performs clustering in linear time. We have made a prototypal implementation of S-RASTER publicly available. However, a thorough comparative evaluation is currently missing. That being said, based on the specification we provided, there is no reason to assume that the performance benefits of RASTER for data batches are not also evident in S-RASTER for evolving data streams. Our reasoning is outlined below.

Runtime.

RASTER is a single-pass linear time algorithm. The same is true for S-RASTER. Data is continually processed as a stream, in which every input is only processed one single time, implying linear time. Clustering does not happen continually, but periodically. Furthermore, clustering significant tiles is a very fast procedure. As the cost of this operation is quite small, it can be viewed as a constant. On a related note, the performance impact of periodic clustering in the node can be mitigated by running this node on a separate CPU core, which is straightforward in the context of stream processing.

Memory.

RASTER needs constant memory to keep track of all counts per tile of the entire (finite) grid. In contrast, S-RASTER uses a sliding window of a fixed length . In the worst case, the data that was stored in with RASTER requires memory in S-RASTER, which is the case if there is, for each discrete period , at least one projected point per tile for each tile. Thus, S-RASTER maintains the key property of RASTER of using constant memory.

5 Related Work

There are several prominent streaming algorithms for density-based clustering. DUC-STREAM performs clustering based on dense unit detection [8]. Its biggest drawback, compared to S-RASTER, is that it has not been designed for handling evolving data streams. D-Stream I is a combined online/offline algorithm [5]. While it also uses a grid, the computations performed are more computationally intensive than the ones S-RASTER performs. Its runtime and memory requirements are sub-optimal. DD-Stream likewise performs density-based clustering in grids and likewise uses computationally expensive methods for clustering [9]. Both D-Stream I and DD-Stream are not suited for handling big data. D-Stream II can handle evolving data and clusters more efficiently than D-Stream [12], but it is not clear if it is suitable for big data. What furthermore distinguishes S-RASTER is that it can be effectively parallelized. This is not the case with some other standard algorithms in this domain.

None of the aforementioned clustering algorithms are quite comparable to S-RASTER, however, as they retain their input. Also, their clustering methods are generally more computationally costly. Thus, S-RASTER requires less time and memory. S-RASTER is closer to those algorithms as it retains relevant input data. As it does not retain all input, S-RASTER is only inefficient with regards to memory use in highly artificial scenarios. This is the case where there is at most one point projected to any square in the grid, which implies that the algorithm parameters were poorly chosen as the basic assumption is that many points are surjected to each significant tile. Ignoring pathological cases, it can thus be stated that S-RASTER is very memory efficient. On top, it benefits from a more efficient approach to clustering. S-RASTER and S-RASTER arguably are, for the purpose of identifying density-based clusters in evolving data streams, more memory efficient than competing algorithms, and also less computationally intensive. Yet, these advantages come at the cost of reduced precision.

6 Future Work

We may release an implementation of S-RASTER for use in Massive Online Analysis (MOA) [3]. In addition, we may release a complete stand-alone implementation of this algorithm that can be fully integrated into a standard stream processing engine such as Apache Flink [4] or Apache Spark [15] [16]. We are keen on performing a comparative evaluation of S-RASTER versus some other algorithms for clustering evolving data streams, preferably in a standard framework such as the aforementioned MOA. This work should also include an evaluation of the parallel speed-up on many-core CPUs.

Acknowledgments

This work was carried out in the Fraunhofer Cluster of Excellence Cognitive Internet Technologies.

References

  • [1]

    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. pp. 94–105. SIGMOD ’98, ACM, New York, NY, USA (1998)

  • [2] Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data. Data Mining and Knowledge Discovery 11(1), 5–33 (Jul 2005)
  • [3] Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: Moa: Massive online analysis. Journal of Machine Learning Research 11(May), 1601–1604 (2010)
  • [4] Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36(4) (2015)
  • [5] Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 133–142. ACM (2007)
  • [6] van Diggelen, F., Enge, P.: The world’s first gps mooc and worldwide laboratory using smartphones. In: Proceedings of the 28th International Technical Meeting of The Satellite Division of the Institute of Navigation (ION GNSS+ 2015). pp. 361–369. ION (2015)
  • [7] Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: SIGKDD Conference on Knowledge Discovery and Data Mining. vol. 96, pp. 226–231 (1996)
  • [8] Gao, J., Li, J., Zhang, Z., Tan, P.N.: An incremental data stream clustering algorithm based on dense units detection. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. pp. 420–425. Springer (2005)
  • [9]

    Jia, C., Tan, C., Yong, A.: A grid and density-based clustering algorithm for processing data stream. In: 2008 Second International Conference on Genetic and Evolutionary Computing. pp. 517–521. IEEE (2008)

  • [10]

    MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. vol. 1, pp. 281–297. Oakland, CA, USA. (1967)

  • [11]

    Rui Xu, Wunsch, D.: Survey of clustering algorithms. IEEE Transactions on Neural Networks

    16(3), 645–678 (May 2005). https://doi.org/10.1109/TNN.2005.845141
  • [12] Tu, L., Chen, Y.: Stream data clustering based on grid density and attraction. ACM Transactions on Knowledge Discovery from Data (TKDD) 3(3),  12 (2009)
  • [13] Ulm, G., Gustavsson, E., Jirstrand, M.: Contraction Clustering (RASTER). In: Nicosia, G., Pardalos, P., Giuffrida, G., Umeton, R. (eds.) Machine Learning, Optimization, and Big Data. pp. 63–75. Springer International Publishing, Cham, Switzerland (2018)
  • [14] Ulm, G., Smith, S., Nilsson, A., Gustavsson, E., Jirstrand, M.: Contraction clustering (raster): A very fast big data algorithm for sequential and parallel density-based clustering in linear time, constant memory, and a single pass (2019)
  • [15] Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. HotCloud 10(10-10),  95 (2010)
  • [16] Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., et al.: Apache spark: a unified engine for big data processing. Communications of the ACM 59(11), 56–65 (2016)