Local Pair and Bundle Discovery over Co-Evolving Time Series

04/19/2021
by   Georgios Chatzigeorgakidis, et al.
0

Time series exploration and mining has many applications across several industrial and scientific domains. In this paper, we consider the problem of detecting locally similar pairs and groups, called bundles, over co-evolving time series. These are pairs or groups of subsequences whose values do not differ by more than ϵ for at least delta consecutive timestamps, thus indicating common local patterns and trends. We first present a baseline algorithm that performs a sweep line scan across all timestamps to identify matches. Then, we propose a filter-verification technique that only examines candidate matches at judiciously chosen checkpoints across time. Specifically, we introduce two block scanning algorithms for discovering local pairs and bundles respectively, which leverage the potential of checkpoints to aggressively prune the search space. We experimentally evaluate our methods against real-world and synthetic datasets, demonstrating a speed-up in execution time by an order of magnitude over the baseline. This paper has been published in the 16th International Symposium on Spatial and Temporal Databases (SSTD19).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

04/19/2021

Local Similarity Search on Geolocated Time Series Using Hybrid Indexing

Geolocated time series, i.e., time series associated with certain locati...
01/09/2022

OPP-Miner: Order-preserving sequential pattern mining

A time series is a collection of measurements in chronological order. Di...
06/12/2019

Dynamic Time Scan Forecasting

The dynamic time scan forecasting method relies on the premise that the ...
10/07/2020

Efficient Temporal Pattern Mining in Big Time Series Using Mutual Information – Full Version

Very large time series are increasingly available from an ever wider ran...
09/14/2017

Motif-based Rule Discovery for Predicting Real-valued Time Series

Time series prediction is of great significance in many applications and...
10/17/2018

The UCR Time Series Archive

The UCR Time Series Archive - introduced in 2002, has become an importan...
05/17/2018

Matching Consecutive Subpatterns Over Streaming Time Series

Pattern matching of streaming time series with lower latency under limit...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Time series are important in many applications both in industry, e.g. in energy and finance, as well as in science, e.g. in astronomy and biology [13]. Therefore, efficient management and mining of time series is a task of critical importance, but also highly challenging due to the large volume and complex nature of this data.

Co-evolving time series are time series that are time-aligned, i.e. they contain observation values at the same timestamps all along their duration. In this work, we focus on discovering all pairs or groups (called bundles) of locally similar co-evolving time series and extracting the subsequences where this local similarity occurs. We consider two time series as locally similar if the pairwise distance of their values per timestamp is at most for a time interval that lasts at least consecutive timestamps.

Several research efforts have focused on similarity search over time series to detect insightful patterns within a single or across a set of time series [18, 23, 12, 14]. However, to the best of our knowledge, the problem of discovering pairs or groups of similar time-aligned subsequences within a set of co-evolving time series has been overlooked.

Discovering such pairs and bundles is useful in various applications. For instance, public utility companies employ smart meters to collect time series measuring consumption per household (e.g., for water or electricity). Identifying such bundles of time series (i.e., a number of similar subsequences over certain time intervals) can reveal similar patterns of consumption among users, allowing for more personalized billing schemes. In finance, examining time series of stock prices can identify pairs or bundles of stocks trending similarly at competitive prices over some trading period (hours or days), hence offering precious insight for possible future investments.

Figure 1: Example of a pair and a bundle of locally similar time series.

Figure 1 illustrates an example comprising four time series depicted with different colors. We observe that from timestamp 1 to 5 the values of and are very close to each other, thus forming a locally similar pair. Similarly, from timestamp 8 to 12, the values of , and are close to each other, forming a bundle with three members. Note that values in each qualifying subsequence may fluctuate along a bundle as long as they remain close to the respective values per timestamp of the other members in that bundle.

A real-world example is depicted in Figure 2. These two time series represent per-hour average water consumption during a day of the week for two different households. We can observe that their respective values per timestamp (at granularity of hours, in this example) are very close to each other during a certain time period (hours 2-11), but are farther apart in the rest. Hence, an algorithm that measures the global similarity between two time series might not consider this pair as similar; however, the subsequences inside the gray strip are clearly pairwise similar, and might indicate an interesting pattern. Identifying such local similarities within a sufficiently long time interval is our focus in this paper.

Figure 2: Example of a pair of locally (but not globally) similar time series.

Furthermore, Figure 3

depicts several bundles of locally similar time series detected by our algorithms in a real-world dataset containing smart water meter measurements. The detected bundles represent different per-hour average water consumption patterns during a week. There is a wider pattern detected among 6 households during the first 30 hours of the week indicating reduced consumption (probably no permanent residence). The orange and yellow patterns indicate different morning routines during the third and fourth day of the week. The green and purple patterns represent a reduction in consumption during the late hours of the fourth and sixth day, respectively, with some intermediate consumption taking place during the night. Finally, the shorter red and light blue bundles suggest different evening patterns for two other days (respectively, decreasing and increasing consumption).

Figure 3: Discovered bundles of locally similar time series in a water consumption dataset.

Discovering all possible pairs and bundles of locally similar time series, along with the corresponding subsequences, within large sets is a computationally expensive process. To find matches, a filter-verification technique can be applied. At each timestamp, the filtering step can discover candidate pairs or groups having values close to each other; then, the verification step is invoked to determine whether each such candidate satisfies the required conditions, essentially whether this match occurs throughout a sufficiently large time interval. However, both the filtering and the verification steps are expensive. The computational cost becomes especially high for the case of bundle discovery, as it has to examine all possible subsets of locally similar time series that could form a bundle. Hence, such an exhaustive search is prohibitive when the number and/or the length of the time series is large.

In this paper, we employ a value discretization approach that divides the value axis in ranges equal to the value difference threshold , in order to reduce the number of candidate pairs or bundles that need to be checked per timestamp. Leveraging this, we first propose two sweep line scan algorithms, for pair and bundle discovery respectively, which operate according to the aforementioned filter-verification strategy. However, this process still incurs an excessive amount of comparisons, as it needs to scan all values at every timestamp. To overcome this, we introduce a more aggressive filtering that only checks at selected checkpoints across time, but ensuring that no false negatives ever occur. This approach incurs significant savings in computation cost, as we only need to examine candidate matches on those checkpoints only instead of all timestamps. To further reduce the number of examined candidates, we propose a strategy that judiciously places these checkpoints across the time axis in a more efficient manner. We then exploit these optimizations introducing two more efficient algorithms that significantly reduce the execution cost for both pair and bundle discovery.

The bundle discovery problem we address in this paper resembles the problem of flock discovery in moving objects, where the goal is to identify sufficiently large groups of objects that move close to each other over a sufficiently long period of time [8, 1, 21, 20]. In fact, the baseline algorithm we describe can be viewed as an adaptation of the algorithm presented in [21]. However, to the best of our knowledge, ours is the first work to address the problems of locally similar pair and bundle discovery over co-evolving time series. Specifically, our main contributions can be summarized as follows:

  • We introduce the problems of local pair and bundle discovery over co-evolving time series.

  • We suggest an aggressive checkpoint-based pruning method that drastically reduces the candidate pairs and bundles that need to be verified, significantly improving performance.

  • We conduct an extensive experimental evaluation using both real-world and synthetic time series, showing that our algorithms outperform the respective sweep line baselines.

The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 describes the problems. Sections 4 and 5 introduce our algorithms for pair and bundle discovery, respectively. Section 6 reports our experimental results, and finally Section 7 concludes the paper.

2 Related Work

Time series similarity. Similarity search over time series has attracted a lot of research interest [6]. One well-studied family of approaches includes wavelet-based methods [4], which rely on Discrete Wavelet Transform [7] to reduce the dimensionality of time series and generate an index using the coefficients of the transformed sequences. The Symbolic Aggregate Approximation (SAX) representation [11] has led to the design of a series of indices, including SAX [19], SAX 2.0 [2], SAX2+ [3], ADS+ [24], Coconut [10], DPiSAX [22], and ParIS [17]. However, these indices support similarity search over complete time series, i.e. whole-matching. Recently, the ULISSE index was proposed [12], which is the first index that can answer similarity search queries of variable length.

Moreover, many approaches have been proposed for subsequence matching. In this problem, a query subsequence is provided and the goal is to identify matches of this subsequence across one or more time series, typically of large length. The UCR suite [18] offers a framework comprising four different optimizations regarding subsequence similarity search. In computing full-similarity-joins over large collections of time series, i.e., to detect for each possible subsequence its nearest neighbor, the matrix profile [23] keeps track of Euclidean distances among each pair within a similarity join set (i.e., a set containing pairs of each subsequence with its nearest neighbor).

The problem we address in this paper differs from the above settings. Instead of identifying matches of a query subsequence against one, or more time series, we are interested in discovering locally similar pairs and bundles of time-aligned subsequences within a given collection of time series.

Time series clustering. Our work also relates to clustering of time series, where methods perform either partitioning or density-based clustering. In the former class, algorithms typically partition the time series into clusters. Similarly to iterative refinement employed in -means, the -Shape partitioning algorithm [15, 16] aims to preserve the shapes of time series assigned to each cluster by considering the shape-based distance, a normalized version of the cross-correlation measure between time series. In contrast, density-based clustering methods are able to identify clusters of time series with arbitrary shapes. YADING [5] is a highly efficient and accurate such algorithm, which consists of three steps: it first samples the input time series also employing PAA (Piecewise Aggregate Approximation) to reduce the dimensionality, then applies multi-density clustering over the samples, and finally assigns the rest of the input to the identified clusters. However, clustering methods consider time series in their entirety and not matching subsequences as we consider in this work.

Discovery of movement patterns in trajectories. Our work also relates to approaches for discovering clusters of moving objects, in particular a type of movement patterns that is referred to as flocks [8]. A flock is a group of at least objects moving together within a circular disk of diameter for at least consecutive timestamps. Finding an exact flock is NP-hard, hence this work suggests an approximate solution to find the maximal flock from a set of trajectories using computational geometry concepts. In [1], another approximate solution for detecting all flocks is based on a skip-quadtree that indexes sub-trajectories. Flock discovery over streaming positions from moving objects was addressed in [21]. This exact solution discovers flock disks that cover a set of points at each timestamp. Their flock discovery algorithm finds candidate flocks per timestamp and joins them with the candidate ones from the previous timestamps, reporting a flock as a result when it exceeds the time constraint . An improvement over this technique was presented in [20], using a plane sweeping technique to accelerate detection of object candidates per flock at each timestamp, while an inverted index speeds up comparisons between candidate disks across time. In our setting, detection of bundles is similar to flocks, thus for our baseline method we adapt the algorithm from [21].

3 Problem Definition

A time series is a time-ordered sequence , where is the value at the -th timestamp and is the length of the series (i.e., the number of timestamps). We consider a set of co-evolving time series, so all time series are time-aligned and each series has a value at each of the timestamps. Given a set of such co-evolving time series, our goal is to find pairs of time series that have similar values locally over some time intervals of significant duration. More specifically:

Definition 1 (Locally Similar Time Series)

Two co-evolving time series and are locally similar if there exists a time interval spanning at least consecutive timestamps such that at every timestamp in their corresponding values do not differ by more than a given threshold , i.e., .

Note that threshold expresses the maximum tolerable deviation per timestamp between two time series, so it actually concerns the absolute difference of their corresponding values. We wish to find all such pairs of time series, so the problem is actually a self-join over the dataset, specifying as join criteria the distance threshold and the minimum time duration of qualifying pairs. More formally:

Problem 1 (Pair Discovery over Time Series)

Given a set of co-evolving time series of equal duration , a distance threshold , and a time duration threshold timestamps, , retrieve all pairs of locally similar time series along with the corresponding time intervals.

For example, in Figure 4, the detected pairs for a specified and duration would be the locally similar time series within the grey ribbons. Since two time series might be locally similar in more than one intervals, their matching subsequences are considered as two different pairs, one for each interval. For instance, in Figure 4 the green and red time series yield two matching pairs in different time intervals.

Figure 4: Pair discovery over a set of time series.

The above problem can be extended to the detection of groups, also called bundles, of co-evolving time series. Instead of pairs, each such bundle of time series contains at least a pre-defined number of members, which are pairwise locally similar to each other over a time interval of sufficient duration. This problem can be formulated as follows:

Problem 2 (Local Bundle Discovery over Time Series)

Given a set of co-evolving time series of equal length , a minimum bundle size ), a maximum value difference , and a minimum time duration timestamps, , retrieve all groups of time series such that:

  • Each group contains at least time series.

  • Within each group , all pairs of time series are locally similar with respect to and .

  • Each group is maximal, i.e., there is no other group that also forms a bundle for the same time interval.

An illustration of the above problem is shown in Figure 5. Each grey band covers the subsequences of at least time series that constitute a bundle. These subsequences are pairwise locally similar for a specified distance and duration .

Figure 5: Bundle discovery over a set of time series.

4 Pair Discovery

In this Section, we propose two solutions for the pair discovery problem. The first (Section 4.2) is a baseline algorithm that uses a sweep line to scan the co-evolving time series throughout their duration, while validating and keeping all the pairs that satisfy the given constraints employing a value discretization scheme per timestamp (Section 4.1). The second method (Section 4.4) employs an optimization that reduces the number of pairs to consider by judiciously probing candidates at selected timestamp values (referred to as checkpoints, Section 4.3). This significantly prunes the search space without missing any qualifying results.

4.1 Value Discretization

To reduce the candidate pairs that need be checked at each timestamp , we discretize the values of all time series at in bins, i.e., several consecutive value ranges, each one of size . Time series with values within the same bin at timestamp form candidate pairs, but we also need to check adjacent bins for additional candidate pairs whose values differ by at most . Time series having values at non-adjacent bins are certainly farther than at that specific timestamp , so we can avoid these checks.

To detect all candidate pairs and avoid cross-checking in every two adjacent bins we consider a value range of size , whose upper endpoint coincides with each value under consideration at time . Then, all values of time series contained within this range, form candidate pairs (see Figure 6). Obviously, values contained in the same bin can form candidate pairs. Then, we can cross-check each value in bin with values in bin for additional candidates with value difference at most , as indicated with the red (right) range.

Figure 6: Discretization of time series values at timestamp .

Thus, at each timestamp , the process of finding all the pairs consists of the following two steps: (1) Filtering - Search among time series values in adjacent bins to detect candidate pairs using the aforementioned search method. (2) Verification - For each candidate pair of time series, check similarity of their respective values at successive timestamps as long as this pair still qualifies to the matching conditions (or the end of time series data is reached). This step actually resembles to a kind of “horizontal expansion” along the time axis in an attempt to eagerly verify and report pairs.

4.2 Pair Discovery Using Sweep Line

A baseline method for performing the pair discovery over a set of time series is to check all the candidate pairs formed at each timestamp, and verify whether the minimum duration constraint is satisfied. Algorithm 3 describes this procedure. Pair discovery (Line 5) considers a time duration (as a set of consecutive timestamps) to check for results. Initially , where is the total duration of all time series data. For each timestamp , we obtain only the subset of time series whose values are contained in two adjacent bins (Line 7). Based on these values at , we obtain all candidate pairs with respect to the threshold (Line 9). For each such pair, if it is not already part of the resulting pairs at that specific timestamp , we verify it by first expanding it horizontally (Lines 10-12) and checking whether this pair meets the duration constraint (Line 13) along subsequent timestamps. If so, we add it to the reported results (Line 14).

Concerning the horizontal expansion, we iterate over the subsequent timestamps (after the current one – Line 17) and we stop when the threshold is violated (Lines 18–19). Afterwards, we mark the start of this pair with the current timestamp , whereas its end is marked by the timestamp at which the threshold is crossed, and we return this pair (Lines 20-22).

Input: Set of co-evolving time series of length
Parameters : Threshold , min duration
Output: List with all locally similar pairs of time series
1
2
3 return Procedure 
4        foreach  do
5               foreach  do
6                     
7                      foreach  do
8                            
9                             foreach  do
10                                    if  then
11                                          
12                                           if  then
13                                                 
14                                                 
15                                          
16                                   
17                            
18                     
19              
20       return
21       
22 Procedure 
23        foreach  do
24               if  then
25                     
26                     
27              
28       
29       
30        return
31       
Algorithm 1 Sweep line scan pair discovery

However, searching over all timestamps in such an exhaustive manner can be expensive, particularly for long time series. Next, we present an optimization that identifies candidate pairs at selected timestamps only, so that only those pairs require verification.

4.3 Optimized Filtering at Checkpoints

To prune the search space, we consider checkpoints along the time axis, so that searching for candidate pairs will be performed at these specific timestamps only. If the temporal span between two successive checkpoints does not exceed the minimal duration threshold , we can ensure no false negatives, since any qualifying pair starting at an intermediate timestamp between two checkpoints will surely be detected at least on the second one. Figure 7 shows an example of a set of time series with checkpoints placed along the time axis every timestamps.

Figure 7: Checkpoints placed every timestamps.

Assume a set of checkpoints placed at time interval from each other, as depicted in Figure 8(a). Let a checkpoint at timestamp and a qualifying pair of duration starting at timestamp . This pair cannot have smaller duration, otherwise it would not meet constraint . Consequently, the pair will be detectable on the checkpoint at , as shown in the figure. Similarly, if a qualifying pair ends at timestamp (Figure 8(b)), it will be detected at the checkpoint at . Hence, all pairs around a checkpoint at can be detected as candidates when we check their values at . Thus, we can easily conclude to the following observation.

Lemma 1 (Checkpoint Covering Interval)

Let the interval between successive checkpoints not exceed . Considering a checkpoint placed at timestamp , all qualifying pairs starting at and ending at will satisfy all matching constraints at timestamp .

(a) Pair with a starting point before .
(b) Pair with an ending point after .
Figure 8: A qualifying pair will be detected on a checkpoint.

This lemma entails that it suffices to check for candidate pairs only at checkpoints, i.e., every timestamps. We denote the set of checkpoints as . Since we skip timestamps and in order to avoid false misses, we now have to verify pairs with a horizontal expansion (as in 4.2), but towards both directions, i.e., before and after a given checkpoint. Overall, at each checkpoint the optimized process performs: (1) Filtering - Search among the values of time series in adjacent bins to detect candidate pairs. (2) Verification - For each candidate pair, perform a two-way horizontal expansion across the time axis.

Improving placement of checkpoints

Depending on the dataset, the default checkpoint placement might yield an increased number of candidate pairs that would incur too many (and perhaps unnecessary) verifications. Intuitively, if the time series values at a specific timestamp are placed in a more “scattered” manner over the bins, then less candidates would be generated. This is because the values of time series at would differ from each other by more than and thus can be pruned as described in Section 4.1. Figure 9 depicts such a case of sub-optimal placement of checkpoints, where the second checkpoint is placed at a rather dense area and as a result, six candidate pairs are considered. We can remedy this issue by shifting all checkpoints together either to the left or to the right, yet maintaining their temporal span every . As shown in Figure 10, all three checkpoints are collectively shifted to the left, avoiding the dense area and reducing the total number of candidate pairs. An extra checkpoint can be inserted before the first or after the last one, so as to guarantee that there is no interval longer than without checkpoints.

Figure 9: Sub-optimal checkpoint placement.
Figure 10: Improved checkpoint placement.

Clearly, the placement of the set of checkpoints influences the amount of candidate pairs. We wish to find the best such placement, which provides the least number of candidate pairs. The amount of candidates depends on the cardinality of the bins (i.e., the number of values in each one, as shown in Fig. 6) at any particular checkpoint . Given that is the total number of time series in the dataset (and hence the number of values at each checkpoint), we can identify the most populated bin at checkpoint by calculating the maximal density , where represents the set of bin cardinalities at checkpoint . Therefore, for a given configuration

of checkpoints, we can estimate an overall cost

by taking the sum of such maximal densities over all checkpoints, i.e., . The less this total cost, the smaller the cardinality per bin at each checkpoint and, thus, the less the candidates that will be generated. Consequently, we seek to find the minimum . To do so, we shift all checkpoints together to the right, one timestamp at a time, we estimate ratio again and repeat times. This procedure is illustrated in Figure 11, where the checkpoints symbolized with similar lines belong to the same set as we move them to the right in order to identify the best placement (indicated with the thickest vertical dashed lines).

Figure 11: Best checkpoint placement (shown as thick vertical lines).
Input: Set of co-evolving time series of length
Parameters : Threshold , min duration
Output: A list containing all the locally similar time series
1
2
3
4 return
5 Procedure 
6        foreach  do
7               if  then
8                     
9                     
10              
11       
12       
13        foreach  do
14               if  then
15                     
16                      return
17                     
18              
19       foreach  do
20               if  then
21                     
22                     
23              
24       
25        return
26       
Algorithm 2 Checkpoint scan pair discovery

4.4 Pair Discovery Using Checkpoints

After identifying the best checkpoint placement, we can discover pairs of locally similar time series by applying the exhaustive algorithm presented in Section 4.2, but iterating over the defined checkpoints instead of all timestamps. To speed up the verification step, we introduce an optimization that reduces the number of checks. For each candidate pair that started at (a possibly previous) timestamp , we first expand its verification to the left. In case the current duration of pair is still less than , we jump at timestamp in order to eagerly prune candidates that will not last at least . There, we check directly whether the values of these two time series qualify, and we continue this check backwards in time. If at an intermediate timestamp the constraint is not met, we can stop the verification process and discard the candidate pair.

The procedure for pair discovery is listed in Algorithm 2. Initially, we calculate the best possible checkpoint set (Line 2). Then, we run the procedure described in Section 4.2 but instead of probing over all timestamps we iterate over the resulting checkpoint set (Line 3).

Regarding verification, we first move towards the leftmost endpoint (Line 6) and detect the starting timestamp of the pair (Line 9). Then, we iterate from the timestamp towards the initial timestamp (Line 11). If the pair does not qualify at a timestamp during this interval (Line 12), we set as its final timestamp and return the pair (Line 14). Otherwise, we continue from timestamp and towards the rightmost timestamp until the pair ceases to qualify (Lines 15-19).

5 Bundle Discovery

We now consider the bundle discovery problem. We propose two algorithms: an exhaustive one using a sweep line (Section 5.1), and one using checkpoints (Section 5.2), following the same principles as in pair discovery.

5.1 Bundle Discovery Using Sweep Line

To exhaustively detect bundles of time series, we can follow a similar procedure employing a sweep line as in Section 4.2. However, this time we detect candidate bundles at each timestamp before verifying whether constraints concerning minimal duration and minimum membership are satisfied. Essentially, this can be thought of as an adaptation of the flock discovery approach in [21] to the 1-dimensional setting in the case of time series.

Algorithm 3 describes this exhaustive process. Similarly to pair discovery, at each timestamp (Line 5) we obtain only the values of time series contained in adjacent bins (Lines 6–7). Then, each such value is considered the origin of a search range , which returns a candidate group at time (Line 9). Of course, such a candidate group may have been already included in the result bundles previously, during a horizontal expansion at a previous timestamp. In this case, its examination is skipped. Otherwise, if this group contains more than members, we proceed to verify it as a candidate bundle over subsequent timestamps via horizontal expansion (Lines 10-12). As will see next, this expansion may return one or more candidate bundles; each one is checked against the duration constraint before adding it to the result bundles (Lines 13-16).

Regarding the verification step of a candidate bundle , we apply its horizontal expansion over all subsequent timestamps (Line 19). For each member of such candidate bundle (Lines 21-28), we find the group of time series having values within range at time . Many such new groups may be created, as each member may yield one group. Such a group may become a new candidate bundle if it satisfies the constraint; if so, it is added to the resulting bundles with an updated duration (Lines 24–27). As we look at subsequent timestamps, it may happen that the same bundle may be added to the results multiple times, but with increasing duration. In the end, we eliminate duplicates and only keep the one with the longest duration (Line 28). Expansion stops once no new candidate bundles can be found in the next timestamp (Lines 29–30).

Input: Set of co-evolving time series of length
Parameters : Threshold , min duration , min members
Output: A list containing all the discovered bundles
1
2
3 return Procedure 
4        foreach  do
5               foreach  do
6                     
7                      foreach  do
8                            
9                             if  then
10                                    if  then
11                                          
12                                           foreach  do
13                                                  if  then
14                                                        
15                                                 
16                                          
17                                   
18                            
19                     
20              
21       return
22       
23 Procedure 
24       
25        foreach  do
26              
27               foreach  do
28                      foreach  do
29                            
30                             if  then
31                                    Rearrange duration of accordingly
32                                   
33                                   
34                                   
35                            
36                     
37              if  then
38                     
39              
40       return
41       
Algorithm 3 Sweep line scan bundle discovery
Input: Set of co-evolving time series of length
Parameters : Threshold , min duration , min members
Output: A list containing all the discovered bundles
1
2
3
4 return
5 Procedure 
6       
7       
8        foreach  do
9              
10              
11              
12       foreach  do
13               if  then
14                     
15                     
16              
17       return
18       
Algorithm 4 Checkpoint scan bundle discovery

5.2 Bundle Discovery Using Checkpoints

For bundle discovery using checkpoints, we apply a similar sweep line approach, but this time we only filter at checkpoints and then verify towards both directions in the time axis. Algorithm 4 describes this procedure. After initializing the checkpoints (Line 2), we run the bundle discovery process as in Algorithm 3, but this time looking at checkpoints (Line 3) instead of all timestamps. For the two-way horizontal expansion, we first examine all candidate bundles in set detected from the current checkpoint towards the origin of time axis (Line 7). This is done because a qualifying bundle could have started earlier, before the current checkpoint. Afterwards, we apply the same eager pruning strategy as in Section 4.4. So, we verify each such candidate bundle jumping forward at timestamp and continue backwards in time to its currently known start (Lines 8–10). Among the candidate bundles (in set ) returned from the forward verification (Line 11), we care only for those that satisfy the minimal duration constraint (Line 12). These are further verified from timestamp and forward in time, obtaining all subsequent qualifying bundles (Line 13).

6 Experimental Evaluation

6.1 Experimental Setup

We evaluate the performance of our methods on pair and bundle discovery both qualitatively and quantitatively. We compare our checkpoint (CP) scan approaches for each problem with the respective sweep line (SL) methods. We use a real-world and a synthetic dataset as listed in Table 1. Next, we describe their characteristics.

Dataset Size Time series length
Water 822 168
Synthetic 50,000 1,000
Table 1: Datasets used in the experiments.

DAIAD Water Consumption (Water). Courtesy of the DAIAD project111 http://daiad.eu/, we acquired a time series dataset of hourly water consumption for 822 households in Alicante, Spain from 1/1/2015 to 20/1/2017. In order to get a more representative dataset for our tests, we first calculated the weekly time series ( timestamps, one value per hour from Monday to Sunday) per household by averaging corresponding hourly values over the entire period.

Synthetic Dataset. We generated a synthetic dataset of 50,000 time series, each with a length of 1,000 timestamps. So, this dataset contains 50 million data points in total. The dataset was generated in a similar manner to the synthetic dataset used in [9].

All experiments were conducted on a Dell PowerEdge M910 with 4 Intel Xeon E7-4830 CPUs, each containing 8 cores clocked at 2.13GHz, 256 GB RAM and a total storage space of 900 GB.

6.2 Evaluation Results

We conducted two sets of experiments, using the water and synthetic datasets respectively. The water dataset was used for qualitative and quantitative assessment of our methods on pair and bundle discovery, while the synthetic dataset was used to evaluate their efficiency in terms of execution time.

6.2.1 Pair and Bundle Discovery over Real Data

We performed several experiments using the water dataset with various parameter values to detect pairs and bundles using both the SL and CP approaches. The water dataset was -normalized in order to eliminate larger amplitude discrepancies among the time series and focus on structural similarity.

To evaluate our methods against different parameters, we performed preliminary tests to extract ranges of values where the algorithms would return a reasonable number of results. Table 2 lists the range of values for the parameters used for bundle and pair discovery tests (recall that parameter is not applicable in pair discovery); default values are in bold. Parameter is expressed as a percentage of the duration of the time series, is expressed as a percentage of the value range (i.e., difference in values encountered across the dataset) and is expressed as a percentage of the number of time series in the dataset.

Parameter Values
(% of time series length, i.e., 168) 6%, 5%, 6%, 7%, 8%
(% of value range, i.e., approx 11.4) 4%, 5%, 6%, 7%, 8%
(% of dataset size, i.e., 822) 0.5%, 0.75%, 1%, 1.25%, 1.5%
Table 2: Parameters for tests over the water dataset
Varying
(a) Bundle discovery exec time
(b) Bundle discovery results
(c) Pair discovery exec time
(d) Pair Discovery results
Figure 12: Assessment against real data for varying .

Figure 12 depicts the results by varying the minimal duration . In bundle discovery, the CP algorithm outperforms SL up to an order of magnitude in terms of execution time (Figure 12(a)). As the threshold gets larger, performance is further improved as less checkpoints are specified and thus less candidates need verification. On the contrary, the SL approach performs similarly irrespective of , as time series must be checked at all timestamps. From Figure 12(b), it turns out that the number of detected bundles is reduced as the value increases, which is expected as fewer bundles can last longer. In this plot, the blue bars indicate the maximum bundle duration among the ones that were detected, while the orange bars indicate the larger detected bundle in terms of membership. It is clear that the maximum bundle size is drastically reduced as the number of results diminish, while the maximum duration among bundles remains the same with the increase of , because the longest bundle is the same in these results.

Regarding pair discovery, since it is an overall faster process, the differences in terms of efficiency are smaller, but still apparent. In this case, the execution time (Figure 12(c)) is more abruptly reduced in both SL and CP methods, since less subsequences qualify as pairs. The number of results (Figure 12(d)) is now naturally much larger, as far more pairs are expected to be verified if bundles exist. The same stands for the maximum duration among pairs, which tend to last longer compared to bundles. Since a pair is actually a bundle with =2 members, it is easier to find local similarity over longer intervals between two subsequences rather than an increased number of them. The maximum duration, as in bundle discovery, remains the same as increases, since this corresponds to the same pair in the results.

Varying
(a) Bundle discovery exec time
(b) Bundle discovery results
(c) Pair discovery exec time
(d) Pair Discovery results
Figure 13: Assessment against real data for varying .

Varying threshold for bundle discovery slightly incurs more execution cost for both SL and CP approaches. This is due to the increased number of bundles that need to be verified. Nonetheless, the difference in cost remains at levels of at most an order of magnitude, as shown in Figure 13(a). As expected, the number of results is also increased (Figure 13(b). So does the maximum duration bundle, which is also expected due to more qualifying bundles, hence a higher probability to find longer ones. The maximum bundle size (i.e., membership) is also increased, as more time series can form a bundle when allowing a wider threshold in deviation of their respective values.

Regarding pair discovery, the results are again similar to bundle discovery for varying . For very small values of up to 7% of the value range, the CP algorithm returns results almost instantly. The SL approach is at least five times slower, with its performance deteriorating more rapidly with increasing values. Again, as in bundle discovery, the results are growing with greater , as does the maximum duration among pairs, especially for equal to 8% of the value range.

Varying
(a) Bundle discovery execution time
(b) Bundle discovery results
Figure 14: Assessment against real data for varying .

When varying the minimum membership parameter in bundle discovery (Figure 14), the results regarding execution time are again very similar to the rest of the tests as indicated in Figure 14(a). Again, the CP algorithm outperforms SL up to one order of magnitude. The execution time is very slightly decreased for larger values in both algorithms, as more candidate bundles are pruned. Results from CP are reported almost instantly, since the number of time series is rather small and scanning through the limited number of checkpoints is very fast. This explains why performance of SL does not get drastically improved as gets larger, since filtering and verification has to be repeated at every timestamp. The number of results (Figure 14(b)) is reduced as increases, which is expected, as less bundles get detected with a larger membership. The maximum duration among bundles also decreases; interestingly, the maximum size detected among bundles increases as the number of results diminishes, due to the growing number of required number of members per bundle.

6.2.2 Efficiency against Synthetic Data

To evaluate the efficiency of our methods, we used the synthetic dataset. Regarding parameter values, as in the previous experiments, we performed preliminary tests to extract ranges of values where the algorithms return a reasonable number of results. Table 3 lists the range of values for all parameters used in these efficiency tests for bundle and pair discovery, with the default values emphasized in bold (again, is not applicable in pair discovery).

Varying Dataset Size
(a) Bundle discovery
(b) Pair discovery
Figure 15: Efficiency with varying numbers of time series.
Parameter Values
Dataset Size 10000, 20000, 30000, 40000, 50000
Time Series Length 600, 700, 800, 900, 1000
(% of time series length) 2.5%
(% of value range) 0.2%
(% of dataset size) 1.25%
Table 3: Parameters for tests against synthetic data

Figure 15 depicts the performance comparison between CP and SL algorithms for bundle and pair discovery. We omit cases where execution of an algorithm was taking more than 15 hours (cutoff). As illustrated in Figure 15(a), an increase in the dataset size leads to a very abrupt deterioration of performance for the SL algorithm of up to several hours of execution for 20,000 time series. For larger dataset sizes, the execution time was significantly longer than the cutoff time. On the other hand, CP reports results in all cases. Its execution time increases for larger dataset sizes, but manages to finish in a few hours in the worst case (for 50,000 time series). It is worth noting that membership parameter is more relaxed for larger dataset sizes, as it is expressed in terms of percentage of the total number of time series in the dataset. This explains the slight improvement in performance for dataset sizes of 20,000 and 30,000 for the CP method. In general, CP is more than an order of magnitude faster than SL, which also stands for the case of pair discovery, as illustrated in Figure 15(b). As the number of time series in the dataset grows, it is natural that more pairs will be detected, hence the linear increase in the execution cost for the CP algorithm. Of course, baseline SL requires more time, as it must check many more combinations of time series.

Varying Time Series Length
(a) Bundle discovery
(b) Pair discovery
Figure 16: Efficiency with varying length of time series.

For time series with increasing length (Figure 16(a)), the CP algorithm for bundle discovery again constantly outperforms SL. Similarly to previous experiments, is expressed as a percentage of the time series length. We observe that the execution time initially decreases for both algorithms, as more bundles are pruned. However, as the time series length (and ) gets larger, the performance of both algorithms slightly worsens. Only in the case of 1,000 timestamps the execution time starts to drop for the CP algorithm due to the even larger . This is not the case with the SL method, which has to evaluate more timestamps. Similar observations stand for pair discovery, with the CP algorithm significantly outperforming SL in all cases (Figure 16(b)).

7 Conclusions

In this paper, we addressed the problems of pair and bundle discovery over co-evolving time series, according to their local similarity. We introduced two efficient algorithms for pair and bundle discovery that utilize checkpoints appropriately placed across the time axis in order to aggressively prune the search space and avoid expensive exhaustive search at each timestamp. Our methods successfully detect locally similar subsequences of co-evolving time series, as demonstrated in our experimental evaluation over real-world and synthetic data. Also, they were orders of magnitude faster compared to baseline methods that apply a sweep line approach, confirming their effectiveness in pair and bundle discovery. In the future, we plan to relax the strict conditions regarding and thresholds and further improve the scalability of our algorithms to extend their applicability over very large time series datasets, both in terms of cardinality, as well as in terms of length.

References

  • [1] M. Benkert, J. Gudmundsson, F. Hübner, and T. Wolle. Reporting flock patterns. Computational Geometry, 41(3):111–125, 2008.
  • [2] A. Camerra, T. Palpanas, J. Shieh, and E. J. Keogh. iSAX 2.0: Indexing and mining one billion time series. In ICDM, 2010.
  • [3] A. Camerra, J. Shieh, T. Palpanas, T. Rakthanmanon, and E. J. Keogh. Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. KAIS, 39(1):123–151, 2014.
  • [4] K. Chan and A. W. Fu. Efficient time series matching by wavelets. In ICDE, pages 126–133, 1999.
  • [5] R. Ding, Q. Wang, Y. Dang, Q. Fu, H. Zhang, and D. Zhang. Yading: Fast clustering of large-scale time series data. Proc. VLDB Endow., 8(5):473–484, Jan. 2015.
  • [6] K. Echihabi, K. Zoumpatianos, T. Palpanas, and H. Benbrahim. The lernaean hydra of data series similarity search: An experimental evaluation of the state of the art. PVLDB, 12(2):112–127, 2018.
  • [7] A. Graps. An introduction to wavelets. IEEE Comput. Sci. Eng., 2(2):50–61, 1995.
  • [8] J. Gudmundsson and M. van Kreveld. Computing longest duration flocks in trajectory data. In ACM GIS, pages 35–42, 2006.
  • [9] E. J. Keogh and M. J. Pazzani. An indexing scheme for fast similarity search in large time series databases. In SSDBM, pages 56–67, 1999.
  • [10] H. Kondylakis, N. Dayan, K. Zoumpatianos, and T. Palpanas. Coconut: A scalable bottom-up approach for building data series indexes. PVLDB, 11(6):677–690, 2018.
  • [11] J. Lin, E. J. Keogh, L. Wei, and S. Lonardi. Experiencing SAX: a novel symbolic representation of time series. DAMI, 15(2):107–144, 2007.
  • [12] M. Linardi and T. Palpanas. Scalable, variable-length similarity search in data series: The ulisse approach. PVLDB, 11(13):2236–2248, 2018.
  • [13] T. Palpanas. Data series management: The road to big sequence analytics. SIGMOD Record, 44(2):47–52, 2015.
  • [14] T. Palpanas. Big sequence management: A glimpse of the past, the present, and the future. In SOFSEM, pages 63–80, 2016.
  • [15] J. Paparrizos and L. Gravano. k-shape: Efficient and accurate clustering of time series. In SIGMOD, pages 1855–1870, 2015.
  • [16] J. Paparrizos and L. Gravano. Fast and accurate time-series clustering. ACM Trans. Database Syst., 42(2):8:1–8:49, 2017.
  • [17] B. Peng, P. Fatourou, and T. Palpanas. Paris: The next destination for fast data series indexing and query answering. In IEEE BigData, 2018.
  • [18] T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, and E. Keogh. Searching and mining trillions of time series subsequences under dynamic time warping. In SIGKDD, pages 262–270, 2012.
  • [19] J. Shieh and E. J. Keogh. iSAX: indexing and mining terabyte sized time series. In SIGKDD, pages 623–631, 2008.
  • [20] P. S. Tanaka, M. R. Vieira, and D. S. Kaster. Efficient algorithms to discover flock patterns in trajectories. In GeoInfo, pages 56–67, 2015.
  • [21] M. R. Vieira, P. Bakalov, and V. J. Tsotras. On-line discovery of flock patterns in spatio-temporal data. In SIGSPATIAL, pages 286–295, 2009.
  • [22] D.-E. Yagoubi, R. Akbarinia, F. Masseglia, and T. Palpanas. Massively distributed time series indexing and querying. TKDE (to appear), 2018.
  • [23] C.-C. M. Yeh, Y. Zhu, L. Ulanova, N. Begum, Y. Ding, H. A. Dau, D. F. Silva, A. Mueen, and E. Keogh. Matrix profile i: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In ICDM, 2016.
  • [24] K. Zoumpatianos, S. Idreos, and T. Palpanas. Indexing for interactive exploration of big data series. In SIGMOD, pages 1555–1566, 2014.