Local Similarity Search on Geolocated Time Series Using Hybrid Indexing

04/19/2021
by   Georgios Chatzigeorgakidis, et al.
0

Geolocated time series, i.e., time series associated with certain locations, abound in many modern applications. In this paper, we consider hybrid queries for retrieving geolocated time series based on filters that combine spatial distance and time series similarity. For the latter, unlike existing work, we allow filtering based on local similarity, which is computed based on subsequences rather than the entire length of each series, thus allowing the discovery of more fine-grained trends and patterns. To efficiently support such queries, we first leverage the state-of-the-art BTSR-tree index, which utilizes bounds over both the locations and the shapes of time series to prune the search space. Moreover, we propose optimizations that check at specific timestamps to identify candidate time series that may exceed the required local similarity threshold. To further increase pruning power, we introduce the SBTSR-tree index, an extension to BTSR-tree, which additionally segments the time series temporally, allowing the construction of tighter bounds. Our experimental results on several real-world datasets demonstrate that SBTSR-tree can provide answers much faster for all examined query types. This paper has been published in the 27th International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL 2019).

READ FULL TEXT VIEW PDF

Authors

page 2

04/14/2021

Twin Subsequence Search in Time Series

We address the problem of subsequence search in time series using Chebys...
04/19/2021

Local Pair and Bundle Discovery over Co-Evolving Time Series

Time series exploration and mining has many applications across several ...
02/11/2020

Exact Indexing of Time Series under Dynamic Time Warping

Dynamic time warping (DTW) is a robust similarity measure of time series...
09/22/2020

Scalable Data Series Subsequence Matching with ULISSE

Data series similarity search is an important operation and at the core ...
02/03/2021

AttentionFlow: Visualising Influence in Networks of Time Series

The collective attention on online items such as web pages, search terms...
09/13/2019

GENDIS: GENetic DIscovery of Shapelets

In the time series classification domain, shapelets are small time serie...
01/07/2019

A Compact Representation of Raster Time Series

The raster model is widely used in Geographic Information Systems to rep...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A time series is a time-ordered sequence of data points. Time series are ubiquitous in many application domains. They can represent various types of measurements, such as user check-ins at various Points of Interest, energy consumption in smart buildings, PM2.5 particle concentration measured by air pollution sensors, etc. Analyzing and mining time series data is highly important for discovering trends and patterns in such phenomena, and has attracted extensive research interest over the last years [8, 13, 20].

However, what is usually overlooked is that the phenomena represented by time series are often also associated with geographic locations, e.g., time series generated by sensors installed at fixed positions. In such cases, spatial distance also plays an important role in the analysis, since discovery of trends and patterns may depend not only on time series similarity but also on geographic proximity. Motivated by this observation, in previous work [6, 5] we introduced the concept of geolocated time series and we proposed hybrid indexing techniques that efficiently support the retrieval of time series based on both spatial distance and time series similarity.

In particular, we introduced the BTSR-tree [6], a hybrid index that first builds an R-tree over the locations of the time series data. It then enhances each node with appropriate upper- and lower-bounding time series (MBTS) that enclose the subset of time series represented by it. Combining MBTSs and MBRs, the query evaluation algorithm can simultaneously prune the search space based on time series similarity and spatial distance while traversing the index. To further increase its pruning power, the BTSR-tree groups together similar time series within each node to derive tighter bounds.

This existing approach for hybrid search over geolocated time series using the BTSR-tree supports only global time series similarity, i.e., similarity measured across the entire length of time series. Specifically, as in other works in this area [8, 11, 2, 3], the distance between two time series is measured by aggregating the pairwise Euclidean distance of their respective values across the entire sequences. However, in many cases, more fine-grained trends and patterns may exist, which are missed under this global similarity measure. For example, consider two time series representing the hourly energy consumption of two nearby buildings over a week, and assume that the two buildings exhibit a similar consumption pattern during working days but a different one in weekends. A query imposing a similarity threshold over the entire week would fail to identify these two geolocated time series as similar. However, it may be useful to discover that there is a period of up to 5 days during which these two time series are actually similar.

Motivated by this observation, in this work we extend our previous approach on hybrid queries over geolocated time series to support local similarity of time series, thus allowing more flexible and fine-grained queries and analyses. The local similarity score between two time series and is defined as the maximum number of consecutive timestamps during which the respective values of and do not differ by more than a user-specified threshold . Notice that, compared to global similarity, this condition is more relaxed, in the sense that it is applied to subsequences of length lower than and , but at the same time stricter, in the sense that the threshold is required to be satisfied at each individual timestamp during the selected period rather than on the aggregate distance over all timestamps.

Combining this local similarity constraint with a filter on spatial distance leads to a novel set of hybrid queries. Figure 1 shows an example with a query time series searching over a set of time series for those within radius from its location and also locally similar to . In particular, with respect to a given , results should also be locally similar to for at least 5 consecutive timestamps. Qualifying results include with local similarity score (bottom chart), and with (top chart).

Figure 1: Retrieving geolocated time series based on spatial distance and local similarity.

It turns out that such hybrid queries involving local similarity can still be evaluated using the BTSR-tree index. We first present a baseline method employing a sweep-line algorithm to check for local similarity, and then describe how this can be optimized by using appropriately placed checkpoints, based on the local similarity score threshold specified by the query, in order to skip unnecessary comparisons. Despite the fact that this saves some computations, the resulting time savings are relatively small, since the number of index nodes that need to be probed is not essentially reduced. To overcome this problem, we introduce an improvement to the BTSR-tree index, which is based on temporally segmenting the time series bounds within each node and deriving tighter bounds per segment. Once the time series bounds in each node become more fine-grained, pruning the search space for local similarity queries proves much more effective.

Summarizing, our main contributions are as follows:

  • We extend our previous work on hybrid queries for geolocated time series to support local time series similarity. We consider both range and top- queries, including combined criteria for spatial distance and local time series distance.

  • We present how such queries can be answered efficiently exploiting the previously introduced BTSR-tree index.

  • To achieve greater savings in execution time by further reducing node accesses, we propose an enhanced variant of BTSR-tree, called BTSR-tree, which additionally employs temporal segmentation in each node to derive tighter, more fine-grained time series bounds.

  • We experimentally evaluate our methods using real-world datasets from different application domains, showing that BTSR-tree can efficiently handle hybrid queries under local similarity search, while BTSR-tree achieves even higher performance due to the additional temporal segmentation.

The remainder of the paper is structured as follows. Section 2 reviews related work. Section 3 formally defines the problem. Section 4 presents how query evaluation under local time series similarity can be executed using the BTSR-tree, while Section 5 presents the enhanced BTSR-tree. Section 6 reports our experimental results, and Section 7 concludes the paper.

2 Related Work

Similarity search over time series has provided a wide range of algorithmic approaches; a detailed survey with experimental evaluation is available in [8]. Initially, the focus was mostly on wavelet-based methods [4] to reduce the dimensionality of time series and generate an index based on the transformed sequences. In contrast, state-of-the-art approaches for time series indexing are based on the Symbolic Aggregate Approximation (SAX) representation [11]. The first index in this family was SAX[17], offering multi-resolution representations for time series. Further extensions like SAX 2.0 [2], SAX2+ [3], ADS+ [21], Coconut [10], DPiSAX [18], and ParIS [14] provided a wide range of advanced capabilities. These indices support global similarity search, i.e., the similarity score is computed over the entire length of the compared time series, as opposed to local similarity, which allows to consider similar subsequences. The most recent addition to this -based family is ULISSE [12], which can answer similarity search queries of varying length. However, this still differs from our setting, since in ULISSE the goal is to build an index that supports similarity search for queries of any length within a given range . Furthermore, none of the aforementioned approaches supports geolocated time series, and thus cannot efficiently process hybrid queries combining conditions on spatial distance and time series similarity.

The problem of subsequence matching over time series is to identify matches of a (relatively short) query subsequence across one or more (relatively long) time series. The UCR suite [15] offers a framework comprising various optimizations regarding subsequence similarity search. Matrix Profile [19] includes methods for detecting, for each subsequence of a time series, its nearest neighbor

subsequence, by keeping track of Euclidean distances among candidate pairs. Applying such approaches in our setting is not straightforward. First, they involve Euclidean or DTW distances, which are different from our definition of local similarity score, hence the pruning heuristics do not hold in our case. Second, they do not consider geolocated time series, thus spatial filtering has to be carried out independently, which reduces pruning opportunities.

To the best of our knowledge, the only index that supports searching over geolocated time series is the BTSR-tree [6, 5]. This hybrid index follows a similar rationale set by spatio-textual indices [7] that can facilitate evaluation of queries combining location-based predicates with keyword search. In a similar spirit, BTSR-tree is a spatial-first index based on the R-tree that can additionally compute bounds on similarity of time series instead of a textual similarity between documents. Apart from an MBR, each node also stores bounds over the time series indexed in its subtree. Thus, it offers increased pruning capabilities for range and top- queries involving both time series similarity and spatial proximity. In the current work, we show how BTSR-tree can be used for another family of hybrid queries involving local similarity of time series. Furthermore, we introduce a variant structure, called BTSR-tree, which constructs tighter bounds over temporally segmented time series to offer stronger pruning power.

3 Local Similarity Search on Geolocated Time Series

Next, we briefly present some background on geolocated time series and the BTSR-tree index, and then formally define the problem.

3.1 Preliminaries

Geolocated Time Series. A time series is a time-ordered sequence of values , where is the value at the -th timestamp and is the length of the series. A geolocated time series is additionally characterized by a location, denoted by . The spatial distance between two geolocated time series is the Euclidean distance of their respective locations.

The BTSR-tree Index. In [6], we have introduced the BTSR-tree index, which is based on the notion of Minimum Bounding Time Series (MBTS). In a similar manner that an MBR encloses a set of geometries, an MBTS encloses a set of time series using a pair of bounds that fully contain all of them. Figure 2 depicts an example of two MBTSs for two disjoint sets of time series. Formally, given a set of time series , its MBTS consists of an upper bounding time series and a lower bounding time series , constructed by respectively selecting the maximum and minimum of values at each timestamp among all time series in set as follows:

(1)
Figure 2: MBTS constructed for two sets of time series.

A BTSR-tree index is initialized as an R-tree [9] built on the spatial attributes of the given geolocated time series dataset, as depicted in the example of Figure 3. Besides MBRs, each node is enhanced to also store MBTSs, shown as colored strips per node in Figure (c)c. This enables efficient pruning of the search space when evaluating hybrid queries combining time series similarity with spatial proximity. For each child, a node stores a pre-specified number of MBTSs. Each MBTS is calculated according to Eq. 1. Construction and maintenance of the BTSR-tree follow the procedures of the R-tree for data insertion, deletion and node splitting. Objects (i.e., geolocated time series) are inserted into leaf nodes and any resulting changes are propagated upwards. Once the nodes have been populated, the MBTS of each node are calculated bottom-up, relying on -means clustering according to their Euclidean distance in the time series domain. The example in Figure 2 depicts the MBTSs (as two bands with a thick outline) obtained for a set of time series (shown as thin polylines). In a BTSR-tree, each parent node receives all the MBTSs of its children and computes its own MBTSs. The process continues upwards, until reaching the root.

(a) Sample dataset with MBRs over locations
(b) Spatial-only R-tree index
(c) Hybrid BTSR-tree index
Figure 3: The BTSR-tree index.

3.2 Problem Definition

We first define the local similarity between time series, and then present the query variants we consider in this paper.

Definition 1 (Local Time Series Similarity)

The local similarity score between two time series and is the maximum count of consecutive timestamps during which the respective values of and do not differ by more than a given margin , i.e., , where is the longest consecutive time interval such that .

In this work, our goal is to efficiently support hybrid queries on geolocated time series that retrieve the results based both on spatial proximity and local similarity. Specifically, we focus on the following types of queries (hereafter referred to as LS-queries):

  • : Given a geolocated time series , retrieve every geolocated time series such that is located within range from , i.e., and has local similarity to at least , i.e., .

  • : Given a geolocated time series , retrieve the spatial -nearest neighbors to that also have local similarity to at least .

  • : Given a geolocated time series , retrieve the top- geolocated time series that have the highest local similarity to with respect to and are located within range from .

Example 1

Figure 1 depicts an example of the query. Given the geolocated time series as query, we seek the spatially close ones (i.e., within a circle of radius ) that are also locally similar within margin for at least timestamps. In this example, despite five geolocated time series being within range, only and qualify for the final result, since these are the ones that are also locally similar for at least one time interval of length at least .

4 LS-Queries Using the BTSR-tree

A straightforward approach for answering LS-queries would be to use a spatial index to first filter by spatial distance and then perform a sequential scan across each result to filter out those having local similarity score below the given threshold. This suffers from generating an unnecessarily large number of intermediate results which are then discarded. Instead, we propose to process LS-queries by leveraging the BTSR-tree index [6], which can prune the search space simultaneously according to both criteria.

While traversing the BTSR-tree, spatial filtering is performed at each node by computing the bounding distance between the location of and the MBR of , as in R-Trees [16].

For time series similarity, we exploit the MBTS stored within each node. Considering an MBTS at a node , we calculate its distance from at each timestamp as:

(2)

where and are the upper and lower values of the MBTS at timestamp . By definition of MBTS, no time series indexed under can differ from by less than at timestamp . Hence, only at those timestamps that , it is possible that a time series indexed under is locally similar to . Subsequently, we can compute a local similarity bound :

(3)

that reflects the maximum interval of consecutive timestamps where the distance computed by Eq. 2 does not exceed margin . This value is an upper bound of the local similarity scores of with any time series enclosed in this MBTS. Figure 4 shows that deviates from the given MBTS by no more than during two intervals: one consisting of consecutive timestamps and a smaller one with only timestamps (shown as square points). So, the local similarity bound for this MBTS is .

Figure 4: Local similarity check against an MBTS.

By construction, the MBTSs of a child node get tighter bounds compared to those of its parent as we descend the BTSR-tree. It is easy to verify that

(4)

hence local similarity bounds can only diminish when descending the index. This bound provides a useful pruning condition during search with a cutoff threshold . Any node where all its MBTSs have local similarity bound below can be safely pruned.

Next, we describe a baseline approach that employs a sequential scan over MBTSs, and then we present an optimization that prioritizes selected checkpoints to avoid many point-wise comparisons.

4.1 Sweep Line Approach

We explain how the BTSR-tree can be used, in conjunction with a simple sweep-line algorithm, to answer each of the three LS-queries, taking advantage of the two types of bounds, and , described above.

: We traverse the BTSR-tree starting from its root. At each inner node , we first check whether . If so, we employ a sweep line across the time axis to compute the local similarity bound for every MBTS included in . If all resulting bounds are below , the subtree under is pruned. Otherwise, the search continues at the children. Upon reaching a leaf node, we fetch the geolocated time series contained therein, and verify the query constraints against each one. Each such that and is added to the results.

: We maintain a priority queue containing both inner nodes (sorted by ascending ) and geolocated time series (sorted by ascending spatial distance to ). We start by adding to the root of BTSR-tree. In each iteration, we retrieve the top element from . If it is an inner node, we visit its children to calculate local similarity bounds according to Eq. 3. For any child that of one of its MBTSs satisfies threshold , we search the subtree of . Then, we calculate the corresponding spatial distance ( for a node or Euclidean distance for a geolocated time series ) and insert it back to . Once we encounter a geolocated time series at the top of , we add it to the results. The process terminates once geolocated time series have been obtained.

: This query is evaluated similarly to the previous one, with two differences. The first difference is that the priority queue is now sorted based on local similarity bounds in descending order, instead of spatial distance bounds in ascending order. The second is that before inserting an item (node or time series) to , its spatial distance ( or exact) is calculated, and if it is higher than the item is skipped. The traversal starts again from the root, and terminates once time series have been retrieved from the top of . These are the top- results with respect to local similarity (if another time series had higher local similarity, it would have been retrieved from first), and they are located within range from (otherwise, they would not have been admitted to ).

4.2 Checkpoint Approach

The drawback of the sweep-line approach is that it needs to perform a comparison for each individual timestamp to eventually determine the exact or maximum local similarity of a given time series or node, respectively. In the following, we explain how we can use checkpoints along the time axis to avoid this exhaustive search. These checkpoints prioritize specific timestamps when checking for candidate matches to eagerly filter out non-qualifying items.

Assume a query with local similarity threshold . We can place checkpoints at every timestamps, and only apply the local similarity filter (i.e., ) at those. If no checkpoint satisfies the condition, this item can be safely pruned since it cannot have local similarity to at least (as this would require the condition to be true for at least consecutive timestamps, thus crossing at least one checkpoint).

(a) Checkpoints placed every timestamps.
(b) Local similarity starting before checkpoint at .
(c) Local similarity ending after checkpoint at .
Figure 5: Local time series similarity via checkpoints.

Figure (a)a shows an example with checkpoints placed along the time axis every timestamps. For clarity, we consider a single time series . Assume a checkpoint at timestamp and a minimal duration starting at timestamp for asserting local similarity with query , as shown with the grey strip in Figure (b)b. This interval cannot have smaller duration, as it would not satisfy the constraint. Thus, the local similarity condition will evaluate to true at checkpoint . Similarly, if such an interval ends at timestamp (Figure (c)c), it will be detected at the checkpoint at . This observation entails that it suffices to check for local similarity only at checkpoints, i.e., every timestamps. We denote the set of checkpoints as , determined at query time. If a checkpoint satisfies the condition, then we need to scan both forward and backward from it to determine the actual local similarity score, i.e., to find the exact extent of the time interval for which the condition holds.

Figure 6 exemplifies the use of checkpoints for comparing to an MBTS of a node for timestamps. Instead of sequentially performing 11 comparisons until verifying that local similarity score is at least (i.e., we stop the verification at , once ), we check only around the checkpoints. At the leftmost checkpoint , no local similarity is found ( is farther than from the MBTS), so we skip directly to checkpoint . Since differs by less than at , we need to compare values backward and forward, up to the previous and next checkpoint, respectively. This requires only 6 comparisons instead of 11 to decide that this node may contain candidates. Next, we describe how probing with checkpoints is applied during evaluation of LS-queries.

Figure 6: Local similarity with a MBTS using checkpoints.

: Algorithm 1 outlines the procedure. Initially, we obtain the children of the root node in a list and place the checkpoints every timestamps (Lines 1-3). We iterate over each item in this list. If is an inner node, we have to examine whether both constraints with respect to and are met for each of its children. Verification of MBTS against query will be discussed shortly. If this is the case, we traverse the sub-tree of each child in the same manner, by adding it to the list (Lines 7-11), thus descending the tree. If the examined node is a leaf (Line 12), we iterate over each contained time series to check the constraints and . If qualifies, it is added to the results (Lines 13-15). Note that now the calculation of local similarity scores for geolocated time series is based on checkpoints (Line 14), as discussed above.

Verification of MBTS against the local similarity constraints is applied using checkpoints (Lines 17-38). This verification concerns each MBTS in a given node . At each checkpoint , we first verify whether its to query is at most (Line 20). If so, we first scan backward to inspect whether there are at least consecutive timestamps where deviates by at most from this MBTS (Lines 22-29). Similarly, we probe forward from checkpoint (Lines 30-37). In either case, once local similarity no longer holds at a timestamp, probing skips to the next checkpoint. If the check fails for all checkpoints of all MBTSs, then this node cannot contain any results (Line 38).

1
2
3
4 while  do
5      
6       if  is not leaf then
7             foreach  do
8                   if  then
9                        
10                         if  then
11                              
12                              
13                        
14                  
15            
16      else
17             foreach  do
18                   if  then
19                        
20                        
21                  
22            return R
23      
24 Procedure 
25       foreach  do
26             foreach  do
27                   if  then
28                        
29                         while True do
30                              
31                               if  then
32                                    
33                                     if  then
34                                           return True
35                                    
36                              else
37                                    
38                              
39                        while True do
40                              
41                               if  then
42                                    
43                                     if  then
44                                           return True
45                                    
46                              else
47                                    
48                              
49                        
50                  
51            
52      return
Algorithm 1

: We follow a similar procedure to the one in Section 4.1 for query , employing the same verification process over MBTSs and time series as in Algorithm 1. Algorithm 2 describes the procedure. We start by adding the root node to a priority queue based on spatial distance (Line 2). After determining the checkpoints using the given (Line 3), we iteratively retrieve elements from (Line 5). Then, three cases may occur:

  1. If this element is a time series (Lines 6-9), it is guarranteed to be a result, given that is sorted based on spatial distance from . Indeed, any subsequent element must be located farther than the current. When list obtains the required number of results, the search terminates.

  2. The element is a leaf node (Lines 10-14): In this case, we obtain each time series contained in this leaf, and verify the local similarity score of against . If the condition is met, we calculate the spatial distance of candidate from query and push into the priority list along with its spatial distance (Lines 10-14).

  3. If the element is an inner node, we iterate over its children and only push back to the queue the ones whose MBTSs are verified against and using checkpoints (Lines 15-19).

1
2
3
4 while  is not empty do
5      
6       if  is raw then
7            
8             if  then
9                  
10                  
11            
12      else if  is leaf then
13             foreach  do
14                   if  then
15                        
16                        
17                        
18                  
19            
20      else
21             foreach  do
22                   if  then
23                        
24                        
25                        
26                  
27            
28      
return R
Algorithm 2

: The procedure for this query is listed in Algorithm 3. Notice that for employing checkpoints, we need a local similarity threshold , so as to determine their placement, but this query does not specify a fixed . To be able to obtain one during search, we now maintain two priority queues: holds inner nodes sorted by local similarity bounds (Eq. 3), while keeps up to geolocated time series sorted by local similarity scores (as in Def. 1). We initially set , so checkpoints are trivially placed at every timestamp. This implies that computation of local similarity scores with is equivalent to the sweep line approach. However, increases with the detection of qualifying results, hence checkpoints will progressively get placed more sparsely. The search starts by adding the BTSR-tree root in (Line 2). We iteratively poll the top element from , and there are two possible cases:

  1. The top element is a leaf node. Then, we iterate over the contained time series and add the ones that satisfy the spatial condition () to , along with their corresponding local similarity score if it exceeds the current value of (Lines 8-12). Once exceeds capacity , its last element is evicted to make room for the newly inserted one and is updated according to the local similarity score of the -th element in . In this case, the placement of checkpoints is re-adjusted according to the increased value (Lines 13-16).

  2. The top element is an inner node. In this case, we iterate over each child and check if . If qualifies, we calculate the local similarity bound of all its MBTSs using checkpoints. If the maximum among these bounds , then is inserted to with this maximum score (Lines 17-25).

The process terminates once the top element in has local similarity less than (Lines 6-7). The result is the contents of .

1
2
3
4
5 while  is not empty do
6       if  then
7            
8            
9      if  is leaf then
10             foreach  do
11                   if  then
12                         if  then
13                              
14                              
15                        
16                  if  then
17                        
18                        
19                        
20                        
21                  
22            
23      else
24             foreach  do
25                   if  then
26                        
27                         foreach  do
28                               if  then
29                                    
30                              
31                        if  then
32                              
33                              
34                        
35                  
36            
37      
return R
Algorithm 3

5 The BTSR-tree Index

5.1 Index Structure

The BTSR-tree index uses -means clustering to cluster the time series under each node and then stores the MBTSs of those clusters. However, clustering entire time series typically generates many overlapping MBTSs, incurring much dead space. This has a negative impact on the pruning power of the index, especially when considering local similarities. Figure (a)a depicts such a case of six time series indexed in a node. A -means clustering with will form the depicted MBTSs denoted with shaded colors. As a result, the dark area represents the overlap between and

and actually makes those bounds less tight. Hence, such MBTSs inflate estimates for local similarity bounds, and thus lead to unnecessarily descending further down the index.

To reduce the amount of overlap within the MBTSs of nodes, we introduce an extended version of the BTSR-tree, named BTSR-tree. BTSR-tree attempts to eliminate as much overlap as possible, through segmentation of time series. Figure (b)b depicts the intuition. If we segment the time series before applying -means, the resulting MBTSs for each segment tend to be tighter, eliminating the excessive overlap from Figure (a)a. The BTSR-tree is built similarly to BTSR-tree. The only difference is that the MBTSs of each node are calculated per segment. In this method, we assume a pre-defined number of segments, but segmentation is orthogonal to our problem and can be carried out by applying existing methods like [1]. Ultimately, BTSR-tree allows for more aggressive pruning when traversing the index.

(a) Example of a node’s MBTS.
(b) Segmenting can eliminate whitespace.
Figure 7: Segmenting time series yields more tight MBTS.

5.2 Cross-Segment Continuity Via Bit-Vectors

A downside of the segmentation approach is the loss of the MBTS continuity across time, which results in MBTSs enclosing different time series in neighboring segments. For example, in Figure (b)b, there are no MBTSs in the right segment containing the same time series as and , a fact which hinders the calculation of local similarity on the segment boundaries (the vertical line). To overcome this, we introduce a bit-vector

along each MBTS of a segment, having one bit for each MBTS created. If in the current segment a bit in vector

of a given MBTS is set, this indicates that this MBTS encloses at least one common time series with another MBTS in the next segment. In the example shown in Figure (b)b, for indicates common time series with and in the next segment, while for signifies common time series with only . This way, to calculate local similarity, we can easily identify all the MBTSs that share common time series among two successive segments.

To evaluate LS-queries, traversal of the BTSR-tree index follows a similar rationale to the procedure in Section 4.2. For each checkpoint , we first obtain the segment where it falls in, and we scan each MBTS leftward and rightward from , as discussed in Section 4.2. If we cross the border to another segment, the available bit-vectors directly identify the MBTS that need be examined in this neighboring segment. This propagates until the local similarity constraints ( and ) are satisfied. Figure 8 illustrates an example of a node verification. Let us consider a predetermined number of three segments and the corresponding MBTS of each segment for that node. Suppose that there exists a checkpoint on the second segment. To verify whether this node satisfies the local similarity constraints, we start from checkpoint and we check leftwards whether for each timestamp. If the currently examined timestamp falls in the first segment, we fetch the corresponding MBTS and bit-vectors and continue checking whether in both MBTS (green shaded), as their bit-vectors both indicate common members with the first one in segment 2. A similar procedure is followed rightwards, where we only have to check the first MBTS, according to the bit-vectors.

Figure 8: Example of verifying a BTSR-tree node.

6 Experimental Evaluation

Next, we report results from a comprehensive evaluation of our methods against real-world datasets.

6.1 Experimental Setup

6.1.1 Datasets

We use three real-world datasets (Table 1) selected from different application domains, containing diverse types of geolocated time series, as detailed below:

Dataset Area Number of Length of Default query parameters
(km) locations timeseries
Flickr Earth 414,967 96 30% 7.5% 20 30
Crime 392,000 362,215 76 30% 7.5% 25 30
Taxi 2,500 417,960 168 30% 10% 20 30
Table 1: Datasets and parameters used in the experiments.

UK historical crime data (Crime). Contains time series representing the temporal variation in the number of crime incidents reported across England and Wales over 76 months (December 2010– March 2017). We generated time series over a grid with cell size 200 meters applied on the original data111https://data.police.uk/data/. For each month, we counted incidents having their location within each cell.

Flickr geotagged photos (Flickr). Contains time series data extracted from geolocated Flickr images between 2006 and 2013 over the entire planet222https://code.flickr.net/category/geo/. To get meaningful geolocated time series, we partitioned the space by a uniform grid of of cells (each one spanning decimal degrees in each dimension) and counted the number of photos contained in every cell each month. We excluded empty cells (e.g., in the oceans). Each time series conveys the visits pattern (in terms of number of photos taken per month) of that region over this period.

NYC taxi dropoffs (Taxi). Contains time series extracted from yellow taxi rides in New York City during 2015. The original data333http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml provide pick-up and drop-off locations, as well as corresponding timestamps for each ride. For each month, we generated time series by applying a uniform spatial grid over the entire city (cell side was 200 meters) and counting all drop-offs therein for each day of the week at the time granularity of one hour. Thus, we obtained the number of drop-offs for time intervals in every cell, which essentially captures the weekly fluctuation of taxi destinations there. Without loss of generality, the centroid of each cell is used as the geolocation of the corresponding time series.

Synthetic. To test scalability, we augmented the Flickr dataset by slightly moving each location in a random manner and altering each time series value by a random number between and . We produced three additional synthetic datasets each containing , , the number of time series from the original dataset.

6.1.2 Index and Query Parameters

To evaluate the performance benefits observed in the experiments only based on pruning, we tuned the index parameters to fixed values. The minimum () and maximum () number of entries stored in each node are set to and , respectively. For both BTSR-tree and BTSR-tree, the number of MBTS to 10 and for BTSR-tree, the number of segments is also set to 10. The query parameters involve the spatial distance and local similarity thresholds, i.e., , , and . The values of these parameters are set differently for each dataset, based on their characteristics; default values are shown in Table 1. The value of is set relatively, by setting the covered area as a percentage of the total area. Similarly, is set as a percentage of the maximum difference between the observed values.

6.1.3 Evaluation Setting

Each experiment is performed using a randomly selected workload of 100 queries for each dataset and we report the average response time. All indices are held in memory, while the leafs contain pointers to files with geolocated time series stored on disk. All methods were developed in Java. Tests were executed on a server with 4 CPUs, each containing 8 cores clocked at 2.13GHz, and 256 GB RAM running Debian Linux.

6.2 Query Performance

We compare the average per query execution time for all three queries using sweep line and checkpoint methods on BTSR-tree and the checkpoint method on BTSR-tree.

(a)
(b) Crime
(c) Crime
(d) Flickr
(e) Flickr
(f) Taxi
(g) Taxi
Figure 9: Query for varying and .

(a)
(b) Crime ()
(c) Crime ()
(d) Crime ()
(e)
(f) Flickr ()
(g) Flickr ()
(h) Flickr ()
(i)
(j) Taxi ()
(k) Taxi ()
(l) Taxi ()
(m)
Figure 10: Per column: for varying for varying for varying – Scalability.

6.2.1

Figure 9 illustrates the query performance for varying thresholds and and the first column of Figure 10 for varying , on all three datasets. It is apparent that the BTSR-tree with the checkpoint approach outperforms the rest in all cases. Its superior pruning power is attributed to the segmentation, which yields tighter bounds within the nodes and consequently less disk accesses. The sweep line and checkpoint methods over BTSR-tree perform similarly in all cases. Both methods access the same nodes, but the checkpoint approach needs to examine significantly less values across time in order to determine local similarities. However, since all local similarity calculations take place in-memory, computation cost does not make a big difference, compared to the lesser node accesses required with the BTSR-tree.

More specifically, for the crime dataset, relaxing (Figure (b)b) has a negative impact on all three methods as more nodes have to be accessed and pruning depends mostly on the value. BTSR-tree increasingly outperforms the rest as increases, due to its more aggressive pruning on local similarity. For the case of increasing (Figure (c)c), the result is the opposite, as this way the parameter is relaxed and more nodes get accessed. For very large values, pruning is solely based on spatial distance and all approaches perform similarly. Finally, increasing (Figure (b)b) also increases the difference in performance among the three approaches, while it also reduces the average query response time. This is due to large numbers of subsequences qualifying for small values, resulting in more node accesses. As increases, pruning is more rapidly improved in the case of BTSR-tree due to its tighter bounds.

The results are similar but with larger differences for the Flickr dataset (Figures (d)d, (e)e and (f)f). Intuitively, the less periodicity in a dataset, the more the benefit from segmentation; if the time series in the dataset exhibit periodicity, the bounds that will occur from applying -means clustering on the whole sequences will be relatively tighter than otherwise. The Flickr dataset, due to its nature, is more random than the crime dataset, which justifies the larger differences. This explanation is also supported by the results for the taxi dataset, illustrated in Figures (b)b, (c)c and (b)b. Despite a similar behavior in varying all thresholds, the differences in average query response time among the different approaches are smaller than in the crime and Flickr datasets, due to the high daily periodicity of taxi drop-offs.

6.2.2

Figures (c)c, (g)g and (k)k depict the results for the query for the three datasets. As increases, more nodes have to be traversed in order to fetch the additional results, and the execution time increases for all methods. Nevertheless, BTSR-tree still clearly outperforms the other two algorithms.

6.2.3

Finally, Figures (d)d, (h)h and (l)l depict the results for the query. In this case, the performance deterioration as increases is less abrupt, especially for the crime dataset, as usually the top- results are spatially closely located and are retrieved quickly. Again, the largest and smallest differences are spotted on the Flickr and taxi datasets, respectively.

6.3 Scalability

We performed a scalability evaluation for all three queries using the Flickr-based synthetic datasets, again measuring the average query response time for the same query workload. The results for increasing dataset size (up to four times) are depicted in Figure 10. In all cases, the BTSR-tree-based approach scales better, especially in the top- queries (Figures (i)i and (m)m), where the larger difference observed in Figures (g)g and (h)h is further augmented.

7 Conclusions

We have studied three variants of hybrid queries on geolocated time series, involving both range and top- search, and combining spatial distance with local time series similarity. The latter allows to measure similarity of time series over subsequences instead of their entire length, and thus enables the identification of more fine-grained trends and patterns. The queries are evaluated by hybrid index structures, in order to allow for simultaneous pruning by both criteria. We first discuss query evaluation using the previously proposed BTSR-tree, and then we further extend it to derive the BTSR-tree which exhibits even better performance, by using temporal segmentation of time series to derive tighter bounds. Our evaluation against several real-world datasets has shown that BTSR-tree can compute results much faster for all query variants.

References

  • [1] E. Bingham, A. Gionis, N. Haiminen, H. Hiisilä, H. Mannila, and E. Terzi (2006) Segmentation and dimensionality reduction. In SIAM, pp. 372–383. Cited by: §5.1.
  • [2] A. Camerra, T. Palpanas, J. Shieh, and E. J. Keogh (2010) ISAX 2.0: indexing and mining one billion time series. In ICDM, pp. 58–67. Cited by: §1, §2.
  • [3] A. Camerra, J. Shieh, T. Palpanas, T. Rakthanmanon, and E. J. Keogh (2014) Beyond one billion time series: indexing and mining very large time series collections with i SAX2+. Knowl. Inf. Syst. 39 (1), pp. 123–151. Cited by: §1, §2.
  • [4] K. Chan and A. W. Fu (1999) Efficient time series matching by wavelets. In ICDE, pp. 126–133. Cited by: §2.
  • [5] G. Chatzigeorgakidis, K. Patroumpas, D. Skoutas, S. Athanasiou, and S. Skiadopoulos (2018) Scalable hybrid similarity join over geolocated time series. In SIGSPATIAL, pp. 119–128. Cited by: §1, §2.
  • [6] G. Chatzigeorgakidis, D. Skoutas, K. Patroumpas, S. Athanasiou, and S. Skiadopoulos (2017) Indexing geolocated time series data. In SIGSPATIAL, pp. 19:1–19:10. Cited by: §1, §1, §2, §3.1, §4.
  • [7] L. Chen, G. Cong, C. S. Jensen, and D. Wu (2013) Spatial keyword query processing: an experimental evaluation. PVLDB 6 (3), pp. 217–228. Cited by: §2.
  • [8] K. Echihabi, K. Zoumpatianos, T. Palpanas, and H. Benbrahim (2018) The lernaean hydra of data series similarity search: an experimental evaluation of the state of the art. PVLDB 12 (2), pp. 112–127. Cited by: §1, §1, §2.
  • [9] A. Guttman (1984) R-trees: a dynamic index structure for spatial searching. In SIGMOD, pp. 47–57. Cited by: §3.1.
  • [10] H. Kondylakis, N. Dayan, K. Zoumpatianos, and T. Palpanas (2018) Coconut: A scalable bottom-up approach for building data series indexes. PVLDB 11 (6), pp. 677–690. Cited by: §2.
  • [11] J. Lin, E. J. Keogh, L. Wei, and S. Lonardi (2007) Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov. 15 (2), pp. 107–144. Cited by: §1, §2.
  • [12] M. Linardi and T. Palpanas (2018) Scalable, variable-length similarity search in data series: the ulisse approach. PVLDB 11 (13), pp. 2236–2248. Cited by: §2.
  • [13] M. Linardi, Y. Zhu, T. Palpanas, and E. J. Keogh (2018) VALMOD: A suite for easy and exact detection of variable length motifs in data series. In SIGMOD, pp. 1757–1760. Cited by: §1.
  • [14] B. Peng, P. Fatourou, and T. Palpanas (2018) ParIS: the next destination for fast data series indexing and query answering. In IEEE BigData, Cited by: §2.
  • [15] T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, and E. Keogh (2012) Searching and mining trillions of time series subsequences under dynamic time warping. In SIGKDD, pp. 262–270. Cited by: §2.
  • [16] N. Roussopoulos, S. Kelley, and F. Vincent (1995) Nearest neighbor queries. In SIGMOD, pp. 71–79. Cited by: §4.
  • [17] J. Shieh and E. J. Keogh (2008) iSAX: indexing and mining terabyte sized time series. In SIGKDD, pp. 623–631. Cited by: §2.
  • [18] D. Yagoubi, R. Akbarinia, F. Masseglia, and T. Palpanas (2018) Massively distributed time series indexing and querying. TKDE (to appear). Cited by: §2.
  • [19] C. M. Yeh, Y. Zhu, L. Ulanova, N. Begum, Y. Ding, H. A. Dau, D. F. Silva, A. Mueen, and E. Keogh (2016) Matrix profile i: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In ICDM, Cited by: §2.
  • [20] C. M. Yeh, Y. Zhu, L. Ulanova, N. Begum, Y. Ding, H. A. Dau, Z. Zimmerman, D. F. Silva, A. Mueen, and E. J. Keogh (2018) Time series joins, motifs, discords and shapelets: a unifying view that exploits the matrix profile. Data Min. Knowl. Discov. 32 (1), pp. 83–123. Cited by: §1.
  • [21] K. Zoumpatianos, S. Idreos, and T. Palpanas (2014) Indexing for interactive exploration of big data series. In SIGMOD, pp. 1555–1566. Cited by: §2.