Discovering Subdimensional Motifs of Different Lengths in Large-Scale Multivariate Time Series

11/20/2019
by   Yifeng Gao, et al.
George Mason University
0

Detecting repeating patterns of different lengths in time series, also called variable-length motifs, has received a great amount of attention by researchers and practitioners. Despite the significant progress that has been made in recent single dimensional variable-length motif discovery work, detecting variable-length subdimensional motifs—patterns that are simultaneously occurring only in a subset of dimensions in multivariate time series—remains a difficult task. The main challenge is scalability. On the one hand, the brute-force enumeration solution, which searches for motifs of all possible lengths, is very time consuming even in single dimensional time series. On the other hand, previous work show that index-based fixed-length approximate motif discovery algorithms such as random projection are not suitable for detecting variable-length motifs due to memory requirement. In this paper, we introduce an approximate variable-length subdimensional motif discovery algorithm called Collaborative HIerarchy based Motif Enumeration (CHIME) to efficiently detect variable-length subdimensional motifs given a minimum motif length in large-scale multivariate time series. We show that the memory cost of the approach is significantly smaller than that of random projection. Moreover, the speed of the proposed algorithm is significantly faster than that of the state-of-the-art algorithms. We demonstrate that CHIME can efficiently detect meaningful variable-length subdimensional motifs in large real world multivariate time series datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

02/13/2018

Efficient Discovery of Variable-length Time Series Motifs with Large Length Range in Million Scale Time Series

Detecting repeated variable-length patterns, also called variable-length...
01/30/2019

Unsupervised Scalable Representation Learning for Multivariate Time Series

Time series constitute a challenging data type for machine learning algo...
08/07/2019

Self-Organizing Maps with Variable Input Length for Motif Discovery and Word Segmentation

Time Series Motif Discovery (TSMD) is defined as searching for patterns ...
01/26/2021

A fast algorithm for complex discord searches in time series: HOT SAX Time

Time series analysis is quickly proceeding towards long and complex task...
08/31/2020

VALMOD: A Suite for Easy and Exact Detection of Variable Length Motifs in Data Series

Data series motif discovery represents one of the most useful primitives...
08/31/2020

Matrix Profile Goes MAD: Variable-Length Motif And Discord Discovery in Data Series

In the last fifteen years, data series motif and discord discovery have ...
11/29/2018

Recurrent Deep Divergence-based Clustering for simultaneous feature learning and clustering of variable length time series

The task of clustering unlabeled time series and sequences entails a par...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Detecting repeating patterns of various lengths, also called variable-length motifs, in time series has received a great amount of attention [1][2][3][4]. Since motifs of different lengths can naturally co-exist in a time series, detecting variable-length motifs often is a necessary step for many real-world applications such as classification [2]

, anomaly detection

[5]

and data visualization

[6].

Contrary to the significant progress that has been made in recent single dimensional variable-length motif discovery work [2][3][7], only little progress is made in detecting variable-length subdimensional motifs [8][9] — patterns that are simultaneously occurring only in a subset of all dimensions in multivariate time series.

Existing approaches [10][8][9] in subdimensional motif discovery still only detect motifs of a specified length, possibly suggested by domain experts. While in some applications, these approaches may fit well if domain knowledge is available and a good motif length can be specified by the user, we aim at solving the problem in a more general case — when the correct motif length is not known, or motifs of various lengths co-exist in the data.

We illustrate the limitation of fixed-length motif discovery in Figure 1. In the figure, there are two subdimensional motifs of lengths 200 and 400 respectively. The first subdimensional motif (labeled in red with green box) occurs in the first two dimensions and , and the second subdimensional motif occurs in and . Since the motifs have different lengths, even in the best case, existing (fixed-length) subdimensional motif discovery approaches [10][8][9] can only detect one of these motifs correctly if the proper length is provided. When the lengths are unknown, they would need to try different lengths — a process that is very time consuming. To overcome this limitation, our proposed approach is designed to discover motifs of various lengths in a single run; that is, all subdimensinoal motifs can be discovered even with significant length differences between them.

Fig. 1: Example of two subdimension motif. One is labeled in red line and occurs in dimension with length 200. The other one appears in with length 400.

Subdimensional motifs of different lengths widely exist in real world time series data. For example, the Caltrans Performance Measurement (PEMS) dataset [11], which records traffic information in major California cities over time, can have patterns in different spatial and temporal scales ranging from several hours of rush hour traffic patterns to a 5-day stable weekly pattern. In meteorology data analysis, a motif with duration of several hours may benefit short-term forecasting, whereas a motif that lasts several months may represent some seasonal phenomenon [9]. These motifs can co-exist in the same time series even though they differ significantly in length, and span different dimensions. Detecting them is often necessary for analyzing the data [2][12].

The main challenge for detecting variable-length subdimensional motif is scalability. The brute-force enumeration solution, which searches for motifs of all possible lengths, is very time consuming. While there exist some search techniques for detecting single dimensional variable-length motifs [2][12][13], they cannot be generalized to detecting subdimensional motifs since they do not address the problem of searching relevant dimensions. As a result, how to efficiently detect variable-length subdimensional motifs remains a difficult problem.

In this paper, we introduce an algorithm called Collaborative HIerarchy based Motif Enumeration (CHIME) to efficiently detect variable-length subdimensional motifs given a minimum motif length . Towards that end, CHIME combines repeating symbols detected from each dimension, and uses the symbol matching results in all dimensions collaboratively to avoid redundant searches. While the algorithm does not guarantee an exact solution, it can find motifs with considerably large length range in multivariate time series with high accuracy. Even in a fifty-dimensional time series with length of one million, the algorithm executes in less than one hour. This is significantly faster than the state-of-the-art approaches which would take days to detect motifs of a single fixed length. The efficient search makes variable-length subdimensional motif discovery in large-scale multivariate time series a feasible task. To summarize, our work has the following contributions:

  • The algorithm can discover subdimensional motifs of a large length range.

  • The speed is significantly faster than that of the state-of-the-art algorithms.

  • Even though CHIME is an approximate algorithm, the experiments show that the algorithm can detect motifs with high accuracy.

  • The experiments also show that CHIME can discover meaningful motifs in real world datasets, and can potentially benefit other multivariate time series data mining tasks such as classification.

The rest of the paper is organized as follows: Sec. 2 discusses related work and challenges in detecting variable-length subdimensional motifs. Sec. 3 introduces the problem definition and notations used in the paper. Sec. 4 describes the discretization technique and introduces our algorithm. The experimental results are shown in Sec. 5, and we conclude in Sec. 6.

Ii Related Work

We start by introducing recent fixed-length multivariate time series motif discovery work.

Tanaka et al. [14] introduce a smart approach that compresses multivariate time series into a univariate time series to detect motifs at a low cost. Minnen et al. and Berlin et al. [10][15] utilize a density based approach to locate potential motif areas. All three approaches above, however, are designed for finding patterns that match in all dimensions. Therefore, the performance of these approaches will degrade if there exist some irrelevant dimensions.

Another work by Minnen et al. [8] is perhaps the first work to point out that finding patterns that match in all dimensions is not very useful in some real-world applications. The authors show that explainable motifs often occur in only some subset(s) of all dimensions. They introduce a fast approach that utilizes random projection and Symbolic Aggregate approXimation representation (SAX) to efficiently find approximate subdimensional motifs. Following similar idea, Wang et al. [16] introduce a subdimensional motif discovery approach by constructing a suffix tree. Recently, Yeh et al. [9] introduce an algorithm named mSTAMP that utilizes state-of-the-art motif discovery results along with Minimum Description Length (MDL) metric to locate subdimensional motifs. The approach can detect high-quality, meaningful subdimensional motifs in many real-world datasets.

All the algorithms described above require the user to pre-define the motif length. However, recent work have shown that detecting motifs of different, possibly unknown lengths is necessary in many real-world applications [3][13][2][12]. We next discuss recent variable-length subdimensional motif discovery algorithm.

Presently, work on variable-length subdimensional motif discovery is very limited. We only found one such work by Balasubramaian et al. [6]. The authors utilize a Grammar Induction based framework [1] to find variable-length subdimensional motifs in health-care time series. However, their approach requires storing every combination of the co-occurring single dimensional motifs in memory. So while it can achieve high accuracy, the approach is only suitable for low-dimensional time series. In contrast, our proposed algorithm is able to scale in high dimensional, large-size time series.

Iii Notation & Definition

We start with fundamental definitions related to time series:

Single Dimensional (Univariate) Time Series is a set of observations ordered by time.

Multivariate Time Series is a set of co-evolving single dimensional time series .

Subsequence of a multivariate time series is a contiguous set of points in the univariate time series starting from position with length . Typically, and .

Subsequences can be extracted from via a sliding window. In many applications, we are interested in finding similar “shapes.” Therefore, motif discovery is more meaningful when it is offset- and amplitude-invariant. This can be achieved by normalizing each subsequence prior to the search for motifs. Z-normalization

is a procedure that normalizes the mean and standard deviation of all points in the subsequence to zero and one, respectively.

Given a start and an end position , respectively, a Multidimensional Subsequence consisting of subsequences can be extracted. We denote it as where D contains all possible dimensions .

As demonstrated in previous work [8][9], in many cases only a subset of all co-evolving subsequences in are relevant (i.e. have repeating patterns). So we describe the concept of subdimensional subsequence [9] used in previous work.

The Subdimensional Subsequence is a set of subsequences among subsequences that start from position and end at in multivariate time series , where

is an indicator vector that stores the list of relevant dimensions and

is the number of relevant dimensions.

We next introduce the definition of subdimensional motif used in this work. Given a distance function and two subdimensional subsequences and with the same indicator and the same length, a vector is used to record the distance value where . In this work, we define a Subdimensional Motif to be a set of subdimensional subsequences in such that the average value of the distance vector between the subdimensional subsequences and a seed subdimensional subsequence is less than the motif threshold function , where is the length of the motif detected, (the minimum motif length). Each such subdimensional subsequence is said to be an instance of the subdimensional motif.

Iii-a Problem Definition

Finally, we introduce the problem that the proposed approach aims to address.

Variable-Length Subdimensional Time Series Motif Discovery Problem: Given a minimum motif length and a multivariate time series , the problem aims to find all variable-length subdimensional time series motifs that exist in .

Clearly, finding exact solution for this problem in a large-scale multivariate time series is costly. However, discovering even a subset of variable-length motifs can already benefit many real-world applications compared with fixed-length motifs [2][1][7]. Therefore, given a minimum motif length , our proposed algorithm aims to find an approximate set of high quality variable-length motifs with lengths larger than , and rank them based on some interest measure.

It is worth noting that determining a general interest measure for variable-length subdimensional motif discovery problem is non-trivial [9][8][12]. Some long motifs are likely to have larger distances compared with short motifs even after normalizing the distances by length. Some motifs that have a lower number of dimensions may contain more interesting local patterns compared with those spanning most or all of the dimensions. Therefore, the ranking of subdimensional motifs should be based on the application. The reader may refer to [9][8][12] for details on various motif evaluation approaches for different goals.

In our experiments, since most state-of-the-art approaches evaluate motifs based on similarity, when compared with those techniques, we rank the motifs based on the most similar pair of subsequences within each detected motif, and we include the execution time of pairwise comparison [17][2][9].

Iv Proposed Method

In this section, we first describe Symbolic Aggregate approXimation (SAX), the discretization approach we use to represent each subsequence. Then we describe the Multivariate Time Series Numerosity Reduction process, the pre-processing step that aims to remove neighboring, similar subsequences to avoid over-counting a pattern. We then introduce the proposed method.

Iv-a Symbolic aggregate approximation (SAX)

Discretization [18] is a common step in many time series motif discovery algorithms [18][3][19][8] [6][1], for it often helps improve the efficiency of the algorithms significantly. In this section, we describe Symbolic Aggregate approXimation (SAX), a popular technique used to discretize univariate time series.

Given a normalized subsequence from and a reduced dimension size , SAX converts the raw subsequence into lower-dimenional PAA (Piecewise Aggregate Approximation) representation by dividing the subsequence into equal-sized windows, and computing the average value of the points within each window. Intuitively, the PAA coefficients vector [18] is a -dimensional vector that consists of the average values from equal-sized segments of the input subsequence. PAA coefficients can be considered as an approximate representation of the original subsequence.

Given a PAA coefficients vector, the algorithm then maps it to symbols with alphabet size according to a breakpoint table [18]

, defined such that the regions are approximately equal-probable under Gaussian distribution. This maximizes the chance that the symbols occur with approximately equal probability. These

symbols form a SAX word. Figure 2 summarizes the process. The bold flat lines represent the values of PAA coefficients computed from their respective segments in the subsequence; the breakpoint table with from 2 to 4 is also shown in the figure. Since we set , the second column shows the breakpoints, based on which three regions are generated: . The PAA coefficients falling into these three regions are represented by symbols and respectively. In this example, the SAX word is formed.

Fig. 2: Example of Generating SAX word “abca” with and .

Iv-A1 Fast Computation of SAX for Subsequence of Any Length

Since the proposed approach uses SAX words heavily, it is very important to reduce the time cost of discretization. We next describe a fast algorithm to compute the SAX word for a univariate subsequence of any length [3]. We first pre-compute two vectors of statistical features for every time series : and based on input multivariate time series . Given a subsequence of length in dimension

, the SAX representation can be computed by Algorithm 1. In the algorithm, the mean and variance of

is computed in constant time (Line 3-4). The computation cost of computing PAA coefficients (Line 5-7) is , which is not dependent on subsequence length. The PAA coefficients are converted to SAX word based on the pre-defined breakpoint table described in previous subsection, which takes time computation. The overall cost of the whole process is . Since can be computed during pre-possessing in time, the cost of computing a SAX word for arbitrary length subsequence during motif discovery process is independent of the subsequence length. As demonstrated in [18][3], and should be very small compared with subsequence length. So the time cost to compute a SAX word is significantly reduced ().

1:Input: ,, PAA size , subsequence
2:Output: PAA representation
3:
4:
5:, ,
6:for every PAA segment do
7:    { and is the start and end point of each PAA segments}
8:end for
9:return ConvertPAAToSAX()
Algorithm 1 Fast SAX Computation (FastSAX)

Iv-B PAA Distance-based Multivariate Time Series Numerosity Reduction

In practice, neighboring subsequnces extracted via a sliding window are similar to each other since they are off by one point. To avoid over-counting a pattern, and to allow variable-length pattern discovery, Numerosity Reduction often is used to further compress the word sequence in previous univariate variable-length motif discovery work [1][3][7].

Concretely, given a multivariate time series , similar to previous work [1][3][7], PAA Distance-based Numerosity Reduction process conducts a left to right scan through and only records the first multivariate subsequence such that the PAA distance111Given two subsequence, PAA distance is computed by the distance between the PAA coefficient vectors of two subsequences multiplied by a factor of . The PAA distance has been shown to lower bound the actual Euclidean Distance. [18]. The reader can refer to the previous work [18] for detailed explanation on PAA distance. [18] between and the most recently recorded subsequence [18] is greater than in at least one dimension. Similar to previous work [7][3], we set for computing the PAA distance.

Iv-C Collaborative Hierarchy based Motif Enumeration (CHIME)

In this section, we describe our proposed method.

Iv-C1 Basic Data Structure

Since the proposed work heavily relies on recursively building up hierarchical structure based on the numerosity reduced subsequences, we store all subsequences after Numerosity Reduction in linked lists (denoted as ). Specifically, for each single dimensional time series , the reduced subsequences are stored in a linked list shown in Fig. 3.

Fig. 3: Data Structure used for proposed approach.

Each node stores a subsequence and is connected to two nodes which represent the two subsequences that appear before and after in the reduced subsequence sequence, respectively. We use and to represent these two nodes respectively. The edge that connect with is denoted as the next edge. The algorithm will check every subsequence connected by the next edge, and merge the connected nodes to generate longer motifs.

Once the data structure is built, CHIME conducts a left to right scan for every node in and calls Collaborative Enumerator — the algorithm used to enumerate variable-length subdimensional motifs based on node .

Iv-C2 Collaborative Enumerator

Collaborative Enumerator is described in Alg 2. Intuitively, given a node (Line 1-2), the enumerator recursively grows the length of the subsequence by using a greedy enumeration step and a dimension matching step. These two steps collaboratively avoid redundant enumeration, which saves space cost, and form potential subdimensional motif candidates (stored in ). Collaborative Enumerator has four major steps: a SAX word matching step (Line 3-9); a greedy enumeration step (Line 10-12); a dimension matching step (Line 13-15); and a step that updates motif set (Line 16-17). After these four steps, the enumerator recursively calls itself to enumerate longer motifs (Line 18-21).

1:Input: Node: , Motif set:
2:Output: Updated Motif Set SAX Word Matching Step
3:=Merge(,)
4:=1D-SAX()
5:if SAXTable.NotExist(,,then
6:   SAXTable.put(,,.length,)
7:   return
8:end if
9:=SAXTable.getSimLen(,); Greedy Enumeration Step
10:Enumeration(,)
11:InsertNode();
12:InsertNode() Dimension Matching Step
13:=MatchDimension(,)
14:LabelFirstNode(,)
15:=GenMatchedSeq(,,)
16:=
17:.put(,,) Recursively Enumerate Long Subsequence
18:CollabEnum(,);
19:if !.isEnum() then
20:   CollabEnum(,);
21:end if
22:return
Algorithm 2 Collaborative Enumerator (CollabEnum)

Iv-C3 SAX word Matching

In this step, Collaborative Enumerator attempts to detect a pair of matching subsequences based on SAX word representation. Specifically, given a node , we first merge the two subsequences stored in and respectively and compute a SAX word via the FastSAX algorithm introduced in Sec. IV.A.1. A new node is generated to represent the new merged subsequence (Line 3-4). The SAX word along with the length of the merged subsequence and its dimension are inserted into a SAX word table, SAXTable, if the same SAX word representing some subsequence(s) of similar length at dimension does not already exist in SAXTable (Line 5-6). Otherwise it indicates that the enumerator has found a pair of matching subsequences in dimension , in which case the algorithm gets the node representing the matched subsequence, , from the SAXTable (Line 9).

Intuitively, this step looks to see whether any subsequence stored in SAXTable is similar to the newly formed long subsequence. If it finds one successfully, then the algorithm calls local enumeration and dimension matching steps to detect motifs (see below). Otherwise it puts the subsequence into SAXTable for future matching.

Iv-C4 Local Enumeration

If the algorithm finds matching subsequences in the previous step, CHIME then conducts a local greedy enumeration step to expand the subsequences simultaneously as much as possible to find the longest matching subsequences pair. This pair of subsequences is obtained by continuing merging nodes via the next edge. The process stops when the two subsequences are represented by different SAX words (Line 10). Two nodes, and , are generated to represent the expanded subsequences. The nodes are inserted into for future enumeration (Line 11-12).

Upon insertion of the nodes and , the edges are updated accordingly as shown in Fig. 4. Intuitively, the newly inserted nodes allow us to re-use the detected matching subsequences to reduce the cost of generating long subsequences.

Fig. 4: Illustration of Updating Edges

An example is shown in Fig. 5. The algorithm iteratively merges nodes via the next edge in the first and second iterations since in both iterations, the SAX words that represent the two newly generated subsequences are identical (add for the first iteration, and adb for the second iteration). In the third iteration, the SAX words for the two subsequences are different (adc and adb, respectively), so the algorithm stops the enumeration. and (green and red nodes) that represented the green and red long subsequences are formed based on the merged nodes.

In the local enumeration step, all merged short subsequences are completely overlapped with the longest matched subsequences. As demonstrated in [2][3], these covered subsequences are redundant. So CHIME skips these short subsequences without generating nodes to reduce memory cost.

Fig. 5: Example of Local Enumeration Step

Iv-C5 Dimension Matching

After the local enumeration step, the algorithm then conducts a dimension matching step. In this step, for each dimension , the enumerator compares the corresponding pair of SAX words generated from the subsequences at the same location, and with the same length as and , stored in . An indicator vector stores the dimensions in which matching subsequences are found, and a set of nodes storing all newly matched subsequences are generated (Line 13). The first nodes of the matching subsequences are marked as “visited” to avoid revisiting them in the future.

To clarify how dimension search process works, let us consider the example shown in Fig. 6. Suppose and represent a pair of matching subsequences in of a three-dimensional time series. In the dimension matching step, CHIME checks each of the remaining two dimensions and and see if the pair of subsequences at the same locations as and , respectively, also have matching SAX words. In this example, CHIME finds that the SAX words match in (subsequences share the same word ). So an indicator vector and the nodes representing the newly found matching subsequences pair (brown nodes) are formed. The node representing the first covered subsequence (blank nodes with ‘x’) is marked as visited. These two brown nodes are stored in SAXTable for future enumeration.

Through the dimension matching process, CHIME can directly find matching long subsequences without going through the process of merging all the blank nodes shown in Fig. 6, which can reduce the cost. In this example, a trivial solution would need to repeat SAX word matching 4 times to find the long subsequences in , whereas CHIME only does it 2 times.

Fig. 6: Example of Dimension Search Step.

Iv-C6 Update Motif Set

In this step, CHIME first forms a pair of matching subdimensional subsequences, and , based on and all matching subsequences (Line 16). and can be considered as two instances of a subdimensional motif since the SAX words in dimension are matched. So CHIME computes a discrete representation by concatenating all the SAX words along with the dimension index to represent these two subdimensional subsequences (e.g. consider the example in Fig. 6, a SAX word sequence is generated). and are then put into the candidate motif set along with the hash value generated by wordSet (Line 18) and its length. Intuitively, since all subdimensional subsequences represented by the same wordSet, with similar lengths are considered instances of the same motif, these subdimensional subsequences will be put into the same bucket in to form a motif candidate. In the post-processing step, a pairwise comparison is conducted to filter out any false positive candidates.

Iv-C7 Recursive Enumeration

Finally, CHIME recursively calls itself to enumerate longer motifs. More specifically, the algorithm calls itself to test two newly generated nodes and to continue enumerating motifs. Note that since there is a chance that the matched subsequence may already be enumerated into long subsequence in the previous step, only conducts the recursive enumeration if it has not been done before. The algorithm stops when there is no SAX word matching detected (Line 19-21).

Iv-C8 High-level Framework

1:Input: , Output:
2:; SAXTable=;
3:for all from left to right, from 1 to  do
4:   if !.isVisited() then
5:      =CollabEnum(,)
6:   end if
7:end for
8:=RemoveFalsePositive();
Algorithm 3 High-level Framework

The overall framework is shown in Alg. 3. Intuitively, CHIME conducts a left to right scan through every and calls the collaborative enumerator to detect motifs. During the scan, CHIME utilizes the recorded nodes to avoid repeating enumeration of the observed node(s). Specifically, for each node , CHIME only calls enumerator with node to start the enumeration process if is not visited (Line 5). If is previously visited, it indicates that the corresponding subsequence has already been processed, so CHIME skips the node to avoid redundant work. Finally, after all nodes are enumerated, a post-processing is conducted to remove all false positive instances of motifs stored in (Line 8). In the proposed work, since we compare the scalability of CHIME with state-of-the-art approach [9], we compute the distances between pairs of instances that share the same representation and have similar lengths. We then rank the motifs by distances in ascending order per dimension size per motif length. Note that the cost is the same as filtering out all false positive instances detected by CHIME per the definition of subdimensional motif in this work.

Iv-D Time Complexity

The time complexity of CHIME is dominated by pairwise comparison post-processing, which may take time, similar to state-of-the-art motif enumeration approach. However, since CHIME utilizes symbolic representation to avoid enumerating many subsequences redundantly, in the experiments, we show that the algorithm can indeed detect subdimensional motifs with very small amount of time. It can also handle million size multivariate time series, which none of the existing approaches can handle due to the time cost.

Iv-E Compared with Sequence Matching Approach

One existing work [6] utilizes sequence matching approach (e.g. [20][21]) to detect variable-length subdimensional motif. While the approach cannot handle high dimensional multivariate time series due to the exponential time and space requirements toward dimension size, it is worth noting that even if the algorithm could reduce the time cost to a reasonable cost, the algorithm still cannot fulfill the task introduced in this paper.

The problem is twofold. First, long subsequences are typically represented by overwhelmingly long sequences of symbols. For example, consider the two 1-D subsequences of length 200 [11], shown in Figure 7. Visually, these two time series look very similar. The sequence representations of the time series obtained by using sliding window of length 20 (10% of the subsequence length) is shown on the right. We can see that the two SAX sequences, including their lengths, look very different despite the striking similarity between the time series. Since the previous approach [6] is based on SAX sequence matching [21], it would not be able to find the motifs unless a warping robust matching process is conducted, which is too costly given the complexity. In contrast, CHIME recomputes the SAX words and can represents these two 1-D sequences by one SAX word, which is robust to noise.

Fig. 7: Example of Two Similar Sequence and Correspond SAX Sequence given

Second, the mean and variance of a short subsequence used to generate the SAX word sequence can significantly differ from that of a long subsequence [2][12]. Since the shape can be largely affected by mean and variance, the SAX word sequence based representation may fail to capture the actual similarity between two subsequences. As a result, SAX word sequence matching based approaches such as [13][1][7] only can detect motifs in a small length range, whereas CHIME does not have this problem.

V Experiments

We perform a series of experiments to evaluate the accuracy and speed of CHIME. All the experiments are conducted on a 16 GB RAM laptop with quad core processor of 2.5 GHz. The executable software and datasets used in the experiments can be found in https://github.com/flash121123/CHIME. In all evaluation experiments unless noted, the PAA parameters and are set to 5 and 6, respectively. The minimum length is 300 and motif threshold is .

We first demonstrate that existing work may not be suitable for variable-length subdimensional motif discovery. As shown in previous work [2][12][7], index-based fixed-length approximate motifs discovery algorithms such as random projection [19] are not suitable for detecting variable-length motifs due to memory requirement. This is because the algorithm needs to generate discrete representation for every subsequence for every length tested, and it can soon become impractical even with small enumeration range[2][7]. To demonstrate the significant difference in memory requirement between the two algorithms, we conduct a simple experiment. We compare the ratio between the number of subsequences stored in memory for matching motifs, and the product of the number of subsequences and enumeration range:

(1)

Since the random projection approach [8] discretizes and keeps track all tested subsequences of every length, the ratio is equal to 1.

The ratios in three different types of datasets — random walk data, traffic speed data[11], and EEG data222http://bbci.de/competition/iv/ along with data size and the enumeration range (measured by number of distinct motif length detected after removing all false positive) are shown in Table 1. In all three datasets, the memory cost for CHIME is three orders of magnitude lower compared with random projection. This property allows CHIME to detect variable-length subdimensional motifs in large scale datasets.

Dataset Size () Enumeration Range CHIME Random Projection
Random Walk 8291 0.001 1
Traffic Speed 4177 0.0014 1
EEG 1674 0.004 1
TABLE I: CHIME Vs. Random Projection in Memory Cost

Since there is no comparable approximate variable-length subdimensional motif discovery approach that can solve the problem in the tested scale, we compare with the state-of-the-art variable-length motif discovery solution introduced by Nunthanid et al. [22] — the algorithm conducts a brute-force enumeration process to enumerate motifs of different lengths via a fixed-length motif discovery approach. We demonstrate that the scale of problems handled by CHIME is too large to get exact solution.

V-a Detecting Planted Motifs in Random Walk Time Series

(a) Overall Score Vs.
(b) Length Overlap Vs.
(c) Dimension Overlap Vs.
(d) Overall Score vs
(e) Length Overlap vs.
(f) Dimension Overlap vs.
Fig. 8: Planted Motif VS. Overlap Score

We first tested CHIME in a planted motif experiment [7][3][2] to demonstrate its ability to detect subdimensional motifs with high accuracy when the minimum length is much shorter than the actual motif length.

We planted a subdimensional motif of 10 instances into a random walk time series of length 1 million points, twenty dimensions with random positions and random dimensions. The shape of motifs are generated by using with random parameters , and . We added 5% random noise to every instance of the motifs. The mean and variance are also randomly generated. CHIME is expected to find at least a pair of non-overlapping subsequences that highly overlap with the actual planted instances. Similar to previous work[3][7], we evaluate the performance by the overlapping rate with the actual planted intervals and relevant dimensions. We also evaluate the performance via an overall overlap score computed by the geometric average of both metrics. The high overall overlap score indicates that the algorithm found a motif highly overlapped with the planted motif in most of relevant dimensions.

V-A1 Planted Motifs of Different Lengths

We first tested CHIME with planted motifs of length equal to 3000, 4500, and 6000. The number of relevant dimensions for the planted motif is set to be 5. We repeat the experiment 10 times for each motif length. The boxplot of overall overlapping rate for each motif length is shown in Fig. 8(a). CHIME consistently gets overall overlapping rates above 0.8 for all motif lengths 3000, 4500 and 6000. CHIME also consistently gets high length overlap rate in all tests. Besides, according to Fig. 8(b)-(c), the algorithm also maintains over 0.8 overlapping rate in both length and dimension.

V-A2 Planted Motifs of Different Number of Relevant Dimensions

In this experiment, we tested CHIME with planted motifs for which the number of relevant dimensions equals to 3, 5, 10, 15. The motif length is set to be 4500. Similar to the previous experiment, we repeated the experiment 10 times for each dimension setting, and the result is shown in Fig. 8(d)-(f). We observe that the median overall overlapping score maintains at approximately 0.9. While the length and dimension overlapping rate decrease as the number of relevant dimension increased, the median of both metrics are still over 0.8. The result indicates that even if motifs contain a large number of relevant dimension, CHIME still can find most of relevant dimensions.

V-B Scalability

In this subsection, we conduct experiments to evaluate the scalability over length and dimension of the multivariate time series. Since there is no approximate variable-length motif discovery algorithm that can handle the size of data tested in the experiment, we report the execution time of state-of-the-art fixed-length multidimensional motif discovery algorithm[9] performance with STOMP[23]

as the base approach. The code is provided by the authors and written in C. In the test case where the algorithm takes more than 24 hour to complete, we estimate the execution time by the first 1000 iterations (the estimation approach used in previous work

[23]). We also report the estimated brute-force motif enumeration time if utilizing the framework described in [22], the classical approach used in variable-length motif discovery. The estimated time is computed by the fixed-length motif discovery time multiplied by the enumeration range since STOMP’s execution time is invariant to motif length[23].

Time Series Length 200K 400K 600K 800K 1 million
Enumeration Range 2724 4499 5915 7319 8290
Dimension Range 5 6 6 6 6
CHIME 3.93 min 10.03 min 15.1 min 22.5 min 28.05 min
Post-processing 54 sec. 3.4 min 11 min 19.5 min 31 min
Fixed-length Motif Discovery 3.6 hr 11.6 hr 1.04 days 2.08 days 12 days
Estimated Brute Force Time 1.19 yr. 5.95 yr. 16.8 yr. 41.68 yr 272 yr
TABLE II: Execution Time Vs. Time Series Length in a 50 dimensional Time Series

V-B1 Scalability over Time Series Length

We tested the scalability of CHIME in a one million length random walk time series of fifty dimensions. The growths of execution time, enumeration length range and dimension range as the length increases are shown in Table 2. The enumeration length and dimension range are measured by the number of distinct motif lengths and dimensions detected after removing all false positive instances based on the motif threshold function. According to the table, the execution time for CHIME grows much slower than that of the tested state-of-the-art approaches. In the largest cases, the algorithm takes 28 min to complete, and then another 31 min to do pairwise distance comparisons. The estimated execution time for fixed-length motif discovery approach to detect motifs is 12 days. Similarly, the enumeration length range and dimension range grow as the length of time series grows. In the largest case, the enumeration range has almost 9000 different lengths. Consider the estimated execution time for the brute force approach, the length range enumerated by CHIME is too large for the state-of-the-art algorithm to get exact solution. In contrast, CHIME provides an alternative way to efficiently detect approximate variable-length motifs in this scale of data size.

Dimension 25 50 100 200
CHIME (Total Time) 1.1 min 3.4 min 10.5 min 33.5 min
Enumeration Range 2058 2252 2334 2419
Fixed-length Motif Discovery Approach 1.72 hr 3.45 hr 6.9 hr 13.8 hr
Estimated Brute Force Time 58.9 days 129.49 days 268 days 1.52 yr
TABLE III: Execution Time Vs. Dimension Size

V-B2 Scalability Over Number of Dimensions

We then tested the scalability of CHIME in a 200-dimensional, 160,000 length random walk time series. The growth of execution time as dimension increases is shown in Table 3. The execution time for CHIME grows faster than that of fixed-length motif discovery approach. However, the estimated brute-force execution time is still infeasible due to the large enumeration range. In contrast, CHIME only takes less than one hour to complete.

V-C Parameters Analysis

We tested CHIME with parameters from 4 to 8 and from 5 to 15 on a 300,000 length, 50-dimensional random walk time series. The enumeration length range, dimension range and execution time of all parameter combinations are shown in Fig. 9(a)-(c) respectively. All three values increase as and decrease. This is because the number of distinct SAX words increases as and increase. As a result, the chance that the SAX words match is reduced. So CHIME can process time series fast at a cost of reduced enumeration length (dimension) range. The user can set both parameters and to balance the search ability and execution time.

(a) Length Range
(b) Dimension Range
(c) Execution Time
Fig. 9: Parameters Experiments

V-D Case Studies

In this section, we show that CHIME can find high quality motifs in several real world large-scale multivariate time series data.

Fig. 10: Two Example of Subdimensional Motifs found in PAMAP2. (Two instances shown in blue and red. One irrelevant dimension are shown in black and grey as an example.)

V-D1 PAMAP2 Physical Activity Monitoring Time Series

We first tested CHIME in PAMAP2 Physical Activity Monitoring dataset [24]. We used all available high resolution acceleration signals recorded from hands, chest and ankles to generate a 9-dimensional time series of 1.3 million points in length. We tested CHIME with minimum motif length equal to 3 sec (300 points).

Two examples of detected subdimensional motifs are shown in Fig. 10(a)(b). The lengths of Motif A and Motif B are 12.5 sec. and 8 sec. respectively. Motif A consists of two different signals (y and z axes of acceleration). Both these signals are recorded from hand motion. According to the label, two occurrences of Motif A indicate an action of ironing, an activity mostly relying on hand. Motif B consists of all 3 signals recorded from chest. According to the annotations, both instances coincide with when the subjects conduct walking activity — a periodic activity. Both motifs have the ability to explain the patterns that occur in different types of activities. Such information can provide useful insights for behavioral studies. Clearly, since the two motifs have significant length gap (8 sec. vs. 12.5 sec.), repeated execution of fixed length subdimensional motif discovery approach [9] to enumerate longer motifs is very time consuming given the data size ().

V-D2 Electric Power Demand Time Series

Fig. 11: Two Example of Subdimensional Motifs found in Power Usage Time Series (Two instances shown in blue and red).

Interpreting the behavior of Electric power usage has many potential applications[9]. Recent work shows that motifs can be used to understand activities in this type of data. In this experiment, we apply CHIME on the 1 million length electric power usage dataset [25]. The multivariate time series consists of power usage watt of eight different appliances including Washing Machine, Dryer, Dishwasher, Computer Site, Television Site, Combination Microwave, Kettle and Toaster, and is recorded from 2013-Oct. to 2014-Jan. We set the minimum motif length equal to 200 (20 min). Two examples of subdimensional motif are shown in Fig.11.

The first motif of length 2000 is a subdimensonal motif consisting of two appliances — power usage time series collected from computer site and television site. The motif represents a power usage pattern that the user turns on the computer first and then turns on the TV. CHIME successfully captures this repeating pattern among the data even when the minimum motif enumeration length is a lot smaller than that of the pattern detected. CHIME also discovers the second motif of length 300. This motif consists of two appliances: Kettle and Toaster. Both appliances’ power usage patterns are much shorter, and only last several minutes. The two motifs have significantly different motif lengths and represent different, meaningful power usage patterns. Since the time series is large (8 x 1 million in size), repeatedly running a fixed-length motif discovery algorithm such as [9] will be very time consuming.

V-D3 Interpretable Time Series Classification Via Motif

Fig. 12:

Three Examples of Shapelets Selected via Forward Feature Selection based on CHIME’s result

One popular application of motif discovery is learning interpretable time series classification model using motifs or shapelets [26][27]. It has been demonstrated in previous work [26][2] that motifs that can be used to distinguish between different classes often have different lengths.

In this experiment, we tested our algorithm in the EigenWorms dataset, a dataset that collects approximately 400 worms’ trajectories. Each trajectory consists of approximately 18000 sample points, and each record contains 6 dimensions representing 6 different worm movement features. We ran CHIME on the data with minimum motif length set to 300 sample points. Similar to a previous work on univariate time series classification [26]

, after identifying the motifs, we transformed each original time series to a distance feature vector by computing the closest match distance between the time series and each of the detected subdimensional motif. Then a forward feature selection process is applied to select the features that can achieve the best accuracy via a simple decision tree classifier. We find that by using the three motifs shown in Figure

12, the decision tree can already get approximately 70% accuracy. By further increasing the number of motifs, we find that we can use six motifs to achieve 77% accuracy in the dataset, whereas 1-Nearest-Neighbor classifier with Dynamic Time Warping only can get 60% accuracy according to Bagnall et al. [28].

Vi Conclusion

We introduce a new algorithm, CHIME, to detect approximate subdimensional motifs of different lengths in multivariate time series. CHIME can handle large-size multivariate time series that the state-of-the-art exact algorithms cannot handle efficiently. We show that CHIME can detect subdimensional motifs successfully even when the motif length is considerably larger than the minimum length. In the case studies, we demonstrate that the motifs found by CHIME are meaningful and can potentially have significant impacts in various applications.

References

  • [1] Y. Li, J. Lin, and T. Oates, “Visualizing variable-length time series motifs,” in Proceedings of the 2012 SIAM international conference on data mining.   SIAM, 2012, pp. 895–906.
  • [2] A. Mueen, “Enumeration of time series motifs of all lengths,” in 13th International Conference on Data Mining (ICDM), 2013.   IEEE, 2013, pp. 547–556.
  • [3] Y. Gao and J. Lin, “Efficient discovery of variable-length time series motifs with large length range in million scale time series,” in 2017 IEEE 17th International Conference on Data Mining (ICDM), 2017.
  • [4] Y. Gao, Q. Li, X. Li, J. Lin, and H. Rangwala, “Trajviz: a tool for visualizing patterns and anomalies in trajectory,” in

    Joint European Conference on Machine Learning and Knowledge Discovery in Databases

    .   Springer, 2017, pp. 428–431.
  • [5] X. Wang, J. Lin, N. Patel, and M. Braun, “A self-learning and online algorithm for time series anomaly detection, with application in cpu manufacturing,” in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management.   ACM, 2016, pp. 1823–1832.
  • [6] A. Balasubramanian, J. Wang, and B. Prabhakaran, “Discovering multidimensional motifs in physiological signals for personalized healthcare,” IEEE journal of selected topics in signal processing, vol. 10, no. 5, pp. 832–841, 2016.
  • [7] Y. Gao and J. Lin, “Exploring variable-length time series motifs in one hundred million length scale,” Data Mining and Knowledge Discovery, May 2018.
  • [8] D. Minnen, C. Isbell, I. Essa, and T. Starner, “Detecting subdimensional motifs: An efficient algorithm for generalized multivariate pattern discovery,” in Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on.   IEEE, 2007, pp. 601–606.
  • [9] C.-C. M. Yeh, N. Kavantzas, and E. Keogh, “Matrix profile vi: Meaningful multidimensional motif discovery,” in Data Mining (ICDM), 2017 IEEE International Conference on.   IEEE, 2017, pp. 565–574.
  • [10] D. Minnen, T. Starner, I. Essa, and C. Isbell, “Discovering characteristic actions from on-body sensor data,” in Wearable computers, 2006 10th IEEE international symposium on.   IEEE, 2006, pp. 11–18.
  • [11] “pems.dot.ca.gov.”
  • [12] T. P. Michele Linardi, Yan Zhu and E. Keogh, “Matrix profile x: Valmod - scalable discovery of variable-length motifs in data series,” in SIGMOD.   ACM, 2018.
  • [13] Y. Gao, J. Lin, and H. Rangwala, “Iterative grammar-based framework for discovering variable-length time series motifs,” in ICMLA, 2016 15th IEEE International Conference on.   IEEE, 2016, pp. 7–12.
  • [14] Y. Tanaka, K. Iwamoto, and K. Uehara, “Discovery of time-series motif from multi-dimensional data based on mdl principle,” Machine Learning, vol. 58, no. 2, pp. 269–300, 2005.
  • [15] E. Berlin and K. Van Laerhoven, “Detecting leisure activities with dense motif discovery,” in Proceedings of the 2012 ACM Conference on Ubiquitous Computing.   ACM, 2012, pp. 250–259.
  • [16] L. Wang and et al., “A tree-construction search approach for multivariate time series motifs discovery,” Pattern Recognition Letters, vol. 31, no. 9, pp. 869–875, 2010.
  • [17] A. Mueen, E. J. Keogh, Q. Zhu, S. Cash, and M. B. Westover, “Exact discovery of time series motifs.” in SDM.   SIAM, 2009, pp. 473–484.
  • [18] J. Lin, E. Keogh, L. Wei, and S. Lonardi, “Experiencing sax: a novel symbolic representation of time series,” Data Mining and knowledge discovery, vol. 15, no. 2, pp. 107–144, 2007.
  • [19] B. Chiu and et al., “Probabilistic discovery of time series motifs,” in Proceedings of the 9th ACM SIGKDD international conference on Knowledge discovery and data mining, 2003, pp. 493–498.
  • [20] A. Balasubramanian and B. Prabhakaran, “Flexible exploration and visualization of motifs in biomedical sensor data,” in Proc. of Workshop on Data Mining for Healthcare, in conjunction with ACM KDD, 2013.
  • [21] C. G. Nevill-Manning and I. H. Witten, “Identifying hierarchical strcture in sequences: A linear-time algorithm,” J. Artif. Intell. Res.(JAIR), vol. 7, pp. 67–82, 1997.
  • [22] P. Nunthanid, V. Niennattrakul, and C. A. Ratanamahatana, “Discovery of variable length time series motif,” in Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2011 8th International Conference on.   IEEE, 2011, pp. 472–475.
  • [23] Y. Zhu, Z. Zimmerman, N. S. Senobari, C.-C. M. Yeh, G. Funning, A. Mueen, P. Brisk, and E. Keogh, “Matrix profile ii: Exploiting a novel algorithm and gpus to break the one hundred million barrier for time series motifs and joins,” in Data Mining (ICDM), 2016 IEEE 16th International Conference on.   IEEE, 2016, pp. 739–748.
  • [24] A. Reiss and D. Stricker, “Introducing a new benchmarked dataset for activity monitoring,” in Wearable Computers (ISWC), 2012 16th International Symposium on.   IEEE, 2012, pp. 108–109.
  • [25] D. Murray, J. Liao, L. Stankovic, V. Stankovic, R. Hauxwell-Baldwin, C. Wilson, M. Coleman, T. Kane, and S. Firth, A data management platform for personalised real-time energy feedback, 8 2015.
  • [26] X. Wang, J. Lin, P. Senin, T. Oates, S. Gandhi, A. P. Boedihardjo, C. Chen, and S. Frankenstein, “Rpm: Representative pattern mining for efficient time series classification,” pp. 185–196, 2016.
  • [27] Y. Gao and J. Lin, “Hime: discovering variable-length motifs in large-scale time series,” Knowledge and Information Systems, pp. 1–30, 2018.
  • [28] A. Bagnall, H. A. Dau, J. Lines, M. Flynn, J. Large, A. Bostrom, P. Southam, and E. Keogh, “The uea multivariate time series classification archive, 2018,” arXiv preprint arXiv:1811.00075, 2018.