Efficient Discovery of Variable-length Time Series Motifs with Large Length Range in Million Scale Time Series

02/13/2018
by   Yifeng Gao, et al.
George Mason University
0

Detecting repeated variable-length patterns, also called variable-length motifs, has received a great amount of attention in recent years. Current state-of-the-art algorithm utilizes fixed-length motif discovery algorithm as a subroutine to enumerate variable-length motifs. As a result, it may take hours or days to execute when enumeration range is large. In this work, we introduce an approximate algorithm called HierarchIcal based Motif Enumeration (HIME) to detect variable-length motifs with a large enumeration range in million-scale time series. We show in the experiments that the scalability of the proposed algorithm is significantly better than that of the state-of-the-art algorithm. Moreover, the motif length range detected by HIME is considerably larger than previous sequence-matching based approximate variable-length motif discovery approach. We demonstrate that HIME can efficiently detect meaningful variable-length motifs in long, real world time series.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

11/20/2019

Discovering Subdimensional Motifs of Different Lengths in Large-Scale Multivariate Time Series

Detecting repeating patterns of different lengths in time series, also c...
08/31/2020

Matrix Profile Goes MAD: Variable-Length Motif And Discord Discovery in Data Series

In the last fifteen years, data series motif and discord discovery have ...
08/07/2019

Self-Organizing Maps with Variable Input Length for Motif Discovery and Word Segmentation

Time Series Motif Discovery (TSMD) is defined as searching for patterns ...
11/14/2019

Robust Parameter-Free Season Length Detection in Time Series

The in-depth analysis of time series has gained a lot of research intere...
05/21/2021

Nori: Concealing the Concealed Identifier in 5G

IMSI catchers have been a long standing and serious privacy problem in p...
08/31/2020

VALMOD: A Suite for Easy and Exact Detection of Variable Length Motifs in Data Series

Data series motif discovery represents one of the most useful primitives...
01/26/2021

A fast algorithm for complex discord searches in time series: HOT SAX Time

Time series analysis is quickly proceeding towards long and complex task...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The task of finding repetitive similar patterns in time series data, known as motif discovery, has received a great amount of attention in the past decade. It has been used as an important subroutine in many time series data mining tasks and applications such as association rule mining [27]

, data visualization

[26], classification [30], clustering [3]

, anomaly detection

[26][8], and activity recognition [15].

Most existing motif discovery algorithms [5][20][11][12] consider the length of motif as an important domain knowledge that needs to be determined by user [32]. We argue that finding motifs of different, previously unknown lengths—a topic that has been largely glossed over in the literature—is a significant obstacle that prevents motifs from achieving their full potential as a useful time series data mining primitive.

To demonstrate the limitations of fixed-length motif discovery, we show an example on a subset of a dishwasher electric power demand time series (Fig. 1.top). A basic wash cycle takes approximately half an hour (approximately 350 sample points). By setting motif length equal to one basic cycle, we can find a frequently repeating pattern, representing a wash cycle, as shown in Fig. 1 (Motif A). However, after running our algorithm, we also found one frequent motif (Motif B) of length 1534 (approximately 2.5 hours), and a rare motif (Motif C) of length 2791 (approximately 4.6 hours) that only happened a few times in the time series. The discovery of these two long motifs is surprising given their respective lengths compared to that of the basic wash cycle. Since the lengths of these three motifs are significantly different, fixed-length motif discovery algorithms cannot discover all three motifs in a single run. In contrast, all three motifs are found by our algorithm with just several seconds of execution time. Previous work also note that variable-length motifs are valuable features for many time series data mining tasks [30][6][18].

Fig. 1: Motifs of different lengths detected by our algorithm in a snippet of dataset from [22].

The greatest challenge in detecting variable-length motifs is scalability. We demonstrate this problem by finding motifs of lengths ranging from 300 to 10300 in a ten-million length time series. The brute force algorithm requires distance calls to solve this problem. In this case, the computational complexity is similar to that of finding fixed-length motif in a 1-billion length time series. In other words, the computational complexity of variable-length motif discovery is 10 times larger than the current largest fixed-length motif discovery problem solved in [32] (which takes 12.13 days with GPU speed-up). As a result, even though the state-of-the-art algorithm can achieve 95% pruning rate, enumerating motifs from lengths 64 to 1024 in an EEG time series of length 160,000 still takes 16 hours to complete [18].

Million-scale time series exists in many domains. For example, a 2-year power demand dataset used in [22] is about 7 million in length. The EOG time series recorded in [7] is about 8 million in length. In both datasets, the state-of-the-art algorithms are too costly to be applied. A series of approximate variable-length time series motif discovery algorithms based on grammar induction have been introduced [9][26] following a three-step process. In the first step, the time series is converted to discrete string sequence via a sliding window. In the second step, grammar induction (e.g. via Sequitur [23]) is applied on the discretized time series to quickly detect repeated string sequences. The third step maps the repeated strings back to the time series subsequences that the strings represent. While the scalability for the grammar-based approach is considerably better than that of the state-of-the-art algorithms, the length range of motifs detected could still be limited. This is because long motifs are often discretized into unnecessarily long and non-identical string sequences due to the typically high amount of noise in the data, and since grammar induction relies on identical sequence matching, existing grammar-based algorithms have difficulty finding long motifs.

In this paper, we introduce a greedy algorithm named HierarchIcal based Motif Enumeration (HIME) for detecting variable-length motifs. Given a minimum motif length , HIME automatically finds motifs that are (much) larger than based on a symbol table which records discretized representation of variable-length subsequences. While the algorithm is not an exact algorithm, in the experiments, it can find motifs of lengths from 300 to 3000 in a time series of length one million, with execution time of about 1 minute. This is considerably faster than the state-of-the-art algorithms. Such good scalability makes searching for variable-length motifs in million scale time series a feasible task. Moreover, compared with sequence matching based motif discovery approaches such as [9][6], HIME considerably increases the motif enumeration range that can be detected.

To summarize, our work has the following unique features:

  • It can discover motifs of much larger length range that sequence matching based approaches [9][6] fail to find.

  • The scalability is significantly better than state-of-the-art algorithms.

The rest of the paper is organized as follows: Section II discusses related work in fixed- and variable-length motif discovery. Section III introduces the problem definition and notations used in this paper. Section IV introduces our proposed HIME algorithm. An adaptive parameter selection approach is described in Section V. The experimental results are shown in Section VI, and we conclude in Section VII.

Ii Related Work

While the classic time series motif discovery problem focuses on finding the most frequent patterns [12], much of recent research has defined motifs as the most similar pair of subsequences (which we refer to as pair-motifs hereinafter) [19][20][27][10]. In addition, some research work has focused on fast approximate motif discovery [5][14]. The motifs discovered by these algorithms may contain subseqeunces that are not truly similar. These false positive subsequences can be removed during post-processing. Likewise, similar subsequences may be missed.

The authors in [10] introduce an algorithm named Quick-Motif, which achieves 3 orders of magnitude speed up compared with the traditional state-of-the-art fixed-length motif discovery algorithm [20]. In a recent work, [31] introduces an anytime algorithm called STAMP which utilizes fast similarity search algorithm [21] to find exact pair-motif of a given length. In [32], the authors further introduce an algorithm called STOMP which can reduce the time complexity of STAMP by . In [5], the authors utilize random projection to detect fixed-length approximate motifs. The work in [2] focuses on detecting fixed-length approximate motifs with limited memory.

Some recent research work has focused on efficiently finding variable-length motifs. In [24], the authors introduce an algorithm named VLMD to find exact K pair-motifs by calling fixed-length exact motif discovery algorithm (i.e. MK [20]) with all possible lengths within a range. In [18], the authors introduce an algorithm called MOEN which uses a lower-bound to speed up the process of enumerating motif length. Other works such as [16][17] follow the similar idea of [18]. However, a common drawback for these algorithms is that they all conduct an incremental enumerating process starting from the smallest length. Since finding fixed-length motifs is already very costly in million-scale time series, such an approach may be impractical for large-scale variable-length motif discovery.

In [9][26], the authors introduce a framework, Grammar Induction based motif discovery, which uses grammar induction to find approximate motifs of variable lengths via identifying repeated discrete sequences without enumerating all possible lengths. It is considerably faster than existing algorithms. However, since the idea of the algorithm is based on symbolic sequence matching, the length range of motifs detected may be limited for some real world applications (e.g. the approach cannot detect motifs shown in Fig. 1 in a single run).

Other works such as [29][4] also focus on variable-length motif discovery; however, these techniques have some shortcomings that prevent them from being useful in general cases. In [29], the subsequences are not normalized, which makes it difficult to find patterns that are similar but with different amplitudes. The work proposed in [4] also consists of a discretization step, but the subsequences are non-overlapping. As a result, it does not consider every possible pattern candidate, and thus may miss some true patterns [9][26].

Iii Notations & Problem Definition

We start with fundamental definitions related to time series:

Time Series is a set of observations ordered by time.

Subsequence of a time serie is a contiguous set of points in time series starting from position and ending at with length . Typically, and .

Subsequences can be extracted from via a sliding window.

In many applications, we are interested in finding similar “shapes.” Therefore, motif discovery is more meaningful when it is offset- and amplitude-invariant. This can be achieved by normalizing each subsequence prior to the search for motifs.

z-normalization

is a procedure that normalizes the mean and standard deviation of the subsequence to zero and one, respectively.

Given a time series and a distance function , in this work, we define a Time Series Motif to be a set of subsequences in such that the distance between each of these subsequences and a seed subsequence is less than a motif threshold function , where is the length of the motif detected, (the minimum motif length). Each subsequence in a motif is said to be an instance of the motif.

More specifically, given a minimum motif length , the algorithm finds pairs of subsequences with length ( is varied for different pairs), where and each pair satisfies motif definition with threshold . Note that as mentioned in [2], once the pair is detected, all instances of motif can be found by using similarity query algorithms [21][25].

Iii-a Problem Definition

Different from fixed-length motif discovery, determining a general measure (e.g. frequency or similarity) for variable-length motif discovery problem is non-trivial [9][24]. For example, some low frequency motif may be more interesting than high frequency motif; some long motifs are likely to have a large distance compared with short motifs even after normalizing by length (e.g. Motif C versus Motif A in Fig. 1). Since motif discovery can be used as a subroutine for real world applications, instead of ranking motifs based on some interest measure, the proposed algorithm aims to detect “seed subsequence” with at least a pair of subsequences that satisfy the motif constraint stated above. The reader may refer to [18][30][26] for details on various motif evaluation approaches for different goals. Since most state-of-the-art approaches evaluate motifs on similarity, when compared with them in the experimental section, we include the execution time of ranking the motifs based on similarity [20][18].

Fig. 2: Example of Normalized Distance as Growth of Length

The difficulty of getting exact solution for variable-length motif can be illustrated by a simple example shown in Fig. 2. Suppose two instances of a motif, shown in Fig. 2(b), appear in a time series (Fig. 2(a)) at the and positions, respectively. Fig. 2(c) shows the growth of normalized Euclidean distances between subsequences and for all lengths ranging from 2 to 6000. We can see from Fig. 2(c) that the growth of the distance is nonlinear. That is, the distance between a pair of short subsequences does not necessarily share similar behavior with the distance between long subsequences if the length difference is large. Therefore, the state-of-the-art algorithms need to repeat the fixed-length motif discovery algorithm many times during the enumeration process to find optimal motifs, which significantly increases the time cost.

Iv Proposed Method

In this section, we introduce our proposed method.

Iv-a Discretization

Discretization of time series into symbolic representation is often a necessary pre-processing step for efficient motif discovery [11][5][26][2]. Since our proposed work also utilizes symbolic representation of subsequences to speed up motif detection process, we first describe Symbolic Aggregate approXimation (SAX) [12], a widely used discretization technique for time series data mining. Given a z-normalized time series (a subsequence in our case), SAX first converts it to Piecewise Aggregate Approximation (PAA) representation [11] of size (i.e. segments). Then, the PAA coefficients are mapped to symbols with alphabet size according to a breakpoints table [12]

, defined such that the regions are approximately equal-probable under Gaussian distribution. These

symbols form a SAX word. Fig. 3 illustrates the discretization process. The pre-defined breakpoints table [12] for alphabet size up to is shown in Fig. 3 (bottom).

Fig. 3: Example of Generating SAX word

Iv-B Fast Computation of SAX

Since SAX words are heavily used in the proposed algorithm, it is very important to reduce the time cost of discretization. We introduce a fast way to compute SAX words from subsequences of different lengths. Two vectors of statistical features,

and are first computed based on input time series . Given a subsequence of length , its SAX representation can be computed by Algorithm 1. The algorithm uses a fast PAA computation approach [10][25] to compute PAA coefficients with time complexity which is not dependent on subsequence length (Lines 5-7). In Line 8, the PAA coefficients are converted to SAX word based on the pre-defined breakpoints table, which cost . The cost of Algorithm 1 is thus . Since can be computed during pre-possessing, the cost of computing a SAX word for arbitrary length subsequence during motif discovery process is . As demonstrated in [12], and should be very small compared with subsequence length. So the time cost to compute a SAX word is reduced from to ().

1:Input: ,, PAA size , subsequence
2:Output: PAA representation
3:,
4:, ,
5:for every PAA segment do
6:   
7:end for
8:return ConvertToSAX()
Algorithm 1 Fast SAX Computation (FastSAX)

Iv-C Lower-bound based Numerosity Reduction

In practice, neighboring subsequences are similar to each other since they are off by one point. To reduce the cost of detecting long motifs, Numerosity Reduction (NR) is used to avoid unnecessary similarity comparisons. Different from previous work [9][6], where a subsequence is skipped if its SAX representation is identical to the last recorded one, we use PAA distance to remove consecutive similar subsequences. A subsequence is ignored if the lower-bounding PAA distance between neighboring subsequences is less than . Since we want a tight lower bound, we use a large PAA size. In this work, we use PAA size of 32 for numerosity reduction.

Iv-D Induction Graph

Fig. 4: Illustration of Induction Graph

The number of SAX words that need to be tested to form long motifs can be unnecessarily long if numerosity reduction is not employed. See Fig. 4 (top). The time series is converted to a sequence of SAX words: , , …, . The node in the graph represents the subsequence in time series. Without numerosity reduction, every single subsequence/SAX word is kept. However, numerosity reduction may cause some patterns to be missed since some subsequences are skipped. In the example shown in Fig. 4 (middle), , and are skipped due to numerosity reduction, and some patterns involving the skipped subsequences may be missed. To mitigate the drawbacks of both scenarios, we introduce the Induction Graph (Fig. 4 (bottom)), a graph structure that helps determine the order of scanning and enumeration of motif candidates during motif discovery. Each node contains 2 edges (next edge and forward edge). Two nodes connected by these two edges are denoted as and (In Fig. 4 (bottom), these are the black and yellow arrows, respectively). is the node representing the subsequence and is the next non-similar subsequence determined by numerosity reduction (see previous section). The Induction Graph can be stored by only recording the nodes connected by . For example, in Fig. 4, all edges and nodes in the Induction Graph can be reconstructed if we record 4 nodes: . In the rest of the paper, the nodes connected by reverse direction of next edge and forward edge are denoted as and respectively.

Iv-E Hierarchical based Motif Enumeration (HIME)

Hierarchical based Motif Enumeration (HIME) is described in Algorithm 2. Intuitively, the algorithm conducts a left to right passing through all nodes via the next edge in Induction Graph . For each node, the algorithm recursively executes two major functions: “RecursiveEnumeration” (Lines 16-17) and “RemoveCoveredMotif” (Line 13). In “RecursiveEnumeration” step, SAX words are formed to represent variable-length subsequences and to detect repeating subsequences. “RemoveCoveredMotif” removes short motifs found that are completely covered by longer motifs to maintain a small size of motif set at a low cost. The motifs detected by the algorithm are stored in (Line 12). Finally, post-processing is applied to remove trivial and false-positive motifs candidates (Line 20). Note that HIME can also utilize numerosity reduction by only examining the nodes recorded in of .

1:Input: Induction Graph
2:Parameter: PAA size , Alphabeta Size
3:Output: Motif Set
4:VLSAXTable[SAX word, Length][Location]={}
5:for each node from left to right do {Compute SAX word for long subsequence}
6:   =Merge(,); =FastSAX(,,); {Check if the same SAX word representing some subsequence with similar length is recorded}
7:   if !VLSAXTable.exist(,.Length) then
8:      VLSAXTable.put(,.Length,.Location);
9:   else{Retrieve subsequence with matching SAX word}
10:      =VLSAXTable.getSimLengthSeq(); {Update Induction Graph }
11:      InsertMotifNode(,,);
12:      UpdateMotifSet(MotifSet,); {Greedy Removing Covered Motifs}
13:      RemoveCoveredMotif(,); {Enumerate Motifs based on and }
14:      =Merge(,);
15:      =Merge(,);
16:      RecursiveEnumeration();
17:      RecursiveEnumeration();
18:   end if
19:end for
20:return RemoveTrivialAndFalsePostive(MotifSet);
Algorithm 2 Hierarchical based Motif Enumeration (HIME)

Iv-E1 SAX based Recursive Enumeration

In this step, Algorithm 2 recursively executes Lines 6-18 to detect variable-length motifs. First, HIME computes a SAX word and generates a new node to represent the long subsequence obtained by merging two short subsequences and (Line 6). The new SAX word , along with the total length of the merged subsequence and the location of the subsequence are inserted into a SAX word table, VLSAXTable, if the same SAX word representing some subsequence(s) of similar length does not already exist in the table (Lines 7-8). If already exists in VLSAXTable for some subsequence(s) of similar length, we have found a motif match. The algorithm gets based on the location of the matching subsequence (Line 10) and inserts into the graph (Line 11). and are now instances of a motif. The next edge and forward edge of are updated based on ’s next edge and forward edge respectively. The reverse links are connected to and respectively. The insertion of the new nodes allows us to re-use the detected motifs to reduce the cost of enumerating long motifs. The algorithm then recursively tests two new generated long subsequences represented by and (generated by further merging with , and with , respectively) until Line 7 is satisfied (Lines 16-17). Such greedy recursive strategy can efficiently generate a structure of low hierarchy for matching long repeating patterns [23].

Fig. 5: Example showing the greedy removal of covered motif

Iv-E2 Removing Covered Motifs

Some short motifs may completely overlap with long motifs. As demonstrated in [24][18], these short motifs (“covered motifs”) are redundant. So during the motif enumeration process, we conduct a fast process (Line 13) to remove the entries for covered motifs from VLSAXTable and MotifSet. An example is shown in Fig. 5 to illustrate the process. The pair (e.g. ) in the figure denotes the SAX word that represents a subsequence of length . The subsequences having the same pair are instances of the same candidate motif, and are labeled in the same color in the figure. In the example, the algorithm first finds the short motif (Fig. 5(b)) based on ( matches ). The algorithm then expands using the recursive enumeration strategy mentioned above and forms a longer subsequence represented by . A match is also found (labeled in green in Fig. 5(c)). Since is generated from and they both represent the same subsequence (except one is a longer version), all instances of the motif represented by may overlap with all instances of long motif represented by . Therefore, the algorithm removes subsequences associated with from VLSAXTable and the corresponding candidate motif from motif set (Fig. 5(d)). While “RemoveCoveredMotif” cannot remove all covered motif, it is used to minimize the time and space cost during motif discovery. All covered motifs can be removed using the algorithm described in [18] after getting the motif set.

Iv-F Compared with Sequence-Matching Based Approximate Motif Discovery Approaches

Compared with SAX-based sequence matching approximate motif discovery approaches (e.g. [9][6]), HIME has two unique advantages.

Fig. 6: An example showing the difficulty for Grammar Induction approach to find long motifs

First, these existing approaches can detect a long motif only if the SAX sequences representing the motif instances are identical. To illustrate the difficulty of generating the same sequence for long motif, an example is shown in Fig. 6. The two subsequences of length 500 in Fig. 6 are instances of a motif discovered by our algorithm in EPG time series [18]. If we set the minimum length to be 50, these two subsequences form two SAX sequences of lengths 27 and 33, respectively, with numerosity reduction. Clearly, the two subsequences are converted to overwhelmingly long word sequences that are not similar to one another except the first two words. Therefore, sequence-matching based approaches cannot detect them. In contrast, HIME would form a single SAX word for each subsequence via the recursive enumeration process, and find motifs using only the SAX word representing the subsequence. In this example, since the two subsequences indeed have the same SAX representation, HIME has the ability to discover them.

Second, the mean and the variance of a subsequence can affect the shape dramatically

[18]. However, the generation of SAX words in the word sequence only depends on the mean and the variance of small subsequence. Therefore, when the length of word sequence is long, the mean and the variance of short subsequences may significantly differ from the overall mean and variance of the whole subsequence. So even if the subsequences are similar with each other, their respective word sequences may be dissimilar. HIME avoids this problem by fast re-computing SAX word via Algorithm 1, which re-normalizes the subsequence every time.

In the experiments, we demonstrate that HIME significantly increases the enumeration range compared to sequence-matching based approaches.

V Parameter Selection

Fig. 7: Fast Computing Multi-resolution SAX

Since the performance of the algorithm depends on the SAX parameters and , in order to alleviate the burden of parameter selection, we introduce an approach to adaptively choose parameter .

We first introduce a fast approach to compute multi-resolution SAX with similar cost as fixed-resolution SAX. Given a maximum alphabet size , our approach first gathers all SAX breakpoints for alphabet sizes from 2 up to . For each interval between any two breakpoints, a symbol sequence containing corresponding symbols up to resolution is recorded. An example with is shown in Fig. 7. The breakpoints are denoted by “x,” and the symbol sequence for each intervals is shown in . By using binary search to determine which interval the PAA coefficient belongs to, we can find its SAX representations in all resolutions from to at a low cost. Since the number of intervals is bound by , the time complexity of binary search is bound by . Given the PAA values denoted in yellow as an example, the algorithm computes 3 different resolutions of SAX words by simply concatenating three symbol sequences (shown in ). Determining each symbol sequence only takes 3 binary searches. As mentioned in [26], for motif discovery , generally is chosen under 20, which only contains 128 distinct break points. So the complexity of computing a single multi-resolution SAX with PAA size word with up to 20 only cost .

The Adaptive Parameter Selection Algorithm is outlined in Algorithm 3. The algorithm random-samples a pair of subsequence (,) of length until alphabet size is determined. For each pair sampled, a suitable that makes the tightness of lower-bound (computed by ) of SAX word close to 0.5 is recorded. The threshold (0.5) provides a way to balance the tightness of the representation and the number of distinct SAX words. Since monotonically increases as increases [12], using binary search (Lines 4-6), we can find the proper without examining all resolutions. Intuitively, controls the approximate performance of SAX words and can be used to select a suitable resolution. The average alphabet size is selected to be the parameter for HIME algorithm after the parameter selection algorithm converges (e.g. when the change of is less than 0.01). By pre-computing SAX words of all resolutions for (,), BinarySearchResolution in Algorithm 3 only costs to search suitable from 2 to . During the experiment, often converges after one thousand samplings, which only takes less than one second.

1:Input: Time Series
2:Output: Alphabet size
3:while  not converged do
4:   Random Sampling , from time series
5:   =BinarySearchResolution(,);
6:   =Average();
7:end while
8:return
Algorithm 3 Determining Alphabet Size for HIME

Vi Evaluation Experiments

We perform a series of experiments to evaluate the performance of HIME. All the experiments are conducted on a 16 GB RAM laptop with quad core processor of 2.5 GHz. The executable software and datasets used in the experiments can be found in http://bit.ly/2rvBETV. Note that the goal of the experiments is to demonstrate the contributions and potential applications for efficiently finding motifs with large length differences that existing algorithms may have difficulty to find.

We first test 2 different state-of-the-art enumeration approaches [24][18] in the task of detecting motifs of length from 300 to 2300 in three real-world time series of length 160,000 to demonstrate the scalability problem. The first approach iteratively calls fixed-length motif discovery algorithm (e.g. MK [20]) to find the most similar subsequences in different lengths [24]. We choose the current fastest algorithm, Quick-Motif [10], to maximize its scalability (denoted as ItrQuick-Moitf). The second approach [18] (MOEN) applies an all pair-wise similarity search algorithm and a lower-bound to prune some lengths that do not need to be tested. We choose the fastest algorithms, STOMP [32] and STAMP [31]

, to be used with the MOEN framework. Since STAMP is slower than STOMP in all test cases, we only show the execution time for motif enumeration based on STOMP (denoted MOEN-STOMP). All execution times of STAMP and STOMP are based on the C code provided by the authors. Table I shows the execution time of fixed-length motif discovery algorithm, MOEN (using code provided by the author) and the estimated execution time if Quick-Motif and STOMP are applied in both enumeration approaches respectively (calculated based on the number of times fixed-length algorithm is called in the framework). From the results, ItrQuick-Motif is very costly compared to the MOEN framework. Although MOEN reaches

, and pruning rates in 3 datasets respectively, even using the latest all-pair similarity comparison algorithm, MOEN-STOMP still takes hours to detect the motifs. In contrast, HIME only take few seconds including post-processing. So state-of-the-art algorithms are hard to be applied in the remaining experiments since the smallest time series used has length of 1 million. Therefore, instead of directly comparing scalability performance, we compare our performance with STOMP pruning rate. As shown in the table, 99% pruning is hard to reach even in periodic ECG time series.

Similar to ItrQuick-Motif, directly using fixed-length approximate motif discovery [2][5][13] is also costly. For example, even if approximate motif discovery algorithm only takes one second to find fixed-length motifs, finding motifs in a length range of [300, 2300] still requires approximately 33 min. In long time series, just the discretization step alone in [5][13] may take more than one second. Similar problem also exists in fixed-length anytime version of the state-of-the-art motif discovery algorithm, STAMP [31]. STAMP can only examine fewer than 50 out of 16 million subsequences in one second.

AlgorithmDataSet Electric Power EEG ECG
STAMP (fixed) 12.15 min 12 min 13.7 min
STOMP (fixed) 4.15 min 5.19 min 4.05 min
Quick-Motif (fixed) 8.2 min 7.8 min 30 sec.
MOEN 3.1 days 1.7 days 16 hr
ItrQuick-Motif (est.) 11.3 days 10.08 days 16.6 hr
MOEN-STOMP (est.) 13.8 hr 8.65 hr 2.7 hr
HIME+Post-Processing 23 sec 21 sec 47 sec
TABLE I: Time required to detect motifs of lengths from 300 to 2300 in time series of length 160,000. Some of the times are estimates only.

In all experiments unless noted, the alphabet size is set using Algorithm 3. The PAA size is set to 6. The minimum length is 300 and motif threshold is . The relation between execution time, enumeration range and parameter setting is discussed in Sec VI.C.

Vi-a Detecting Planted Motifs in Random Walk Time Series

We first test HIME in a planted motif experiment to demonstrate its ability to detect motifs with high accuracy. We choose a grammar (Sequitur) based motif discovery algorithm [9] as the baseline since it can find variable-length motifs in million-scale time series and utilizes similar hierarchical identification process as the proposed algorithm. We planted 4 motifs of different lengths, 10 instances each, into a random walk time series of length 3 million. The probability that an instance of the motif appears in the time series is smaller than , which can be considered rare motif [2]). The shape of motifs are generated by using with random parameters , and . The lengths of motifs are . We added 5% random noise to every instance of the motifs. The mean and variance are also randomly generated. HIME is expected to find at least a pair of non-overlapping subsequences for each motif that highly overlapped with the actual planted instances.

Fig. 8: Overlapping rates compared with Sequitur for detecting planted motifs of different lengths

The performance of both algorithms is measured by the overlapping rate between the subsequences found and the ground truth locations. Since the grammar-based algorithm also infers motifs based on discretized SAX word sequence, we conduct a grid search for parameter selection of and (covering the grid search area mentioned in [6]) and only report the best performance.

The experimental result is shown in Fig. 8. For Sequitur, the best overlapping rate (0.45 on average) is reached when . For all other motif lengths, the overlapping rates are between 0.1 and 0.3. In contrast, HIME consistently gets overlapping rates above 0.8 for motif lengths 1500, 3000 and 6000. The performance decreases when detecting motif of length 12000; however, HIME still can achieve overlapping rates above 0.8 in 6 out of 10 cases for motif length 12000.

Vi-A1 Performance VS. Number of Instances

Fig. 9: Overlapping rates vs. number of motif instances for motif length 6000
# of Subsequences 1 million 2 million 4 million 8 million 16 million
Motif Length Range Detected [300 3020] [300 3643] [300 5104] [300 7775] [300 10795]
STOMP (fixed length) 2 hr 16.6 hr 5.01 days (est.) 18.25 days (est.) 68 day (est.)
STOMP 99% pruning rate 2.25 days 22.7 days 0.64 yr (est.) 3.65 yr (est.) 19.9 yr (est.)
HIME 1.2 min 2.9 min 8.4 min 30 min 1.6 hr
Post-processing 0.4 min 0.8 min 3.1 min 15 min 0.9 hr
Total 1.6 min 3.7 min 11.5 min 45 min 2.5 hr
TABLE II: Scalability Compared with Estimated State-of-the-art Solution (without Numerosity Reduction)
Time Series Length 1 million 2 million 4 million 8 million 16 million
Sequitur-Best (Enumeration Range) 506([300 806]) 535([300 835]) 586([300 886]) 586([300 886]) 586([300 886])
HIME (Enumeration Range) 2307([300 2607]) 2731([300 3031]) 3705([300 4005]) 5965([300 6265]) 8811([300 9111])
Sequitur-Best 3.6 sec. 12 sec. 30 sec. 2 min 6 min
HIME 12 sec. 30 sec. 1.4 min 9 min 40 min
TABLE III: Enumeration Range Growth Compared with Sequitur (with Numerosity Reduction)

We conducted an experiment to illustrate the relationship between accuracy and the number of instances repeated on the time series. We use planted motif of length 6000 and adjust the number of instances in each experiment. The overlapping rates vs. the number of instances repeated is shown in Fig. 9. According to Fig. 9, the accuracy grows rapidly as the number of repeated instances increases. The algorithm can successfully detects the planted motifs with very high overlapping rate (above 80% overlapped with ground truth) when the number of instances is above 6. Note that a motif only repeating 6 times in million scale time series is considered a rare motif due to the length of time series. Therefore, although HIME is a greedy approximate motif discovery algorithm that does not guarantee exact solution, since the algorithm utilizes all short subsequences overlapped with the instances to detect long motifs, the chance that the motif is detected increases as the number of motif instances increases. Also, motifs often repeat multiple times in a long time series (e.g. if a motif appears with the probability of , in a time series of one million in length, it may appear 100 times), so the chance of the motif being detected by HIME is high.

Vi-B Scalability

In this subsection, the scalability of algorithm is tested in a 16-million length random walk time series.

Vi-B1 Execution Time Vs. Data Size

(a) Parameter Vs. Algorithm Execution Time
(b) Parameter Vs. Enumeration Range
Fig. 10: Parameter experiments in a random walk time series

We use STOMP [31] with 99% pruning rate to demonstrate the necessity of an approximate approach in million-scale (or larger) time series. In the test case where STOMP takes more than 24 hours, we estimate the execution time by the first 100 iterations (the same estimation approach used in [31]). The enumeration range of HIME is measured after removing all false positive motifs based on motif threshold function. The result is shown in Table II. The running time for HIME grows significantly slower compared to state-of-the-art algorithms. In the largest test case, HIME takes approximately 2.5 hr to process 16 million subsequences and the length range of motifs found is from 300 to 10795. The estimated execution time for STOMP is 68 days. So even with 99% pruning rate, it may take 20 years to enumerate the same length range as HIME (though for STOMP, the solution would be exact). Note that GPU version of STOMP [32], which can achieve approximately 150 times speedup, can be used to increase the scalability of STOMP. However, it may still take 47 days. The problem size is simply too large for state-of-the-art algorithms to get the exact solution. In contrast, HIME provides an alternative way to efficiently detect approximate variable-length motifs of large length range in this scale of time series.

Vi-B2 Enumeration Range Vs. Data Size

We next demonstrate that HIME can efficiently detect motifs in large enumeration range as the length of time series increases. The performance is compared with that of Sequitur (grammar-based approach). Similar to previous experiment, we use grid search to find the best parameters for Sequitur so that the enumeration range can be maximized. In order to make a fair comparison, HIME uses the same numerosity reduction strategy as Sequitur in this experiment instead of the Induction Graph approach introduced in Section IV.D. This way, both algorithms process the same input. The results are shown in Table III. Sequitur’s enumeration range stops growing after reaching length 886. In contrast, HIME’s enumeration range continues to increase as the time series length grows. Therefore, in large scale time series, HIME can detect significantly longer motifs compared with Sequitur. When the length of time series reaches 16 million, the enumeration range of HIME is one order of magnitude larger than that of Sequitur. We also measure the execution time of both algorithms including post-processing. According to the results, HIME’s overall running time is slower than that of Sequitur’s. However, considering that HIME’s enumeration range is 4-15 times larger than Sequitur’s, HIME is more efficient than Sequitur per motif enumeration length.

Vi-C Parameter Analysis

In this subsection, we demonstrate that the parameters in HIME are easy to choose. Recall that Algorithm 3 can help determines the alphabet size given a range, so we only need to set one paramter for HIME.

We test HIME with parameter from 4 to 16 and from 5 to 15 on a 1 million length random walk time series. The execution time and enumeration range of all parameter combinations are shown in Fig. 10(a) and Fig. 10(b) respectively. According to Fig. 10, HIME’s execution time and enumeration range increase as and decrease. This is because small and allows easier word matching, but loose representation also requires the algorithm to take extra time to compare and filter out false positives. As the number of distinct SAX words increases with the increase of and , the execution time and enumeration range of the algorithm is reduced. The parameter combination chosen by Algorithm 3 for different is labeled in red in the figure. According to the figure, Algorithm 3 tends to select with trade-off of execution time and enumeration range with equal priority. By using Algorithm 3 to determine , the change in execution time and enumeration range as increases is the same as the results from setting 2 parameters. So user can select only unless they want to enumerate a large length range or save time cost due to the length of time series. In these cases, user can set both parameters and to balance the enumeration range and execution time.

Vi-D Case Studies

In this section, we show that HIME can find high quality motifs in several real world million scale time series data. As demonstrated in previous section, existing algorithms are too costly to be applied on problems of such scale.

Vi-D1 Variable Length Similar Subsequences Discovery in DNA Sequence

As shown in previous work, converting DNA sequence to time series [25] can help researchers understand structural similarity [32][25] between DNA subsequences. In this case study, we use HIME to find repeating subsequences in Human hgY chromosome. We converts hgY to a 26 million length time series based on the algorithm described in [25]. We set PAA size to so that the algorithm can finish the search process within half an hour. We set minimum motif length to 1000.

The motif density curve [26] is shown in Fig. 11.top. We observe a region that has a large amount of repeated subseqeunces. According to [28], this region is the longest “ampliconic region” (labeled in red) in Y chromosome. As explained in [28], ampliconic region, or “the ampliconic segments, are composed largely of sequences that exhibit marked similarity —as much as 99.9% identity over tens or hundreds of kilobases—to other sequences in the MSY”

Two examples of long motifs discovered by our algorithm are shown in Fig. 11.bottom. The two motifs found indicate similarity between 2 long DNA subsequences of length 18,000 and 22,000, respectively. Note that previous work [28] set a sliding window of length 2000 to find similar DNA sequence patterns. We show the first 2000 points of the long motifs in the black box. The respective intra-distances between both pairs of subseqeunces are greater than the motif threshold. In other words, they would not be discovered by fixed-length motif discovery algorithm with motif length 2000. The experiment indicates that, by enabling variable-length motif discovery in long time series, HIME provides an opportunity to discover potentially surprising patterns that are hard to be detected.

Fig. 11: Examples of time series motifs found in human Y-Chromosome

Vi-D2 Discovering motifs in Bird Soundtrack

Existing work [2][27] shows that motifs can be used to find repeated calls from bird soundtrack. In this experiment, we test our approach with 600 records of Rufours Capped, Spix and Azard spinetail birds in [1]. We use the second Mel-Frequency Coefficient (MFCC) with 250 Hz to form a 5 million length time series. HIME is applied to this time series with minimum length equal to 0.5 sec.

(a) Rufous Capped Spinetail
(b) Spix Spinetail
(c) Azard Spinetail
Fig. 12: Examples of motifs detected by HIME in Bird Soundtrack

Three examples of motifs discovered with 3.4, 2.0 and 4.5 seconds are shown in Fig. 12. By using the query algorithm [21] to retrieve all motif instances from the closest pair detected, we find that the similar subsequences of these three motifs refer to the same species of birds. Four records of each bird species are shown in Fig. 12.right. The locations of the motifs are labeled in red. These 3 motifs also explain some unique sound patterns for each bird. For example, Fig. 12(a) shows that Rufous Capped may have a silent 1.5 sec after a bird call. Fig. 12(b) shows that between every call of Spix Spinetail, it tends to have a 0.5 sec. gap. Fig. 12(c) indicates that Azard Spinetail often generates three consecutive bird calls. Such information provides useful insights that can help researchers understand bird behaviors.

Vi-D3 Revealing Rare Patterns in Long Electric Power Usage Time Series

In this case study, we show that by enabling large range of motif discovery, HIME can detect long and rare patterns in real world time series. HIME is applied on a 7.4 million length freezer electric power usage time series recorded in [22] with minimum length approximately equal to 1 hour period. An example of long pattern is shown in Fig. 13. HIME detects the repeating subsequences shown in Fig. 13 (top, middle; approximately 10 hours). Using the query algorithm [21], we find that even the similar subsequence (Fig.13 (bottom)) already has obvious sequential difference (difference is surrounded in red box). This may indicate that this is a rare motif. Besides, two subsequences found by HIME are 1 month and 7 days apart in the time series, hence hard to find in shorter time series.

Fig. 13: Example of rare motif in electric power usage data

Vii Conclusion

We introduce a new algorithm named HIME, to detect time series motifs of large range of different lengths in million-scale time series that state-of-the-art algorithms cannot handle efficiently. HIME can process million subsequences with execution time less than 1 minute, which is much faster than any existing algorithm to date. Compared with sequence matching based approach such as grammar induction-based motif discovery algorithms, HIME achieves one order of magnitude increase in search range. In the case studies, we demonstrate that the motifs found by HIME are meaningful and can potentially have significant impacts in various applications.

References

  • [1] www.xeno-canto.org.
  • [2] N. Begum and E. Keogh. Rare time series motif discovery from unbounded streams. Proceedings of the VLDB Endowment, 8(2):149–160, 2014.
  • [3] K. Buza and L. Schmidt-Thieme.

    Motif-based classification of time series with bayesian networks and svms.

    In Advances in Data Analysis, Data Handling and Business Intelligence, pages 105–114. Springer, 2010.
  • [4] N. Castro and P. J. Azevedo. Multiresolution motif discovery in time series. In SDM, pages 665–676. SIAM, 2010.
  • [5] B. Chiu, E. Keogh, and S. Lonardi. Probabilistic discovery of time series motifs. In Proceedings of the 9th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 493–498, 2003.
  • [6] Y. Gao, J. Lin, and H. Rangwala. Iterative grammar-based framework for discovering variable-length time series motifs. In Machine Learning and Applications (ICMLA), 2016 15th IEEE International Conference on, pages 7–12. IEEE, 2016.
  • [7] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. Physiobank, physiotoolkit, and physionet components of a new research resource for complex physiologic signals. Circulation, 101(23):e215–e220, 2000.
  • [8] E. Keogh, J. Lin, and A. Fu. Hot sax: Efficiently finding the most unusual time series subsequence. In Fifth IEEE International Conference on Data Mining (ICDM’05), pages 8–pp. IEEE, 2005.
  • [9] Y. Li, J. Lin, and T. Oates. Visualizing variable-length time series motifs. In SDM, pages 895–906. SIAM, 2012.
  • [10] Y. Li, M. L. Yiu, Z. Gong, et al. Quick-motif: An efficient and scalable framework for exact motif discovery. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pages 579–590. IEEE, 2015.
  • [11] J. Lin, E. Keogh, S. Lonardi, and P. Patel. Finding motifs in time series. In In the 2nd Workshop on Temporal Data Mining, at the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. July 23-26, 2002.
  • [12] J. Lin, E. Keogh, L. Wei, and S. Lonardi. Experiencing sax: a novel symbolic representation of time series. Data Mining and knowledge discovery, 15(2):107–144, 2007.
  • [13] B. Liu, J. Li, C. Chen, W. Tan, Q. Chen, and M. Zhou. Efficient motif discovery for large-scale time series in healthcare. IEEE Transactions on Industrial Informatics, 11(3):583–590, 2015.
  • [14] J. Meng, J. Yuan, M. Hans, and Y. Wu. Mining motifs from human motion. In Proc. of EUROGRAPHICS, volume 8, 2008.
  • [15] D. Minnen, T. Starner, I. Essa, and C. Isbell. Discovering characteristic actions from on-body sensor data. In Wearable computers, 2006 10th IEEE international symposium on, pages 11–18. IEEE, 2006.
  • [16] Y. Mohammad and T. Nishida. Exact discovery of length-range motifs. In Intelligent Information and Database Systems, pages 23–32. Springer, 2014.
  • [17] Y. Mohammad and T. Nishida. Scale invariant multi-length motif discovery. In Modern Advances in Applied Intelligence, pages 417–426. Springer, 2014.
  • [18] A. Mueen. Enumeration of time series motifs of all lengths. In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pages 547–556. IEEE, 2013.
  • [19] A. Mueen and E. Keogh. Online discovery and maintenance of time series motifs. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1089–1098. ACM, 2010.
  • [20] A. Mueen, E. J. Keogh, Q. Zhu, S. Cash, and M. B. Westover. Exact discovery of time series motifs. In SDM, pages 473–484. SIAM, 2009.
  • [21] A. Mueen, Y. Zhu, M. Yeh, K. Kamgar, K. Viswanathan, C. Gupta, and E. Keogh. The fastest similarity search algorithm for time series subsequences under euclidean distance, August 2015.
  • [22] D. Murray, J. Liao, L. Stankovic, V. Stankovic, R. Hauxwell-Baldwin, C. Wilson, M. Coleman, T. Kane, and S. Firth. A data management platform for personalised real-time energy feedback. 8 2015.
  • [23] C. G. Nevill-Manning and I. H. Witten. Identifying hierarchical strcture in sequences: A linear-time algorithm. J. Artif. Intell. Res.(JAIR), 7:67–82, 1997.
  • [24] P. Nunthanid, V. Niennattrakul, and C. A. Ratanamahatana. Discovery of variable length time series motif. In Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2011 8th International Conference on, pages 472–475. IEEE, 2011.
  • [25] T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, and E. Keogh. Searching and mining trillions of time series subsequences under dynamic time warping. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 262–270. ACM, 2012.
  • [26] P. Senin, J. Lin, X. Wang, T. Oates, S. Gandhi, A. P. Boedihardjo, C. Chen, S. Frankenstein, and M. Lerner. Grammarviz 2.0: a tool for grammar-based pattern discovery in time series. In Machine Learning and Knowledge Discovery in Databases, pages 468–472. Springer, 2014.
  • [27] M. Shokoohi-Yekta, Y. Chen, B. Campana, B. Hu, J. Zakaria, and E. Keogh. Discovery of meaningful rules in time series. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1085–1094. ACM, 2015.
  • [28] H. Skaletsky, T. Kuroda-Kawaguchi, P. J. Minx, H. S. Cordum, L. Hillier, L. G. Brown, S. Repping, T. Pyntikova, J. Ali, T. Bieri, et al. The male-specific region of the human y chromosome is a mosaic of discrete sequence classes. Nature, 423(6942):825–837, 2003.
  • [29] H. Tang and S. S. Liao. Discovering original motifs with different lengths from time series. Knowledge-Based Systems, 21(7):666–671, 2008.
  • [30] X. Wang, J. Lin, P. Senin, T. Oates, S. Gandhi, A. P. Boedihardjo, C. Chen, and S. Frankenstein. Rpm: Representative pattern mining for efficient time series classification. pages 185–196, 2016.
  • [31] C.-C. M. Yeh, Y. Zhu, L. Ulanova, N. Begum, Y. Ding, H. A. Dau, D. F. Silva, A. Mueen, and E. Keogh. Matrix profile i: All pairs similarity joins for time series: A unifying view that includes motifs, discords and shapelets. In IEEE ICDM, 2016.
  • [32] Y. Zhu, Z. Zimmerman, N. S. Senobari, C.-C. M. Yeh, G. Funning, A. Mueen, P. Brisk, and E. Keogh. Matrix profile ii: Exploiting a novel algorithm and gpus to break the one hundred million barrier for time series motifs and joins.