The task of finding repetitive similar patterns in time series data, known as motif discovery, has received a great amount of attention in the past decade. It has been used as an important subroutine in many time series data mining tasks and applications such as association rule mining 26], classification , clustering 26], and activity recognition .
Most existing motif discovery algorithms  consider the length of motif as an important domain knowledge that needs to be determined by user . We argue that finding motifs of different, previously unknown lengths—a topic that has been largely glossed over in the literature—is a significant obstacle that prevents motifs from achieving their full potential as a useful time series data mining primitive.
To demonstrate the limitations of fixed-length motif discovery, we show an example on a subset of a dishwasher electric power demand time series (Fig. 1.top). A basic wash cycle takes approximately half an hour (approximately 350 sample points). By setting motif length equal to one basic cycle, we can find a frequently repeating pattern, representing a wash cycle, as shown in Fig. 1 (Motif A). However, after running our algorithm, we also found one frequent motif (Motif B) of length 1534 (approximately 2.5 hours), and a rare motif (Motif C) of length 2791 (approximately 4.6 hours) that only happened a few times in the time series. The discovery of these two long motifs is surprising given their respective lengths compared to that of the basic wash cycle. Since the lengths of these three motifs are significantly different, fixed-length motif discovery algorithms cannot discover all three motifs in a single run. In contrast, all three motifs are found by our algorithm with just several seconds of execution time. Previous work also note that variable-length motifs are valuable features for many time series data mining tasks .
The greatest challenge in detecting variable-length motifs is scalability. We demonstrate this problem by finding motifs of lengths ranging from 300 to 10300 in a ten-million length time series. The brute force algorithm requires distance calls to solve this problem. In this case, the computational complexity is similar to that of finding fixed-length motif in a 1-billion length time series. In other words, the computational complexity of variable-length motif discovery is 10 times larger than the current largest fixed-length motif discovery problem solved in  (which takes 12.13 days with GPU speed-up). As a result, even though the state-of-the-art algorithm can achieve 95% pruning rate, enumerating motifs from lengths 64 to 1024 in an EEG time series of length 160,000 still takes 16 hours to complete .
Million-scale time series exists in many domains. For example, a 2-year power demand dataset used in  is about 7 million in length. The EOG time series recorded in  is about 8 million in length. In both datasets, the state-of-the-art algorithms are too costly to be applied. A series of approximate variable-length time series motif discovery algorithms based on grammar induction have been introduced  following a three-step process. In the first step, the time series is converted to discrete string sequence via a sliding window. In the second step, grammar induction (e.g. via Sequitur ) is applied on the discretized time series to quickly detect repeated string sequences. The third step maps the repeated strings back to the time series subsequences that the strings represent. While the scalability for the grammar-based approach is considerably better than that of the state-of-the-art algorithms, the length range of motifs detected could still be limited. This is because long motifs are often discretized into unnecessarily long and non-identical string sequences due to the typically high amount of noise in the data, and since grammar induction relies on identical sequence matching, existing grammar-based algorithms have difficulty finding long motifs.
In this paper, we introduce a greedy algorithm named HierarchIcal based Motif Enumeration (HIME) for detecting variable-length motifs. Given a minimum motif length , HIME automatically finds motifs that are (much) larger than based on a symbol table which records discretized representation of variable-length subsequences. While the algorithm is not an exact algorithm, in the experiments, it can find motifs of lengths from 300 to 3000 in a time series of length one million, with execution time of about 1 minute. This is considerably faster than the state-of-the-art algorithms. Such good scalability makes searching for variable-length motifs in million scale time series a feasible task. Moreover, compared with sequence matching based motif discovery approaches such as , HIME considerably increases the motif enumeration range that can be detected.
To summarize, our work has the following unique features:
The scalability is significantly better than state-of-the-art algorithms.
The rest of the paper is organized as follows: Section II discusses related work in fixed- and variable-length motif discovery. Section III introduces the problem definition and notations used in this paper. Section IV introduces our proposed HIME algorithm. An adaptive parameter selection approach is described in Section V. The experimental results are shown in Section VI, and we conclude in Section VII.
Ii Related Work
While the classic time series motif discovery problem focuses on finding the most frequent patterns , much of recent research has defined motifs as the most similar pair of subsequences (which we refer to as pair-motifs hereinafter) . In addition, some research work has focused on fast approximate motif discovery . The motifs discovered by these algorithms may contain subseqeunces that are not truly similar. These false positive subsequences can be removed during post-processing. Likewise, similar subsequences may be missed.
The authors in  introduce an algorithm named Quick-Motif, which achieves 3 orders of magnitude speed up compared with the traditional state-of-the-art fixed-length motif discovery algorithm . In a recent work,  introduces an anytime algorithm called STAMP which utilizes fast similarity search algorithm  to find exact pair-motif of a given length. In , the authors further introduce an algorithm called STOMP which can reduce the time complexity of STAMP by . In , the authors utilize random projection to detect fixed-length approximate motifs. The work in  focuses on detecting fixed-length approximate motifs with limited memory.
Some recent research work has focused on efficiently finding variable-length motifs. In , the authors introduce an algorithm named VLMD to find exact K pair-motifs by calling fixed-length exact motif discovery algorithm (i.e. MK ) with all possible lengths within a range. In , the authors introduce an algorithm called MOEN which uses a lower-bound to speed up the process of enumerating motif length. Other works such as  follow the similar idea of . However, a common drawback for these algorithms is that they all conduct an incremental enumerating process starting from the smallest length. Since finding fixed-length motifs is already very costly in million-scale time series, such an approach may be impractical for large-scale variable-length motif discovery.
In , the authors introduce a framework, Grammar Induction based motif discovery, which uses grammar induction to find approximate motifs of variable lengths via identifying repeated discrete sequences without enumerating all possible lengths. It is considerably faster than existing algorithms. However, since the idea of the algorithm is based on symbolic sequence matching, the length range of motifs detected may be limited for some real world applications (e.g. the approach cannot detect motifs shown in Fig. 1 in a single run).
Other works such as  also focus on variable-length motif discovery; however, these techniques have some shortcomings that prevent them from being useful in general cases. In , the subsequences are not normalized, which makes it difficult to find patterns that are similar but with different amplitudes. The work proposed in  also consists of a discretization step, but the subsequences are non-overlapping. As a result, it does not consider every possible pattern candidate, and thus may miss some true patterns .
Iii Notations & Problem Definition
We start with fundamental definitions related to time series:
Time Series is a set of observations ordered by time.
Subsequence of a time serie is a contiguous set of points in time series starting from position and ending at with length . Typically, and .
Subsequences can be extracted from via a sliding window.
In many applications, we are interested in finding similar “shapes.” Therefore, motif discovery is more meaningful when it is offset- and amplitude-invariant. This can be achieved by normalizing each subsequence prior to the search for motifs.
is a procedure that normalizes the mean and standard deviation of the subsequence to zero and one, respectively.
Given a time series and a distance function , in this work, we define a Time Series Motif to be a set of subsequences in such that the distance between each of these subsequences and a seed subsequence is less than a motif threshold function , where is the length of the motif detected, (the minimum motif length). Each subsequence in a motif is said to be an instance of the motif.
More specifically, given a minimum motif length , the algorithm finds pairs of subsequences with length ( is varied for different pairs), where and each pair satisfies motif definition with threshold . Note that as mentioned in , once the pair is detected, all instances of motif can be found by using similarity query algorithms .
Iii-a Problem Definition
Different from fixed-length motif discovery, determining a general measure (e.g. frequency or similarity) for variable-length motif discovery problem is non-trivial . For example, some low frequency motif may be more interesting than high frequency motif; some long motifs are likely to have a large distance compared with short motifs even after normalizing by length (e.g. Motif C versus Motif A in Fig. 1). Since motif discovery can be used as a subroutine for real world applications, instead of ranking motifs based on some interest measure, the proposed algorithm aims to detect “seed subsequence” with at least a pair of subsequences that satisfy the motif constraint stated above. The reader may refer to  for details on various motif evaluation approaches for different goals. Since most state-of-the-art approaches evaluate motifs on similarity, when compared with them in the experimental section, we include the execution time of ranking the motifs based on similarity .
The difficulty of getting exact solution for variable-length motif can be illustrated by a simple example shown in Fig. 2. Suppose two instances of a motif, shown in Fig. 2(b), appear in a time series (Fig. 2(a)) at the and positions, respectively. Fig. 2(c) shows the growth of normalized Euclidean distances between subsequences and for all lengths ranging from 2 to 6000. We can see from Fig. 2(c) that the growth of the distance is nonlinear. That is, the distance between a pair of short subsequences does not necessarily share similar behavior with the distance between long subsequences if the length difference is large. Therefore, the state-of-the-art algorithms need to repeat the fixed-length motif discovery algorithm many times during the enumeration process to find optimal motifs, which significantly increases the time cost.
Iv Proposed Method
In this section, we introduce our proposed method.
Discretization of time series into symbolic representation is often a necessary pre-processing step for efficient motif discovery . Since our proposed work also utilizes symbolic representation of subsequences to speed up motif detection process, we first describe Symbolic Aggregate approXimation (SAX) , a widely used discretization technique for time series data mining. Given a z-normalized time series (a subsequence in our case), SAX first converts it to Piecewise Aggregate Approximation (PAA) representation  of size (i.e. segments). Then, the PAA coefficients are mapped to symbols with alphabet size according to a breakpoints table symbols form a SAX word. Fig. 3 illustrates the discretization process. The pre-defined breakpoints table  for alphabet size up to is shown in Fig. 3 (bottom).
Iv-B Fast Computation of SAX
Since SAX words are heavily used in the proposed algorithm, it is very important to reduce the time cost of discretization. We introduce a fast way to compute SAX words from subsequences of different lengths. Two vectors of statistical features,and are first computed based on input time series . Given a subsequence of length , its SAX representation can be computed by Algorithm 1. The algorithm uses a fast PAA computation approach  to compute PAA coefficients with time complexity which is not dependent on subsequence length (Lines 5-7). In Line 8, the PAA coefficients are converted to SAX word based on the pre-defined breakpoints table, which cost . The cost of Algorithm 1 is thus . Since can be computed during pre-possessing, the cost of computing a SAX word for arbitrary length subsequence during motif discovery process is . As demonstrated in , and should be very small compared with subsequence length. So the time cost to compute a SAX word is reduced from to ().
Iv-C Lower-bound based Numerosity Reduction
In practice, neighboring subsequences are similar to each other since they are off by one point. To reduce the cost of detecting long motifs, Numerosity Reduction (NR) is used to avoid unnecessary similarity comparisons. Different from previous work , where a subsequence is skipped if its SAX representation is identical to the last recorded one, we use PAA distance to remove consecutive similar subsequences. A subsequence is ignored if the lower-bounding PAA distance between neighboring subsequences is less than . Since we want a tight lower bound, we use a large PAA size. In this work, we use PAA size of 32 for numerosity reduction.
Iv-D Induction Graph
The number of SAX words that need to be tested to form long motifs can be unnecessarily long if numerosity reduction is not employed. See Fig. 4 (top). The time series is converted to a sequence of SAX words: , , …, . The node in the graph represents the subsequence in time series. Without numerosity reduction, every single subsequence/SAX word is kept. However, numerosity reduction may cause some patterns to be missed since some subsequences are skipped. In the example shown in Fig. 4 (middle), , and are skipped due to numerosity reduction, and some patterns involving the skipped subsequences may be missed. To mitigate the drawbacks of both scenarios, we introduce the Induction Graph (Fig. 4 (bottom)), a graph structure that helps determine the order of scanning and enumeration of motif candidates during motif discovery. Each node contains 2 edges (next edge and forward edge). Two nodes connected by these two edges are denoted as and (In Fig. 4 (bottom), these are the black and yellow arrows, respectively). is the node representing the subsequence and is the next non-similar subsequence determined by numerosity reduction (see previous section). The Induction Graph can be stored by only recording the nodes connected by . For example, in Fig. 4, all edges and nodes in the Induction Graph can be reconstructed if we record 4 nodes: . In the rest of the paper, the nodes connected by reverse direction of next edge and forward edge are denoted as and respectively.
Iv-E Hierarchical based Motif Enumeration (HIME)
Hierarchical based Motif Enumeration (HIME) is described in Algorithm 2. Intuitively, the algorithm conducts a left to right passing through all nodes via the next edge in Induction Graph . For each node, the algorithm recursively executes two major functions: “RecursiveEnumeration” (Lines 16-17) and “RemoveCoveredMotif” (Line 13). In “RecursiveEnumeration” step, SAX words are formed to represent variable-length subsequences and to detect repeating subsequences. “RemoveCoveredMotif” removes short motifs found that are completely covered by longer motifs to maintain a small size of motif set at a low cost. The motifs detected by the algorithm are stored in (Line 12). Finally, post-processing is applied to remove trivial and false-positive motifs candidates (Line 20). Note that HIME can also utilize numerosity reduction by only examining the nodes recorded in of .
Iv-E1 SAX based Recursive Enumeration
In this step, Algorithm 2 recursively executes Lines 6-18 to detect variable-length motifs. First, HIME computes a SAX word and generates a new node to represent the long subsequence obtained by merging two short subsequences and (Line 6). The new SAX word , along with the total length of the merged subsequence and the location of the subsequence are inserted into a SAX word table, VLSAXTable, if the same SAX word representing some subsequence(s) of similar length does not already exist in the table (Lines 7-8). If already exists in VLSAXTable for some subsequence(s) of similar length, we have found a motif match. The algorithm gets based on the location of the matching subsequence (Line 10) and inserts into the graph (Line 11). and are now instances of a motif. The next edge and forward edge of are updated based on ’s next edge and forward edge respectively. The reverse links are connected to and respectively. The insertion of the new nodes allows us to re-use the detected motifs to reduce the cost of enumerating long motifs. The algorithm then recursively tests two new generated long subsequences represented by and (generated by further merging with , and with , respectively) until Line 7 is satisfied (Lines 16-17). Such greedy recursive strategy can efficiently generate a structure of low hierarchy for matching long repeating patterns .
Iv-E2 Removing Covered Motifs
Some short motifs may completely overlap with long motifs. As demonstrated in , these short motifs (“covered motifs”) are redundant. So during the motif enumeration process, we conduct a fast process (Line 13) to remove the entries for covered motifs from VLSAXTable and MotifSet. An example is shown in Fig. 5 to illustrate the process. The pair (e.g. ) in the figure denotes the SAX word that represents a subsequence of length . The subsequences having the same pair are instances of the same candidate motif, and are labeled in the same color in the figure. In the example, the algorithm first finds the short motif (Fig. 5(b)) based on ( matches ). The algorithm then expands using the recursive enumeration strategy mentioned above and forms a longer subsequence represented by . A match is also found (labeled in green in Fig. 5(c)). Since is generated from and they both represent the same subsequence (except one is a longer version), all instances of the motif represented by may overlap with all instances of long motif represented by . Therefore, the algorithm removes subsequences associated with from VLSAXTable and the corresponding candidate motif from motif set (Fig. 5(d)). While “RemoveCoveredMotif” cannot remove all covered motif, it is used to minimize the time and space cost during motif discovery. All covered motifs can be removed using the algorithm described in  after getting the motif set.
Iv-F Compared with Sequence-Matching Based Approximate Motif Discovery Approaches
First, these existing approaches can detect a long motif only if the SAX sequences representing the motif instances are identical. To illustrate the difficulty of generating the same sequence for long motif, an example is shown in Fig. 6. The two subsequences of length 500 in Fig. 6 are instances of a motif discovered by our algorithm in EPG time series . If we set the minimum length to be 50, these two subsequences form two SAX sequences of lengths 27 and 33, respectively, with numerosity reduction. Clearly, the two subsequences are converted to overwhelmingly long word sequences that are not similar to one another except the first two words. Therefore, sequence-matching based approaches cannot detect them. In contrast, HIME would form a single SAX word for each subsequence via the recursive enumeration process, and find motifs using only the SAX word representing the subsequence. In this example, since the two subsequences indeed have the same SAX representation, HIME has the ability to discover them.
Second, the mean and the variance of a subsequence can affect the shape dramatically. However, the generation of SAX words in the word sequence only depends on the mean and the variance of small subsequence. Therefore, when the length of word sequence is long, the mean and the variance of short subsequences may significantly differ from the overall mean and variance of the whole subsequence. So even if the subsequences are similar with each other, their respective word sequences may be dissimilar. HIME avoids this problem by fast re-computing SAX word via Algorithm 1, which re-normalizes the subsequence every time.
In the experiments, we demonstrate that HIME significantly increases the enumeration range compared to sequence-matching based approaches.
V Parameter Selection
Since the performance of the algorithm depends on the SAX parameters and , in order to alleviate the burden of parameter selection, we introduce an approach to adaptively choose parameter .
We first introduce a fast approach to compute multi-resolution SAX with similar cost as fixed-resolution SAX. Given a maximum alphabet size , our approach first gathers all SAX breakpoints for alphabet sizes from 2 up to . For each interval between any two breakpoints, a symbol sequence containing corresponding symbols up to resolution is recorded. An example with is shown in Fig. 7. The breakpoints are denoted by “x,” and the symbol sequence for each intervals is shown in . By using binary search to determine which interval the PAA coefficient belongs to, we can find its SAX representations in all resolutions from to at a low cost. Since the number of intervals is bound by , the time complexity of binary search is bound by . Given the PAA values denoted in yellow as an example, the algorithm computes 3 different resolutions of SAX words by simply concatenating three symbol sequences (shown in ). Determining each symbol sequence only takes 3 binary searches. As mentioned in , for motif discovery , generally is chosen under 20, which only contains 128 distinct break points. So the complexity of computing a single multi-resolution SAX with PAA size word with up to 20 only cost .
The Adaptive Parameter Selection Algorithm is outlined in Algorithm 3. The algorithm random-samples a pair of subsequence (,) of length until alphabet size is determined. For each pair sampled, a suitable that makes the tightness of lower-bound (computed by ) of SAX word close to 0.5 is recorded. The threshold (0.5) provides a way to balance the tightness of the representation and the number of distinct SAX words. Since monotonically increases as increases , using binary search (Lines 4-6), we can find the proper without examining all resolutions. Intuitively, controls the approximate performance of SAX words and can be used to select a suitable resolution. The average alphabet size is selected to be the parameter for HIME algorithm after the parameter selection algorithm converges (e.g. when the change of is less than 0.01). By pre-computing SAX words of all resolutions for (,), BinarySearchResolution in Algorithm 3 only costs to search suitable from 2 to . During the experiment, often converges after one thousand samplings, which only takes less than one second.
Vi Evaluation Experiments
We perform a series of experiments to evaluate the performance of HIME. All the experiments are conducted on a 16 GB RAM laptop with quad core processor of 2.5 GHz. The executable software and datasets used in the experiments can be found in http://bit.ly/2rvBETV. Note that the goal of the experiments is to demonstrate the contributions and potential applications for efficiently finding motifs with large length differences that existing algorithms may have difficulty to find.
We first test 2 different state-of-the-art enumeration approaches  in the task of detecting motifs of length from 300 to 2300 in three real-world time series of length 160,000 to demonstrate the scalability problem. The first approach iteratively calls fixed-length motif discovery algorithm (e.g. MK ) to find the most similar subsequences in different lengths . We choose the current fastest algorithm, Quick-Motif , to maximize its scalability (denoted as ItrQuick-Moitf). The second approach  (MOEN) applies an all pair-wise similarity search algorithm and a lower-bound to prune some lengths that do not need to be tested. We choose the fastest algorithms, STOMP  and STAMP 
, to be used with the MOEN framework. Since STAMP is slower than STOMP in all test cases, we only show the execution time for motif enumeration based on STOMP (denoted MOEN-STOMP). All execution times of STAMP and STOMP are based on the C code provided by the authors. Table I shows the execution time of fixed-length motif discovery algorithm, MOEN (using code provided by the author) and the estimated execution time if Quick-Motif and STOMP are applied in both enumeration approaches respectively (calculated based on the number of times fixed-length algorithm is called in the framework). From the results, ItrQuick-Motif is very costly compared to the MOEN framework. Although MOEN reaches, and pruning rates in 3 datasets respectively, even using the latest all-pair similarity comparison algorithm, MOEN-STOMP still takes hours to detect the motifs. In contrast, HIME only take few seconds including post-processing. So state-of-the-art algorithms are hard to be applied in the remaining experiments since the smallest time series used has length of 1 million. Therefore, instead of directly comparing scalability performance, we compare our performance with STOMP pruning rate. As shown in the table, 99% pruning is hard to reach even in periodic ECG time series.
Similar to ItrQuick-Motif, directly using fixed-length approximate motif discovery  is also costly. For example, even if approximate motif discovery algorithm only takes one second to find fixed-length motifs, finding motifs in a length range of [300, 2300] still requires approximately 33 min. In long time series, just the discretization step alone in  may take more than one second. Similar problem also exists in fixed-length anytime version of the state-of-the-art motif discovery algorithm, STAMP . STAMP can only examine fewer than 50 out of 16 million subsequences in one second.
|STAMP (fixed)||12.15 min||12 min||13.7 min|
|STOMP (fixed)||4.15 min||5.19 min||4.05 min|
|Quick-Motif (fixed)||8.2 min||7.8 min||30 sec.|
|MOEN||3.1 days||1.7 days||16 hr|
|ItrQuick-Motif (est.)||11.3 days||10.08 days||16.6 hr|
|MOEN-STOMP (est.)||13.8 hr||8.65 hr||2.7 hr|
|HIME+Post-Processing||23 sec||21 sec||47 sec|
In all experiments unless noted, the alphabet size is set using Algorithm 3. The PAA size is set to 6. The minimum length is 300 and motif threshold is . The relation between execution time, enumeration range and parameter setting is discussed in Sec VI.C.
Vi-a Detecting Planted Motifs in Random Walk Time Series
We first test HIME in a planted motif experiment to demonstrate its ability to detect motifs with high accuracy. We choose a grammar (Sequitur) based motif discovery algorithm  as the baseline since it can find variable-length motifs in million-scale time series and utilizes similar hierarchical identification process as the proposed algorithm. We planted 4 motifs of different lengths, 10 instances each, into a random walk time series of length 3 million. The probability that an instance of the motif appears in the time series is smaller than , which can be considered rare motif ). The shape of motifs are generated by using with random parameters , and . The lengths of motifs are . We added 5% random noise to every instance of the motifs. The mean and variance are also randomly generated. HIME is expected to find at least a pair of non-overlapping subsequences for each motif that highly overlapped with the actual planted instances.
The performance of both algorithms is measured by the overlapping rate between the subsequences found and the ground truth locations. Since the grammar-based algorithm also infers motifs based on discretized SAX word sequence, we conduct a grid search for parameter selection of and (covering the grid search area mentioned in ) and only report the best performance.
The experimental result is shown in Fig. 8. For Sequitur, the best overlapping rate (0.45 on average) is reached when . For all other motif lengths, the overlapping rates are between 0.1 and 0.3. In contrast, HIME consistently gets overlapping rates above 0.8 for motif lengths 1500, 3000 and 6000. The performance decreases when detecting motif of length 12000; however, HIME still can achieve overlapping rates above 0.8 in 6 out of 10 cases for motif length 12000.
Vi-A1 Performance VS. Number of Instances
|# of Subsequences||1 million||2 million||4 million||8 million||16 million|
|Motif Length Range Detected||[300 3020]||[300 3643]||[300 5104]||[300 7775]||[300 10795]|
|STOMP (fixed length)||2 hr||16.6 hr||5.01 days (est.)||18.25 days (est.)||68 day (est.)|
|STOMP 99% pruning rate||2.25 days||22.7 days||0.64 yr (est.)||3.65 yr (est.)||19.9 yr (est.)|
|HIME||1.2 min||2.9 min||8.4 min||30 min||1.6 hr|
|Post-processing||0.4 min||0.8 min||3.1 min||15 min||0.9 hr|
|Total||1.6 min||3.7 min||11.5 min||45 min||2.5 hr|
|Time Series Length||1 million||2 million||4 million||8 million||16 million|
|Sequitur-Best (Enumeration Range)||506([300 806])||535([300 835])||586([300 886])||586([300 886])||586([300 886])|
|HIME (Enumeration Range)||2307([300 2607])||2731([300 3031])||3705([300 4005])||5965([300 6265])||8811([300 9111])|
|Sequitur-Best||3.6 sec.||12 sec.||30 sec.||2 min||6 min|
|HIME||12 sec.||30 sec.||1.4 min||9 min||40 min|
We conducted an experiment to illustrate the relationship between accuracy and the number of instances repeated on the time series. We use planted motif of length 6000 and adjust the number of instances in each experiment. The overlapping rates vs. the number of instances repeated is shown in Fig. 9. According to Fig. 9, the accuracy grows rapidly as the number of repeated instances increases. The algorithm can successfully detects the planted motifs with very high overlapping rate (above 80% overlapped with ground truth) when the number of instances is above 6. Note that a motif only repeating 6 times in million scale time series is considered a rare motif due to the length of time series. Therefore, although HIME is a greedy approximate motif discovery algorithm that does not guarantee exact solution, since the algorithm utilizes all short subsequences overlapped with the instances to detect long motifs, the chance that the motif is detected increases as the number of motif instances increases. Also, motifs often repeat multiple times in a long time series (e.g. if a motif appears with the probability of , in a time series of one million in length, it may appear 100 times), so the chance of the motif being detected by HIME is high.
In this subsection, the scalability of algorithm is tested in a 16-million length random walk time series.
Vi-B1 Execution Time Vs. Data Size
We use STOMP  with 99% pruning rate to demonstrate the necessity of an approximate approach in million-scale (or larger) time series. In the test case where STOMP takes more than 24 hours, we estimate the execution time by the first 100 iterations (the same estimation approach used in ). The enumeration range of HIME is measured after removing all false positive motifs based on motif threshold function. The result is shown in Table II. The running time for HIME grows significantly slower compared to state-of-the-art algorithms. In the largest test case, HIME takes approximately 2.5 hr to process 16 million subsequences and the length range of motifs found is from 300 to 10795. The estimated execution time for STOMP is 68 days. So even with 99% pruning rate, it may take 20 years to enumerate the same length range as HIME (though for STOMP, the solution would be exact). Note that GPU version of STOMP , which can achieve approximately 150 times speedup, can be used to increase the scalability of STOMP. However, it may still take 47 days. The problem size is simply too large for state-of-the-art algorithms to get the exact solution. In contrast, HIME provides an alternative way to efficiently detect approximate variable-length motifs of large length range in this scale of time series.
Vi-B2 Enumeration Range Vs. Data Size
We next demonstrate that HIME can efficiently detect motifs in large enumeration range as the length of time series increases. The performance is compared with that of Sequitur (grammar-based approach). Similar to previous experiment, we use grid search to find the best parameters for Sequitur so that the enumeration range can be maximized. In order to make a fair comparison, HIME uses the same numerosity reduction strategy as Sequitur in this experiment instead of the Induction Graph approach introduced in Section IV.D. This way, both algorithms process the same input. The results are shown in Table III. Sequitur’s enumeration range stops growing after reaching length 886. In contrast, HIME’s enumeration range continues to increase as the time series length grows. Therefore, in large scale time series, HIME can detect significantly longer motifs compared with Sequitur. When the length of time series reaches 16 million, the enumeration range of HIME is one order of magnitude larger than that of Sequitur. We also measure the execution time of both algorithms including post-processing. According to the results, HIME’s overall running time is slower than that of Sequitur’s. However, considering that HIME’s enumeration range is 4-15 times larger than Sequitur’s, HIME is more efficient than Sequitur per motif enumeration length.
Vi-C Parameter Analysis
In this subsection, we demonstrate that the parameters in HIME are easy to choose. Recall that Algorithm 3 can help determines the alphabet size given a range, so we only need to set one paramter for HIME.
We test HIME with parameter from 4 to 16 and from 5 to 15 on a 1 million length random walk time series. The execution time and enumeration range of all parameter combinations are shown in Fig. 10(a) and Fig. 10(b) respectively. According to Fig. 10, HIME’s execution time and enumeration range increase as and decrease. This is because small and allows easier word matching, but loose representation also requires the algorithm to take extra time to compare and filter out false positives. As the number of distinct SAX words increases with the increase of and , the execution time and enumeration range of the algorithm is reduced. The parameter combination chosen by Algorithm 3 for different is labeled in red in the figure. According to the figure, Algorithm 3 tends to select with trade-off of execution time and enumeration range with equal priority. By using Algorithm 3 to determine , the change in execution time and enumeration range as increases is the same as the results from setting 2 parameters. So user can select only unless they want to enumerate a large length range or save time cost due to the length of time series. In these cases, user can set both parameters and to balance the enumeration range and execution time.
Vi-D Case Studies
In this section, we show that HIME can find high quality motifs in several real world million scale time series data. As demonstrated in previous section, existing algorithms are too costly to be applied on problems of such scale.
Vi-D1 Variable Length Similar Subsequences Discovery in DNA Sequence
As shown in previous work, converting DNA sequence to time series  can help researchers understand structural similarity  between DNA subsequences. In this case study, we use HIME to find repeating subsequences in Human hgY chromosome. We converts hgY to a 26 million length time series based on the algorithm described in . We set PAA size to so that the algorithm can finish the search process within half an hour. We set minimum motif length to 1000.
The motif density curve  is shown in Fig. 11.top. We observe a region that has a large amount of repeated subseqeunces. According to , this region is the longest “ampliconic region” (labeled in red) in Y chromosome. As explained in , ampliconic region, or “the ampliconic segments, are composed largely of sequences that exhibit marked similarity —as much as 99.9% identity over tens or hundreds of kilobases—to other sequences in the MSY”
Two examples of long motifs discovered by our algorithm are shown in Fig. 11.bottom. The two motifs found indicate similarity between 2 long DNA subsequences of length 18,000 and 22,000, respectively. Note that previous work  set a sliding window of length 2000 to find similar DNA sequence patterns. We show the first 2000 points of the long motifs in the black box. The respective intra-distances between both pairs of subseqeunces are greater than the motif threshold. In other words, they would not be discovered by fixed-length motif discovery algorithm with motif length 2000. The experiment indicates that, by enabling variable-length motif discovery in long time series, HIME provides an opportunity to discover potentially surprising patterns that are hard to be detected.
Vi-D2 Discovering motifs in Bird Soundtrack
Existing work  shows that motifs can be used to find repeated calls from bird soundtrack. In this experiment, we test our approach with 600 records of Rufours Capped, Spix and Azard spinetail birds in . We use the second Mel-Frequency Coefficient (MFCC) with 250 Hz to form a 5 million length time series. HIME is applied to this time series with minimum length equal to 0.5 sec.
Three examples of motifs discovered with 3.4, 2.0 and 4.5 seconds are shown in Fig. 12. By using the query algorithm  to retrieve all motif instances from the closest pair detected, we find that the similar subsequences of these three motifs refer to the same species of birds. Four records of each bird species are shown in Fig. 12.right. The locations of the motifs are labeled in red. These 3 motifs also explain some unique sound patterns for each bird. For example, Fig. 12(a) shows that Rufous Capped may have a silent 1.5 sec after a bird call. Fig. 12(b) shows that between every call of Spix Spinetail, it tends to have a 0.5 sec. gap. Fig. 12(c) indicates that Azard Spinetail often generates three consecutive bird calls. Such information provides useful insights that can help researchers understand bird behaviors.
Vi-D3 Revealing Rare Patterns in Long Electric Power Usage Time Series
In this case study, we show that by enabling large range of motif discovery, HIME can detect long and rare patterns in real world time series. HIME is applied on a 7.4 million length freezer electric power usage time series recorded in  with minimum length approximately equal to 1 hour period. An example of long pattern is shown in Fig. 13. HIME detects the repeating subsequences shown in Fig. 13 (top, middle; approximately 10 hours). Using the query algorithm , we find that even the similar subsequence (Fig.13 (bottom)) already has obvious sequential difference (difference is surrounded in red box). This may indicate that this is a rare motif. Besides, two subsequences found by HIME are 1 month and 7 days apart in the time series, hence hard to find in shorter time series.
We introduce a new algorithm named HIME, to detect time series motifs of large range of different lengths in million-scale time series that state-of-the-art algorithms cannot handle efficiently. HIME can process million subsequences with execution time less than 1 minute, which is much faster than any existing algorithm to date. Compared with sequence matching based approach such as grammar induction-based motif discovery algorithms, HIME achieves one order of magnitude increase in search range. In the case studies, we demonstrate that the motifs found by HIME are meaningful and can potentially have significant impacts in various applications.
-  www.xeno-canto.org.
-  N. Begum and E. Keogh. Rare time series motif discovery from unbounded streams. Proceedings of the VLDB Endowment, 8(2):149–160, 2014.
K. Buza and L. Schmidt-Thieme.
Motif-based classification of time series with bayesian networks and svms.In Advances in Data Analysis, Data Handling and Business Intelligence, pages 105–114. Springer, 2010.
-  N. Castro and P. J. Azevedo. Multiresolution motif discovery in time series. In SDM, pages 665–676. SIAM, 2010.
-  B. Chiu, E. Keogh, and S. Lonardi. Probabilistic discovery of time series motifs. In Proceedings of the 9th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 493–498, 2003.
-  Y. Gao, J. Lin, and H. Rangwala. Iterative grammar-based framework for discovering variable-length time series motifs. In Machine Learning and Applications (ICMLA), 2016 15th IEEE International Conference on, pages 7–12. IEEE, 2016.
-  A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. Physiobank, physiotoolkit, and physionet components of a new research resource for complex physiologic signals. Circulation, 101(23):e215–e220, 2000.
-  E. Keogh, J. Lin, and A. Fu. Hot sax: Efficiently finding the most unusual time series subsequence. In Fifth IEEE International Conference on Data Mining (ICDM’05), pages 8–pp. IEEE, 2005.
-  Y. Li, J. Lin, and T. Oates. Visualizing variable-length time series motifs. In SDM, pages 895–906. SIAM, 2012.
-  Y. Li, M. L. Yiu, Z. Gong, et al. Quick-motif: An efficient and scalable framework for exact motif discovery. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pages 579–590. IEEE, 2015.
-  J. Lin, E. Keogh, S. Lonardi, and P. Patel. Finding motifs in time series. In In the 2nd Workshop on Temporal Data Mining, at the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. July 23-26, 2002.
-  J. Lin, E. Keogh, L. Wei, and S. Lonardi. Experiencing sax: a novel symbolic representation of time series. Data Mining and knowledge discovery, 15(2):107–144, 2007.
-  B. Liu, J. Li, C. Chen, W. Tan, Q. Chen, and M. Zhou. Efficient motif discovery for large-scale time series in healthcare. IEEE Transactions on Industrial Informatics, 11(3):583–590, 2015.
-  J. Meng, J. Yuan, M. Hans, and Y. Wu. Mining motifs from human motion. In Proc. of EUROGRAPHICS, volume 8, 2008.
-  D. Minnen, T. Starner, I. Essa, and C. Isbell. Discovering characteristic actions from on-body sensor data. In Wearable computers, 2006 10th IEEE international symposium on, pages 11–18. IEEE, 2006.
-  Y. Mohammad and T. Nishida. Exact discovery of length-range motifs. In Intelligent Information and Database Systems, pages 23–32. Springer, 2014.
-  Y. Mohammad and T. Nishida. Scale invariant multi-length motif discovery. In Modern Advances in Applied Intelligence, pages 417–426. Springer, 2014.
-  A. Mueen. Enumeration of time series motifs of all lengths. In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pages 547–556. IEEE, 2013.
-  A. Mueen and E. Keogh. Online discovery and maintenance of time series motifs. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1089–1098. ACM, 2010.
-  A. Mueen, E. J. Keogh, Q. Zhu, S. Cash, and M. B. Westover. Exact discovery of time series motifs. In SDM, pages 473–484. SIAM, 2009.
-  A. Mueen, Y. Zhu, M. Yeh, K. Kamgar, K. Viswanathan, C. Gupta, and E. Keogh. The fastest similarity search algorithm for time series subsequences under euclidean distance, August 2015.
-  D. Murray, J. Liao, L. Stankovic, V. Stankovic, R. Hauxwell-Baldwin, C. Wilson, M. Coleman, T. Kane, and S. Firth. A data management platform for personalised real-time energy feedback. 8 2015.
-  C. G. Nevill-Manning and I. H. Witten. Identifying hierarchical strcture in sequences: A linear-time algorithm. J. Artif. Intell. Res.(JAIR), 7:67–82, 1997.
-  P. Nunthanid, V. Niennattrakul, and C. A. Ratanamahatana. Discovery of variable length time series motif. In Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2011 8th International Conference on, pages 472–475. IEEE, 2011.
-  T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, and E. Keogh. Searching and mining trillions of time series subsequences under dynamic time warping. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 262–270. ACM, 2012.
-  P. Senin, J. Lin, X. Wang, T. Oates, S. Gandhi, A. P. Boedihardjo, C. Chen, S. Frankenstein, and M. Lerner. Grammarviz 2.0: a tool for grammar-based pattern discovery in time series. In Machine Learning and Knowledge Discovery in Databases, pages 468–472. Springer, 2014.
-  M. Shokoohi-Yekta, Y. Chen, B. Campana, B. Hu, J. Zakaria, and E. Keogh. Discovery of meaningful rules in time series. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1085–1094. ACM, 2015.
-  H. Skaletsky, T. Kuroda-Kawaguchi, P. J. Minx, H. S. Cordum, L. Hillier, L. G. Brown, S. Repping, T. Pyntikova, J. Ali, T. Bieri, et al. The male-specific region of the human y chromosome is a mosaic of discrete sequence classes. Nature, 423(6942):825–837, 2003.
-  H. Tang and S. S. Liao. Discovering original motifs with different lengths from time series. Knowledge-Based Systems, 21(7):666–671, 2008.
-  X. Wang, J. Lin, P. Senin, T. Oates, S. Gandhi, A. P. Boedihardjo, C. Chen, and S. Frankenstein. Rpm: Representative pattern mining for efficient time series classification. pages 185–196, 2016.
-  C.-C. M. Yeh, Y. Zhu, L. Ulanova, N. Begum, Y. Ding, H. A. Dau, D. F. Silva, A. Mueen, and E. Keogh. Matrix profile i: All pairs similarity joins for time series: A unifying view that includes motifs, discords and shapelets. In IEEE ICDM, 2016.
-  Y. Zhu, Z. Zimmerman, N. S. Senobari, C.-C. M. Yeh, G. Funning, A. Mueen, P. Brisk, and E. Keogh. Matrix profile ii: Exploiting a novel algorithm and gpus to break the one hundred million barrier for time series motifs and joins.