I Introduction
Detecting repeating patterns of various lengths, also called variablelength motifs, in time series has received a great amount of attention [1][2][3][4]. Since motifs of different lengths can naturally coexist in a time series, detecting variablelength motifs often is a necessary step for many realworld applications such as classification [2]
[5][6].Contrary to the significant progress that has been made in recent single dimensional variablelength motif discovery work [2][3][7], only little progress is made in detecting variablelength subdimensional motifs [8][9] — patterns that are simultaneously occurring only in a subset of all dimensions in multivariate time series.
Existing approaches [10][8][9] in subdimensional motif discovery still only detect motifs of a specified length, possibly suggested by domain experts. While in some applications, these approaches may fit well if domain knowledge is available and a good motif length can be specified by the user, we aim at solving the problem in a more general case — when the correct motif length is not known, or motifs of various lengths coexist in the data.
We illustrate the limitation of fixedlength motif discovery in Figure 1. In the figure, there are two subdimensional motifs of lengths 200 and 400 respectively. The first subdimensional motif (labeled in red with green box) occurs in the first two dimensions and , and the second subdimensional motif occurs in and . Since the motifs have different lengths, even in the best case, existing (fixedlength) subdimensional motif discovery approaches [10][8][9] can only detect one of these motifs correctly if the proper length is provided. When the lengths are unknown, they would need to try different lengths — a process that is very time consuming. To overcome this limitation, our proposed approach is designed to discover motifs of various lengths in a single run; that is, all subdimensinoal motifs can be discovered even with significant length differences between them.
Subdimensional motifs of different lengths widely exist in real world time series data. For example, the Caltrans Performance Measurement (PEMS) dataset [11], which records traffic information in major California cities over time, can have patterns in different spatial and temporal scales ranging from several hours of rush hour traffic patterns to a 5day stable weekly pattern. In meteorology data analysis, a motif with duration of several hours may benefit shortterm forecasting, whereas a motif that lasts several months may represent some seasonal phenomenon [9]. These motifs can coexist in the same time series even though they differ significantly in length, and span different dimensions. Detecting them is often necessary for analyzing the data [2][12].
The main challenge for detecting variablelength subdimensional motif is scalability. The bruteforce enumeration solution, which searches for motifs of all possible lengths, is very time consuming. While there exist some search techniques for detecting single dimensional variablelength motifs [2][12][13], they cannot be generalized to detecting subdimensional motifs since they do not address the problem of searching relevant dimensions. As a result, how to efficiently detect variablelength subdimensional motifs remains a difficult problem.
In this paper, we introduce an algorithm called Collaborative HIerarchy based Motif Enumeration (CHIME) to efficiently detect variablelength subdimensional motifs given a minimum motif length . Towards that end, CHIME combines repeating symbols detected from each dimension, and uses the symbol matching results in all dimensions collaboratively to avoid redundant searches. While the algorithm does not guarantee an exact solution, it can find motifs with considerably large length range in multivariate time series with high accuracy. Even in a fiftydimensional time series with length of one million, the algorithm executes in less than one hour. This is significantly faster than the stateoftheart approaches which would take days to detect motifs of a single fixed length. The efficient search makes variablelength subdimensional motif discovery in largescale multivariate time series a feasible task. To summarize, our work has the following contributions:

The algorithm can discover subdimensional motifs of a large length range.

The speed is significantly faster than that of the stateoftheart algorithms.

Even though CHIME is an approximate algorithm, the experiments show that the algorithm can detect motifs with high accuracy.

The experiments also show that CHIME can discover meaningful motifs in real world datasets, and can potentially benefit other multivariate time series data mining tasks such as classification.
The rest of the paper is organized as follows: Sec. 2 discusses related work and challenges in detecting variablelength subdimensional motifs. Sec. 3 introduces the problem definition and notations used in the paper. Sec. 4 describes the discretization technique and introduces our algorithm. The experimental results are shown in Sec. 5, and we conclude in Sec. 6.
Ii Related Work
We start by introducing recent fixedlength multivariate time series motif discovery work.
Tanaka et al. [14] introduce a smart approach that compresses multivariate time series into a univariate time series to detect motifs at a low cost. Minnen et al. and Berlin et al. [10][15] utilize a density based approach to locate potential motif areas. All three approaches above, however, are designed for finding patterns that match in all dimensions. Therefore, the performance of these approaches will degrade if there exist some irrelevant dimensions.
Another work by Minnen et al. [8] is perhaps the first work to point out that finding patterns that match in all dimensions is not very useful in some realworld applications. The authors show that explainable motifs often occur in only some subset(s) of all dimensions. They introduce a fast approach that utilizes random projection and Symbolic Aggregate approXimation representation (SAX) to efficiently find approximate subdimensional motifs. Following similar idea, Wang et al. [16] introduce a subdimensional motif discovery approach by constructing a suffix tree. Recently, Yeh et al. [9] introduce an algorithm named mSTAMP that utilizes stateoftheart motif discovery results along with Minimum Description Length (MDL) metric to locate subdimensional motifs. The approach can detect highquality, meaningful subdimensional motifs in many realworld datasets.
All the algorithms described above require the user to predefine the motif length. However, recent work have shown that detecting motifs of different, possibly unknown lengths is necessary in many realworld applications [3][13][2][12]. We next discuss recent variablelength subdimensional motif discovery algorithm.
Presently, work on variablelength subdimensional motif discovery is very limited. We only found one such work by Balasubramaian et al. [6]. The authors utilize a Grammar Induction based framework [1] to find variablelength subdimensional motifs in healthcare time series. However, their approach requires storing every combination of the cooccurring single dimensional motifs in memory. So while it can achieve high accuracy, the approach is only suitable for lowdimensional time series. In contrast, our proposed algorithm is able to scale in high dimensional, largesize time series.
Iii Notation & Definition
We start with fundamental definitions related to time series:
Single Dimensional (Univariate) Time Series is a set of observations ordered by time.
Multivariate Time Series is a set of coevolving single dimensional time series .
Subsequence of a multivariate time series is a contiguous set of points in the univariate time series starting from position with length . Typically, and .
Subsequences can be extracted from via a sliding window. In many applications, we are interested in finding similar “shapes.” Therefore, motif discovery is more meaningful when it is offset and amplitudeinvariant. This can be achieved by normalizing each subsequence prior to the search for motifs. Znormalization
is a procedure that normalizes the mean and standard deviation of all points in the subsequence to zero and one, respectively.
Given a start and an end position , respectively, a Multidimensional Subsequence consisting of subsequences can be extracted. We denote it as where D contains all possible dimensions .
As demonstrated in previous work [8][9], in many cases only a subset of all coevolving subsequences in are relevant (i.e. have repeating patterns). So we describe the concept of subdimensional subsequence [9] used in previous work.
The Subdimensional Subsequence is a set of subsequences among subsequences that start from position and end at in multivariate time series , where
is an indicator vector that stores the list of relevant dimensions and
is the number of relevant dimensions.We next introduce the definition of subdimensional motif used in this work. Given a distance function and two subdimensional subsequences and with the same indicator and the same length, a vector is used to record the distance value where . In this work, we define a Subdimensional Motif to be a set of subdimensional subsequences in such that the average value of the distance vector between the subdimensional subsequences and a seed subdimensional subsequence is less than the motif threshold function , where is the length of the motif detected, (the minimum motif length). Each such subdimensional subsequence is said to be an instance of the subdimensional motif.
Iiia Problem Definition
Finally, we introduce the problem that the proposed approach aims to address.
VariableLength Subdimensional Time Series Motif Discovery Problem: Given a minimum motif length and a multivariate time series , the problem aims to find all variablelength subdimensional time series motifs that exist in .
Clearly, finding exact solution for this problem in a largescale multivariate time series is costly. However, discovering even a subset of variablelength motifs can already benefit many realworld applications compared with fixedlength motifs [2][1][7]. Therefore, given a minimum motif length , our proposed algorithm aims to find an approximate set of high quality variablelength motifs with lengths larger than , and rank them based on some interest measure.
It is worth noting that determining a general interest measure for variablelength subdimensional motif discovery problem is nontrivial [9][8][12]. Some long motifs are likely to have larger distances compared with short motifs even after normalizing the distances by length. Some motifs that have a lower number of dimensions may contain more interesting local patterns compared with those spanning most or all of the dimensions. Therefore, the ranking of subdimensional motifs should be based on the application. The reader may refer to [9][8][12] for details on various motif evaluation approaches for different goals.
Iv Proposed Method
In this section, we first describe Symbolic Aggregate approXimation (SAX), the discretization approach we use to represent each subsequence. Then we describe the Multivariate Time Series Numerosity Reduction process, the preprocessing step that aims to remove neighboring, similar subsequences to avoid overcounting a pattern. We then introduce the proposed method.
Iva Symbolic aggregate approximation (SAX)
Discretization [18] is a common step in many time series motif discovery algorithms [18][3][19][8] [6][1], for it often helps improve the efficiency of the algorithms significantly. In this section, we describe Symbolic Aggregate approXimation (SAX), a popular technique used to discretize univariate time series.
Given a normalized subsequence from and a reduced dimension size , SAX converts the raw subsequence into lowerdimenional PAA (Piecewise Aggregate Approximation) representation by dividing the subsequence into equalsized windows, and computing the average value of the points within each window. Intuitively, the PAA coefficients vector [18] is a dimensional vector that consists of the average values from equalsized segments of the input subsequence. PAA coefficients can be considered as an approximate representation of the original subsequence.
Given a PAA coefficients vector, the algorithm then maps it to symbols with alphabet size according to a breakpoint table [18]
, defined such that the regions are approximately equalprobable under Gaussian distribution. This maximizes the chance that the symbols occur with approximately equal probability. These
symbols form a SAX word. Figure 2 summarizes the process. The bold flat lines represent the values of PAA coefficients computed from their respective segments in the subsequence; the breakpoint table with from 2 to 4 is also shown in the figure. Since we set , the second column shows the breakpoints, based on which three regions are generated: . The PAA coefficients falling into these three regions are represented by symbols and respectively. In this example, the SAX word is formed.IvA1 Fast Computation of SAX for Subsequence of Any Length
Since the proposed approach uses SAX words heavily, it is very important to reduce the time cost of discretization. We next describe a fast algorithm to compute the SAX word for a univariate subsequence of any length [3]. We first precompute two vectors of statistical features for every time series : and based on input multivariate time series . Given a subsequence of length in dimension
, the SAX representation can be computed by Algorithm 1. In the algorithm, the mean and variance of
is computed in constant time (Line 34). The computation cost of computing PAA coefficients (Line 57) is , which is not dependent on subsequence length. The PAA coefficients are converted to SAX word based on the predefined breakpoint table described in previous subsection, which takes time computation. The overall cost of the whole process is . Since can be computed during prepossessing in time, the cost of computing a SAX word for arbitrary length subsequence during motif discovery process is independent of the subsequence length. As demonstrated in [18][3], and should be very small compared with subsequence length. So the time cost to compute a SAX word is significantly reduced ().IvB PAA Distancebased Multivariate Time Series Numerosity Reduction
In practice, neighboring subsequnces extracted via a sliding window are similar to each other since they are off by one point. To avoid overcounting a pattern, and to allow variablelength pattern discovery, Numerosity Reduction often is used to further compress the word sequence in previous univariate variablelength motif discovery work [1][3][7].
Concretely, given a multivariate time series , similar to previous work [1][3][7], PAA Distancebased Numerosity Reduction process conducts a left to right scan through and only records the first multivariate subsequence such that the PAA distance^{1}^{1}1Given two subsequence, PAA distance is computed by the distance between the PAA coefficient vectors of two subsequences multiplied by a factor of . The PAA distance has been shown to lower bound the actual Euclidean Distance. [18]. The reader can refer to the previous work [18] for detailed explanation on PAA distance. [18] between and the most recently recorded subsequence [18] is greater than in at least one dimension. Similar to previous work [7][3], we set for computing the PAA distance.
IvC Collaborative Hierarchy based Motif Enumeration (CHIME)
In this section, we describe our proposed method.
IvC1 Basic Data Structure
Since the proposed work heavily relies on recursively building up hierarchical structure based on the numerosity reduced subsequences, we store all subsequences after Numerosity Reduction in linked lists (denoted as ). Specifically, for each single dimensional time series , the reduced subsequences are stored in a linked list shown in Fig. 3.
Each node stores a subsequence and is connected to two nodes which represent the two subsequences that appear before and after in the reduced subsequence sequence, respectively. We use and to represent these two nodes respectively. The edge that connect with is denoted as the next edge. The algorithm will check every subsequence connected by the next edge, and merge the connected nodes to generate longer motifs.
Once the data structure is built, CHIME conducts a left to right scan for every node in and calls Collaborative Enumerator — the algorithm used to enumerate variablelength subdimensional motifs based on node .
IvC2 Collaborative Enumerator
Collaborative Enumerator is described in Alg 2. Intuitively, given a node (Line 12), the enumerator recursively grows the length of the subsequence by using a greedy enumeration step and a dimension matching step. These two steps collaboratively avoid redundant enumeration, which saves space cost, and form potential subdimensional motif candidates (stored in ). Collaborative Enumerator has four major steps: a SAX word matching step (Line 39); a greedy enumeration step (Line 1012); a dimension matching step (Line 1315); and a step that updates motif set (Line 1617). After these four steps, the enumerator recursively calls itself to enumerate longer motifs (Line 1821).
IvC3 SAX word Matching
In this step, Collaborative Enumerator attempts to detect a pair of matching subsequences based on SAX word representation. Specifically, given a node , we first merge the two subsequences stored in and respectively and compute a SAX word via the FastSAX algorithm introduced in Sec. IV.A.1. A new node is generated to represent the new merged subsequence (Line 34). The SAX word along with the length of the merged subsequence and its dimension are inserted into a SAX word table, SAXTable, if the same SAX word representing some subsequence(s) of similar length at dimension does not already exist in SAXTable (Line 56). Otherwise it indicates that the enumerator has found a pair of matching subsequences in dimension , in which case the algorithm gets the node representing the matched subsequence, , from the SAXTable (Line 9).
Intuitively, this step looks to see whether any subsequence stored in SAXTable is similar to the newly formed long subsequence. If it finds one successfully, then the algorithm calls local enumeration and dimension matching steps to detect motifs (see below). Otherwise it puts the subsequence into SAXTable for future matching.
IvC4 Local Enumeration
If the algorithm finds matching subsequences in the previous step, CHIME then conducts a local greedy enumeration step to expand the subsequences simultaneously as much as possible to find the longest matching subsequences pair. This pair of subsequences is obtained by continuing merging nodes via the next edge. The process stops when the two subsequences are represented by different SAX words (Line 10). Two nodes, and , are generated to represent the expanded subsequences. The nodes are inserted into for future enumeration (Line 1112).
Upon insertion of the nodes and , the edges are updated accordingly as shown in Fig. 4. Intuitively, the newly inserted nodes allow us to reuse the detected matching subsequences to reduce the cost of generating long subsequences.
An example is shown in Fig. 5. The algorithm iteratively merges nodes via the next edge in the first and second iterations since in both iterations, the SAX words that represent the two newly generated subsequences are identical (add for the first iteration, and adb for the second iteration). In the third iteration, the SAX words for the two subsequences are different (adc and adb, respectively), so the algorithm stops the enumeration. and (green and red nodes) that represented the green and red long subsequences are formed based on the merged nodes.
IvC5 Dimension Matching
After the local enumeration step, the algorithm then conducts a dimension matching step. In this step, for each dimension , the enumerator compares the corresponding pair of SAX words generated from the subsequences at the same location, and with the same length as and , stored in . An indicator vector stores the dimensions in which matching subsequences are found, and a set of nodes storing all newly matched subsequences are generated (Line 13). The first nodes of the matching subsequences are marked as “visited” to avoid revisiting them in the future.
To clarify how dimension search process works, let us consider the example shown in Fig. 6. Suppose and represent a pair of matching subsequences in of a threedimensional time series. In the dimension matching step, CHIME checks each of the remaining two dimensions and and see if the pair of subsequences at the same locations as and , respectively, also have matching SAX words. In this example, CHIME finds that the SAX words match in (subsequences share the same word ). So an indicator vector and the nodes representing the newly found matching subsequences pair (brown nodes) are formed. The node representing the first covered subsequence (blank nodes with ‘x’) is marked as visited. These two brown nodes are stored in SAXTable for future enumeration.
Through the dimension matching process, CHIME can directly find matching long subsequences without going through the process of merging all the blank nodes shown in Fig. 6, which can reduce the cost. In this example, a trivial solution would need to repeat SAX word matching 4 times to find the long subsequences in , whereas CHIME only does it 2 times.
IvC6 Update Motif Set
In this step, CHIME first forms a pair of matching subdimensional subsequences, and , based on and all matching subsequences (Line 16). and can be considered as two instances of a subdimensional motif since the SAX words in dimension are matched. So CHIME computes a discrete representation by concatenating all the SAX words along with the dimension index to represent these two subdimensional subsequences (e.g. consider the example in Fig. 6, a SAX word sequence is generated). and are then put into the candidate motif set along with the hash value generated by wordSet (Line 18) and its length. Intuitively, since all subdimensional subsequences represented by the same wordSet, with similar lengths are considered instances of the same motif, these subdimensional subsequences will be put into the same bucket in to form a motif candidate. In the postprocessing step, a pairwise comparison is conducted to filter out any false positive candidates.
IvC7 Recursive Enumeration
Finally, CHIME recursively calls itself to enumerate longer motifs. More specifically, the algorithm calls itself to test two newly generated nodes and to continue enumerating motifs. Note that since there is a chance that the matched subsequence may already be enumerated into long subsequence in the previous step, only conducts the recursive enumeration if it has not been done before. The algorithm stops when there is no SAX word matching detected (Line 1921).
IvC8 Highlevel Framework
The overall framework is shown in Alg. 3. Intuitively, CHIME conducts a left to right scan through every and calls the collaborative enumerator to detect motifs. During the scan, CHIME utilizes the recorded nodes to avoid repeating enumeration of the observed node(s). Specifically, for each node , CHIME only calls enumerator with node to start the enumeration process if is not visited (Line 5). If is previously visited, it indicates that the corresponding subsequence has already been processed, so CHIME skips the node to avoid redundant work. Finally, after all nodes are enumerated, a postprocessing is conducted to remove all false positive instances of motifs stored in (Line 8). In the proposed work, since we compare the scalability of CHIME with stateoftheart approach [9], we compute the distances between pairs of instances that share the same representation and have similar lengths. We then rank the motifs by distances in ascending order per dimension size per motif length. Note that the cost is the same as filtering out all false positive instances detected by CHIME per the definition of subdimensional motif in this work.
IvD Time Complexity
The time complexity of CHIME is dominated by pairwise comparison postprocessing, which may take time, similar to stateoftheart motif enumeration approach. However, since CHIME utilizes symbolic representation to avoid enumerating many subsequences redundantly, in the experiments, we show that the algorithm can indeed detect subdimensional motifs with very small amount of time. It can also handle million size multivariate time series, which none of the existing approaches can handle due to the time cost.
IvE Compared with Sequence Matching Approach
One existing work [6] utilizes sequence matching approach (e.g. [20][21]) to detect variablelength subdimensional motif. While the approach cannot handle high dimensional multivariate time series due to the exponential time and space requirements toward dimension size, it is worth noting that even if the algorithm could reduce the time cost to a reasonable cost, the algorithm still cannot fulfill the task introduced in this paper.
The problem is twofold. First, long subsequences are typically represented by overwhelmingly long sequences of symbols. For example, consider the two 1D subsequences of length 200 [11], shown in Figure 7. Visually, these two time series look very similar. The sequence representations of the time series obtained by using sliding window of length 20 (10% of the subsequence length) is shown on the right. We can see that the two SAX sequences, including their lengths, look very different despite the striking similarity between the time series. Since the previous approach [6] is based on SAX sequence matching [21], it would not be able to find the motifs unless a warping robust matching process is conducted, which is too costly given the complexity. In contrast, CHIME recomputes the SAX words and can represents these two 1D sequences by one SAX word, which is robust to noise.
Second, the mean and variance of a short subsequence used to generate the SAX word sequence can significantly differ from that of a long subsequence [2][12]. Since the shape can be largely affected by mean and variance, the SAX word sequence based representation may fail to capture the actual similarity between two subsequences. As a result, SAX word sequence matching based approaches such as [13][1][7] only can detect motifs in a small length range, whereas CHIME does not have this problem.
V Experiments
We perform a series of experiments to evaluate the accuracy and speed of CHIME. All the experiments are conducted on a 16 GB RAM laptop with quad core processor of 2.5 GHz. The executable software and datasets used in the experiments can be found in https://github.com/flash121123/CHIME. In all evaluation experiments unless noted, the PAA parameters and are set to 5 and 6, respectively. The minimum length is 300 and motif threshold is .
We first demonstrate that existing work may not be suitable for variablelength subdimensional motif discovery. As shown in previous work [2][12][7], indexbased fixedlength approximate motifs discovery algorithms such as random projection [19] are not suitable for detecting variablelength motifs due to memory requirement. This is because the algorithm needs to generate discrete representation for every subsequence for every length tested, and it can soon become impractical even with small enumeration range[2][7]. To demonstrate the significant difference in memory requirement between the two algorithms, we conduct a simple experiment. We compare the ratio between the number of subsequences stored in memory for matching motifs, and the product of the number of subsequences and enumeration range:
(1) 
Since the random projection approach [8] discretizes and keeps track all tested subsequences of every length, the ratio is equal to 1.
The ratios in three different types of datasets — random walk data, traffic speed data[11], and EEG data^{2}^{2}2http://bbci.de/competition/iv/ along with data size and the enumeration range (measured by number of distinct motif length detected after removing all false positive) are shown in Table 1. In all three datasets, the memory cost for CHIME is three orders of magnitude lower compared with random projection. This property allows CHIME to detect variablelength subdimensional motifs in large scale datasets.
Dataset  Size ()  Enumeration Range  CHIME  Random Projection 

Random Walk  8291  0.001  1  
Traffic Speed  4177  0.0014  1  
EEG  1674  0.004  1 
Since there is no comparable approximate variablelength subdimensional motif discovery approach that can solve the problem in the tested scale, we compare with the stateoftheart variablelength motif discovery solution introduced by Nunthanid et al. [22] — the algorithm conducts a bruteforce enumeration process to enumerate motifs of different lengths via a fixedlength motif discovery approach. We demonstrate that the scale of problems handled by CHIME is too large to get exact solution.
Va Detecting Planted Motifs in Random Walk Time Series
We first tested CHIME in a planted motif experiment [7][3][2] to demonstrate its ability to detect subdimensional motifs with high accuracy when the minimum length is much shorter than the actual motif length.
We planted a subdimensional motif of 10 instances into a random walk time series of length 1 million points, twenty dimensions with random positions and random dimensions. The shape of motifs are generated by using with random parameters , and . We added 5% random noise to every instance of the motifs. The mean and variance are also randomly generated. CHIME is expected to find at least a pair of nonoverlapping subsequences that highly overlap with the actual planted instances. Similar to previous work[3][7], we evaluate the performance by the overlapping rate with the actual planted intervals and relevant dimensions. We also evaluate the performance via an overall overlap score computed by the geometric average of both metrics. The high overall overlap score indicates that the algorithm found a motif highly overlapped with the planted motif in most of relevant dimensions.
VA1 Planted Motifs of Different Lengths
We first tested CHIME with planted motifs of length equal to 3000, 4500, and 6000. The number of relevant dimensions for the planted motif is set to be 5. We repeat the experiment 10 times for each motif length. The boxplot of overall overlapping rate for each motif length is shown in Fig. 8(a). CHIME consistently gets overall overlapping rates above 0.8 for all motif lengths 3000, 4500 and 6000. CHIME also consistently gets high length overlap rate in all tests. Besides, according to Fig. 8(b)(c), the algorithm also maintains over 0.8 overlapping rate in both length and dimension.
VA2 Planted Motifs of Different Number of Relevant Dimensions
In this experiment, we tested CHIME with planted motifs for which the number of relevant dimensions equals to 3, 5, 10, 15. The motif length is set to be 4500. Similar to the previous experiment, we repeated the experiment 10 times for each dimension setting, and the result is shown in Fig. 8(d)(f). We observe that the median overall overlapping score maintains at approximately 0.9. While the length and dimension overlapping rate decrease as the number of relevant dimension increased, the median of both metrics are still over 0.8. The result indicates that even if motifs contain a large number of relevant dimension, CHIME still can find most of relevant dimensions.
VB Scalability
In this subsection, we conduct experiments to evaluate the scalability over length and dimension of the multivariate time series. Since there is no approximate variablelength motif discovery algorithm that can handle the size of data tested in the experiment, we report the execution time of stateoftheart fixedlength multidimensional motif discovery algorithm[9] performance with STOMP[23]
as the base approach. The code is provided by the authors and written in C. In the test case where the algorithm takes more than 24 hour to complete, we estimate the execution time by the first 1000 iterations (the estimation approach used in previous work
[23]). We also report the estimated bruteforce motif enumeration time if utilizing the framework described in [22], the classical approach used in variablelength motif discovery. The estimated time is computed by the fixedlength motif discovery time multiplied by the enumeration range since STOMP’s execution time is invariant to motif length[23].Time Series Length  200K  400K  600K  800K  1 million 

Enumeration Range  2724  4499  5915  7319  8290 
Dimension Range  5  6  6  6  6 
CHIME  3.93 min  10.03 min  15.1 min  22.5 min  28.05 min 
Postprocessing  54 sec.  3.4 min  11 min  19.5 min  31 min 
Fixedlength Motif Discovery  3.6 hr  11.6 hr  1.04 days  2.08 days  12 days 
Estimated Brute Force Time  1.19 yr.  5.95 yr.  16.8 yr.  41.68 yr  272 yr 
VB1 Scalability over Time Series Length
We tested the scalability of CHIME in a one million length random walk time series of fifty dimensions. The growths of execution time, enumeration length range and dimension range as the length increases are shown in Table 2. The enumeration length and dimension range are measured by the number of distinct motif lengths and dimensions detected after removing all false positive instances based on the motif threshold function. According to the table, the execution time for CHIME grows much slower than that of the tested stateoftheart approaches. In the largest cases, the algorithm takes 28 min to complete, and then another 31 min to do pairwise distance comparisons. The estimated execution time for fixedlength motif discovery approach to detect motifs is 12 days. Similarly, the enumeration length range and dimension range grow as the length of time series grows. In the largest case, the enumeration range has almost 9000 different lengths. Consider the estimated execution time for the brute force approach, the length range enumerated by CHIME is too large for the stateoftheart algorithm to get exact solution. In contrast, CHIME provides an alternative way to efficiently detect approximate variablelength motifs in this scale of data size.
Dimension  25  50  100  200 

CHIME (Total Time)  1.1 min  3.4 min  10.5 min  33.5 min 
Enumeration Range  2058  2252  2334  2419 
Fixedlength Motif Discovery Approach  1.72 hr  3.45 hr  6.9 hr  13.8 hr 
Estimated Brute Force Time  58.9 days  129.49 days  268 days  1.52 yr 
VB2 Scalability Over Number of Dimensions
We then tested the scalability of CHIME in a 200dimensional, 160,000 length random walk time series. The growth of execution time as dimension increases is shown in Table 3. The execution time for CHIME grows faster than that of fixedlength motif discovery approach. However, the estimated bruteforce execution time is still infeasible due to the large enumeration range. In contrast, CHIME only takes less than one hour to complete.
VC Parameters Analysis
We tested CHIME with parameters from 4 to 8 and from 5 to 15 on a 300,000 length, 50dimensional random walk time series. The enumeration length range, dimension range and execution time of all parameter combinations are shown in Fig. 9(a)(c) respectively. All three values increase as and decrease. This is because the number of distinct SAX words increases as and increase. As a result, the chance that the SAX words match is reduced. So CHIME can process time series fast at a cost of reduced enumeration length (dimension) range. The user can set both parameters and to balance the search ability and execution time.
VD Case Studies
In this section, we show that CHIME can find high quality motifs in several real world largescale multivariate time series data.
VD1 PAMAP2 Physical Activity Monitoring Time Series
We first tested CHIME in PAMAP2 Physical Activity Monitoring dataset [24]. We used all available high resolution acceleration signals recorded from hands, chest and ankles to generate a 9dimensional time series of 1.3 million points in length. We tested CHIME with minimum motif length equal to 3 sec (300 points).
Two examples of detected subdimensional motifs are shown in Fig. 10(a)(b). The lengths of Motif A and Motif B are 12.5 sec. and 8 sec. respectively. Motif A consists of two different signals (y and z axes of acceleration). Both these signals are recorded from hand motion. According to the label, two occurrences of Motif A indicate an action of ironing, an activity mostly relying on hand. Motif B consists of all 3 signals recorded from chest. According to the annotations, both instances coincide with when the subjects conduct walking activity — a periodic activity. Both motifs have the ability to explain the patterns that occur in different types of activities. Such information can provide useful insights for behavioral studies. Clearly, since the two motifs have significant length gap (8 sec. vs. 12.5 sec.), repeated execution of fixed length subdimensional motif discovery approach [9] to enumerate longer motifs is very time consuming given the data size ().
VD2 Electric Power Demand Time Series
Interpreting the behavior of Electric power usage has many potential applications[9]. Recent work shows that motifs can be used to understand activities in this type of data. In this experiment, we apply CHIME on the 1 million length electric power usage dataset [25]. The multivariate time series consists of power usage watt of eight different appliances including Washing Machine, Dryer, Dishwasher, Computer Site, Television Site, Combination Microwave, Kettle and Toaster, and is recorded from 2013Oct. to 2014Jan. We set the minimum motif length equal to 200 (20 min). Two examples of subdimensional motif are shown in Fig.11.
The first motif of length 2000 is a subdimensonal motif consisting of two appliances — power usage time series collected from computer site and television site. The motif represents a power usage pattern that the user turns on the computer first and then turns on the TV. CHIME successfully captures this repeating pattern among the data even when the minimum motif enumeration length is a lot smaller than that of the pattern detected. CHIME also discovers the second motif of length 300. This motif consists of two appliances: Kettle and Toaster. Both appliances’ power usage patterns are much shorter, and only last several minutes. The two motifs have significantly different motif lengths and represent different, meaningful power usage patterns. Since the time series is large (8 x 1 million in size), repeatedly running a fixedlength motif discovery algorithm such as [9] will be very time consuming.
VD3 Interpretable Time Series Classification Via Motif
One popular application of motif discovery is learning interpretable time series classification model using motifs or shapelets [26][27]. It has been demonstrated in previous work [26][2] that motifs that can be used to distinguish between different classes often have different lengths.
In this experiment, we tested our algorithm in the EigenWorms dataset, a dataset that collects approximately 400 worms’ trajectories. Each trajectory consists of approximately 18000 sample points, and each record contains 6 dimensions representing 6 different worm movement features. We ran CHIME on the data with minimum motif length set to 300 sample points. Similar to a previous work on univariate time series classification [26]
, after identifying the motifs, we transformed each original time series to a distance feature vector by computing the closest match distance between the time series and each of the detected subdimensional motif. Then a forward feature selection process is applied to select the features that can achieve the best accuracy via a simple decision tree classifier. We find that by using the three motifs shown in Figure
12, the decision tree can already get approximately 70% accuracy. By further increasing the number of motifs, we find that we can use six motifs to achieve 77% accuracy in the dataset, whereas 1NearestNeighbor classifier with Dynamic Time Warping only can get 60% accuracy according to Bagnall et al. [28].Vi Conclusion
We introduce a new algorithm, CHIME, to detect approximate subdimensional motifs of different lengths in multivariate time series. CHIME can handle largesize multivariate time series that the stateoftheart exact algorithms cannot handle efficiently. We show that CHIME can detect subdimensional motifs successfully even when the motif length is considerably larger than the minimum length. In the case studies, we demonstrate that the motifs found by CHIME are meaningful and can potentially have significant impacts in various applications.
References
 [1] Y. Li, J. Lin, and T. Oates, “Visualizing variablelength time series motifs,” in Proceedings of the 2012 SIAM international conference on data mining. SIAM, 2012, pp. 895–906.
 [2] A. Mueen, “Enumeration of time series motifs of all lengths,” in 13th International Conference on Data Mining (ICDM), 2013. IEEE, 2013, pp. 547–556.
 [3] Y. Gao and J. Lin, “Efficient discovery of variablelength time series motifs with large length range in million scale time series,” in 2017 IEEE 17th International Conference on Data Mining (ICDM), 2017.

[4]
Y. Gao, Q. Li, X. Li, J. Lin, and H. Rangwala, “Trajviz: a tool for
visualizing patterns and anomalies in trajectory,” in
Joint European Conference on Machine Learning and Knowledge Discovery in Databases
. Springer, 2017, pp. 428–431.  [5] X. Wang, J. Lin, N. Patel, and M. Braun, “A selflearning and online algorithm for time series anomaly detection, with application in cpu manufacturing,” in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2016, pp. 1823–1832.
 [6] A. Balasubramanian, J. Wang, and B. Prabhakaran, “Discovering multidimensional motifs in physiological signals for personalized healthcare,” IEEE journal of selected topics in signal processing, vol. 10, no. 5, pp. 832–841, 2016.
 [7] Y. Gao and J. Lin, “Exploring variablelength time series motifs in one hundred million length scale,” Data Mining and Knowledge Discovery, May 2018.
 [8] D. Minnen, C. Isbell, I. Essa, and T. Starner, “Detecting subdimensional motifs: An efficient algorithm for generalized multivariate pattern discovery,” in Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on. IEEE, 2007, pp. 601–606.
 [9] C.C. M. Yeh, N. Kavantzas, and E. Keogh, “Matrix profile vi: Meaningful multidimensional motif discovery,” in Data Mining (ICDM), 2017 IEEE International Conference on. IEEE, 2017, pp. 565–574.
 [10] D. Minnen, T. Starner, I. Essa, and C. Isbell, “Discovering characteristic actions from onbody sensor data,” in Wearable computers, 2006 10th IEEE international symposium on. IEEE, 2006, pp. 11–18.
 [11] “pems.dot.ca.gov.”
 [12] T. P. Michele Linardi, Yan Zhu and E. Keogh, “Matrix profile x: Valmod  scalable discovery of variablelength motifs in data series,” in SIGMOD. ACM, 2018.
 [13] Y. Gao, J. Lin, and H. Rangwala, “Iterative grammarbased framework for discovering variablelength time series motifs,” in ICMLA, 2016 15th IEEE International Conference on. IEEE, 2016, pp. 7–12.
 [14] Y. Tanaka, K. Iwamoto, and K. Uehara, “Discovery of timeseries motif from multidimensional data based on mdl principle,” Machine Learning, vol. 58, no. 2, pp. 269–300, 2005.
 [15] E. Berlin and K. Van Laerhoven, “Detecting leisure activities with dense motif discovery,” in Proceedings of the 2012 ACM Conference on Ubiquitous Computing. ACM, 2012, pp. 250–259.
 [16] L. Wang and et al., “A treeconstruction search approach for multivariate time series motifs discovery,” Pattern Recognition Letters, vol. 31, no. 9, pp. 869–875, 2010.
 [17] A. Mueen, E. J. Keogh, Q. Zhu, S. Cash, and M. B. Westover, “Exact discovery of time series motifs.” in SDM. SIAM, 2009, pp. 473–484.
 [18] J. Lin, E. Keogh, L. Wei, and S. Lonardi, “Experiencing sax: a novel symbolic representation of time series,” Data Mining and knowledge discovery, vol. 15, no. 2, pp. 107–144, 2007.
 [19] B. Chiu and et al., “Probabilistic discovery of time series motifs,” in Proceedings of the 9th ACM SIGKDD international conference on Knowledge discovery and data mining, 2003, pp. 493–498.
 [20] A. Balasubramanian and B. Prabhakaran, “Flexible exploration and visualization of motifs in biomedical sensor data,” in Proc. of Workshop on Data Mining for Healthcare, in conjunction with ACM KDD, 2013.
 [21] C. G. NevillManning and I. H. Witten, “Identifying hierarchical strcture in sequences: A lineartime algorithm,” J. Artif. Intell. Res.(JAIR), vol. 7, pp. 67–82, 1997.
 [22] P. Nunthanid, V. Niennattrakul, and C. A. Ratanamahatana, “Discovery of variable length time series motif,” in Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2011 8th International Conference on. IEEE, 2011, pp. 472–475.
 [23] Y. Zhu, Z. Zimmerman, N. S. Senobari, C.C. M. Yeh, G. Funning, A. Mueen, P. Brisk, and E. Keogh, “Matrix profile ii: Exploiting a novel algorithm and gpus to break the one hundred million barrier for time series motifs and joins,” in Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 2016, pp. 739–748.
 [24] A. Reiss and D. Stricker, “Introducing a new benchmarked dataset for activity monitoring,” in Wearable Computers (ISWC), 2012 16th International Symposium on. IEEE, 2012, pp. 108–109.
 [25] D. Murray, J. Liao, L. Stankovic, V. Stankovic, R. HauxwellBaldwin, C. Wilson, M. Coleman, T. Kane, and S. Firth, A data management platform for personalised realtime energy feedback, 8 2015.
 [26] X. Wang, J. Lin, P. Senin, T. Oates, S. Gandhi, A. P. Boedihardjo, C. Chen, and S. Frankenstein, “Rpm: Representative pattern mining for efficient time series classification,” pp. 185–196, 2016.
 [27] Y. Gao and J. Lin, “Hime: discovering variablelength motifs in largescale time series,” Knowledge and Information Systems, pp. 1–30, 2018.
 [28] A. Bagnall, H. A. Dau, J. Lines, M. Flynn, J. Large, A. Bostrom, P. Southam, and E. Keogh, “The uea multivariate time series classification archive, 2018,” arXiv preprint arXiv:1811.00075, 2018.
Comments
There are no comments yet.