1 Introduction
Over the last two decades, time series classification and clustering have received considerable interests in many applications such as genomic research (Leslie et al., 2002), image alignment (Peng et al., 2015b, a), speech recognition (Cuturi et al., 2007; Shimodaira et al., 2001), and motion detection (Li and Prakash, 2011). One of the main challenges in time series data stems from the fact that there are no explicit features in sequences (Xing et al., 2010). Therefore, a number of feature representation methods have been proposed recently, among which the approaches deriving features from phase dependent intervals (Deng et al., 2013; Baydogan et al., 2013), phase independent shapelets (Ye and Keogh, 2009; Rakthanmanon and Keogh, 2013), and dictionary based bags of patterns (Senin and Malinchik, 2013; Schäfer, 2015) have gained much popularity due to their highly competitive performance (Bagnall et al., 2016). However, since the aforementioned approaches only consider the local patterns rather than global properties, the effectiveness of these features largely depends on the underlying characteristics of sequences that may vary significantly across applications. More importantly, these approaches may typically not be a good first choice for large scale time series due to their quadratic complexity in terms both of the number and (or) length of time series.
Another family of research defines a distance function to measure the similarity between a pair of time series. Although Euclidean distance is a widely used option and has been shown to be competitive with other more complex similarity measures (Wang et al., 2013), various elastic distance measures designed to address the temporal dynamics and time shifts are more appropriate (Xing et al., 2010; Kate, 2016). Among them, dynamic time warping (DTW) (Berndt and Clifford, 1994)
is the standard elastic distance measure for time series. Interestingly, an 1NN classifier with DTW has been demonstrated as the gold standard benchmark, and has been proved difficult to beat consistently
(Wang et al., 2013; Bagnall et al., 2016). Recently, a thread of research has attempted to directly use the pairwise DTW distance as features (Hayashi et al., 2005; Gudmundsson et al., 2008; Kate, 2016; Lei et al., 2017). However, the majority of these approaches have quadratic complexity in both number and length of time series in terms of the computation and memory requirements.Despite the successes of various explicit feature design, kernel methods have great promise for learning nonlinear models by implicitly transforming a simple representations into a highdimension feature space (Rahimi and Recht, 2007; Chen et al., 2016; Wu et al., 2016; Yen et al., 2014). The main obstacle for applying kernel method to time series is largely due to two distinct characteristics of time series, (a) variable length; and (b) dynamic time scaling and shifts. Since elastic distance measures, such as DTW, take into account these two issues, there have been several attempts to apply DTW directly as a similarity measure in a kernelbased classification model (Shimodaira et al., 2001; Gudmundsson et al., 2008). Unfortunately, the DTW distance does not correspond to a valid positivedefinite (p.d.) kernel and thus direct use of DTW leads to an indefinite kernel matrix that neither corresponds to a loss minimization problem nor giving a convex optimization problem (Bahlmann et al., 2002; Cuturi et al., 2007). To overcome these difficulties, a family of global alignment kernels have been proposed by taking softmax over all possible alignments in DTW to give a p.d. kernel (Cuturi et al., 2007; Cuturi, 2011; Marteau and Gibet, 2015). However, the effectiveness of the global alignment kernels is impaired by the diagonal dominance of the resulting kernel matrix. Also, the quadratic complexity in both the number and length of time series make it hard to scale.
In this paper, inspired by the latest advancement of kernel learning methodology from distance (Wu et al., 2018), we study Random Warping Series (RWS), a generic framework to generate vector representation of timeseries, where we construct a family of p.d. kernels from an explicit feature map given by the DTW between original time series and a distribution of random series. To admit an efficient computation of the kernel, we give a random features approximation that uniformly converges to the proposed kernel using a finite number of random series drawn from the distribution. The RWS technique is fully parallelizable, and highly extensible in the sense that the building block DTW can be replaced by recently proposed elastic distance measures such as CID (Batista et al., 2014) and DTDC (Górecki and Łuczak, 2014). With a number of random series, RWS can substantially reduce the computational complexity of existing DTWbased techniques from to and memory consumption from to . We also extend existing analysis of random features to handle time series of unbounded length, showing that suffices for the uniform convergence to precision of the exact kernel. We evaluate RWS on 16 realworld datasets on which it consistently outperforms or matches stateoftheart baselines in terms of both testing accuracy and runtime. In particular, RWS often achieves ordersofmagnitude speedup over other methods to achieve the same accuracy.
2 DTW and Global Alignment Kernels
We first introduce the widelyused technique DTW and nearestneighbor DTW (1NNDTW), and then illustrate the existing global alignment kernels for time series and their disadvantages.
Time Series Alignment and 1NNDTW. Let be the domain of input time series, and be the set of time series, where the length of each time series , taking numeric values in . A special challenge in time series lies in the fact that the series could have different lengths, and a signal could be generated with time shifts and different scales, but with a similar pattern. To take these factors into account, an alignment (also called a warping function) is often introduced to provide a better distance/similarity measure between two time series and of lengths and respectively. Specifically, an alignment of length between two time series and is a pair of increasing vectors such that and with unitary increments and no simultaneous repetitions. The set of all alignments between and is defined as . In the literature of DTW (Berndt and Clifford, 1994), the DTW distance between and is defined as follows in its simplest form:
(1) 
Here is a dissimilarity measure between and under alignment . Typically, Dynamic Programming (DP) is employed to find the optimal alignment and then compute DTW distance. The dissimilarity function could be defined as any commonly used distance such as the squared Euclidean distance. To accelerate the computation and improve the performance, a Sakoe and Chiba band is often used to constrain the search window size for DTW (Sakoe and Chiba, 1978; Rakthanmanon et al., 2012).
DTW has been widely used for time series classification in combination with the 1NN algorithm, and this combination has been shown to be exceptionally difficult to beat (Wang et al., 2013; Bagnall et al., 2016). However, there are two disadvantages of 1NNDTW. First, this method incurs the high computational cost of complexity for computing DTW similarity between all pairs of time series, where each evaluation of DTW without constraints takes
computation. Second, NearestNeighbor methods often suffers from the problems of high variance. For example, if a class label is determined by a small portion of time series, a NearestNeighbor identification on the basis of similarity with the whole time series will be ineffective due to noise and irrelevant information.
Existing Global Alignment Kernels. To take the advantage of DTW in other prediction methods based on Empirical Risk Minimization
(ERM) such as SVM and Logistic Regression, a thread of research has been trying to derive a
valid p.d. kernel that resembles . A framework for designing such kernel is the time series globalalignment kernel proposed in (Cuturi et al., 2007) and further explored in (Cuturi, 2011). The kernel replaces the minimum in (1) with a soft minimum that sums over all possible DTW alignments between two series , :(2) 
where is some local similarity function induced from the divergence as . The function (2) is a p.d. kernel when satisfies certain conditions (Cuturi et al., 2007). However, it is known that a soft minimum can be orders of magnitude larger than the minimum when summing over exponentially many terms, which results in a serious diagonally dominant problem for the kernel (2). In other words, the kernel value between a series to itself is orders of magnitude larger than other values . Thus in practice, one must take the log of the kernel (2) even though such operation is known to break the p.d. property (Cuturi, 2011). In addition, the evaluation of kernel (2) requires running DP over all pairs of samples and thus gives a high complexity of .
3 Novel TimeSeries Kernels via Alignments to Random Series
In this section, we study a new approach to build a family of p.d. kernels for time series based on DTW, inspired by the latest advancement of kernel learning methodology from distance (Wu et al., 2018).
Formally, the kernel is defined by integrating a feature map over a distribution of random time series , with each feature produced by alignments between original time series and random series :
(3) 
The kernel (3) enjoys several advantages. First, (3) is a p.d. kernel by its construction.
Proposition 1.
The kernel (3) is positive definite, that is, for any and any .
Proof.
Secondly, by choosing
(4) 
one can avoid the diagonal dominance problem of the kernel matrix, since the kernel value between two time series depends only on the correlation of and under their optimal alignments. It thus avoids the dominance of the diagonal terms caused by the summation over exponentially many alignments. We can interpret the random series of length as the possible shapes of a time series, defined by segments, each associated with a random number. Figure 1 gives an example of a random series of length , which divides a time series into segments and outputs a dissimilarity score as the feature . The third advantage of (3) is its computational efficiency due to a simple random features approximation. Although the kernel function (3) seems hard to compute, we show that there is a lowdimensional representation of each series , by which one can efficiently find an approximate solution to that of the exact kernel (3) within precision. This is in contrast to the globalalignment kernel (2), where although one can evaluate the kernel matrix exactly in time, it is unclear how to efficiently find a lowrank approximation.
3.1 Computation of Random Warping Series
Although the kernel (3) does not yield a simple analytic form, it naturally yields a random approximation of the form using a simple MC method,
The feature vector is computed using dissimilarity measure , where is a set of random series of variable length with each value drawn from a distribution . In particular, the function could be any elastic distance measure but without loss of generality we consider DTW as our similarity measure since it has proved to be the most successful metric for time series (Wang et al., 2013; Xi et al., 2006).
Algorithm 1 summarizes the procedure to generate feature vectors for raw time series. There are several comments worth making here. First of all, the distribution of
plays an important role in capturing the global properties of original timeseries. Since we explicitly define a kernel from this distribution, it is flexible to search for the best distribution that fits data well for underlying applications. In our experiments, we find the Gaussian distribution is generally applicable for time series from various applications. Specifically, the parameter
stems from a distribution that should well capture the characteristics of time series . Second, as shown in Figure 1, a short random warping series could typically identify the local patterns as well as global patterns in raw time series. It suggests that there are some optimal alignments that allow short random series to segment raw time series to obtain discriminatory features. In practice, there is no prior information for this optimal alignment and thus we choose to uniformly sample the length of random series betweento give an unbiased estimate of
, where is used in our experiments. Additional benefits lie in the fact that random series with variable lengths may simultaneously identify multiscale patterns hidden in the raw time series.In addition to giving a practical way to approximate the proposed kernel, applying these random series also enjoys the double benefits of reduced computation and memory consumption. Compared to the family of global alignment kernels (Cuturi et al., 2007; Cuturi, 2011), computing the dense kernel matrix requires times evaluation of DTW which usually takes complexity based on DP. It also needs to store the original time series and resulting kernel matrix. In contrast, our RWS approximation only requires linear complexity of computation and storage size, given is a small constant. This dramatic reduction in both computation and memory storage empowers much more efficient training and testing when combining with ERM classifiers such as SVM.
3.2 Convergence of Random Warping Series
In the following, we extend standard convergence analysis of Random Features (Rahimi and Recht, 2007) from a kernel between two fixeddimensional vectors to a kernel function measuring similarity between two time series of variable lengths. Note (Wu et al., 2018) has proposed a general analysis for any distancebased kernel through covering number w.r.t. the distance, which however, does not apply directly here since DTW is not a distance metric.
Let be and matrices that map each element of and to an element of a DTW alignment path. The feature map of RWS can be expressed as
(5) 
Note that in practice one can often convert a similarity function into a dissimilarity function to fit into the above setting. The goal is to approximate the kernel via a sampling approximation with . Note we have . The question is how many samples are needed to guarantee
(6) 
In the standard analysis of RF, the required sample size is where comprises all dimensional vectors of diameter diam(). The standard analysis does not apply to our case for two reasons: (a) our domain contains time series of different lengths, and (b) our kernel involves a minimization (5) over all possible DTW alignments, and thus is not shiftinvariant as required in (Rahimi and Recht, 2007). To obtain a uniform convergence bound that could potentially handle time series of unbounded length, we introduce the notion of minimum shapepreserving length.
Definition 1.
The Minimum ShapePreserving Length (MSPL) of tolerance is the smallest such that
(7) 
where is the set of possible alignments between and considered by DTW, and
is an identity matrix.
In other words, defines the smallest length one can compress a time series to with approximation error no more than , measured by DTW in the distance. Then the following gives the number of RWS required to guarantee an uniform convergence over all possible inputs .
Theorem 1.
Assume the ground metric satisfies and is Lipschitzcontinuous w.r.t. with parameter where . The RWS approximation with features satisfies
(8) 
where is the radius of time series domain in the norm and is the MSPL with precision .
Proof Sketch.
Let . We have and by the boundedness of function . Then by Hoeffding inequality, we have
(9) 
for a given pair . To get a uniform bound that holds for all pairs of series , consider the pair of series of minimum shapepreserving length under precision . We have an net with that covers the dimensional ball of radius . Then through union bound and (9), we have
(10) 
Let be the dimensional ball. Given any time series of arbitrary length, we can first find with , and then find such that , . By the result of Lemma 1 (see appendix 6.1), the closeness of to implies the closeness of to , which leads to
(11) 
Combining (10) and (11), we have
(12) 
This is of the from . Choosing to balance the two terms in (12), the RHS becomes . This yields the result
∎
The above theorem 1 shows that, to guarantee
with probability
, it suffices to have In practice, the constants , are not particularly large due to the normalization on series and dissimilarity function . The main factor determining the rate of convergence is the shapepreserving length . Note that for problems with time series length bounded by , we always have , which means the number of features required would be only of order .4 Experiments
We conduct experiments to demonstrate the efficiency and effectiveness of the RWS, and compare against 9 baselines on 16 realworld datasets from the widelyused UCR timeseries classification archive (Chen et al., 2015) as shown in Table 1. We evaluate RWS on the datasets with variable number and length to achieve these goals: 1) competitive or better accuracy for small problems; 2) matches or outperforms other methods in terms of both performance and runtime for middle or large scale tasks. We implement our method in Matlab and use C Mex function ^{1}^{1}1https://www.mathworks.com/matlabcentral /fileexchange/43156dynamictimewarping–dtw for computationally expensive component of DTW. For other methods we use the same routine to promote a fair runtime comparison, where the window size of DTW is set as similar to (Lei et al., 2017; Paparrizos and Gravano, 2015). More details about datasets and parameter settings are in Appendix 6.2.
Name  :Classes  :Train  :Test  :length 

Beef  5  30  30  470 
DPTW  6  400  139  80 
IPD  2  67  1,029  24 
PPOAG  3  400  205  80 
MPOC  2  600  291  80 
POC  2  1,800  858  80 
LKA  3  375  375  720 
IWBS  11  220  1,980  256 
TWOP  4  1,000  4,000  128 
ECG5T  5  500  4,500  140 
CHCO  3  467  3,840  166 
Wafer  2  1,000  6,174  152 
MALLAT  8  55  2,345  1,024 
FordB  2  3636  810  500 
NIFECG  42  1,800  1,965  750 
HO  2  370  1,000  2,709 
Classifier  RWS  TSEigen  TSMC  

Dataset  Accu  Time  Accu  Time  Accu  Time 
Beef  0.733  0.3  0.633  2.1  0.433  0.6 
DPTW  0.79  0.5  0.738  7.1  0.738  1.5 
IPD  0.969  0.3  0.911  8.6  0.80  1.7 
PPOAG  0.868  0.4  0.82  8.9  0.82  1.8 
MPOC  0.711  0.8  0.653  19.3  0.653  2.4 
POC  0.711  2.4  0.686  172.3  0.66  8.2 
LKA  0.792  7.3  0.528  401.5  0.525  39.5 
IWBS  0.619  8.9  0.633  784.6  0.57  31.9 
TWOP  0.999  4.4  0.976  1395  0.946  32.8 
ECG5T  0.933  10.6  0.932  1554  0.918  36.0 
CHCO  0.572  6.3  0.529  1668  0.402  45.7 
Wafer  0.993  9.6  0.89  3475  0.89  59.3 
MALLAT  0.937  33.9  0.898  7982  0.888  282.6 
FordB  0.727  43.5  0.704  10069  0.686  216.3 
NIFECG  0.907  19.8  0.867  10890  0.582  265 
HO  0.843  43.3  0.845  46509  0.82  979.1 
4.1 Effects of , and on RWS
Setup. We first perform experiments to investigate the characteristics of the RWS method by varying the kernel parameter , the rank and the length of random series. Due to limited space, we only show typical results and see Appendix 6.3 for complete ones.
Effects of . It is well known that the choice of the kernel parameter determines the quality of various kernels. Figure 2 shows that in most cases the training and testing performance curves agree well in the sense that they consistently increase at the beginning, stabilize around (which corresponds to the standard distribution), and finally decrease in the end. In a few cases like NIFECG, the optimal performance is slightly shifted from . This observation is favorable since it suggests that one may easily tune our approach over a smaller interval around for good performance.
Effects of . We evaluate the training and testing performance when varying the rank from 4 to 512 with fixed and . Figure 3 shows that the training and testing accuracy generally converge almost exponentially when increasing from very small number () to a relative large number (), and then slowly saturate to the optimal performance. Empirically, this feature is the most favorable because the performance of RWS is relatively stable even for small . More importantly, this confirms our analysis in Theorem 1 that our RWS approximation can guarantee (rapid) convergence to the exact kernel.
Effects of . We investigate the effect of the length of the random series on training and testing performance. As hinted at earlier, a key insight behind the proposed timeseries kernel depends on the assumption that a random series of short length can effectively segment raw time series in a way that captures its patterns. Figure 4 shows that although testing accuracy seems to fluctuate when varying from 10 to 100, it is clear that the nearpeak performance can be achieved when is small in the most of cases.
Classifier  RWS(LR)  RWS(SR)  1NNDTW  1NNDTW^{opt}  TGAK  DTWF  

Dataset  Accu  Time  Accu  Time  Accu  Time  Accu  Time  Accu  Time  Accu  Time 
Beef  0.767  0.8  0.733  0.3  0.567  1.1  0.633  0.3  0.633  24.7  0.60  3.7 
DPTW  0.865  4.2  0.80  0.2  0.73  1.4  0.718  0.8  0.738  27.9  0.77  3.0 
IPD  0.965  1.0  0.962  0.4  0.947  55.3  0.962  56.0  0.739  3.7  0.953  0.5 
PPOAG  0.868  0.3  0.859  0.2  0.776  2.0  0.785  1.2  0.854  118.2  0.829  9.7 
MPOC  0.773  6.8  0.708  0.8  0.635  4.4  0.663  2.7  0.627  117.3  0.653  10.2 
POC  0.815  38.2  0.746  4.7  0.721  36.9  0.751  20.1  0.613  2373  0.79  202.7 
LKA  0.84  54.9  0.816  13.6  0.712  97.7  0.837  573.6  0.645  13484  0.80  1220 
IWBS  0.641  132.4  0.619  8.8  0.504  70.9  0.589  36.1  0.126  2413  0.609  260.3 
TWOP  1  16.1  0.999  4.4  1  222.2  1  157.5  0.269  5690  1  481.7 
ECG5T  0.94  9.2  0.934  4.9  0.928  137.8  0.928  70.1  0.927  2822  0.933  278.3 
CHCO  0.777  189.1  0.683  48.1  0.627  160.8  0.627  57.0  0.545  3122  0.666  333.6 
Wafer  0.995  143.6  0.993  9.6  0.986  412.3  0.996  210.1  0.896  11172  0.994  980.5 
MALLAT  0.952  72.8  0.937  33.8  0.937  150.3  0.925  65.5  0.257  11882  0.915  988.4 
FordB  0.793  543.8  0.62  5.6  0.589  1476  0.581  577.6  N/A  N/A  0.83  8402 
NIFECG  0.936  140.2  0.903  20.0  0.845  2699  0.857  1432  N/A  N/A  0.906  32493 
HO  0.871  336.9  0.834  41.9  0.816  4883  0.807  5837  N/A  N/A  0.898  40407 
Clustering  RWS(LR)  RWS(SR)  KMeansDTW  CLDS  KShape  

Dataset  NMI  Time  NMI  Time  NMI  Time  NMI  Time  NMI  Time 
Beef  0.29  1.1  0.27  1.0  0.25  377  0.24  61.3  0.22  1.8 
DPTW  0.52  0.6  0.56  0.5  0.55  182  0.55  176.8  0.45  14.9 
PPOAG  0.56  0.5  0.54  0.2  0.44  105.4  0.55  191.1  0.27  40.2 
IWBS  0.43  43.9  0.36  6.3  0.37  5676  0.38  1109  0.43  377.6 
TWOP  0.23  11.2  0.3  4.7  0.12  1960  0.02  1312  0.4  292.1 
ECG5T  0.46  25.7  0.4  7.0  0.48  2539  0.37  1308  0.35  360.7 
MALLAT  0.92  48.2  0.91  25.4  0.72  95218  0.92  2448  0.75  900.4 
NIFECG  0.71  346.1  0.68  43.7  0.63  101473  0.67  3442  0.73  5387 
4.2 Comparing Feature Representations
Baselines and Setup. We compare our approach with two recently developed methods: 1) TSEigen (Hayashi et al., 2005)
: learn a lowrank feature representation for a similarity matrix computed using DTW distance through Singular Value Decomposition
(Wu and Stathopoulos, 2015; Wu et al., 2017); 2) TSMC (Lei et al., 2017): a recently proposed similarity preserving representation for DTWbased similarity matrix using matrix completion approach. We set for all methods. We employ a linear SVM implemented in LIBLINEAR (Fan et al., 2008) since it can separate the effectiveness of the feature representation from the power of the nonlinear learning solvers.Results. Table 2 clearly demonstrates the significant advantages of our approach compared to other representations in terms of both classification accuracy and computational time. Indeed, TSMC improves the computational efficiency compared to TSEigen without compromising large loss of the accuracy as claimed in (Lei et al., 2017). However, RWS is corroborated to achieve both higher accuracy and faster train and testing time compared to TSMC and TSEigen. The improved accuracy of RWS suggests that a truly p.d. time series kernel admits better feature representations than those obtained from a similarity or kernel (not p.d.) matrix. In addition, improved computational time illustrates the effectiveness of using random series to approximate the exact kernel.
4.3 Comparing TimeSeries Classification
Baselines. We now compare our method with other stateoftheart time series classification methods that also take advantage of DTW distance or employ DTWlike kernels: 1) 1NNDTW: use window size ; 2) 1NNDTW^{opt}: use optimal window size using leaveoneout cross validation from test data in (Chen et al., 2015) 3) DTWF (Kate, 2016): a recently proposed method that combines DTW without and with constraints and SAX (Lin et al., 2007) as features; 4) TGAK (Cuturi, 2011): a fast triangular global alignment kernel for timeseries; 5) RWS(LR): RWS with large rank that achieves the best accuracy with more computational time; 6) RWS(SR): small rank that obtains comparable accuracy in less time. We conduct grid search for important parameters in each method suggested in (Kate, 2016; Cuturi, 2011).
Results. Table 3 corroborates that RWS consistently outperforms or matches other stateoftheart methods in terms of testing accuracy while requiring significantly less computational time. First, RWS(SR) can achieve better or similar performance compared to 1NNDTW and 1NNDTW ^{opt} for all datasets. This is a strong sign that our learned feature representation is very effective, since, using it, even a linear SVM can beat the wellrecognized benchmark. Meanwhile, the clear computational advantages of RWS over 1NNDTW can be observed when the number or the length of time series samples become large. This is not surprising since RWS reduces both number and length of time series from quadratic complexity to linear complexity. Second, RWS is much better than another family of time series kernels represented by TGAK, which probably indicates that considering the softminimum of all alignment distances does not capture well hidden patterns of time series. Third, DTWF shows significant performance difference compared to 1NNDTW, which is consistent with the reported results in (Kate, 2016). However, compared to DTWF, RWS(LR) can still show clear advantages in accuracy among 11 cases out of the total 16 datasets while achieving one or two orders of magnitude speedup. More importantly, RWS can support a tradeoff between the accuracy and runtime. This feature is highly desirable in real applications that may have a variety of priorities and constraints.
4.4 Comparing TimeSeries Clustering
Baselines. We compare our method against several timeseries clustering baselines: 1) KMeansDTW (Petitjean et al., 2011; Paparrizos and Gravano, 2015): accelerate computation with lower bounding approach (Keogh, 2002); 2) CLDS (Li and Prakash, 2011): learns a feature representation with hidden variables through complexvalued linear dynamical systems; 3) KShape (Paparrizos and Gravano, 2015): recently proposed clustering method demonstrated to outperform stateoftheart clustering approaches in accuracy and computational time; 4) RWS(LR); 5) RWS(SR). We combine our learned feature representation with the classic KMeans algorithm (Hartigan and Wong, 1979). We employ a commonly used clustering metric, the normalized mutual information (NMI scaling between 0 and 1) to measure the performance, where higher value indicates better accuracy.
Results. Table 4 shows that RWS provides similar or better performance and typically is substantially faster than KMeansDTW when the number or the length of timeseries become large. In addition, RWS can consistently outperform CLDS in terms of both accuracy and runtime. Interestingly, even compared to the stateoftheart method KShape, RWS can still yield a clear advantage in terms of accuracy; RWS yields 5 wins, 1 even, and 2 loses over KShape for 8 datasets. Besides its accuracy, the better computational efficiency of RWS over KShape is also corroborated.
5 Conclusions and Future Work
In this work, we have studied an effective and scalable timeseries (p.d.) kernel for largescale time series problems based on RWS approximation, and the feature embedding generated by the technique is generally applicable to most of learning problems. There are several interesting directions of future work, including: i) studying the effects of different random timeseries distribution and ii) exploring more elastic dissimilarity measure between time series such as CID and DTDC.
References
 Bagnall et al. [2016] Anthony Bagnall, Jason Lines, Aaron Bostrom, James Large, and Eamonn Keogh. The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery, pages 1–55, 2016.
 Bahlmann et al. [2002] Claus Bahlmann, Bernard Haasdonk, and Hans Burkhardt. Online handwriting recognition with support vector machinesa kernel approach. In in Frontiers in Handwriting Recognition, pages 49–54. IEEE, 2002.
 Batista et al. [2014] Gustavo EAPA Batista, Eamonn J Keogh, Oben Moses Tataw, and Vinicius MA De Souza. Cid: an efficient complexityinvariant distance for time series. Data Mining and Knowledge Discovery, 28(3):634–669, 2014.
 Baydogan et al. [2013] Mustafa Gokce Baydogan, George Runger, and Eugene Tuv. A bagoffeatures framework to classify time series. IEEE transactions on pattern analysis and machine intelligence, 35(11):2796–2802, 2013.
 Berndt and Clifford [1994] Donald J Berndt and James Clifford. Using dynamic time warping to find patterns in time series. In KDD workshop, volume 10, pages 359–370. Seattle, WA, 1994.

Chen et al. [2016]
Jie Chen, Lingfei Wu, Kartik Audhkhasi, Brian Kingsbury, and Bhuvana
Ramabhadrari.
Efficient onevsone kernel ridge regression for speech recognition.
In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 2454–2458. IEEE, 2016.  Chen et al. [2015] Yanping Chen, Eamonn Keogh, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, and Gustavo Batista. The ucr time series classification archive, July 2015. www.cs.ucr.edu/~eamonn/time_series_data/.

Cuturi [2011]
Marco Cuturi.
Fast global alignment kernels.
In
Proceedings of the 28th international conference on machine learning
, pages 929–936, 2011.  Cuturi et al. [2007] Marco Cuturi, JeanPhilippe Vert, Oystein Birkenes, and Tomoko Matsui. A kernel for time series based on global alignments. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing, volume 2, pages II–413. IEEE, 2007.

Deng et al. [2013]
Houtao Deng, George Runger, Eugene Tuv, and Martyanov Vladimir.
A time series forest for classification and feature extraction.
Information Sciences, 239:142–153, 2013.  Fan et al. [2008] RongEn Fan, KaiWei Chang, ChoJui Hsieh, XiangRui Wang, and ChihJen Lin. Liblinear: A library for large linear classification. Journal of machine learning research, 9(Aug):1871–1874, 2008.
 Górecki and Łuczak [2014] Tomasz Górecki and Maciej Łuczak. Nonisometric transforms in time series classification using dtw. KnowledgeBased Systems, 61:98–108, 2014.

Gudmundsson et al. [2008]
Steinn Gudmundsson, Thomas Philip Runarsson, and Sven Sigurdsson.
Support vector machines and dynamic time warping for time series.
In
2008 IEEE International Joint Conference on Neural Networks
, pages 2772–2776. IEEE, 2008. 
Hartigan and Wong [1979]
John A Hartigan and Manchek A Wong.
Algorithm as 136: A kmeans clustering algorithm.
Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979. 
Hayashi et al. [2005]
Akira Hayashi, Yuko Mizuhara, and Nobuo Suematsu.
Embedding time series data for classification.
In
International Workshop on Machine Learning and Data Mining in Pattern Recognition
, pages 356–365. Springer, 2005.  Kate [2016] Rohit J Kate. Using dynamic time warping distances as features for improved time series classification. Data Mining and Knowledge Discovery, 30(2):283–312, 2016.
 Keogh [2002] Eamonn Keogh. Exact indexing of dynamic time warping. In Proceedings of the 28th international conference on Very Large Data Bases, pages 406–417. VLDB Endowment, 2002.
 Lei et al. [2017] Qi Lei, Jinfeng Yi, Roman Vaculin, Lingfei Wu, and Inderjit S. Dhillon. Similarity preserving representation learning for time series analysis. https://arxiv.org/abs/1702.03584, 2017.
 Leslie et al. [2002] Christina S Leslie, Eleazar Eskin, and William Stafford Noble. The spectrum kernel: A string kernel for svm protein classification. In Pacific symposium on biocomputing, volume 7, pages 566–575, 2002.
 Li and Prakash [2011] Lei Li and B Aditya Prakash. Time series clustering: Complex is simpler! In Proceedings of the 28th International Conference on Machine Learning, pages 185–192, 2011.
 Lin et al. [2007] Jessica Lin, Eamonn Keogh, Li Wei, and Stefano Lonardi. Experiencing sax: a novel symbolic representation of time series. Data Mining and knowledge discovery, 15(2):107–144, 2007.
 Marteau and Gibet [2015] PierreFrançois Marteau and Sylvie Gibet. On recursive edit distance kernels with application to time series classification. IEEE transactions on neural networks and learning systems, 26(6):1121–1133, 2015.
 Paparrizos and Gravano [2015] John Paparrizos and Luis Gravano. kshape: Efficient and accurate clustering of time series. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1855–1870. ACM, 2015.
 Peng et al. [2015a] Xi Peng, Junzhou Huang, Qiong Hu, Shaoting Zhang, Ahmed Elgammal, and Dimitris Metaxas. From circle to 3sphere: Head pose estimation by instance parameterization. Computer Vision and Image Understanding, 136:92–102, 2015a.
 Peng et al. [2015b] Xi Peng, Shaoting Zhang, Yu Yang, and Dimitris N Metaxas. Piefa: Personalized incremental and ensemble face alignment. In Proceedings of the IEEE International Conference on Computer Vision, pages 3880–3888, 2015b.
 Petitjean et al. [2011] François Petitjean, Alain Ketterlin, and Pierre Gançarski. A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognition, 44(3):678–693, 2011.
 Rahimi and Recht [2007] Ali Rahimi and Benjamin Recht. Random features for largescale kernel machines. In Advances in Neural Information Processing Systems, volume 3, page 5, 2007.
 Rakthanmanon and Keogh [2013] Thanawin Rakthanmanon and Eamonn Keogh. Fast shapelets: A scalable algorithm for discovering time series shapelets. In Proceedings of the 2013 SIAM International Conference on Data Mining, pages 668–676. SIAM, 2013.
 Rakthanmanon et al. [2012] Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria, and Eamonn Keogh. Searching and mining trillions of time series subsequences under dynamic time warping. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 262–270. ACM, 2012.
 Sakoe and Chiba [1978] Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE transactions on acoustics, speech, and signal processing, 26(1):43–49, 1978.
 Schäfer [2015] Patrick Schäfer. The boss is concerned with time series classification in the presence of noise. Data Mining and Knowledge Discovery, 29(6):1505–1530, 2015.
 Senin and Malinchik [2013] Pavel Senin and Sergey Malinchik. Saxvsm: Interpretable time series classification using sax and vector space model. In Proceedings of the 13th IEEE International Conference on Data Mining, pages 1175–1180. IEEE, 2013.
 Shimodaira et al. [2001] Hiroshi Shimodaira, Kenichi Noma, Mitsuru Nakai, Shigeki Sagayama, et al. Dynamic timealignment kernel in support vector machine. In Advances in Neural Information Processing Systems, volume 2, pages 921–928, 2001.
 Wang et al. [2013] Xiaoyue Wang, Abdullah Mueen, Hui Ding, Goce Trajcevski, Peter Scheuermann, and Eamonn Keogh. Experimental comparison of representation methods and distance measures for time series data. Data Mining and Knowledge Discovery, pages 1–35, 2013.
 Wu and Stathopoulos [2015] Lingfei Wu and Andreas Stathopoulos. A preconditioned hybrid svd method for accurately computing singular triplets of large matrices. SIAM Journal on Scientific Computing, 37(5):S365–S388, 2015.
 Wu et al. [2016] Lingfei Wu, Ian EH Yen, Jie Chen, and Rui Yan. Revisiting random binning features: Fast convergence and strong parallelizability. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1265–1274. ACM, 2016.
 Wu et al. [2017] Lingfei Wu, Eloy Romero, and Andreas Stathopoulos. Primme_svds: A highperformance preconditioned svd solver for accurate largescale computations. SIAM Journal on Scientific Computing, 39(5):S248–S271, 2017.
 Wu et al. [2018] Lingfei Wu, Ian EnHsu Yen, Fnagli Xu, Pradeep Ravikumar, and Witbrock Michael. D2ke: From distance to kernel and embedding. https://arxiv.org/abs/1802.04956, 2018.
 Xi et al. [2006] Xiaopeng Xi, Eamonn Keogh, Christian Shelton, Li Wei, and Chotirat Ann Ratanamahatana. Fast time series classification using numerosity reduction. In Proceedings of the 23rd international conference on Machine learning, pages 1033–1040. ACM, 2006.
 Xing et al. [2010] Zhengzheng Xing, Jian Pei, and Eamonn Keogh. A brief survey on sequence classification. ACM Sigkdd Explorations Newsletter, 12(1):40–48, 2010.
 Ye and Keogh [2009] Lexiang Ye and Eamonn Keogh. Time series shapelets: a new primitive for data mining. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 947–956. ACM, 2009.
 Yen et al. [2014] Ian E.H. Yen, TingWei Lin, ShouDe Lin, Pradeep Ravikumar, and Inderjit S. Dhillon. Sparse random features algorithm as coordinate descent in hilbert space. In Advances in Neural Information Processing Systems, 2014.
6 Appendix
6.1 Proof of Lemma 1
Lemma 1 (Lipschitz parameter).
Let be Lipschitzcontinuous w.r.t. with parameter . Then given and , we have
satisfies
with probability at least , where is the variance of the Lipschitz parameter averaged over samples.
Proof.
Consider an arbitrary pair of series of the same length. From Lipschitzcontinuity of , we have
for some . Then let and be the minimizers of and respectively, we have
and
Therefore, and
where . With the similar argument we have and Then since and Let . By Chebyshev inequality, we have
∎
6.2 Experimental settings and parameters for RWS
As shown in Table 1, we choose 16 datasets that come from various applications, including ECG, sensor, image, spectro, simulated and device, and have various numbers of classes, varying numbers of time series, and a wide range of lengths of time series, as shown in Table 1
. For all experiments, we generate random document from uniform distribution with mean centered in Word2Vec embedding space since we observe the best performance with this setting. We perform 10fold crossvalidation to search for best parameters for
, and as well as parameter for LIBLINEAR on training set for each dataset. We simply fix the , and vary in the range of [10 20 30 40 50 60 70 80 90 100], in the range of [1e4 1e3 3e3 1e2 3e2 0.10 0.14 0.19 0.28 0.39 0.56 0.79 1.12 1.58 2.23 3.16 4.46 6.30 8.91 10 31.62 1e2 3e2 1e3 1e4], and in the range of [1e5 1e4 1e3 1e2 1e1 1 1e1 1e2 1e3 1e4 1e5] respectively in all experiments. All computations were carried out on a DELL dual socket system with Intel Xeon processors 272 at 2.93GHz for a total of 16 cores and 250 GB of memory, running the SUSE Linux operating system.Name  App  

Beef  5  30  30  470  Spectro 
DPTW  6  400  139  80  Image 
IPD  2  67  1,029  24  Sensor 
PPOAG  3  400  205  80  Image 
MPOC  2  600  291  80  Image 
POC  2  1,800  858  80  Image 
LKA  3  375  375  720  Device 
IWBS  11  220  1,980  256  Sensor 
TWOP  4  1,000  4,000  128  Simulated 
ECG5T  5  500  4,500  140  ECG 
CHCO  3  467  3,840  166  Simulated 
Wafer  2  1,000  6,174  152  Sensor 
MALLAT  8  55  2,345  1,024  Simulated 
FordB  2  3636  810  500  Sensor 
NIFECG  42  1,800  1,965  750  ECG 
HO  2  370  1,000  2,709  Image 
6.3 More Results on Effects of , and on Random Features
To fully investigate the behavior of the WME method, we study the effect of the kernel parameter , the number of random documents and the length of random documents on training and testing accuracy for all 16 datasets. Clearly, the training and testing accuracy can converge rapidly to the exact kernels when varying R from 4 to 512, which confirms our analysis in Theory 1. When varying D from 10 to 100, we can see that in the majority of cases generally yields a nearpeak performance except FordB.
6.4 Parameters and Settings on Comparisons of Feature Representations
For TSEigen Hayashi et al. [2005], we implemented this method in Matlab where we apply SVD to compute number of largest dominant components on the similar matrix computed using DTW. For TSMC Lei et al. [2017], we used their open source in code in Github: https://github.com/cecilialeiqi/SPIRAL. Since the default rank size of TSMC is 32, we keep all methods consistent with this setting to make a fair comparison. For all methods, we choose the parameter by 10fold cross validation on training data in LIBLINEAR on all 16 datasets.
6.5 Parameters and Settings on Comparisons for LargeScale Classification
For 1NNDTW and 1NNDTW^{opt}, we implemented them using Matlab internal fitcknn with DTW using the same C Mex file ^{2}^{2}2https://www.mathworks.com/matlabcentral/fileexchange/43156dynamictimewarping–dtw as our method RWS. Although our implementations may not be highly optimized, we believe the runtime comparisons among these methods are reasonably fair. For DTWF Kate [2016], we used their open source code ^{3}^{3}3https://people.uwm.edu/katerj/timeseries/. To make a fair comparison with other methods, we set the window size as . The feature representation generated by DTWF combines SAX, DTW, and DTW_R where we use recommended parameter ranges = [8 16 24 32 40 48 56 64 72 80 96 112 128 144 160], = [4 8], and = [3 4 5 6 7 8 9] for cross validation. For TGAK Cuturi [2011], we took their open source code ^{4}^{4}4http://marcocuturi.net/GA.html for the experiments. We choose recommended window size due to a good trade off between testing accuracy and computational time. We also perform cross validation to search for good kernel parameter in the range of [0.01, 0.033, 0.066, 0.1, 0.33, 0.66, 1, 3.3, 6.6, 10] and the LIBLINEAR parameter in the range of [1e5 1e4 1e3 1e2 1e1 1 1e1 1e2 1e3 1e4 1e5 1e6].
6.6 Parameters and Settings on Comparisons for LargeScale Clustering
For KMeansDTW Petitjean et al. [2011], we used the public available python code ^{5}^{5}5https://github.com/alexminnaar/timeseriesclassificationandclustering, which also implements LB_Keogh lower bound with DTW. However, the efficiency of python code may be significantly worse than C mex file of DTW we used, which could be the reason we observed larger margin speedup compared to 1NNDTW. Nevertheless, note that the computational complexity of RWS over KmeansDTW reduces from quadratic complexity to linear complexity. For CLDS Li and Prakash [2011], we used the open source code published by authors ^{6}^{6}6http://www.cs.cmu.edu/~./leili/software.html. We choose the parameter by cross validation while using recommended parameters for generating the representations on all datasets. For KShape Paparrizos and Gravano [2015], we used the public available python code ^{7}^{7}7https://github.com/Mic92/kshape. Similarly, we choose the parameter by cross validation while using recommended parameters for generating the representations on all datasets.
Comments
There are no comments yet.