1 Introduction
1.1 Dynamic time warping
Time series such as audio, video, and other sensory signals represent a collection of timedependent values that may vary in speed (see Fig. 1). Since the Euclidean distance is sensitive to such variations, its application to time series related data mining tasks may give unsatisfactory results [6, 8, 24]. Consequently, the preferred approaches to compare time series apply elastic transformations that filter out the variations in speed. Among various techniques, one of the most common elastic transformation is dynamic time warping (DTW) [20].
Dynamic time warping is based on the concept of warping path. A warping path determines how to stretch two given time series and to warped time series and under certain constraints. The cost of warping and along warping path measures how dissimilar the warped time series and are. There are exponential many different warping paths [1] each of which determines the cost of warping time series and . An optimal warping path of and is a warping path with minimum cost. Optimal warping paths exist but are not unique in general (see Fig. 2).
1.2 The problem of nonuniqueness
Recent research is directed towards extending standard statistical concepts and machine learning methods to time series spaces endowed with the DTW distance. Examples include time series averaging
[3, 5, 16, 17, 21], kmeans clustering [10, 18, 22][14], learning vector quantization
[23, 12], and warpedlinear classifiers
[11, 13].The lowest common denominator of these approaches is that they repeatedly update one or more parameter time series. In addition, update directions are based on optimal warping paths such that the following properties hold:

If an optimal warping path is unique, then the update direction is welldefined.

If an optimal warping path is nonunique, then there are several update directions.
Nonuniqueness of optimal warping paths complicates the algorithmic design of learning methods in DTW spaces and their theoretical analysis. In some situations, nonuniqueness may result in adverse effects. For example, repulsive updating in learning vector quantization finds a theoretical justification only in cases where the corresponding optimal warping path is unique [12].
Given the problems caused by nonuniqueness, it is desirable that optimal warping paths are unique almost everywhere. In this case, nonunique optimal warping paths occur exceptionally and are easier to handle as we will shortly. Therefore, we are interested in how prevalent unique optimal warping paths are.
1.3 Almost everywhere
The colloquial term “almost everywhere” has a precise measuretheoretic meaning. A measure quantifies the size of a set. It generalizes the concepts of length, area, and volume of a solid body defined in one, two, and three dimensions, respectively. The term “almost everywhere” finds its roots in the notion of a “negligible set”. Negligible sets are sets contained in a set of measure zero. For example, the function
is discontinuous on the negligible set with measure zero. We say, function is continuous almost everywhere, because the set where is not continuous is negligible. More generally, a property is said to be true almost everywhere if the set where is false is negligible. The property that an optimal warping is unique almost everywhere means that the set of all pairs of time series with nonunique optimal warping path is negligible.
When working in a measure space, a negligible set contains the exceptional cases we can handle or even do not care about and often ignore. For example, we do not care about the behavior of the above function on its negligible set when computing its Lebesgue integral over . Another example is that cost functions of some machine learning methods in Euclidean spaces such as kmeans or learning vector quantization are nondifferentiable on a negligible set. In such cases, it is common practice to ignore such points or to resort to subgradient methods.
1.4 Contributions
Consider the following property on the set of pairs of time series of length and : The pair of time series and satisfies if there are two different (not necessarily optimal) warping paths between and with identical costs. Under the assumption of a squared error local cost function, the main result of this article is Theorem 1:
Property is negligible on .
Direct consequences of Theorem 1 are (i) optimal warping paths are unique almost everywhere, and (ii) property holds on the union of exponentially many zero sets of quadratic forms. The results hold for uni as well as multivariate time series.
An implication of nonunique optimal warping paths is that adverse effects in learning are exceptional cases that can be safely handled. For example, learning amounts in (stochastic) gradient descent update rules almost everywhere.
2 Background
This section first introduces warping paths and then defines the notions of negligible and almost everywhere from measure theory.
2.1 Time Series and Warping Paths
We first define time series. Let denote the dimensional Euclidean space. A variate time series of length is a sequence consisting of elements . By we denote the set of all time series of length with elements from .
Next, we describe warping paths. Let , where . An ()lattice is a set of the form . A warping path in lattice is a sequence of points such that

and

for all .
The first condition is called boundary condition and the second one is the step condition.
By we denote the set of all warping paths in . A warping path departs at the upper left corner and ends at the lower right corner of the lattice. Only east , south , and southeast steps are allowed to move from a given point to the next point for all .
Finally, we introduce optimal warping paths. A warping path defines an alignment (warping) between time series and by relating elements and if . The cost of aligning time series and along warping path is defined by
where denotes the Euclidean norm on . A warping path between and is optimal if
By we denote the set of all optimal warping paths between time series and . The DTW distance is defined by
is the DTW distance. Observe that for all .
2.2 MeasureTheoretic Concepts
We introduce the necessary measuretheoretic concepts to define the notions of negligible and almost everywhere. For details, we refer to [9].
One issue in measure theory is that not every subset of a given set is measurable. A family of measurable subsets of a set is called algebra in . A measure is a function that assigns a nonnegative value to every measurable subset of such that certain conditions are satisfied. To introduce these concepts formally, we assume that denotes the power set of a set , that is the set of all subsets of . A system is called a algebra in if it has the following properties:


implies

implies .
A measure on is a function that satisfies the following properties:

for all


For a countable collection of disjoint sets , we have
A triple consisting of a set , a algebra in and a measure on is called a measure space. The Borelalgebra in is the algebra generated by the open sets of . The Lebesguemeasure on generalizes the concept of volume of a box in . The triple is called BorelLebesgue measure space.
Let be a measure space, where is a set, is a algebra in , and is a measure defined on . A set is negligible if there is a set such that and . A property of is said to hold almost everywhere if the set of points in where this property fails is negligible.
3 Results
We first show that optimal warping paths are unique almost everywhere. Then we geometrically describe the location of the nonunique set. Finally, we discuss the implications of the proposed results on learning in DTW spaces.
Let bet the set of all pairs of time series, where has length and has length . We regard the set as a Euclidean space and assume the LebesgueBorel measure space .^{1}^{1}1See Remark 7 for an explanation of why we regard as a Euclidean space. The multi optimalpath set of is defined by
This set consists of all pairs with nonunique optimal warping path. To assert that is negligible, we show that is a subset of a set of measure zero. For this, consider the multipath set
The set consists of all pairs that can be aligned along different warping paths with identical cost. Obviously, the set is a subset of . The next theorem states that is a set of measure zero.
Theorem 1.
Let be the LebesgueBorel measure space. Then .
From Theorem 1 and immediately follows that is negligible.
Corollary 2.
Under the assumptions of Theorem 1 the set is negligible.
Thus, optimal warping paths are unique almost everywhere in . Even more generally: The property that all warping paths have different cost holds almost everywhere in .
We describe the geometric form of the multipath set . For this we identify with , where . Thus, pairs of time series are summarized to , henceforth denoted as . By we denote the set of all ()matrices with elements from . Finally, the zero set of a function is of the form
From the proof of Theorem 1 directly follows that the set is the union of zero sets of quadratic forms.
Corollary 3.
Under the assumptions of Theorem 1, there is an integer and symmetric matrices such that
The number of zero sets in Corollary 3 grows exponentially in and . From the proof of Theorem 1 follows that , where
is the Delannoy number [1]. The Delannoy number counts the number of all warping paths in lattice . Table 1 presents the first Delannoy numbers up to and . We see that there are more than half a million warping paths in a ()lattice showing that is the union of more than 178 billion zero sets. For two time series of length , the number of warping paths is , which is more than trillions. Thus, the multipath set of two time series of length is the union of more than octillion zero sets, that is . An open question is the number of zero sets that form the multi optimalpath set . The example in Figure 3 indicates that the multi optimalpath set can be much smaller than the multipath set .
1  2  3  4  5  6  7  8  9  

1  1  1  1  1  1  1  1  1  1 
2  1  3  5  7  9  11  13  15  17 
3  1  5  13  25  41  61  85  113  145 
4  1  7  25  63  129  231  377  575  833 
5  1  9  41  129  321  681  1,289  2,241  3,649 
6  1  11  61  231  681  1,683  3,653  7,183  13,073 
7  1  13  85  377  1,289  3,653  8,989  19,825  40,081 
8  1  15  113  575  2,241  7,183  19,825  48,639  108,545 
9  1  17  145  833  3,649  13,073  40,081  108,545  265,729 
10  1  19  181  1,159  5,641  22,363  75,517  224,143  598,417 
3.1 Discussion
We discuss the implications of Theorem 1 for learning in DTW spaces.
3.1.1 Learning
This section shows that almosteverywhere uniqueness implies almosteverywhere differentiability of the underlying cost function.
To convey the line of argument, it is sufficient to restrict to the problem of averaging time series as representative for other, more complex learning problems. In contrast to computing the average in Euclidean spaces, time series averaging is a nontrivial task for which the complexity class is currently unknown [3].
Let be a sample of time series, possibly of varying length. Consider the cost function of the form
where
is a loss function. Common loss functions for averaging time series are the identity
and the squared loss . The goal of time series averaging is to find a time series of length that minimizes the cost .The challenge of time series averaging is to minimize the nondifferentiable cost function . We show that almosteverywhere uniqueness of optimal warping paths implies almosteverywhere differentiability of the cost function and provides a stochastic (incremental) update rule.
We express the DTW distance as a parametrized function. Suppose that is a time series. Then the parametrized DTW function restricted to the set is of the form . We have the following result:
Proposition 4.
Suppose that and are two time series with unique optimal warping path. Then the function is differentiable at and its gradient is a time series of the form
where is an optimal warping path.
The proof follows from [19] after reducing to a piecewise smooth function. By construction, the cost of warping and along warping path is differentiable as a function of the second argument . Nondifferentiability of is caused by nonuniqueness of an optimal warping path between and . In this case we have
where are two distinct optimal warping paths. Then it can happen that
showing that is nondifferentiable at .
Next, suppose that the loss function is differentiable and an optimal warping path between time series and is unique. Then the individual cost is also differentiable at with gradient
Differentiability of gives rise to a stochastic (incremental) update rule of the form
(1) 
where is the step size and is an optimal warping path between and . We can also apply the stochastic update rule (1) in cases where an optimal warping path between and is not unique. In this case, we first (randomly) select an optimal warping path from the set . Then we update by applying update rule (1).
Updating at nondifferentiable points according to the rule (1) is not welldefined. In addition, it is unclear whether the update directions are always directions of descent, for learning problems in general. The next result confines both issues to a negligible set.
Corollary 5.
Suppose that and is a differentiable loss function. Then the functions
are differentiable almost everywhere on .
Corollary 5 directly follows from Prop. 4 together with Theorem 1. In summary, almosteverywhere uniqueness of optimal warping paths implies almosteverywhere differentiability of the individual cost . The latter in turn implies that update rule (1) is a welldefined stochastic gradient step almost everywhere.
The arguments in this section essentially carry over to other learning problems in DTW spaces such as kmeans, selforganizing maps, and learning vector quantization. We assume that it should not be a problem to transfer the proposed results to learning based on DTW similarity scores as applied in warpedlinar classifiers [11, 13].
3.1.2 Learning Vector Quantization in DTW Spaces
Learning vector quantization (LVQ) is a supervised classification scheme introduced by Kohonen [15]. A basic principle shared by most LVQ variants is the margingrowth principle [12]. This principle justifies the different learning rules and corresponds to stochastic gradient update rules if a differentiable cost function exists. As the kmeans algorithm, the LVQ scheme has been generalized to DTW spaces [12, 23]. In this section, we illustrate that a unique optimalwarping path is a necessary condition to satisfy the margingrowth principle in DTW spaces as proved in [12], Theorem 12.
As a representative example, we describe LVQ1, the simplest of all LVQ algorithms [15]. Let be the dimensional Euclidean space and let be a set consisting of class labels. The LVQ1 algorithm assumes a codebook of prototypes with corresponding class labels . As a classifier, LVQ1 assigns an input point to the class of its closest prototype , where
LVQ1 learns a codebook on the basis of a training set . After initialization of , the algorithm repeats the following steps until termination: (i) Randomly select a training example ; (ii) determine the prototype closest to ; and (iii) attract to if their class labels agree and repel from otherwise. Step (iii) adjusts according to the rule
where is the learning rate, the sign is positive if the labels of and agree (), and negative otherwise (). The update rule guarantees that adjusting makes an incorrect classification of more insecure. Formally, if the learning rate is bounded by some threshold , the LVQ1 update rule guarantees to increase the hypothesis margin
where () is the closest prototypes of with the same (different) class label.
The different variants of LVQ have been extended to DTW spaces by replacing the squared Euclidean distance with the squared DTW distance [12, 23]. The update rule is based on an optimal warping path between the current input time series and its closest prototpye. In asymmetric learning [12], the margingrowth principle always holds for the attractive force and for the repulsive force only when the optimal warping path is unique (as a necessary condition).
3.1.3 Comments
We conclude this sections with two remarks.
Remark 6.
Proposition 4 states that uniqueness of an optimal warping path between and implies differentiability of at . The converse statement does not hold, in general. A more general approach to arrive at Prop. 4 and Corollary 5 is as follows: First, show that a function is locally Lipschitz continuous (llc). Then invoke Rademacher’s Theorem [7] to assert almosteverywhere differentiability of . By the rule of calculus of llc functions, we have:

is llc on , because the minimum of continuously differentiable functions is llc.

If the loss is llc, then is llc, because the composition of llc functions is llc. ∎
Remark 7.
Measuretheoretic, geometric, and analytical concepts are all based on Euclidean spaces rather than DTW spaces. The reason is that contemporary learning algorithms are formulated in such a way that the current solution and input time series are first projected into the Euclidean space via optimally warping to the same length. Then an update step is performed and finally the updated solution is projected back to the DTW space. Therefore, to understand this form of learning under warping, we study the DTW distance as a function restricted to the Euclidean space , where is the length of and is the length of . ∎
4 Conclusion
The multipath set is negligible and corresponds to the union of zero sets of exponentially many quadratic forms. As a subset of the multipath set, the multi optimalpath set is also negligible. Therefore optimal warping paths are unique almost everywhere. The implications of the proposed results are that adverse effects on learning in DTW spaces caused by nonunique optimal warping paths can be controlled and learning in DTW spaces amounts in minimizing the respective cost function by (stochastic) gradient descent almost everywhere.
Acknowledgements.
B. Jain was funded by the DFG Sachbeihilfe JA 2109/41.
Appendix A Proofs
The appendix presents the proof of Theorem 1 and Proposition 4. We first consider the univariate case () in Sections A.1 and A.2. Section A.1 introduces a more useful representation for proving the main results of this contribution and derives some auxiliary results. Section A.2 proves the proposed results for the univariate case. Finally, Section A.3 generalizes the proofs to the multivariate case.
a.1 Preliminaries
We assume that elements are from , that is . We write instead of to denote a time series of length . By we denote the th standard basis vector of with elements
Definition 8.
Let be a warping path with points . Then
is the pair of embedding matrices induced by warping path .
The embedding matrices have full column rank due to the boundary and step condition of the warping path. Thus, we can regard the embedding matrices of warping path as injective linear maps and that embed time series and into by matrix multiplication and . We can express the cost of aligning time series and along warping path by the squared Euclidean distance between their induced embeddings.
Proposition 9.
Let and be the embeddings induced by warping path . Then
for all and all .
Proof.
[21], Proposition A.2. ∎
Next, we define the warping and valence matrix of a warping path.
Definition 10.
Let and be the pair of embedding matrices induced by warping path . Then the valence matrix and warping matrix of warping path are defined by
The definition of valence and warping matrix are oriented in the following sense: The warping matrix aligns a time series to the time axis of time series . The diagonal elements of the valence matrix count the number of elements of warped onto the same element of . Alternatively, we can define the complementary valence and warping matrix of by
The complementary warping matrix warps time series to the time axis of time series . The diagonal elements of the complementary valence matrix counts the number of elements of warped onto the same element of .
Let be a warping path of length with induced embedding matrices and . The aggregated embedding matrix induced by warping path is defined by
where . Then the symmetric matrix is of the form
We use the following notations:
The next result expresses the cost by the matrix .
Lemma 11.
Let be the aggregated embedding matrix induced by warping path . Then we have
for all .
Proof.
Suppose that , where and are the embedding matrices induced by . Let . Then we have
∎
The last auxiliary result shows that the zero set of a nonzero quadratic form has measure zero.
Lemma 12.
Let matrix be nonzero and symmetric. Then
where is the Lebesgue measure on .
Proof.
Since
is symmetric, there is an orthogonal matrix
such that is a diagonal matrix. Consider the functionwhere the are the diagonal elements of . Since is nonzero, there is at least one . Hence, is a nonzero polynomial on . Then the set is measurable and has measure zero [4].
We show that the set is also a set of measure zero. Consider the linear map
First, we show that .

: Suppose that . With we have
This shows that . From follows that .

: Let . Then for some . Hence, and we have
This shows that .
Next, we show that . Observe that the linear map is continuously differentiable on a measurable set with Jacobian . Applying [2], Prop. 3.7.3 gives
Since is orthogonal, we have . Thus, we find that
Finally, the assertion follows from . ∎
a.2 Proof of Theorem 1 and Proposition 4
This section assumes the univariate case ().
Proof of Theorem 1:
Suppose that . We use the following notations for all :

denotes the aggregated embedding matrix induced by warping path .

and are the valence and warping matrices of .

and are the complementary valence and warping matrices of .

with denotes the cost of aligning and along warping path .
For every with and for every , we have
where is a symmetric matrix of the form
For the warping paths and are different implying that the warping matrices and , resp., are also different. Hence, is nonzero and from Lemma 12 follows that has measure zero. Then the union
of finitely many measure zero sets also has measure zero. It remains to show that .

: Suppose that . Then there are indices with such that the costs of aligning and along warping paths and are identical, that is . Setting gives
Hence, and therefore .

: Let . Then there is a set containing . From follows that and are two warping paths between and with identical costs. Hence, . This proves the assertion.∎
Proof of Proposition 4:
To show the proposition, we first define the notion of piecewise smooth function. A function is piecewise smooth if it is continuous on and for each there is a neighborhood of and a finite collection of continuously differentiable functions such that
for all .
Proof.
We show that the function is piecewise smooth. The function is continuous, because all are continuous and continuity is closed under the minoperation. In addition, the functions are continuously differentiable as functions in the second argument. Let and let be a neighborhood of . Consider the index set
By construction, we have for all . This shows that is piecewise smooth. Then the assertion follows from [19], Lemma 2. ∎
a.3 Generalization to the Multivariate Time Series
We briefly sketch how to generalize the results from the univariate to the multivariate case. The basic idea is to reduce the multivariate case to the univariate case. In the following, we assume that and are two variate time series and is a warping path between and with elements .
First observe that a variate time series consists of individual component time series . Next, we construct the embeddings of a warping path. The variate time warping embeddings and induced by are maps of the form
The maps and can be written as
where and are the embedding matrices induced by . Since and are linear, the maps and are also linear maps. We show the multivariate formulation of Prop. 9.
Proposition 13.
Let and be the variate embeddings induced by warping path . Then
for all and all .
Proof.
The assertion follows from
∎
Due to the properties of product spaces and product measures, the proofs of all other results can be carried out componentwise.
References
 [1] C. Banderier and S. Schwer. Why Delannoy numbers? Journal of Statistical Planning and Inference, 135(1):40–54, 2005.
 [2] V. Bogachev. Measure Theory. SpringerVerlag Berlin Heidelberg, 2007.
 [3] M. Brill, T. Fluschnik, V. Froese, B. Jain, R. Niedermeier, D. Schultz, Exact Mean Computation in Dynamic Time Warping Spaces. SIAM International Conference on Data Mining, 2018 (accepted).
 [4] R. Caron and T. Traynor. The zero set of a polynomial. WSMR Report, University of Windsor, 2005.
 [5] M. Cuturi and M. Blondel. SoftDTW: a differentiable loss function for timeseries. International Conference on Machine Learning, 2017.
 [6] P. Esling and C. Agon. Timeseries data mining. ACM Computing Surveys, 45:1–34, 2012.
 [7] L.C. Evans and R.F. Gariepy. Measure theory and fine properties of functions. CRC Press, 1992.

[8]
T. Fu.
A review on time series data mining.
Engineering Applications of Artificial Intelligence
, 24(1):164–181, 2011.  [9] P.R. Halmos. Measure Theory. SpringerVerlag, 2013.

[10]
V. Hautamaki, P. Nykanen, P. Franti.
Timeseries clustering by approximate prototypes.
International Conference on Pattern Recognition
, 2008.  [11] B. Jain. Generalized gradient learning on time series. Machine Learning 100(23):587–608, 2015.
 [12] B. Jain and D. Schultz. Asymmetric learning vector quantization for efficient nearest neighbor classification in dynamic time warping spaces. Pattern Recognition, 76:349366, 2018.
 [13] B. Jain. WarpedLinear Models for Time Series Classification arXiv preprint, arXiv:1711.09156, 2017.
 [14] T. Kohonen and P. Somervuo. Selforganizing maps of symbol strings. Neurocomputing, 21(13):19–30, 1998.
 [15] T. Kohonen. SelfOrganizing Maps SpringerVerlag Berlin Heidelberg, 2001.
 [16] J.B. Kruskal and M. Liberman. The symmetric timewarping problem: From continuous to discrete. Time warps, string edits and macromolecules: The theory and practice of sequence comparison, 1983.
 [17] F. Petitjean, A. Ketterlin, and P. Gancarski. A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognition 44(3):678–693, 2011.
 [18] F. Petitjean, G. Forestier, G.I. Webb, A.E. Nicholson, Y. Chen, and E. Keogh. Faster and more accurate classification of time series by exploiting a novel dynamic time warping averaging algorithm. Knowledge and Information Systems, 47(1):1–26, 2016.
 [19] R.T. Rockafellar. A Property of Piecewise Smooth Functions. Computational Optimization and Applications, 25:247–250, 2003.
 [20] H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43–49, 1978.
 [21] D. Schultz and B. Jain. Nonsmooth analysis and subgradient methods for averaging in dynamic time warping spaces. Pattern Recognition, 74:340–358, 2018.
 [22] S. SoheilyKhah, A. DouzalChouakria, and E. Gaussier. Progressive and Iterative Approaches for Time Series Averaging. Workshop on Advanced Analytics and Learning on Temporal Data, 2015.
 [23] P. Somervuo and T. Kohonen, Selforganizing maps and learning vector quantization for feature sequences. Neural Processing Letters, 10(2):151–159, 1999.
 [24] Z. Xing, J. Pei, and E. Keogh. A brief survey on sequence classification. ACM SIGKDD Explorations Newsletter, 12(1):40–48, 2010.
Comments
There are no comments yet.