Time series such as stock prices, climate data, energy usages, sales, biomedical measurements, and biometric data are sequences of time-dependent observations that often vary in temporal dynamics, that is in length, speed, and shifts in phase. For example, the same word can be uttered with different speaking speeds. Similarly, monthly temperature or precipitation extremes of certain regions can differ in duration and may occur out of phase for a period of a few weeks.
. An intricate problem in DTW-based time series mining is time series averaging. The problem consists in finding a typical representative that summarizes a sample of time series. Different forms of time series averaging have been applied to improve nearest neighbor classifiers[14, 19], to accelerate similarity search , and to formulate -means clustering in DTW spaces [19, 12, 22].
A standard approach to time series averaging under DTW is based on an idea by Fréchet : Suppose that is a sample of time series. A sample mean of is any time series that minimizes the Fréchet function
where is the DTW-distance and is a set of time series of finite length. The search space typically takes two forms: (i) is the set of all time series of finite length and (ii) is the set of all time series of length . We refer to (i) as the unconstrained and to (ii) as the constrained sample mean problem.
A sample mean is guaranteed to exist in either case but may not be unique . In addition, computing a sample mean is NP-hard . Consequently, there is an ongoing research on devising heuristics for minimizing the Fréchet function. Most contributions focus on devising and applying heuristics for the constrained sample mean problem. State-of-the-art algorithms are stochastic subgradient methods , majorize-minimize algorithms [12, 18], and soft-DTW . In contrast, only few work has been done for solving the unconstrained sample mean problem. One algorithm is an (essentially optimal) dynamic program that exactly solves the unconstrained problem in exponential time . A second algorithm is a heuristic, called adaptive DBA (ADBA) . This algorithm uses a majorize-minimize algorithm (DBA) as a base-algorithm and iteratively refines subsequences to improve the solution quality.
Currently, there is no clear understanding of the characteristic properties, advantages, and disadvantages of both types of sample means. We can approach the sample mean problem theoretically and empirically. A prerequisite for an empirical approach towards a better understanding of the sample mean problem are sufficiently powerful averaging algorithms. Compared to the constrained sample mean problem, algorithms for the unconstrained sample mean problem are underdeveloped.
In this work, we propose an average-compress (AC) algorithm for the unconstrained sample mean problem. The algorithm repeatedly alternates between averaging (A-step) and compression (C-step). The A-step requires a time series as initial guess, minimizes the Fréchet function, and returns an approximate solution as output. The C-step compresses the approximation of the A-step to obtain an improved solution. The compressed solution of the C-step serves as initial guess of the A-step in the next iteration. Compression is motivated by empirical observations that an unconstrained sample mean is typically shorter than the sample time series to be averaged . In principle, any averaging algorithm and any compression method can be applied. Here, we propose a compression method that minimizes the squared DTW error between original and compressed time series. Empirical results suggest that the AC scheme substantially outperforms state-of-the-art heuristics including ADBA.
2 Average-Compress Algorithm
In this section, we develop an average-compress (AC) algorithm for approximately solving the unconstrained sample mean problem. To this end, we first introduce the DTW-distance (Section 2.1), the concept of a sample mean under DTW (Section 2.2), and compressions (Section 2.3). Thereafter, we describe the AC algorithm in Section 2.4.
2.1 Dynamic Time Warping
For a given , we write . A time series is a sequence with elements for all . We denote the length of time series by , the set of time series of length by , and the set of all time series of finite length by . Consider the ()-grid defined as
A warping path of order and length is a sequence through the grid consisting of points such that
for all .
The first condition is the boundary condition and the second condition is the step condition of the DTW-distance. We denote the set of all warping paths of order by . Suppose that is a warping path with points for all . Then defines an expansion (or warping) of the time series and to the length- time series and . By definition, the length of a warping path satisfies .
The cost of warping time series and along warping path is defined by
where denotes the Euclidean norm and and are the expansions defined by . The DTW-distance of and is
A warping path with is called an optimal warping path of and . By definition, the DTW-distance minimizes the Euclidean distance between all possible expansions that can be derived from warping paths. Computing the DTW-distance and deriving an optimal warping path is usually solved via dynamic programming [20, 25].
2.2 Sample Means under DTW
Let be a sample of time series . Note that is a multiset that allows multiple instances of the same elements. A sample mean of is any time series that minimizes the Fréchet function 
where is a subset of time series. The value is the Fréchet variation of sample at . The infimum serves as a measure of variability of . Here, the search space takes one of the following two forms: (i) and (ii) . We refer to (i) as the unconstrained and to (ii) as the constrained sample mean problem. Note that the constrained formulation only restricts the length of the candidate solutions, whereas there is no length restriction on the sample time series to be averaged.
A sample mean exists in either case but is not unique in general . This result implies that attains its infimum (has a unique minimum). However, computing a sample mean is NP-hard . The implication is that we often need to resort to heuristics that return useful solutions within acceptable time.
We briefly describe two state-of-the-art algorithms for the constrained sample mean problem: a stochastic subgradient method (SSG)  and a majorize-minimize algorithm (DBA) [12, 18]. For a detailed description of both algorithms, we refer to .
To present the update rule of both algorithms in a compact form, we introduce the notions of warping and valence matrix as proposed by . Suppose that is a warping path. The warping matrix of is the zero-one matrix with elements
The valence matrix of warping path is the diagonal matrix with positive diagonal elements
Suppose that and are time series of length and . Then warps onto the time axis of . Each diagonal element of counts how many elements of are warped to element .
Stochastic Subgradient Algorithm.
Subgradient methods for time series averaging have been proposed by . Algorithm 1 outlines a vanilla version of the SSG algorithm with constant learning rate . In practice, more sophisticated stochastic subgradient variants such as Adam  are preferred. The input of Algorithm 1 are a learning rate , a length-parameter of the constrained search space, and a sample of time series to be averaged. The output is a time series with lowest Fréchet variation that has been encountered during optimization.
Majorize-minimize algorithms for time series averaging have been proposed in the 1970s mainly by Rabiner and his co-workers with speech recognition as the primary application [19, 27]. The early approaches fell largely into oblivion and where successively rediscovered, consolidated, and improved in a first step by Abdulla et al.  in 2003 and then finalized in 2008 by Hautamaki et al. . In 2011, Petitjean et al.  reformulated, explored, and popularized the majorize-minimize algorithm by Hautamaki et al.  under the name DTW Barycenter Averaging (DBA).
Let be a time series of length . A compression of is a time series of length that maintains some desirable problem-specific properties of . By definition, is also a compression of itself. A compression chain of is a sequence of compressions of such that
There are numerous compression methods such as principal component analysis, discrete Fourier transform, discrete wavelet transform, and many more. Here, we consider two simple methods: adaptive scaling (ADA) and minimum squared DTW error (MSE).
Algorithm 3 describes ADA. The procedure takes a time series as input and returns a compression chain consisting of compressions of of length to . To compress a time series of length to a time series of length , ADA merges two consecutive elements with minimal distance. The merge subroutine in Algorithm 3 replaces these two consecutive time points by their average.
Finding the smallest distance in Line 4 takes time. In each iteration, the length of is reduced by one. Thus, the complexity of computing all compressions is .
Minimum Squared DTW Error Compression.
The second compression method computes a time series of a given length such that the squared DTW error is minimized. Let be a time series of length and let . We call each
an MSE compression of of length . Observe that the MSE compression problem for is the constrained sample mean problem of the sample . Algorithm 4 outlines MSE compression.
It is not hard to see that for a compression , an optimal warping path between and warps every element of to exactly one element of , that is, . Thus, we can write
where . Let , , and . The squared DTW error of is
MSE compression is also known as adaptive piecewise constant approximation  and as segmentation problem . It can be solved exactly via dynamic programming in time . Moreover, the dynamic program allows to find all compressions (for each length ) in time by running it once for (as it is done in Algorithm 4).
Interestingly, MSE compression is also related to one-dimensional -means clustering. To see this relationship, consider an optimal warping path between the compression and the original time series as in Eq. (1). Then, the squared DTW error is minimal for . Thus, finding an MSE compression of length can also be seen as a one-dimensional -means clustering problem, where every cluster consists of a consecutive subsequence of elements in . Indeed, the dynamic program in Algorithm 4 is the same as for one-dimensional -means  (without previously sorting the elements in ).
To reduce the computational complexity, several heuristics and approximations for MSE compression have been proposed [24, 15, 28, 8]. Also ADA compression can be regarded as a heuristic for MSE compression since it greedily averages two consecutive elements.
To conclude, with MSE compression, we consider an exact solution method to a sound compression problem and with ADA compression, we consider a fast heuristic. Among the various heuristics, we have chosen ADA compression because it has been successfully tested for improving approximate solutions of the constrained sample mean problem .
2.4 The Average-Compress Algorithm
In this section, we assemble the pieces of the previous sections and propose a generic average-compress (AC) algorithm for approximately solving the unconstrained sample mean problem.
The AC algorithm alternates between averaging (A-step) and compression (C-step). For this purpose, any averaging algorithm and any compression method can be used. Algorithm 5 depicts the generic procedure. The input of the algorithm is a sample of time series and an initial guess . It then repeatedly applies the following steps until termination:
A-step: approximate sample mean
C-step: compute a compression chain
Select the shortest compression such that for all .
If , set and go to Step 1, otherwise terminate.
Line 7 computes the complete compression chain that consists of all compressions of of lengths to . To accelerate the algorithm at a possible expense of solution quality, sparse compression chains can be considered.
In the following, we explain why and under which conditions compression is useful. To simplify our argument, we assume that AC uses an averaging algorithm for the constrained sample mean problem (such as SSG or DBA). In this case, the length of the initial guess restricts the search space of AC to the set of all time series of maximum length . The choice of the length-parameter via the initial guess is critical. If is too small, the search space may not contain an unconstrained sample mean. For a given sample , the Reduction Theorem  guarantees the existence of an unconstrained sample mean of a length at most . Consequently, we can safely constrain the search space to for solving the unconstrained sample mean problem. Then, a naive approach to minimize the Fréchet function on is to solve constrained sample mean problems on and then to pick the solution with lowest Fréchet variation. When using state-of-the-art heuristics for the constrained problems, the naive approach is computationally infeasible.
The purpose of compression is to substantially accelerate the intractable naive approach at the expense of solution quality. Instead of solving all constrained problems, the AC algorithm uses compressions to select a few promising search spaces with . Starting with , the solution found in is compressed in order to determine the next search space . The length-parameter of the next search space corresponds to the length of the compression of with lowest Fréchet variation. Obviously, this idea only accelerates the naive approach if the length of the best compression is substantially smaller than the length of the previous solution.
The theoretical upper bound provided by the Reduction Theorem  is usually very large such that existing state-of-the-art heuristics for solving the constrained problem on are computationally intractable. In this case, also the AC algorithm using such a heuristic would be infeasible. However, empirical results on samples of two time series of equal length suggest that the length of an unconstrained sample mean is more likely to be less than . Similar results for larger sample sizes are unavailable due to forbidding running times required for exact sample means. For solving constrained sample mean problems, it is common practice to choose within the range of the lengths of the sample time series . Within this range, experimental results showed that an approximate solution of a constrained sample mean can be improved by reducing its length using adaptive scaling . These findings suggest to choose the length-parameter within or slightly above the range of lengths of the sample time series.
Our goal is to assess the performance of the proposed AC algorithm. For our experiments, we use the data sets from the UCR archive . Appendix A.1 summarizes the parameter settings of the mean algorithms used in these experiments.
3.1 Comparison of ADA and MSE
We compared the performance of ADA and MSE as compression subroutines of the AC algorithm. We applied the following configurations:
|DBA||DTW Barycenter Averaging [12, 18]||–|
|DBA-ADA||DBA with ADA compression ||1|
|DBA-MSE||DBA with MSE compression||1|
|DBA-ADA||DBA with ADA compression|
|DBA-MSE||DBA with MSE compression|
Column refers to the number of iterations of the repeat-until loop of the AC Algorithm. Compression schemes with iterations run until convergence. We applied DBA and the four AC algorithms to approximate the class means of every UCR training set.111The UCR data sets have prespecified training and test sets.
To assess the performance of the mean algorithms, we recorded the percentage deviations, ranking distribution, and space-saving ratios. Here, we used the solutions of the DBA algorithm as reference. The percentage deviation of a mean algorithm is defined by
where is the solution of the DBA algorithm and is the solution of algorithm . Negative (positive) percentage deviations mean that algorithm has better (worse) Fréchet variation than DBA. The ranking distribution summarizes the rankings of every mean algorithm over all samples. The best (worst) algorithm is ranked first (last). The space-saving ratio of algorithm is
A positive (negative) space-saving ratio means that the solution is shorter (longer) than .
summarizes the results. The top table shows the average, standard deviation, minimum, and maximum percentage deviations from the Fréchet variation of the DBA algorithm (lower is better). The table in the middle shows the distribution of rankings and their corresponding averages and standard deviations. The best (worst) algorithm is ranked first (fifth). Finally, the bottom table shows the average, standard deviation, minimum, and maximum space-saving ratios (higher is better).
All AC variants improved the solutions of the DBA baseline by to on average and in the best case. By construction, an AC solution is never worse than a DBA solution. The best method is DBA-MSE with average rank followed by DBA-MSE and DBA-ADA with average ranks and , respectively. These three methods clearly outperformed DBA-ADA proposed by Petitjean et al. . The main improvement of DBA-MSE and DBA-ADA occurs at the first iteration.
We considered the lengths of the approximated means. Recall that all mean algorithms started with the same initial guess (medoid). The length of a DBA solution corresponds to the length of its initial guess, whereas solutions of AC algorithms are likely to be shorter by construction. The bottom table shows that solutions of ADA compression save about a third (, ) of the length of DBA solutions on average and solutions of MSE compressions close to a half (, ). As for the Fréchet variation, most of the space-saving occurs in the first iteration of an AC algorithm.
3.2 Comparison of AC Algorithms
The goal of the second experiment is to compare the performance of the following mean algorithms:
|DBA||DTW Barycenter Averaging [12, 18]||–|
|SSG||stochastic subgradient method ||–|
|ADBA||adaptive DBA ||–|
|DBA-MSE||DBA with MSE compression|
|SSG-MSE||SSG with MSE compression|
|ADBA-MSE||adaptive DBA with MSE compression|
Table 2 summarizes the results using the same legend as in Table 1. The percentage deviations and rankings suggest that the three AC variants DBA-MSE, SSG-MSE, and ADBA-MSE performed substantially better than the corresponding base algorithms DBA, SSG, and ADBA, respectively. The SSG-MSE algorithm performed best with an average rank of , followed by ADBA-MSE (), SSG (), and DBA-MSE (). Interestingly, ADBA performed worse than DBA-MSE. Both, ADBA and DBA-MSE, are based on the DBA algorithm. The difference between both algorithms is that ADBA compresses and expands selected subsequences of the current DBA solution, whereas DBA-MSE only compresses the current DBA solution. The results indicate that simple MSE compression on the entire sequence appears to be a better strategy than ADBA’s compression and expansion schemes on selected subsequences. Notably, SSG performed best among the three base averaging algorithms DBA, SSG, ADBA, and performed even better than DBA-MSE. These results are in contrast to those presented in , where ADBA outperformed SSG (and also DBA). Our findings confirm that the performance of SSG substantially depends on a careful selection of an optimizer (such as Adam) and a proper choice of the initial learning rate.
Next, we examine the length of the solutions. Note that SSG also does not alter the length of its initial guess such that . The bottom table shows that MSE compression schemes reduce the length of the solutions obtained by their corresponding base algorithm (DBA, SSG, and ADBA). The space-saving ratios of the AC variants are roughly independent of the particular base algorithm for mean computation (0.43–0.45). Notably, the base algorithm ADBA is more likely to compress rather than to expand the DBA solutions. This finding is in line with the hypothesis that an exact mean is typically shorter than the length of the sample time series .
3.3 Application: k-Means Clustering
In this experiment, we investigated how the quality of a mean algorithm affects the quality of a -means clustering. Let be a set of finite time series. The goal of -means is to find a set of centroids such that the -means error
is minimized. We used DBA, SSG, ADBA, DBA-MSE, SSG-MSE, and ADBA-MSE for computing the set of centroids. We applied the six variants of -means to UCR data sets and excluded UCR data sets due to overly long running times (see Appendix A.2). We merged the prespecified training and test sets. The number of clusters was set to the number of classes and the centroids were initialized by the class medoids.
Table 3 summarizes the results. The top table presents the average, standard deviation, minimum, and maximum percentage deviations from the respective minimum -means error (lower is better). The percentage deviation of -means algorithm for data set is defined by
where is the set of centroids returned by algorithm and is the best solution obtained by one of the six -means algorithms. The bottom table shows the distribution of rankings and their corresponding averages and standard deviations. The best (worst) algorithm is ranked first (sixth)
The results show that the AC approach substantially improved all -means variants using one of the base averaging methods (DBA, SSG, ADBA). Notably, SSG-MSE performed best with an average percentage deviation of and an average rank of , followed by ADBA-MSE ( and ). The average percentage deviations of DBA and DBA-MSE are substantially impacted by the results on a single data set (DiatomSizeReduction). Removing the DiatomSizeReduction data set yields an average percentage deviation of for DBA and for DBA-MSE, whereas the other average percentage deviations remain unchanged up to . These findings confirm the hypothesis raised by Brill et al.  that better mean algorithms more likely result in lower -means errors.
We proposed a generic average-compress algorithm for the unconstrained sample mean problem in DTW spaces. Starting with an initial guess of sufficient length, the AC algorithm alternates between averaging and compression. In principle, any averaging and any compression algorithm can be plugged into the AC scheme. The compression guides the algorithm to promising search spaces of shorter time series. This approach is theoretically justified by the Reduction Theorem  that guarantees the existence of an unconstrained sample mean in a search space of bounded length. Experimental results show that the AC algorithm substantially outperforms state-of-the-art heuristics for time series averaging. In addition, we observed that better averaging algorithms yield lower -means errors on average. Open research questions comprise empirical analysis of alternative compression methods for the AC algorithm and reducing its computational effort.
Appendix A Experimental Settings
a.1 Hyperparameter Settings
In all experiments, we selected the sample medoid as initial guess of a mean algorithm. The DBA algorithm terminated after convergence and latest after epochs (cycles through a sample). The ADBA algorithm terminates subsequence optimization when the sum of the scaling coefficients changes its sign and latest after iterations. The SSG algorithm terminated after iterations without observing an improvement and latest after epochs. As optimization scheme, SSG applied Adam  with as first and as second momentum. To cope with the problem of selecting an initial learning rate, we used the procedure described in Algorithm 6. The input is a sample of size . The output is the best solution found. The algorithm terminates if the solution did not improve for two consecutive learning rates and latest if .
a.2 Data Sets Excluded From k-Means Experiments
The following list contains all UCR data sets excluded from -means clustering due to computational reasons:
-  A. Abanda, U. Mori, and J.A. Lozano. A review on distance based time series classification. Data Mining and Knowledge Discovery, 33(2):378–412, 2019.
-  W.H. Abdulla, D. Chow, and G. Sin. Cross-words reference template for DTW-based speech recognition systems. Conference on Convergent Technologies for Asia-Pacific Region, 2003.
- Aghabozorgi et al.  S. Aghabozorgi, A.S. Shirkhorshidi, and T.-Y. Wah. Time-series clustering – A decade review. Information Systems, 53:16–38, 2015.
- Bagnall et al.  A. Bagnall, J. Lines, A. Bostrom, J. Large, and E. Keogh. The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery, 31(3):606–660, 2017.
-  R. Bellman. On the approximation of curves by line segments using dynamic programming. Communications of the ACM, 4(6):284, 1961.
-  M. Brill, T. Fluschnik, V. Froese, B. Jain, R. Niedermeier, and D. Schultz. Exact Mean Computation in Dynamic Time Warping Spaces. Data Mining and Knowledge Discovery, 33(1):252-291, 2019.
-  L. Bulteau, V. Froese, and R. Niedermeier. Hardness of Consensus Problems for Circular Strings and Time Series Averaging. CoRR, abs/1804.02854, 2018.
-  K. Chakrabarti, E. Keogh, S. Mehrotra, and M. Pazzani. Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. ACM Transactions on Database Systems, 27(2):188–228, 2002.
-  Y. Chen, E. Keogh, B. Hu, N. Begum, A. Bagnall, A. Mueen, and G.E. Batista. The UCR Time Series Classification Archive. www.cs.ucr.edu/~eamonn/time_series_data/, Accessed: 08/2018.
M. Cuturi and M. Blondel.
Soft-DTW: A Differentiable Loss Function for Time-Series.
Proceedings of the 34th International Conference on Machine Learning, 70:894–903, 2017.
-  M. Fréchet. Les éléments aléatoires de nature quelconque dans un espace distancié. Annales de l’institut Henri Poincaré, 215–310, 1948.
V. Hautamaki, P. Nykanen, P. Franti.
Time-series clustering by approximate prototypes.
International Conference on Pattern Recognition, 1–4, 2008.
-  B. Jain and D. Schultz. On the Existence of a Sample Mean in Dynamic Time Warping Spaces. CoRR, abs/1610.04460, 2016.
B. Jain and D. Schultz.
Asymmetric learning vector quantization for efficient nearest neighbor classification in dynamic time warping spaces.Pattern Recognition, 76:349–366, 2018.
-  E. Keogh, K. Chakrabarti, S. Mehrotra, M. Pazzani, and S. Mehrotra. Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Journal of Knowledge and Information Systems, 3(3):263–286, 2001.
-  D. P. Kingma and J. L. Ba. Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 2015.
-  Y. Liu, Y. Zhang, and M. Zeng. Adaptive Global Time Sequence Averaging Method Using Dynamic Time Warping. IEEE Transactions on Signal Processing, 67(8):2129–2142, 2019.
-  F. Petitjean, A. Ketterlin, and P. Gancarski. A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognition 44(3):678–693, 2011.
-  L.R. Rabiner and J.G. Wilpon. Considerations in applying clustering techniques to speaker-independent word recognition. The Journal of the Acoustical Society of America, 66(3):663–673, 1979.
-  H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43–49, 1978.
-  D. Schultz and B. Jain. Nonsmooth analysis and subgradient methods for averaging in dynamic time warping spaces. Pattern Recognition, 74:340–358, 2018.
-  S. Soheily-Khah, A. Douzal-Chouakria, and E. Gaussier. Generalized -means-based clustering for temporal data under weighted and kernel time warp. Pattern Recognition Letters, 75:63–69, 2016.
-  C. W. Tan, G.I. Webb, and F. Petitjean. Indexing and classifying gigabytes of time series under time warping. Proceedings of the 2017 SIAM International Conference on Data Mining, 282–290, 2017.
-  E. Terzi and P. Tsaparas. Efficient algorithms for sequence segmentation. Proceedings of the 2006 SIAM International Conference on Data Mining, 316–327, 2006.
-  T.K. Vintsyuk. Speech discrimination by dynamic programming. Cybernetics, 4(1):52–57, 1968.
-  Haizhou Wang and Mingzhou Song. Ckmeans.1d.dp: Optimal -means Clustering in One Dimension by Dynamic Programming. The R Journal, 3(2):29–33, 2011.
-  J.G. Wilpon and L.R. Rabiner. A Modified -Means Clustering Algorithm for Use in Isolated Work Recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(3):587–594, 1985.
-  B.-Y. Yi and C. Faloutsos. Fast time sequence indexing for arbitrary norms. International Conference on Very Large Databases, 385–394, 2000.