1 Introduction
The goto algorithm for computing alignments between two audio clips is Dynamic Time Warping (DTW) [19, 20], and DTW and its variants have seen wide application in music processing applications [16]. However, the textbook version of exact DTW has quadratic memory constraints. While some MIR applications, such as cover song identification, can get away with coarse, beatsynchronous features [7] to remain in a low memory regime, other applications may require finer scale features and can quickly explode in memory requirements. For instance, in orchestral music, onsets are weak, so one must often revert to framelevel features for satisfactory alignments. Singing voices also present unique challenges in this regard [24], and both stringed instruments and singing voices have precise, expressive attacks at subbeat scales.
In this work, we present a simple divide and conquer variant of DTW to compute a globally optimal alignment between two audio sequences with linear memory. Our contributions are as theoretical as they are practical; though there are many approximate algorithms that work well in practice (Section 2), we are not aware of any other linear algorithms for DTW with this guarantee. A related advantage is that there are no approximation parameters to tune; there is only one exact cost (with some caveats on numerical precision in Section 4.2).
Once we establish the algorithm, we present an experiment on a handcurated collection of classical music (Section 4). Since our memory only scales linearly with a small factor (Section 4.3), we are able to run it on longer pieces, enabling us to evaluate the precision of approximation algorithms at scales not previously possible.
2 Background
2.1 The Textbook DTW Algorithm
We now briefly review the standard DTW algorithm for context. Given a (possibly multivariate) time series with points and a time series with points, there is a notion of allowable matchings that preserve the time order, known as a timeordered correspondence, or “warping path.” A warping path , is a correspondence between and ^{1}^{1}1A correspondence between two indexing sets and is a subset of the cartesian product so that every element of is contained in at least one element of and every element of is contained in at least one element of with ordered tuples of indices of and so that (assuming 0indexing) , , and . In other words, matched points between time series must always stay still or advance by at most one along each, and at least one must move forward.
Given a cost measure between the element in the first time series and the element in the second time series, , then an “optimal” or “exact” solution to the Dynamic Time Warping problem is a warping path that minimizes the sum
(1) 
We’ll refer to an optimal path as and the optimal cost as . It is possible to compute the and using a wellestablished dynamic programming approach, which is also shared among edit distance algorithms such as Smith Waterman [22]. In particular, if refers to the optimal cost of aligning the first points of to the first points of , then the following recurrence holds for
(2) 
the boundary conditions are set as
(3) 
After filling in the first row and column by Equation 3, it is possible to compute all values of by applying Equation 2 from left to right, row by row. After processing all pairs of subsets in this fashion, contains the optimal cost.
At this point in the algorithm, we merely have a cost, not an optimal warping path that realizes this cost. But if we store a second matrix which remembers one of the three “backpointers” LEFT, RIGHT, and UP that realized the minimum cost at that cell, then we can “backtrace” by following these arrows back from to to figure out the elements of an optimal in between.
2.2 Variants And DTW Algorithms in MIR
There are countless works that incorporate and expand on DTW, so we constrain our focus to approaches and conventions that apply to music processing [16], with a particular focus on techniques that accelerate the algorithm.
There is theory to suggest that in general, computation will always be the worstcase for optimal DTW [3], so many settle for approximate solutions. The socalled “Itakura Parallelogram” [12] and “SakoeChiba Band” [20] were early fixed global alignment restrictions proposed to reduce memory and computation. More adaptive algorithms have also been used to approximate the DTW alignment on audio streams. One popular such example is a recursive multiresolution algorithm known as “FastDTW” [21], which has been used to synchronize orchestral music at large scales [17, 18]. The algorithm computes the warping path of lower resolutions versions of the time series, and then it recursively constrains alignments at finer scales to lie within some band of the lower resolution path. It is guaranteed to have worst case runtime and memory consumption. A similar algorithm, known as “MemoryRestricted MultiScale DTW” (MrMsDTW) [18] was devised to have constant memory usage, where performance degrades gracefully with a decreasing constant memory, and this technique has proved useful in MIR synchronization applications to pedagogy [23]. We compare to both of these algorithms in Section 4.
Beyond this, researchers have cut down on memory with approximate algorithms that use overlapping blocks [13] and which greedily expand cells to evaluate [6], though, like all of the approximations we’ve mentioned so far, they have no worstcase approximation guarantees. The authors of [26] and [1], on the other hand, present some of the only algorithms with worst case guarantees for DTW cost in Euclidean spaces. They achieve linearithmic runtime complexity, with a runtime proportional to the geometric complexity of the time series and inversely proportional to the approximation ratio. There are also several exact algorithms in the literature that use parallel architectures both for DTW [5] and for the related problem SmithWaterman scoring between gene sequences [27] to speed up computation. We draw on these algorithms in our design in Section 3.1. However, they were designed simply to compute costs/scores, not to extract alignments, so we must build on this work to extract alignments (Section 3.2).
The closest work in spirit to ours is an algorithm for finding the longest common subsequence between strings [11], which also uses a divide and conquer scheme for subproblems that overlap on cells, yielding an algorithm of time complexity and space complexity like ours. However, it does not guarantee a globally optimal solution in the context of DTW [2].
3 Our Algorithm
3.1 Computing the Cost (DiagDTW)
The backbone of our algorithm relies on a different order of filling in subproblems of the alignment, the hardware implementation of which is an instance of a "linear systolic array" in computer architecture [5, 27]. This part is not yet novel, but it is crucial to our approach, so we review it here. Rather than processing the cells of matrix row by row, as in the textbook version, it is also possible to satisfy dependencies needed to complete the recurrences in Equation 2 while filling in along diagonals. Figure 1 shows the idea. If the diagonals are processed in order from lower left to upper right, then it follows that all elements on a single diagonal can be computed in parallel from two previous adjacent diagonals and . Then, to move to the next diagonal, the three diagonal buffers circularly shift; becomes , becomes , and the previous can be reused as the new as . Since each diagonal has a max length of , this means that one only needs a memory of to maintain such a system of circularly shifting buffers^{2}^{2}2
Ignoring the cost of storing features for the moment
; one never needs to store all of in memory to compute the optimal cost. Once the algorithm has completed, the optimal cost can be read off as the only element in the last buffer.Algorithm 1 shows a sketch of the process. We refer to this algorithm as DiagDTW, and we have implemented it on the GPU using CUDA. For reasons that will become clear in Section 3.2, we take as a parameter kstop on which to stop the computation, and we return the states of all three diagonals at that point.
3.2 Extracting Alignment
The linear systolic array provides a way to compute cost, but if we insist on only remembering the most three recent diagonals of backpointers, then there is no obvious way to recover all of the backpointers to reconstruct an optimal warping path under memory constraints. Instead, we make the following two observations, which we use to build a different algorithm from standard backtracing which works in a memoryrestricted setting:
Lemma 1.
For any warping path and any adjacent set of 3 diagonals, at least one element of is incident on one of the three diagonals.
This follows directly from the definition of a warping path in Section 2.1. We also have the following observation
Lemma 2.
Let be an optimal warping path and , and let and be the time series and in reverse order, respectively. Then the cost of the warping path can be broken into three parts as follows
(4) 
This is depicted by the overlapping boxes in Figure 2. In other words, the total cost is the optimal cost of aligning the first half of the path from up to and including , plus the optimal cost of aligning second half of the path from up to and including , minus the distance from to (so we don’t double count that distance where they overlap). This follows from the fact that warping paths must start and end at the beginning and end of each time series (so each subpath is forced to touch ), and the fact that reversing both time series has no effect on the optimal cost. This is similar to the observation used in MrMsDTW to break up computation into smaller parts [18].
Now we are ready to present the divide and conquer algorithm to compute an optimal warping path . Lemma 1 and Lemma 2 together imply that if we trace the first half of the diagonals in a forward direction and the second half of the diagonals in the reverse direction and add them up pointwise where they meet at the center, subtracting off the distance at those points, then at least one element on the three diagonals will contain the optimal cost . Furthermore, since this cost occurs on the optimal path, it will by definition be the minimum cost over all pointwise sums. Therefore, to find a point towards the center of , we simply do the following

Run Algorithm 1 halfway in the forward direction, starting at the beginning

Run Algorithm 1 halfway in the reverse direction, starting at the end

Perform the sums in Equation 4 where they overlap

Take the indices of the value that achieves the minimum over all three diagonals (breaking ties arbitrarily, the result of which we explore in Section 4.2)
.
We refer to as the “pivot” at this step, and we are guaranteed that resides on . At this point, we divide the problem in half at the pivot and find two more points on the warping path to the left and right, which is the recursive step. Algorithm 2 summarizes this process.
Because Algorithm 2 calls DiagDTW as a subroutine and DiagDTW uses memory, Algorithm 2 also uses at most memory. What is slightly less obvious, but still fairly straightforward to show, is that a serial version of the algorithm takes time. To see this, parameterize the diagonal by a variable , where at the center of the central diagonal, then the total area of the subblock to the left of the chosen pivot and to the right of the chosen pivot is bounded from above by the following sum of two products
(5)  
(6) 
Then, , and , so a global maximum occurs at , for an area of . In other words, ignoring the edge effects due to the overlap, at most half of the total cells are processed across the two halves of each recursive split. This leads to the recurrence , which is bounded from above by . To understand the edge effects, we note the following: since the number of diagonals, , can be subdivided times, this leads to a bound of , for a total worstcase cost of
(7) 
However, the term will usually swamp the , unless one of or is very small (e.g , in which case it’s simply subdividing an interval of length repeatedly times). In practice, we parallelize the DiagDTW step on a GPU, so the algorithm runs faster than this. We also keep track of the number of cells processed, and we assume to indicate progress of the the algorithm to the user.
Finally, to save the overhead of initiating too many small GPU alignments, we break off the recursion when the subblocks get small enough (Line 2, Algorithm 2). In practice, if either length of of the subdivided time series goes below 500, we use the textbook DTW algorithm to complete the alignment, which uses an insignificant amount of computation and memory at that scale.
4 Experiments
Piece  Version 1  Version 2  DTW  FastDTW  Ours 
Vivaldi Spring  Abbado (188 sec)  Gunzenhauser (209 sec)  277 MB  3.86 MB  194 KB 
Candide Overture  Bernstein (268 sec)  Dudamel (279 sec)  527 MB  5.5 MB  270 KB 
Beethoven. Symph. No.5  Thielemann (445 sec)  Bernstein (514 sec)  1.58 GB  9.12 MB  448 KB 
Schumann  Symph. No. 3  Bernstein (2124 sec)  Muti (2199 sec)  23.2 GB  36.9 MB  1.77 MB 
Stravinsky The Rite of Spring  Rattle (2053 sec)  Bernstein (2082 sec)  29.4 GB  42.1 MB  2.02 MB 
Tchaikovsky Symph. No. 4  Bernstein (2645 sec)  Rozhdestve.. (2530 sec)  46.1 GB  51.9 MB  2.48 MB 
Shostakovich: Symph. No. 11  Søndergård (3647 sec)  Nelsons (3765 sec)  94.6 GB  74.8 MB  3.6 MB 
Verdi Requiem  Bychkov (4983 sec)  Solti (5042 sec)  173 GB  102 MB  4.9 MB 
Wagner  Das Rheingold  Kuhn (8799 sec)  Solti (8759 sec)  542 GB  180 MB  8.6 MB 
Now that the theory for our algorithm has been established, we apply it to align real audio data. We created a dataset with 100 pairs of mostly orchestral pieces, where each pair is performed under different conductors. All of the performances can be found on Youtube, and we provide code to automatically download them for reproducibility^{3}^{3}3If any links go down, the code is robust to that and will simply skip those examples. We do not have access to human annotated alignments, but since we can compute exact costs with linear memory, we can use our exact paths to assess the precision of approximate algorithms at very large scales to get an idea of how they perform in that regime. To that end, we split our dataset into two parts. The first 50 pairs are “shorter” pieces that can be handled (albeit sometimes slowly) by the textbook CPU DTW algorithm. The second set are pieces that would quickly use up all memory with the textbook algorithm on a personal computer, including many pieces around an hour or longer. Figure 3 shows the distribution of the lengths of pieces in both sets.
4.1 Features
So far, our discussion has assumed that we had access to some distance function between time series and . We now finally describe two different features sets that allow us to compute distances for synchronization, which we use in our experiments. The first set of features are the socalled “decaying locally adaptive normalized C0” (DLNC0) features [9], which are popular for fine scale alignments ^{4}^{4}4Unlike [17], we do not use a multiscale version of DLNC0, since we are assessing approximations of exact alignments at a single scale. The second set of features are referred to as “mfccmod” features, which consist of a large number of MFCC coefficients, throwing away the lower order ones to control for loudness. These features were shown to work well at precisely capturing human annotations [10].
For both feature sets, we sampled audio at 22050hz, and we used a hop size of 512 between feature frames. This corresponds to about 43 frames per second of resolution. For the DLNC0 features, we used librosa’s implementation of the CQT with default parameters as a starting point [15]. The DLNC0 features were concatenated to a 0.1 factor of CENS features to improve stability in steadystate regions, as suggested in [9]. For the mfccmod coefficients, we used an fftlength of 2048 and computed 120 “HTK” coefficients, leaving the first 20 out. Although [9] recommends using cosine distance for the DNLC0 component, we found lower discrepancies using the Euclidean distances as our measure across the board on all of our features.
4.2 Numerical Precision / Tie Breaking
Since our algorithm is on the GPU, we revert to 32bit computations, and there is a worry that numerical precision could cause discrepancies, especially since the numbers along warping paths are summed together in a different order in our algorithm, and ties are broken at a different stage. To rule this out as a source of error when comparing approximation precisions, we compare our GPU answer to the brute force 32bit CPU answer on the textbook algorithm. We also compare two different tie breaking rules on 64bit CPU brute force implementations; one where diagonal takes precedence over left, and one the other way around. Ultimately, though there are discrepancies, they are negligible compared to errors in approximation, as shown in Figure 4. And the 32bit versus 64bit appears to make little difference at these scales.
4.3 Memory Requirements
We compare our alignments to both FastDTW [21] with a band and to MrMsDTW using a constant amount of and cells. To make Equation 4 more convenient to compute in our implementation, we store the distances between corresponding points on the three diagonals in addition to the cumulative sums, so we end up using storage instead of storage. Still, we note that cells is an order of magnitude beyond this requirement, while cells is on the shorter end of what our algorithm needs on the short dataset, so these are two good reference points for MrMsDTW. To compare memory with FastDTW, we use the equation from [21] which states that the total worstcase space complexity for storing the cells is values. Hence, though FastDTW also has linear memory requirements, it has a larger constant factor, particularly for reasonable band sizes ( is less than a second of wiggle room). Table 1 shows the memory requirements for storing the cells for different algorithms with variable memory, assuming 32bit precision (4 bytes per cell). This neglects the memory for storing the warping path, which is negligible compared to the cost of storing the accumulated cost cells, and it also neglects the memory requirements of storing features, which is a separate issue mostly independent of the algorithms, since these are all run offline. Still, the memory differences are striking.
4.4 Results
We now examine the results closely. We computed the alignment discrepancies between two warping paths and as follows. For every element , we report the error as for . To maintain symmetry, we also add an analogously defined column error to our distributions for every element. Figure 4 shows the approximation error distributions for different algorithms on all of the shorter pieces, including tie breaking discrepancies on the exact algorithm (Section 4.2), while Figure 5 shows approximation errors on all of the longer pieces. In each figure for each pairwise comparison, there are four different color dots per piece that indicate the proportion of correspondences that fall below the alignment discrepancies (23 ms, 47 ms, 510ms, and 1 second). Overall, we find that the approximation algorithms often fail to agree at very fine scales, but they usually agree to within a second of audio, which is particularly impressive on the long pieces. And unsurprisingly, MrMsDTW performs better with more memory.
In addition to approximation errors, we also show the discrepancy between the mfccmod and DLNC0 feature sets for reference. Interestingly, their discrepancy is similar to that of approximation with MrMsDTW, suggesting that feature design is at least as important as a good approximation. However, under a good feature choice at a fine scale, it is likely that our exact algorithm will give the most desirable alignment.
Finally, to empirically validate the correctness of our computational complexity bound in Section 3.2, we report the ratio of cells processed to total cells in the full accumulated cost matrix in Figure 6 across all pieces, and we find that the ratio is very close to 2 in all cases, as predicted. Only under very extreme warps away from the center of the matrix would one expect this to be much smaller.
5 Software
Since some of the details of linmdtw (Algorithm 2) are tricky to implement correctly, and in the spirit of reproducibility [14]
, we have provided our CPU and GPU (pycuda) implementations of linmdtw, FastDTW, and MrMsDtw in an open source package at
https://github.com/ctralie/linmdtw, which can be installed simply with “pip install linmdtw”. We have documentation and Jupyter notebooks on the repo for example usage. The software will try run CUDA by default, but if it fails, it will fall back to the CPU implementation. There is also code to replicate the experiments in Section 4 by downloading URLs from Youtube, robustly skipping those no longer available. Finally, we used the Rubberband Library [4] and implemented the refinement technique of Ewert (Section 4 of [8]) to stretch audio to conform to warping paths.6 Discussion
In this paper, we presented a novel exact memory efficient algorithm for DTW. In addition establishing this new algorithm and proving its correctness, we empirically benchmarked a couple of popular approximation algorithms for DTW alignment in MIR at larger scales than had ever been shown. We found that these algorithms still have fairly good performance with reference to an exact alignment even on longer pieces. MrMsDTW is particularly fast computationally, so this suggests that it’s good as a first stop in many cases, though there are outliers, and there are always quality gains to be had for an exact algorithm.
Furthermore, though the focus of this paper was on memory constraints, our vanilla GPU implementation also led to speed increases over the textbook CPU version and had similar but slightly slower runtimes than FastDTW. However, a better GPU implementation would treat global and local memory with more care, along with addressing myriad other issues [25], so we do not believe this algorithm has yet reached its full computational potential.
There are also other computational problems with very similar dynamic programming design DTW, such as edit distance and Smith Waterman [22], which could benefit from the ability to align large sequences under memory restrictions. Even approximate DTW algorithms may benefit from tricks in this paper.
References
 [1] Pankaj K. Agarwal, Kyle Fox, Jiangwei Pan, and Rex Ying. Approximating Dynamic Time Warping and Edit Distance for a Pair of Point Sequences. In Sándor Fekete and Anna Lubiw, editors, 32nd Intl. Symp. on Computational Geometry (SoCG), in print, volume 51 of Leibniz Intl Proc. in Informatics (LIPIcs), pages 6:1–6:16, Dagstuhl, Germany, 2016. Schloss Dagstuhl–LeibnizZentrum fuer Informatik.
 [2] Ghazi AlNaymat, Sanjay Chawla, and Javid Taheri. Sparsedtw: a novel approach to speed up dynamic time warping. In Proc. Australasian Data Mining Conf., volume 101, pages 117–127, 2012.
 [3] Karl Bringmann and Marvin Künnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In IEEE 56th Annual Symp. on Foundations of Computer Science, pages 79–97. IEEE, 2015.
 [4] C Cannam. Rubber band library. Software released under GNU General Public License (version 1.8. 1), 2012.
 [5] Evert Dijkstra and Christian Piguet. On minimizing memory in systolic arrays for the dynamic time warping algorithm. Integration, 4(2):155–173, 1986.
 [6] Simon Dixon. Live tracking of musical performances using online time warping. In Proc. of the 8th Intl. Conf. on Digital Audio Effects (DAFX), volume 92, page 97. Citeseer, 2005.
 [7] Daniel PW Ellis. Identifying ‘cover songs’ with beatsynchronous chroma features. MIREX 2006, page 32.
 [8] Sebastian Ewert and Meinard Müller. Refinement strategies for music synchronization. In Intl, Symp. on Computer Music Modeling and Retrieval, pages 147–165. Springer, 2008.
 [9] Sebastian Ewert, Meinard Müller, and Peter Grosche. High resolution audio synchronization using chroma onset features. In Proc. of the IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 1869–1872. IEEE, 2009.
 [10] Thassilo Gadermaier and Gerhard Widmer. A study of annotation and alignment accuracy for performance comparison in complex orchestral music. In Proc. of the Int. Soc. for Music Information Retrieval Conf. (ISMIR), in print, 2019.
 [11] Daniel S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Communications of the ACM, 18(6):341–343, 1975.
 [12] Fumitada Itakura. Minimum prediction residual principle applied to speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 23(1):67–72, 1975.
 [13] Robert Macrae and Simon Dixon. Accurate realtime windowed time warping. In Proc. of the Int. Soc. for Music Information Retrieval Conf. (ISMIR), in print, pages 423–428, 2010.
 [14] Brian McFee, Jong Wook Kim, Mark Cartwright, Justin Salamon, Rachel Bittner, and Juan Pablo Bello. Opensource practices for music signal processing research. IEEE Signal Processing Magazine, 1053(5888/19), 2019.
 [15] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. In Proc. of the 14th python in science conf., volume 8, 2015.
 [16] Meinard Müller. Fundamentals of music processing: Audio, analysis, algorithms, applications. Springer, 2015.
 [17] Meinard Müller, Henning Mattes, and Frank Kurth. An efficient multiscale approach to audio synchronization. In Proc. of the Int. Soc. for Music Information Retrieval Conf. (ISMIR), in print, volume 546, pages 192–197. Citeseer, 2006.
 [18] Thomas Prätzlich, Jonathan Driedger, and Meinard Müller. Memoryrestricted multiscale dynamic time warping. In Proc. of the IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 569–573. IEEE, 2016.
 [19] Hiroaki Sakoe and Seibi Chiba. A similarity evaluation of speech patterns by dynamic programming. In Nat. Meeting of Institute of Electronic Communications Engineers of Japan, page 136, 1970.
 [20] Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE transactions on acoustics, speech, and signal processing, 26(1):43–49, 1978.
 [21] Stan Salvador and Phillip Chan. Fastdtw: Toward accurate dynamic time warping in linear time and space. Proc. of ACM Knowledge Data And Discovery (KDD), 3rd Wkshp. on Mining Temporal and Sequential Data, 2004.
 [22] Temple F Smith and Michael S Waterman. Identification of common molecular subsequences. Journal of molecular biology, 147(1):195–197, 1981.
 [23] TJ Tsai, Steven K Tjoa, and Meinard Müller. Make your own accompaniment: Adapting fullmix recordings to match soloonly user recordings. In Proc. of the Int. Soc. for Music Information Retrieval Conf. (ISMIR), in print, pages 79–86, 2017.
 [24] Simon Waloschek and Aristotelis Hadjakos. Driftin’down the scale: Dynamic time warping in the presence of pitch drift and transpositions. In Proc. of the Int. Soc. for Music Information Retrieval Conf. (ISMIR), in print, pages 630–636, 2018.
 [25] Limin Xiao, Yao Zheng, Wenqi Tang, Guangchao Yao, and Li Ruan. Parallelizing dynamic time warping algorithm using prefix computations on gpu. In 2013 IEEE 10th Intl. Conf. on High Performance Computing and Communications & 2013 IEEE Intl Conf. on Embedded and Ubiquitous Computing, pages 294–299. IEEE, 2013.
 [26] Rex Ying, Jiangwei Pan, Kyle Fox, and Pankaj K. Agarwal. A simple efficient approximation algorithm for dynamic time warping. In Proc. of the 24th ACM SIGSPATIAL Intl. Conf. on Advances in GIS, GIS ’16, pages 21:1–21:10, New York, NY, USA, 2016. ACM.
 [27] Chi Wai Yu, KH Kwong, KinHong Lee, and Philip Heng Wai Leong. A smithwaterman systolic cell. In New Algorithms, Architectures and Applications for Reconfigurable Computing, pages 291–300. Springer, 2005.
Comments
There are no comments yet.