With the recent emergence of new machine learning techniques, there has been an increasing interest in robotic action recognition. The foundation of action recognition lies in the problem of signal alignments, in the sense that prior to categorizing or identifying sets of sequences, one must establish a method to temporally parameterize the sequences that enables standardized comparisons.
Currently the most successful techniques for signal alignments are based on the well-known method of Dynamic Time Warping (DTW) , which matches two time series with a monotonically increasing optimal warping path satisfying boundary conditions. Since its introduction almost 40 years ago, DTW has been applied to a variety of fields including speech recognition , action recognition , data mining , and motion perception . Due to DTW’s time complexity, many variants have been introduced with the goal of striking a balance between accuracy and computational efficiency [6, 7, 8], the most widely-used of these being the Fast Dynamic Time Warping (FastDTW) algorithm, which achieves a time complexity of . One of the most recent developments in the DTW family has been the introduction of Generalized Time Warping , which aligns multiple multi-modal sequences with linear time complexity.
Recently, a novel, alternative mathematical framework for signal alignment has been proposed, in which signals are reparameterized to a universal standard timescale (UST) using principles of variational calculus . The goal of this paper is to introduce an efficient numerical algorithm for signal alignment based on this framework, which we will henceforth refer to as the Globally Optimal Reparameterization Algorithm (GORA), and to provide an initial numerical validation of this approach.
Given two or more time-evolving signal sequences, GORA temporally reparameterizes each to a UST that allows for pairwise comparison at each instance in time. Re parameterizations are found using variational calculus to produce mappings to new temporal variables that globally minimize the amount of change in the sequences, representing a new approach to the problem of signal alignment. The major advantages of this approach are:
1. It achieves linear time complexity of , where T is the number of time instances in a signal;
2. It can simultaneously reparameterize multiple signal sequences to a universal time scale; and
3. It can potentially be built-upon to allow for the effects of nuisance parameters such as noise or motion artifacts to be minimized or eliminated .
The remainder of the paper is organized as follows. First, we review GORA’s mathematical foundations, which were first described in , and define and introduce GORA itself. We then discuss the settings for its application to signals in the form of both real trajectories and video sequences. This is followed by a numerical verification of GORA’s ability to find globally optimal temporal reparameterization of a given signal. We then provide an initial verification of the algorithm by comparing its performance, in terms of both computational efficiency and accuracy in matching signals, relative to DTW methods. Using both synthetic and real datasets, our results show a significant improvement in both speed and accuracy over the DTW and FastDTW algorithms. We conclude with a short discussion on the computational significance of the differences between GORA and DTW methods, in addition to the authors’ plans for the continued development of the GORA framework.
Ii Problem Statement
Without loss of generality, consider any kind of temporally evolving signal, , as a mapping from the unit interval to the space , i.e. , on which that particular type of signal evolves. Defining a metric on , becomes a metric space. In general, given any two signals , it is likely that
even if it is suspected that both signals portray similar dynamic phenomena, a major reason being that each signal could have a different temporal parameterization on the the unit interval.
The GORA algorithm is based on the notion that the temporal misalignment between two arbitrary signals can be compensated by reparameterizing each to a UST. In other words, assuming that nuisance parameters or motion artifacts, such as variations in perspective, are not significantly affecting the signals, if we can find two strictly monotonically increasing functions, such that
then we can say that and are fundamentally the same.
Let represent the set of all such monotonically increasing functions on the unit interval. Denoting as the operation of composition of functions, namely
forms a group, which we refer to as the temporal reparameterization group (TRG). For a given signal, , one can use the succeeding variational calculus formulation to find a globally optimal such that is the UST parameterization of , reducing a search for this mapping from to the quotient space .
Iii Mathematical Formulations
Suppose one wants to find a function, , that extremizes a functional of the form
where . This type of problem can be addressed by the application of Calculus of Variations, and the desired is the solution to the Euler-Lagrange equations:
In general, there are no guarantees that the solution to the preceding equations will be globally optimal, however, in certain situations (including optimal temporal reparameterization), the structure of the function will guarantee that the solution generated by the equations is in fact a globally optimal solution. The following theorem is an example of one such case.
Iii-a Theorem and Proof of Global Optimality
THEOREM 1: If and the integrand in the cost functional (1) is of the form
where is , then the solution generated by (2) subject to the boundary conditions and is globally minimal.
The proof of this theorem was first demonstrated in , however we choose to re-demonstrate it here as it illuminates the fundamental structure of GORA.
Evaluating (2) with (3) gives
Multiplying both sides by and integrating yields the exact differential
Integrating both sides with respect to and isolating yields
where is the arbitrary constant of integration. With the boundary conditions and , we can then write
The notation indicates that this is the unique solution obtained from the Euler-Lagrange equations that satisfies the boundary conditions.
The function can be inverted ( is monotonically increasing since ) to yield
To see that this solution is globally optimal, substitute
into the cost functional
where is any function in . Then
where the second equality is simply a change of name of the dummy variable of integration. Furthermore, sinceand are both functions of time, we can change the domain of integration as
Since in general, from the Cauchy-Schwarz inequality,
we see that by letting that
where is the solution generated by the Euler-Lagrange equation and is any function in . Therefore is a globally minimal solution.
Iii-B The Globally Optimal Reparameterization Algorithm (GORA)
In the context of signal alignment, the solution to the preceding variational problem provides a method for finding the UST parameterization of a given signal. In particular, taking we have , subject to the definition of , which measures the rate of change of the given signal along the temporal axis. This is the backbone of Globally Optimal Reparameterization Algorithm (GORA), defined in Algorithm 1.
Given , calculating is relatively straightforward and follows the first part of the proof of Theorem 1. For a given signal, the function should be defined analogous to the squared magnitude of the temporal derivative of the signal. For example, in the case of a video signal, an appropriate definition of could be based on the temporal derivative of the matrix of pixel values representing each frame.
It should be noted that steps 2-3 in GORA can be performed simultaneously. Additionally, the method of interpolation through whichis recovered from and in step 5 should be chosen based on the properties of the input signal.
Iv Numerical Verification Settings
In this paper, we provide a validation of GORA using discretized signals, in the form of both synthetically generated trajectories in and video sequences from the Weizmann Action Recognition Classification Database [11, 12]. The following section describes our experimental regime and results.
Iv-a Signal structure
The version of GORA implemented in our experiments is designed for signals in the form of real trajectories. Through vectorization, each frame of a video sequence can be represented as anarray of pixel values where . As such, any video sequence can be described by a temporally-evolving curve in . If we imagine video sequences as collections of discretized samplings of continuous phenomena at arbitrary time instances, any re-sampling at new time instances produces a curve in .
Additionally, it is important to note that for the sake of sampling consistency, we trimmed all video sequences in the Weizmann Database such that each trimmed video showed only a single instance of an action being performed
. For example, videos of a person walking were trimmed to show only a single stride (two successive placements of the same foot) and videos of a person waving multiple times were trimmed to show only a single wave.
Iv-B Formulation of
For signals of the form , a natural choice for the definition of consistent with (3) is
where denotes the Euclidean norm of a vector. In practice, we computed using a high order finite difference method.
Iv-C Error metric
Given two discretized signals we defined the distance or error between them as the average euclidean distance over all time instances, namely,
where is the number of time instances.
V Verifications of Global Optimality
One of the major advantages of GORA is its ability to reparameterize multiple signals to their corresponding USTs in parallel, which allows for pairwise comparison between signals. Here we seek to verify the global optimality of UST parameterizations computed by GORA using both synthetic curves in and video sequences in the form of vectorized curves in .
In principle, for an arbitrary input signal its UST reparameterization found by GORA minimizes the integrand in (1) with with respect to the cost function given by (3). Our experimental procedures for evaluation are summarized as follows: For a given number of time instances, we randomly selected 50 template signals. For each template signal, , we randomly generated 50 functions in the TRG and reparameterized with respect to each to create 50 different input signals. We then used GORA to obtain the UST parameterization, , and recover the the UST reparameterized version of the signal, . We then computed the value of the cost functional with respect to and and compared this with the value of the cost functional computed using the input signal and original timescale.
The results of our global optimality experiments are displayed in Fig. 1. Fig. 0(a) shows the percentages of UST computed cost functional values lower than the cost functional values computed using the initial signals and initial timescales from 20 to 150 time instances. At each time instance, the percentages computed with respect to the 2500 (50 50) input and corresponding UST signal pairs generated using the experimental procedures detailed in the preceding paragraph. The red and blue lines show the results for synthetic trajectories in and for vectorized video sequences, respectively.
As an example, Figs. 0(b) and 0(c) show the values of the cost functionals for fifty different pairs of randomly parameterized input signals and their corresponding UST reparameterizations found by GORA, with 20 and 100 time instances respectively. For both Figs 0(b) and 0(c), all input signals were generated from a single template video sequence in the Weizmann Database. The black and green lines represent the cost functional values for the input signals and their UST reparameterizations, respectively.
The results indicate that in general, GORA does a remarkably good job of finding UST parameterizations that are globally minimal (or at least very close, depending on numerical precision) in the sense of (1). GORA’s failure to so consistently in the case of signals with low numbers of time instances (e.g. Fig. 0(b)) can likely be explained by its reliance on numerical differentiation of the input signal (in the computation of ) and on numerical integration of , both of which become less accurate with larger temporal step sizes corresponding to lower numbers of time instances.
Additionally, this type of numerical evaluation between the values of the UST computed cost functionals and cost functionals computed using the initial sequences and timescales might provide a template for finding the lower bound of GORA’s effectiveness with a given signal type. When the UST parameterization found by GORA is clearly not globally optimal, as is the case when the cost functional computed with respect to the input signal has a lower value than that computed using the UST reparameterization, it cannot be considered to be accurate. When using GORA for pairwise signal comparisons, failure to well-approximate UST parameterizations would likely lead to a greater degree of induced error. Depending on the properties of the input signals, chosen methods of derivation, integration, and interpolation, one might be able to probe for a lower bound on the number of time instances based on a desired accuracy threshold.
Vi Algorithm Performance and Comparisons
This section summarizes our comparisons between the performance of GORA and that of the DTW and FastDTW  algorithms. Specifically, we evaluate the performance of each of the above algorithms in terms of both accuracy in matching signals and computational efficency. All comparisons are performed in Python 2.7 and the DTW and FastDTW implementations we used in our experiments were from the official Python packages. The experiments were performed on an Intel Core i7-7600U CPU @ 2.80GHz.
Vi-a Comparison regime
We compared the performance of GORA with the DTW algorithm and implementations of the FastDTW algorithm with radii of 1, 5, and 20. The procedures with which we performed comparisons using both synthetic trajectories in and video sequences are described as follows: For a given number of time instances, we randomly selected 50 different template signals. For each template signal, two initial parameterizations in the TRG were randomly generated and used to parameterize the original signal, creating 50 pairs of input signals, which were then fed to GORA and the DTW and FastDTW algorithms.
To ensure fair comparisons between algorithms, we use a modified version of GORA designed for the pairwise comparison of two signals, which is outlined in Fig. 2. This version accepts two input signals, and , computes in parallel to their respective UST reparameterizations as defined in Algorithm 1, i.e. and , and outputs the error between the two UST reparameterizations given by (7). Similarly, we normalized the accumulated cost error output by the DTW and FastDTW algorithms under the Euclidean norm by dividing it by the length of the optimal warping path. Run time comparisons were performed using the clock module in Python’s time package. Given two input signals, we defined the run time (what we called computational efficiency) to be the time it took each algorithm to output the error between them.
Fig. 3 and Fig. 4 compare the performance of GORA and the DTW and FastDTW algorithms using signals in the form of trajectories in and vectorized video sequences, respectively. Figs. 2(a) and 3(a) show the mean run time of each algorithm from 20 to 150 time instances. Figs. 2(c) and 3(c)
show the corresponding standard deviations from the mean run times for each algorithm.
With both trajectories in and video sequences, as the total number of time instances increases, DTW’s run time grows quadratically (i.e. complexity) while all iterations of the FastDTW algorithm and GORA achieve linear complexity (i.e. ). However, in both cases GORA’s run time is less than that of all the DTW methods, and GORA’s complexity grows more slowly than the fastest implementation of FastDTW (radius 1). In addition, GORA’s run time has a similar degree of stability (in the sense of smaller deviations from the overall mean run time) as that of the FastDTW implementation with radius 1, and remains significantly more stable than other DTW methods.
Figs. 2(b) and 3(b) show the mean error between signal pairs given by each algorithm from 20 to 150 time instances. Figs. 2(d) and 3(d) show the corresponding standard deviations from the mean error for each algorithm. In both cases, GORA is significantly more accurate (in the sense that the computed error between signal pairs known to represent the same dynamic phenomena is small) than the DTW algorithm and all implementations of the FastDTW algorithm. It was often the case that the DTW algorithm and the implementations of the FastDTW algorithm gave identical errors, since it is possible for the FastDTW algorithm to construct the same accumulated cost matrix as the DTW algorithm.
The authors believe that the disparity in accuracy between GORA and the implementation of the FastDTW algorithm with radius 1 is especially significant. Since the error produced by the implementation of the FastDTW algorithm with radius 1 is both highly inaccurate and unstable (in the sense of large deviations from the mean error), especially for input signals in the form of video sequences, this suggests that an effective implementation of the FastDTW algorithm requires a larger radius. As such, the run time disparity between an effective FastDTW implementation and GORA is likely somewhere between the implementations of FastDTW with radius 1 and radius 5.
That being said, these results constitute only an initial analysis with two types of elementary data. However, they do suggest that GORA has potential to be a highly effective framework for signal comparison and action recognition.
A crucial difference between GORA and the DTW and FastDTW algorithms is GORA’s reliance on interpolation to recover the UST reparameterization of the input signal. Depending on the context, this can be an advantage or disadvantage for the GORA framework. For example, consider the problem of signal comparison or action recognition over a space of signals where computing the error between signals at a given time instance is itself computationally expensive to perform. By interpolating, GORA only has to compute the pairwise error between signals no more than times, where is the total number of time instances. On the other hand, all DTW methods will have to compute this error between and times. If GORA’s chosen method of interpolation is relatively inexpensive, this could give it a significant run time advantage over DTW methods.
However, this could easily become a disadvantage for GORA if the chosen method of interpolation is relatively expensive compared to the computation the pairwise error between signals at a given time instance. In particular, this might serve to explain why GORA’s run time advantage over the FastDTW implementation with radius 1 is smaller with video sequences than with trajectories in . For trajectories in , GORA only performs three instances of linear interpolation — one in each dimension of the trajectory. In contrast, for video sequences GORA performs instances of linear interpolation along each dimension of the trajectory in representing the vectorized video sequence. While this also means that DTW methods have to perform pairwise error computations with larger signals, it’s likely that the cost due to increasing the instances of interpolation outweighs the cumulative cost of computing the euclidean norm in . The videos in the Weizmann Database are relatively small ( pixels) and it is unknown with video sequences of larger dimensions whether GORA’s mean run time would remain favorable relative to other methods.
The GORA framework is very much an active work-in-progress. Currently, we are exploring the GORA’s potential in providing a foundation for a more robust algorithm able to minimize or eliminate nuisance parameters while simultaneously reparameterizing signals to a UST . The development of such an algorithm able to inherently compensate for perturbations such as noise or motion artifacts while maintaining a similar linear complexity to GORA would mark an important milestone toward the goal of robust robotic action recognition of human motions in real-time. Additionally, a well-known strength of DTW methods is their ability to effectively compare signals with different numbers of time instances. This is something we have yet to consider in our implementations of GORA and a topic we plan to address in our future work.
In this paper, we introduced the Globally Optimal Reparameterization Algorithm (GORA) for signal alignment and comparison, based on a recently proposed novel mathematical framework . This algorithm reparameterizes signals to a universal standard timescale (UST), allowing for element-wise comparisons between multiple signals at each instance of timewith linear time complexity of . In particular, we define procedures for applying this algorithm to characterize and compare signals in the form of real trajectories and video sequences.
Our experimental results have both provided a numerical validation of GORA’s theoretical basis and suggested that the GORA framework has potential to become a viable alternative to DTW methods for signal comparison and action recognition purposes. In particular we showed that for signals in the form of real trajectories in and vectorized video sequences with a fixed number of time instances, GORA’s computational complexity is less than that of the FastDTW algorithm with radius 1 and that GORA’s accuracy in matching signals representing fundamentally the same phenomena exceeds that of both the DTW algorithm and implementations of the FastDTW algorithm with radii of 1, 5, and 20.
The authors would like to thank Dr. Jin Seob Kim and Ms. Mengdi Xu for useful discussions that contributed to this work. This work was performed under National Science Foundation grant IIS-1619050 and Office of Naval Research Award N00014-17-1-2142.
-  H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26, no. 1, pp. 43–49, Feb 1978.
-  H. Shaikh, L. C. Mesquita, S. D. C. S. Araujo, P. Student, and A. P. Professor, “Recognition of isolated spoken words and numeric using mfcc and dtw,” International Journal of Engineering Science, vol. 10539, 2017.
-  S. Sempena, N. U. Maulidevi, and P. R. Aryan, “Human action recognition using dynamic time warping,” in Proceedings of the 2011 International Conference on Electrical Engineering and Informatics, July 2011, pp. 1–5.
-  E. J. Keogh and M. J. Pazzani, “Scaling up dynamic time warping to massive datasets,” in Principles of Data Mining and Knowledge Discovery, J. M. Żytkow and J. Rauch, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1999, pp. 1–11.
-  T. Zhu, Q. Zhao, W. Wan, and Z. Xia, “Robust regression-based motion perception for online imitation on humanoid robot,” International Journal of Social Robotics, vol. 9, no. 5, pp. 705–725, Nov 2017. [Online]. Available: https://doi.org/10.1007/s12369-017-0416-9
-  E. J. Keogh and M. J. Pazzani, “Derivative dynamic time warping,” in Proceedings of the 2001 SIAM International Conference on Data Mining. SIAM, 2001, pp. 1–11.
-  S. Salvador and P. Chan, “Fastdtw: Toward accurate dynamic time warping in linear time and space,” Intelligent Data Analysis, vol. 11, no. 5, pp. 561–580, 2007.
-  E. Keogh and C. A. Ratanamahatana, “Exact indexing of dynamic time warping,” Knowledge and Information Systems, vol. 7, no. 3, pp. 358–386, Mar 2005. [Online]. Available: https://doi.org/10.1007/s10115-004-0154-9
-  F. Zhou and F. D. la Torre, “Generalized time warping for multi-modal alignment of human motion,” in
-  G. S. Chirikjian, “Signal classification in quotient spaces via globally optimal variational calculus,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on. IEEE, 2017, pp. 735–743.
-  M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” in The Tenth IEEE International Conference on Computer Vision (ICCV’05), 2005, pp. 1395–1402.
-  L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 12, pp. 2247–2253, December 2007.