Data is often gathered sequentially in the form of a time series, which consists of sequences of data points observed at successive time points. Elements of such sequences are correlated through time, and comparing time series requires one to take the direction of time into account. To define a sensible similarity measure between time series, saoke78 proposed dynamic time warping (DTW), a distance over the space of time series, which has recently been extended by pmlr-v70-cuturi17a into soft DTW for use as a differentiable loss. DTW consists of a minimal-cost alignment problem and is solved via a Bellman recursion, while soft DTW leverages a soft-min operation to smooth the DTW objective. Such distances enable tackling a large range of temporal problems, including aligning time series, averaging them or making long-term predictions.
DTW-based approaches require a sensible cost function between samples from the two time series. The specification of such cost functions is often hard, and limits the applicability of DTW. For example, in cases where the time series are invariant under symmetries, such as sequences of word embeddings which are only identified up to a rotation of latent space, one needs to solve an alignment problem to compare the two sequences. vayer2020time propose an extension of DTW that addresses this issue by making the cost invariant with respect to specific sets of symmetries. In this case, one still requires the definition of a cost function between samples from the two time series, along with a potentially large set of transformations to optimize over. On the other hand, in multi-modal settings, one considers time series that live on incomparable spaces: for example, the configuration space of a physical system and its representation as pixels of a video frame. In such cases, defining a sensible distance between samples from the two sequences is impractical, as it would require detailed understanding of the objects we wish to study.
In this work, we propose to tackle the problem by relaxing our notion of equality in a manner inspired by recent ideas from the optimal transport literature. Using connections between DTW and the Wasserstein distance, we propose Gromov dynamic time warping (GDTW), which compares two time series by contrasting their intra-relational geometries, analogously to the Gromov-Wasserstein distance of isometry classes of metric-measure spaces journals/focm/Memoli11. This allows one to compare two time series without defining a sensible ground cost between their domains, and automatically incorporates invariances into the distance.
(1) We introduce Gromov DTW, a distance between time series and a smoothed version extending DTW variants to handling time series on incomparable space, (2) we present an efficient algorithm for computing it, (3) we apply Gromov DTW to a range of settings including barycentric averaging, generative modeling and imitation learning.
Let be a compact metric space, and let a time series of length be an element of . Let be the set of alignment matrices, which are binary matrices containing a path of ones from the top-left to the bottom-right corner, allowing only bottom, right or diagonal bottom-right moves. Given a matrix and a 4-dimensional array , define the matrix . Denote the Frobenius matrix inner product by
. Define the probability simplex. Finally, corresponds to the first time steps of .
2 Background and Related Work
We now introduce concepts and definitions needed in the rest of the work.
2.1 Dynamic Time Warping for Time Series Alignment
We now review DTW and its smoothed version. saoke78 consider the problem of aligning two time series and , where potentially . This is formalized as
where is the pairwise distance matrix. This problem amounts to finding an alignment matrix that minimizes the total alignment cost. The objective (2.1) can be computed in by leveraging the dynamic programming forward recursion
The optimal alignment matrix can then be obtained by tracking the optimal path backwards. DTW is a more flexible choice for comparing time series than element-wise Euclidean distances, because it allows one to compare time series of different sampling frequencies due to its ability to warp time. In particular, two time series can be close in DTW even if . DTW has been used in a number of settings, including time series averaging and clustering PETITJEAN201276; 10.1016/j.patcog.2017.08.012655778; 10.1007/s10618-015-0418-x.
A limitation of DTW is the discontinuity of its gradient, which can affect the performance of gradient descent algorithms. To address this, pmlr-v70-cuturi17a introduced a soft version of DTW. The minimum in (2.1) is replaced with a softened version, yielding
DTW is recovered in the limit . They also discuss a softened version of the optimal alignment matrix , given by the softened argmin defined by
where is the normalizing constant of the unnormalized density . While it considers temporal variability, DTW is not invariant under transformations, such as translations and rotations, which can limit its application to settings where time series are obtained only up to isometric transformations, such as word embeddings. To alleviate this, vayer2020time propose
which gives a distance between time series that is invariant under a set of transformations where is applied elementwise to points of the time series, and vayer2020time consider orthonormal transformations. However, in more general settings, this requires one to optimize over a potentially large space of transformations , which becomes unfeasible if and are too different.
2.2 Connecting DTW and Optimal Transport
Optimal transport peyre2019computational allows one to compare and average measures in a way that incorporates the geometry of the underlying space on which they are defined. Such approaches can be intuitively connected to DTW by observing that time series are essentially discrete measures equipped with an ordering. This allows one to view the alignment matrices in the DTW objective as analogues of coupling matrices that appear in the Kantorovich formulation of the classical optimal transport problem alma991005863149705596. To formalize this, we introduce the Wasserstein distance between discrete measures. Let , be discrete probability measures with , and set . Define the Wasserstein distance between discrete measures and as
where is the set of coupling matrices with marginals and . Equation (2.2) clearly resembles (2.1), and in both cases the objective consists of the minimization of the element-wise dot product between a distance matrix and another matrix, which we call the plan. In the DTW case, the plan consists of an alignment matrix, and in the Wasserstein case it consists of a coupling matrix. Moreover, the optimal coupling describes the optimal amount of probability mass to move from point to , whilst the optimal alignment describes whether or not and are aligned at optimality.
The Wasserstein distance is limited by the requirement for a sensible ground metric to be defined between samples and , which is impossible if the spaces are unregistered Solomon16. The Wasserstein distance is also not invariant under isometries, such as rotations and translations, and generally leads to a large distance between measures equivalent up to such transformations. To relax these requirements, journals/focm/Memoli11 proposed the Gromov–Wasserstein (GW) distance between metric-measure triples and , up to isometry. It is defined as
is typically squared error loss or Kullback-Leibler divergence, and does not rely on a cost or metric to comparewith . GW compares the intra-relational metric geometries of the two measures by comparing the distributions of their pairwise distances, and only requires the definition of metrics and on and , respectively, which can be arbitrarily different. GW has been used as a tool for comparing measures on incomparable spaces, notably for training generative models pmlr-v97-bunne19a, graph matching pmlr-v97-xu19b, and graph averaging NIPS2019_8569. pmlr-v97-titouan19a also proposed an extension of both Wasserstein and GW, named Fused-Gromov-Wasserstein to deal with structured measures such as graphs, which consists of a mixture of Wasserstein distances on the structural components, and GW on the spatial features.
3 Gromov Dynamic Time Warping
Motivated by the connections between DTW and optimal transport described in Sections 2.1 and 2.2, respectively, we introduce a distance between time series and defined on potentially incomparable compact metric spaces. Define the Gromov dynamic time warping distance between metric-time-series triples and as
is a loss function measuring the alignment of the pairwise distances, and the first two elements of the triple are omitted to ease notation. We think ofas a proxy for measuring the alignment of the time series (e.g, the square error loss and KL loss ). Under the optimal alignment, for any two pairs and , if is close to then is close to . Some optimal alignments are given in Figure 1.
Provided is a pre-metric and so induces a Hausdorff topology, GDTW possesses the following properties: (i) , and , (ii) if and only if there exists an isometry such that , and (iii) if and only if is symmetric. As DTW fails to satisfy the triangle inequality, GDTW does not generally satisfy it either. Thus, GDTW is a pre-metric over equivalence classes of , up to metric isometry. A formal treatment is given in Appendix A.
3.1 Efficient Computation via a Frank–Wolfe Algorithm
We now present a straightforward algorithm for computing GDTW. Following ideas proposed in the optimal transport setting for computing the Gromov-Wasserstein distance, one can introduce a 4-dimensional array and express GDTW as GDTW(x̌,y̌) &= min_A ∈A̧(T_x,T_y) G_x̌,y̌(A), & G_x̌,y̌(A) &= L AA_F . This expression is similar to the DTW objective in (2.1), but with a cost function that now depends on the alignment matrix . We apply the Frank–Wolfe algorithm doi:10.1002/nav.3800030109; Chapel2020PartialGW to (3.1). This consists of (i) solving a linear minimization oracle
which can be performed exactly in by a DTW iteration, and we note that can be computed in in the case pmlr-v48-peyre16. This is followed by (ii) a line-search step
We prove in Appendix A that the optimal step size is always or . The final step (iii) is updating
In summary: if improves, update , otherwise the algorithm converges to a local optimum at . Note that the optimal step size remediates the non-convexity of the constraint set, as iterates are guaranteed to remain in in spite of its non-convexity.
Algorithm 1 for computing Gromov DTW converges to a stationary point.
Appendix A. ∎
3.2 Gromov DTW as a Loss Function
Gromov DTW can be itself used as a differentiable loss function. Here, we apply the envelope theorem RePEc:mtp:titles:0262531925; milgrom2002envelope to (3.1) and obtain
Similarly to DTW, GDTW suffers from unpredictability when the time series is close to a change point of the optimal alignment matrix because of the discontinuity of derivatives. To remediate this, we describe how GDTW can be softened analogously to soft DTW. This allows smoother derivatives when applying it to generative modeling of time series and imitation learning. Our algorithm for computing Gromov DTW consists of successive DTW iterations. Following ideas from the Gromov-Wasserstein literature, we propose to replace the DTW operation in the iterations with a softened version, by replacing the argmin by the soft argmin in (2.1). A priori, it may seem that computing this is significantly more involved—however, pmlr-v70-cuturi17a observed that
where is the softened in (2.1). Hence, (3.2) can be computed by reverse-mode automatic differentiation in quadratic time, and soft GDTW iterations can be performed by plugging in . We approximate the derivatives by using the optimal soft alignment matrix in (3).
4 Learning with Gromov DTW as a Loss Function
We now present a range of applications of Gromov DTW, including barycentric averaging, generative modeling and imitation learning.
To compute barycenters of Gromov DTW (3.1), we extend the algorithm from pmlr-v48-peyre16 to the sequential setting. Given time series and weights , let . For fixed (length of the barycentric time series), the barycenter is defined as any triple satisfying D^* &= _D∈^TT ∑_j=1^Jα_j GDTW(D,D_x̌_j), & D_mn &= d_X̧(x̌^(m),x̌^(n)), n,m=1,…,T, where with some abuse of notation we denote purely in terms of distance matrices. The barycentric time series can then be reconstructed by applying multi-dimensional scaling (MDS) Kruskal:1978eu to , and is illustrated in Figure 2. We solve (2) by rewriting it as
which is solved by alternating between minimizing over via Algorithm 1, and minimizing over for fixed . The latter step admits a closed-form solution given as follows.
Appendix A. ∎
4.2 Generative Modeling
We now use GDTW as an approach for training generative models of time series. Here, we view our dataset of time series as a discrete measure . We define a generative model , where is a latent measure, such as an isotropic Gaussian,
is a neural network andis the pushforward measure. By nature of Gromov DTW, the generated time series do not have to live in the same space as the data. We train the model by minimizing the entropic Wasserstein distance NIPS2013_4927 between and . For the ground cost of , we use and . For , the objective is
where is the relative entropic regularization term. Following pmlr-v84-genevay18a, it’s also possible to use its debiased analog. is computed efficiently using the Sinkhorn algorithm Sinkhorn1974DiagonalET; NIPS2013_4927, and is minimized by gradient descent. This approach extends the Sinkhorn GAN of pmlr-v84-genevay18a and the GWGAN of pmlr-v97-bunne19a to sequential data.
4.3 Imitation Learning
We consider an imitation learning setting in which an agent needs to solve a task given the demonstration of an expert. We assume the agent has access to the true transition function over the agent’s state-space , and define a state trajectory as a time series . An expert state trajectory solving a specific task, such as traversing a maze, is given. The goal is to train the agent’s parametrized policy to solve the given task by imitating the expert’s behavior, where is the action space. To find this policy, the agent uses the model of the environment to predict state trajectories under the current policy , compares these predictions with the expert’s trajectory , and then optimizes the controller parameters to minimize the distance between predicted agent trajectory and observed expert trajectory. Using GDTW, our objective is
The flexibility of GDTW allows for expert trajectories defined in pixel space , while the agent lives in . Similarly, rollouts obtained with mimic the expert’s trajectory up to isometry. For comparison, we also consider DTW in (4.3), which aims to learn the same trajectory in the same space as the expert—this requires
, and starting positions for the agent and expert to be identical. From a reinforcement learning perspective, the use of GDTW in (4.3
) can be interpreted as a value estimate and gradient-based policy learning can be seen as taking value gradientsfairbank2012value; heess2015learning.
We now assess the effectiveness of our proposals in settings in which (i) time series live in comparable spaces and where previous approaches apply, (ii) the spaces are incomparable.
Baselines: Throughout the experiments, we compare our proposal to, in settings in which they apply, saoke78; pmlr-v70-cuturi17a and its rotationally-invariant extension for vayer2020time.
We first evaluate GDTW on alignment tasks. We consider two settings in which is obtained by applying to (i) a rotation, and (ii) a translation followed by a rotation. DTW-GI is invariant under rotations, and is therefore expected to work in setting (i) only, whilst GDTW is invariant under isometries, and is expected to work in both. In Figure 1, we see that GDTW recovers the right alignment in both settings, while DTW-GI only works in the rotational setting 1(a), and ordinary DTW fails in 1(a)–1(b). Further experiments with soft DTW and GDTW are given in Appendix B.
5.2 Barycenter Computation
We now investigate barycentric averaging of GDTW, on both toy data and the QuickDraw111https://quickdraw.withgoogle.com/ dataset. We compare Gromov DTW to DTW and DTW-GI, where barycenters from the latter two methods are computed using DTW barycentric averaging PETITJEAN201276.
In Figure 2, we see that in comparable settings DTW barycenters fail if time series are rotated or translated. DTW-GI is robust to rotation, but fails when applying both rotations and translations. By contrast, GDTW is robust to both, and leads to meaningful barycenters in all settings.
The QuickDraw dataset consists of time series of drawings in , belonging to 345 categories. Among those categories, we selected hands, clouds, fishes, and blueberries. To address high variability in classes, we selected input data following a preprocessing routine described in Appendix B. A sample of the data sets, together with barycenters computed with DTW, DTW-GI, and GDTW is displayed in Figure 3. While DTW and DTW-GI fail to reproduce the shape of the inputs for most classes, GDTW provides meaningful barycenters across the range of examples. This shows that GDTW is more robust in recovering the geometric shape of the time series, whilst DTW variants are sensitive to moderate isometries.
5.3 Generative Modeling
We evaluate the generative modeling proposal of Section 4.2, and compare the behavior of the learned model when using DTW and GDTW. Here, we consider the sequential-MNIST dataset222https://github.com/edwin-de-jong/mnist-digits-stroke-sequence-data, which consists of time series of digits in being drawn, and where each time step corresponds to a stroke. In Figure 4 we see that samples using GDTW as ground cost (4.2) are of a significantly better quality than samples using DTW. This can be explained by the variability in the data set: slight translations significantly affect DTW, but not GDTW. Note that the GDTW samples are rotated and reflected, since GDTW only produces learned samples up to metric isometries.
in the 2D/2D setting (averaged across 20 seeds, with standard deviations.
5.4 Imitation Learning
We now apply Gromov DTW to the imitation learning setting of Section 4.3. Here, we are given an expert trajectory , and our goal is to find a policy such that the agent’s simulated trajectory mimics . We consider maze navigation tasks in two settings: (i) both expert trajectories and the agent’s domain are and (ii) expert trajectories consist of a video sequence of images, giving , whilst the agent’s domain is . In the first setting, DTW and GDTW apply, whilst in the second setting only GDTW can be used. Figure 5(h) displays the loss (4.3), which is the GDTW distance to the given trajectory, obtained by learning with GDTW and DTW in (i) averaged across 20 seeds. We see that GDTW slightly outperforms DTW, and both agents recover the spiral trajectory provided by the expert.
Finally, we consider a setting in which an agent living in is provided with an expert trajectory consisting of a video of a car driving through a spiral, illustrated in Figures 5(a)–5(e) (before down-scaling the images). Here, the state-space of the agent, , differs from the state-space of the expert, . We define the cost on image space to be the -Wasserstein distance, defined on images interpreted as densities on a grid, and the cost on the Euclidean space to be the Euclidean distance. Figure 5(f) shows the agent’s trajectory under the learned policy , and Figure 5(g) shows the loss (4.3) against the number of training steps. We see that, using GDTW, the agent successfully learns to solve the task despite never having access to trajectories in the space of interest.
We propose Gromov DTW, a distance between time-series living on potentially incomparable spaces. The idea is to consider the intra-relational geometries of the considered time series, alleviating the need for a ground metric to be defined across spaces, which is a requirement of previous approaches. Moreover, Gromov DTW is invariant under isometries by nature, which is an important inductive bias for generalization, and makes it significantly more robust under transformations of the spaces by contrast with DTW and DTW-GI. The generality of our proposed distance enables applying it to a wide range of problems that previous approaches could not tackle, in particular when comparing time series on unregistered spaces. We considered applications ranging from alignment to barycentric averaging, generative modeling and imitation learning.
In this work, we develop techniques for aligning, averaging, and learning using multiple time series in potentially different domains. This makes it easier for practitioners to use a number of different tools in these settings.
In particular, we envision this might help reduce demand for manually labeled data in robotics. For example, one might use techniques derived from ours to train robots on expert demonstrations, without manually transcribing those demonstrations into a computer-friendly format. This can play an important role in human-robot interaction, for instance in construction or elderly care.
Similarly, aligning time series can be helpful in epidemiology. For example, it could allow scientists to compare the shapes of infection curves with different starting points, thereby allowing to compare countries at different stages of an epidemic. This could help with understanding the evolution of diseases, such as COVID-19, which would in turn benefit the general public.
Finally, our work can be used in a climate science context to align time series from computer-generated numerical weather prediction and heterogeneous observational data, which might be available with different sampling frequencies. This in turn might improve quality of data assimilation, thereby improving weather forecasting, which we hope might contribute to the United Nations’ sustainable development goal 13 on climate action.
We are grateful to K. S. Sesh Kumar for ideas on the Frank-Wolfe algorithm. SC was supported by the Engineering and Physical Sciences Research Council (grant number EP/S021566/1).
Appendix A Theory
Here we develop the theory of Gromov dynamic time warping distances. We begin by introducing the necessary preliminaries.
Definition 3 (Time series).
Let be a compact metric space, and let . We call a finite sequence a time series. Let be the space of all time series.
Let and be time series. Define a premetric , which we call the cost. Define the cost matrix by .
We say that a binary matrix is an alignment matrix if , , and implies exactly one of , , and holds. Let
be the set of alignment matrices.
Definition 6 (Dynamic Time Warping).
Let and be time series. Define the dynamic time warping distance by
where is the Frobenius norm over real matrices.
If is a premetric, then is a premetric on the space of time series. If we take , then is a symmetric premetric on .
A premetric induces a Hausdorff topology on the set it is defined over, and so is suitable for many purposes that ordinary metrics are used for. To proceed along the path suggested by Gromov-Hausdorff and Gromov-Wasserstein distances over metric-measure spaces, we need to define the time series analog.
Define a metric space equipped with a time series to be a triple .
Let and be metric spaces equipped with time series. Define , and similarly, and equip both sets with their respective subset metrics. We say that and are isomorphic if there is a metric isometry such that , where and denote and with consecutive repeated elements removed.
At this stage it is not clear whether or not the class of all such triples under isometry forms a set, or is instead a proper class. To avoid set-theoretic complications, we need the following technical result.
The class of all isometry classes of compact metric spaces is a set.
[ch. 27, p. 746]alma991005863149705596. ∎
It follows immediately that the class of all metric spaces equipped with time series is a set, provided that identification by isometry extends to the time series. We are now ready to define GDTW.
Let be a premetric on , and define by
Define the Gromov Dynamic Time Warping distance by
is a premetric on the set of all metric spaces equipped with time series.
We check the conditions. Non-negativity is immediate by definition. It also follows immediately that implies . We thus need to prove that implies . By hypothesis, we have
where all elements of the last sum are non-zero. Suppose without loss of generality that and contain no duplicate elements. We argue inductively that optimal
is the identity matrix. First, note thatby definition of . Now, consider . If we suppose , then we must have , and hence . But then , contradicting the assumption there are no duplicates. Hence, . By mirroring the above argument, . Hence, by definition of , the only remaining possibility is . Inductively, we conclude for all , and for . Finally, since the lower-right corner of has to also be equal to one by definition, it follows that is the square identity matrix. Hence and if and only if and . Plugging this into the previous equality yields for all , which together with diagonal gives the isomorphism. Finally, to see that lack of duplicates truly is assumed without loss of generality, note that if there are duplicates in and , then we apply the above argument to and of Definition 9, which no longer contain duplicates. The claim follows. ∎
One can easily see that will be symmetric if is symmetric. Since itself doesn’t satisfy a triangle inequality 10.1016/j.patcog.2008.11.030, won’t satisfy it either.
We formulate an algorithm for computing GDTW within the Frank–Wolfe (FW) framework doi:10.1002/nav.3800030109. These algorithms tackle problems of the form
where is the objective to be minimized, assumed differentiable with Lipschitz continuous gradient, and is the (usually convex) constraint set.
We first describe the Frank–Wolfe algorithm in its general form. Let be the initial point.
Solve the linear minimization oracle
Find the optimal step size
Perform the update
We now describe the algorithm our setting. The objective is
The constraint domain in our case is not convex. This is usually a requirement of Frank–Wolfe algorithms, but we derive a result in the sequel that enables us to bypass this requirement.
Step 1: Linear Minimization Oracle
We note that the gradient is of the form
and step 1 thus consists in solving
This can be minimized exactly in time by plugging in the DTW objective (2.1) and solving via dynamic programming.
Step 2: Optimal Step Size
The optimal step size in (6) is either 0 or 1.
This follows by applying the argument of Chapel2020PartialGW, who derive a similar result in the Gromov–Wasserstein setting, with one minor modification. In their equation (9), an optimal-transport-based argument is used to obtain an inequality—in our setting, an analogous inequality holds for DTW, given by
where is given by dynamic programming. The claim follows. ∎
We observe from Proposition 13 that the optimal step size is or , therefore if the proposal improves on the objective, , otherwise .
Step 3: Iterative Updates
Frank–Wolfe algorithms typically require convexity of the constraint set , otherwise the iterates might escape from the constrained domain. In our setting, since the optimal step size is either 0 or 1, this never happens: is equal to or , and both belong to the constrain set, and convexity is not needed to guarantee the iterates remain in the constrained domain.
Summarizing, we obtain the following.
Algorithm 1 for computing GDTW converges to a stationary point.
The result follows by a minor modification of Theorem 1 of DBLP:journals/corr/Lacoste-Julien16, which proves convergence of the Frank–Wolfe algorithm for possibly non-convex optimization objectives over convex constraint sets. Here, we use Proposition 13 instead of convexity to ensure the iterates remain in the constraint set. ∎
If is a square error loss, the solution to the minimization in (4.1) for fixed is
where division is performed element-wise, and is a vector of ones.
If is square error loss, then (4.1) can be written as
where is element-wise matrix multiplication. Differentiating the objective with respect to and setting it equal to , we get
which, dividing both sides element-wise, gives the result. ∎
Appendix B Experimental Details
In Figures 7–10, we provide further alignment experiments. Here, we set the entropic term to for soft alignments, and we use normalized distance matrices. We observe that GDTW and soft GDTW are robust to scaling, rotations and translations, whilst DTW and soft DTW are sensitive to rotations and translations. Finally, DTW-GI is robust to rotations, but sensitive to translations, which further corroborates the observations from Figure 1.
In this experiment, we perform barycenters of 30 elements of 4 quickdraw classes with respect to DTW, DTW-GI and GDTW.
Data selection and pre-processing
The classes considered in the experiment are fish, blueberries, clouds and hands. The variability in each class of QuickDraw is extremely high: we created datasets of 30 elements such that it is straightforward to recognize to which category the element belongs to, such that the element is drawn with a single stroke and such that it has a common style. The full datasets are displayed in Figure 6. Before running the algorithms, we rescaled the data, applying the transformation to each data point. Finally, we down-sampled the length of the time series reducing it by 1/3 for hands and 1/2 for fish, clouds and blueberries.
For GDTW barycenters, we applied our algorithm of Section 4.1, using the entropy regularized version of GDTW with . For DTW and DTW-GI, we used standard DBA procedures. For both algorithms, we set the barycentric length to 60 for fish and hands and 40 for clouds and blueberries. Also, we set the maximum number of FW iterations for GDTW to 25, and the number of DTW-GI iterations to 30.
In this experiment, we use the Sinkhorn divergence objective. We use a latent dimension of , and the generator is a -layer MLP with neurons per layers. The length of the generated time series is set to , and the dimension of the space is , thus the MLP’s output dimension is . We set the batch size to . We use the ADAM optimizer, with , and the learning rate set to . We set , and the maximum number of iterations in the GDTW computation to . We use the sequential MNIST dataset333Sequential MNIST: https://github.com/edwin-de-jong/mnist-digits-stroke-sequence-data. and normalize the data, which time series in , into the unit square.
In this experiment, we use a two-layer MLP policy, with input dimension of , a hidden dimension of 64, and an output dimension of . The learning rate is set to , and we use the ADAM optimizer, and . In the video/2D experiment444The video was generated using https://github.com/gezichtshaar/PyRaceGame., the ground cost for the video is entropic 2-Wasserstein distance, computed efficiently using GeomLoss feydy2019interpolating, and the ground cost on the 2D space is squared error loss. We plot mean scores along with standard deviations (across 20 random seeds).