COT-GAN: Generating Sequential Data via Causal Optimal Transport

by   Tianlin Xu, et al.

We introduce COT-GAN, an adversarial algorithm to train implicit generative models optimized for producing sequential data. The loss function of this algorithm is formulated using ideas from Causal Optimal Transport (COT), which combines classic optimal transport methods with an additional temporal causality constraint. Remarkably, we find that this causality condition provides a natural framework to parameterize the cost function that is learned by the discriminator as a robust (worst-case) distance, and an ideal mechanism for learning time dependent data distributions. Following Genevay et al.(2018), we also include an entropic penalization term which allows for the use of the Sinkhorn algorithm when computing the optimal transport cost. Our experiments show effectiveness and stability of COT-GAN when generating both low- and high-dimensional time series data. The success of the algorithm also relies on a new, improved version of the Sinkhorn divergence which demonstrates less bias in learning.



There are no comments yet.


page 7

page 8

page 14

page 15

page 16

page 18

page 19


Quantized Conditional COT-GAN for Video Prediction

Causal Optimal Transport (COT) results from imposing a temporal causalit...

Discriminator optimal transport

Within a broad class of generative adversarial networks, we show that di...

Entropy-regularized Optimal Transport Generative Models

We investigate the use of entropy-regularized optimal transport (EOT) co...

Improving GANs Using Optimal Transport

We present Optimal Transport GAN (OT-GAN), a variant of generative adver...

Efficient robust optimal transport: formulations and algorithms

The problem of robust optimal transport (OT) aims at recovering the best...

Stratification and Optimal Resampling for Sequential Monte Carlo

Sequential Monte Carlo (SMC), also known as particle filters, has been w...

An Optimal Transport Approach to Causal Inference

We propose a method based on optimal transport theory for causal inferen...

Code Repositories


COT-GAN: Generating Sequential Data via Causal Optimal Transport

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dynamical data are ubiquitous in the world, including natural scenes such as video and audio data, and temporal recordings such as physiological and financial traces. Being able to synthesize realistic dynamical data is a challenging unsupervised learning problem and has wide scientific and practical applications. In recent years, training implicit generative models (IGMs) has proven to be a promising approach to data synthesis, driven by the work on generative adversarial networks (GANs)


Nonetheless, training IGMs on dynamical data poses an interesting yet difficult challenge. On one hand, learning complex dependencies between spatial locations and channels for static images has already received significant effort within the research community. On the other hand, temporal dependencies are no less complicated since the dynamical features are strongly correlated with spatial features. Recent works, including [34, 42, 15, 39, 36], often tackle this problem by separating the model or loss into static and dynamic components.

In this paper, we consider training dynamic IGMs for sequential data. We introduce a new adversarial objective that builds on optimal transport (OT) theory, and constrains the transport plans to respect causality

: the probability mass moved to the target sequence at time

can only depend on the source sequence up to time [2, 8]. A reformulation of the causality constraint leads to a formulation of an adversarial training objective in the spirit of [20]

, but tailored to sequential data. In addition, we demonstrate that optimizing the original Sinkhorn divergence over mini-batches causes biased parameter estimation, and propose the

mixed Sinkhorn divergence which avoids this problem. Our new framework, Causal Optimal Transport GAN (COT-GAN), outperforms existing methods on a wide range of datasets from traditional time series to high dimensional videos.

2 Background

2.1 Adversarial learning for implicit generative models

Goodfellow et al. [22] introduced an adversarial scheme for training an IGM. Given a (real) data distribution , and a distribution on some latent space , the generator is a function trained so that the induced distribution is as close as possible to as judged by a discriminator. The discriminator is a function trained to output a high value if the input is real (from ), and a low value otherwise (from

). In practice, the two functions are implemented as neural networks

and with parameters and , and the generator distribution is denoted by . The training objective is then formulated as a zero-sum game between the generator and the discriminator. Different probability divergences were later proposed to evaluate the distance between and [30, 26, 29, 4]. Notably, the Wasserstein-1 distance was used in [6, 5]:


where is the space of transport plans (couplings) between and . Its dual form turns out to be a maximization problem over such that is Lipschitz. Combined with the minimization over , a min-max problem can be formulated with a Lipschitz constraint on .

2.2 Optimal transport and Sinkhorn divergences

The optimization in (2.1) is a special case of the classical (Kantorovich) optimal transport problem. Given probability measures on , on , and a cost function , the optimal transport problem is formulated as


Here, represents the cost of transporting a unit of mass from to , and is thus the minimal total cost to transport the mass from to . Obviously, the Wasserstein-1 distance (2.1) corresponds to . However, when and are supported on finite sets of size , solving (2.2) has super-cubic (in ) complexity [14, 31, 32], which is computationally expensive for large datasets.

Instead, Genevay et al. [20] proposed training IGMs by minimizing a regularized Wasserstein distance that can be computed more efficiently by the Sinkhorn algorithm (see [14]). For transport plans with marginals supported on a finite set and on a finite set , any is also discrete with support on the set of all possible pairs . Denoting , the Shannon entropy of is given by For , the regularized optimal transport problem then reads as


Denoting by the optimizer in (2.3), one can define a regularized distance by


Computing this distance is numerically more stable than solving the dual formulation of the OT problem, as the latter requires differentiating dual Kantorovich potentials; see e.g. [12, Proposition 3]. To correct the fact that , Genevay et al. [20] proposed to use the Sinkhorn divergence:


as the objective function, and to learn the cost parameterized by , resulting in the following adversarial objective


In practice, a sample-version of this cost is used, where and are replaced by distributions of mini-batches randomly extracted from them.

3 Training generative models with Causal Optimal Transport

We now focus on data that consists of -dimensional (number of channels), -long sequences, so that and are distributions on the path space . In this setting we introduce a special class of transport plans, between and , that will be used to define our objective function. On , we denote by and the first and second half of the coordinates, and we let and be the canonical filtrations (for all , is the smallest -algebra s.t. is measurable; analogously for ).

3.1 Causal Optimal Transport

Definition 3.1.

A transport plan is called causal if

The set of all such plans will be denoted by .

Roughly speaking, the amount of mass transported by to a subset of the target space belonging to depends on the source space only up to time . Thus, a causal plan transports into in a non-anticipative way, which is a natural request in a sequential framework. In the present paper, we will use causality in the sense of Definition 3.1. However, note that in the literature, the term causality is often used to indicate a mapping in which the output at a given time depends only on inputs up to time .

Restricting the space of transport plans to in the OT problem (2.2) gives the COT problem


COT has already found wide application in dynamic problems in stochastic calculus and mathematical finance, see e.g. [3, 1, 2, 9, 7]. The causality constraint can be equivalently formulated in several ways, see [8, Proposition 2.3]. The one that will be useful for our purposes can be expressed in the following way: let be the set of -martingales, and define

then a transport plan is causal if and only if


where and similarly for , and . As usual, denotes the space of continuous, bounded functions on . Where no confusion can arise, with an abuse of notation we will simply write rather than .

3.2 Regularized Causal Optimal Transport

In the same spirit of [20], we include an entropic regularization in the COT problem (3.1) and consider


The solution to such problem is then unique due to strict concavity of . We denote by the optimizer to the above problem, and define the regularized COT distance by

Remark 3.2.

In analogy to the non-causal case, it can be shown that, for discrete and (as in practice), the following limits holds:

where denotes the independent coupling.

See Section A.1 for a proof. This means that the regularized COT distance is between the COT distance and the loss obtained by independent coupling, and is closer to the former for small . Optimizing over the space of causal plans is not straightforward. Nonetheless, the following proposition shows that the problem can be reformulated as a maximization over non-causal problems with respect to a specific family of cost functions.

Proposition 3.3.

The regularized COT problem (3.3) can be reformulated as




This means that the optimal value of the regularized COT problem equals the maximum value over the family of regularized OT problems w.r.t. the set of cost functions . This result has been proven in [2]. As it is crucial for our analysis, we show it in Section A.2.

Proposition 3.3 suggests the following worst-case distance between and :


as a regularized Sinkhorn distance that respects the causal constraint on the transport plans.

In the context of training a dynamic IGM, the training dataset is a collection of paths of equal length , , . As is usually very large, we proceed as usual by approximating with its empirical mini-batch counterpart. Precisely, for a given IGM , we fix a batch size and sample from the dataset and from . Denote the generated samples by , and the empirical distributions by

The empirical distance can be efficiently approximated by the Sinkhorn algorithm.

Figure 1: Regularized distance (2.4), Sinkhorn divergence (2.5) and mixed Sinkhorn divergence (3.8) computed for mini-batches of size from and , where . Color indicates batch size. Curve and errorbar show the mean and sem estimated from 300 draws of mini-batches.

3.3 Reducing the bias with mixed Sinkhorn divergence

When implementing the Sinkhorn divergence (2.5) at the level of mini-batches, one canonical candidate clearly is


which is indeed what is used in [20]. While the expression in (3.7) does converge in expectation to (2.5) for ([19, Theorem 3]), it is not clear whether it is an adequate loss given data of fixed batch size . In fact, we find that this is not the case, and demonstrate it here empirically.

Example 3.4.

We build an example where the data distribution belongs to a parameterized family of distributions , with (details in Section A.3). As shown in Figure 1 (top two rows), neither the expected regularized distance (2.4) nor the Sinkhorn divergence (2.5) reaches minimum at , especially for small . This means that optimizing over mini-batches will not lead to .

Instead, we propose the following mixed Sinkhorn divergence at the level of mini-batches:


where and are the empirical distributions of mini-batches from the data distribution, and and from the IGM distribution . The idea is to take into account the bias within the distribution and that within the distribution as well. The proposed divergence finds the correct minimizer for all in Example 3.4 (Figure 1, bottom), and the improvement is not due solely to the double batch used by Equation 3.8. We further discuss this choice and our findings in Section A.3.

3.4 COT-GAN: Adversarial learning for sequential data

We now combine the results in Section 3.2 and Section 3.3 to formulate an adversarial training algorithm for IGMs. First, we approximate the set of functions (3.5) by truncating the sums at a fixed , and we parameterize and as two separate neural networks, and let . To capture the adaptedness of those processes, we employ architectures where the output at time depends on the input only up to time . The mixed Sinkhorn divergence between and is then calculated with respect to a parameterized cost function


Second, it is not obvious how to directly impose the martingale condition, as constraints involving conditional expectations cannot be easily enforced in practice. Rather, we penalize processes for which increments at every time step are non-zero on average. For an -adapted process and a mini-batch (), we define the martingale penalization for as


is the empirical variance of

over time and batch, and is a small constant. Third, we use the mixed normalization introduced in (3.8). Each of the four terms is approximated by running the Sinkhorn algorithm on the cost for iterations.

Altogether, we arrive at the following adversarial objective function for COT-GAN:


where and are empirical measures corresponding to non-overlapping subsets of the dataset, and are the ones corresponding to two samples from , and is a positive constant. We update to decrease this objective, and to increase it.

While the generator acts as in classical GANs, the adversarial role here is played by and . In this setting, the discriminator, parameterized by , learns a robust (worst-case) distance between the real data distribution and the generated distribution , where the class of cost functions as in (3.9) originates from causality. The algorithm is summarized in Algorithm 1. Its time complexity scales as for each iteration.

Data: (real data),

(probability distribution on latent space

Parameters: , , (batch size), (regularization parameter), (number of Sinkhorn iterations), (learning rate), (martingale penalty coefficient)
Result: ,
Initialize: ,
for  do
        Sample and from real data;
      [0.1cm]  Sample and from  ;
      [0.1cm]  ;
      [0.1cm]  compute (3.8) by the Sinkhorn algorithm, with given by (3.9)  
      [0.1cm]  Sample and from real data;
      [0.1cm]  Sample and from ;
      [0.1cm]  ;
      [0.1cm]  compute (3.8) by the Sinkhorn algorithm, with given by (3.9)  ;
end for
Algorithm 1 training COT-GAN by SGD

4 Related work

Early video generation literature focuses on dynamic texture modeling [16, 35, 40]. Recent efforts in video generation within the GAN community have been devoted to designing GAN architectures of generator and discriminator to tackle the spatio-temporal dependencies separately, e.g., [39, 34, 36]. VGAN [39] explored a two-stream generator that combines a network for a static background and another one for moving foreground trained on the original GAN objective. TGAN [34] proposed a new structure capable of generating dynamic background as well as a weight clipping trick to regularize the discriminator. In addition to a unified generator, MoCoGAN [36] employed two discriminators to judge both the quality of frames locally and the evolution of motions globally.

The broader literature of sequential data generation attempts to capture the dependencies in time by simply deploying recurrent neural networks in the architecture

[28, 18, 23, 42]. Among them, TimeGAN [42] demonstrated improvements in time series generation by adding a teacher-forcing component in the loss function. Alternatively, WaveGAN [15] adopted the causal structure of WaveNet [38]. Despite substantial progress made, existing sequential GANs are generally domain-specific. We therefore aim to offer a framework that considers (transport) causality in the objective function and is suitable for more general sequential settings.

Whilst our analysis is built upon [14] and [20], we remark two major differences between COT-GAN and the Sinkhorn GAN in [20]. First, we consider a different family of costs. While [20] learns the cost function by parametrizing with , the family of costs in COT-GAN is found by adding a causal component to in terms of and . is the mixed Sinkhorn divergence we propose, which reduces biases in parameter estimation and can be used as a generic divergence for training IGMs not limited to time series settings.

5 Experiments

5.1 Time series

We now validate COT-GAN empirically111Code and data are available at For times series that have a relatively small dimensionality but exhibit complex temporal structure, we compare COT-GAN with the following methods: Direct minimization of Sinkhorn divergences (3.8) and (3.7); TimeGAN [42] as reviewed in Section 4; Sinkhorn GAN, similar to [20] with cost where is trained to increase the mixed Sinkhorn divergence with weight clipping. All methods use . The networks and in COT-GAN and in Sinkhorn GAN share the same architecture. Details of models and datasets are in Section B.1.

Figure 2: Results on learning the multivariate AR-1 process. Top row shows the auto-correlation coefficient for each channel. Bottom row shows the correlation coefficient between channels averaged over time. The number on top of each panel is the sum of the absolute difference between the correlation coefficients computed from real (leftmost) and generated samples.
Figure 3: Results on EEG data. The same correlations as Figure 2 are shown.
Autoregressive processes.

We first test whether COT-GAN can learn temporal and spatial correlation in a multivariate first-order auto-regressive process (AR-1) . Results are shown in Figure 2. COT-GAN samples have correlation structures that best match the real data. Minimizing the mixed divergence produces almost as good correlations as COT-GAN, but with less accurate auto-correlation. Minimizing the original Sinkhorn divergence yields poor results, and neither TimeGAN nor Sinkhorn GAN could capture the correlation structure for this dataset.

Noisy oscillations.

The noisy oscillation distribution is composed of sequences of 20-element arrays (1-D images) [41]. Figure 6 in Section B.1

shows data as well as generated samples by different training methods. To evaluate performance, we estimate two attributes of the samples by Monte Carlo: the marginal distribution of pixel values, and the joint distribution of the location at adjacent time steps. COT-GAN samples match the real data best.

Electroencephalography (EEG).

This dataset is from the UCI repository [17] and contains recordings from 43 healthy subjects each undergoing around 80 trials. Each data sequence has 64 channels and we model the first 100 time steps. We trained and evaluated each method 16 times with different training and test splits. We evaluated performance by the maximum mean discrepancy (MMD), and the match with data in terms of temporal and channel correlations, and frequency spectrum. In addition, we investigated how the coefficient affects sample quality. We show an example of the data and learned correlations in Figure 3

, and summary statistics of all evaluation metrics in

Figure 8 in Section B.1. COT-GANs generate the best samples compared with other baselines across all four metrics. A smaller tends to generate less realistic correlation patterns, but slightly better match in frequency spectrum.

5.2 Videos

We train COT-GAN on Sprites animations [27, 33] and human action sequences [11], and compare with MoCoGAN [36]. The evaluation metrics are Fréchet Inception Distance (FID) [24] comparing individual frames, Fréchet Video Distance (FVD) [37] which compares the video sequences as a whole by mapping samples into features via pretrained 3D convolutional networks, and their kernel counterparts (KID, KVD) [10]. Previous studies suggest that FVD correlates better with human judgement than KVD for videos [37], whereas KID does so better than FID on images [44].

Figure 4: Animated (top) and human (bottom) action videos. Left column reports real data samples, and right column samples from COT-GAN.

We pre-process the Sprites and human action sequences to have a sequence length of and , respectively. Each frame has dimension . We employ the same architecture of generator and discriminator to train both datasets. Both the generator and discriminator comprises generic LSTM with 2-D convolutional layers. The detailed data pre-processing, GAN architectures, hyper-parameter settings, and training techniques are reported in Appendix B.2. We show the real data and samples from COT-GAN side by side in Figure 4.

MoCoGAN 1 213.2 281.3 160.1 0.33
COT-GAN 444.6 83.5 64.0 0.077
Human actions
MoCoGAN 661.8 128.4 60.4 0.21
COT-GAN 541.0 52.4 46.2 0.096
Table 1: Evaluations for video datasets. Lower value means better sample quality.

The evaluation scores in Table 1 are estimated using 5000 generated samples. COT-GAN is the better performing method in both tasks for all four measurements. Further samples, and comparison with direct minimization of the mixed Sinkhorn divergence, are provided in Appendix C.

6 Discussion

The performance of COT-GAN suggests that constraining the transport plans to be causal is a promising direction for generating sequential data. The approximations we introduce, such as the mixed Sinkhorn distance (3.8) and truncated sum in (3.5), are sufficient to produce good experimental results, and provide opportunities for more theoretical analyses in future studies. Directions of future development include ways to learn from data with flexible lengths, extensions to conditional COT-GAN, and improved methods to enforce the martingale property for and better parameterize the causality constraint.


  • [1] B. Acciaio, J. Backhoff-Veraguas, and R. Carmona (2019) Extended mean field control problems: stochastic maximum principle and transport perspective. SIAM Journal on Control and Optimization 57 (6). Cited by: §3.1.
  • [2] B. Acciaio, J. Backhoff-Veraguas, and J. Jia (2020) Cournot-nash equilibrium and optimal transport in a dynamic setting. arXiv preprint arXiv:2002.08786. Cited by: §A.1, §1, §3.1, §3.2.
  • [3] B. Acciaio, J. Backhoff-Veraguas, and A. Zalashko (2019) Causal optimal transport and its links to enlargement of filtrations and continuous-time stochastic optimization. Stochastic Processes and their Applications. Cited by: §3.1.
  • [4] M. Arbel, D. Sutherland, M. Bińkowski, and A. Gretton (2018) On gradient regularizers for mmd gans. In NeurIPS, Cited by: §2.1.
  • [5] M. Arjovsky and L. Bottou (2017) Towards principled methods for training generative adversarial networks. arxiv e-prints, art. arXiv preprint arXiv:1701.04862. Cited by: §2.1.
  • [6] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein generative adversarial networks. In ICML, Cited by: §2.1.
  • [7] J. Backhoff, D. Bartl, M. Beiglböck, and J. Wiesel (2020) Estimating processes in adapted Wasserstein distance. arXiv preprint arXiv:2002.07261. Cited by: §3.1.
  • [8] J. Backhoff, M. Beiglbock, Y. Lin, and A. Zalashko (2017) Causal transport in discrete time and applications. SIAM Journal on Optimization 27 (4). Cited by: §1, §3.1.
  • [9] J. Backhoff-Veraguas, D. Bartl, M. Beiglböck, and M. Eder (2019) Adapted Wasserstein distances and stability in mathematical finance. arXiv preprint arXiv:1901.07450. Cited by: §3.1.
  • [10] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton (2018) Demystifying mmd gans. arXiv preprint arXiv:1801.01401. Cited by: §5.2.
  • [11] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri (2005) Actions as space-time shapes. In ICCV, Cited by: §5.2.
  • [12] O. Bousquet, S. Gelly, I. Tolstikhin, C. Simon-Gabriel, and B. Schoelkopf (2017) From optimal transport to generative modeling: the vegan cookbook. arXiv preprint arXiv:1705.07642. Cited by: §2.2.
  • [13] T. M. Cover and J. A. Thomas (2012) Elements of information theory. John Wiley & Sons. Cited by: §A.1.
  • [14] M. Cuturi (2013) Sinkhorn distances: lightspeed computation of optimal transport. In NeurIPS, Cited by: §2.2, §2.2, §4.
  • [15] C. Donahue, J. J. McAuley, and M. S. Puckette (2019) Adversarial audio synthesis. In ICLR, Cited by: §1, §4.
  • [16] G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto (2003) Dynamic textures.

    International Journal of Computer Vision

    51 (2).
    Cited by: §4.
  • [17] D. Dua and C. Graff (2017)

    UCI machine learning repository

    University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §5.1.
  • [18] C. Esteban, S. L. Hyland, and G. Rätsch (2017) Real-valued (medical) time series generation with recurrent conditional gans. arXiv preprint arXiv:1706.02633. Cited by: §4.
  • [19] A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré (2019) Sample complexity of sinkhorn divergences. In AISTATS, Cited by: §A.3, §3.3.
  • [20] A. Genevay, G. Peyre, and M. Cuturi (2018) Learning generative models with sinkhorn divergences. In AISTATS, Cited by: §A.3, §A.3, §A.3, §1, §2.2, §3.2, §3.3, §4, §5.1.
  • [21] I. J. Good et al. (1963)

    Maximum entropy for hypothesis formulation, especially for multidimensional contingency tables

    The Annals of Mathematical Statistics 34 (3). Cited by: §A.1.
  • [22] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, Cited by: §1, §2.1.
  • [23] S. Haradal, H. Hayashi, and S. Uchida (2018) Biosignal data augmentation based on generative adversarial networks. In International Conference in Medicine and Biology Society, Cited by: §4.
  • [24] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, Cited by: §5.2.
  • [25] M. J. Johnson, D. K. Duvenaud, A. Wiltschko, R. P. Adams, and S. R. Datta (2016) Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, pp. 2946–2954. Cited by: §B.1.
  • [26] C. Li, W. Chang, Y. Cheng, Y. Yang, and B. Póczos (2017)

    Mmd gan: towards deeper understanding of moment matching network

    In NeurIPS, Cited by: §2.1.
  • [27] Y. Li and S. Mandt (2018)

    Disentangled sequential autoencoder

    arXiv preprint arXiv:1803.02991. Cited by: §5.2.
  • [28] O. Mogren (2016) C-rnn-gan: continuous recurrent neural networks with adversarial training. arXiv preprint arXiv:1611.09904. Cited by: §4.
  • [29] Y. Mroueh, C. Li, T. Sercu, A. Raj, and Y. Cheng (2018) Sobolev gan. In ICLR, Cited by: §2.1.
  • [30] S. Nowozin, B. Cseke, and R. Tomioka (2016) F-gan: training generative neural samplers using variational divergence minimization. In NeurIPS, Cited by: §2.1.
  • [31] J. B. Orlin (1993) A faster strongly polynomial minimum cost flow algorithm. Operations research 41 (2), pp. 338–350. Cited by: §2.2.
  • [32] O. Pele and M. Werman (2009) Fast and robust earth mover’s distances. In 2009 IEEE 12th International Conference on Computer Vision, pp. 460–467. Cited by: §2.2.
  • [33] S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee (2015) Deep visual analogy-making. In NeurIPS, Cited by: §5.2.
  • [34] M. Saito, E. Matsumoto, and S. Saito (2017)

    Temporal generative adversarial nets with singular value clipping

    In ICCV, Cited by: §1, §4.
  • [35] M. Szummer and R. W. Picard (1996) Temporal texture modeling. In International Conference on Image Processing, Vol. 3. Cited by: §4.
  • [36] S. Tulyakov, M. Liu, X. Yang, and J. Kautz (2018) Mocogan: decomposing motion and content for video generation. In CVPR, Cited by: §1, §4, §5.2.
  • [37] T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly (2019) FVD: a new metric for video generation. Cited by: Appendix C, §5.2.
  • [38] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu (2016) WaveNet: A generative model for raw audio. In ISCA workshop, Cited by: §4.
  • [39] C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Generating videos with scene dynamics. In NeurIPS, Cited by: §1, §4.
  • [40] L. Wei and M. Levoy (2000)

    Fast texture synthesis using tree-structured vector quantization

    In Annual conference on Computer graphics and interactive techniques, Cited by: §4.
  • [41] L. K. Wenliang and M. Sahani (2019) A neurally plausible model for online recognition and postdiction in a dynamical environment. In NeurIPS, Cited by: §B.1, §5.1.
  • [42] J. Yoon, D. Jarrett, and M. van der Schaar (2019) Time-series generative adversarial networks. In NeurIPS, Cited by: §1, §4, §5.1.
  • [43] X. L. Zhang, H. Begleiter, B. Porjesz, W. Wang, and A. Litke (1995) Event related potentials during object recognition tasks. Brain Research Bulletin 38 (6). Cited by: §B.1.
  • [44] S. Zhou, M. Gordon, R. Krishna, A. Narcomey, L. F. Fei-Fei, and M. Bernstein (2019) HYPE: a benchmark for human eye perceptual evaluation of generative models. In NeurIPS, Cited by: §5.2.

Appendix A Specifics on regularized Causal Optimal Transport

a.1 Limits of regularized Causal Optimal Transport

In this section we prove the limits stated in Remark 3.2.

Lemma A.1.

Let and be discrete measures, say on path spaces and , with and . Then


We mimic the proof of Theorem 4.5 in [2], and note that the entropy of any is uniformly bounded:


This yields


Now, note that , and that, for , the LHS and RHS in (A.2) both tend to . ∎

Lemma A.2.

Let and be discrete measures. Then


Being and discrete, is uniformly bounded for . Therefore, for big enough, the optimizer in is , the independent coupling, for which ; see [13] and [21]. Therefore, for big enough, we have . ∎

a.2 Reformulation of the COT problem


The causal constraint (3.2

) can be expressed using the following characteristic function:


This allows to rewrite (3.3) as

where the third equality holds by the min-max theorem, thanks to convexity of , and convexity and compactness of . ∎

a.3 Sinkhorn divergence at the level of mini-batches

Empirical observation of the bias in Example 3.4.

In the experiment mentioned in creftype 3.4, we consider a set of distributions ’s as sinusoids with random phase, frequency and amplitude. We let

be one element in this set whose amplitude is uniformly distributed between minimum 0.3 and maximum 0.8. On the other hand, for each

, the amplitude is uniformly distributed between the same minimum 0.3 and a maximum that lies in . Thus, the only parameter of the distribution being varied is the maximum amplitude. We may equivalently take the maximum amplitude as a single that parameterizes , so that . Figure 1 illustrates that the sample Sinkhorn divergence (3.7) (or regularized distance (2.4)) does not recover the optimizer , while the proposed mix Sinkhorn divergence (3.8) does.

Further discussion.

As mentioned in Section 3.3, when implementing the Sinkhorn divergence (2.5) at the level of mini-batches, one canonical choice is the one adopted in [20], that is


What inspired us the different choice of the mixed Sinkhorn divergence in (3.8), that is


is the idea of also taking into account the bias within the distribution and that within the distribution , when sampling mini-batches from them.

Clearly, when the batch size , both (A.4) and (A.5) converge to (2.5) in expectation, see [19, Theorem 3]. So the main point here is, for a fixed , which one of the two does a better job in translating the idea of the Sinkhorn divergence at the level of mini-batches. Experiments suggest that (A.5) is indeed the better choice. To support this fact, note that the triangular inequality implies

One can possibly argue that in (A.5) we are using two batches of size , thus simply considering a bigger mini-batch, say of size , may perform as well. However, we have considered this case and our experiments confirm that the mixed Sinkhorn divergence (A.5) we suggest does perform better than the so-far used (A.4) even when in the latter we allow for bigger batch size. This reasoning can be pushed further, by for example considering for all four combinations of samples with and without . Implementations showed that there is no advantage in doing so while requiring more computations.

The MMD limiting case.

In the limit , Genevay et al. [20] showed that under the kernel defined by . Here we want to point out an interesting fact about the limiting behavior of the mixed Sinkhorn divergence.

Remark A.3.

Given distributions of mini-batches and formed by samples from and , respectively, in the limit , the Sinkhorn divergence converges to a biased estimator of ; given additional and from and , respectively, the mixed Sinkhorn divergence

converges to an unbiased estimator of



The first part of the statement relies on the fact that is a biased estimator of . Indeed, we have

Now note that

where we have used the fact that . A similar result holds for the sum over . On the other hand, . Therefore

which completes the proof of the first part of the statement.

For the second part, note that as [20, Theorem 1], thus

The RHS is an unbiased estimator of , since its expectation is

Figure 5: Data and samples obtained by different methods for the autoregressive process.

Note that the bias refers to the parameter estimate, rather than the divergence itself. The mixed divergence may still be a biased estimate of the true Sinkhorn divergence. However, in the experiment of Example 3.4 we note that the minimum is reached for the parameter close to the real one (Figure 1, bottom). We defer detailed analysis of mixed divergence to a future paper.

Appendix B Experimental details

b.1 Low dimensional time series

Here we describe details of the experiments in Section 5.1.

Autoregressive process.

The generative process to obtain data for the autoregressive process is

where is diagonal with ten values evenly spaced between and . We initialize from a 10-dimensional standard normal, and ignore the data in the first 10 time steps so that the data sequence begins with a more or less stationary distribution. We use and for this experiment. Real data and generated samples are shown in Figure 5.

Figure 6: 1-D noisy oscillation. Top two rows show two samples from the data distribution and generators trained by different methods. Third row shows marginal distribution of pixels values (y axis clipped at 0.07 for clarity). Bottom row shows joint distribution of the position of the oscillation at adjacent time steps.
Noisy oscillation.

This dataset comprises paths simulated from a noisy, nonlinear dynamical system. Each path is represented as a sequence of -dimensional arrays, time steps long, and can be displayed as a -pixel image for visualization. At each discrete time step , data at time , given by , is determined by the position of a “particle” following noisy, nonlinear dynamics. When shown as an image, each sample path appears visually as a “bump” travelling rightward, moving up and down in a zig-zag pattern as shown in Figure 6 (top left).

More precisely, the state of the particle at time is described by its position and velocity , and evolves according to

where is a rotation matrix, and is uniformly distributed on the unit circle.

We take and so that is a vector of evaluations of a Gaussian function at 20 evenly spaced locations, and the peak of the Gaussian function follows the position of the particle for each :

where maps pixel indices to a grid of evenly spaced points in the space of particle position. Thus, , the observation at time , contains information about but not . A similar data generating process was used in [41], inspired by Johnson et al. [25].

We compare the marginal distribution of the pixel values and joint distribution of the bump location () between adjacent time steps. See Figure 6.


We obtained EEG dataset from [43] and took the recordings of all the 43 subjects in the control group under the matching condition (S2). For each subject, we choose 75% of the trials as training data and the remaining for evaluation , giving training sequences and

test sequences in total. All data are subtracted by channel-wise mean, divided by three times the channel-wise standard deviation, and then passed through a

nonlinearity. We train and evaluate models 16 times with different splittings. For COT-GAN, we trained three variants corresponding to being one of , and for all OT-based methods. Data and samples are shown in Figure 7.

Figure 7: Data and samples obtained by different methods for EEG data, the number after COT-GAN indicates the value of .

We use four different metrics to compare sample quality. Relative

test that compares a test statistic based on

, where indicates the real test dataset, is sampled from a COT-GAN with , and is sampled from an alternative method that is one of the following: COT-GAN with , direct minimizations of mixed and original Sinkhorn divergences, TimeGAN and Sinkhorn GAN. A larger value of the test statistic indicates that COT-GAN with is better compared to the alternative. We do not employ the hypothesis testing framework, but rather use the test statistic as a metric of relative sample quality. We also compute the following quantities on the real and generated samples: a) temporal correlation coefficient, b) channel-wise correlation coefficient, and c) the frequency spectrum for each channel averaged over samples. For each of these three features, we use the sum of absolute difference between features computed from real and synthesized data as a metric of similarity. A small number means the generated data is close to real data based on the corresponding feature.

As the results in Figure 8 show, the different metrics do not agree in general. Nonetheless, COT-GANs in general outperform other models. According to MMD and temporal correlation, direct minimization of the mixed Sinkhorn divergence is as good as the best COT-GAN with . But all COT-GANs do better in channel correlation and frequency spectrum. We noticed that increasing is helpful for MMD and the two correlations, but is not for frequency spectrum.

Model and training parameters.

The dimensionality of the latent state is 10 at each time step, and there is also a 10-dimensional time-invariant latent state. The generator common to COT-GAN, direct minimization and Sinkhorn GAN comprises a 1-layer (synthetic) or 2-layer (EEG) LSTM networks, whose output at each time step