cot-gan
COT-GAN: Generating Sequential Data via Causal Optimal Transport
view repo
We introduce COT-GAN, an adversarial algorithm to train implicit generative models optimized for producing sequential data. The loss function of this algorithm is formulated using ideas from Causal Optimal Transport (COT), which combines classic optimal transport methods with an additional temporal causality constraint. Remarkably, we find that this causality condition provides a natural framework to parameterize the cost function that is learned by the discriminator as a robust (worst-case) distance, and an ideal mechanism for learning time dependent data distributions. Following Genevay et al.(2018), we also include an entropic penalization term which allows for the use of the Sinkhorn algorithm when computing the optimal transport cost. Our experiments show effectiveness and stability of COT-GAN when generating both low- and high-dimensional time series data. The success of the algorithm also relies on a new, improved version of the Sinkhorn divergence which demonstrates less bias in learning.
READ FULL TEXT VIEW PDFCOT-GAN: Generating Sequential Data via Causal Optimal Transport
None
Dynamical data are ubiquitous in the world, including natural scenes such as video and audio data, and temporal recordings such as physiological and financial traces. Being able to synthesize realistic dynamical data is a challenging unsupervised learning problem and has wide scientific and practical applications. In recent years, training implicit generative models (IGMs) has proven to be a promising approach to data synthesis, driven by the work on generative adversarial networks (GANs)
[22].Nonetheless, training IGMs on dynamical data poses an interesting yet difficult challenge. On one hand, learning complex dependencies between spatial locations and channels for static images has already received significant effort within the research community. On the other hand, temporal dependencies are no less complicated since the dynamical features are strongly correlated with spatial features. Recent works, including [34, 42, 15, 39, 36], often tackle this problem by separating the model or loss into static and dynamic components.
In this paper, we consider training dynamic IGMs for sequential data. We introduce a new adversarial objective that builds on optimal transport (OT) theory, and constrains the transport plans to respect causality
: the probability mass moved to the target sequence at time
can only depend on the source sequence up to time [2, 8]. A reformulation of the causality constraint leads to a formulation of an adversarial training objective in the spirit of [20], but tailored to sequential data. In addition, we demonstrate that optimizing the original Sinkhorn divergence over mini-batches causes biased parameter estimation, and propose the
mixed Sinkhorn divergence which avoids this problem. Our new framework, Causal Optimal Transport GAN (COT-GAN), outperforms existing methods on a wide range of datasets from traditional time series to high dimensional videos.Goodfellow et al. [22] introduced an adversarial scheme for training an IGM. Given a (real) data distribution , and a distribution on some latent space , the generator is a function trained so that the induced distribution is as close as possible to as judged by a discriminator. The discriminator is a function trained to output a high value if the input is real (from ), and a low value otherwise (from
). In practice, the two functions are implemented as neural networks
and with parameters and , and the generator distribution is denoted by . The training objective is then formulated as a zero-sum game between the generator and the discriminator. Different probability divergences were later proposed to evaluate the distance between and [30, 26, 29, 4]. Notably, the Wasserstein-1 distance was used in [6, 5]:(2.1) |
where is the space of transport plans (couplings) between and . Its dual form turns out to be a maximization problem over such that is Lipschitz. Combined with the minimization over , a min-max problem can be formulated with a Lipschitz constraint on .
The optimization in (2.1) is a special case of the classical (Kantorovich) optimal transport problem. Given probability measures on , on , and a cost function , the optimal transport problem is formulated as
(2.2) |
Here, represents the cost of transporting a unit of mass from to , and is thus the minimal total cost to transport the mass from to . Obviously, the Wasserstein-1 distance (2.1) corresponds to . However, when and are supported on finite sets of size , solving (2.2) has super-cubic (in ) complexity [14, 31, 32], which is computationally expensive for large datasets.
Instead, Genevay et al. [20] proposed training IGMs by minimizing a regularized Wasserstein distance that can be computed more efficiently by the Sinkhorn algorithm (see [14]). For transport plans with marginals supported on a finite set and on a finite set , any is also discrete with support on the set of all possible pairs . Denoting , the Shannon entropy of is given by For , the regularized optimal transport problem then reads as
(2.3) |
Denoting by the optimizer in (2.3), one can define a regularized distance by
(2.4) |
Computing this distance is numerically more stable than solving the dual formulation of the OT problem, as the latter requires differentiating dual Kantorovich potentials; see e.g. [12, Proposition 3]. To correct the fact that , Genevay et al. [20] proposed to use the Sinkhorn divergence:
(2.5) |
as the objective function, and to learn the cost parameterized by , resulting in the following adversarial objective
(2.6) |
In practice, a sample-version of this cost is used, where and are replaced by distributions of mini-batches randomly extracted from them.
We now focus on data that consists of -dimensional (number of channels), -long sequences, so that and are distributions on the path space . In this setting we introduce a special class of transport plans, between and , that will be used to define our objective function. On , we denote by and the first and second half of the coordinates, and we let and be the canonical filtrations (for all , is the smallest -algebra s.t. is measurable; analogously for ).
A transport plan is called causal if
The set of all such plans will be denoted by .
Roughly speaking, the amount of mass transported by to a subset of the target space belonging to depends on the source space only up to time . Thus, a causal plan transports into in a non-anticipative way, which is a natural request in a sequential framework. In the present paper, we will use causality in the sense of Definition 3.1. However, note that in the literature, the term causality is often used to indicate a mapping in which the output at a given time depends only on inputs up to time .
Restricting the space of transport plans to in the OT problem (2.2) gives the COT problem
(3.1) |
COT has already found wide application in dynamic problems in stochastic calculus and mathematical finance, see e.g. [3, 1, 2, 9, 7]. The causality constraint can be equivalently formulated in several ways, see [8, Proposition 2.3]. The one that will be useful for our purposes can be expressed in the following way: let be the set of -martingales, and define
then a transport plan is causal if and only if
(3.2) |
where and similarly for , and . As usual, denotes the space of continuous, bounded functions on . Where no confusion can arise, with an abuse of notation we will simply write rather than .
In the same spirit of [20], we include an entropic regularization in the COT problem (3.1) and consider
(3.3) |
The solution to such problem is then unique due to strict concavity of . We denote by the optimizer to the above problem, and define the regularized COT distance by
In analogy to the non-causal case, it can be shown that, for discrete and (as in practice), the following limits holds:
where denotes the independent coupling.
See Section A.1 for a proof. This means that the regularized COT distance is between the COT distance and the loss obtained by independent coupling, and is closer to the former for small . Optimizing over the space of causal plans is not straightforward. Nonetheless, the following proposition shows that the problem can be reformulated as a maximization over non-causal problems with respect to a specific family of cost functions.
This means that the optimal value of the regularized COT problem equals the maximum value over the family of regularized OT problems w.r.t. the set of cost functions . This result has been proven in [2]. As it is crucial for our analysis, we show it in Section A.2.
Proposition 3.3 suggests the following worst-case distance between and :
(3.6) |
as a regularized Sinkhorn distance that respects the causal constraint on the transport plans.
In the context of training a dynamic IGM, the training dataset is a collection of paths of equal length , , . As is usually very large, we proceed as usual by approximating with its empirical mini-batch counterpart. Precisely, for a given IGM , we fix a batch size and sample from the dataset and from . Denote the generated samples by , and the empirical distributions by
The empirical distance can be efficiently approximated by the Sinkhorn algorithm.
When implementing the Sinkhorn divergence (2.5) at the level of mini-batches, one canonical candidate clearly is
(3.7) |
which is indeed what is used in [20]. While the expression in (3.7) does converge in expectation to (2.5) for ([19, Theorem 3]), it is not clear whether it is an adequate loss given data of fixed batch size . In fact, we find that this is not the case, and demonstrate it here empirically.
We build an example where the data distribution belongs to a parameterized family of distributions , with (details in Section A.3). As shown in Figure 1 (top two rows), neither the expected regularized distance (2.4) nor the Sinkhorn divergence (2.5) reaches minimum at , especially for small . This means that optimizing over mini-batches will not lead to .
Instead, we propose the following mixed Sinkhorn divergence at the level of mini-batches:
(3.8) |
where and are the empirical distributions of mini-batches from the data distribution, and and from the IGM distribution . The idea is to take into account the bias within the distribution and that within the distribution as well. The proposed divergence finds the correct minimizer for all in Example 3.4 (Figure 1, bottom), and the improvement is not due solely to the double batch used by Equation 3.8. We further discuss this choice and our findings in Section A.3.
We now combine the results in Section 3.2 and Section 3.3 to formulate an adversarial training algorithm for IGMs. First, we approximate the set of functions (3.5) by truncating the sums at a fixed , and we parameterize and as two separate neural networks, and let . To capture the adaptedness of those processes, we employ architectures where the output at time depends on the input only up to time . The mixed Sinkhorn divergence between and is then calculated with respect to a parameterized cost function
(3.9) |
Second, it is not obvious how to directly impose the martingale condition, as constraints involving conditional expectations cannot be easily enforced in practice. Rather, we penalize processes for which increments at every time step are non-zero on average. For an -adapted process and a mini-batch (), we define the martingale penalization for as
where
is the empirical variance of
over time and batch, and is a small constant. Third, we use the mixed normalization introduced in (3.8). Each of the four terms is approximated by running the Sinkhorn algorithm on the cost for iterations.Altogether, we arrive at the following adversarial objective function for COT-GAN:
(3.10) |
where and are empirical measures corresponding to non-overlapping subsets of the dataset, and are the ones corresponding to two samples from , and is a positive constant. We update to decrease this objective, and to increase it.
While the generator acts as in classical GANs, the adversarial role here is played by and . In this setting, the discriminator, parameterized by , learns a robust (worst-case) distance between the real data distribution and the generated distribution , where the class of cost functions as in (3.9) originates from causality. The algorithm is summarized in Algorithm 1. Its time complexity scales as for each iteration.
Early video generation literature focuses on dynamic texture modeling [16, 35, 40]. Recent efforts in video generation within the GAN community have been devoted to designing GAN architectures of generator and discriminator to tackle the spatio-temporal dependencies separately, e.g., [39, 34, 36]. VGAN [39] explored a two-stream generator that combines a network for a static background and another one for moving foreground trained on the original GAN objective. TGAN [34] proposed a new structure capable of generating dynamic background as well as a weight clipping trick to regularize the discriminator. In addition to a unified generator, MoCoGAN [36] employed two discriminators to judge both the quality of frames locally and the evolution of motions globally.
The broader literature of sequential data generation attempts to capture the dependencies in time by simply deploying recurrent neural networks in the architecture
[28, 18, 23, 42]. Among them, TimeGAN [42] demonstrated improvements in time series generation by adding a teacher-forcing component in the loss function. Alternatively, WaveGAN [15] adopted the causal structure of WaveNet [38]. Despite substantial progress made, existing sequential GANs are generally domain-specific. We therefore aim to offer a framework that considers (transport) causality in the objective function and is suitable for more general sequential settings.Whilst our analysis is built upon [14] and [20], we remark two major differences between COT-GAN and the Sinkhorn GAN in [20]. First, we consider a different family of costs. While [20] learns the cost function by parametrizing with , the family of costs in COT-GAN is found by adding a causal component to in terms of and . is the mixed Sinkhorn divergence we propose, which reduces biases in parameter estimation and can be used as a generic divergence for training IGMs not limited to time series settings.
We now validate COT-GAN empirically^{1}^{1}1Code and data are available at github.com/neuripss2020/COT-GAN. For times series that have a relatively small dimensionality but exhibit complex temporal structure, we compare COT-GAN with the following methods: Direct minimization of Sinkhorn divergences (3.8) and (3.7); TimeGAN [42] as reviewed in Section 4; Sinkhorn GAN, similar to [20] with cost where is trained to increase the mixed Sinkhorn divergence with weight clipping. All methods use . The networks and in COT-GAN and in Sinkhorn GAN share the same architecture. Details of models and datasets are in Section B.1.
We first test whether COT-GAN can learn temporal and spatial correlation in a multivariate first-order auto-regressive process (AR-1) . Results are shown in Figure 2. COT-GAN samples have correlation structures that best match the real data. Minimizing the mixed divergence produces almost as good correlations as COT-GAN, but with less accurate auto-correlation. Minimizing the original Sinkhorn divergence yields poor results, and neither TimeGAN nor Sinkhorn GAN could capture the correlation structure for this dataset.
The noisy oscillation distribution is composed of sequences of 20-element arrays (1-D images) [41]. Figure 6 in Section B.1
shows data as well as generated samples by different training methods. To evaluate performance, we estimate two attributes of the samples by Monte Carlo: the marginal distribution of pixel values, and the joint distribution of the location at adjacent time steps. COT-GAN samples match the real data best.
This dataset is from the UCI repository [17] and contains recordings from 43 healthy subjects each undergoing around 80 trials. Each data sequence has 64 channels and we model the first 100 time steps. We trained and evaluated each method 16 times with different training and test splits. We evaluated performance by the maximum mean discrepancy (MMD), and the match with data in terms of temporal and channel correlations, and frequency spectrum. In addition, we investigated how the coefficient affects sample quality. We show an example of the data and learned correlations in Figure 3
, and summary statistics of all evaluation metrics in
Figure 8 in Section B.1. COT-GANs generate the best samples compared with other baselines across all four metrics. A smaller tends to generate less realistic correlation patterns, but slightly better match in frequency spectrum.We train COT-GAN on Sprites animations [27, 33] and human action sequences [11], and compare with MoCoGAN [36]. The evaluation metrics are Fréchet Inception Distance (FID) [24] comparing individual frames, Fréchet Video Distance (FVD) [37] which compares the video sequences as a whole by mapping samples into features via pretrained 3D convolutional networks, and their kernel counterparts (KID, KVD) [10]. Previous studies suggest that FVD correlates better with human judgement than KVD for videos [37], whereas KID does so better than FID on images [44].
We pre-process the Sprites and human action sequences to have a sequence length of and , respectively. Each frame has dimension . We employ the same architecture of generator and discriminator to train both datasets. Both the generator and discriminator comprises generic LSTM with 2-D convolutional layers. The detailed data pre-processing, GAN architectures, hyper-parameter settings, and training techniques are reported in Appendix B.2. We show the real data and samples from COT-GAN side by side in Figure 4.
Sprites | FVD | FID | KVD | KID |
---|---|---|---|---|
MoCoGAN | 1 213.2 | 281.3 | 160.1 | 0.33 |
COT-GAN | 444.6 | 83.5 | 64.0 | 0.077 |
Human actions | ||||
MoCoGAN | 661.8 | 128.4 | 60.4 | 0.21 |
COT-GAN | 541.0 | 52.4 | 46.2 | 0.096 |
The evaluation scores in Table 1 are estimated using 5000 generated samples. COT-GAN is the better performing method in both tasks for all four measurements. Further samples, and comparison with direct minimization of the mixed Sinkhorn divergence, are provided in Appendix C.
The performance of COT-GAN suggests that constraining the transport plans to be causal is a promising direction for generating sequential data. The approximations we introduce, such as the mixed Sinkhorn distance (3.8) and truncated sum in (3.5), are sufficient to produce good experimental results, and provide opportunities for more theoretical analyses in future studies. Directions of future development include ways to learn from data with flexible lengths, extensions to conditional COT-GAN, and improved methods to enforce the martingale property for and better parameterize the causality constraint.
International Journal of Computer Vision
51 (2). Cited by: §4.UCI machine learning repository
. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §5.1.Maximum entropy for hypothesis formulation, especially for multidimensional contingency tables
. The Annals of Mathematical Statistics 34 (3). Cited by: §A.1.Mmd gan: towards deeper understanding of moment matching network
. In NeurIPS, Cited by: §2.1.Disentangled sequential autoencoder
. arXiv preprint arXiv:1803.02991. Cited by: §5.2.Temporal generative adversarial nets with singular value clipping
. In ICCV, Cited by: §1, §4.Fast texture synthesis using tree-structured vector quantization
. In Annual conference on Computer graphics and interactive techniques, Cited by: §4.In this section we prove the limits stated in Remark 3.2.
Let and be discrete measures, say on path spaces and , with and . Then
Let and be discrete measures. Then
The causal constraint (3.2
) can be expressed using the following characteristic function:
(A.3) |
This allows to rewrite (3.3) as
where the third equality holds by the min-max theorem, thanks to convexity of , and convexity and compactness of . ∎
In the experiment mentioned in creftype 3.4, we consider a set of distributions ’s as sinusoids with random phase, frequency and amplitude. We let
be one element in this set whose amplitude is uniformly distributed between minimum 0.3 and maximum 0.8. On the other hand, for each
, the amplitude is uniformly distributed between the same minimum 0.3 and a maximum that lies in . Thus, the only parameter of the distribution being varied is the maximum amplitude. We may equivalently take the maximum amplitude as a single that parameterizes , so that . Figure 1 illustrates that the sample Sinkhorn divergence (3.7) (or regularized distance (2.4)) does not recover the optimizer , while the proposed mix Sinkhorn divergence (3.8) does.As mentioned in Section 3.3, when implementing the Sinkhorn divergence (2.5) at the level of mini-batches, one canonical choice is the one adopted in [20], that is
(A.4) |
What inspired us the different choice of the mixed Sinkhorn divergence in (3.8), that is
(A.5) |
is the idea of also taking into account the bias within the distribution and that within the distribution , when sampling mini-batches from them.
Clearly, when the batch size , both (A.4) and (A.5) converge to (2.5) in expectation, see [19, Theorem 3]. So the main point here is, for a fixed , which one of the two does a better job in translating the idea of the Sinkhorn divergence at the level of mini-batches. Experiments suggest that (A.5) is indeed the better choice. To support this fact, note that the triangular inequality implies
One can possibly argue that in (A.5) we are using two batches of size , thus simply considering a bigger mini-batch, say of size , may perform as well. However, we have considered this case and our experiments confirm that the mixed Sinkhorn divergence (A.5) we suggest does perform better than the so-far used (A.4) even when in the latter we allow for bigger batch size. This reasoning can be pushed further, by for example considering for all four combinations of samples with and without . Implementations showed that there is no advantage in doing so while requiring more computations.
In the limit , Genevay et al. [20] showed that under the kernel defined by . Here we want to point out an interesting fact about the limiting behavior of the mixed Sinkhorn divergence.
Given distributions of mini-batches and formed by samples from and , respectively, in the limit , the Sinkhorn divergence converges to a biased estimator of ; given additional and from and , respectively, the mixed Sinkhorn divergence
converges to an unbiased estimator of
.The first part of the statement relies on the fact that is a biased estimator of . Indeed, we have
Now note that
where we have used the fact that . A similar result holds for the sum over . On the other hand, . Therefore
which completes the proof of the first part of the statement.
For the second part, note that as [20, Theorem 1], thus
The RHS is an unbiased estimator of , since its expectation is
∎
Note that the bias refers to the parameter estimate, rather than the divergence itself. The mixed divergence may still be a biased estimate of the true Sinkhorn divergence. However, in the experiment of Example 3.4 we note that the minimum is reached for the parameter close to the real one (Figure 1, bottom). We defer detailed analysis of mixed divergence to a future paper.
Here we describe details of the experiments in Section 5.1.
The generative process to obtain data for the autoregressive process is
where is diagonal with ten values evenly spaced between and . We initialize from a 10-dimensional standard normal, and ignore the data in the first 10 time steps so that the data sequence begins with a more or less stationary distribution. We use and for this experiment. Real data and generated samples are shown in Figure 5.
This dataset comprises paths simulated from a noisy, nonlinear dynamical system. Each path is represented as a sequence of -dimensional arrays, time steps long, and can be displayed as a -pixel image for visualization. At each discrete time step , data at time , given by , is determined by the position of a “particle” following noisy, nonlinear dynamics. When shown as an image, each sample path appears visually as a “bump” travelling rightward, moving up and down in a zig-zag pattern as shown in Figure 6 (top left).
More precisely, the state of the particle at time is described by its position and velocity , and evolves according to
where is a rotation matrix, and is uniformly distributed on the unit circle.
We take and so that is a vector of evaluations of a Gaussian function at 20 evenly spaced locations, and the peak of the Gaussian function follows the position of the particle for each :
where maps pixel indices to a grid of evenly spaced points in the space of particle position. Thus, , the observation at time , contains information about but not . A similar data generating process was used in [41], inspired by Johnson et al. [25].
We compare the marginal distribution of the pixel values and joint distribution of the bump location () between adjacent time steps. See Figure 6.
We obtained EEG dataset from [43] and took the recordings of all the 43 subjects in the control group under the matching condition (S2). For each subject, we choose 75% of the trials as training data and the remaining for evaluation , giving training sequences and
test sequences in total. All data are subtracted by channel-wise mean, divided by three times the channel-wise standard deviation, and then passed through a
nonlinearity. We train and evaluate models 16 times with different splittings. For COT-GAN, we trained three variants corresponding to being one of , and for all OT-based methods. Data and samples are shown in Figure 7.We use four different metrics to compare sample quality. Relative
test that compares a test statistic based on
, where indicates the real test dataset, is sampled from a COT-GAN with , and is sampled from an alternative method that is one of the following: COT-GAN with , direct minimizations of mixed and original Sinkhorn divergences, TimeGAN and Sinkhorn GAN. A larger value of the test statistic indicates that COT-GAN with is better compared to the alternative. We do not employ the hypothesis testing framework, but rather use the test statistic as a metric of relative sample quality. We also compute the following quantities on the real and generated samples: a) temporal correlation coefficient, b) channel-wise correlation coefficient, and c) the frequency spectrum for each channel averaged over samples. For each of these three features, we use the sum of absolute difference between features computed from real and synthesized data as a metric of similarity. A small number means the generated data is close to real data based on the corresponding feature.As the results in Figure 8 show, the different metrics do not agree in general. Nonetheless, COT-GANs in general outperform other models. According to MMD and temporal correlation, direct minimization of the mixed Sinkhorn divergence is as good as the best COT-GAN with . But all COT-GANs do better in channel correlation and frequency spectrum. We noticed that increasing is helpful for MMD and the two correlations, but is not for frequency spectrum.
The dimensionality of the latent state is 10 at each time step, and there is also a 10-dimensional time-invariant latent state. The generator common to COT-GAN, direct minimization and Sinkhorn GAN comprises a 1-layer (synthetic) or 2-layer (EEG) LSTM networks, whose output at each time step
Comments
There are no comments yet.