Latent Space Optimal Transport for Generative Models

09/16/2018 ∙ by Huidong Liu, et al. ∙ Dalian University of Technology Stony Brook University Harvard University 0

Variational Auto-Encoders enforce their learned intermediate latent-space data distribution to be a simple distribution, such as an isotropic Gaussian. However, this causes the posterior collapse problem and loses manifold structure which can be important for datasets such as facial images. A GAN can transform a simple distribution to a latent-space data distribution and thus preserve the manifold structure, but optimizing a GAN involves solving a Min-Max optimization problem, which is difficult and not well understood so far. Therefore, we propose a GAN-like method to transform a simple distribution to a data distribution in the latent space by solving only a minimization problem. This minimization problem comes from training a discriminator between a simple distribution and a latent-space data distribution. Then, we can explicitly formulate an Optimal Transport (OT) problem that computes the desired mapping between the two distributions. This means that we can transform a distribution without solving the difficult Min-Max optimization problem. Experimental results on an eight-Gaussian dataset show that the proposed OT can handle multi-cluster distributions. Results on the MNIST and the CelebA datasets validate the effectiveness of the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Auto-Encoders (AEs) have demonstrated the capability of learning a subspace for dimensionality reduction [22]. However, AEs are not generative. Mathematically speaking, there exist zones where a latent code is not in the support of the latent representation of the input [36]. In order to address this problem, Variational Auto-Encoders (VAEs) [16] enforce the latent-space data distribution to be close to a simple distribution, e.g., a unit Gaussian, such that a randomly sampled latent code lies in the support of the latent representation of the given dataset. In practice, VAEs minimize the KL-divergence between the latent-space data distribution and a unit Gaussian [17]. For a similar purpose, instead of measuring the KL-divergence, the Adversarial Auto-Encoder (AAE) [25] adopts adversarial training in the latent space to enforce the latent-space data distribution to be a unit Gaussian [3]. The Wasserstein Auto-Encoder (WAE) [36] with a GAN penalty (WAE-GAN) is the generalization of the AAE with the reconstruction cost being any cost function. When the cost function is quadratic, WAE-GAN reproduces AAE.

Existing VAE based methods [26, 18], AAE and WAE transform the distribution of the latent representations of data to a simple distribution such as an isotropic Gaussian. However, many real world datasets, such as facial images, lie in lower dimensional manifolds which are quite different from a simple isotropic Gaussian [9, 1]. Changing the important manifold structure of data into a simple distribution, e.g. a unit Gaussian, obliterates the structure of latent-space data distribution, causes the posterior collapse problem [37] and leads to generating unrealistic images.

In contrast to VAE-based methods in which the distribution of the latent representation of data is transformed to an isotropic Gaussian implicitly using KL-divergence, Generative Adversarial Nets (GANs) [7] are able to transform any given distribution to another distribution theoretically. GAN based methods are receiving increased attention in various applications [34, 29, 39, 21, 11, 10] as well as methodology improvements [32, 2, 8, 28, 27]. A GAN model consists of a generator and a discriminator (or critic): the generator synthesizes data from a simple distribution to fool the discriminator while the discriminator tries to distinguish between the real data and synthetic data. However, training a GAN is solving a Min-Max optimization problem [31], which is difficult and unstable in practice [26]. Furthermore, the balance between the discriminator and generator is difficult to control [3].

(a) The workflow of AE-OT
Figure 1: The workflow of the AE-OT. The Encoder and Decoder are trained and then fixed. In the training phase, the latent representation of a sample is real data for the discriminator, while a noise sample is fake data. We train the discriminator using the proposed approach. In the generating phase, we generate a noise sample and input into the Decoder to produce a generated data sample .

In this paper, we address the posterior collapse problem in VAE by learning a transformation from a simple distribution (e.g. a unit Gaussian) to the latent-space data distribution. This preserves the structure of the data in the latent space. Traditionally, in order to achieve this goal, a GAN with two networks (a generator and a discriminator) is used. However, the adversarial training process is not well understood theoretically. In contrast, our proposed method computes the Optimal Transport (OT) map directly based on the theoretical analysis presented in [20]. We only train a discriminator in the latent space and the OT which transforms a sample from a simple distribution to a latent-space data distribution is explicitly derived from the discriminator output. As we use an Auto-Encoder (AE) to find the latent space and use OT to perform distribution transformation in the latent space, we name our method as AE-OT. The OT has a well understood theory, and hence we can use transparent theoretic model to transform a distribution. Figure 1 shows the workflow of AE-OT.

In contrast to the Wasserstein Auto-Encoder (WAE) in which the Wasserstein Distance (WD) is defined in the original image space, in AE-OT, the WD is defined in the latent space. Both the objective and the training protocol are different. WAE requires solving a difficult Min-Max optimization problem, while AE-OT only needs to solve a minimization problem in the latent space. So, training AE-OT is much easier than training WAE. It is worthy noting that several OT methods have been proposed in literature. [35, 6] compute discrete OT from OT primal formulation, and thus are not suitable for generative models. [30] learns the OT mapping using kernel methods, whose parameters are difficult to choose. [33] needs to train three networks to learn the transport mapping, which is harder than AE-OT.

The contributions of this paper are the following:

1) We propose a novel generative Auto-Encoder. Different from existing VAE based methods which map a latent-space data distribution to a simple distribution, the proposed generative model transforms a simple distribution to the latent-space data distribution. In this way, the intrinsic data structure in the latent space is preserved, and thus the posterior collapse problem [37] is addressed.

2) We show that if the cost function is quadratic, then once the optimal discriminator is obtained, the generator can be explicitly obtained from the discriminator output. AE-OT can achieve the same goal of a GAN in the latent space, but AE-OT only needs to solve a minimization problem in the latent space rather than the difficult Min-Max optimization problem in GANs.

3) Experiments on an eight-Gaussian toy dataset demonstrate that the computed OT can model the multi-cluster distributions. Qualitative and quantitative results on the MNIST [19] dataset show that AE-OT performs better than VAEs and WAE. Images generated on the CelebA [24] dataset show that AE-OT generates much better facial images than VAE and WAE.

In this remainder of this paper, we shall review optimal transport first, and then introduce our proposed generative model, followed by experimental results.

2 Optimal Transport

Since our method is based on Optimal Transport (OT), we first introduce the background of OT. OT is a powerful tool to handle probability measure transformations. For details one could refer to

[20] [38].

2.1 Optimal Transport theory

In this subsection, we will introduce basic concepts and theorems in classic optimal transport theory, focusing on Kantorovich potential and Brenier’s approach of solving the Monge problem defined in Problem 1.

Let , be two subsets of a -dimensional Euclidean space, with measures and be probability measures defined on and , respectively. Also we require that they have equal total measure, i.e

Definition 1 (Measure-Preserving Map)

A map is measure preserving if for any measurable set , the set is -measurable and

(1)

The problem of optimal transport arises from minimizing the total cost of moving all particles from one place (i.e. source) to another place (i.e. target), given the cost of moving each unit of mass. Formally speaking, we define a cost function on , such that for every and . The total transport cost of moving particles with density at to density at is defined to be

(2)

Eq. (2) can also be rewritten as , where is the push forward map induced by . Now we could define the Monge’s problem of optimal transport.

Problem 1

[Monge’s Optimal Transport [4]] Given a transport cost function , find a measure preserving map that minimizes the total transport cost

(3)

Note that in Monge’s problem a map is required, whose existence is not guaranteed given an arbitrary pair of measures and . For example when is a Dirac measure and is an arbitrary measure absolutely continuous to Lebesgue measure in . This means that given and , the feasible solution set to Monge’s problem might be empty. Therefore, Kontorovich introduced a relaxation of Monge’s problem [13] in 1940s. Instead of considering a transport map, he considered the set of transport plan. Mathematically, given source measure on and target measure on

, a transport plan is a joint distribution

on , such that

(4)

Intuitively, if for some set , , there will be mass moved from to in plan . Now the total cost of transport plan is

(5)

And the Monge-Kantorovich problem is defined as

(6)

among all transport plans , where and are projection maps from onto and respectively.

To solve the () problem, we consider its dual form, known as the Kantorovich problem [38],

(7)

where and are real functions defined on and . One of the key observations to solve (DP) is based on the concept of c-transform.

Definition 2 (c-transform)

Given a real function , the c-transform of is defined by

It can be shown [38] that by replacing with in (7), the value of the energy to be maximized will not decrease. Therefore we could just search for the optimal to solve the Kantorovich problem:

(8)

Here is called Kantorovich potential.

Formula (8) can be rewritten into simpler forms, if we narrow down the choice of cost functions. For example, if we choose the cost function , i.e the distance, then the -transform has the property , given being -Lipschitz [38]. And (8) becomes

(9)

However, the Kantorovich potential

is usually parameterized by a Deep Neural Network (DNN). To restrict a DNN to be 1-Lipschitz is very difficult

[2]. Our method adopts the distance as the cost function, because in this case once we computed the optimal Kontorovich potential, the transport map can be written down in explicit form [20]. Since the Brenier potential is the transport map (corresponding to the generator in GANs), we shall introduce Brenier theorem [5] below. Suppose is a second order continuous convex function. Its gradient map is defined as

Theorem 1 (Brenier[5])

Suppose and are the Euclidean space , and the transport cost is the quadratic Euclidean distance . If is absolutely continuous and and

have finite second order moments, then there exists a convex function

, whose gradient map gives the solution to the Monge’s problem, where is called the Brenier potential. Furthermore, the optimal transport map is unique.

Here is called the Brenier potential. In GANs, the discriminator is served as the Kantorovich potential, and the generator is served as the Brenier potential. The following theorem [20] establishes the relationship between the Brenier potential and the Kantorovich potential:

Theorem 2

Given and on a compact domain there exists an optimal transport plan for the cost with being strictly convex. It is unique and of the form , provided is absolutely continuous and is negligible. Moreover, there exists a Kantorovich potential , and can be represented as

In particular, if we choose , then

(10)

The above theorem shows that when the discriminator is optimal, the generator can be directly computed from the discriminator. We will employ this important property in our generative model.

3 The Proposed Generative Model

Existing VAE based methods try to transform the latent data distribution to a simple distribution. However, this changes the intrinsic data structure in the latent space and finally leads to the posterior collapse problem [37]. In our method, we want to preserve the intrinsic data structure in the latent space, but conversely transform a simple distribution to the latent data distribution.

Different from existing methods, we first pre-train an AE, and then we fix the AE and compute the OT map to transform a simple distribution to the latent-space data distribution, such that the intrinsic data structure is preserved. we name our method as AE-OT. AE-OT solves the optimal transport problem in distance.

AE-OT has a strong geometric motivation. A manifold can be arbitrarily complex, and therefore, learning to transform a simple distribution to the complex manifold can be very difficult. There exists a one-to-one mapping from every neighborhood of a manifold to an Euclidean space which is typically of low dimension. One can get an intuition from [20] Figure 1. Thus, we propose to apply an AE to find the low dimensional space, and then perform distribution transformation in the latent space. This is much easier than transforming distributions in the original input space.

3.1 Learning the Optimal Transport in Latent Space

In this part, we propose to transform the latent space distribution by only training a discriminator. According to Brenier’s theorem 1, the transport map or generator can be explicitly expressed by using the gradient of the optimal discriminator. Training a discriminator is a minimization problem which is much easier than solving a Min-Max optimization problem. As in real applications, we are given the empirical distribution. We give the discrete case of optimal transport below:

3.1.1 Discrete Case of Optimal Transport

A generative model can be defined once we find the optimal transport map from a simple distribution to the distribution of real data. To carry out the computational tasks, we introduce basic ideas when the probability measures and are defined on discrete sets.

Let and denote the two disjoint sets of indices. Suppose and are discrete subsets of , and the cost function is defined by , where are positive real numbers. Suppose the source measure and the target measure . A transport plan is a real function that takes values on such that , and . We rewrite the total transport cost (5) as

(11)

The Monge-Kantorovich problem then can be rewritten as

(12)

The Monge-Kantorovich dual problem, which in this case is actually the dual form of (12), is

(13)

Both (12) and (13

) are linear programming problems, and thus can be solved with generic linear programming methods

[14], such as dual simplex method.

Then, we introduce the learning the optimal transport in the latent space.

3.1.2 Training Phase

Denote by all the data in the given dataset, where is the index set of training samples. Denote by and the pre-trained encoder and decoder on all the given images. First, we use the encoder to get the latent codes of the data . Then, we learn a OT map in the latent space from a simple distribution, unit Gaussian for example, to the empirical distribution formed by . Since the OT map can be computed from the Kantorovich potential, we learn the Kantorovich potential by a two-step manner proposed in [23]. In the first step, we solve Eq. (13) by solving the following linear programming problem:

(14)

where are sampled from a simple distribution, and , where is the batch size. is the fusion of and . , , and .

In the second step, we employ a deep neural network parameterized by to regress the output provided by (14):

(15)

However, when the dimensionality of the latent space is high, the number of data is sparse in the high dimension space. Eq. (10) does not necessarily hold given sparse data for computing the Kantorovich potential. Since Eq. (10) is the first order optimum condition of , each is mapped to . Empirically, we can approximate the mapping from to by the following ordering function:

(16)

In the latent space, we compute the OT matching from random samples to the latent representation of given data using the following ordering function:

(17)

Instead of optimizing Eq. (15), we optimize the following regularized regression problem:

(18)

where is a trade-off parameter. The second term is used to regularize the behavior of ’s gradient w.r.t. its input. The total loss ensures that approximates well in both value and the first order derivative to .

3.1.3 Generating Phase

After we solve (18), intuitively, is a smooth approximation of . The OT for a noise is given below using Eq. (10):

(19)

After we obtain the mapped latent code, we employ the pre-trained decoder to get a generated image .

Figure 1 shows the workflow of AE-OT. The Encoder and Decoder are trained and then fixed. In the training phase, the latent representation of a sample is a real data for the discriminator, while a noise sampled from a simple distribution is a fake data. We train the discriminator using the two-step computation mentioned in this section. In the generating phase, sample a noise and feed into the Decoder to produce a generated data . Algorithm 1 and Algorithm 2 present the training and generating phases of AE-OT, respectively.

0:  Let be number of iterations. Batch Size . Adam parameters , , . Latent representation of real data . Regularization parameter .
0:  .
1:  for  do
2:     Sample from , where .
3:     Sample from a simple distribution, where .
4:     Solve the linear programming problem in (14).
5:     Calculate loss () via Eq. (18),
6:     Calculate the gradient of : ,
7:     Update with Adam.
8:  end for
Algorithm 1 AE-OT Training Phase
0:  Let be the trained discriminator. is the pre-trained decoder. is sampled from a simple distribution.
0:  The generated image .
1:  Generate new codes: .
2:  Decode generated code to an image: .
Algorithm 2 AE-OT Generating Phase

4 Experiments

Figure 2: Results on the eight-Gaussian toy dataset. The green points are sampled from the target distribution, and the blue points are sampled form the source distribution. Red points are computed by OT after (a) 5 iterations, and (b) 10000 iterations. Surface values of are also plotted.

To demonstrate the effectiveness of the proposed method, we 1) evaluate AE-OT on a eight-Gaussian toy dataset; 2) compare AE-OT against VAE [16] and WAE [36] for generative modeling on MNIST [19] and CelebA [24] datasets. AE-OT has a strong geometric motivation and is therefore more suitable for data that has strong manifold structures, for example, handwritten digits and human faces. For WAE, we use the GAN penalty proposed in [36]. In the AE-OT implementation, we set for the eight-Gaussian toy dataset, and for the MNIST and CelebA datasets. We use Adam [15] for optimization and set and in all the experiments. for the eight-Gaussian experiment; for the experiments on the MNIST and CelebA datasets.

(a) VAE
(b) WAE
(c) AE-OT
Figure 3: Results on the MNIST dataset. Digits generated by (a) VAE, (b) WAE and (c) AE-OT. All images are randomly generated.
(a) VAE
(b) WAE
(c) AE-OT
Figure 4: Generated faces on the CelebA dataset by (a) VAE, (b) WAE and (c) AE-OT. All faces are randomly generated.
Figure 5: Interpolation of the faces generated by AE-OT on the CelebA dataset.

Network Architecture: For the Auto-Encoder in AE-OT, we use the vanilla Auto-Encoder [12]. For the discriminator of AE-OT on the eight-Gaussian dataset, we use the network architecture used in WGAN-GP [8]

on the eight-Gaussian dataset. It is a four layer Muti-Layer Perceptions (MLP), with the number of nodes in each hidden layer being 512 and in the output layer being 1. We use ReLU as the non-linear activation function in all layers. For the discriminator of AE-OT on the MNIST and CelebA datasets, we use a six layer MLP, with the number of nodes in each hidden layer being 512 and in the output layer being 1. We use LeakyReLU as the non-linear activation function in all layers and set the slop parameter to 0.2.

4.1 Results on the Eight-Guassian Toy Dataset

Dataset Description: Following previous work [8]

, we generate a toy dataset consists of eight 2-D Gaussian distributions as the real data distribution. The eight Gaussians are centered at

, , , , , , , with . From each Gaussian we sample 32 data points. Therefore the dataset consists of 256 2-D points to represent the real data distribution. The synthetic data is sampled from a Gaussian distribution centered at with , from which we sample 256 synthetic data points.

On this 2-D toy dataset, we do not use AE for dimensionality reduction. We train a discriminator using our proposed OT, and use the discriminator to map the synthetic data points to a set of new data points. We evaluate whether the transformed new data points form a distribution that is similar to the real data distribution. The surface values of the discriminator are plotted in Figure 2 (a) and (b) after 5 and 10000 discriminator iterations, respectively. The blue points in Figure 2 (a) and (b) are synthetic data points, while the green points are the real data points. The red points are computed from the synthetic data points using the proposed OT. In Figure 2 (a), we can see that the generated empirical distribution (red points) is still very close to synthetic distribution since the discriminator has only updated 5 times. After 10000 number of discriminator iterations, the red points form the distribution analogous to the real data distribution. This shows that even though we do not use a generator to generate synthetic samples, with only the discriminator, the source distribution can be transformed to the target distribution. Also, we do not use the regularization term in our method. Computing the Kantorovich potential only on the synthetic data points gives the accurate transport map to transform synthetic data to real data. From this experiment we can see that the proposed method can model the multi-cluster distributions, which is considered as a difficult task for generative models.

4.2 Results on the MNIST dataset



Method
MNIST

VAE
1.76 0.11
WAE 1.64 0.09
AE-OT 1.78 0.13
Table 1: Inception scores on the MNIST dataset.

In this subsection, we compare our method against VAE and WAE on the MNIST dataset. The images are resized to

. For VAE, we use the vanilla Auto-Encoder. The dimensionality of the latent space is set to 10 for all the methods. For VAE, WAE and AE in AE-OT, we train 1000 epochs on the MNIST dataset. For OT in AE-OT, we perform 200K iterations.

In all the methods, we randomly sample 64 noises and feed them into different generative models. The images generated by different methods are shown in Figure 3. From Figure 3 (a) we can see that the brightness of the digits generated by VAE are lower compared to AE-OT. There are some digits that are unclear and incomplete. Figure 3 (b) shows digits generated by WAE. Many images are blurry, mainly because WAE trains three networks simultaneously, and thus it cannot learn a desired reconstruction network. Digits produced by AE-OT are visually better than those by VAE and WAE. In addition, we list the Inception Scores (IS) [32] for different methods in Table 1. From this table we can see that the highest IS is achieved by AE-OT. This experiment shows that the AE-OT is better than VAE and WAE. The main reason is that AE-OT preserves the manifold structure of the data in the latent space.

4.3 Results on the CelebA dataset

We compare our method against VAE and WAE on the CelebA dataset. The images are cropped to 128128. The dimensionality of the latent space is set to 100 for all the methods. We train VAE and AE in AE-OT for 300 epochs on this dataset. For OT in AE-OT, we perform 200K iterations. Our experiment with WAE on this dataset crashes during training. The results shown by WAE are generated just before it crashes.

We sample 64 random noises in the latent space to generate images for all the methods. The generated faces are shown in Figure 4. From this figure we can see that many faces generated by VAE are distorted. VAE tends to mix the face and the background, because VAE enforces the distribution of latent representation of the face manifold to be a unit Gaussian. This distorts the intrinsic representation of the face manifold. Images generated by WAE shown in Figure 4 (b) are very blurry and many faces are incomplete. WAE crashes because it jointly trains the encoder, decoder and the discriminator in the latent space. There are competitions among the three networks thus making the training of WAE unstable. In contrast, faces produced by AE-OT (Figure 4 (c)) are visually much better than VAE and WAE. The generated face images are clear, complete and recognizable. The faces and the background are well separated, thanks to the property of AE-OT, with which the structure of the latent representation of the face manifold is preserved.

In order to verify that the latent space learned by AE-OT is smooth, we randomly sample two noises, map them into the latent space using OT. We interpolate between the mapped vectors and forward them into the decoder to generate faces. Figure

5 shows the interpolation of the faces generated by AE-OT. From this figure we can see that the interpolation of the faces generated by AE-OT preserves clear facial structure and smooth transition of facial appearance. This manifests that the face manifold in the latent space is well preserved by AE-OT.

5 Conclusion

In this work, we propose a novel generative model named AE-OT. Instead of enforcing the distribution of data in the latent space to a simple distribution as proposed in VAE, which leads to the posterior collapse problem. AE-OT transforms a simple distribution to the data distribution in the latent space. In this way, the manifold structure of data in the latent space is preserved, and the posterior collapse problem is addressed. Moreover, In order to avoid the Min-Max optimization problem in a GAN, we propose an OT to generate data from a well-trained discriminator. AE-OT computes the optimal transport map directly in the latent space with explicitly theoretic interpretation. Results on the eight-Gaussian dataset show that the learned OT is capable of handling multi-cluster distributions. Qualitative and quantitative results on the MNIST dataset show that the AE-OT generates better digits than VAE and WAE. Results on the CelebA dataset show that the AE-OT generates much better faces than VAE and WAE, and preserves the manifold structure in the latent space.

In future work, we will try to solve the optimal transport more accurately.

References