Mode Collapse and Regularity of Optimal Transportation Maps

02/08/2019 ∙ by Na Lei, et al. ∙ 2

This work builds the connection between the regularity theory of optimal transportation map, Monge-Ampère equation and GANs, which gives a theoretic understanding of the major drawbacks of GANs: convergence difficulty and mode collapse. According to the regularity theory of Monge-Ampère equation, if the support of the target measure is disconnected or just non-convex, the optimal transportation mapping is discontinuous. General DNNs can only approximate continuous mappings. This intrinsic conflict leads to the convergence difficulty and mode collapse in GANs. We test our hypothesis that the supports of real data distribution are in general non-convex, therefore the discontinuity is unavoidable using an Autoencoder combined with discrete optimal transportation map (AE-OT framework) on the CelebA data set. The testing result is positive. Furthermore, we propose to approximate the continuous Brenier potential directly based on discrete Brenier theory to tackle mode collapse. Comparing with existing method, this method is more accurate and effective.



There are no comments yet.


page 5

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GANs, [13]

) emerge as one of the dominant approaches for unconditional image generating. GANs have successfully shown their amazing capability of generating realistic looking and visual pleasing images. Typically, a GAN model consists of an unconditional generator that regresses real images from random noises and a discriminator that measures the difference between generated samples and real images. Despite GANs’ advantages, they have critical drawbacks. 1) Training of GANs are tricky and sensitive to hyperparameters. 2) GANs suffer from mode collapsing. Recently Meschede et. al (

[23]) studied different GAN models and variants showing that gradient descent based GAN optimization is not always convergent. The goal of this work is to improve the theoretic understanding of these difficulties and propose methods to tackle them fundamentally.

Figure 1: Wasserstein GAN framwork.

Optimal Transportation View of GANs Recent promising successes are making GANs more and more attractive (e.g. [27, 32, 36]). Among various improvements of GANs, one breakthrough has been made by incorporating GANs with optimal transportation (OT) theory ([35]), such as in works of WGAN ([2]), WGAN-GP ([15]) and RWGAN ([16]

). In WGAN framework, the generator computes the optimal transportation map from the white noise to the data distribution, the discriminator computes the Wasserstein distance between the real and the generated data distributions.

Fig. 1 illustrates the framework of WGAN. The image space is , the real data distribution is concentrated on a manifold embedded in . is the latent space,

is the white noise (Gaussian distribution). The generator computes a transformation map

, which maps to ; the discriminator computes the Wasserstein distance between and the real distribution by finding the Kontarovich potential (refer to Eqn. 22).

In principle, the GAN model accomplishes two major tasks: 1) manifold learning, discovering the manifold structure of the data; 2) probability transformation, transforming a white noise to the data distribution. Accordingly, the generator map

can be further decomposed into two steps,


where is a transportation map, maps the white noise to in the latent space , is the manifold parameterization, maps local coordinates in the latent space to the manifold . Namely, gives a local chart of the data manifold , realizes the probability measure transformation. The goal of the GAN model is to find , such that the generated distribution fits the real data distribution , namely


Regularity Analysis for mode collapse By manifold structure assumption, the local chart representation is continuous. Unfortunately, the continuity of the transportation map can not be guaranteed. Even worse, according to the regularity theory of optimal transportation map, except very rare situations, the transportation map is always discontinuous. In more details, unless the support of is convex, there are non-empty singularity sets in the domain of , where is discontinuous. By Eqn. 1 and 2, is determined by the real data distribution and the encoding map , it is highly unlikely that the support of is convex.

On the other hand, the deep neural networks (DNNs) can only model continuous mappings. For example, the commonly used ReLU DNNs can only represent piece-wise linear mappings. But the desired mapping itself is discontinuous. This intrinsic conflict explains the fundamental difficulties of GANs:

Current GANs search a discontinuous mapping in the space of continuous mappings, the searching will not converge or converge to one continuous branch of the target mapping, leading to a mode collapse.

Solution to mode collapse We propose a solution to the mode collapse problem based on the Brenier theory of optimal transportation ([35]). According to Brenier theorem 3.4, under the quadratic distance cost function, the optimal transportation map is the gradient of a convex function, the so-called Brenier potential. Under mild conditions, the Brenier potential is always continuous and can be represented by DNNs. We propose to find the continuous Brenier potential instead of the discontinuous transportation map.

Contributions This work improves the theoretic understanding of the convergence difficulty and mode collapse of GANs from the perspective of optimal transportation; builds connections between the regularity theory of optimal transportation map, Monge-Ampère equation, and GANs; proposes solutions to conquer mode collapse based on Brenier theory.

This paper is organized as follows: in Section 3, we briefly introduce the theory of optimal transportation; in Section 4, we give a computational algorithm based on the discrete version of Brenier theory; in Section 5, we explain the mode collapse issue using Monge-Ampère regularity theory, and propose a novel method. Furthermore, we test our hypothesis that general transportation maps in GANs are discontinuous with proposed method. The testing results are reported in Section 5 as well. Finally, we draw the conclusion in Section 6.

2 Previous Work

Generative adversarial networks. Generative adversarial networks (GANs) are technique for training generative models to produce realistic examples from an unknown distribution ([13]

). In particular, the GAN model consists of a generator network that maps latent vectors, typically drawn from a standard Gaussian distribution, into real data distribution and a discriminator network that aims to distinguish generated data distribution with the real one. Training of GANs is unfortunately found to be tricky and one major challenge is called

mode collapse, which refers to a lack of diversity of generated samples. This commonly happens when trained on multimodal distributions. For example on a dataset that consists of images of ten handwritten digits, generators might fail to generator some digits ([15]). Prior works have observed two types of mode collapse, i.e fail to generate some modes entirely, or only generating a subset of a particular mode ([12, 34, 4, 10, 24, 28]). Several explanatory hypothesis to mode collapse have been made, including proper objective functions ([2, 3]) and weak discriminators ([24, 29, 3, 21]).

Three main approaches to mitigate mode collapse include employing inference networks in addition to generators (e.g. [10, 11, 33]), discriminator augmentation (e.g. [8, 29, 19, 22]) and improving optimization procedure during GAN training (e.g. [24]). However these methods measure the difference between implicit distributions by a neural network (i.e discriminator), whose training relies on solving a non-convex optimization problem that might lead to non-convergence ([21, 23]). In contrast, the proposed method metrics the distance between two distributions by Wasserstein distance, which can be computed under a convex optimization framework. Furthermore the optimal solution provides a transport map transforming the source distribution to the target distribution, and essentially serves as a generator in the feature space. Empirically, the distribution generated by this generator is theoretically guaranteed to be identical to the real one, without mode collapse or artificial mode invention.

Optimal Transport.

Optimal transport problem attracted the researchers attentions since it was proposed in 1940s, and there were vast amounts of literature in various kinds of fields like computer vision and natural language processing. We recommend the readers to refer to

[26] and [31] for detailed information.

Under discrete optimal transport, in which both the input and output domains are Dirac masses, we can use standard linear programming (LP) to model the problem. To facilitate the computation complexity of LP,

[9] added an entropic regularizer into the original problem and it can be quickly computed with the Sinkhorn algorithm. By the introduction of fast convolution, Solomon & Guibas ([30]) improved the computational efficiency of the above algorithm. The smaller the coefficient of the regularizer, the solution of the regularized problem is closer to the original problem. However, when the coefficient is too small, the sinkhorn algorithm cannot find a good solution. Thus, it can only approximate the original problem coarsely.

When computing the transport map between continuous and point-wise measures, i.e. the semi-discrete optimal transport, ([14]) proposed to minimize a convex energy through the connection between the OT problem and convex geometry. Then by the link between c-transform and Legendre dual theory, the authors of [20] found the equivalence between solution of the Kantorovich duality problem and that of the convex energy minimization.

If both the input and output are continuous densities, the OT problem can be treated to solve the Monge-Ampére equation. ([17, 18, 25]) solved this PDE by computational fluid dynamics with an additional virtual time dimension. But this kind of problem is both time consuming and hard to extend to high dimensions.

3 Optimal Mass Transport theory

In this subsection, we will introduce basic concepts and theorems in classic optimal transport theory, focusing on Brenier’s approach, and their generalization to the discrete setting. Details can be found in [35].

Optimal Transportation Problem Suppose , are two subsets of -dimensional Euclidean space , are two probability measures defined on and respectively, with density functions Suppose the total measures are equal, , namely We only consider maps which preserve the measure.

Definition 3.1 (Measure-Preserving Map)

A map is measure preserving if for any measurable set , the set is -measurable and , i.e. Measure-preserving condition is denoted as , where is the push forward measure induced by .

Given a cost function , which indicates the cost of moving each unit mass from the source to the target, the total transport cost of the map is defined to be .

The Monge’s problem of optimal transport arises from finding the measure-preserving map that minimizes the total transport cost.

Problem 3.2

[Monge’s Optimal Mass Transport (MP) ([5])] Given a transport cost function , find the measure preserving map that minimizes the total transport cost

Definition 3.3 (Optimal Transportation Map)

The solutions to the Monge’s problem is called the optimal transportation map, whose total transportation cost is called the Wasserstein distance between and , denoted as .


For the cost function being the norm, Kontarovich relaxed transportation maps to transportation plans, and proposed linear programming method to solve this problem. We introduce the details of Kontarovich’s approach in Appendix A.

Brenier’s Approach For quadratic Euclidean distance cost, the existence, uniqueness and the intrinsic structure of the optimal transportation map were proven by Brenier ([6]).

Theorem 3.4 ([6])

Suppose and are the Euclidean space and the transportation cost is the quadratic Euclidean distance . Furthermore is absolutely continuous and and

have finite second order moments,

then there exists a convex function , the so-called Briener potential, its gradient map gives the solution to the Monge’s problem,


The Brenier potential is unique upto a constant, hence the optimal transportation map is unique.

Assume the Briener potential is smooth, the it is the solution to the following Monge-Amère equation:


In general cases, the Brenier potential satisfing the transport condition can be seen as a weak form of Monge-Ampère equation, coupled with the boundary condition . Hence is called a Brenier solution.

Definition 3.5 (Legendre Transform)

Given a function , its Legendre transform is defined as

Figure 2: PL Brenier potential (left) and its Legendre dual (right).

In practice, the Brenier solution is approximated by the so-called Alexandrov solution.

Definition 3.6 (Sub-gradient)

Let be a convex function. Its sub-gradient or sub-differential at a point is defined as


A convex function is Lipschitz, hence it is differentiable almost everywhere.

Definition 3.7 (Alexandrov Solution)

If a convex function satisfies


then we say is an Alexandrov solution to the Monge-Ampère equation.

Regularity of Optimal Transportation Maps Let and be two bounded smooth open sets in , let and be two probability measures on such that and . Assume that and are bounded away from zero and infinity on and , respectively,

According to Caffarelli ([7]), if is convex, then the Alexandrov solution is strictly convex, furthermore

  1. If , for some , then .

  2. If and , with , then ,

Here represents the -th order Hölder continuous with exponant function space. If is not convex, there exist and smooth such that , the optimal transportation map is discontinuous at singularities. is differentiable if its subgradient

is a singleton. We classify the points according to the dimensions of their subgradients, and define the sets

It is obvious that is the set of regular points, , are the set of singular points. We also define the reachable subgradients at as

It is well known that the subgradient equals to the convex hull of the reachable subgradient,

Theorem 3.8 (Regularity)

Let be two bounded open sets, let be two probability densities, that are zero outside , and are bounded away from zero and infinity on , , respectively. Denote by the optimal transport map provided by theorem 3.4. Then there exist two relatively closed sets and with such that is a homeomorphism of class for some .

Figure 3: Singularity structure of an optimal transportation map.

Fig. 3 illustrates the singularity set structure of an optimal transportation map , computed using the algorithm based on theorem 4.2. We obtain

The subgradient of , is the entire inner hole of , is the shaded triangle. For each point on , is a line segment outside . is the bifurcation point of and . The Brenier potential on and is not differentiable, the optimal transportation map on them are discontinuous.

4 Discrete Brenier Theory

Brenier’s theorem can be directly generalized to the discrete situation. In GAN models, the source measure is given as a uniform (or Gaussian) distribution defined on a convex compact domain ; the target measure is represented as the empirical measures, which is the sum of Dirac measures where are training samples, weights . Each training sample corresponds to a supporting plane of the Brenier potential, denoted as


where the height is a variable. We represent all the height variables as .

An envelope of a family of hyper-planes in the Euclidean space is a hyper-surface that is tangent to each member of the family at some point, and these points of tangency together form the whole envelope. As shown in Fig. 2, the Brenier potential is a piecewise linear convex function determined by , which is the upper envelope of all its supporting planes,


The graph of Brenier potential is a convex polytope. Each supporting plane corresponds to a facet of the polytope. The projection of the polytope induces a cell decomposition of , each supporting plane projects onto a cell ,


the cell decomposition is a power diagram.

The -measure of is denoted as The gradient map maps each cell to a single point ,

Given the target measure in Eqn. 4, there exists a discrete Brenier potential in Eqn. 11, whose projected -volume of each facet equals to the given target measure . This was proved by Alexandrov in convex geometry.

Theorem 4.1 ([1])

Suppose is a compact convex polytope with non-empty interior in , are distinct unit vectors, the -th coordinates are negative, and so that . Then there exists convex polytope with exact codimension-1 faces so that is the normal vector to and the intersection between and the projection of is with volume . Furthermore, such is unique up to vertical translation.

Alexandrov’s proof for the existence is based on algebraic topology, which is not constructive. Recently, Gu et al. gave a contructive proof based on the variational approach.

Theorem 4.2 ([14])

Let a probability measure defined on a compact convex domain in , be a set of distinct points in . Then for any with , there exists , unique up to adding a constant , so that , for all . The vector is the unique minimum argument of the following convex energy


defined on an open convex set


Furthermore, minimizes the quadratic cost


among all transport maps , where the Dirac measure .

The gradient of the above convex energy in Eqn. 13 is given by:


The Hessian of the energy is given by

Figure 4: The left frame shows Power diagram (blue) and its dual weighted Delaunay triangulation (black). The right frame shows the Optimal transportation map from a disk to two cluster of points.

As shown in Fig. 2, the Hessian matrix has explicit geometric interpretation. The left frame shows the discrete Brenier potential , the right frame shows its Legendre transformation using definition 7. The Legendre transformation can be constructed geometrically: for each supporting plane , we construct the dual point , the convex hull of the dual points is the graph of the Legendre transformation . The projection of induces a triangulation of , which is the weighted Delaunay triangulation. As shown in the left fram of 4, the power diagram in Eqn.12 and weighted Delaunay triangulation are Poincarè dual to each other: if in the power diagram, and intersect at a -dimensional cell , then in the weighted Delaunay triangulation connects with . The element of the Hessian matrix Eqn. 17 is the ratio between the -volume of the cell in the power diagram and the length of dual edge in the weighted Delaunay triangulation.

Fig. 4 shows one computational example based on the theorem 4.2. Suppose the support of the target measure has two connected components, restricted on each component, has a smooth density function. We sample and use a Dirac measure to approximate it. By increasing the sampling density, we can construct a sequence weakly converges to , , the Alexandrov solution to also converges to Alexandrov solution to the Monge-Ampère equation with .

At each stage, the target is a Dirac measure with two clusters, the source

is the uniform distribution on the unit disk. Each cell on the disk is mapped to a point with the same color. The Brenier potential

has a ridge in the middle. Let , , the ridge on will be preserved on the limit , whose projection is the singularity set for the limit optimal transportation map . Along , is discontinuous, but the Brenier potential is always continuous.

5 Mode Collapse and Regularity

Although GANs are powerful for many applications, they have critical drawbacks: first, training of GANs are tricky and sensitive to hyper-parameters, difficult to converge; second, GANs suffer from mode collapsing; third, GANs may generate unrealistic samples. This section focuses on explaining these difficulties using the regularity theorem 3.8 of transportation maps.

Figure 5: Discontinuous Optimal transportation map, produced by a GPU implementation of algorithm based on theorem 4.2.

Intrinsic Conflict The difficulty of convergence, mode collapse, and generating unrealistic samples can be explained by the regularity theorem of the optimal transportation map.

Suppose the support of the target measure has multiple connected components, namely has multiple modes, or is non-convex, then the optimal transportation map is discontinuous, the singular set is non-empty. Fig. 4 shows the multi-cluster case, has multiple connected components, where the optimal transportation map is discontinuous along . Fig. 5 shows even is connected , but non-convex. is a rectangle, is a dumbbell shape, the density functions are constants, the optimal transportation map is discontinuous, the singularity set . In general situation, due to the complexity of the real data distributions, and the embedding manifold , the encoding/decoding maps, the supports of the target measure are rarely convex, therefore the transportation mapping can not be continuous globally.

On the other hand, general deep neural networks, e.g. ReLU DNNs, can only approximate continuous mappings. The functional space represented by ReLU DNNs doesn’t contain the desired discontinuous transportation mapping. The training process or equivalently the searching process will leads to three situations:

  1. The training process is unstable, and doesn’t converge;

  2. The searching converges to one of the multiple connected components of , the mapping converges to one continuous branch of the desired transformation mapping. This means we encounter a mode collapse;

  3. The training process leads to a transportation map, which covers all the modes successfully, but also cover the regions outside . In practice, this will induce the phenomena of generating unrealistic samples. As shown in the middle frame of Fig. 6.

Therefore, in theory, it is impossible to approximate optimal transportation maps directly using DNNs.

Proposed Solution The fundamental reason for mode collapse is the conflict between the regularity of the transportation map and the continuous functional space of DNNs. In order to tackle this problem, we propose to compute the Brenier potential itself, instead of its gradient (the transportation map). This is based on the fact that the Brenier potential is always continuous under mild conditions, and representable by DNNs, but its gradient is rarely continuous and always outside the functional space of DNNs.

Figure 6: Comparison between PacGAN and our method to tackle mode collapsing.

Multi-mode Experiment

We use GPU implementation of the algorithm in Section 4 to compute the Brenier potential. As shown in Fig. 6, we compare our method with a recent GAN training method (PacGAN [22]) that aims to reduce mode collapse. Orange markers are real samples and green markers represent generated ones. Left frame shows a typical case of mode collapse where the generated samples cannot cover all modes. Middle frame shows the result of PacGAN. Although all modes are captured, the model also generates data that deviates from real samples. Right frame shows the result of our method that precisely captured all modes. It is obvious that our method accurately approximates the target measures and covers all the modes, whereas the method PacGAN generates many fake samples between the modes.

Figure 7: AE-OT framework.
(a) generated facial images (b) a path through a singularity.
Figure 8: Facial images generated by an AE-OT model.

Hypothesis Test on CelebA

In this experiment, we want to test our hypothesis: In most real applications, the support of the target measure is non-convex, therefore the singularity set is non-empty.

As shown in Fig. 7, we use an auto-encoder (AE) to compute the encoding/decoding maps from CelebA data set to the latent space , the encoding map pushes forward to on the latent space. In the latent space, we compute the optimal transportation map (OT) based on the algorithm described in Section 4, , maps the uniform distribution in a unit cube to . Then we randomly draw a sample from the distribution , and use the decoding map to map to a generated human facial image . The left frame in Fig. 8 demonstrates the realist facial images generated by this AE-OT framework.

If the support of the push-forward measure in the latent space is non-convex, there will be singularity set , . We would like to detect the existence of

. We randomly draw line segments in the unit cube in the latent space, then densely interpolate along this line segment to generate facial images. As shown in the right frame of Fig. 

8, we find a line segment , and generate a morphing sequence between a boy with a pair of brown eyes and a girl with a pair of blue eyes. In the middle, we generate a face with one blue eye and one brown eye, which is definitely unrealistic and outside . This means the line segment goes through a singularity set , where the transportation map is discontinuous. This also shows our hypothesis is correct, the support of the encoded human facial image measure on the latent space is non-convex.

As a by-product, we find this AE-OT framework improves the training speed by factor and increases the convergence stability, since the OT step is a convex optimization. This gives a promising way to improve existing GANs.

6 Conclusion

This work builds the connection between the regularity theory of optimal transportation map, Monge-Ampère equation and GANs, which gives an theoretic understanding of the major drawbacks of GANs: convergence difficulty and mode collapse.

According to the regularity theory of Monge-Ampère equation, if the support of the target measure is disconnected or just non-convex, the optimal transportation mapping is discontinuous. General DNNs can only approximate continuous mappings, this intrinsic conflict leads to the convergence difficulty and mode collapse in GANs.

We test our hypothesis that the supports of real data distribution are in general non-convex, therefore the discontinuity is unavoidable using an Autoencoder combined with discrete optimal transportation map (AE-OT framework) on the CelebA data set. The testing result is positive. Furthermore, we propose to approximate the continuous Brenier potential directly based on discrete Brenier theory to tackle mode collapse problem. Comparing with existing methods, this method provides a possible way to achieve higher accuracy and efficiency.


Appendix A Kontarovich’s Approach

Depending on the cost function and the measures, the optimal transportation map between and may not exist. Kontarovich relaxed transportation maps to transportation plans, and defined joint probability measure , such that the marginal probability of equations to and respectively. Formally, let the projection maps be , , then define mapping class

Problem A.1 (Kontarovich)

Given a transport cost function , find the joint probability measure that minimizes the total transport cost


Kontarovich’s problem can be solved using linear programming method. Due to the duality of lineary programming, the (KP) Eqn.19 can be reformulated as the duality problem (DP) as follows:

Problem A.2 (Duality)

Given a transport cost function , find the function and , such that


The maximum value of Eqn.20 gives the Wasserstein distance. Most existing Wasserstein GAN models are based on the duality formulation under the cost function.

Definition A.3 (-tranformation)

The -tranformation of as :


Then the duality problem can be rewritten as


where is called the Kontarovich’s potential.