Generative adversarial networks (GANs)(Goodfellow et al., 2014a) have achieved great success at generating realistic and sharp looking images. However, they are widely general methods, now starting to be applied to several other important problems, such as semisupervised learning, stabilizing sequence learning methods for speech and language, and 3D modelling. (Denton et al., 2015; Radford et al., 2015; Salimans et al., 2016; Lamb et al., 2016; Wu et al., 2016)
However, they still remain remarkably difficult to train, with most current papers dedicated to heuristically finding stable architectures.(Radford et al., 2015; Salimans et al., 2016)
Despite their success, there is little to no theory explaining the unstable behaviour of GAN training. Furthermore, approaches to attacking this problem still rely on heuristics that are extremely sensitive to modifications. This makes it extremely hard to experiment with new variants, or to use them in new domains, which limits their applicability drastically. This paper aims to change that, by providing a solid understanding of these issues, and creating principled research directions towards adressing them.
It is interesting to note that the architecture of the generator used by GANs doesn’t differ significantly from other approaches like variational autoencoders(Kingma & Welling, 2013). After all, at the core of it we first sample from a simple prior , and then output our final sample , sometimes adding noise in the end. Always,
is a neural network parameterized by, and the main difference is how is trained.
Traditional approaches to generative modeling relied on maximizing likelihood, or equivalently minimizing the Kullback-Leibler (KL) divergence between our unknown data distribution and our generator’s distribution (that depends of course on ). If we assume that both distributions are continuous with densities and , then these methods try to minimize
This cost function has the good property that it has a unique minimum at , and it doesn’t require knowledge of the unknown to optimize it (only samples). However, it is interesting to see how this divergence is not symetrical between and :
If , then
is a point with higher probability of coming from the data than being a generated sample. This is the core of the phenomenon commonly described as ‘mode dropping’: when there are large regions with high values of, but small or zero values in . It is important to note that when but , the integrand inside the KL grows quickly to infinity, meaning that this cost function assigns an extremely high cost to a generator’s distribution not covering parts of the data.
If , then has low probability of being a data point, but high probability of being generated by our model. This is the case when we see our generator outputting an image that doesn’t look real. In this case, when and , we see that the value inside the KL goes to 0, meaning that this cost function will pay extremely low cost for generating fake looking samples.
Clearly, if we would minimize instead, the weighting of these errors would be reversed, meaning that this cost function would pay a high cost for generating not plausibly looking pictures. Generative adversarial networks have been shown to optimize (in its original formulation), the Jensen-shannon divergence, a symmetric middle ground to this two cost functions
where is the ‘average’ distribution, with density . An impressive experimental analysis of the similarities, uses and differences of these divergences in practice can be seen at Theis et al. (2016). It is indeed conjectured that the reason of GANs success at producing reallistically looking images is due to the switch from the traditional maximum likelihood approaches. (Theis et al., 2016; Huszar, 2015). However, the problem is far from closed.
Generative adversarial networks are formulated in two steps. We first train a discriminator to maximize
One can show easily that the optimal discriminator has the shape
and that , so minimizing equation (1) as a function of yields minimizing the Jensen-Shannon divergence when the discriminator is optimal. In theory, one would expect therefore that we would first train the discriminator as close as we can to optimality (so the cost function on better approximates the ), and then do gradient steps on , alternating these two things. However, this doesn’t work. In practice, as the discriminator gets better, the updates to the generator get consistently worse. The original GAN paper argued that this issue arose from saturation, and switched to another similar cost function that doesn’t have this problem. However, even with this new cost function, updates tend to get worse and optimization gets massively unstable. Therefore, several questions arize:
Why do updates get worse as the discriminator gets better? Both in the original and the new cost function.
Why is GAN training massively unstable?
Is the new cost function following a similar divergence to the ? If so, what are its properties?
Is there a way to avoid some of these issues?
The fundamental contributions of this paper are the answer to all these questions, and perhaps more importantly, to introduce the tools to analyze them properly. We provide a new direction designed to avoid the instability issues in GANs, and examine in depth the theory behind it. Finally, we state a series of open questions and problems, that determine several new directions of research that begin with our methods.
2 Sources of instability
The theory tells us that the
trained discriminator will have cost at most . However, in practice, if we just
train till convergence, its error will go to 0, as observed
in Figure 1, pointing to the fact that the
between them is maxed out. The only way this can happen is if
the distributions are not continuous111 By continuous
we will actually refer to an absolutely continuous random variable (i.e. one that
has a density),
as it typically done. For further clarification see Appendix
By continuous we will actually refer to an absolutely continuous random variable (i.e. one that has a density), as it typically done. For further clarification see AppendixAppendix B., or they have disjoint supports.
One possible cause for the distributions not to be continuous is if their supports lie on low dimensional manifolds. There is strong empirical and theoretical evidence to believe that is indeed extremely concentrated on a low dimensional manifold (Narayanan & Mitter, 2010). As of , we will prove soon that such is the case as well.
In the case of GANs, is defined via sampling from a simple prior , and then applying a function , so the support of has to be contained in . If the dimensionality of is less than the dimension of (as is typically the case), then it’s imposible for to be continuous. This is because in most cases will be contained in a union of low dimensional manifolds, and therefore have measure 0 in . Note that while intuitive, this is highly nontrivial, since having an -dimensional parameterization does absolutely not imply that the image will lie on an -dimensional manifold. In fact, there are many easy counterexamples, such as Peano curves, lemniscates, and many more. In order to show this for our case, we rely heavily on being a neural network, since we are able to leverage that is made by composing very well behaved functions. We now state this properly in the following Lemma:
Let be a function composed by affine transformations and pointwise nonlinearities, which can either be rectifiers, leaky rectifiers, or smooth strictly increasing functions (such as the sigmoid, tanh, softplus, etc). Then, is contained in a countable union of manifolds of dimension at most . Therefore, if the dimension of is less than the one of , will be a set of measure 0 in .
Driven by this, this section shows that if the supports of and are disjoint or lie in low dimensional manifolds, there is always a perfect discriminator between them, and we explain exactly how and why this leads to an unreliable training of the generator.
2.1 The perfect discrimination theorems
For simplicity, and to introduce the methods, we will first explain the case where and have disjoint supports. We say that a discriminator has accuracy 1 if it takes the value 1 on a set that contains the support of and value 0 on a set that contains the support of . Namely, and .
If two distributions and have support contained on two disjoint compact subsets and respectively, then there is a smooth optimal discrimator that has accuracy 1 and for all .
The discriminator is trained to maximize
Since and are compact and disjoint, the distance between both sets. We now define
By definition of we have that and are clearly disjoint compact sets. Therefore, by Urysohn’s smooth lemma there exists a smooth function such that and . Since for all in the support of and for all in the support of , the discriminator is completely optimal and has accuracy 1. Furthermore, let be in . If we assume that , there is an open ball on which is constant. This shows that . Taking and working analogously we finish the proof. ∎
In the next theorem, we take away the disjoint assumption, to make it general to the case of two different manifolds. However, if the two manifolds match perfectly on a big part of the space, no discriminator could separate them. Intuitively, the chances of two low dimensional manifolds having this property is rather dim: for two curves to match in space in a specific segment, they couldn’t be perturbed in any arbitrarilly small way and still satisfy this property. To do this, we will define the notion of two manifolds perfectly aligning, and show that this property never holds with probability 1 under any arbitrarilly small perturbations.
We first need to recall the definition of transversality. Let and be two boundary free regular submanifolds of , which in our cases will simply be . Let be an intersection point of the two manifolds. We say that and intersect transversally in if , where means the tangent space of around .
We say that two manifolds without boundary and perfectly align if there is an such that and don’t intersect transversally in .
We shall note the boundary and interior of a manifold by and respectively. We say that two manifolds and (with or without boundary) perfectly align if any of the boundary free manifold pairs or perfectly align.
The interesting thing is that we can safely assume in practice that any two manifolds never perfectly align. This can be done since an arbitrarilly small random perturbation on two manifolds will lead them to intersect transversally or don’t intersect at all. This is precisely stated and proven in Lemma 2.
As stated by Lemma 3, if two manifolds don’t perfectly align, their intersection will be a finite union of manifolds with dimensions strictly lower than both the dimension of and the one of .
Let and be two regular submanifolds of that don’t have full dimension. Let be arbitrary independent continuous random variables. We therefore define the perturbed manifolds as and . Then
Let and be two regular submanifolds of that don’t perfectly align and don’t have full dimension. Let . If and don’t have boundary, then is also a manifold, and has strictly lower dimension than both the one of and the one of . If they have boundary, is a union of at most 4 strictly lower dimensional manifolds. In both cases, has measure 0 in both and .
We now state our perfect discrimination result for the case of two manifolds.
Let and be two distributions that have support contained in two closed manifolds and that don’t perfectly align and don’t have full dimension. We further assume that and are continuous in their respective manifolds, meaning that if there is a set with measure 0 in , then (and analogously for ). Then, there exists an optimal discriminator that has accuracy 1 and for almost any in or , is smooth in a neighbourhood of and .
By Lemma 3 we know that is strictly lower dimensional than both and , and has measure 0 on both of them. By continuity, and . Note that this implies the support of is contained in and the support of is contained in .
Let . Therefore, (the complement of ) which is an open set, so there exists a ball of radius such that . This way, we define
We define analogously. Note that by construction these are both open sets on . Since , and , the support of and is contained in and respectively. As well by construction, .
Let us define for all , and 0 elsewhere (clearly including . Since for all in the support of and for all in the support of , the discriminator is completely optimal and has accuracy 1. Furthermore, let . Since is an open set and is constant on , then . Analogously, . Therefore, the set of points where is non-smooth or has non-zero gradient inside is contained in , which has null-measure in both manifolds, therefore concluding the theorem. ∎
These two theorems tell us that there are perfect discriminators which are smooth and constant almost everywhere in and . The fact that the discriminator is constant in both manifolds points to the fact that we won’t really be able to learn anything by backproping through it, as we shall see in the next subsection. To conclude this general statement, we state the following theorem on the divergences of and , whose proof is trivial and left as an exercise to the reader.
Let and be two distributions whose support lies in two manifolds and that don’t have full dimension and don’t perfectly align. We further assume that and are continuous in their respective manifolds. Then,
Note that these divergences will be maxed out even if the two manifolds lie arbitrarilly close to each other. The samples of our generator might look impressively good, yet both KL divergences will be infinity. Therefore, Theorem 2.3 points us to the fact that attempting to use divergences out of the box to test similarities between the distributions we typically consider might be a terrible idea. Needless to say, if these divergencies are always maxed out attempting to minimize them by gradient descent isn’t really possible. We would like to have a perhaps softer measure, that incorporates a notion of distance between the points in the manifolds. We will come back to this topic later in section 3, where we explain an alternative metric and provide bounds on it that we are able to analyze and optimize.
2.2 The consequences, and the problems of each cost function
Theorems 2.1 and 2.2 showed one very important fact. If the two distributions we care about have supports that are disjoint or lie on low dimensional manifolds, the optimal discriminator will be perfect and its gradient will be zero almost everywhere.
2.2.1 The original cost function
We will now explore what happens when we pass gradients to the generator through a discriminator. One crucial difference with the typical analysis done so far is that we will develop the theory for an approximation to the optimal discriminator, instead of working with the (unknown) true discriminator. We will prove that as the approximaton gets better, either we see vanishing gradients or the massively unstable behaviour we see in practice, depending on which cost function we use.
In what follows, we denote by the norm
The use of this norm is to make the proofs simpler, but could have been done in another Sobolev norm for covered by the universal approximation theorem in the sense that we can guarantee a neural network approximation in this norm (Hornik, 1991).
Theorem 2.4 (Vanishing gradients on the generator).
Let be a differentiable function that induces a distribution . Let be the real data distribution. Let be a differentiable discriminator. If the conditions of Theorems 2.1 or 2.2 are satisfied, , and , 222Since can depend on , this condition is trivially verified for a uniform prior and a neural network. The case of a Gaussian prior requires more work because we need to bound the growth on , but is also true for current architectures. then
Under the same assumptions of Theorem 2.4
This shows that as our discriminator gets better, the gradient of the generator vanishes. For completeness, this was experimentally verified in Figure 2. The fact that this happens is terrible, since the fact that the generator’s cost function being close to the Jensen Shannon divergence depends on the quality of this approximation. This points us to a fundamental: either our updates to the discriminator will be inacurate, or they will vanish. This makes it difficult to train using this cost function, or leave up to the user to decide the precise amount of training dedicated to the discriminator, which can make GAN training extremely hard.
2.2.2 The alternative
To avoid gradients vanishing when the discriminator is very confident, people have chosen to use a different gradient step for the generator.
We now state and prove for the first time which cost function is being optimized by this gradient step. Later, we prove that while this gradient doesn’t necessarily suffer from vanishing gradients, it does cause massively unstable updates (that have been widely experienced in practice) under the prescence of a noisy approximation to the optimal discriminator.
Let and be two continuous distributions,
with densities and respectively. Let
be the optimal discriminator, fixed for a value 333 This is important
since when backpropagating to the generator, the discriminator is assumed fixed
This is important since when backpropagating to the generator, the discriminator is assumed fixed. Therefore,
Before diving into the proof, let’s look at equation (3) for a second. This is the inverted KL minus two JSD. First of all, the JSDs are in the opposite sign, which means they are pushing for the distributions to be different, which seems like a fault in the update. Second, the KL appearing in the equation is , not the one equivalent to maximum likelihood. As we know, this KL assigns an extremely high cost to generating fake looking samples, and an extremely low cost on mode dropping; and the JSD is symetrical so it shouldn’t alter this behaviour. This explains what we see in practice, that GANs (when stabilized) create good looking samples, and justifies what is commonly conjectured, that GANs suffer from an extensive amount of mode dropping.
We now turn to our result regarding the instability of a noisy version of the true distriminator.
Theorem 2.6 (Instability of generator gradient updates).
Let be a differentiable
function that induces a distribution . Let be the real data distribution, with
either conditions of Theorems 2.1 or 2.2 satisfied.
Let be a discriminator such that is a centered Gaussian
process indexed by and independent for every (popularly known as white noise)
(popularly known as white noise) andanother independent centered Gaussian process indexed by and independent for every . Then, each coordinate of
is a centered Cauchy distribution with infinite expectation and variance
is a centered Cauchy distribution with
infinite expectation and variance.444Note that the theorem holds regardless of the variance of and . As the approximation gets better, this error looks more and more as centered random noise due to the finite precision.
Let us remember again that in this case is locally constant equal to 0 on the support of . We denote the random variables . By the chain rule and the definition of , we get
is a centered Gaussian distribution, multiplying by a matrix doesn’t change this fact. Furthermore, when we divide by, a centered Gaussian independent from the numerator, we get a centered Cauchy random variable on every coordinate. Averaging over the different independent Cauchy random variables again yields a centered Cauchy distribution. 555A note on technicality: when is defined as such, the remaining process is not measurable in , so we can’t take the expectation in trivially. This is commonly bypassed, and can be formally worked out by stating the expectation as the result of a stochastic differential equation. ∎
Note that even if we ignore the fact that the updates have infinite variance, we still arrive to the fact that the distribution of the updates is centered, meaning that if we bound the updates the expected update will be 0, providing no feedback to the gradient.
Since the assumption that the noises of and are decorrelated is albeit too strong, we show in Figure 3 how the norm of the gradient grows drastically as we train the discriminator closer to optimality, at any stage in training of a well stabilized DCGAN except when it has already converged. In all cases, using this updates lead to a notorious decrease in sample quality. The noise in the curves also shows that the variance of the gradients is increasing, which is known to delve into slower convergence and more unstable behaviour in the optimization (Bottou et al., 2016).
3 Towards softer metrics and distributions
An important question now is how to fix the instability and vanishing gradients issues. Something we can do to break the assumptions of these theorems is add continuous noise to the inputs of the discriminator, therefore smoothening the distribution of the probability mass.
If has distribution with support on and is an absolutely continuous random variable with density , then is absolutely continuous with density
This theorem therefore tells us that the density is inversely proportional to the average distance to points in the support of , weighted by the probability of these points. In the case of the support of being a manifold, we will have the weighted average of the distance to the points along the manifold. How we choose the distribution of the noise will impact the notion of distance we are choosing. In our corolary, for example, we can see the effect of changing the covariance matrix by altering the norm inside the exponential. Different noises with different types of decays can therefore be used.
Now, the optimal discriminator between and is
and we want to calculate what the gradient passed to the generator is.
Let and be two distributions with support on and respectively, with . Then, the gradient passed to the generator has the form
where and are positive functions. Furthermore, if and only if , and if and only if .
This theorem proves that we will drive our samples towards points along the data manifold, weighted by their probability and the distance from our samples
. Furthermore, the second term drives our points away from high probability samples, again, weighted by the sample manifold and distance to these samples. This is similar in spirit to contrastive divergence, where we lower the free energy of our samples and increase the free energy of data points. The importance of this term is seen more clearly when we have samples that have higher probability of coming fromthan from . In this case, we will have and the second term will have the strength to lower the probability of this too likely samples. Finally, if there’s an area around that has the same probability to come from than , the gradient contributions between the two terms will cancel, therefore stabilizing the gradient when is similar to .
There is one important problem with taking gradient steps exactly of the form (4), which is that in that case, will disregard errors that lie exactly in , since this is a set of measure 0. However, will be optimizing its cost only on that space. This will make the discriminator extremely susceptible to adversarial examples, and will render low cost on the generator without high cost on the discriminator, and lousy meaningless samples. This is easilly seen when we realize the term inside the expectation of equation (4) will be a positive scalar times , which is the directional derivative towards the exact adversarial term of Goodfellow et al. (2014b). Because of this, it is important to backprop through noisy samples in the generator as well. This will yield a crucial benefit: the generator’s backprop term will be through samples on a set of positive measure that the discriminator will care about. Formalizing this notion, the actual gradient through the generator will now be proportional to , which will make the two noisy distributions match. As we anneal the noise, this will make and match as well. For completeness, we show the smooth gradient we get in this case. The proof is identical to the one of Theorem 3.2, so we leave it to the reader.
Let and , then
In the same as with Theorem 3.2, and will have the same properties. The main difference is that we will be moving all our noisy samples towards the data manifold, which can be thought of as moving a small neighbourhood of samples towards it. This will protect the discriminator against measure 0 adversarial examples.
Proof of theorem 3.2.
Since the discriminator is assumed fixed when backproping to the generator, the only thing that depends on is for every . By taking derivatives on our cost function
Let the density of be . We now define
Trivially, and are positive functions. Since , we know that if and only if , and if and only if as we wanted. Continuing the proof, we know
Finishing the proof. ∎
An interesting observation is that if we have two distributions and with support on manifolds that are close, the noise terms will make the noisy distributions and almost overlap, and the JSD between them will be small. This is in drastic contrast to the noiseless variants and , where all the divergences are maxed out, regardless of the closeness of the manifolds. We could argue to use the JSD of the noisy variants to measure a similarity between the original distributions, but this would depend on the amount of noise, and is not an intrinsic measure of and . Luckilly, there are alternatives.
We recall the definition of the Wasserstein metric for and two distributions over . Namely,
where is the set of all possible joints on that have marginals and .
The Wasserstein distance also goes by other names, most commonly the transportation metric and the earth mover’s distance. This last name is most explicative: it’s the minimum cost of transporting the whole probability mass of from its support to match the probability mass of on ’s support. This identification of transporting points from to is done via the coupling . We refer the reader to Villani (2009) for an in-depth explanation of these ideas. It is easy to see now that the Wasserstein metric incorporates the notion of distance (as also seen inside the integral) between the elements in the support of and the ones in the support of , and that as the supports of and get closer and closer, the metric will go to 0, inducing as well a notion of distance between manifolds.
Intuitively, as we decrease the noise, and become more similar. However, it is easy to see again that is maxed out, regardless of the amount of noise. The following Lemma shows that this is not the case for the Wasserstein metric, and that it goes to 0 smoothly when we decrease the variance of the noise.
If is a random vector with mean 0, then we have
is a random vector with mean 0, then we have
where is the variance of .
Let , and with independent from . We call the joint of , which clearly has marginals and . Therefore,
where the last inequality was due to Jensen. ∎
We now turn to one of our main results. We are interested in studying the distance between and without any noise, even when their supports lie on different manifolds, since (for example) the closer these manifolds are, the closer to actual points on the data manifold the samples will be. Furthermore, we eventually want a way to evaluate generative models, regardless of whether they are continuous (as in a VAE) or not (as in a GAN), a problem that has for now been completely unsolved. The next theorem relates the Wasserstein distance of and , without any noise or modification, to the divergence of and , and the variance of the noise. Since and
are continuous distributions, this divergence is a sensible estimate, which can even be attempted to minimize, since a discriminator trained on those distributions will approximate the JSD between them, and provide smooth gradients as per Corolary3.2.
Let and be any two distributions, and be a random vector with mean 0 and variance . If and have support contained on a ball of diameter , then 666While this last condition isn’t true if is a Gaussian, this is easily fixed by clipping the noise