Generative Adversarial Networks (GANs) (goodfellow2014generative)
allow for efficient learning of complicated probability distributions from samples. However, training such models is notoriously complicated: the dynamics can exhibit oscillatory behavior, and the convergence can be very slow. To alleviate this, a number of practical tricks, including the usage of data augmentation(karras2020training; zhao2020image; zhang2019consistency) and optimization methods inspired by numerical integration schemes for ODEs (qin2020training) have been proposed. Nevertheless, the theory explaining the success of these methods is only starting to catch up with practice.
The standard way of training a GAN is to use simultaneous gradient descent. In nagarajan2017gradient it was shown that under mild assumptions, this method is locally convergent. The authors mescheder2018training demonstrated that when these assumptions are not satisfied, the usage of regularization methods such as the gradient penalty (roth2017stabilizing) or consensus optimization (mescheder2017numerics) is required to achieve convergence. In balduzzi2018mechanics; nie2020towards; liang2019interaction
the local convergence is studied based on the eigenvalues of the Jacobian of the dynamics near the equilibrium. Such analysis involves the parameterization of the generator and the discriminator by neural networks and usually does not take into account the properties of the target distribution.
Our key idea is that this dynamics can be efficiently analyzed in the functional space. This is achieved by constructing a local quadratic approximation of the GAN training objective and writing down the dynamics as a system of partial differential equations (PDEs). The differential operator underlying this system has a remarkably simple form, and its spectrum can be analyzed in terms of the fundamental properties of the target distribution. Furthermore, we show that the convergence of the dynamics is determined by the Poincaré constant of this distribution. Intuitively, this constant describes the connectivity properties of a distribution; for instance, it is smaller for multimodal datasets with disconnected modes, where GAN convergence is known to be slower. This connection is practically important since the Poincaré constant of a dataset can be easily estimated a priori, and the larger it is, the better is the expected convergence of GAN. Thus, we can analyze common techniques that alter the train set in terms of their effect on the Poincaré constant. For instance, one can choose a proper augmentation strategy that increases the Poincaré constant the most. The main contributions of our paper are:
We develop a linearized GAN training model in the functional space in the form of a PDE.
We derive explicit formulas connecting the eigenfunctions and eigenvalues of the resulting PDE operator with the target distribution properties. This connection provides a theoretical justification for common GAN stabilization techniques.
We describe an efficient, practical recipe for choosing optimal parameters for gradient penalty and data augmentations for a particular dataset.
2 Second-order approximation of the GAN objective
We assume that data samples
are produced by a target probability measure. The generator function is a deterministic mapping from a latent space to ; is sampled from a known probability measure . The discriminator is a real-valued function on which goal is to distinguish between real and fake samples. The GAN objective can be written as a min-max problem
and , are some scalar convex twice differentiable functions. For example, for the LSGAN (mao2017least)
We analyze the behaviour of a GAN in a neighborhood of the Nash equilibrium corresponding to an optimal discriminator function and an optimal generator function . We make standard assumptions (nagarajan2017gradient; mescheder2018training; nie2020towards) that the optimal discriminator satisfies and the measure produced by is equal to . Similarly, we also assume that and . In this setting we can recover many common GAN variations including the vanilla GAN (goodfellow2014generative), WGAN (arjovsky2017wasserstein), LSGAN (mao2017least).
The standard way of solving the min-max problem is to represent and
by some parametric models and solve it in the parameter space. The convergence of the resulting dynamics can be studied by linearizing it around the equilibrium point(nagarajan2017gradient; mescheder2018training; nie2020towards). This is equivalent to the construction of the quadratic approximation of in the parameter space. In this paper, we do the quadratic approximation first, obtaining a new saddle point problem in a functional space that approximates the local dynamics of standard GAN training methods. This representation has a remarkably simple form, and its properties can be studied in detail.
We assume that is a measure with positive density on , i.e., . In practice, this can be (formally) achieved by smoothing a discrete data distribution with Gaussian noise. As the functional space, we consider equipped with a natural weighted inner product:
We will also often utilize differential operators, which should be understood in the distributional sense. In these cases, the functions under study are assumed to be the elements of the corresponding Sobolev space. Specifically, we will utilize the weighted Sobolev spaces and . These are spaces of generalized functions such that their distributional first order and second order derivatives respectively lie in the .
We start by finding the second-order approximation to the GAN objective near Nash equilibrium in the functional space. We assume that is invertible, i.e., for any there exists a unique such that . Let be the functional variations of the optimal discriminator and generator respectively.
Let . Let us denote . Then, (2) can be approximated as
and and .
The proof is straightforward: we use Taylor expansion to approximate both terms in the objective up to second order and also use the fact that in the Nash equilibrium the first order terms sum up to .
From the definitions of , and the statement of the theorem follows. ∎
3 Continuous gradient descent-ascent in the functional space
Now we need to solve the min-max problem of the form
We will refer to (3) as linearized GAN problem (lGAN). Since it is a local quadratic approximation of the original min-max problem, local behavior of GAN training methods can be understood on this task. The linearized GAN problem depends solely on the measure , and we will see what are the particular fundamental properties of this measure that determine the convergence of the gradient-based method for solving (3).
We will use for the weighted Sobolev space and for the weighted Sobolev space . The continuous descent-ascent flow in the functional space can be written as:
where and are Fréchet derivatives of the functional with respect to and . It is important to define what is the meaning of (4), i.e., it should be understood in the weak sense. Select a test function and take the scalar product of the first equation with it. We get
Similar equation holds for . The evaluation of the right-hand side is done by integration by parts, which leads to the system of the form
By making use of the density , equations (6) can be rewritten in the strong form as
We are interested in the asymptotic behavior of and when goes to infinity. It is completely described by the eigenvalues of the operator on the right-hand side of (6). We will say that is an eigenfunction with an eigenvalue if it satisfies the following system of equations.
4 Eigenfunctions and eigenvalues of the linearized operator
If form an eigenfunction, then the solution of the time-dependent problem (6) with the initial conditions can be written as
We will see that form a basis in our functional space and thus an arbitrary solution can be written as a series of terms (9). Thus, the real parts of the spectrum of the operator in (6) should be less than in order for the system to have asymptotic stability. We will demonstrate that our system does not have eigenvalues with a positive real part but has a naturally interpretable kernel.
4.1 The kernel.
We find that the kernel has the following form.
Let be an eigenfunction with . Then,
or in the strong form:
Here is a constant such that . I.e., for we get , and otherwise.
From (8) we observe that the element of the kernel satisfies the following equations .
Let us choose . From the second equation it follows that as desired. ∎
This kernel has a straightforward interpretation. If we add a function that satisfies (10) to the , the mapped measure will be the same up to second order terms. Indeed, consider (10) in the strong form. We obtain that . Recall that the function maps a sample to a sample from the synthetic density produced by , i.e. can be considered as a velocity of each sample. If samples evolve with velocity , the differential equation for the density takes the form
i.e., the density is invariant under such transformation. We will refer to the condition (10) as the divergence-free condition on .
4.2 Non-zero spectrum.
4.2.1 Weighted Laplace operator
We now address non-zero eigenvalues of our operator. It will be convenient to utilize the weighted Laplace operator which is defined (in the weak form) as follows.
In the strong form takes the following form:
which results in the standard Laplacian in the case of the standard Lebesgue measure on . This operator commonly appears in the study of the diffusion processes (coifman2006diffusion) and weighted heat equations (grigoryan2009heat).
Spectrum of the weighted Laplacian.
In what follows we will use eigenvalues and eigenfunctions of . Firstly, this a self-adjoint non-positive definite operator and under mild assumptions on (cianchi2011discreteness; grigoryan2009heat), e.g., when decreases sufficiently fast, it has a discrete spectrum. Let us study it in more detail. Consider the eigenvalue problem for written in the weak form.
We note that is the eigenfunction with ; thus, due to self-adjointness of for every other eigenfunction we have , i.e., it has zero mean with respect to .
The Poincaré constant of .
Let us consider the smallest non-zero eigenvalue of . We obtain that the following inequality holds.
and the exact minimizer of this inequality is achieved by (grigoryan2009heat). The value of (sometimes its inverse) is called the Poincaré constant of the measure and often appears in Sobolev-type inequalities (adams2003sobolev). The exact values of for a given measure are almost never known analytically. To make use of our results in practical settings, we propose a simple neural network based approach for estimation of . We discuss it in Section 6.
What is it exactly?
We can provide an intuitive meaning of by drawing an analogy with graph Laplacians. In this case, the second smallest eigenvalue of the graph Laplacian, called the Fiedler value (fiedler1973algebraic)
, reflects how well connected the overall graph is. Thus, the Poincaré constant is in a way a continuous analog of this constant, reflecting the “connectivity” of a measure. This is the property that we would expect to impact the GAN convergence, as based on rich empirical evidence, the datasets that are more “disconnected” (such as ImageNet) are very challenging to model. We provide experimental evidence in support of this intuitive explanation in Section7.
4.2.2 Spectrum of lGAN
We are now ready to fully describe the spectrum of our lGAN model. Let be the non-zero spectrum of and be the set of the corresponding eigenfunctions. Recall from Theorem 1 that the constants and are defined purely in terms of functions and . We now state our main result.
The non-zero spectrum of (8) is described as follows.
The eigenvalues are given by where are roots of the quadratic equation:
The corresponding eigenfunctions are written in terms eigenfunctions of as follows.
The following theorem expresses the solution in terms of the eigenfunctions and also provides the convergence estimates.
Let and . Then, these functions can be written as
and is divergence-free, i.e. . The coefficients and can be obtained as the solution of the linear systems:
With this expansion, the solution to (6) is
For the norms of and can be estimated as
where is the maximal real part of the eigenvalues.
The decomposition of into a potential part and divergence-free part is a direct generalization of the classical result for the ordinary divergence and gradient, known as the Helmholtz decomposition (griffiths2005introduction). The divergence-free part belongs to the kernel of the operator, thus it stays constant. The dynamics of and follows from the completeness of the eigenbasis of and the assumption that its spectrum is discrete, thus we can expand them in this basis. From Theorem 2 each component in the sum is an eigenfunction, thus its time dynamics is just . For the constant term in , by substituting in (6), we obtain the following ODE , from which the statement follows. ∎
Theorem 3 provides the exact representation of the solution in terms of the eigenfunctions of . The eigenvalues are given as the solution of the quadratic equation (17), thus the spectral properties of completely determine the dynamics of the convergence. Specifically, we observe that the speed of convergence is determined by the value of , i.e., the lowest non-zero mode of the weighted Laplacian.
There are two distinct cases. If the first eigenvalue of satisfies (i.e., is large enough), then all the eigenvalues are complex, and their real part is equal to . Ideally, we would have , which would provide is with the optimal convergence speed and no oscillations for the highest mode. Consider now the case of a small . In this case, will be close to , which would result in a slow convergence rate. These observations can be used to explain the success of various practical GAN training methods, as we show in Section 7.
In the case is , which holds, for instance, for WGAN, we obtain the well-known purely oscillatory behavior (nagarajan2017gradient). We also note that for , the average value of the discriminator exponentially decays to . This resembles the convergence plots of state-of-the-art GANs, see, e.g., karras2020training.
Example: normal distribution.
Consider a model example of the normal distribution,. Then, has the density The eigenfunctions and eigenvalues of can be computed explicitly. The strong form of the eigenproblem is
The solution of (22) exists for and the corresponding eigenfunction is the Hermite polynomial:
The smallest non-zero eigenvalue is . Therefore, for the LSGAN model, we will have the discriminant in (17) always non-positive, and the convergence and will be exponential with the rate . The solution also will oscillate due to the presence of complex eigenvalues.
What about neural networks?
The discussion so far focuses on the functional spaces. In reality, these functions are approximated by neural networks, and the dynamics is written in the parameters and of these networks. Local convergence analysis of such dynamics is possible in these parameters (mescheder2018training; nie2020towards); however, such analysis involves eigenvalues of the Jacobian of the loss. It is not easy to connect these properties to the fundamental properties of the measure. The functional space analysis shows this connection. The approximation by neural networks (or by any other parametric representation) can be thought of as a spatial discretization of PDEs. If the number of parameters increases, the eigenvalues of the discretized problem should approximate the eigenvalues of the infinite-dimensional problem. It is worth noting that different discretizations (for example, different neural network architectures) may lead to different properties. One can compare such discretizations by looking at the quality of approximation of the eigenvalues of the weighted Laplace operator (see Section 6 for the algorithmic details). Also, such techniques used in practice as Jacobian regularization (karras2020analyzing) can be considered as methods for choosing a better-conditioned discretization. Finally, for future research, we would like to mention that the lGAN problem is similar to the saddle point problems (benzi2005numerical) appearing in mathematical modeling of fluid problems. For example, for robust discretization, one has to choose discretization spaces of (‘pressure’), and (‘velocity’) such that they satisfy the famous LBB (Ladyzhenskaya-Babuška-Brezzi) condition (boffi2013mixed; ladyzhenskaya1969mathematical) in order to get good convergence properties. This task requires systematic study, and we leave it for future research.
6 Effects of common practices on the asymptotic convergence
Several studies have shown that if we penalize the norm of the gradient of the GAN objective with respect to the discriminator, it improves the convergence (gulrajani2017improved; mescheder2018training). In our case, it results in an additional term . After linearization we obtain a regularized loss .
New eigenvalue problem has the form
The kernel stays the same, and moreover, the component is also the same, which can be seen again by using in the second equation. The only difference is the connection between eigenvalues of the weighted Laplace operator and eigenvalues of the linearized problem, which changes as follows:
One of the most interesting conclusions of our analysis is that the convergence is determined by the connection between and . If we fix in advance, we can not control and and convergence for a particular dataset will be determined by
. I.e., for one dataset the loss may work well, but for another it may fail. An alternative approach is to optimize over the hyperparameters of the loss and regularizer taking the Poincaré constant into account. Expression (23) gives a direct way to do that. The eigenvalue with the maximal real part corresponds to and is given by the formula
The maximum is obtained if the discriminant is equal to , which gives
Another desirable property is the absence of oscillations. This is only possible if the quadratic function under the square root is non-negative. Since it is equal to zero at , it is necessary and sufficient for it to have non-negative derivative:
Simple analysis gives the following conditions for the parameters such that we have optimal local convergence rate and all the eigenvalues of the linearized operator are real:
Discrete approximation of the time dynamics.
The dynamics (6) can be written in the operator form as
The usage of gradient descent optimization for this problem is equivalent to the forward Euler scheme:
and the eigenvalues of the discretized operator are equal to
where are given by Theorem 2. Since for all such that the eigenvalue is complex, the modulus of the corresponding is
Since is an unbounded operator under mild assumptions on there exists an eigenvalue of it that makes , i.e. the Euler scheme is absolutely unstable. On the discrete level, when the functions are approximated by a neural network, we might be working in a subspace that does not contain eigenvalues that are too large, and the method might actually converge; but this is very problem specific. As has been noted by qin2020training the usage of more advanced time integration schemes leads to competitive GAN training results even without such functional constraints as spectral normalization (miyato2018spectral). If the parameters are selected in the range (24), then there are no complex eigenvalues, and we can make Euler scheme convergent. This requires regularization. Another important research direction is the development of more suitable time discretization schemes. The system (6) belongs to the class of hyperbolic problems, and special time discretization schemes have to be used. One class of such methods are total variation diminishing (TVD) schemes (gottlieb1998total). Note, that one of the methods considered in qin2020training is the Heun method (suli2003introduction) which is a second-order TVD Runge-Kutta scheme, which confirms its empirical efficiency. Thus, our analysis provides theoretical foundation for the results of qin2020training.
One of the most successful practical tricks significantly improving GAN convergence have been the usage of data augmentation (karras2020training; zhang2019consistency; zhao2020image). In our framework, the usage of data augmentations is reflected in the shift of . Based on the intuition discussed earlier, higher values of correspond to more “connected” distributions and allow for faster convergence. This is exactly what is intuitively achieved by data augmentation: we fill the “holes” in our dataset with new samples and make it more connected. We describe our experiments confirming this idea in Section 7.
Practical estimation of .
For practical analysis of GAN convergence, we need to estimate the value of the Poincaré constant
for a given dataset. This can be performed in the standard supervised learning manner. Recall from the definition thatis the minimizer of the following optimization problem (commonly called the Rayleigh quotient).
In practice we can parameterize with a neural network, and perform optimization of (27) in a stochastic manner by replacing expectations over with their empirical counterparts (over a batch of inputs). Due to the scale invariance of the Rayleigh quotient, we employ spectral normalization (miyato2018spectral) as an additional regularization; this also falls in line with commonly used discriminator architectures. To summarize, we utilize a neural network and for a dataset consider the batched version of the following loss.
. Minimization of this loss function can be performed with standard optimizers such as SGD or Adam(kingma2014adam).
Our experiments are organized as follows. We start by numerically investigating the value of and impact of formulas from Section 6
on the convergence on synthetic datasets. We then study the more practical CIFAR-10(krizhevsky2009learning) dataset. Firstly, we show the correlation between the obtained for various augmented versions of the dataset and FID values obtained for the corresponding GAN. We then show that the similar correlation holds when we perform instance selection, a recently proposed technique shown to improve GAN convergence (devries2020instance). For the synthetic datasets, our experiments were performed in JAX (jax2018github) using the ODE-GAN code available at GitHub111https://github.com/deepmind/deepmind-research/tree/master/ode_gan
. For CIFAR-10 experiments we utilized PyTorch(NEURIPS2019_9015). Our experiments were performed on a single NVidia V-100 GPU.
In order to study the effect of and the choice of and on the training procedure, we set up a simple one-dimensional test: a mixture of two normal distributions and , where is the separation parameter. Intuitively, the larger the , the smaller is , since it reduces connectivity of our measure. We also verify it numerically, as shown in Figure 1 (top). In this case, we utilize a simple two-layer MLP with spectral normalization on top of linear layers. We observe that indeed the value of decays rapidly with the increase of the separation parameter .
To experiment with GAN convergence in this toy setting, we setup two MLP models for and , and train the LSGAN-like model with
and gradient penalty with factor . We train the model using the ODE-GAN approach with the Heun method. We experiment with two moderately separated mixtures, namely with and . Respective values obtained by the numerical simulation described above are and . We consider five options for the values of . We start with the baseline WGAN corresponding to . We consider two options for the value of : the optimal one predicted by the theory, and a sub-optimal . With the latter value we obtain . We also consider two LSGAN variants. First one is the default version with the parameters . For the second version we fix and choose the optimal and according to (24). We measure performance of a GAN model via the Earth Mover’s Distance between the Gaussians fitted on real and synthetic data. This approach resembles the commonly used Frechét Inception Distance (FID) metric for GAN evaluation. Results are visualized on Figure 1. We observe that the resulting convergence plots match our theoretical predictions. In particular, we see that when the parameters are chosen in the optimal way, methods converge more rapidly and stably. On the other hand, when is not tuned or is sub-optimal, the convergence is more oscillatory, and GANs struggle to converge. Note that even for the optimal for methods with , we still observe some oscillations. This may be a result of a noise induced by neural function approximation or stochasticity of training.
Data augmentation of CIFAR-10.
In this set of experiments, we verify if the effects of data augmentations on GAN convergence can be predicted by evaluating for the correspondingly augmented dataset. We consider a number of augmentations commonly utilized in data augmentation pipelines for GANs (see Figure 2).
In particular, these augmentations include both spatial (translations, zooming) and visual augmentations (color adjustment). Folklore knowledge is that the first type is generally helpful, while the second is not. In our framework, the positive effect of augmentations can be understood as improving the “connectivity” of a dataset, which can be quantitatively measured by evaluating of the augmented version of a dataset. Increased value of both intuitively and theoretically (as supported by Theorem 2) corresponds to better convergence and better FID scores. Our experiments are based on a thorough analysis of the impact of augmentations on the quality of a GAN trained on CIFAR-10 provided in zhao2020image. Specifically, the authors select a single augmentation from a predefined set and train several types of GANs by augmenting real and fake images. The strength of the augmentation is controlled by a single parameter , and the authors provide the obtained FID for a number of its possible values (e.g., we may vary how strongly we zoom an image); we refer the reader to zhao2020image for specific details on each augmentation. We selected a large portion of augmentations studied in this paper and replicated its data augmentation setup. For each augmentation and each strength value, we minimize the Rayleigh quotient with a neural network mimicking the SNDCGAN discriminator from zhao2020image. We train it on the train part of the dataset by applying the respective augmentation to each image with probability one. Note that the actual augmentation strength is randomly sampled from the range , so we cover the entire distribution relatively well. For training, we use the batch size of and Adam optimizer with a learning rate (results are not sensitive to these parameters). We measure the actual value of by sweeping across the augmented test set times to better cover the distribution and aggregate the results across all (augmented) samples. For the reference values, we choose the FID scores obtained by the SNDCGAN model with Balanced Consistency Regularization (bCR) compiled from zhao2020image. This model achieves better generation quality so our theory is more reliable in this case. Our results are provided at Figure 3. For convenience, we normalize the obtained values by dividing it by of the non-augmented dataset (approximately equal to by our estimation). We observe that FID scores in many cases indeed follow the behavior of . For instance, for zoomin and cutout, the estimated values predict that there is an intermediate optimal augmentation strength, which is matched by the practice. For translation, we observe that stronger augmentations worsen the connectivity of the dataset, which results in worse FID scores. We can also observe a significant drop in for color-based augmentations, confirming their practical inefficiency. We note, in some cases (e.g., zoomout), we do not observe a direct correspondence. This may be a result of auxiliary effects not covered by our theory or of a discrepancy between ours and the authors’ implementation.
Another possible way to manipulate the data distribution is to remove non-typical samples, which can confuse the generator (devries2020instance)
. This can be performed by fitting a Gaussian model on feature vectors of a dataset produced by some pretrained model (in our experiments, we used Inception v3(szegedy2016rethinking)) and keeping only samples with high likelihood under this model. This procedure results in better FID scores of trained GANs, as shown in devries2020instance
. In our framework, this can be understood as another way to improve the dataset’s connectivity since by removing outliers, we reduce the number of gaps in the data. Experiments in the aforementioned paper were conducted on the resized ImageNet dataset; however, we hypothesize that this phenomenon may also be understood on the conceptually similar CIFAR-10 dataset. We perform an experiment confirming this effect quantitatively. We follow the steps described above, and for each value ofevaluate of the dataset obtained by removing samples in the bottom
-quantile of the likelihood. Our results are provided in Table1. We observe that for higher values of the truncation parameter, we indeed obtain better dataset connectivity, confirming practical findings of devries2020instance.
We presented a novel framework for a theoretical understanding of GAN training by analyzing the local convergence in the functional space. Namely, we represent the GAN dynamics as a system of partial differential equations and analyze the spectrum of the corresponding differential operator, which determines the dynamics convergence. As the main result, we show how the spectrum depends on the properties of the target distribution, in particular, on its Poincaré constant. Our perspective provides a new understanding of established GAN tricks, such as gradient penalty or dataset augmentation. For practitioners, our paper develops an efficient method that allows to choose optimal augmentations for a particular dataset.