PairGAN
None
view repo
Adversarial training methods typically align distributions by solving two-player games. However, in most current formulations, even if the generator aligns perfectly with data, a sub-optimal discriminator can still drive the two apart. Absent additional regularization, the instability can manifest itself as a never-ending game. In this paper, we introduce a family of objectives by leveraging pairwise discriminators, and show that only the generator needs to converge. The alignment, if achieved, would be preserved with any discriminator. We provide sufficient conditions for local convergence; characterize the capacity balance that should guide the discriminator and generator choices; and construct examples of minimally sufficient discriminators. Empirically, we illustrate the theory and the effectiveness of our approach on synthetic examples. Moreover, we show that practical methods derived from our approach can better generate higher-resolution images.
READ FULL TEXT VIEW PDFNone
The problem of finding a distributional alignment by means of adversarial training has become a core subroutine across learning tasks, from generative adversarial networks (GANs) (goodfellow2014generative) to domain-invariant training (ganin2016domain; li2018deep). For instance, in GANs we seek to align samples from the model with real examples (e.g., images). The generative model in GANs is trained by minimizing a discrepancy or divergence measure between the two distributions. This divergence measure is realized by a discriminator trained to separate real examples from those sampled from the model (nowozin2016f).
Despite their appeal, GANs are known to be hard to train due to stability issues. Since the estimation is typically setup as two objectives, one for the generator, the other for the discriminator, the desired solution is analogous to a Nash equilibrium of the associated game. Without additional regularization, the dynamics between the two can become unstable
(mescheder2018training), and lead to a never-ending game. While there are multiple reasons for instability, we focus in particular on analyzing and ensuring the stability of alignment around the optimal solution(s).The generator sees the training signal, the divergence measure, only through the discriminator. As a result, a generator that aligns perfectly with the target distribution can be thrown off the alignment by a sub-optimal discriminator. In other words, generator can achieve alignment if and only if the discriminator reaches its optimum at the same time.
We illustrate the stability problem and our approach to resolving it with a toy example (mescheder2018training) in Figure 1. In this example, the generative model is simply , i.e., concentrated on a single point , which is the parameter to optimize. The goal is to align it with a real point fixed at
. The discriminator is a simple classifier
parameterized by slope . The upper-left panel in Figure 1 gives the vector field for an alternating gradient descent training as well as an example trajectory. The training objective is given by the zero-sum game:The upper-right panel shows the time evolution of in relation to . Note, in particular, that even when reaches the target position (perfect alignment) the sub-optimal discriminator drives the points apart.
In this paper, we focus on a different class of discriminators that operate on pairs of samples, trained to identify whether the samples come from the same distribution or not. Utilizing such pairwise discriminators, we identify a family of training objectives which ensures alignment stability even if the discriminator is sub-optimal.
The panels in the bottom row of Figure 1 illustrate the same example, now with a pairwise discriminator: , where is a constant: . The lower-left panel again shows the vector field for alternating gradient updates between and , resulting from our objective function described in Section 5. In this case, the alignment is a stationary point for any discriminator. The time evolution now shows that the alignment is preserved.
We make following contributions:
In Section 4, we introduce a family of training objectives with pairwise discriminators, that preserve the distribution alignment, if achieved, regardless of the discriminator status.
In Section 5.2, we show that in our setup, only the generator needs to converge, and we provide sufficient conditions for local convergence.
In Section 5.3, we introduce the notion of a sufficient discriminator that formalizes the relationship between the capacities of the discriminator and generator. Moreover, we provide constructive examples of minimally sufficient discriminators.
In Section 6, we show that our approach and its benefits generalize to aligning multiple distributions.
In Section 7, we show that practical methods derived from our theoretical findings improve the stability and sample quality for generating higher-resolution images.
The code to reproduce all experiments presented in this paper can be found in https://github.com/ShangyuanTong/PairGAN.
All proofs can be found in the appendix.
goodfellow2014generative proposed Generative Adversarial Networks and showed that the associated min-max game can be viewed as minimization of Jensen-Shannon divergence. It was pointed out by nowozin2016f that the original GAN objective is a special case of a broader family of min-max objectives corresponding to -divergences. arjovsky2017wasserstein showed that game-theoretic setup of GANs can be extended to approximately optimize the Wasserstein distance. mao2017least propose LSGAN which uses a least squares objective related to Pearson divergence.
mescheder2017numerics and nagarajan2017gradient proved that GAN training convergences locally for absolutely continuous distributions. However, GANs are commonly used to approximate distributions that lie on low-dimensional manifolds (arjovsky2017towards). mescheder2018training showed that in this case, many training methods do not guarantee local convergence without some additional regularizations, such as instance noise and gradient penalties (roth2017stabilizing), which enjoy both theoretic guarantees and empirical improvements.
There is a body of work on GANs which utilize pairwise discriminators for improving training dynamics. jolicoeur-martineau2019relative has shown that by using a “relativistic discriminator”, which compares real and fake data, improves both performance and stability. To combat mode collapse, lin2018pacgan proposed to feed a pack of multiple samples from the same distribution to the discriminator rather than a pair of samples from different distributions. Recently, tsirigotis2019objectives sought objectives for locally stable GAN training without gradient penalties by using a pairwise discriminator of a specific structure.
We consider an objective which is related to Maximum Mean Discrepancy (MMD) (gretton2007kernel) metric between distributions. MMD is defined by a positive definite kernel and reaches its minimum only when the distributions are equal. li2015generative and dziugaite2015training use MMD with RBF-kernels for training deep generative models. In MMD-GAN (li2017mmd; wang2018improving) the kernel function is parameterized by a discriminator-network.
ganin2016domain
proposed domain-adversarial neural networks (DANN) for domain adaptation. Adversarial training in DANN is used to learn a feature representation such that it can be used to classify the labeled data while keeping the distribution of the representations invariant across the domains. Similarly to GANs, discriminators in DANN are used to estimate a distance between the distributions.
li2018deep extend this methodology to learn conditional invariant representations for domain generalization.Let denote the space of objects (e.g. images). We consider functional spaces of real-valued functions operating on . In this paper we consider two particular settings:
is a finite set, ;
is a compact set, .
In both cases, is a vector space with inner product. In our analysis, we build intuition about linear functionals and linear operators on treating them as finite-dimensional vectors and matrices. While the space provides useful intuition, our results naturally extend to .
Consider a generative modelling setup where we want to approximate a distribution of “real” objects with a distribution of generated (“fake”) objects . The training in GANs is performed by solving a game between the generator and a unary discriminator
which operates on single samples, with the loss functions
^{1}^{1}1Throughout this paper, we assume that all loss functions are to be minimized, unless stated otherwise. for the two given by(1a) | ||||
(1b) |
where
are activation functions applied to the discriminator. The original GAN by
goodfellow2014generative, which we refer to as the standard GAN (SGAN for short), has , for the saturating version and , for the non-saturating one.Unary discriminators define linear forms. An expectation
can be viewed as a linear form in the function space:
where and are the function space vectors corresponding to functions and respectively.
Using the function space notation we can rewrite the losses (1) as
(2a) | ||||
(2b) |
Note that and must define valid density functions. We define
as the set of probability density functions which belong to
. Formally, we define aswhere is a function space vector having the constant value of on all of its “positions”.
The standard setup of GANs can be extended by replacing the unary discriminator with a pairwise discriminator which operates on pairs of samples (li2015generative; jolicoeur-martineau2019relative; jolicoeur2018rfdiv; tsirigotis2019objectives). In this paper we interpret a pairwise discriminator as a classifier which classifies the pairs of samples into two classes: same distribution pairs and different distribution pairs:
same: | |||
different: |
With a pairwise discriminator, we define a modified game for GANs:
(3a) | |||
(3b) |
Binary discriminators define bi-linear forms. An expectation:
can be viewed as a bi-linear form in the function space:
where denotes a function-space linear operator corresponding to the function :
In this paper we consider symmetric discriminators which define self-adjoint operators:
Using the bi-linear forms we re-write the losses (3):
(4a) | |||
(4b) |
Unary discriminators destroy the alignment. Consider the generator loss for a unary GAN (2b
). Suppose that at some moment the generator has been aligned with the target distribution:
. With the subsequent update, receives the gradient signal . Below we show that unless is constant in the support of , the discriminator will drive away from and destroy the alignment.We consider an infinitesimal perturbation . Since must be a valid density function, must satisfy:
The first-order change of the loss (2b) corresponding to the perturbation is given by:
The generator is stationary at iff
This is only possible when is constant in the support of .
This observation implies that the generator can not converge unless the discriminator converges to the equilibrium position.
Pairwise discriminators preserve the alignment. We find that there is a family of objectives (4b) with pairwise discriminators that prevents the discriminator from destroying the alignment, meaning:
(5) |
Indeed, in order to satisfy
it is sufficient to choose . We define a function and consider the following instance of the loss (4b):
(6) |
In this section, we first propose PairGAN, a formulation of GANs with the generator loss of the form (6). Then, in Section 5.1, for specific choices of , we provide a theoretical insight similar to that in (goodfellow2014generative) to show that our approach in a specific form also minimizes a meaningful divergence metric. In Section 5.2, through evaluating the sufficient condition for local convergence, we introduce the notion of sufficient discriminators, which we analyze in details in Section 5.3.
General formulation of PairGAN loss functions are described by a non-zero-sum game:
(7a) | |||
(7b) |
PairGAN-Z. We also consider a zero-sum game for loss (6). We call the corresponding formulation PairGAN-Z:
(8) |
These loss functions are natural choice for a probabilistic discriminator. In this setup, we interpret the output of a pairwise discriminator as the estimated probability of a pair being sampled from the same distributions. Here, we will show that both non-zero-sum and zero-sum setups minimize meaningful divergence metrics.
Let us define the following mixture distributions:
(9a) | |||
(9b) | |||
(9c) |
The family of discriminators for PairGAN is defined as:
The generator loss evaluated at the optimal PairGAN discriminator is
For PairGAN-Z, we define another family of probabilistic discriminators whose values are separated from zero:
where .
Then, the generator loss evaluated at the optimal PairGAN-Z discriminator is
Now, we can show that with optimal discriminators, these particular choices of PairGAN and PairGAN-Z minimize a symmetrized KL divergence and a total variation distance respectively.
Each of the values and is equivalent to a divergence between the distributions and . Specifically:
Consequently, for :
We note that in game (7), since the generator loss is designed to preserve alignment once achieved, we only require the generator to reach alignment but do not require the discriminator to converge to a specific position. Thus, the goal of our convergence analysis is to identify the set of discriminators which allow the generator to converge.
Let and be parametric discriminator and generator parameterized by vectors and respectively.
We consider the realizable setup, that is we assume that there exists such that: . Generally, a parametrization may permit different instances of parameters to define the same distribution. Hence, we consider a reparametrization manifold (mescheder2018training):
In our analysis below, we assume that there is an -ball around such that defines a -manifold. We denote the tangent space of the manifold at by .
Recall from Section 4 that is a stationary generator for any discriminator . Similar to mescheder2018training
, we analyze the local convergence by examining the eigenvalues of the Hessian of the loss (
7b) w.r.t at . We denote this Hessian by . In Appendix B we show that the Hessian is given by(10) |
The following proposition provides a sufficient condition for local convergence of the generator.
Suppose that and a pair satisfies:
(11) |
Then, with fixed , gradient descent w.r.t. for (7b) converges to in a neighborhood of provided a small enough learning rate. Moreover, the rate of convergence is at least linear.
Proposition 5.2 states that a discriminator satisfying condition (11) allows the generator to converge. While, the convergence guarantee is only established for training the generator with a fixed discriminator, this result still holds if we allow to vary within a set. Indeed, from Proposition 5.2 it follows that converges to , given that remains in the set of the discriminators satisfying (11). Note that this set includes all discriminator in a neighborhood of , since is continuous at for any .
Figure 2 contrasts the convergence properties for GANs with unary discriminators and PairGAN on a toy example identical to that described in Section 1. Left panel of Figure 2 shows two trajectories for SGAN with gradient penalties (mescheder2018training). Both trajectories converge to the only stationary point. In contrast, for PairGAN (Figure 2, right), two trajectories initialized at different points both achieve the alignment but converge to different positions of discriminator . In this example, the discriminators corresponding to satisfy (11) and define the gradient vector field pointing towards the line . We note that the discriminator updates tend to keep positive. In Section 5.4 we extend this observation for PairGAN-Z.
To characterize the set of discriminators satisfying condition (11), we build intuition from the function space perspective.
We consider a perturbed value of the parameters of the generator , where is an infinitesimal perturbation vector. The corresponding first-order perturbation of the generated distribution can be expressed via Taylor expansion:
(12) |
Note that is a linear combination of the derivatives w.r.t. to individual parameters :
Thus, the set of all defines a finite-dimensional subspace of the function space. We denote this subspace by :
Note that , since
The expression in equation (11) can be rewritten in terms of the perturbation :
The following definition gives a function space reformulation of the condition (11).
We say that a self-adjoint operator is sufficient for a parametric generator at if
(13) |
We say that a discriminator is sufficient for at if the corresponding operator is sufficient for at .
This definition essentially means that a discriminator is sufficient for a particular aligned generator if every possible change that this generator can make only result in increasing the generator loss.
Note that for the condition (13) to be satisfied it is required that .
We say that an operator is minimally sufficient for at if
[label=()]
is sufficient for at ;
for any sufficient operator .
The following proposition provides constructive examples of minimally sufficient discriminators for any given parametric generator.
Let and denote the functions:
The operators and :
are minimally sufficient operators for at .
These operators define the following generator objectives (7b):
(14) |
Appendix D.1 provides a detailed discussion on the interpretation of the operators and the objectives .
Another interpretation of a sufficient discriminator in condition (13) is that its corresponding operator needs to be positive definite in the subspace defined by the generator. This notion naturally extends to positive definite operators defined by a kernel . In fact, for a special case of discriminator-operator defined by a positive definite kernel: , objective (6) defines Maximum Mean Discrepancy (MMD) (gretton2007kernel) metric which is used as the loss function in MMD-GAN (li2017mmd; binkowski2018demystifying; wang2018improving).
In this section, we discuss the connections and differences between MMD-GAN and our method. We start by examining the optimization problem (7b) in the case of non-parametric generator .
Suppose that the generator is an arbitrary continuous density function not restricted to a parametric family. Then, the minimization of the loss (7b) transforms into a constrained optimization problem w.r.t. a function space vector :
(15a) | ||||
s.t. | (15b) |
With constraints (15b), defines a valid density function.
Let be an infinitesimal perturbation of the aligned distribution . In order for to remain a valid distribution, we need to restrict the space of possible perturbations . We define the set of admissible perturbations as the intersection , where
Requiring all admissible perturbation to be “detectable” by the operator , we obtain a non-parametric version of the condition (13):
(16) |
Condition (16) is a relaxed version of the condition (13), since for any valid parameterization of a distribution . Now, we contrast the difference between the parametric and non-parametric cases.
Non-parametric: the perturbation of is not restricted by a parameterization; thus is required to be positive definite in the set , which is infinite-dimensional.
Parametric: the perturbation of is restricted by a parameterization; thus is required to be positive definite in finite-dimensional subspace .
With this connection, we can see the key difference between PairGAN and MMD-GAN. MMD-GAN utilizes kernel operators which are positive definite in the functional space . These operators guarantee that is the unique minimizer in (15). Note that the set of positive definite kernels is a subset of the set of sufficient operators for a given parameteric generator.
Now, we note an interesting property of PairGAN-Z (8).
For simplicity, we consider the case of finite . Let denote the probability simplex in . Consider game (8) between a generator and a discriminator-operator given a target distribution . Suppose we initialize and with and respectively. An iteration of alternating gradient descent is given by:
(17a) | |||
(17b) |
where and are positive learning rates.
Suppose that at some iteration is positive definite. Then, each step of the generator decreases the metric and drives towards . Furthermore, once has become positive definite it is guaranteed to remain positive definite after the symmetric rank- update (17b). Thus once becomes positive definite, is guaranteed to converge.
We hypothesize that the observed effect opens the possibility to establish global convergence guarantees for PairGAN-Z. Informally, with each gradient update (17b), becomes “more” positive definite. Then it remains to prove formally that with updates (17b) reaches positive definite state from any starting point . We leave further analysis of this problem for future work.
In GANs the goal is to align the generated distribution with a fixed real distribution . In this section, we consider an extended setup for adversarial training, where our goal is to align multiple distributions together. This setup is a simplified version of the distribution alignment problem arising in domain-invariant training (ganin2016domain; li2018deep), where adversarial training is used to make the distributions of representations in multiple domains indistinguishable from one another.
We consider the following loss function for :
(18) |
Suppose that for some . Then :
Proposition 6.1 states that whenever all distribution in any given subset of become mutually aligned they will receive the same gradient. Consequently the alignment within this subset will be preserved.
Figure 3 provides a toy example demonstration for Proposition 6.1. In this example, the goal is to align three distributions , . Both panels on Figure 3 show the trajectories of individual points obtained as result of their interaction with a discriminator (domain-classifier). The left panel corresponds to a game with a linear unary discriminator . Here, we observe that, when a pair of points becomes aligned, the discriminator can still drive them apart. The right panel of Figure 3 shows the trajectories obtained by using objective (18) with a pairwise discriminator . We observe that with objective (18) the alignment is preserved for any pair of distributions. We provide the detailed specification of the toy example in Appendix F.
We conducted experiments with a specific form of PairGAN described in Section 5.1. Recall, this means that
The experiments are set on the CAT dataset (zhang2008cat), with the same preprocessing setup as (jolicoeur-martineau2019relative). This is generally a hard problem for generative models because of high-resolution samples (up to 256x256) and small dataset size (about 9k images for 64x64, 6k images for 128x128 and only 2k images for 256x256). The details of our model can be found in Appendix G.1.
We quantitatively evaluate our approach with the Fréchet Inception Distance (FID) (heusel2017gans) (where a lower value generally corresponds to better image quality and sample diversity) on the three choices of resolutions against baselines provided in (jolicoeur-martineau2019relative). As our specific loss function is a variant of the standard GAN, Table 1 shows our model’s performance compared with the baselines that are also variants of standard GAN. These baselines are: standard GAN (SGAN) (goodfellow2014generative), Relativistic SGAN (RSGAN), Relativistic SGAN with gradient penalty (RSGAN-GP), Relativistic average SGAN (RaSGAN), Relativistic average SGAN with gradient penalty (RaSGAN-GP) (jolicoeur-martineau2019relative). We find that the baseline for SGAN provided in (jolicoeur-martineau2019relative) uses a numerically unstable implementation of the cross-entropy loss, so we rerun this baseline with the stable implementation. Same as (jolicoeur-martineau2019relative), we calculate FID at k, k, ,
k generator steps and report the minimum, maximum, mean and standard deviation of the score values at these
steps. Moreover, we evaluate our method and the fixed SGAN baseline three times and report the average on all four statistics.Loss | Min | Max | Mean | SD |
---|---|---|---|---|
images | ||||
SGAN | 13.51 | 41.89 | 23.78 | 8.81 |
RSGAN | 19.03 | 42.05 | 32.16 | 7.01 |
RaSGAN | 15.38 | 33.11 | 20.53 | 5.68 |
RSGAN-GP | 16.41 | 22.34 | 18.20 | 1.82 |
RaSGAN-GP | 17.32 | 22 | 19.58 | 1.81 |
PairGAN (ours) | 12.66 | 20.90 | 16.38 | 2.23 |
images | ||||
SGAN | 27.35 | 57.76 | 40.17 | 9.34 |
RaSGAN | 21.05 | 39.65 | 28.53 | 6.52 |
PairGAN (ours) | 17.30 | 29.32 | 21.92 | 3.76 |
images | ||||
SGAN | 69.64 | 344.55 | 208.99 | 104.08 |
RaSGAN | 32.11 | 102.76 | 56.64 | 21.03 |
PairGAN (ours) | 35.35 | 64.77 | 45.21 | 9.49 |
From Table 1, we observe that PairGAN improves both performance and stability on higher resolution images. Overall, our method outperforms the baselines in all categories except for RaSGAN-GP (64x64) in standard deviation and RaSGAN (256x256) in minimum.
Appendix G.2 provides extended comparison with other baselines, including LSGAN (mao2017least), HingeGAN (miyato2018spectral), WGAN-GP (gulrajani2017improved), and their variants. On 64x64 resolution, PairGAN demonstrates comparable performance with the best baseline (Relativistic average LSGAN) in all four categories. In higher resolution settings (128x128, 256x256), our model achieves the best FID across maximum, mean and standard deviation and its minimum FID is comparable with the best model for that resolution.
We also find our approach to be consistent across multiple runs with small deviations of the four metrics. Further discussion and the full table with the standard deviations of the scores for our model over repeated trials can be found in Appendix G.2.
We include image samples generated by PairGAN in Appendix G.3.
We introduced PairGAN, a formulation of adversarial training where the training dynamics does not suffer from the instability of the alignment. Our theoretical results constitute first steps in understanding convergence guarantees for PairGAN. Interestingly, in our setup, one can formalize the balance of power between the discriminator and the generator with the notion of sufficient discriminators, which is not present in the standard formulation of GANs.
Directions for future work include further theoretical understanding of convergence guarantees and properties of sufficient discriminators. Throughout our analysis, PairGAN enjoys flexibility which permits the use of different loss functions and model architectures. More extensive experiments with different design choices are necessary to understand the general improvements that PairGAN can bring.
This work was partially supported by the MIT-IBM collaboration on adversarial learning.
First, we expand the expression for the discriminator loss (7a) in PairGAN:
We expand all expectations as the integrals and obtain:
We minimize the integral by minimizing the expression inside the integral w.r.t point-wise. Solving for the optimal , we obtain:
We rewrite this expression as the function of the mixture distributions (9)
Next, we substitute to the generator loss (7b):
We add and substract the terms to the expression above, and rewrite it as:
After cancelling out the constant terms, the two expectations above give KL and reverse-KL divergences between and . Thus, we have shown that
The symmetrized KL-divergence above is non-negative and is equal to zero iff
We transform the last equation in the following way:
The last equation holds true for all iff .
First, we expand the expression for the discriminator loss for PairGAN-Z (8):
We expand all expectations as the integrals and obtain:
We introduce the function as:
and re-write the loss as
Recall, that in PairGAN-Z the discriminator aims to maximize . Therefore, our goal is to maximize the expression in the integral pointwise w.r.t. . The optimal discriminator is given by^{2}^{2}2We restrict the discriminator output , in order for the discriminator loss to be bounded. For , an unrestricted discriminator can drive to .:
The logarithm of can be written as:
We substitute to the generator loss and obtain: