relativistic-f-divergences
Code from paper "On Relativistic f-divergences" (http://arxiv.org/abs/1901.02474)
view repo
This paper provides a more rigorous look at Relativistic Generative Adversarial Networks (RGANs). We prove that the objective function of the discriminator is a statistical divergence for any concave function f with minimal properties (f(0)=0, f'(0) ≠ 0, _x f(x)>0). We also devise a few variants of relativistic f-divergences. Wasserstein GAN was originally justified by the idea that the Wasserstein distance (WD) is most sensible because it is weak (i.e., it induces a weak topology). We show that the WD is weaker than f-divergences which are weaker than relativistic f-divergences. Given the good performance of RGANs, this suggests that WGAN does not performs well primarily because of the weak metric, but rather because of regularization and the use of a relativistic discriminator. We also take a closer look at estimators of relativistic f-divergences. We introduce the minimum-variance unbiased estimator (MVUE) for Relativistic paired GANs (RpGANs; originally called RGANs which could bring confusion) and show that it does not perform better. Furthermore, we show that the estimator of Relativistic average GANs (RaGANs) is only asymptotically unbiased, but that the finite-sample bias is small. Removing this bias does not improve performance.
READ FULL TEXT VIEW PDFCode from paper "On Relativistic f-divergences" (http://arxiv.org/abs/1901.02474)
Generative adversarial networks (GANs) (Goodfellow et al., 2014)
are a very popular approach to approximately generate data from a complex probability distribution using only samples of data (without any information on the true data distribution). Most notably, it has been very successful at generating photo-realistic images
(Karras et al., 2017) (Karras et al., 2018). It consists in a game between two neural networks, the generator
and the discriminator . The goal ofis to classify real from fake (generated) data. The goal of
is to generate fake data that appears to be real, thus "fooling" into thinking that fake data is actually real.There are many GANs variants and most of them consist of changing the loss function of
. To name a few: Standard GAN (SGAN) (Goodfellow et al., 2014), Least-Squares GAN (LSGAN) (Mao et al., 2017), Hinge-loss GAN (HingeGAN) (Miyato et al., 2018), Wasserstein GAN (WGAN) (Arjovsky et al., 2017).For most GAN variants, training is equivalent to estimating a divergence: SGAN estimates the Jensen–Shannon divergence (JSD), LSGAN estimates the Pearson divergence, HingeGAN estimates the Reverse-KL divergence, and WGAN estimates the Wasserstein distance. Even more generally, -GANs (Nowozin et al., 2016) estimate any -divergence (which includes most of the popular divergences), while IPM-based GANs (Mroueh and Sercu, 2017) estimate any Integral probability metric (IPM) (Müller, 1997). Thus, intuitively, GANs can be thought as approximately minimizing a divergence (this is not technically correct; see Jolicoeur-Martineau (2018a)).
Recently, Jolicoeur-Martineau (2018b) showed that IPM-based GANs possess a unique type of discriminator which they call a Relativistic Discriminator (RD). They explained that one can construct -GANs while using a RD and that doing so improves the stability of the training and quality of generated data. They called this approach Relativistic GANs (RGANs). They proposed two variants: Relativistic paired GANs (RpGANs)^{1}^{1}1We added the word "paired" to better distinguish the variant with paired real/fake data (originally called RGANs) and the general approach called Relativistic GANs (RGANs). and Relativistic Average GANs (RaGANs).
Jolicoeur-Martineau (2018b) provided mathematical and intuitive arguments as to why having a Relativistic Discriminator (RD) may be helpful. However, they did not show that the loss functions are mathematically sensible as they did not show that these form statistical divergences. Furthermore, the estimators that they used were not the minimum-variance unbiased estimators (MVUE).
The contributions of this paper are the following:
We prove that the objective functions of the discriminator in RGANs are divergences (relativistic -divergences).
We devise a few variants of Relativistic -divergences.
We show that the Wasserstein Distance is weaker than -divergences which are weaker than relativistic -divergences.
We present the minimum-variance unbiased estimator (MVUE) of RpGANs and show that using it hinders the performance of the generator.
We show that RaGANs are only asymptotically unbiased, but that the finite-sample bias is small. Removing this bias does not improve the performance of the generator.
For the rest of the paper, we focus on the critic instead of the discriminator
. The critic is the discriminator before applying the activation function (
, where is an activation function and ). Intuitively, the critic can be thought as describing how realistic is. In the case of SGAN and HingeGAN, a large means that is realistic, while a small means that is not realistic. We use this notation because Relativistic GANs are defined in terms of the critic rather than the discriminator.GANs can be defined very generally in the following way:
(1) |
(2) |
where , , , , is the distribution of real data with support ,
is the latent distribution (generally a multivariate normal distribution),
is the critic evaluated at , is the generator evaluated at , and , where is the distribution of fake data. See Brock et al. (2018) for details on how different choices ofperforms. The critic and the generator are generally trained with stochastic gradient descent (SGD) in alternating steps.
Most GANs can be separated in two classes: non-saturating and saturating loss functions. GANs with the saturating loss are such that = and =, while GANs with the non-saturating loss are such that = and =. In this paper, we will assume that the non-saturating loss is always used as it generally works best in practice (Goodfellow et al., 2014) (Nowozin et al., 2016). Note that is also generally not included as its gradient with respect to is zero.
Although not always the case, the most popular GAN loss functions (SGAN, LSGAN with labels -1/1, HingeGAN, WGAN) are symmetric (i.e., ). For simplicity, in this paper, we restrict ourselves to symmetric loss functions.
Non-saturating Symmetric GANs (SyGANs) can be represented more simply as:
(3) |
(4) |
for some function . For easier optimization, we generally want to be concave with respect to the critic. This is the case in symmetric -GANs since , for some convex function and non-decreasing function , is concave.
In this paper, we restrict our relativistic divergences to symmetric cases with concave . Although this may be somewhat constraining, not making these assumptions would be very problematic for GANs. By not assuming concavity, we could have an objective function that diverges to infinity (and thus an infinite divergence). This is particularly problematic for GANs because early in training, we expect and to be perfectly separated (because of fully disjoint supports). This would cause the objective function to explode towards infinity and thereby causing severe instabilities. The Kullback–Leibler (KL) divergence is a good example of such a problematic divergence for GANs. If a single sample from the support of is not part of the support of , the divergence will be . Also, note that the dual form of the KL divergence cannot be represented as a SyGAN with equation (3) since and are not symmetric (Nowozin et al., 2016).
Rather than using a concave function to ensure a maximum on the objective function, IPM-based GANs instead force the critic to respect some constraint so that it does not grow too quickly. IPM-based GANs are defined in the following way:
(5) |
(6) |
where is a class of IPM. See Mroueh et al. (2017) for an extensive review of the choices of .
Rather than training the critic on real and fake data separately, this approach tries to maximize the critic’s difference (CD), but not too much. In Relativistic paired GANs (RpGANs), the CD is defined as , while in Relativistic average GANs (RaGANs), the CD is defined as (or vice-versa). The CD can be understood as how much more realistic real data is from fake data. The optimal size of the CD is determined by the choice of . With a least-square loss, the CD must be exactly equal to 1. On the other hand, with a log-sigmoid loss, the CD is grown to around 2 or 3 (after-which the gradient of vanishes to zero). This will be explained in more details in the next section. Again, we focus only on cases with symmetry (as done with SyGANs).
Relativistic paired GANs (RpGANs) are defined in the following way:
(7) |
(8) |
Relativistic average GANs (RaGANs) are defined in the follow way:
(9) |
(10) |
We define statistical divergences in the following way:
Let and be probability distributions and be the set of all probability distributions with common support. A function is a divergence if it respects the following two conditions:
In other words, divergences are distances between probability distributions. The distribution of real data () is fixed and our goal is to modify the distribution of fake data () so that the divergence decreases over training time.
As discussed in the introduction, in most GANs, the objective function of the critic at optimum is a divergence. We show that the objective function of the critic in RpGANs, RaGANs, and other variants also estimate a divergence. The theorem is as follows:
Let be a concave function such that , is differentiable at 0, , , and . Let and be probability distributions with support . Let . Then, we have that
are divergences.
We ask that the supremum of is reached at some positive (or at ). This is purely to ensure that a larger CD can be interpreted as leading to a larger divergence (rather than the opposite). This does not reduce the generality of Theorem 3.1. If is maximized at , we have that is maximized at and one can simply use instead of .
We require that is differentiable at zero and its derivative to be non-zero. This assumption may not be necessary, but it is needed for one of our main lemma which we use to prove that these objective functions are divergences.
A one-page sketch of the proof is available in Appendix A; the full proof is found in Appendix B.
Note that corresponds to RpGANs, corresponds to RaGANs, corresponds to a simplified one-way version of RaGANs (RalfGANs), and corresponds to a new type of RGAN called Relativistic centered GANs (RcGANs). RalfGANs are not particularly interesting as they simply represent a simpler version of RaGANs. On the other hand, RcGANs are interesting as they center the critic scores using the mean of the whole mini-batch (rather than the mean of only real or only fake mini-batch samples). This divergence also has similarities to the Jensen–Shannon divergence (JSD) since the JSD adds the divergence between and to the divergence between and .
A logical extension to RcGANs would be to standardize the critic scores; however, this would not lead to a divergence given that we could not control the size of the elements inside . To make it a divergence, we would need a learn-able scaling weight (as in batch norm (Ioffe and Szegedy, 2015)), but this would counter the effect of the standardization. Thus, standardizing and scaling would just correspond to an equivalent re-parametrization of .
Figure 1 shows three examples of concave with the necessary properties to be used in relativistic divergences; they are the concave functions used in SGAN, LSGAN (with labels 1/-1), and HingeGAN. Their respective mathematical functions are
(11) | ||||
(12) | ||||
(13) |
Interestingly, we see that they form three different types of functions. Firstly, we have functions that grow exponentially less as increases and thus reach their supremum at . Secondly, we have functions that grow to a maximum and then forever decrease (thus penalizing large CDs). Thirdly, we have functions that grow to a maximum and then never change. SGAN is of the first type, LSGAN is of the second, and HingeGAN is of the third type.
This shows that for all three types, we have that the CD is only encouraged to grow until a certain point. With the first type, we never truly force the CD to stop growing, but the gradients vanish to zero. Thus, SGD effectively prevents the CDs from growing above a certain level (sigmoid saturates at around 2 or 3).
It is useful to keep in mind that Figure 1 also represents the concave functions used for SyGANs, in which case applies to real and fake data separately ( and ).
The paper by Arjovsky et al. (2017) on using the Wasserstein distance (and other IPMs) for GANs has been extremely influential. In the WGAN paper, the authors suggest that the Wasserstein distance is more appropriate than -divergences for training a critic since it induces the weakest topology possible. Rather than giving a formal definition in terms of topologies, we show a simpler definition (as also done by Arjovsky et al. (2017)):
Let be a probability distribution with support , be a sequence of distributions converging to , and and be statistical divergences (per definition 3.1).
We say that is weaker than if we have that:
but the converse is not true.
We say that is a weakest distance if we have that:
where represents convergence in distribution.
Thus, intuitively, a weaker divergence can be thought as converging more easily. Arjovsky et al. (2017) showed that the Wasserstein distance is a weakest divergence and that it is weaker than common -divergences (as used in -GANs and standard GANs). They also showed that the Wasserstein distance is continuous with respect to its parameters and they attributed this property to the weakness of the divergence.
Considering this argument, one would except that RaGANs would be weaker than RpGANs which would be weaker than RGANs since this is the order of their relative performance and stability. Instead, we found the opposite relationship:
Let be a probability distribution with support , be a sequence of distributions converging to , be a concave function such that , is differentiable at 0, , , and . Then, we have that
were is the Wasserstein distance and is the distance in Symmetric GANs (see equation 3).
The proof is in Appendix C.
Given the good performance of RaGANs, this suggests that the argument made by Arjovsky et al. (2017) is insufficient. It only focuses on a perfect sequence of converging distributions, but the generator training does not guarantee a converging sequence of fake data distributions. It ignores the complex dynamics and intricacies of the generator training, which are still not well understood. Furthermore, it assumes an optimal critic which is unobtainable in practice. In practice, trying to obtain a semi-optimal critic requires many iterations and thus a significant amount of additional computational resources.
As previously suggested (Jolicoeur-Martineau, 2018b), what makes the Wasserstein distance a good choice of divergence are likely 1) the constraint of the critic (a Lipschitz critic) and 2) the use of a relativistic discriminator, rather than the weakness of the divergence.
To estimate RpGANs, Jolicoeur-Martineau (2018a) used the following estimator^{2}^{2}2Note that they actually used instead of because of how they defined the divergence.:
where and are samples from and respectively.
Although this is an unbiased estimator of , it is not the estimator with the minimal variance for a given mini-batch. Using the two-sample version (Lehmann, 1951) of the U-statistic theorem (Hoeffding, 1992) and given that the loss function is symmetric with respect to its arguments, one can show the following:
Let and be probability distributions with support . Let and be i.i.d. samples from and respectively. Then, we have that
is the minimum-variance unbiased estimator (MVUE) of .
Although it is the MVUE, this estimator requires operations instead of . In the experiments, we will show that using this estimator does not lead to better results. Given the quadratic scaling and lack of performance gain, it may not be worth using.
The divergences of RaGANs and RalfGANs assume that one knows the true expectation of the critic of real and fake data. However, in practice, we can only estimate the expectation. Although never explicitly mentioned, Jolicoeur-Martineau (2018b) simply replaced all expectations by the mini-batch mean:
where is the size of the mini-batch.
Given the non-linear function applied after calculating the CD, the divergences of RaGANs are biased with finite batch size . This means that RaGANs are only asymptotically unbiased. How large must be for the bias to become negligible is unclear.
We attempted to find a close form for the bias with , , and (equations 11, 12, 13 and Figure 1). We were only able to find a closed form for the bias with . The bias with has a simple form and can thus be removed, as seen below:
Let and be probability distributions with support . Then, we have that
are unbiased estimator of , , and respectively,
where , , , . and .
See Appendix B for the proof. This means that we can estimate the loss function in RaLSGAN, RalfLSGAN, and RcLSGAN without bias. In the experiments, we will show that the bias is negligible with the usual choices of (equations 11, 12, 13) and batch size (32 or higher).
All experiments were done with the spectral GAN architecture for 32x32 images (See Miyato et al. (2018)
) in Pytorch
(Paszke et al., 2017). We used the standard hyperparameters: learning rate (lr) = .0002, batch size (k) = 32, and the ADAM optimizer
(Kingma and Ba, 2014) with parameters . We trained the models for 100k iterations with one critic update per generator update. For the datasets, we used CIFAR-10 (Krizhevsky, 2009), CelebA (Liu et al., 2015) and CAT (Zhang et al., 2008). All models were trained using the same seed. To evaluate the quality of generated outputs, we used the Fréchet Inception Distance (FID) (Heusel et al., 2017). For a review of the different evaluation metrics for GANs, please see
Borji (2018).We approximated the bias of RaGANs and RcGANs by estimating the real/fake critic mean from samples rather than the mini-batch samples. For , we were able to calculate the true value of the bias (in expectation, see Corollary 4.2). Results on CIFAR-10 are shown in Figure 2.
For RAGANs, the approximation of the relative bias with was correct from 4k iterations and onwards. For all choices of , we observed the same pattern of low approximated relative bias that stabilized to a larger number after a certain number of iterations. We suspect that this may be due to the important instabilities of the first iterations when the discriminator is not optimal. At 15k iterations, all biases were stabilized. We calculated the average of the bias with different starting at 15k iterations: .995 for the true relative bias with , .996 for the approximated relative bias with , .994 for the approximated relative bias with , and .997 for the approximated relative bias with .
For RcGANs, the approximation of the bias with was correct from the very beginning of training. All biases were relatively stable over time with the exception of which increased linearly over time (up to around 1.05). We calculated the average of the bias with different : 1.007 for the true relative bias with , 1.007 for the approximated relative bias with , 1.03 for the approximated relative bias with , and 1.007 for the approximated relative bias with .
Overall, this shows that the bias in the estimators of RaGANs and RcGANs tends to be small. Furthermore, with the exception of , the bias is relatively stable over time. Thus, accounting for the bias, may not be necessary.
To test the new relativistic divergences proposed (and verify whether removing the bias in RaGANs is useful), we ran experiments on CIFAR-10 using , on LSUN bedrooms using , and on CAT using (these choices of were arbitrary). Results are shown in Table 1.
CIFAR-10 | CelebA | CAT | |
Loss | |||
GAN | 31.1 (8.2) | 15.3 (51.8) | 15.2 (11.1) |
RpGAN | 31.5 (7.6) | 16.7 (4) | 12.9 (2.3) |
RpGAN (MVUE) | 30.2 (11.7) | 21.9 (3.2) | 18.2 (2.9) |
RaGAN | 29.2 (7.4) | 15.9 (4.5) | 12.3 (1.2) |
RaGAN (unbiased) | 30.3 (12.9) | - | - |
RcGAN | 31.7 (8) | 18.1 (2.9) | 16.5 (7.1) |
RcGAN (unbiased) | 32.3 (8.7) | - | - |
Minimum (and standard deviation) of the FID calculated at 10k, 20k, … , 100k iterations using different loss functions (see equations 11, 12, 13) and datasets.
Using the MVUE for RpGAN resulted in the generator having a worse performance on CIFAR-10 with (, ), CelebA with (, ), and CAT with (, ). Similarly, using the unbiased estimator made the generator perform sightly worse for RaLSGAN (, ) and RcLSGAN (, ). These results are surprising as they suggest that using noisy or slightly biased estimators may be beneficial.
Most importantly, we proved that the objective function of the critic in RGANs is a divergence.
In addition, we showed that -divergences are weaker than relativistic -divergences. Thus, the weakness of the topology induced by a divergence alone cannot explain why WGAN performs well.
Finally, we took a closer look at the estimators or RGANs and found that 1) the estimator of RpGANs used by Jolicoeur-Martineau (2018b) is not the minimum-variance unbiased estimator (MVUE) and 2) the estimators of RaGANs and RalfGANs are slightly biased with finite batch-sizes. Surprisingly, we found that neither using the MVUE with RpGANs or using an unbiased estimator with RaGANs and RalfGANs improved the performance. On the contrary, using better estimators always slightly decreased the quality of generated samples. This suggests that using noisy estimates of the divergences may beneficial as a regularization mechanism. This could be explained by vanishing gradients when the discriminator becomes closer to optimality (Arjovsky and Bottou, 2017).
It still remains a mystery as to why RaGANs are better than RpGANs and the direct mechanism that leads to RGANs performing in a much more stable matter. Future work should attempt to better understand the effect of the critic’s difference on training. Our experiments were limited to the generation of small images; thus, we encourage further experiments with the MVUE and the unbiased estimator of RaLSGAN in different settings.
2017 IEEE International Conference on Computer Vision (ICCV)
, pages 2813–2821. IEEE, 2017.International Conference on Machine Learning
, pages 214–223, 2017.Although the four divergences have separate proofs, a similar framework is used in each proof. Each proof consists of three steps. For clarity of notation, let be the divergence, where is any of the objective functions in Theorem 3.1.
First, we show that . This is easily proven by taking the simplest possible choice of critic, which does not depends on the probability distributions, i.e., for all . This critic always leads to and thus to a objective function equal to 0. This means that
.
Second, we show that . This step generally relies on Jensen’s inequality (for concave functions) which we use to show that . Given that and , we have that .
Third, we show that . This step is by far the most difficult to prove. Instead of showing it directly, we instead prove it by contraposition, i.e., we show that . To prove this, we use the fact that if
, there must be values of the probability density functions,
and respectively, such that (and vice versa). Let , we know that this set is not empty. To make the proof as simple as possible, we use the following sub-optimal critic:where . This critic function is very simple, but, as we will show, there exists a such that this leads to an objective function greater than 0 which means that the divergence is also greater than 0.
With this critic in mind, our goal is to transform the problem into the following:
where , for some and s.t. . We have been able to show this with all divergences.
We want to find a large enough so that the positive term () is big, but small enough so that the negative term () is not too big. The main caveat is that, by concavity, . This means that the negative term is always bigger in absolute value than the positive term. This is problematic, since could be be very close to and we want to get and show that we have a divergence. The solution is to choose to be very small. By continuity of the concave function, if we make small enough (very close to 0), we can reach a point where . In which case, if , we have that
In the actual proof, we show that there always exists a small enough such that any leads to . This concludes the sketch of the proof.
Let and be probability distributions and be the set of all probability distributions with common support. A function is a divergence if it respects the following two conditions:
A function is concave on if and only if
Let be a concave function on , we have that
and
Let .
If , we have that .
If , we have that .
Either way, by concavity, we have that
If , we have that:
If , we have that:
∎
Let be a concave function such that . We have that
If we have that .
By Lemma A.1 , we have that
If , we have that .
By Lemma A.1, we have that
Thus, when , we have that
∎
Let be a concave function such that , is differentiable at 0, , , and . Let , where , , and .
If , , s.t.
If , , s.t. .
By concavity, for all , we have .
This means that for any , we have that .
By concavity, for all , we have that .
Thus, for all we have that .
This means that and .
Let , where .
We can show that:
If , by concavity we have that , thus .
Let , where .
By the definition of the limit, s.t. s.t. , we have
Since this is true for all s.t. , this is also true for all .
This means that
If , let , , and we have for all .
If , let , , and we have for all .
∎
Let be a concave function such that , is differentiable at 0, , , and . Let and be probability distributions with support . Then, we have that
is a divergence.
Let (worst possible choice of ).
Let (best possible choice of ).
#1 Proof that
.
#2 Proof that
Since , we have that .
#3 Proof that
We prove this by contraposition (i.e., we prove that ). To do so, we design a function that is better than the worse option ().
Assume that .
Let ^{3}^{3}3If and have probability density functions and respectively, then ..
Let .
Let .
Since , we know that .
This means that , , and .
Let , where .
Let .
We have that
Comments
There are no comments yet.