OCFGAN
Pytorch implementation of OCFGAN-GP (CVPR 2020, Oral).
view repo
In this paper, we formulate the problem of learning an Implicit Generative Model (IGM) as minimizing the expected distance between characteristic functions. Specifically, we match the characteristic functions of the real and generated data distributions under a suitably-chosen weighting distribution. This distance measure, which we term as the characteristic function distance (CFD), can be (approximately) computed with linear time-complexity in the number of samples, compared to the quadratic-time Maximum Mean Discrepancy (MMD). By replacing the discrepancy measure in the critic of a GAN with the CFD, we obtain a model that is simple to implement and stable to train; the proposed metric enjoys desirable theoretical properties including continuity and differentiability with respect to generator parameters, and continuity in the weak topology. We further propose a variation of the CFD in which the weighting distribution parameters are also optimized during training; this obviates the need for manual tuning and leads to an improvement in test power relative to CFD. Experiments show that our proposed method outperforms WGAN and MMD-GAN variants on a variety of unsupervised image generation benchmark datasets.
READ FULL TEXT VIEW PDFPytorch implementation of OCFGAN-GP (CVPR 2020, Oral).
Implicit Generative Models (IGMs), such as Generative Adversarial Networks (GANs) [Goodfellow et al.2014], seek to learn a model of an underlying data distribution using samples from . Unlike prescribed probabilistic models, IGMs do not require a likelihood function and thus, are appealing when the data likelihood is unknown or intractable. Empirically, GANs have excelled at numerous tasks, from unsupervised image generation [Karras, Laine, and Aila2019] to policy learning [Ho and Ermon2016].
Unfortunately, the original GAN suffered from optimization instability and mode collapse, and required various ad-hoc tricks to stabilize training [Radford, Metz, and Chintala2015]. Subsequent research has revealed that the generator-discriminator setup in the GAN minimizes the Jensen-Shannon divergence between the real and generated data samples; this divergence possesses discontinuities that results in uninformative gradients as approaches that hamper the training. Various works have since established desirable properties for a divergence to satisfy for use in GAN training, and proposed alternative training schemes [Arjovsky and Bottou2017, Salimans et al.2016, Arjovsky, Chintala, and Bottou2017]
, primarily using distances belonging to the Integral Probability Metric (IPM) family
[Müller1997]. One popular IPM is the kernel-based measure Maximum Mean Discrepancy (MMD) and a significant portion of recent work has focussed on deriving better MMD-GAN variants [Li et al.2017, Binkowski et al.2018, Arbel et al.2018, Li et al.2019].In this paper, we undertake a different, more elementary approach, and formulate the problem of learning an IGM from data as that of matching the characteristic functions of the real and generated data distributions. Characteristic functions are widespread in probability theory and have been used for two-sample testing
[Heathcote1972, Epps and Singleton1986, Chwialkowski et al.2015], yet surprisingly, have not yet been investigated for GAN training. We find this approach leads to a simple and computationally-efficient loss: the characteristic function distance (CFD). Computing CFD is linear in the number of samples (unlike the quadratic-time MMD), and our experiments show CFD minimization results in effective GAN training.This work provides both theoretical and empirical support for using CFDs to train IGMs. We first establish that the CFD is continuous and differentiable almost everywhere with respect to the parameters of the generator, and that it satisfies continuity in the weak topology — key properties that make it a suitable GAN metric [Arjovsky, Chintala, and Bottou2017, Li et al.2017]
. We provide novel direct proofs that supplement the existing theory on GAN training metrics. Given these properties, our key idea is simple: train GANs using empirical estimates of the CFD under optimized weighting distributions. We report on empirical results using synthetic distributions and four benchmark image datasets (MNIST, CIFAR10, STL10, CelebA). Our experiments show that the CFD-based approach outperforms state-of-the-art models (WGAN and MMD-GAN variants) on quantitative evaluation metrics. From a practical perspective, we find the CFD-based GANs are simple to implement and stable to train.
In summary, the key contributions of this work are:
A novel approach to train implicit generative models using a CFD-loss derived from characteristic functions;
Theoretical results showing that the proposed loss metric for the critic is continuous and differentiable in the parameters of the generator and satisfies weak topology;
Experimental results showing our approach leads to effective generative models compared to state-of-the-art models (WGAN and MMD-GAN variants) on a variety of synthetic and real-world datasets.
We begin by providing a brief review of the GAN framework and recent distance-based methods for training GANs. A Generative Adversarial Network (GAN) is an implicit generative model that seeks to learn the data distribution given samples from . The GAN consists of a generator network and a critic network (also called the discriminator). The generator
transforms a latent vector
sampled from a simple distribution (e.g., Gaussian) to a vector in the data space. The original GAN [Goodfellow et al.2014] was defined via an adversarial two-player game between the critic and the generator; the critic attempts to distinguish the true data samples from ones obtained from the generator, and the generator attempts to make its samples indistinguishable from the true data.In more recent work, this two-player game is cast as minimizing a divergence between the real data distribution and the generated distribution. The critic evaluates some probability divergence between the true and generated samples, and is optimized to maximize this divergence. In the original GAN, the associated (implicit) distance is the Jensen-Shannon divergence, but alternative divergences have since been introduced. For example, the 1-Wasserstein distance [Arjovsky, Chintala, and Bottou2017, Gulrajani et al.2017], Cramer distance [Bellemare et al.2017], maximum mean discrepancy (MMD) [Li et al.2017, Binkowski et al.2018, Arbel et al.2018], and Sobolev IPM [Mroueh et al.2017] have been proposed. Many distances proposed in the literature can be reduced to the Integral Probability Metric (IPM) framework with different restrictions on the function class.
In this work, we propose to train GANs using an IPM based on characteristic functions (CFs). Letting
be the probability measure associated with a real-valued random variable
, the characteristic function of is given by(1) |
where
is the input argument. Characteristic functions are widespread in probability theory, and are often used as an alternative to probability density functions. The characteristic function of a random variable completely defines it, i.e., for two distributions
and , if and only if . Unlike the density function, the characteristic function always exists, and is uniformly continuous and bounded: .The squared Characteristic Function Distance (CFD) between two distributions and is given by the weighted integrated squared error between their characteristic functions
(2) |
where is a weighting function, which we henceforth assume to be parametrized by , and chosen such that the integral in Eq. (2) converges. When is the probability density function of a distribution on , the integral in Eq. (2) can be written as an expectation:
(3) |
By analogy to Fourier analysis in signal processing, Eq. (3
) can be interpreted as the expected discrepancy between the Fourier transforms of two signals at frequencies sampled from
. If , it can be shown using the uniqueness theorem of characteristic functions that [Sriperumbudur et al.2010].In practice, the CFD can be approximated using empirical characteristic functions and finite samples from the weighting distribution . To elaborate, the characteristic function of a degenerate distribution for is given by where . Given observations
from a probability distribution
, the empirical distribution is a mixture of degenerate distributions with equal weights, and the corresponding empirical characteristic function is a weighted sum of characteristic functions of degenerate distributions:(4) |
Let and with be samples from the distributions and respectively, and let be samples from . Then, we define the empirical characteristic function distance (ECFD) between and as
(5) |
where and are the empirical CFs, computed using and respectively.
A quantity related to CFD (Eq. 2) has been studied in [Paulson, Holcomb, and Leitch1975] and [Heathcote1977], in which the discrepancy between the analytical and empirical characteristic functions of stable distributions is minimized for parameter estimation. The CFD is well-suited to this application because stable distributions do not admit density functions, making maximum likelihood estimation difficult. Parameter fitting has also been explored for other models such as mixture-of-Gaussians, stable ARMA process, and affine jump diffusion models [Yu2004].
More recently, [Chwialkowski et al.2015] proposed fast ( in the number of samples ) two-sample tests based on ECFD, as well as a smoothed version of ECFD in which the characteristic function is convolved with an analytic kernel. The authors empirically show that ECFD and its smoothed variant have a better test-power/run-time trade-off compared to quadratic time tests, and better test power than the sub-quadratic time variants of MMD.
The choice of is important for the success of ECFD in distinguishing two different distributions; choosing an appropriate distribution and/or set of parameters allows better coverage of the frequencies at which the differences in and lie. For instance, if the differences are concentrated at the frequencies far away from the origin and
is Gaussian, the test power can be improved by suitably enlarging the variance of each coordinate of
.To increase the power of ECFD, we propose to optimize the parameters
(e.g., variances associated with a normal distribution) of the weighting distribution
to maximize the power of the test. However, care should be taken in how rich the class of all functions is (i.e., choosing which parameters to optimize, and the associated constraints), since excessive optimization may cause the test to fixate on differences that are merely due to fluctuations in the sampling. As an extreme example, we found that optimizing’s directly (instead of optimizing the weighting distribution) severely degrades the test’s ability to correctly accept the null hypothesis
.To validate our approach, we conducted a basic experiment using high-dimensional Gaussians, similar to [Chwialkowski et al.2015]. Specifically, we used two multivariate Gaussians and that have the same mean in all dimensions except one. As the dimensionality increases, it becomes increasingly difficult to distinguish between samples from the two distributions. In our tests, the weighting distribution
was chosen to be a Gaussian distribution
, and the number of frequencies () was set to 3. We optimize the parameter vector to maximize the ECFD using Adam optimizer for 100 iterations with a batch-size of 1000.Fig. (a)a shows the variation of the test power (i.e., the fraction of times the null hypothesis is rejected) with the number of dimensions. OEFCD refers to the optimized ECFD, and the “Smooth” suffix indicates the smoothed ECFD variant of [Chwialkowski et al.2015]. We see that optimization of increases the power of ECFD and ECFD-Smooth, particularly at the higher dimensionalities. There do not appear to be significant differences between the optimized smoothed and non-smoothed ECFD variants. Moreover, the optimization improved the ability of the test to correctly distinguish the two different distributions, but did not hamper its ability to correctly accept the null hypothesis when the distributions are the same.
To investigate how is adapted, we visualize two dimensions from the dataset where and . Fig. (b)b shows the absolute difference between the ECFs of and , with the corresponding dimensions of the weighting distribution plotted in both dimensions. The solid blue line shows the optimized distribution (for OECFD) while the dashed orange line shows the initial distribution (i.e., for ECFD and ECFD-Smooth). When the samples are the same in both dimensions, the in the corresponding dimension has small deviation from the initial value. However, in the dimension where the distributions are different, the increase in variance is more pronounced to compensate for the spread of difference between the ECFs away from the origin.
In this section, we turn our attention to applying the (optimized) CFD for learning IGMs, specifically GANs. As in the standard GAN, our model is comprised of a generator and a critic , with parameter vectors and , and data/latent spaces and . Below, we write for the spaces in which the parameters lie.
The generator minimizes the empirical CFD between the real and generated data. Instead of minimizing the distance been characteristic functions of raw high-dimensional data, we use a critic neural network
that is trained to maximize the CFD between real and generated data distributions in a learned lower-dimensional space. This results in the following minimax objective for the IGM:(6) |
where (with corresponding parameter space ), and is the parameter vector of the weighting distribution . The optimization over is omitted if we choose to not optimize the weighting distribution. In our experiments, we set , with indicating the scale of each dimension of . Since evaluating the CDF requires knowledge of the data distribution, in practice, we optimize the empirical estimate instead of . We henceforth refer to this model as the Characteristic Function Generative Adversarial Network (CF-GAN).
Similar to recently proposed Wasserstein [Arjovsky, Chintala, and Bottou2017] and MMD [Li et al.2017] GANs, the CFD exhibits desirable mathematical properties. Specifically, CFD is continuous and differentiable almost everywhere. Moreover, as a weak distance, it can provide a better signal to the generator than measures that are not weak distances (e.g., Jensen-Shannon). In the following, we provide proofs for the above claims under assumptions similar to [Arjovsky, Chintala, and Bottou2017].
The following theorem formally states the result of continuity and differentiability in almost everywhere, which is desirable for permitting training via gradient descent.
Assume that (i) is locally Lipschitz with constants not depending on , and satisfying ; (ii) , where implicitly depends on . Then, the function is continuous in everywhere, and differentiable in almost everywhere.
The following theorem establishes continuity in the weak topology, and concerns general convergent distributions as opposed to only those corresponding to . In this result, we let be the distribution of when .
Assume that (i) is Lipschitz with a constant not depending on ; (ii) . Then, the function is continuous in the weak topology, i.e., if , then .
The proofs are given in the appendix. In brief, we bound the difference between characteristic functions using geometric arguments, i.e., we interpret as a vector on a circle and note that . We then upper-bound the difference of function values in terms of (assumed to be finite) and averages of Lipschitz functions of under the distributions considered. The Lipschitz properties ensure that the function difference vanishes when one distribution converges to the other.
Various generators satisfy the locally Lipschitz assumption, e.g., in the case that
is a feed-forward network with ReLU activations. To ensure that
is Lipschitz, common methods employed in prior work include weight clipping [Arjovsky, Chintala, and Bottou2017] and gradient penalty [Gulrajani et al.2017]. In addition, many common distributions satisfy , e.g., Gaussian, Student-t, and Laplace with fixed , which we use in our experiments. When is optimized, we normalize the CFD by , which prevents from going to infinity.An example demonstrating the necessity of Lipschitz assumptions in continuity results (albeit for a different measure) can be found in Example 1 of [Arbel et al.2018]. In the appendix, we discuss conditions under which Theorem 2 can be strengthened to an “if and only if” statement.
The CFD is related to the maximum mean discrepancy (MMD) [Gretton et al.2012]. Given samples from two distributions and , the squared MMD is given by
(7) |
where and are independent samples, and is kernel. When the weighting distribution of the CFD is equal to the inverse Fourier transform of the kernel in MMD (i.e., ), the CFD and squared MMD are equivalent: . Indeed, kernels with are called characteristic kernels [Sriperumbudur et al.2010], and when , if and only if . Although formally equivalent under the above conditions, we find experimentally that optimizing empirical estimates of MMD and CFD result in different convergence profiles and model performance across a range of datasets. Also, unlike MMD, which takes quadratic time in the number of samples to approximately compute, the CFD takes time and is therefore computationally attractive when .
Learning a generative model by minimizing the MMD between real and generated samples was proposed independently by [Li, Swersky, and Zemel2015] and [Dziugaite, Roy, and Ghahramani2015]
. The Generative Moment Matching Network (GMMN)
[Li, Swersky, and Zemel2015]uses an autoencoder to first transform the data into a latent space, and then trains a generative network to produce latent vectors that match the true latent distribution. The MMD-GAN
[Li et al.2017] performs a similar input transformation using a network that is adversarially trained to maximize the MMD between the true distribution and the generator distribution ; this results in a GAN-like min-max criterion. More recently, [Binkowski et al.2018] and [Arbel et al.2018] have proposed different theoretically-motivated regularizers on the gradient of MMD-GAN critic that improve training. In our experiments, we compare against the MMD-GAN both with and without gradient regularization.Very recent work [Li et al.2019] (IKL-GAN) has evaluated kernels parameterized in Fourier space, which are then used to compute MMD in MMD-GAN. The IKL-GAN utilizes a neural network to sample random frequencies, whereas we use a simpler fixed distribution with a learned scale. In this aspect, our work can be viewed as a special case of IKL-GAN; however, we derive the CF-GAN via characteristic functions rather than via MMD. This obviates the requirement for kernel evaluation as in IKL-GAN. We also provide novel direct proofs for the properties of optimized CFD that are not based on its equivalence to MMD. Our method yields state-of-the-art performance, which suggests that the more complex setup in IKL-GAN may not be required for effective GAN training.
In parallel, significant work has gone into improving GAN training via architectural and optimization enhancements [Miyato et al.2018, Brock, Donahue, and Simonyan2018, Karras, Laine, and Aila2019]; these research directions are orthogonal to our work and can be incorporated in our proposed model.
In this section, we present empirical results comparing three variants of our proposed model: 1) CF-GAN – ECFD with fixed scale ; 2) OCF-GAN – ECFD with optimized ; and, 3) OCF-GAN-GP – ECFD with optimized and gradient penalty. For the first two versions, we use weight clipping in to enforce Lipschitzness of , while for OCF-GAN-GP we constrain the gradients between real and generated samples to be 1 using an additive penalty [Gulrajani et al.2017].
We compare our proposed model against two variants of MMD-GAN: (i) MMD-GAN [Li et al.2017], which uses MMD with a mixture of RBF kernels as the distance metric; (ii) MMD-GAN-GP [Binkowski et al.2018], which introduces an additive gradient penalty based on MMD’s IPM witness function, an L2 penalty on discriminator activations, and uses a mixture of RQ kernels. We also compare against WGAN [Arjovsky, Chintala, and Bottou2017] and WGAN-GP [Gulrajani et al.2017] due to their close relation to MMD-GAN [Li et al.2017, Binkowski et al.2018].
We first tested the methods on two synthetic 1D distributions: a simple unimodal distribution () and a more complex bimodal distribution (). The distributions were constructed by transforming using a function . For the unimodal dataset, we used the scale-shift function form used by [Zaheer et al.2018] where . For the bimodal dataset, we used the function form used by planar flow [Rezende and Mohamed2015] where . We trained the various GAN models to approximate the distribution of the transformed samples. Once trained, we compared the transformation function learned by the GAN against the true function . We computed the mean absolute error (MAE) () to evaluate the models. Further details on the experimental setup can be found in Appendix B.1.
Figs. (a)a and (b)b show the variation of the MAE with training iterations. For both datasets, the models with gradient penalty converge to better minima. In , both MMD-GAN-GP and OCF-GAN-GP converge to the same value of MAE, but MMD-GAN-GP converges faster. During our experiments, we observed that the scale of the weighting distribution (which is intialized to 1) falls rapidly before the MAE begins to decrease. For the experiments with the scale fixed at 0.1 (CF-GAN-GP) and 1 (CF-GAN-GP), both models converge to the same MAE, but CF-GAN-GP takes much longer to converge than CF-GAN-GP. This indicates that the optimization of the scale parameter leads to faster convergence. For the more complex dataset , MMD-GAN-GP takes far longer to converge compared to WGAN-GP and OCF-GAN-GP. OCF-GAN-GP converges fastest and to a better minimum, followed by WGAN-GP.
We conducted experiments on four benchmark datasets: 1) MNIST [LeCun, Bottou, and Haffner2001]: 60K grayscale images of handwritten digits; 2) CIFAR10 [Krizhevsky2009]: 50K RGB images; 3) CelebA [Liu et al.2015]: 200K RGB images of celebrity faces; and, 4) STL10 [Coates, Ng, and Lee2011]: 100K RGB images. For all the datasets, we center-cropped and scaled the images to .
We follow [Li et al.2017], and used a DCGAN-like generator and critic architecture for all models. For MMD-GAN, we use a mixture of five RBF kernels (5-RBF) with different scales [Li et al.2017]. For MMD-GAN-GP, we applied a mixture of rational quadratic kernels (5-RQ). The kernel parameters and the trade-off parameters for gradient and L2 penalties were set according to [Binkowski et al.2018]. We tested CF-GAN variants with two weighting distributions: Gaussian () and Student’s-t (
) (with degrees-of-freedom
). For CF-GAN, we tested 3 scale parameters in the set and report the best results. The number of frequencies () used to compute the ECFD was set to 8. Complete implementation details can be found in Appendix B.2.We compare the different models using three evaluation metrics: Fréchet Inception Distance (FID) [Salimans et al.2016], Kernel Inception Distance (KID) [Binkowski et al.2018], and Precision-Recall (PR) for Generative models [Sajjadi et al.2018]. Details on these metrics and the evaluation procedure can be found in Appendix B.2. In brief, the FID computes the Fréchet distance between two multivariate Gaussians and the KID computes the MMD (with a polynomial kernel of degree 3) between the real and generated data distributions. While both FID and KID give single value scores, PR gives a two dimensional score which disentangles the quality of generated samples from the coverage of the data distribution. PR is defined by a pair (recall) and (precision) which represent the distribution coverage and sample quality respectively.
In the following, we summarize our main findings and relegate details to the Appendix. Table 1 shows the FID and KID values achieved by different models for CIFAR10, STL10, and CelebA datasets. In short, our model outperforms both variants of WGAN and MMD-GAN by a significant margin. OCF-GAN, using just one weighting function, outperforms both MMD-GANs that use a mixture of 5 different kernels.
We observe that the optimization of the scale parameter improves the performance of the models for both weighting distributions, and the introduction of gradient penalty as a means to ensure Lipschitzness of results in a significant improvement in the score values for all models. This is in line with the results of [Gulrajani et al.2017] and [Binkowski et al.2018]. Overall, amongst the CF-GAN variants, OCF-GAN-GP with Gaussian weighting performs the best for all datasets.
The two-dimensional precision-recall scores in Fig. 3 provide further insight into the performance of different models. Across all the datasets, the addition of gradient penalty (OCF-GAN-GP) rather than weight clipping (OCF-GAN) leads to a much higher improvement in recall compared to precision. This result supports recent arguments that weight clipping forces the generator to learn simpler functions while gradient penalty is more flexible [Gulrajani et al.2017]. The improvement in recall with the introduction of gradient penalty is much more noticeable for CIFAR10 and STL10 datasets compared to CelebA. This result is intuitive; CelebA is a more uniform and simpler dataset when compared to CIFAR10/STL10, which contain more diverse classes of images and thus, likely have modes that are more complex and far apart.
Fig. 4 shows image samples generated by OCF-GAN-GP for different datasets. Additional qualitative comparisons can be found in Appendix C, which also describes further experiments using the smoothed version of ECFD and the optimized smoothed version (no improvement over the unsmoothed versions on the image datasets) and results on the MNIST dataset where all models achieve good score values.
Model | Kernel/ | CIFAR10 | STL10 | CelebA | |||
---|---|---|---|---|---|---|---|
Weight | FID | KID | FID | KID | FID | KID | |
WGAN | – | 44.11 (1.16) | 25 (1) | 38.61 (0.43) | 23 (1) | 17.85 (0.69) | 12 (1) |
WGAN-GP | – | 35.91 (0.30) | 19 (1) | 27.85 (0.81) | 15 (1) | 10.03 (0.37) | 6 (1) |
MMD-GAN | 5-RBF | 41.28 (0.54) | 23 (1) | 35.76 (0.54) | 21 (1) | 18.48 (1.60) | 12 (1) |
MMD-GAN-GP | 5-RQ | 38.88 (1.35) | 21 (1) | 31.67 (0.94) | 17 (1) | 13.22 (1.30) | 8 (1) |
CF-GAN | 39.81 (0.93) | 23 (1) | 33.54 (1.11) | 19 (1) | 13.71 (0.50) | 9 (1) | |
41.41 (0.64) | 22 (1) | 35.64 (0.44) | 20 (1) | 16.92 (1.29) | 11 (1) | ||
OCF-GAN | 38.47 (1.00) | 20 (1) | 32.51 (0.87) | 19 (1) | 14.91 (0.83) | 9 (1) | |
37.96 (0.74) | 20 (1) | 31.03 (0.82) | 17 (1) | 13.73 (0.56) | 8 (1) | ||
OCF-GAN-GP | 33.08 (0.26) | 17 (1) | 26.16 (0.64) | 14 (1) | 9.39 (0.25) | 5 (1) | |
34.33 (0.77) | 18 (1) | 26.86 (0.38) | 15 (1) | 9.61 (0.39) | 6 (1) |
) scores (lower is better) for CIFAR10, STL10, and CelebA datasets averaged over 5 random runs (standard deviation in parentheses).
# of freqs () | FID | KID |
---|---|---|
1 | 0.49 (0.04) | 6 (1) |
4 | 0.43 (0.07) | 5 (1) |
16 | 0.44 (0.04) | 5 (1) |
32 | 0.40 (0.03) | 4 (1) |
64 | 0.39 (0.04) | 4 (1) |
The choice of weighting distribution did not lead to drastic changes in model performance. The distribution performs best when weight clipping is used, while performs best in the case of gradient penalty. This suggests that the proper choice of distribution is dataset and method dependent.
We also conducted preliminary experiments using a uniform () distribution weighting scheme. Even though the condition
does not hold for the uniform distribution, we found that this does not adversely affect the performance (see Appendix
C). The uniform weighting distribution corresponds to the sinc-kernel in MMD, which is known to be a non-characteristic kernel [Sriperumbudur et al.2010]. Our results suggest that such kernels can perform reasonably well when used in MMD-GAN, but we did not verify this experimentally.We conducted an experiment to study the impact of the number of random frequencies () that are sampled from the weighting distribution to compute the ECFD. We ran our best performing model (OCF-GAN-GP) with different values of from the set . The FID and KID scores for this experiment are shown in Table 2. As expected, the score values improve as increases. However, even for the lowest number of frequencies possible (), the performance does not degrade too severely.
In this paper, we proposed a novel weighted distance between characteristic functions for training IGMs, and shown that the proposed metric has attractive theoretical properties. We observed experimentally that the proposed model outperforms MMD-GAN and WGAN variants on four benchmark image datasets. Our results indicate that CFDs provide an effective alternative means for training IGMs.
This work opens additional avenues for future research. For example, the empirical CFD used for training may result in high variance gradient estimates (particularly with a small number of sampled frequencies), yet the CFD-trained models attain high performance scores with better convergence in our tests. The reason for this should be more thoroughly explored. Although we used the gradient penalty proposed by WGAN-GP, there is no reason to constrain the gradient to exactly 1. We believe that an exploration of the geometry of the proposed loss could lead to improvement in the gradient regularizer for the proposed method.
Apart from generative modeling, two sample tests such as MMD have been used for problems such as domain adaptation [Long et al.2015] and domain separation [Bousmalis et al.2016]
, among others. The optimized CFD loss function proposed in this work can be used as an alternative loss for these problems.
Journal of Machine Learning Research
13(Mar):723–773.Generative adversarial imitation learning.
In NIPS.Assessing generative models via precision and recall.
In NeurIPS, 5228–5237.Let be the data distribution, and let be the distribution of when , the latent distribution. Recall that the characteristic function of a distribution is given by
(8) |
The quantity can then be written as
(9) |
where we denote the characteristic functions of and by and respectively. For notational simplicity, we henceforth denote by .
Since the difference of two functions’ maximal values is always upper bounded by the maximal gap between the two functions, we have
(10) | ||||
(11) |
where denotes any parameters that are within of the supremum on the right-hand side of (11), where may be arbitrarily small. Such always exists by the definition of supremum. Subsequently, we define for compactness.
Let denote the distribution associated with . We further upper bound the right-hand side of (11) as follows:
(12) | ||||
(13) |
where uses the linearity of expectation and Jensen’s inequality.
Since any characteristic function is bounded by , the value of for any is upper bounded by 2. Since the function is (locally) -Lipschitz over the restricted domain , we have
(14) | ||||
(15) | ||||
(16) | ||||
(17) |
where uses the triangle inequality, and uses Jensen’s inequality.
In inequality (17), let , which can be interpreted as the length of the chord that subtends an angle of at the center of a unit circle centered at origin. The length of this chord is given by , and since , we have
(18) | ||||
(19) |
where uses the Cauchy-Schwarz inequality.
Furthermore, using the assumption , we get
(20) |
with the first term being finite.
By assumption, is locally Lipschitz, i.e., for any pair , there exists a constant and an open set such that we have . Hence,
(21) |
Since by assumption, we get
(22) |
and combining with (11) gives
(23) |
Taking the limit on both sides gives
(24) |
which proves that is locally Lipschitz, and therefore continuous. In addition, Radamacher’s theorem [Federer2014] states any locally Lipschitz function is differentiable almost everywhere, which establishes the differentiability claim.
Let and . To study the behavior of , we first consider
(25) |
Since , using the fact that for , we have
(26) | |||
(27) | |||
(28) | |||
(29) |
where uses Jensen’s inequality, uses the geometric properties stated following Eq. (17) and the fact that , and uses the Cauchy-Schwarz inequality.
For brevity, let , which is finite by assumption. Interchanging the order of the expectations in Eq. (29) and applying Jensen’s inequality (to alone) and the concavity of , we can continue the preceding upper bound as follows:
(30) | |||
(31) |
where defines to be the Lipschitz constant of assumed to be independent of (see the theorem statement).
Observe that is a bounded Lipschitz function of . By the Portmanteau theorem ([Klenke2013], Thm. 13.16), convergence in distribution implies that for any such , and hence (31) yields (upon taking on both sides), as required.
Theorem 2 shows that, under some technical assumptions, the function satisfies continuity in the weak toplogy, i.e.,
Here we discuss whether the opposite is true: Does imply that ? In general, the answer is negative. For example:
If only contains the function , then is always the distribution corresponding to deterministically equaling zero, so any two distributions give zero CFD.
If has bounded support, then two distributions whose characteristic functions only differ for values outside that support may still give .
In the following, however, we argue that the answer is positive when is “sufficiently rich” and is “sufficiently well-behaved”.
Rather than seeking the most general assumptions that formalize these requirements, we focus on a simple special case that still captures the key insights, assuming the following:
There exists such that includes all linear functions that are -Lipschitz;
There exists such that has support , where is the output dimension of .
To give examples of these, note that neural networks with ReLU activations can implement arbitrary linear functions (with the Lipschitz condition amounting to bounding the weights), and note that the second assumption is satisfied by any Gaussian with fixed positive length-scales.
In the following, let and . We will prove the contrapositive statement:
By the Cramér-Wold theorem [Cramér and Wold1936], implies that we can find constants such that
(32) |
where denote the -th entries of , with being their dimension.
Recall that we assume includes all linear functions from to with Lipschitz constant at most . Hence, we can select such that every entry of equals , where is sufficiently large so that the Lipschitz constant of this is at most . However, for this , (32) implies that , which in turn implies that is bounded away from zero for all in some set of positive Lebesgue measure.
Choosing to have support in accordance with the second technical assumption above, it follows that and hence .
(in blue) estimated using Kernel Density Estimation (KDE) along with the true distribution
(in red).The synthetic data is generated by first sampling and then applying a function to the samples. We construct distributions of two types: a scale-shift unimodal distribution and a “scale-split-shift” bimodal distribution . The function for the two distributions is defined as:
: ; we set and . This shifts the mean of the distribution to , essentially resulting in the distribution. Fig. (a)a shows the PDF (and histogram) of the original distribution and the distribution of , which is approximated using KDE.
: ; we set , , . This splits the distribution into two modes and shifts the two modes to and . Fig. (b)b shows the PDF (and histogram) of the original distribution and the distribution of which is approximated using KDE.
For the two cases described above, there are two transformation functions that will lead to the same distribution. In each case, the second transformation function is given by:
:
:
As there are two possible correct transformation functions ( and ) that the GANs can learn, we compute the Mean Absolute Error as follows
(33) |
where is the transformation learned by the generator. We estimate the expectations in Eq. (33) using 5000 samples.
For the generator and critic network architectures, we follow [Zaheer et al.2018]
. Specifically, the generator is an MLP with 3 hidden layers of size 7, 13, 7, and ELU non-linearity between the layers. The critic network is also an MLP with 3 hidden layers of size 11, 29, 11, and ELU non-linearity between the layers. The inputs and outputs of both networks are one-dimensional. We use the RMSProp optimizer with a learning rate of 0.001 for all models. We run the models for 10000 and 20000 generator iterations for
and respectively, with a batch size of 50 and 5 critic iterations per generator iteration. For all the models that rely on weight clipping, clipping in for resulted in poor performance, so, we modified the range to .We use a mixture of 5 RBF kernels for MMD-GAN [Li et al.2017] and a mixture of 5 RQ kernels and gradient penalty (as defined in [Binkowski et al.2018]) for MMD-GAN-GP. For CF-GAN variants, we use a single weighting distribution (Student-t and Gaussian for and respectively). WGAN-GP performed erratically for with a learning rate of 0.001, so we reduced the learning rate to , but this didn’t improve the performance significantly. With a learning rate of , WGAN-GP is too slow to converge compared to the other models.
Following [Li et al.2017], a decoder is also connected to the critic in CF-GAN to reconstruct the input to the critic; this encourages the critic to learn a representation that has a high mutual information with the input. The auto-encoding objective is optimized along with the discriminator and the final objective is given by:
(34) |
where is the decoder network, is the regularization parameter, and is the error between two datapoints (e.g., squared error or cross-entropy). While the decoder is interesting from an auto-encoding perspective of the representation learned by , we found in our experiments that the removal of the decoder does not impact the performance of the model; this is seen by the results of OCF-GAN-GP, which does not use a decoder network.
We also reduce the feasible set [Li et al.2017] of , which amounts to an additive penalty of . We observed in our experiments that this led to improved stability of training for models that use weight clipping to enforce Lipschitz condition. For more details, we refer the reader to [Li et al.2017].
We use DCGAN-like generator and critic architectures, same as [Li et al.2017] for all models. Specifically, both and are fully convolutional networks with the following structures:
: upconv(256) bn relu upconv(128) bn relu upconv(64) bn relu upconv() tanh;
: conv(64) leaky-relu(0.2) conv(128) bn leaky-relu(0.2) conv(256) bn leaky-relu(0.2) conv(),
where conv, upconv, bn, relu, leaky-relu, and tanh refer to convolution, up-convolution, batch-normalization, ReLU, LeakyReLU, and Tanh layers respectively. The decoder
(whenever used) is also a DCGAN-like decoder. The generator takes a -dimensional Gaussian latent vector as the input and outputs a image with channels. The value of is set differently depending on the dataset: MNIST (10), CIFAR10 (32), STL10 (32), and CelebA (64). The output dimensionality of the critic network () is set to 32 for MMD-GAN and CF-GAN models and 1 for WGAN and WGAN-GP. The batch normalization layers in the critic are omitted for WGAN-GP (as suggested by [Gulrajani et al.2017]) and OCF-GAN-GP.RMSProp optimizer is used with a learning rate of 0.00005. For all the datasets, all models are optimized with a batch size of 64 for 125000 generator iterations (50000 for MNIST) with 5 critic updates per generator iteration. We tested CF-GAN variants with two weighting distributions: Gaussian () and Student-t () (with ). We also conducted preliminary experiments using Laplace () and Uniform () weighting distributions (see Table 3). For CF-GAN, we test with 3 scale parameters for and from the set and report the best results. The trade-off parameter for the auto-encoder penalty () and feasible-set penalty () are set to 8 and 16 respectively, as in [Li et al.2017]. For OCF-GAN-GP, the trade-off for the gradient penalty is set to 10, same as WGAN-GP. The number of random frequencies used for computing ECFD for all CF-GAN models is set to 8. For MMD-GAN, we use a mixture of five RBF kernels with different scales () in as in [Li et al.2017]. For MMD-GAN-GP, we use a mixture of rational quadratic kernels with in ; the trade-off parameters of the gradient and L2 penalties are set according to [Binkowski et al.2018].
We compare the different models using three evaluation metrics: Fréchet Inception Distance (FID) [Salimans et al.2016], Kernel Inception Distance (KID) [Binkowski et al.2018], and Precision-Recall (PR) for Generative models [Sajjadi et al.2018]
. All evaluation metrics use features extracted from the
pool3layer (2048 dimensional) of an Inception network pre-trained on ImageNet, except for MNIST, for which we use a LeNet5 as the feature extractor. FID fits Gaussian distributions to Inception features of real and fake images and computes the Fréchet distance between the two Gaussians. KID, on the other hand, computes the MMD between the Inception features of the two distributions using a polynomial kernel of degree 3. This is equivalent to comparing the first three moments of the two distributions.
Let be samples from the data distribution and be samples from the GAN generator distribution . Let and be the feature vectors extracted from the Inception network for and respectively. The FID and KID are then given by
(35) | ||||