In traditional statistics, generative modeling is formulated as density estimation. The learning objective and evaluation metric are usually the expected negative-log-likelihood. While maximizing the log-likelihood, or equivalently, minimizing the KL-divergence, works fine for modeling low-dimensional data, there are a number of issues that arise when modeling high-dimensional data, such as images. Maybe the most important issue is the lack of guarantees that log-likelihood is a good proxy for sample quality. Specifically,Theis et al. (2016) exhibit generative models with great log-likelihood that produce samples of poor image quality, and models with poor log-likelihood with great image quality. In some cases, they show that the log-likelihood can even be hacked to be arbitrarily high, even on test data, without improving the sample quality at all. Another practical issue with f-divergences, a generalization of the KL, is they are either not defined or uninformative whenever the distributions are too far apart, even when tricks such as smoothing are used (Arjovsky et al., 2017).
Because of those shortcomings, we need to look past maximum-likelihood and classic divergences in order to define better objectives. Let us take a step back and consider what really is the final task, or end goal, of generative modeling. One way to define the final task is that we want to generate realistic and diverse samples. A fundamental question is then how to formalize such a subjective final task into a task loss,111The terminology comes from statistical decision theory, which we introduce in Section 3. an actual mathematical objective that can be optimized and evaluated. When comes the question of defining relevant task losses, it is worthwhile to consider how people choose task losses in structured prediction. The task loss of structured prediction is the generalization error induced by a structured loss, which formally specifies how close the predicted label is from the ground truth. For instance, in machine translation, a possible structured loss is the BLEU score (Papineni et al., 2002), which basically counts how many words the predicted and ground truth sentences have in common. Although such a task loss is still imperfect, it is way more informative than “stronger” losses such as the loss, which in essence gives no training signal unless the predicted label matches exactly the ground truth. Once the task loss is defined, we can objectively evaluate and compare models, even though such comparison can be only as relevant as the task loss considered.
Unfortunately, in generative modeling, it is not as obvious how to define a task loss that correlates well with the final task of generating realistic samples. Nevertheless, we argue that the adversarial framework, introduced in the context of generative adversarial networks or GANs (Goodfellow et al., 2014), provides an interesting way to define meaningful and practical task losses for generative modeling. For that purpose, we adopt the view222 We focus in this paper on the divergence minimization perspective of GANs. There are other views, such as those based on game theory , ratio matching and moment matching
We focus in this paper on the divergence minimization perspective of GANs. There are other views, such as those based on game theory(Arora et al., 2017; Fedus et al., 2017)
, ratio matching and moment matching(Mohamed & Lakshminarayanan, 2016). that training a GAN can be seen as training an implicit generator to minimize a special type of task loss, which is a parametric (adversarial) divergence:
where is the distribution to learn and is the distribution defined by the implicit generator. The expectation is minimized over a parametrized class of functions
, generally neural networks, which are calleddiscriminators in the GAN framework (Goodfellow et al., 2014). The constraints and the formulation determine properties of the resulting divergence.
Our first contribution is to relate generative modeling with structured prediction. Both can be formulated as statistical decision problems, where the goal is to output the model that minimizes some task loss. Under that perspective, we review results from the structured prediction literature that formalize the intuition that “weaker” task losses are easier to learn from than “stronger” task losses, and quantify how much easier it is. Although it is non-trivial to extend those results to the task losses of generative modeling, they are a first step towards developing a complete theory of generative modeling.
Our second contribution is to emphasize that parametric divergences have several good properties which make them suitable as a learning objective for generative modeling. We argue that it is actually better to optimize parametric divergences rather than their nonparametric counterparts. This is to be contrasted with the popular optimal discriminator assumption and the view that GANs are merely optimizing nonparametric divergences (e.g., Jensen-Shannon divergence, Wasserstein distance) or a lower-bound to them. Among other advantages, parametric divergences have good sample complexity, which makes them suitable for learning from limited data, and unlike usual nonparametric divergences, they are actually able to enforce properties that characterize the final task, which is critical in order to approximate the final task well.
Our third contribution is to propose two new challenging tasks in generative modeling. One is qualitative and the other is quantitative:
Qualitative Task: We collect a new dataset, Thin-8, which consists of 1585 handwritten images of the digit “8” in resolution . Although the dimensionality is very large, the intrinsic dimensionality is fairly small, which makes it an interesting task for qualitatively evaluating parametric and nonparametric divergences.
Quantitative Task: We introduce the
task, which consists in generating images of 5 digits that sum up to 25, as an example of more complicated constraints found in real data, such as those arising from physics. We consider precision and recall metrics to quantitatively evaluate whether parametric and nonparametric divergences are able to enforce this constraint.
Here we briefly introduce the structured prediction framework because it can be related to generative modeling in important ways. We will later link them formally, and present insights from recent theoretical results to choose a better divergence. We also unify parametric adversarial divergences with traditional divergences in order to compare them in the next section.
2.1 Structured Prediction
The goal of structured prediction is to learn a classifierwhich predicts a structured output from an input . The key difficulty is that usually has size exponential in the dimension of the input (e.g. it could be all possible sequence of symbols with a given length). Being able to handle this exponentially large set of possible outputs is one of the key challenges in structured prediction. Traditional multi-class classification methods are unsuitable for these problems in general. Standard practice in structured prediction (Taskar et al., 2003; Collins, 2002; Pires et al., 2013) is to consider predictors based on score functions , where , called the score/energy function (LeCun et al., 2006), assigns a score to each possible label for an input . Typically, as in structured SVMs (Taskar et al., 2003), the score function is linear: , where is a predefined feature map. Alternatively, the score function could also be a learned neural network (Belanger & McCallum, 2016).
In order to evaluate the predictions objectively, we need to define a task-specific structured loss which expresses the cost of predicting for when the ground truth is
. We discuss the relation between the loss function and the actual final task in Section3.1. The goal is then to find a parameter which minimizes the generalization error
or, in practice, an empirical estimation of it based on an average over a finite sample from . Directly minimizing this is often intractable, even in simple cases, e.g. when the structured loss is the 0-1 loss (Arora et al., 1993). Instead, the usual practice is to minimize a surrogate loss (Bartlett et al., 2006) which has nicer properties, such as sub-differentiability or convexity, to get a tractable optimization problem. The surrogate loss is said to be consistent (Osokin et al., 2017) when its minimizer is also a minimizer of the task loss.
2.2 Parametric and Nonparametric Adversarial Divergences
The focus of this paper is to analyze whether parametric divergences are good candidates for generative modeling. In particular, we analyze them relatively to nonparametric divergences. Therefore, we first unify them with a formalism similar to Sriperumbudur et al. (2012); Liu et al. (2017). We define adversarial divergences as the following:
When is a nonparametric function space, we call a nonparametric (adversarial) divergence. We review in Appendix A that many usual divergences can be written as Equation (3), for appropriate choices of and . Examples include f-divergences (such as the Kullback-Leibler or the Chi-Squared), Wasserstein distances, and Maximum Mean Discrepancy (MMD).
When is a parametric function space, we call a parametric (adversarial) divergence.333Usually, is a class of neural networks with fixed architecture. In that case, has been called a neural divergence in Arora et al. (2017). We will use the slightly more generic parametric divergence in our work. Most nonparametric divergences can be made parametric by replacing with a neural network: examples are the parametric Jensen-Shannon, which is the standard mini-max GAN objective (Goodfellow et al., 2014) and the parametric Wasserstein which is the WGAN objective (Arjovsky et al., 2017) in essence, modulo some technical tricks.444There are subtleties in the way the Lipschitz constraint is enforced. More details in Petzka et al. (2017). More details can be found in Appendix A and references therein. We deliberately chose a somewhat ambiguous terminology – non-parametric v.s. parametric – not to imply a clear-cut distinction between the two (as e.g. neural networks can be seen to become universal fuction approximators as we increase their size), but to imply a continuum from least restricted to more restricted function families where the latter are typically expressed through an explicit parametrization.
In light of this unified framework, one could argue that parametric divergences are simply estimators –in fact lower-bounds– of their nonparametric counterparts. Our opinion is that parametric divergences are not merely convenient estimators, but actually much better objectives for generative modeling than nonparametric divergences. We give practical arguments why in Section 4, and demonstrate empirical evidence in Section 6.
3 Task Losses in Structured Prediction and Generative Modeling
In this section, we show that the problems of high-dimensional generative modeling and structured prediction have much in common. Obviously, they both consist of learning models that output high-dimensional samples or labels.555As opposed to binary classification models which output just a single bit of information. Less obvious is that they require formalizing a final task, which is what we really care about, into a task loss,666The (statistical) task loss is the arbitrary evaluation metric we choose; the terminology comes from the framework of statistical decision theory. We refer the interested reader to Appendix B.1 where, using the framework of statistical decision theory, we formally unify structured prediction and generative modeling as statistical decision problems. an actual mathematical objective. Such process is complex and rarely perfect; we explain why in Section 3.1. We emphasize that in the context of structured prediction, the choice of task loss is much more critical than it is in, say, traditional multiclass classification. Indeed, using the wrong task losses might result in exponentially slower learning, as detailed in Section 3.2.
3.1 Necessity of formalizing a Final Task into a Task Loss
We have seen in the introduction that both structured prediction and generative modeling involve a notion of final task (end goal) which is at the same time crucial and not well defined. Despite the complexity of the final task, we can still try to define criteria which characterize good solutions. If we incorporate sufficiently many criteria relevant to the final task into the task loss, then the hope is that minimizing it over an appropriate class of models will yield a model that will perform well on that final task.
A usual task loss in structured prediction is the generalization error induced by a structured loss . The structured loss specifies how bad it is to predict label instead of . For many prediction problems, the structured prediction community has engineered structured loss functions which induce properties of interest on the learned predictors. In machine translation, a commonly considered property of interest is for candidate translations to contain many words in common with the ground-truth; this has given rise to the BLEU score which counts the percentage of candidate words appearing in the ground truth. In the context of image segmentation, Osokin & Kohli (2014) have compared various structured loss functions which induce different properties on the predicted foreground mask.
As for generative modeling, the focus of this paper, we consider the special case of GANs. Specifically we adopt the view that GANs are minimizing parametric adversarial divergences of the form . Such an objective can be seen as both our learning objective and evaluation metric. In other words, the task loss of a GAN is a parametric divergence. Similarly to the structured prediction community engineering structured losses, the GAN community has engineered architectures and formulations to induce properties of interest on the task loss. For instance, in the DCGAN (Radford et al., 2016), the discriminator has a convolutional architecture, which makes it potentially robust to small deformations that would not affect the visual quality of the samples significantly, while still making it able to detect blurry samples, which are some desirable properties for the final task of image generation.
3.2 The Choice of Task Loss is Crucial in High Dimensions
In this section we draw insights from the convergence results of Osokin et al. (2017) in structured prediction. They show in a specific setting that some “weaker” task losses can be easier to learn from than some stronger ones. This formal result parallels the intuition in generative modeling that learning with “weaker” divergences is easier (Arjovsky et al., 2017) and more intuitive (Liu et al., 2017) than stronger divergences.
Consider a strong structured loss, the 0-1 loss, defined as , and a weaker loss, the Hamming loss, defined as , when decomposes as binary variables . Weaker losses like the Hamming loss have more flexibility; since they tell us how close a prediction is to the ground truth, we can expect that fewer examples are needed to generalize well. In a non-parametric setting,777Details and limitations are in Appendix B.3. Osokin et al. (2017) derive a worst case sample complexity needed to obtain a fixed error . The sample complexity quantifies how fast the model minimizes the task loss. For the 0-1 loss, they get a sample complexity of which is exponential in the dimension of . However, for the Hamming loss, they get a much better sample complexity of which is polynomial in the number of dimensions, whenever certain constraints are imposed on the score function (see Osokin et al., 2017, section on exact calibration functions). Thus their results suggest that choosing the right structured loss, like the weaker Hamming loss, might make training exponentially faster.
Under the framework of statistical decision theory (details in Section B.1 of Appendix), their results can be related to analogous results in generative modeling (Arjovsky et al., 2017; Liu et al., 2017) showing that it can be easier to learn with weaker divergences than with stronger ones. In particular, one of their arguments is that distributions with disjoint support can be compared in weaker topologies like the the one induced by the Wasserstein but not in stronger ones like the the one induced by the Jensen-Shannon.
4 Advantages of Parametric Adversarial Divergences
In this section, we show that parametric adversarial divergences have many desirable properties which make them more appropriate for generative modeling in high dimensions, by comparing them to traditional divergences in terms of sample complexity, computational cost (Section 4.1), and ability to integrate criteria related to the final task (Section 4.2). We refer the reader to the Appendix for further highlights regarding differing properties of divergences. In particular we discuss shortcomings of the KL-divergence and of common workarounds in Appendix C.2. Optimization and stability issues are discussed in Appendix C.3. The fact that the parametric adversarial divergence formulation only requires the ability to sample from the generative model, and provides useful learning signal even when their nonparametric counterparts are not well-defined, is discussed in Appendix C.4.
4.1 Sample Complexity and Computational Cost
Since we want to learn from finite data, we would like to know how well empirical estimates of a divergence approximate the population divergence. In other words, we want to control the sample complexity, that is, how many samples
we need to have with high probability that, where , and are empirical distributions associated with . Sample complexities for parametric and nonparametric divergences are summarized in Table 1.
Parametric adversarial divergences can be formulated as a classification/regression problem with a loss depending on the specific adversarial divergence. Therefore, they have a reasonable sample complexity of , where is the VC-dimension/number of parameters of the discriminator (Arora et al., 2017), and can be solved using classic stochastic gradient methods.
A straightforward estimator of the (nonparametric) Wasserstein is simply the Wasserstein distance between the empirical distributions and , for which smoothed versions can be computed in using specialized algorithms such as Sinkhorn’s algorithm (Cuturi, 2013) or iterative Bregman projections (Benamou et al., 2015). However, this empirical Wasserstein estimator has sample complexity which is exponential in the number of dimensions (see Sriperumbudur et al., 2012, Corollary 3.5). Thus the empirical Wasserstein is not a viable estimator in high-dimensions.
4.2 Ability to Integrate Desirable Properties for the Final Task
In Section 3, we discussed the necessity and importance of designing task losses which reflect the final task. We showed that in structured prediction, optimizing for more informative task losses can make learning considerably easier under some conditions. Similarly in generative modeling, we would like divergences to be as informative and close to the final task as possible. We show that although not all divergences can easily integrate final task-related criteria, parametric divergences provide an indirect way to do so.
Pure f-divergences cannot directly integrate any notion of final task,888One could also attempt to induce properties of interest by adding a regularization term to the f-divergence. However, if we assume that maximum likelihood is itself often not a meaningful task loss, then there is no guarantee that minimizing a tradeoff between maximum likelihood and a regularization term is more meaningful or easier. at least not without tweaking the generator. The Wasserstein distance and MMD are respectively induced by a base metric and a kernel . The metric and kernel give us the opportunity to specify a task by letting us express a (subjective) notion of similarity. However, the metric and kernel traditionally had to be defined by hand. For instance, Genevay et al. (2017)
learn to generate MNIST by minimizing a smooth Wasserstein based on the L2-distance, whileDziugaite et al. (2015); Li et al. (2015)
also learn to generate MNIST by minimizing the MMD induced by kernels obtained externally: either generic kernels based on the L2-distance or on autoencoder features. However, the results seems to be limited to simple datasets. There is noobvious
or generally accepted way to learn the metric or kernel in an end-to-end fashion; this is an active research direction. In particuler MMD has recently been combined with aversarial kernel learning, with convincing results on LSUN, CelebA and ImageNet images:Mroueh et al. (2017) learn a feature map and try to match its mean and covariance, Li et al. (2017) learn kernels end-to-end, while Bellemare et al. (2017) do end-to-end learning of energy distances, which are closely related to MMD. See Bińkowski et al. (2018) for a recent review of MMD-based GANs.
Parametric adversarial divergences offer a different route to being tailored to a specific final tak, as they are induced by a parametrized class of discriminators and a formulation . The architecture of the discriminator, which significantly restricts the considered class of functions compared to the non-parametric counterpart, implicitly determines what aspects the divergence will be more sensitive or blind to. For instance using a convolutional network as the discriminator may render the divergence insensitive to small elastic deformations, while leaving it able to detect whether images are “natural”: e.g., natural images should not be blurry, be regular enough, have edges and textures. This trait is common to all parametric adversarial divergences, with the non-parametric Wasserstein having an additional knob for incorporating properties of interest (that it shares with the parametric Wasserstein) in the form of the choice of base metric. In Section 6 we conduct experiments with the goal to shed light on the relation between the choice of discriminator and the divergence.
5 Related Work
Closest to our work are the following two papers. Arora et al. (2017) argue that analyzing GANs with a nonparametric (optimal discriminator) view does not really make sense, because the usual nonparametric divergences considered have bad sample complexity. They also prove sample complexities for parametric divergences. Liu et al. (2017) prove under some conditions that globally minimizing a neural divergence is equivalent to matching all moments that can be represented within the discriminator family. They unify parametric divergences with nonparametric divergences and introduce the notion of strong and weak divergence. However neither of these works focuses on the meaning and practical properties of parametric divergences, as we do here, regarding their suitability for a final task, and paralleling similar questions studied in structured prediction.
Throughout this paper, we have also used the following results from the literature to discuss whether parametric divergences are good task losses for generative modeling. Here by “good” we mean relevant to the final task (Section 3) and have practical advantages for use as a learning objective (Section 4). Before the first GAN paper, Sriperumbudur et al. (2012) unified traditional Integral Probability Metrics (IPM), analyzed their statistical properties, and proposed to view them as classification problems. Similarly, Reid & Williamson (2011) show that computing a divergence can be formulated as a classification problem. Later, Nowozin et al. (2016) generalize the GAN objective to any adversarial f-divergence. However, the first papers to actually study the effect of restricting the discriminator to be a neural network instead of any function are the MMD-GAN papers: Li et al. (2015); Dziugaite et al. (2015); Li et al. (2017); Mroueh et al. (2017) and Bellemare et al. (2017) who give an interpretation of their energy distance framework in terms of moment matching. Mohamed & Lakshminarayanan (2016) give many interpretations of generative modeling, including moment-matching, divergence minimization, and density ratio matching. On the other hand, work has been done to better understand the GAN objective in order to improve its stability (Salimans et al., 2016). Subsequently, Arjovsky et al. (2017) introduce the adversarial Wasserstein distance which makes training much more stable, and Gulrajani et al. (2017) improve the objective to make it more practical. Regarding model evaluation, Theis et al. (2016) contains an excellent discussion on the evaluation of generative models, they show in particular that log-likelihood is not a good proxy for the visual quality of samples. Danihelka et al. (2017) compare parametric adversarial divergence and likelihood objectives in the special case of RealNVP, a generator with explicit density, and obtain better visual results with the adversarial divergence. Concerning theoretical understanding of learning in structured prediction, several recent papers are devoted to theoretical understanding of structured prediction such as Cortes et al. (2016) and London et al. (2016) which propose generalization error bounds in the same vein as Osokin et al. (2017) but with data dependencies.
Our perspective on generative modeling is novel because we ground it on the notion of final task – what we ultimately care about – and highlight the multiple reasons why parametric divergences offer a superior framework to define good task losses with respect to a final task; in essence, they provide a more effective and meaningful training signal. We also perform experiments to determine properties of some parametric divergences, such as invariance/robustness, ability to enforce constraints and properties of interest, as well as the difference with their nonparametric counterparts. To the best of our knowledge, this is the first work that links the task loss generalization error of structured prediction and the adversarial divergences used in generative modeling.
6 Experimental results
Importance of Sample Complexity.
Since the sample complexity of the nonparametric Wasserstein is exponential in the dimension (Section 4.1), we verify experimentally whether training a generator to minimize the nonparametric Wasserstein distance works in high dimensions. Implementation details and generated samples are in Appendix D.2
. In summary, on MNIST, the generator manages to produce decent but blurry images. However on CIFAR-10, which has higher intrinsic dimensionality, the generator fails to produce meaningful samples. This is in stark contrast with the high quality generators displayed in the literature with aparametric adversarial Wasserstein (Wasserstein-GAN).
Robustness to Transformations.
Ideally, good divergences should vary smoothly with the amount of a small transformation applied to a reference distribution. They should neither saturate nor be invariant to those transformations in order to provide a useful learning signal. We consider two transformations (rotation and additive noise of input images), and plot the divergence between MNIST and transformed MNIST, as a function of the amplitude of transformation (degrees and standard deviation of noise). We consider the parametric Jensen-Shannon (ParametricJS) and parametric Wasserstein999We use the WGAN-GP formulation (Gulrajani et al., 2017). (ParametricW) divergences induced by three discriminators (linear, 1-layer-dense, 2-layer-cnn). ParametricJS saturates quickly for rotations except for very simple architectures like the linear (Figure 0(a)). ParametricW, on the other hand, does not saturate for any architecture. This is consistent with theory on weaker v.s. stronger nonparametric divergences (Arjovsky et al., 2017; Liu et al., 2017). For additive Gaussian noise (Figure 0(c)), the linear discriminator is totally unable to distinguish the two distributions (it only “sees” the means of the distributions), whereas more complex architectures like CNNs do. In that sense the linear discriminator is too weak for the task, or not strict enough (Liu et al., 2017), which suggests that a better divergence involves trading off between robustness and strength.
Learning High-dimensional Data.
We collect Thin-8, a dataset of about 1500 handwritten images of the digit “8”, with a very high resolution of , and augment them with random elastic deformations during training. Because the pen strokes are relatively thin, we expect any pixel-wise distance to be uninformative, because the images are dominated by background pixels, and because with high probability, any two “8’ will intersect on no more than a little area. We train a convolutional VAE and a WGAN-GP (Gulrajani et al., 2017), henceforth simply denoted GAN, using nearly the same architectures (VAE decoder similar to GAN generator, VAE encoder similar to GAN discriminator), with 16 latent variables, on the following resolutions: , and . Generated samples are shown in Figure 2. Indeed, we observe that the VAE, trained to minimize the evidence lower bound on maximum-likelihood, fails to generate convincing samples in high-dimensions: they are blurry, pixel values are gray instead of being white, and some samples look like the average of many digits. On the contrary, the GAN can generate sharp and realistic samples even in
. Our hypothesis is that the discriminator learns moments which are easier to match than it is to directly match the training set with maximum likelihood. Since we were able to perfectly generate high-resolution digits, an additional insight of our experiment is that the main difficulty in generating high-dimensional natural images (like ImageNet and LSUN bedrooms) resides not in high resolution itself, but in the intrinsic complexity of the scenes. Such complexity can be hidden in low resolution, which might explain recent successes in generating images in low resolution but not in higher ones.
Enforcing and Generalizing Constraints.
To be able to compare VAEs and GANs quantitatively rather than simply inspecting the quality of their generated images, we design the visual hyperplane dataset, which we generate on-the-fly with the following process. First, we enumerate all 5631 combinations of 5 digits such that those digits sum up to 25. Then, we split them into disjoint train (80%) and test (20%) sets. Now, the sampling process consists in uniformly sampling a random combination from the train/test set, then sampling corresponding digit images from MNIST, and finally concatenating them to yield the final image containing the 5 digits in a row summing up to 25. We train a VAE and a WGAN-GP (henceforth simply denoted GAN) on the train set. Both models share the same architecture for generator network and use 200 latent variables. After training, with the help of a MNIST classifier, we automatically recognize and sum up the digits in each generated sample. Generated samples can be found in Section D.4 of the appendix.101010As usual, the VAE samples are mostly blurry while the GAN samples are more realistic and crisp. We then compare how well the VAE and GAN enforce and generalize the constraint that the digits sum to 25. Figure 3 shows, on the left, the distributions of the sums of the digits generated by the VAE and GAN, and on the right, their train and test recall.111111Train/test recalls are defined as the proportions of the train/test sets covered by a given generative model, after generating a fixed number of samples. Our first observation is that the GAN distribution is more peaked and centered around the target 25, while the VAE distribution is less precise and not centered around the target. In that respect, the GAN (though still far from nailing the problem) was better than the VAE at capturing and enforcing the particular aspects and constraints of the data distribution (summing up to 25). One possible explanation is that since training a classifier to recognize digits and sum them up is not hard in a supervised setting, it could also be relatively easy for a discriminator to discover such a constraint. Our second observation is that the WGAN-GP has best train/test recalls, followed by the independent baseline, while the VAE ranks last. On one hand, WGAN-GP has better train recall, which means it better covers the target distribution and has less mode-dropping than the VAE. On the other hand, WGAN-GP also has higher test recall than the VAE, so in a sense, it is better at generalizing constraints to new samples. Understanding why the VAE has worse recall than the independent baseline requires further investigation.
We provided multiple evidence in support of favoring parametric adversarial divergences over nonparametric divergences to guide generative modeling in high dimension, the most important of which being their ability to account for the final task. We provided a unifying perspective relating structured prediction and generative modeling under the framework of statistical decision theory. This allowed us to connect recent results from structured prediction to the notions of strong and weak divergences. Moreover, viewing parametric adversarial divergences as proper statistical task losses advocates for more systematically using them as evaluation criteria, replacing inflexible hand-crafted criteria which cannot usually be as exhaustive. In a sense, they are a flexible mathematical tool with which to incorporate desirable properties into a meaningful task loss, that goes beyond the more traditional kernels and metrics. This will be our starting point for future work on how to define meaningful evaluation criteria with minimal human intervention.
This research was partially supported by the Canada Excellence Research Chair in “Data Science for Real-time Decision-making”, by the NSERC Discovery Grant RGPIN-2017-06936 and by a Google Research Award.
- Arjovsky et al. (2017) Arjovsky, Martin, Chintala, Soumith, and Bottou, Léon. Wasserstein GAN. In ICML, 2017.
- Arora et al. (1993) Arora, Sanjeev, Babai, László, Stern, Jacques, and Sweedyk, Z. The hardness of approximate optima in lattices, codes, and systems of linear equations. In FOCS, 1993.
- Arora et al. (2017) Arora, Sanjeev, Ge, Rong, Liang, Yingyu, Ma, Tengyu, and Zhang, Yi. Generalization and equilibrium in generative adversarial nets (GANs). In ICML, 2017.
- Aude et al. (2016) Aude, Genevay, Cuturi, Marco, Peyré, Gabriel, and Bach, Francis. Stochastic optimization for large-scale optimal transport. In NIPS, 2016.
- Bartlett et al. (2006) Bartlett, Peter L, Jordan, Michael I, and McAuliffe, Jon D. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
- Belanger & McCallum (2016) Belanger, David and McCallum, Andrew. Structured prediction energy networks. In ICML, 2016.
- Bellemare et al. (2017) Bellemare, Marc G, Danihelka, Ivo, Dabney, Will, Mohamed, Shakir, Lakshminarayanan, Balaji, Hoyer, Stephan, and Munos, Rémi. The Cramer distance as a solution to biased Wasserstein gradients. arXiv:1705.10743, 2017.
- Benamou et al. (2015) Benamou, Jean-David, Carlier, Guillaume, Cuturi, Marco, Nenna, Luca, and Peyré, Gabriel. Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015.
- Bińkowski et al. (2018) Bińkowski, Mikołaj, Sutherland, Dougal J, Arbel, Michael, and Gretton, Arthur. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
- Bousquet et al. (2017) Bousquet, Olivier, Gelly, Sylvain, Tolstikhin, Ilya, Simon-Gabriel, Carl-Johann, and Schoelkopf, Bernhard. From optimal transport to generative modeling: the VEGAN cookbook. arXiv:1705.07642, 2017.
- Collins (2002) Collins, Michael. In EMNLP, 2002.
- Cortes et al. (2016) Cortes, Corinna, Kuznetsov, Vitaly, Mohri, Mehryar, and Yang, Scott. Structured prediction theory based on factor graph complexity. In NIPS, 2016.
- Cuturi (2013) Cuturi, Marco. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, 2013.
- Danihelka et al. (2017) Danihelka, Ivo, Lakshminarayanan, Balaji, Uria, Benigno, Wierstra, Daan, and Dayan, Peter. Comparison of maximum likelihood and GAN-based training of real NVPs. arXiv:1705.05263, 2017.
- Dziugaite et al. (2015) Dziugaite, Gintare Karolina, Roy, Daniel M., and Ghahramani, Zoubin. Training generative neural networks via maximum mean discrepancy optimization. In UAI, 2015.
- Fedus et al. (2017) Fedus, William., Rosca, Mihaela., Lakshminarayanan, Balaji., Dai, Andrew M., Mohamed, Shakir., and Goodfellow, Ian. Many paths to equilibrium: GANs do not need to decrease a divergence at every step. arXiv:1710.08446, 2017.
- Genevay et al. (2017) Genevay, Aude, Peyré, Gabriel, and Cuturi, Marco. Sinkhorn-Autodiff: Tractable Wasserstein learning of generative models. arXiv:1706.00292, 2017.
- Goodfellow et al. (2014) Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In NIPS, 2014.
- Gretton et al. (2007) Gretton, Arthur, Borgwardt, Karsten M, Rasch, Malte, Schölkopf, Bernhard, Smola, Alexander J, et al. A kernel method for the two-sample-problem. In NIPS, 2007.
- Gretton et al. (2012) Gretton, Arthur, Borgwardt, Karsten M, Rasch, Malte J, Schölkopf, Bernhard, and Smola, Alexander. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
- Gulrajani et al. (2017) Gulrajani, Ishaan, Ahmed, Faruk, Arjovsky, Martin, Dumoulin, Vincent, and Courville, Aaron. Improved training of Wasserstein GANs. In NIPS, 2017. (to appear).
- Kingma & Welling (2014) Kingma, Diederik P and Welling, Max. Auto-encoding variational bayes. In ICLR, 2014.
- Kodali et al. (2017) Kodali, Naveen, Abernethy, Jacob, Hays, James, and Kira, Zsolt. How to train your DRAGAN. arXiv:1705.07215, 2017.
- Lamb et al. (2016) Lamb, Alex M, Goyal, Anirudh, Zhang, Ying, Zhang, Saizheng, Courville, Aaron C, and Bengio, Yoshua. Professor forcing: A new algorithm for training recurrent networks. In NIPS, 2016.
- Leblond et al. (2017) Leblond, Rémi, Alayrac, Jean-Baptiste, Osokin, Anton, and Lacoste-Julien, Simon. SEARNN: Training RNNs with global-local losses. arXiv:1706.04499, 2017.
- LeCun et al. (2006) LeCun, Yann, Chopra, Sumit, Hadsell, Raia, Ranzato, M, and Huang, F. A tutorial on energy-based learning. Predicting structured data, 2006.
- Li et al. (2017) Li, Chun-Liang, Chang, Wei-Cheng, Cheng, Yu, Yang, Yiming, and Póczos, Barnabás. MMD GAN: Towards deeper understanding of moment matching network. In NIPS, 2017. (to appear).
- Li et al. (2015) Li, Yujia, Swersky, Kevin, and Zemel, Rich. Generative moment matching networks. In ICML, 2015.
- Liu et al. (2017) Liu, Shuang, Bousquet, Olivier, and Chaudhuri, Kamalika. Approximation and convergence properties of generative adversarial learning. In Advances in Neural Information Processing Systems, pp. 5551–5559, 2017.
- London et al. (2016) London, Ben, Huang, Bert, and Getoor, Lise. Stability and generalization in structured prediction. Journal of Machine Learning Research, 17(222):1–52, 2016.
- Mikolov et al. (2010) Mikolov, Tomas, Karafiát, Martin, Burget, Lukas, Cernockỳ, Jan, and Khudanpur, Sanjeev. Recurrent neural network based language model. In Interspeech, volume 2, pp. 3, 2010.
- Mohamed & Lakshminarayanan (2016) Mohamed, Shakir and Lakshminarayanan, Balaji. Learning in implicit generative models. arXiv:1610.03483, 2016.
- Moon & Hero (2014) Moon, Kevin and Hero, Alfred. Multivariate f-divergence estimation with confidence. In Advances in Neural Information Processing Systems, pp. 2420–2428, 2014.
- Mroueh et al. (2017) Mroueh, Youssef, Sercu, Tom, and Goel, Vaibhava. McGan: Mean and covariance feature matching GAN. In ICML, 2017.
- Nguyen et al. (2010) Nguyen, XuanLong, Wainwright, Martin J, and Jordan, Michael I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
- Nowozin et al. (2016) Nowozin, Sebastian, Cseke, Botond, and Tomioka, Ryota. f-GAN: Training generative neural samplers using variational divergence minimization. In NIPS, 2016.
- Oord et al. (2016) Oord, Aaron van den, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. In ICML, 2016.
- Osokin & Kohli (2014) Osokin, Anton and Kohli, Pushmeet. Perceptually inspired layout-aware losses for image segmentation. In ECCV, 2014.
- Osokin et al. (2017) Osokin, Anton, Bach, Francis, and Lacoste-Julien, Simon. On structured prediction theory with calibrated convex surrogate losses. In NIPS, 2017.
- Papineni et al. (2002) Papineni, Kishore, Roukos, Salim, Ward, Todd, and Zhu, Wei-Jing. BLEU: a method for automatic evaluation of machine translation. In ACL, pp. 311–318, 2002.
- Petzka et al. (2017) Petzka, Henning, Fischer, Asja, and Lukovnicov, Denis. On the regularization of wasserstein gans. arXiv preprint arXiv:1709.08894, 2017.
- Pires et al. (2013) Pires, Bernardo Avila, Szepesvari, Csaba, and Ghavamzadeh, Mohammad. Cost-sensitive multiclass classification risk bounds. In ICML, 2013.
- Radford et al. (2016) Radford, Alec, Metz, Luke, and Chintala, Soumith. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
- Reddi et al. (2015) Reddi, Sashank J, Ramdas, Aaditya, Póczos, Barnabás, Singh, Aarti, and Wasserman, Larry. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In AAAI, 2015.
- Reid & Williamson (2011) Reid, Mark D and Williamson, Robert C. Information, divergence and risk for binary experiments. Journal of Machine Learning Research, 12(Mar):731–817, 2011.
- Roth et al. (2017) Roth, Kevin, Lucchi, Aurelien, Nowozin, Sebastian, and Hofmann, Thomas. Stabilizing training of generative adversarial networks through regularization. In Advances in Neural Information Processing Systems, pp. 2015–2025, 2017.
- Ruderman et al. (2012) Ruderman, Avraham, Reid, Mark, García-García, Darío, and Petterson, James. Tighter variational representations of f-divergences via restriction to probability measures. arXiv preprint arXiv:1206.4664, 2012.
- Salimans et al. (2016) Salimans, Tim, Goodfellow, Ian, Zaremba, Wojciech, Cheung, Vicki, Radford, Alec, and Chen, Xi. Improved techniques for training GANs. In NIPS, 2016.
- Sriperumbudur et al. (2012) Sriperumbudur, Bharath K, Fukumizu, Kenji, Gretton, Arthur, Schölkopf, Bernhard, Lanckriet, Gert RG, et al. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550–1599, 2012.
- Taskar et al. (2003) Taskar, Ben, Guestrin, Carlos, and Koller, Daphne. Max-margin markov networks. In NIPS, 2003.
- Theis et al. (2016) Theis, Lucas, Oord, Aäron van den, and Bethge, Matthias. A note on the evaluation of generative models. In ICLR, 2016.
Appendix A Unifying Adversarial Divergences
Under this formalism we give some examples of nonparametric divergences:
-divergences with generator function (which we call f-divergences) can be written in dual form (Nowozin et al., 2016)121212The standard form is .
where is the convex conjugate. Depending on , one can obtain any -divergence such as the (reverse) Kullback-Leibler, the Jensen-Shannon, the Total Variation, the Chi-Squared131313For instance the Kullback-Leibler has the dual form . Some require additional constraints, such as for the Total Variation..
Wasserstein-1 distance induced by an arbitrary norm and its corresponding dual norm (Sriperumbudur et al., 2012):
which can be interpreted as the cost to transport all probability mass of into , where is the unit cost of transporting to .
In the optimization problems (5) and (6), whenever is additionally constrained to be in a given parametric family, the associated divergence will be termed a parametric adversarial divergence. In practice, that family will typically be specified as a neural network architecture, so in this work we will use the term neural adversarial divergences interchangeably with the slightly more generic parametric adversarial divergence. For instance, the parametric adversarial Jensen-Shannon optimized in GANs corresponds to (5) with specific (Nowozin et al., 2016), while the parametric adversarial Wasserstein optimized in WGANs corresponds to (6) where is a neural network. See Liu et al. (2017) for interpretations and a review and interpretation of other divergences like the Wasserstein with entropic smoothing (Aude et al., 2016), energy-based distances (Li et al., 2017) which can be seen as adversarial MMD, and the WGAN-GP (Gulrajani et al., 2017) objective.
Appendix B Task Losses in Structured Prediction and Generative Modeling
b.1 Statistical Decision Theory
We frame the relationship of structured prediction and GANs using the framework of statistical decision theory. Assume that we are in a world with a set of possible states and that we have a set of actions. When the world is in the state , the cost of playing action is the (statistical) task loss . The goal is to play the action minimizing the task loss.
Generative models with Maximum Likelihood.
The set of possible states is the set of available distributions for the data . The set of actions is the set of possible distributions for the model and the task loss is the negative log-likelihood,
The set of possible states is the set of available distribution for . The set of actions is the set of prediction functions and the task loss is the generalization error:
where is a structured loss function.
The set of possible states is the set of available distributions for the data . The set of actions is the set of distributions that the generator can learn, and the task loss is the adversarial divergence
Under this unified framework, the prediction function is analogous to the generative model , while the choice of the right structured loss can be related to and to the choice of the discriminator family which will induce a good adversarial divergence. We will further develop this analogy in Section 3.1.
b.2 Diagram: Formalizing a Final Task into a Task Loss
The process of formalizing an ill-defined final task into a mathematical statistical task loss has many similarities between structured prediction and generative modeling (Figure 4). In our framework, one starts by specifying properties of interest which characterize the final task. Then, one crafts task losses which enforce such properties of interest.
While the task loss of structured prediction is typically the generalization error (9) and is induced by the choice of structured loss , the task loss of GANs is the adversarial divergence (10) and is induced by the choice of architecture and formulation . Therefore one can enforce properties of interest by choosing adequate structure losses and architectures, depending on the task. Please see Section 3 for full details.
b.3 Limitations of Osokin et al. (2017)’s Theory.
Although Osokin et al. (2017) give a lot of insights, their results must be taken with a grain of salt. In this section we point out the limitations of their theory.
First, their analysis ignores the dependence on and is non-parametric, which means that they consider the whole class of possible score functions for each given . Additionally, they only consider convex consistent surrogate losses in their analysis, and they give upper bounds but not lower bounds on the sample complexity. It is possible that optimizing approximately-consistent surrogate losses instead of consistent ones, or making additional assumptions on the distribution of the data could yield better sample complexities.
Appendix C Advantages of Parametric Adversarial Divergences
In this section, we describe additional advantages and properties of parametric adversarial divergences.
c.1 Additional Sample Complexities and Computational Costs
Here we give sample complexities for f-divergences and for MMD.
For explicit models which allow evaluating the density , one could use Monte-Carlo to evaluate the f-divergence with sample complexity
, according to the Central-Limit theorem. For implicit models, there is no one good way of estimating f-divergences from samples. There are some techniques for it(Nguyen et al., 2010; Moon & Hero, 2014; Ruderman et al., 2012), but they all make additional assumptions about the underlying densities (such as smoothness), or they solve the dual in a restricted family, such as a RKHS, which makes the divergences no longer f-divergences.
Maximum Mean Discrepancy admits an estimator with sample complexity , which can be computed analytically in . More details are given in the original MMD paper (Gretton et al., 2007). One should note that MMD depends fundamentally on the choice of kernel. As the sample complexity is independent of the dimension of the data, one might believe that the MMD estimator behaves well in high dimensions. However, it was experimentally illustrated in Dziugaite et al. (2015) that with generic kernels like RBF, MMD performs poorly for MNIST and Toronto face datasets, as the generated images have many artifacts and are clearly distinguishable from the training dataset. See Section 4.2 for more details on the choice of kernel. It was also shown theoretically in (Reddi et al., 2015) that the power of the MMD statistical test can drop polynomially with increasing dimension, which means that with generic kernels, MMD might be unable to discriminate well between high-dimensional generated and training distributions. More precisely, consider a Gaussian kernel with bandwidth and compute the between two isotropic Gaussians with different means. Then, for , and , goes to zero:
polynomially as if
polynomially as if
exponentially as if , all that while the KL divergence between the two Gaussians stays constant.
c.2 Combining KL-Divergence with Generators that have Special Structure creates other Problems
In some cases, imposing a certain structure on the generator (e.g. a Gaussian or Laplacian observation model) yields a Kullback-Leibler divergence which involves some form of component-wise distance between samples, reminiscent of the Hamming loss (see Section3.2) used in structured prediction. However, doing maximum likelihood on generators having an imposed special structure can have drawbacks which we detail here. For instance, the generative model of a typical variational autoencoder can be seen as an infinite mixture of Gaussians (Kingma & Welling, 2014). The log-likelihood thus involves a “reconstruction loss”, a pixel-wise L2 distance between images analogous to the Hamming loss, which makes the training relatively easy and very stable. However, the Gaussian is partly responsible for the VAE’s inability to learn sharp distributions. Indeed it is a known problem that VAEs produce blurry samples (Arjovsky et al., 2017), in fact even if the approximate posterior matches exactly the true posterior, which would correspond to the evidence lower-bound being tight, the output of the VAE would still be blurry (Bousquet et al., 2017)
. Other examples are autoregressive models such as recurrent neural networks(Mikolov et al., 2010) which factorize naturally as , and PixelCNNs (Oord et al., 2016). Training autoregressive models using maximum likelihood results in teacher-forcing (Lamb et al., 2016): each ground-truth symbol is fed to the RNN, which then has to maximize the likelihood of the next symbol. Since teacher-forcing induces a lot of supervision, it is possible to learn using maximum-likelihood. Once again, there are similarities with the Hamming loss because each predicted symbol is compared with its associated ground truth symbol. However, among other problems, there is a discrepancy between training and generation. Sampling from would require iteratively sampling each symbol and feeding it back to the RNN, giving the potential to accumulate errors, which is not something that is accounted for during training. See Leblond et al. (2017) and references therein for more principled approaches to sequence prediction with autoregressive models.
c.3 Ease of Optimization and Stability
While adversarial divergences are learned and thus potentially much more powerful than traditional divergences, the fact that they are the solution to a hard, non-convex problem can make GANs unstable. Not all adversarial divergences are equally stable: Arjovsky et al. (2017) claimed that the adversarial Wasserstein gives more meaningful learning signal than the adversarial Jensen-Shannon, in the sense that it correlates well with the quality of the samples, and is less prone to mode dropping. In Section 6 we will show experimentally on a simple setting that indeed the neural adversarial Wasserstein consistently give more meaningful learning signal than the neural adversarial Jensen-Shannon, regardless of the discriminator architecture. Similarly to the WGAN, the MMD-GAN divergence (Li et al., 2017) was shown to correlate well with the quality of samples and to be robust to mode collapse. Recently, it was shown that neural adversarial divergences other than the Wasserstein can also be made stable by regularizing the discriminator properly (Kodali et al., 2017; Roth et al., 2017).
c.4 Sampling from Generator is Sufficient
Maximum-likelihood typically requires computing the density , which is not possible for implicit models such as GANs, from which it is only possible to sample. On the other hand, parametric adversarial divergences can be estimated with reasonable sample complexity (see Section 4.1) only by sampling from the generator, without any assumption on the form of the generator. This is also true for MMD but generally not the case for the empirical Wasserstein, which has bad sample complexity as stated previously. Another issue of f-divergences such as the Kullback-Leibler and the Jensen-Shannon is that they are either not defined (Kullback-Leibler) or uninformative (Jensen-Shannon) when is not absolutely continuous w.r.t. (Nowozin et al., 2016), which makes them unusable for learning sharp distributions such as manifolds. On the other hand, some integral probability metrics, such as the Wasserstein, MMD, or their adversarial counterparts, are well defined for any distributions and . In fact, even though the Jensen-Shannon is uninformative for manifolds, the parametric adversarial Jensen-Shannon used in the original GANs (Goodfellow et al., 2014) still allows learning realistic samples, even though the process is unstable (Salimans et al., 2016).
c.5 Limitations of Sample Complexity Analysis
Note that comparing divergences in terms of sample complexity can give good insights on what is a good divergence, but should be taken with a grain of salt as well. On the one hand, the sample complexities we give are upper-bounds, which means the estimators could potentially converge faster. On the other hand, one might not need a very good estimator of the divergence in order to learn in some cases. This is illustrated in our experiments with the nonparametric Wasserstein (Section 6) which has bad sample complexity but yields reasonable results.
Appendix D Experimental results
d.1 Learnability of Parametric Adversarial Divergences.
Here, we compare the parametric adversarial divergences induced by three different discriminators (linear, dense, and CNN) under the WGAN-GP (Gulrajani et al., 2017) formulation.
We consider one of the simplest non-trivial generators, in order to factor out optimization issues on the generator side. The model is a mixture of 100 Gaussians with zero-covariance. The model density is , parametrized by prototypes
. The generative process consists in sampling a discrete random variable, and returning the prototype .
Learned prototypes (means of each Gaussian) are shown in Figure 5 and 6. The first observation is that the linear discriminator is too weak of a divergence: all prototypes only learn the mean of the training set. Now, the dense discriminator learns prototypes which sometimes look like digits, but are blurry or unrecognizable most the time. The samples from the CNN discriminator are never blurry and recognizable in the majority of cases. Our results confirms that indeed, even for simplistic models like a mixture of Gaussians, using a CNN discriminator provides a better task loss for generative modeling of images.
d.2 Importance of Sample Complexity.
Since the sample complexity of the nonparametric Wasserstein is exponential in the dimension (Section 4.1), we check experimentally whether training a generator to minimize the nonparametric Wasserstein distance fails in high dimensions. We implement the Sinkhorn-AutoDiff algorithm (Genevay et al., 2017) to compute the entropy-regularized L2-Wasserstein distance between minibatches of training images and generated images. Figure 7 shows generated samples after training with the Sinkhorn-Autodiff algorithm on both MNIST and CIFAR-10 dataset. On MNIST, the network manages to produce decent but blurry images. However, on CIFAR-10, which is a much more complex dataset, the network fails to produce meaningful samples, which would suggest that indeed the nonparametric Wasserstein should not be used for generative modeling when the (effective) dimensionality is high. This result is to be contrasted with the recent successes in image generation of the parametric Wasserstein (Gulrajani et al., 2017), which also has much better sample complexity than the nonparametric Wasserstein.
d.3 Additional Samples for VAE and GAN
d.4 Visual Hyperplane: Generated samples
Figure 11 shows some additional samples from the VAE and WGAN-GP trained on the visual-hyperplane task. Both models have 200 latent variables and similar architectures.