cramer-gan
Tensorflow Implementation on "The Cramer Distance as a Solution to Biased Wasserstein Gradients" (https://arxiv.org/pdf/1705.10743.pdf)
view repo
The Wasserstein probability metric has received much attention from the machine learning community. Unlike the Kullback-Leibler divergence, which strictly measures change in probability, the Wasserstein metric reflects the underlying geometry between outcomes. The value of being sensitive to this geometry has been demonstrated, among others, in ordinal regression and generative modelling. In this paper we describe three natural properties of probability divergences that reflect requirements from machine learning: sum invariance, scale sensitivity, and unbiased sample gradients. The Wasserstein metric possesses the first two properties but, unlike the Kullback-Leibler divergence, does not possess the third. We provide empirical evidence suggesting that this is a serious issue in practice. Leveraging insights from probabilistic forecasting we propose an alternative to the Wasserstein metric, the Cramér distance. We show that the Cramér distance possesses all three desired properties, combining the best of the Wasserstein and Kullback-Leibler divergences. To illustrate the relevance of the Cramér distance in practice we design a new algorithm, the Cramér Generative Adversarial Network (GAN), and show that it performs significantly better than the related Wasserstein GAN.
READ FULL TEXT VIEW PDFTensorflow Implementation on "The Cramer Distance as a Solution to Biased Wasserstein Gradients" (https://arxiv.org/pdf/1705.10743.pdf)
In machine learning, the Kullback-Leibler (KL) divergence is perhaps the most common way of assessing how well a probabilistic model explains observed data. Among the reasons for its popularity is that it is directly related to maximum likelihood estimation and is easily optimized. However, the KL divergence suffers from a significant limitation: it does not take into account how close two outcomes might be, but only their relative probability. This closeness can matter a great deal: in image modelling, for example, perceptual similarity is key
(Rubner et al., 2000; Gao and Kleywegt, 2016). Put another way, the KL divergence cannot reward a model that “gets it almost right”.To address this limitation, researchers have turned to the Wasserstein metric, which does incorporate the underlying geometry between outcomes. The Wasserstein metric can be applied to distributions with non-overlapping supports, and has good out-of-sample performance (Esfahani and Kuhn, 2015)
. Yet, practical applications of the Wasserstein distance, especially in deep learning, remain tentative. In this paper we provide a clue as to why that might be: estimating the Wasserstein metric from samples yields biased gradients, and may actually lead to the wrong minimum. This precludes using stochastic gradient descent (SGD) and SGD-like methods, whose fundamental mode of operation is sample-based, when optimizing for this metric.
As a replacement we propose the Cramér distance (Székely, 2002; Rizzo and Székely, 2016), also known as the continuous ranked probability score in the probabilistic forecasting literature (Gneiting and Raftery, 2007). The Cramér distance, like the Wasserstein metric, respects the underlying geometry but also has unbiased sample gradients. To underscore our theoretical findings, we demonstrate a significant quantitative difference between the two metrics when employed in typical machine learning scenarios: categorical distribution estimation, regression, and finally image generation.
In this section we provide the notation to mathematically distinguish the Wasserstein metric (and later, the Cramér distance) from the Kullback-Leibler divergence and probability distances such as the total variation. The experimentally-minded reader may choose to skip to Section 4.2, where we show that minimizing the sample Wasserstein loss leads to significant degradation in performance.
Let
be a probability distribution over
. When is continuous, we will assume it has density . The expectation of a function with respect to isWe will suppose all expectations and integrals under consideration are finite. We will often associate
to a random variable
, such that for a subset of the reals , we have. The (cumulative) distribution function of
is thenFinally, the inverse distribution function of , defined over the interval , is
Consider two probability distributions and over . A divergence is a mapping with if and only if almost everywhere. A popular choice is the Kullback-Leibler (KL) divergence
with if is not absolutely continuous w.r.t. . The KL divergence, also called relative entropy, measures the amount of information needed to encode the change in probability from to (Cover and Thomas, 1991).
A probability metric is a divergence which is also symmetric () and respects the triangle inequality: for any distribution , . We will use the term probability distance to mean a symmetric divergence satisfying the relaxed triangle inequality for some .
We will first study the -Wasserstein metrics (Dudley, 2002). For , a practical definition is through the inverse distribution functions of and :
(1) |
We will sometimes find it convenient to deal with the power of the metric, which we will denote by ; note that is not a metric proper, but is a probability distance.
We will be chiefly concerned with the 1-Wasserstein metric, which is most commonly used in practice. The 1-Wasserstein metric has a dual form which is theoretically convenient and which we mention here for completeness. Define to be the class of -Lipschitz functions. Then
(2) |
This is a special case of the celebrated Monge-Kantorovich duality (Rachev et al., 2013), and is the integral probability metric (IPM) with function class (Müller, 1997). We invite the curious reader to consult these two sources as a starting point on this rich topic.
As noted in the introduction, the fundamental difference between the KL divergence and the Wasserstein metric is that the latter is sensitive not only to change in probability but also to the geometry of possible outcomes. To capture this notion we now introduce the concept of an ideal divergence.
Consider a divergence , and for two random variables with distributions write . We say that is scale sensitive (of order ), i.e. it has property (S), if there exists a such that for all , , and a real value ,
(S) |
A divergence has property (I), i.e. it is sum invariant, if whenever is independent from ,
(I) |
Following Zolotarev (1976), an ideal divergence is one that possesses both (S) and (I).^{1}^{1}1Properties (S) and (I) are called regularity and homogeneity by Zolotarev; we believe our choice of terms is more machine learning-friendly.
We can illustrate the sensitivity of ideal divergences to the value of outcomes by considering Dirac functions at different values of . If is scale sensitive of order then the divergence can be no more than half the divergence . If is sum invariant, then the divergence of to is equal to the divergence of the same distributions shifted by a constant , i.e. of to . As a concrete example of the importance of these properties, Bellemare et al. (2017)
recently demonstrated the importance of ideal metrics in reinforcement learning, specifically their role in providing the contraction property for a distributional form of the Bellman update.
In machine learning we often view the divergence
as a loss function. Specifically, let
be some distribution parametrized by , and consider the loss . We are interested in minimizing this loss, that is finding . We now describe a third property based on this loss, which we call unbiased sample gradients.Let be independent samples from and define the empirical distribution (note that is a random quantity). From this, define the sample loss . We say that has unbiased sample gradients when the expected gradient of the sample loss equals the gradient of the true loss for all and :
(U) |
The notion of unbiased sample gradients is ubiquitous in machine learning and in particular in deep learning. Specifically, if a divergence does not possess (U) then minimizing it with stochastic gradient descent may not converge, or it may converge to the wrong minimum. Conversely, if possesses (U) then we can guarantee that the distribution which minimizes the expected sample loss is . In the probabilistic forecasting literature, this makes a proper scoring rule (Gneiting and Raftery, 2007).
We now characterize the KL divergence and the Wasserstein metric in terms of these properties. As it turns out, neither simultaneously possesses both (U) and (S).
The KL divergence has unbiased sample gradients (U), but is not scale sensitive (S).
The Wasserstein metric is ideal (I, S), but does not have unbiased sample gradients.
We will provide a proof of the bias in the sample Wasserstein gradients just below; the proof of the rest and later results are provided in the appendix.
In this section we give theoretical evidence of serious issues with gradients of the sample Wasserstein loss. We will consider a simple Bernoulli distribution
with parameter , which we would like to estimate from samples. Our model is , a Bernoulli distribution with parameter . We study the behaviour of stochastic gradient descent w.r.t. over the sample Wasserstein loss, specifically using the power of the metric (as is commonly done to avoid fractional exponents). Our results build on the example given by Bellemare et al. (2017), whose result is for and .Consider the estimate of the gradient . We now show that even in this simplest of settings, this estimate is biased, and we exhibit a lower bound on the bias for any value of . Hence the Wasserstein metric does not have property (U). More worrisome still, we show that the minimum of the expected empirical Wasserstein loss is not the minimum of the Wasserstein loss . We then conclude that minimizing the sample Wasserstein loss by stochastic gradient descent may in general fail to converge to the minimum of the true loss.
Let be the empirical distribution derived from independent samples drawn from a Bernoulli distribution . Then for all ,
[leftmargin=4ex,topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex]
Non-vanishing minimax bias of the sample gradient. For any there exists a pair of Bernoulli distributions , for which
Wrong minimum of the sample Wasserstein loss. The minimum of the expected sample loss is in general different from the minimum of the true Wasserstein loss .
Deterministic solutions to stochastic problems. For any , there exists a distribution with nonzero entropy whose sample loss is minimized by a distribution with zero entropy.
Taken as a whole, Theorem 1 states that we cannot in general minimize the Wasserstein loss using naive stochastic gradient descent methods. Although our result does not imply the lack of a stochastic optimization procedure for this loss,^{2}^{2}2For example, if has finite support keeping track of the empirical distribution suffices. we believe our result to be serious cause for concern. We leave as an open question whether an unbiased optimization procedure exists and is practical.
Our result is surprising given the prevalence of the Wasserstein metric in empirical studies. We hypothesize that this bias exists in published results and is an underlying cause of learning instability and poor convergence often remedied to by heuristic means. For example,
Frogner et al. (2015) and Montavon et al. (2016) reported the need for a mixed KL-Wasserstein loss to obtain good empirical results, with the latter explicitly discussing the issue of wrong minima when using Wasserstein gradients.We remark that our result also applies to the dual (2), since the losses are the same. This dual was recently considered by Arjovsky et al. (2017) as an alternative loss to the primal (1). The adversarial procedure proposed by the authors is a two time-scale process which first maximizes (2) w.r.t using samples, then takes a single stochastic gradient step w.r.t. . Interestingly, this approach does seem to provide unbiased gradients as . However, the cost of a single gradient is now significantly higher, and for a fixed we conjecture that the minimax bias remains.
We are now ready to describe an alternative to the Wasserstein metric, the Cramér distance (Székely, 2002; Rizzo and Székely, 2016). As we shall see, the Cramér distance has the same appealing properties as the Wasserstein metric, but also provides us with unbiased sample gradients. As a result, we believe this underappreciated distance is strictly superior to the Wasserstein metric for machine learning applications.
Recall that for two distributions and over , their (cumulative) distribution functions are respectively and . The Cramér distance between and is
Note that as written, the Cramér distance is not a metric proper. However, its square root is, and is a member of the family of metrics
The and Wasserstein metrics are identical at , but are otherwise distinct. Like the Wasserstein metrics, the metrics have dual forms as integral probability metrics (see Dedecker and Merlevède, 2007, for a proof):
(3) |
where and is the conjugate exponent of , i.e. .^{3}^{3}3This relationship is the reason for the notation in the definition the dual of the 1-Wasserstein (2). It is this dual form that we use to prove that the Cramér distance has property (S).
Consider two random variables , , a random variable independent of , and a real value . Then for ,
Furthermore, the Cramér distance has unbiased sample gradients. That is, given drawn from a distribution , the empirical distribution , and a distribution ,
and of all the distances, only the Cramér () has this property.
We conclude that the Cramér distance enjoys both the benefits of the Wasserstein metric and the SGD-friendliness of the KL divergence. Given the close similarity of the Wasserstein and metrics, it is truly remarkable that only the Cramér distance has unbiased sample gradients.
The energy distance (Székely, 2002) is a natural extension of the Cramér distance to the multivariate case. Let be probability distributions over and let and be independent random variables distributed according to and , respectively. The energy distance (sometimes called the squared energy distance, see e.g. Rizzo and Székely, 2016) is
(4) |
Székely showed that, in the univariate case, . Interestingly enough, the energy distance can also be written in terms of a difference of expectations. For
we find that
(5) |
Note that this is not the maximizer of the dual (3), since is equal to the squared metric (i.e. the Cramér distance).^{4}^{4}4The maximizer of (3) is: (based on results by Gretton et al., 2012). Finally, we remark that also possesses properties (I), (S), and (U) (proof in the appendix).
To illustrate how the Cramér distance compares to the 1-Wasserstein metric, we consider modelling the discrete distribution depicted in Figure 1 (left). Since the trade-offs between metrics are only apparent when using an approximate model, we use an underparametrized discrete distribution which assigns the same probability to and . That is,
Figure 1 depicts the distributions minimizing the various divergences under this parametrization. In particular, the Cramér solution is relatively close to the 1-Wasserstein solution. Furthermore, the minimizer of the sample Wasserstein loss () clearly provides a bad solution (most of the mass is on 0). Note that, as implied by Theorem 1, the bias shown here would arise even if the distribution could be exactly represented.
To further show the impact of the Wasserstein bias we used gradient descent to minimize either the true or sample losses with a fixed step-size (). In the stochastic setting, at each step we construct the empirical distribution from samples (a Dirac when ), and take a gradient step. As we are interested in finding a solution that accounts for the numerical values of outcomes, we measure the performance of each method in terms of the true 1-Wasserstein loss.
Figure 2 (left) plots the resulting training curves in the 1-Wasserstein regime, with the KL and Cramér solutions indicated for reference (for completeness, full learning curves are provided in the appendix). We first note that, compared to the KL solution, the Cramér solution has significantly smaller Wasserstein distance to the target distribution. Second, for small sample sizes stochastic gradient descent fails to find reasonable solutions, and for even converges to a solution worse than the KL minimizer. This small experiment highlights the cost incurred from minimizing the sample Wasserstein loss, and shows that increasing the sample size may not be sufficient to guarantee good behaviour.
We next trained a neural network in an ordinal regression task using either of the three divergences. The task we consider is the Year Prediction MSD dataset
(Lichman, 2013). In this task, the model must predict the year a song was written (from 1922 to 2011) given a 90-dimensional feature representation. In our setting, this prediction takes the form of a probability distribution. We measure each method’s performance on the test set (Figure 2) in two ways: root mean squared error (RMSE) – the metric minimized by Hernández-Lobato and Adams (2015) – and the sample Wasserstein loss. Full details on the experiment may be found in the appendix.The results show that minimizing the sample Wasserstein loss results in significantly worse performance. By contrast, minimizing the Cramér distance yields the lowest RMSE and Wasserstein loss, confirming the practical importance of having unbiased sample gradients. Naturally, minimizing for one loss trades off performance with respect to the others, and minimizing the Cramér distance results in slightly higher negative log likelihood than when minimizing the KL divergence (Figure 7 in appendix). We conclude that, in the context of ordinal regression where outcome similarity plays an important role, the Cramér distance should be preferred over either KL or the Wasserstein metric.
We now consider the Generative Adversarial Networks (GAN) framework (Goodfellow et al., 2014), in particular issues arising in the Wasserstein GAN (Arjovsky et al., 2017), and propose a better GAN based on the Cramér distance. A GAN is composed of a generative model (in our experiments, over images), called the generator, and a trainable loss function called a discriminator or critic. In theory, the Wasserstein GAN algorithm requires training the critic until convergence, but this is rarely achievable: we would require a critic that is a very powerful network to approximate the Wasserstein distance well. Simultaneously, training this critic to convergence would overfit the empirical distribution of the training set, which is undesirable.
Our proposed loss function allows for useful learning with imperfect critics by combining the energy distance with a transformation function , where is the input dimensionality and in our experiments. The generator then seeks to minimize the energy distance of the transformed variables , where is a real sample and is a generated sample. The critic itself seeks to maximize this same distance by changing the parameters of , subject to a soft constraint (the gradient penalty used by Gulrajani et al., 2017). The losses used by the Cramér GAN algorithm is summarized in Algorithm 1, with additional design choices detailed in the appendix.
We now show that, compared to the improved Wasserstein GAN (WGAN-GP) of Gulrajani et al. (2017), the Cramér GAN leads to more stable learning and increased diversity in the generated samples. In both cases we train generative models that predict the right half of an image given the left half; samples from unconditional models are provided in the appendix (Figure 10). The dataset we use here is the CelebA dataset (Liu et al., 2015) of celebrity faces.
Increased diversity. In our first experiment, we compare the qualitative diversity of completed faces by showing three sample completions generated by either model given the left half of a validation set image (Figure 3). We observe that the completions produced by WGAN-GP are almost deterministic. Our findings echo those of Isola et al. (2016), who observed that “the generator simply learned to ignore the noise.” By contrast, the completions produced by Cramér GAN are fairly diverse, including different hairstyles, accessories, and backgrounds. This lack of diversity in WGAN-GP is particularly concerning given that the main requirement of a generative model is that it should provide a variety of outputs.
Theorem 1 provides a clue as to what may be happening here. We know that minimizing the sample Wasserstein loss will find the wrong minimum. In particular, when the target distribution has low entropy, the sample Wasserstein minimizer may actually be a deterministic distribution. But a good generative model of images must lie in this “almost deterministic” regime, since the space of natural images makes up but a fraction of all possible pixel combinations and hence there is little per-pixel entropy. We hypothesize that the increased diversity in the Cramér GAN comes exactly from learning these almost deterministic predictions.
More stable learning. In a second experiment, we varied the number of critic updates () per generator update. To compare performance between the two architectures, we measured the loss computed by an independent WGAN-GP critic trained on the validation set, following a similar evaluation previously done by Danihelka et al. (2017). Figure 4 shows the independent Wasserstein critic distance between each generator and the test set during the course of training. Echoing our results with the toy experiment and ordinal regression, the plot shows that when a single critic update is used, WGAN-GP performs particularly poorly. We note that additional critic updates also improve Cramér GAN. This indicates that it is helpful to keep adapting the transformation.
There are many situations in which the KL divergence, which is commonly used as a loss function in machine learning, is not suitable. The desirable alternatives, as we have explored, are the divergences that are ideal and allow for unbiased estimators: they allow geometric information to be incorporated into the optimization problem; because they are scale-sensitive and sum-invariant, they possess the convergence properties we require for efficient learning; and the correctness of their sample gradients means we can deploy them in large-scale optimization problems. Among open questions, we mention deriving an unbiased estimator that minimizes the Wasserstein distance, and variance analysis and reduction of the Cramér distance gradient estimate.
Proceedings of the Conference on Uncertainty in Artificial Intelligence
.Probabilistic backpropagation for scalable learning of Bayesian neural networks.
In Proceedings of the International Conference on Machine Learning.Generative moment matching networks.
In Proceedings of the International Conference on Machine Learning.Proceedings of International Conference on Computer Vision
.Wasserstein training of restricted Boltzmann machines.
In Advances in Neural Information Processing Systems.The earth mover’s distance as a metric for image retrieval.
International journal of computer vision, 40(2):99–121.Pixel recurrent neural networks.
In Proceedings of the International Conference on Machine Learning.The statement regarding (U) for the KL divergence is well-known, and forms the basis of most stochastic gradient algorithms for classification. Chung and Sobel (1987) have shown that the total variation does not have property (S); by Pinsker’s inequality, it follows that the same holds for the KL divergence. A proof of (I) and (S) for the Wasserstein metric is given by Bickel and Freedman (1981), while the lack of (U) is shown in the proof of Theorem 1. ∎
Minimax bias: Consider , a Bernoulli distribution of parameter and a Bernoulli of parameter . The empirical distribution is a Bernoulli with parameter . Note that with and both Bernoulli distributions, the powers of the -Wasserstein metrics are equal, i.e. . This gives us an easy way to prove the stronger result that all -Wasserstein metrics have biased sample gradients. The gradient of the loss is, for ,
and similarly, the gradient of the sample loss is, for ,
Notice that this estimate is biased for any since
(6) |
which is different from for any . In particular for , does not depend on , thus a gradient descent using a one-sample gradient estimate has no chance of minimizing the Wasserstein loss as it will converge to either or instead of .
Now observe that for , and any ,
and therefore
Taking , we find that
Thus for any , there exists and with such that the bias is lower-bounded by a numerical constant. Thus the minimax bias does not vanish with the number of samples .
Notice that a similar argument holds for and being close to . In both situations where is close to or , the bias is non vanishing when is of order . However this is even worse when is away from the boundaries. For example chosing , we can prove that the bias is non vanishing even when is (only) of order .
Indeed, using the anti-concentration result of Veraar (2010) (Proposition 2), we have that for a sequence of Rademacher random variables (i.e. with equal probability),
This means that for samples drawn from a Bernoulli (i.e., are Rademacher), we have
thus for we have the following lower bound on the bias:
Thus the bias is lower-bounded by a constant (independent of ) when and .
Wrong minimum: From (6), we deduce that a stochastic gradient descent algorithm based on the sample Wasserstein gradient will converge to a such that , i.e., is the median of the distribution over , whereas is the mean of that distribution. Since
follows a (normalized) binomial distribution with parameters
and , we know that the median and the mean do not necessarily coincide, and can actually be as far as-away from each other. For example for any odd
and any the median is .It follows that the minimum of the expected sample Wasserstein loss (the fixed point of the stochastic gradient descent using the sample Wasserstein gradient) is different from the minimum of the true Wasserstein loss:
This is illustrated in Figure 5.
Notice that the fact that the minima of these losses differ is worrisome as it means that minimizing the sample Wasserstein loss using (finite) samples will not converge to the correct solution.
Deterministic solutions: Consider the specific case where (illustrated in the right plot of Figure 5). Then the expected sample gradient for any , so a gradient descent algorithm will converge to instead of . Notice that a symmetric argument applies for close to .
In this simple example, minimizing the sample Wasserstein loss may lead to degenerate solutions (i.e., deterministic) when our target distributions have low (but not zero) entropy. ∎
We provide an additional result here showing that the sample -Wasserstein gradient converges to the true gradient as .
Let and be probability distributions, with parametrized by . Assume that the set has measure zero, and that for any , the map is differentiable in a neighborhood of with a uniformly bounded derivative (for and ). Let be the empirical distribution derived from independent samples drawn from . Then
We note that the measure requirement is strictly to keep the proof simple, and does not subtract from the generality of the result.
Let . Since the Wasserstein distance measures the area between the curves defined by the distribution function of and , thus and
Now since we have assumed that for any , the map is differentiable in a neighborhood of and its derivative is uniformly (over and ) bounded by , we have
Thus the dominated convergence theorem applies and
since we have assumed that the set of such that has measure zero.
Now, using the same argument for we deduce that
Let us decompose this integral over as the sum of two integrals, one over and the other one over , where . We have
and
Now from the strong law of large numbers, we have that for any
, the empirical cumulative distribution function converges to the cumulative distribution almost surely. We deduce that converges to the set which has measure zero, thus andNow, since , we can use once more the dominated convergence theorem to deduce that
The following lemma will be useful in proving that the Cramér distance has property (U).
Let be independent samples from , and let . Then
Because the ’s are independent,
Now, taking the expectation w.r.t. ,
since the are identically distributed according to . ∎
We will prove that has properties (I) and (S) for ; the case follows by a similar argument. Begin by observing that
Then we may rewrite as
where () uses a change of variables . Taking both sides to the power proves that the metric possesses property (S) of order . For (I), we use the IPM formulation (3):
where () is by independence of and , , and () is by Jensen’s inequality. Next, recall that is the set of absolutely continuous functions whose derivative has bounded norm. Hence if , then also for all the translate is also in . Therefore,
Now, to prove (U). Here we make use of the introductory requirement that “all expectations under consideration are finite.” Specifically, we require that the mean under , , is well-defined and finite, and similarly for . In this case,
(7) |
This mild requirement guarantees that the tails of the distribution function are light enough to avoid infinite Cramér distances and expected gradients (a similar condition was set by Dedecker and Merlevède (2007)). Now, by definition,