Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), as a new way of learning generative models, have recently shown promising results in various challenging tasks. Although GANs are popular and widely-used (Isola et al., 2016; Brock et al., 2016; Nguyen et al., 2016; Zhu et al., 2017; Karras et al., 2017), they are notoriously hard to train (Goodfellow, 2016). The underlying obstacles, though have been widely studied (Arjovsky & Bottou, 2017; Lucic et al., 2017; Heusel et al., 2017a; Mescheder et al., 2017, 2018; Yadav et al., 2017), are still not fully understood.
In this paper, we study the convergence of GANs from the perspective of the optimal discriminative function . We show that in original GAN and its most variants, is a function of densities at the current point but does not reflect any information about the densities/locations of other points in the real and fake distributions. Moreover, Arjovsky & Bottou (2017) state that the supports of real and fake distributions are usually disjoint. In this paper, we argue that the fundamental cause of failure in training of GANs (Section 2.1) stems from the combination of the above two facts. The generator uses as the guidance for updating the generated samples, but actually tells nothing about where is. As the result, the generator is not guaranteed to converge to .
Accordingly, Arjovsky et al. (2017) proposed Wasserstein distance (in its dual form) as an alternative objective, which can properly measure the distance between two distributions no matter whether their supports are disjoint. However, as shown in Section 2.3, when the supports of the and are disjoint, the gradient of from the dual form of Wasserstein distance given a compacted dual constraint also does not reflect any useful information about other points in . Based on this observation, we provide further investigation in Section 2.4 and argue that measuring the distance properly does not necessarily imply that the gradient is well-defined in terms of .
In Section 3, we show that incorporating Lipschitz-continuity condition in the objectives of GANs is a general solution to the above mentioned problem, and prove that for a broad family of discriminator objectives, Lipschitz-continuity condition can build strong connections between and through such that at each sample will point towards some real sample . This guarantees that is moving towards at every step, i.e, the convergence of GANs is guaranteed.
Finally, in Section 4, we extend our discussion on and to the case where the supports of and are overlapped and show that the locality of and in traditional GANs turns out to be an intrinsic cause to mode collapse.
|Wasserstein-1 with Lip|
2 The Fundamental Cause of Failure in Training of GANs
Typically, the objectives of GANs can be formulated as follows:
is the source distribution of the generator (usually a Gaussian distribution) inand is the target (real) distribution in . The generative function learns to output samples that shares the same dimension as , while the discriminative function learns to output a score indicating the authenticity of a given sample. We denote the implicit distribution of the generated samples as , i.e., .
denote discriminative and generative function spaces parameterized by neural networks, respectively; functions, , : are loss metrics. We list the choices of , and in some representative GAN models in Table 1, where we denote .
In these GANs, the gradient that the generator receives from the discriminator with respect to a generated sample is
In Eq. (2), the first term is a step-related scalar that is out of the scope of our discussion in this paper; the second term
is a vector indicating the direction that the generator should follow for optimizing on sample.
2.1 on does not reflect useful information about
In this section, we will show that when the supports of and are disjoint, in traditional GANs does not reflect any useful information about , and is not guaranteed to converge to . We argue that this is the fundamental cause of non-convergence and instability in traditional GANs.111In this paper, traditional GANs mainly refers to the original GAN and Least-Squares GAN, where depends only on the densities and . Broadly, it refers to all GANs where does not reflect information about the locations of the other points in and , such as the Fisher GAN.
2.1.1 The original GAN and Least-Squares GAN
In the simplest case of Eq. (1), e.g., the original GAN (Goodfellow et al., 2014) and Least-Squares GAN (Mao et al., 2016), there is no restriction on . Therefore, for each point is independent of other points, and we have
Since we assume supports of and are disjoint, we further have
For , the value of is irrelevant to . Since and are disjoint222Here and later, “two distributions are disjoint” means that their supports are disjoint., for also tells nothing about . In consequence, the generator can hardly learn useful information and is not guaranteed to converge to the case where .
2.1.2 The Fisher GAN
where is a distribution whose support covers and . Given and are disjoint, we have
Note that the scalar is a constant. Eq. (6) also defines on and independently. Therefore, for , and tell nothing about .
2.2 Connection to gradient vanishing
The non-convergence problem of the original GAN has once been considered as the gradient vanishing problem. In (Goodfellow et al., 2014), it is addressed by using an alternative objective for the generator. However, it actually only changes the scalar while the aforementioned problem in still exists. The least-squares GAN (Mao et al., 2016) is proposed to address the gradient vanishing problem, but it also focuses on basically. As we have discussed, the least-squares GAN also belongs to traditional GANs, which is not guaranteed to converge when and are disjoint.
Arjovsky et al. (2017) provided a new perspective on understanding the gradient vanishing problem. They argued that gradient vanishing stems from the ill-behaving of traditional metrics, i.e., the distance between and remains constant when they are disjoint. Wasserstein distance is thus proposed as an alternative metric, which can properly measure the distance between two distributions no matter they are disjoint or not. However, as we will show next, Wasserstein distance may also suffer from the same problem on , if a more compact dual form is used.
In summary, gradient vanishing is about the scalar term in or the overall scale of , and in this paper we investigate its direction , where the problem is more fundamental and challenging. We will next show that Wasserstein distance, which can properly measure the distance for disjoint distributions, may also suffer from the same issue.
2.3 Wasserstein distance in compact dual form suffers from the same problem
-)Wasserstein distance is a distance function defined between two probability distributions:
where denotes the collection of all probability measures with marginals and on the first and second factors, respectively. Since solving it in the primal form (Eq. (7)) is burdensome, Wasserstein distance is usually solved in its dual form. Though Wasserstein distance in its dual form is usually written with Lipschitz-continuity condition, we here provide a more compact version. The proof of this dual form can be found in Appendix I.
We leave the detailed discussion on the relationship between Lipschitz-continuity condition and Wasserstein distance in Section 5.1. In Eq. (8), we replace the strong Lipschitz-continuity condition with a looser constraint. Note that Eq. (7) and Eq. (8) are still equivalent. In the next, we will demonstrate that a well-defined distance metric, e.g., Wasserstein distance in this compacted dual form, may also suffer from the same problem in and does not necessarily ensure the convergence of GANs.
We now study the optimal discriminative function of Wasserstein distance in this dual form. Since there is generally no closed-form solution for in Eq. (8), we use an illustrative example for demonstration here, but the conclusion is general. Let be a uniform variable on interval , be the distribution of , and be the distribution of , as shown in Figure 1. According to Eq. (8), one of the optimal is as follows
Though having the constraint “”, Wasserstein distance in this dual form also only defines the value of on the supports of and , and the values of on contain no useful information about the location of . Therefore, if and are disjoint, hardly provides useful information to the generator about how to change into and the generator is not guaranteed to converge to the case . It is worth noticing that the value of on the supports of and is sufficient to evaluate the Wasserstein distance.
2.4 A well-defined metric does not necessarily guarantee the convergence
The objectives of GANs are usually defined as (or proved equivalent to) minimizing a distance metric between and , which implies that
is the unique global optimum and is in accordance with the final goal of the generative model, i.e., estimating the distribution of real samples. However, in this section, we emphasize that a well-defined (e.g., smooth, continuous, withbeing the optimum) distance metric does not necessarily guarantee the convergence of GANs.
Given an objective is convex with respect to and holds the property that is the unique optimum, the convergence of GANs is guaranteed if only it directly optimizes . However, directly optimizing the distribution is usually unfeasible and the practice is optimizing the generated samples according to . As shown in previous sections, when and are disjoint, , the direction that the generator follows for updating the generated samples, tells nothing about how to pull to . Therefore, the convergence of GANs are not necessarily guaranteed.
It is worth noticing that indeed indicates the direction of decreasing the objective in terms of the current , but updating to make the value of increase/decrease does not necessarily imply that is getting closer to . Recall that in the failure case of Wasserstein distance dual form in the above section, the values of on is , while the values of around is undefined.
In conclusion, a smooth distance metric satisfying is the optimum does not guarantee the convergence and sample updating according to does not necessarily decrease the distance between and . Therefore, if we use for the update of the generator333Alternative strategies actually exist, for example, Sanjabi et al. (2018) use the optimal transport plan (Seguy et al., 2017) between and to update the generator. , it is necessary to make aware of how to pull to . In the next section, we will introduce the Lipschitz-continuity condition as a general solution for making well-behaving and guaranteeing the convergence of -based GANs.
3 A General Solution: Lipschitz-continuity Condition
Lipschitz-continuity condition becomes popular in GANs recently as part of the discriminator’s objective (Arjovsky et al., 2017; Kodali et al., 2017; Fedus et al., 2017; Miyato et al., 2018), achieving great success. In this section, we explain the significance of Lipschitz-continuity condition when introduced into the objective of the discriminator.
In a nutshell, under a board family of GAN objectives, Lipschitz-continuity condition is able to connect and through such that when and are disjoint, for each generated sample will point towards some real sample , which guarantees the trend that is getting closer to at every step. More detailed results are presented as follows.
3.1 The main result
A function is -Lipschitz continuous if it satisfies the following property:
where and are distances metrics in domains and , respectively. The smallest constant is called the Lipschitz constant of function . In this paper (and most GAN papers), and are defined as Euclidean distance.444Actually, we argue that the distance metrics must be Euclidean distance in GANs. See Appendix D. We let denote Euclidean distance.
As proved by Gulrajani et al. (2017), when the Lipschitz-continuity condition is combined with Wasserstein distance, we have the following property if is differentiable, then
where , , and is the optimal in Eq. (7). The meaning of this proposition is two-fold: (i) for each , there exists a such that
for all linear interpolationsbetween and ; (ii) these pairs match the optimal coupling .
Next we introduce our theorem on the Lipschitz-continuity condition. It turns out when combining the Lipschitz-continuity condition with generalized objectives, Property-(i) still holds and Property-(ii) is naturally dismissed as it is now not restricted to Wasserstein distance.
Let and denotes . Let and denote the supports of and , respectively. Assume , where is the Lipschitz constant of . If and in satisfy
then we have that
, such that or ;
, such that ;
if and , then , such that ;
the only Nash Equilibrium of is reached when , where .
The above theorem states that when the Lipschitz-continuity condition is combined with an objective that satisfies Eq. (12), then: (a) for the optimal discriminative function at any point , it either is bounded by the Lipschitz constant or holds a zero-gradient with respect to ; (b) for any point that only appears in or , there must exist a point that bounds this point in terms of , because for these points, will never get zero gradient with respect to as we prove in the Appendix G; (c) when and are totally overlapped, as long as still not converges to , there exists at least one pair that bounds each other; (d) the only Nash Equilibrium among and under this objective is “ with ”. The formal proof is in Appendix G.
Wasserstein distance, i.e., is one instance that satisfies Eq. (12); and it is a very special case, which holds and . Eq. (12) is actually quite general and there exists many other settings, e.g., , and . Generally, it is feasible to set . As such, to build a new objective, one only needs to find a function that is increasing and has non-decreasing derivative. See Figure 12. In addition, all linear combinations of feasible pairs also lie in the family.
It is worth noting that is also optimized here and it is actually necessary for Property-(c) and Property-(d). This is the key difference when the Lipschitz-continuity condition is extended to general objectives. The underlying reason for the need of also minimizing comes from the existence of case “ for ”, which does not hold when the objective is Wasserstein distance. Minimizing guarantees that the only Nash Equilibrium is “ with ”. On the other hand, if is not minimized towards zero, Wasserstein distance dual form based GANs are not guaranteed to have zero gradient at the convergence state . It indicates that minimizing is also beneficial to the Wasserstein GAN (Arjovsky et al., 2017).
3.2 Lipschitz-continuity connects and through
From Theorem 1, we know that for any point , as long as does not hold a zero gradient with respect to , must be bounded by another point such that . We here further clarify that, when there is a bounding relationship, it must involve both real sample(s) and fake sample(s). More formally, we have
If , then
, if such that , then such that ,
, if such that , then such that .
The intuition behind the above theorem is that samples from the same distribution, e.g., the fake samples, will not bound each other. It is worth noticing that there might exist a chain of bounding relationships that involves a dozen of fake samples and real samples, and these points all lie in the same line and bounds each other.
Under the Lipschitz-continuity condition, the bounded line in the value surface of is the basic building block that connects and , and each fake sample lies in one of the bounded lines. Next we will further interpret the implication of bounding relationship and show that it guarantees meaningful for all involved points.
3.3 Lipschitz-continuity ensures the convergence of -based GANs
Recall that the proposition in Eq. (11) states that . We next show that it is actually a direct consequence of bounding relationship between and . We formally state it as follows:
Assume is differentiable and -Lipschitz continuous. For all and which satisfy and , we have , where for .
In other words, if two points and bound each other in terms of , there is a straight line between and in the value surface of . Any point in this line holds the maximum gradient slope , and the direction of these gradient all point towards the direction. Combining Theorem 1 and Theorem 2, we can conclude that when and are disjoint, for each sample points to a sample , which guarantees that is moving towards .
In fact, Theorem 1 provides further guarantee on the convergence. Property-(b) implies that for any that does not lies in , points to some real sample . In the fully overlapped case, according to Property-(c), unless , there exists a pair in bounding relationship and pulls towards . Property-(d) guarantees that the only Nash Equilibrium is “”. The proof of Theorem 3 is provided in Appendix D.
4 Overlapping case: the cause of mode collapse
In Section 2, we discuss the problem of and in the case where and are disjoint. In this section, we extend our discussion to the overlapping case. In the disjoint case, we argue that “ on does not reflect any information about the location of other points in ” will lead to an unfeasible and thus non-convergence. In the overlapping and continuous case, things are actually different, around each point is also defined, and its gradient now reflects the local variation of .
For most traditional GANs, mainly reflects the local information about the density and . However, it is worth noting that is usually an increasing function with respect to while a decreasing function with respect to . For instance, in the original GAN is . Optimizing the generator according will move sample towards the direction of increasing . Because positively correlates with and negatively correlated with , it in sense means is becoming more real. However, such a local greedy strategy turns out to be a fundamental cause of mode collapse.
Mode collapse is a notorious problem in GANs’ training, which refers to the phenomenon that the generator only learns to produce part of . Many literatures try to study the source of mode collapse (Che et al., 2016; Metz et al., 2016; Kodali et al., 2017; Arora et al., 2017) and measure the degree of mode collapse (Odena et al., 2016; Arora & Zhang, 2017).
The most recognized cause of mode collapse is that, if the generator is much stronger than the discriminator, it may learn to only produce the sample(s) in the local or global maximum of for the current discriminator. This argument is true for most of GAN models. However, from our perspective on and its gradient, there actually exists a much more fundamental cause of mode collapse, i.e., the locality of in traditional GANs and the locality of gradient operator .
In traditional GANs, is a function of local densities and , which is local, and the gradient operator is also a local operator. As the result, only reflects its local variations and cannot capture the statistic of and that is far from itself. If in the surrounding area of is well-defined, will move towards the nearby location where the value of is higher. It does not take the global status into account.
The typical result is that when fake samples get close to a mode of the , they move towards the mode and get stuck there (due to the locality). Assume consists of two Gaussian distributions (A and B) that are distant from each other, while the current
is uniformly distributed over its support and close to real Gaussian A. In this case,of all fake samples will point towards the center of Gaussian A. If
is a Gaussian with the same standard deviation as Gaussian A,in original GAN and Least-Square GAN shows almost identical behaviors, which is illustrated in Figure 2. In Fisher GAN, if is uniform, the case is even worse: a large amount of points that are relatively far from Gaussian A will move away from A (but the direction is not necessarily towards B, though in our 1-D case it is). This observation again supports our argument that “a well-defined distance metric does not necessarily guarantee the convergence”, and the validity of is still necessary even if and is continuous and overlapped.
5 Extended Discussions
5.1 The relation between Lipschitz-continuity and Wasserstein distance
Most literature presents the dual form of Wasserstein distance with the Lipschitz-continuity condition. However, it is worth noticing that the Lipschitz-continuity condition is actually stronger than the necessary one in the dual form of Wasserstein distance. Recall that in the dual form of Wasserstein distance, the constraint can be more compactly written as (introduced in Section 2.3 and proved in Appendix I)
However, it is usually written as 1-Lipschitz continuous, which is
The key difference is that the constraint in Eq. (13) restricts the range of and , but Lipschitz-continuity condition (Eq. (14)) does not have the restriction on the range, thus the latter is the sufficient condition of the former one. It is also worth noticing that, though Lipschitz-continuity condition is stronger than the compact one, it does not affect the final solution (Appendix I). In other words, Lipschitz-continuity condition is a safe extension of the compact constraint. And if the supports of and are the entire space, Eq. (13) and Eq. (14) are actually identical; in such condition, Wasserstein distance in its dual form always works. However, and are usually disjoint in GANs. Therefore, using the strong Lipschitz-continuity condition is necessary to ensure the validity of the dual form of Wasserstein distance in -based updating, and the constraint in Eq. (13) is not enough as shown in Section 2.3.
5.2 Explanation on the empirical success of traditional GANs
Though traditional GANs does not have any guarantee on its convergence, it has already achieved its great success. The reason is that having no guarantee does not mean it cannot converge. It turns out extensive parameter-tuning actually increases the probability of the convergence.
As shown in Appendix A, hyper-parameters are important in influencing the value surface of
. Some typical settings (e.g., simplified neural network architecture, relu or leaky relu activation, relatively high learning rate, Adam optimizer, etc.) tend to form a relatively smooth value surface (e.g., monotonically increasing fromto ), making much more meaningful. That is, one can find these settings, where or is more favourable, to enable traditional GANs to work. In opposite, we have tried highly-nonlinear activation such as swish (Ramachandran et al., 2018) in the discriminator. It turns out traditional GANs are very likely to fail. In contrast, our proposed Lipschitz-continuity condition based GANs are compatible with highly-nonlinear activation. Another important empirical technique is to delicately balance the generator and the discriminator or limit the capacity of the discriminator. This is to avoid the fatal optimal . All these could possibly make traditional GANs work. However, the consequence is that these GANs are very sensitive to hyper-parameters and hard to use.
In this section, we present the experiment results on our proposed objectives for GANs. The anonymous code is provided at http://bit.ly/2Kvbkje.
6.1 Verifying the objective family and its gradient
We further verify with the real-world data, using ten CIFAR-10 images as and ten noise images as to make the solving of feasible. The result is shown in Figure 4, where The leftmost in each row are the and the second are their gradient . The interior are with increasing , which will pass through a real sample, and the rightmost are the nearest . This result visually demonstrates that the gradient of a generated sample is towards the direction of one real sample. Note that the final results of this experiment keep almost identical when varying the loss metric and in the family.
6.2 Stabilizing with new Objectives
Wasserstein distance is a special case in our proposed family of objectives where . As a result, under the Wasserstein distance objective where has a free offset, which means given a , with any is also an optimal. In practice, this behaves as an oscillatory during training. Any other instance of our new proposed objectives does not have this problem. We illustrate this practical difference in Figure 6.
6.3 Benchmark on unsupervised image generation tasks
|Oxford 102 Flower|
Finally, we fix in the generator’s objective and compare various objectives on unsupervised image generation tasks. The results of Inception Score (Salimans et al., 2016) and Frechet Inception Distance (Heusel et al., 2017b) are presented in Table 6.3. We also include the hinge loss which used in (Miyato et al., 2018)
. We use a classifier on Oxford 102 Flower Dataset for the evaluation of FID and Inception Score for results on Oxford 102.
The gradient of varies significantly and we find it requires a small learning rate to avoid explosion. The objectives and achieve the best performances. This is probably because they have bounded gradient and reduce the gradient of well-identified points towards zero, which enables the discriminator to pay more attention to these ill-identified. Hinge loss does not lie in our proposed objective family and turns out to be unstable and performs unsatisfactory in same cases. We also plot the training curve in terms of FID in Figure 6.
Due to page limitation, we leave the details, visual results and more experiments in the Appendix.
In this paper we have shown that the fundamental cause of failure in training of GANs stems from the unreliable . Specifically, when and are disjoint, for fake sample tells nothing about , making it impossible for to converge to . We have further demonstrated that even Wasserstein distance in a more compact dual form (still is equivalent to Wasserstein distance and can properly measure the distance between distributions) also suffers from the same problem when and are disjoint. This implies that “whether a distance metric can properly measure the distance” does not yet touch the key of non-convergence of GANs. We have highlighted in this paper that a well-defined distance metric does not necessarily guarantee the convergence of GANs because can be meaningless. Therefore, if we update the generator based on , we need to pay more attention on the design of . Furthermore, to address the aforementioned problem, we have proposed the Lipschitz-continuity condition as a general solution to make reliable and ensure the convergence of GANs, which works well with a large family of GAN objectives. In addition, we have shown that in the overlapping case, is also problematic which turns out to be an intrinsic cause of mode collapse in traditional GANs.
Remark 1: It is worth noticing that in our formulation is not derived from any well-established distance metric; it is derived based on Lipschitz-continuity condition. As we have shown that a well-established distance metric does not necessarily ensure the convergence, we hope our trial could shed light on the new direction of GANs.
Remark 2: Though the objective of generator is not the focus of this paper, our analysis indicates that the minimax in terms of in Eq. (1) is not essential, because it only influences the scale of the gradient. Nevertheless, the function does influence the updating of the generator and we leave the detailed investigation as future work.
8 Related work
The main argument in Wasserstein GAN (Arjovsky et al., 2017) for the benefit of Wasserstein distance is that it can properly measure the distance between two distributions no matter whether their supports are disjoint. However, according to our analysis, a proper distance metric does not necessarily ensure the convergence of GAN and the Lipschitz-continuity condition in Wasserstein GAN is crucial for ensuring its convergence. More specifically, we have shown that Wasserstein distance in the dual form with compacted constraint also cannot provide meaningful gradient through