Generative Adversarial Networks (GANs) have emerged as formidable new method in unsupervised learning, often learning to generate images that are visually more appealing than the ones generated using other unsupervised learning methods (e.g. variational autoencoders, generative stochastic networks, deep Boltzmann machines etc.) However, in contrast to many previous unsupervised learning methods, GANs do not provide an estimate of a measure of distributional fit (e.g. likelihood calculation on heldout data). Then there is no obvious guarantee of generalization at the end, and the persistent fear has been of mode-collapse (the simplest being that the generator network memorizes training examples). Additionally, standard GAN frameworks may not provide any meaningful features (or latent representations) for downstream tasks—which is often the goal of unsupervised learning.
A theoretical study of GANs was initiated in the seminal work of (Goodfellow et al., 2014), which proved that when sample sizes, generator sizes, and discriminator sizes are all unbounded (i.e., infinite) then the generator converges to the true data distribution. But a recent theoretical analysis of GANs with finite training samples and finite discriminator size (Arora et al., 2017) reached a different conclusion: it was proved that the training objective can be close to optimal while at the same time the generator is far from having actually learnt the distribution. Concretely, (Arora et al., 2017) show that if the discriminator has size bounded by , the training objective can be close to optimal even though the output distribution is supported on only images. By contrast one imagines that the target distribution usually must have very large support : the set of all possible images of human faces (a frequent setting in GANs work) should effectively involve all combinations of hair color/style, facial features, complexion, expression, pose, lighting, race, etc., and thus the possible set of images of faces approaches infinity.
Of course, such a theoretical analysis does not in principle preclude the possibility that the training process of GAN’s somehow avoids such low-support solutions – similarly to how SGD seems to avoid bad local optima in the (supervised) training of feedforward neural networks. Clearly, the issue needed to be studied empirically as well. A recent paper(Arora & Zhang, 2017) proposed a birthday paradox test to probe the diversity of trained GANs. The birthday paradox states that if we are sampling uniformly at random from a set of support , we will start seeing collisions (i.e. repeated samples of the same element) after sampling about elements. This is adapted to the continuous regime by sampling samples from the generator, finding the 20 most similar images using some automated measure of similarity, and visually inspecting these 20 images for duplicates. (Note, the last step ensures “no false positives”: in order to find a “duplicate” a visual inspection of it must have occured.) If we find a duplicate, this suggest the support of the distribution cannot be more than about . The results of the birthday paradox test in (Arora & Zhang, 2017) suggest the low-support solutions aren’t a merely theoretical issue, but do actually occur even in practically trained GANs, which suffer from mode-collapse to various degrees.
Encoder-decoder frameworks like BiGAN (Donahue et al., 2017) and Adversarially Learned Inference or ALI (Dumoulin et al., 2017) were recently proposed towards fixing both the issue of mode collapse, and the lack of features in the standard GAN setup. Inspired by autoencoder models, they force the generative model to learn an inference mechanism as well as a generative mechanism. The hope is that the encoding mechanism “inverts” the generator and thus forces the generator to learn meaningful featurizations of date. It has been suggested that the constraint of learning “meaningful features” will also help the mode collapse problem: (Dumoulin et al., 2017) report experiments on 2-dimensional mixtures of Gaussians suggesting this is indeed the case. More promisingly, the theoretical result of (Arora et al., 2017) also doesn’t seem to extend to encoder-decoder architectures. Thus it was an open problem whether encoder-decoder GANs suffer from the same theoretical limitations as standard GANs. (We note that the above-mentioned emprical study (Arora & Zhang, 2017) did report that ALI training also suffers from mode collapse, although it seems slightly better than other GANs setups in this regard.)
The current paper provides theoretical analysis showing that encoder-decoder training objectives cannot avoid mode collapse even for very realistic target distributions (basically, real-life images) and they cannot enforce learning of meaningful codes/features as well. In fact, a close-to-optimum solution to the encoder-decoder optimization can be achieved by an inference mechanism that essentially extracts white noise from the images, and where the generator produces a distribution of finite support whose size is moderate (sub-quadratic in the discriminator size). The proof is novel, as explained below.
We recall the Adversarial Feature Learning (BiGAN) setup from (Donahue et al., 2017). For concreteness we assume the setup is being trained on the distribution of real-life images.
The “generative” player consists of two parts: a generator and an encoder . The generator takes as input a latent variable and produces a sample that is its attempt to output a realistic image. The encoder takes as input an actual image and produces , which is a guess for the latent variable that can generate .
The underlying assumption is that images come from an unknown manifold whose dimension is much lower than the number of pixels. Latent variables correspond to image representations on this manifold and Encoder (resp., generator) map from images to manifold points (resp., manifold points to images). Then there exists a distribution , where
is the joint distribution of the latent variables and data. The goal of the training is to yieldsuch that is distributed as , the true generator distribution; and is distributed as , the true encoder distribution. The hope is that the trained encoder-decoder pairs are such that both and are equal to .
In the older encoder-decoder frameworks, and
would be trained jointly using variational inference. But the GAN setup avoids any explicit probability calculations, and instead uses an adversarial discriminator who can be asked (i.e., trained) to distinguish between two given distributions. Thus the goal of the “generative” player is to convince the discriminator that these two distributionsand are the same, where is a random seed and is a random image. The discriminator is being trained to distinguish between them.
Using usual min-max formalism for adversarial training, the BiGAN objective is written as:
where is the empirical distribution over data samples ; is a distribution over random “seeds” for the latent variables: typically sampled from a simple distribution like a standard Gaussian; and is a concave “measuring” function. (The standard choice is , though other options have been proposed in the literature.) For our purposes, we will assume that outputs values in the range , and is -Lipschitz.
As mentioned, this objective leads to the target distribution being learnt, given enough capacity in the nets, samples, and training time. But the earlier analysis of (Arora et al., 2017) showed that finite-capacity discriminators behave very differently from infinite capacity discriminators, in that they cannot prevent the learnt distribution from exhibiting mode-collapse. But their proof cannot handle the encoder-decoder framework, and the obstacle is nontrivial. Their argument a simple concentration/epsilon-net argument showing that the discriminator of capacity cannot distinguish between a generator that samples from versus one that memorizes a subset of random images in and outputs one randomly from this subset. By contrast, in the current setting we need to say what happens with the encoder. A big obstacle is that fairness requires that the encoder net should be smaller than the discriminator, so that discriminator could (in principle, if needed) learn to compute by itself. Thus in particular the proof must end up describing an explicit small encoder, which seems very difficult. (No such explicit description is known, and generative models are only an approximation.) This difficulty is cleverly circumvented in the argument below by making the encoder map images to random noise extracted from the image.
3 Limitations of Encoder-Decoder GAN architectures
For ease of exposition we will refer to the data distribution as the image distribution. The proof becomes more elegant if we assume that consists of images that have been noised —concretely, think of replacing every th pixel by Gaussian noise. Such noised images would of course look fine to our eyes, and we would expect the learnt encoder/decoder to not be affected by this noise. For concreteness, we will take the seed/code distribution
to be a spherical zero-mean Gaussian (in an arbitrary dimension and with an arbitrary variance).111 The proof can be extended to non-noised inputs, by assuming that natural images have an innate stochasticity that can be extracted by a small net to get a few statistically random bits. We chose not to write it that way because it requires making a novel assumption about images. The proof also extends to more general code distributions than Gaussian.
Furthermore, we will assume that , with (we think of , which is certainly the case in practice). As in (Arora et al., 2017) we assume that discriminators are -lipschitz with respect to their trainable parameters, and the support size of the generator’s distribution will depend upon this and the capacity (= number of parameters) of the discriminator.
Theorem 1 (Main).
There exists a generator of support and an encoder with at most non-zero weights, s.t. for all discriminators that are -Lipschitz and have capacity less than , it holds that
The interpretation of the above theorem is as stated before: the encoder has very small complexity (we will subsequently specify it precisely and show it simply extracts noise from the input ); the generator is a small-support distribution (so presumably far from the true data distribution). Nevertheless, the value of the BiGAN objective is small.
The precise noise model: Denoting by the distribution of unnnoised images, and the distribution of seeds/codes, we define the distribution of noised images as the following distribution: to produce a sample in take a sample from and from independently and output , which is defined as
In other words, set every -th to one of the coordinates of . In practical settings, is usually chosen, so noising coordinates out of all will have no visually noticeable effect. 222Note the result itself doesn’t require constraints on beyond – the point we are stressing is merely that the noise model makes sense for practical choices of .
3.1 Proof Sketch, Theorem 1
(A full proof appears in the appendix.) The main idea is to show the existence of the generator/encoder pair via a probabilistic construction that is shown to succeed with high probability.
Encoder : The encoder just extracts the noise from the noised image (by selecting the relevant coordinates). Namely, . (So the code
is just gaussian noise and has no meaningful content.) It’s easy to see this can be captured using a ReLU network withweights: we can simply connect the -th output to the ()-th input using an edge of weight 1.
Generator : This is designed to produce a distribution of support size . We first define a partition of into equal-measure blocks under . Next, we sample samples from the image distribution. Finally, for a sample , we define to be the index of the block in the partition in which lies, and define the generator as . Since the set of samples is random, this specifies a distribution over generators. We prove that with high probability, one of these generators satisfies the statement of Theorem 1. Moreover, we show that such a generator can be easily implemented using a ReLU network of complexity in the full version.
The basic intuition of the proof is as follows. We will call a set of samples from non-colliding if no two lie in the same block. Let be the distribution over non-colliding sets , s.t. each is sampled independently from the conditional distribution of inside the i-th block of the partition.
First, we notice that under the distribution for we defined, it holds that
In other words, the “expected” encoder correctly matches the expectation of , so that the discriminator is fooled. We want to show that concentrates enough around this expectation, as a function of the randomness in , so that we can say with high probability over the choice of , is small. We handle the concentration argument in two steps:
First, we note that we can calculate the expectation of when by calculating the empirical expectation over -sized non-colliding sets sampled according to . Namely, as we show in the full version.
This follows easily from the fact that all blocks in the partition have equal measure under .
Thus, we have reduced our task to arguing about the concentration of
(viewed as a random variable in). Towards this, we consider the random variable as a function of the randomness in and both. Since is a non-colliding set of samples, we can write
for some function , where the random variables , are all mutually independent – thus use McDiarmid’s inequality to argue about the concentration of in terms of both and .
From this, we can use Markov’s inequality to argue that all but an exponentially small (in ) fraction of encoders satisfy that: for all but an exponentially small (in ) fraction of non-colliding sets , is small. Note that this has to hold for all discriminators – so we need to additionally build an epsilon-net, and union bound over all discriminators, similarly as in (Arora et al., 2017). From this fact, it’s easy to extrapolate that for such ,
is small, as we want. The details are provided in the full version.
We have considered the theoretical shortcomings of encoder-decoder GAN architectures, which a priori seemed promising because they target feature learning, a potentially simpler task than learning the full underlying distribution. At the same time, it was hoped that forcing the generator to learn “meaningful” encodings of the image should improve distribution learning as well by ameliorating mode collapse. Our work suggests however that the learning objective alone does not guarantee such success; the objective can be low even though the GAN has learnt meaningless features which amount to noise. Furthermore, the learnt distribution can exhibit severe mode collapse, similarly as it does for the usual GAN setup.
This theoretical problem arises from two causes: (a) Use of generative models that are highly expressive (multilayer nets) and which can therefore exhibit behavior unanticipated by the designer. (Obviously, this issue doesn’t arise in less expressive models like mixtures of gaussians.) (b) Lack of any explicit probability calculations in the training. (Though of course this design decision arose after the failure of earlier models that did rely on such explicit calculations.)
Our theoretical analysis points to gaps in current ways of reasoning about GANs, but it leaves open the possibility that encoder-decoder GANs do work in practice, with the training somehow avoiding the bad solutions exhibited here. We hope our results will stimulate further theoretical and empirical study.
- Arora & Zhang (2017) Sanjeev Arora and Yi Zhang. Do gans actually learn the distribution? an empirical study. arXiv preprint arXiv:1706.08224, 2017.
- Arora et al. (2017) Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573, 2017.
- Donahue et al. (2017) Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. In International Conference on Learning Features (ICLR), 2017.
- Dumoulin et al. (2017) Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and Aaron Courville. Adversarially learned inference. In International Conference on Learning Features (ICLR), 2017.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
Appendix A Technical proofs
We recall the basic notation from the main part: the image distribution will be denoted as , and the code/seed distribution as , which we assume is a spherical Gaussian. For concreteness, we assumed the domain of is and the domain of is with . (As we said, we are thinking of .)
We also introduced the quantity .
Before proving Theorem 1, let’s note that the claim can easily be made into a finite-sample version. Namely:
Corollary A.1 (Main, finite sample version).
There exists a generator of support , s.t.
if is the uniform distribution over a training set
is the uniform distribution over a training setof size at least , and is the uniform distribution over a sample from of size at least , for all discriminators that are -Lipschitz and have less than parameters, with probability over the choice of training set , we have:
As is noted in Theorem B.2 in (Arora et al., 2017), we can build a -net for the discriminators with a size bounded by . By Chernoff and union bounding over the points in the -net, with probability at least over the choice of a training set , we have
for all discriminators with capacity at most . Similarly, with probability at least over the choice of a noise set ,
Union bounding over these two events, we get the statement of the Corollary.
Spelling out the distribution over generators more explicitly, we will in fact show:
Theorem 2 (Main, more detailed).
Let follow the distribution over generators defined in Section 3. With probability over the choice of , for all discriminators that are L-Lipschitz and have capacity bounded by :
a.1 Proof of the main claim
Let us recall we call a set of samples from non-colliding if no two lie in the same block and we denoted to be the distribution over non-colliding sets , s.t. each is sampled independently from the conditional distribution of inside the i-th block of the partition.
First, notice the following Lemma:
Lemma A.1 (Reducing to expectations over non-colliding sets).
Let be a fixed generator, and a fixed discriminator. Then,
where is the conditional distribution of in the -th block of the partition. However, since the blocks form an equipartitioning, we have
Lemma A.2 (Concentration of good generators).
With probability over the choice of ,
for all discriminators of capacity at most .
Consider as a random variable in and for a fixed . We can write
where the random variables , are all mutually independent. Note that the arguments that is a function of are all independent variables, so we can apply McDiarmid’s inequality. Towards that, let’s denote by
the vector of all inputs to, except for . Notice that
(as changing to only affect one out of the terms in ). Analogously we have
Denoting , by McDiarmid we get
Building a -net for the discriminators and union bounding over the points in the net, we get that . On the other hand, we also have
so by Markov’s inequality, with probability over the choice of , ,
Let us denote by the sets , s.t. . Then, with probability over the choice of , we have that for all of capacity at most ,
which is what we want.
With these in mind, we can prove the main theorem:
a.2 Representability results
In this section, we show that a generator of the type we described in the previous section can be implemented easily using a ReLU network. The encoder can be parametrized by the set of “memorized” training samples. The high-level overview of the network is rather obvious: we partition into equal-measure blocks; subsequently, for a noise sample , we determine which block it belongs to, and output the corresponding sample , which is memorized in the weights of the network.
Let be the generator determined by the samples , i.e. . For any , there exists a ReLU network with non-zero weights, which implements a function , s.t. , where denotes total variation distance.333The notation simply denotes the distribution of , when . 444For smaller , the weights get larger, so the bit-length to represent them gets larger. This is standard when representing step functions using ReLU, for instance see Lemma 3 in (Arora et al., 2017). For the purposes of our main result Theorem 1, it suffices to make , which translates into a weights on the order of – which in turn translates into bit complexity of so this isn’t an issue as well.
The construction of the network is pictorially depicted on Figure 1. We expand on each of the individual parts of the network.
First, we show how to implement the partitioning into blocks. The easiest way to do this is to use the fact that the coordinates of a spherical Gaussian are independent, and simply partition each dimension separately into equimeasure blocks, depending on the value of : the absolute value of the -th coordinate. Concretely, without loss of generality, let’s assume , for some . Let us denote by , the real-numbers, s.t. . . We will associate to each -tuple the cell . These blocks clearly equipartition with respect to the Gaussian measure.
Inferring the -tuple after calculating the absolute values (which is trivially representable as a ReLU network as ) can be done using the “selector” circuit introduced in ‘(Arora et al., 2017). Namely, by Lemma 3 there, there exists a ReLU network with non-zero weights that takes as input and outputs numbers , s.t. and with probability over the randomness of .
Since we care about and being close in total variation distance only, we can focus on the case where all are such that and for some indices .
We wish to now “turn on” the memorized weights for the corresponding block in the partition. To do this, we first pass the calculated -tuple through a network which interprets it as a number in -ary and calculates it’s equivalent decimal representation. This is easily implementable as a ReLU network with weights calculating . Then, we use use a simple circuit with non-zero weights to output numbers , s.t. and (implemented in the obvious way). The subnetwork of will be responsiple for the -th memorized sample.
Namely, we attach to each coordinate a curcuit with fan-out of degree , s.t. the weight of edge is . Let’s denote these outputs as and let be defined as . It’s easy to see since that .
Finally, the operation can be trivially implemented using additional weights: we simply connect each output either to if or to otherwise.
Adding up the sizes of the individual components, we see the total number of non-zero weights is , as we wanted.