work
This is a public blog, intended to be used only by myself
view repo
We introduce a simple recurrent variational auto-encoder architecture that significantly improves image modeling. The system represents the state-of-the-art in latent variable models for both the ImageNet and Omniglot datasets. We show that it naturally separates global conceptual information from lower level details, thus addressing one of the fundamentally desired properties of unsupervised learning. Furthermore, the possibility of restricting ourselves to storing only global information about an image allows us to achieve high quality 'conceptual compression'.
READ FULL TEXT VIEW PDFThis is a public blog, intended to be used only by myself
Images contain a large amount of information that is a priori stored independently in the pixels. In the semi-supervised learning regime where a large number of images is available but only a small number of labels, one would like to leverage this information to create representations that allow for better (and especially faster) generalization. Intuitively one expects such representations to explicitly extract global conceptual aspects of an image.
In this paper we propose a method that is able to transform an image into a progression of increasingly detailed representations, ranging from global conceptual aspects to low level details (see Figures 1 & 2). At the same time, our model greatly improves latent variable image modeling compared to earlier implementations of deep variational auto-encoders (Kingma & Welling, 2014; Rezende et al., 2014; Gregor et al., 2014). Furthermore, it has the advantage of being a simple homogeneous architecture not requiring complex design choices, which is similar to the recurrent structure of DRAW (Gregor et al., 2015). Last, it provides an important insight into building good variational auto-encoder models of images: the use of multiple layers of stochastic variables that are all ‘close’ to the pixels significantly improves performance.
The system’s ability to stratify information enables it to perform high quality lossy compression, by storing only a subset of latent variables, starting with the high level ones, and generating the remainder during decompression (see Figure 4).
Currently the ultimate arbiter of lossy compression remains human evaluation. Other simple measures such as the L2 distance between compressed and original images are inappropriate – for example if a particular generated grass texture is sharp, but different from the one in the original image, it will yield a large L2 distance yet should, at the same time, be considered conceptually close to the original.
Achieving good lossy compression while storing only high level latent variables would imply that representations learned at a high level contain information similar to that used by humans to judge images. As humans outperform the best machines at learning abstract representations, human evaluation of lossy compression obtained by these generative models constitutes a reasonable test of the quality of representations learned by these models.
In the following we discuss variational auto-encoders and compression in more detail, present the algorithm and demonstrate the results on generation quality and compression.
Numerous techniques exist for unsupervised learning in deep networks, e.g. sparse auto-encoders and sparse coding (Kavukcuoglu et al., 2010; Le, 2013), denoising auto-encoders (Vincent et al., 2010), deconvolutional networks (Zeiler et al., 2010)
, restricted Boltzmann machines
(Hinton & Salakhutdinov, 2006), deep Boltzmann machines (Salakhutdinov & Hinton, 2009), generative adversarial networks (Goodfellow et al., 2014) and variational auto-encoders (Kingma & Welling, 2014; Rezende et al., 2014; Gregor et al., 2014).In this paper we focus on the class of models in the variational auto-encoding framework. Since we are also interested in compression, we present them from an information-theoretic perspective. Variational auto-encoders typically consist of two neural networks: one that generates samples from latent variables (‘imagination’), and one that infers latent variables from observations (‘recognition’). The two networks share the latent variables. Intuitively speaking one might think of these variables as specifying, for a given image, at different levels of abstraction, whether a particular object such as a cat or a dog is present in the input, or perhaps what the exact position and intensity of an edge at a given location might be.
During the recognition phase the network acquires information about the input and stores it in the latent variables, reducing their uncertainty. For example, at first not knowing whether a cat or a dog is present in the image, the network observes the input and becomes nearly certain that it is a cat. The reduction in uncertainty is quantitatively equal to the amount of information the network acquired about the input. During generation the network starts with uncertain latent variables and selects their values from a prior distribution that specifies this uncertainty (e.g. it chooses a dog). Different choices will produce different samples.
Variational auto-encoders provide a natural framework for unsupervised learning – we can build networks with layers of stochastic variables and expect that, after learning, the representations become increasingly more abstract for higher levels of the hierarchy. The questions then are: can such a framework indeed discover such representations both in principle and in practice, are such networks powerful enough for modeling real data, and what techniques one needs to make it work well.
Variational auto-encoders can not only be used for representation learning but also for compression. The training objective of variational auto-encoders is to compress the total amount of information needed to encode the input. They achieve this by using information-carrying latent variables that express what, before compression, was encoded using a larger amount of information in the input. The information in the layers and the remaining information in the input can be encoded in practice as explained later in this paper.
The amount of lossless compression one is able to achieve is bounded by the underlying entropy of the image distribution. Additionally, most image information as measured in bits is contained in the fine details of the image. Thus we might reasonably expect that lossless compression will never improve by more than a factor of two in comparison to current performance.
Lossy compression, on the other hand, holds much more potential for improvement. In this case we want to compress an image by a certain amount, allowing some information loss, while maximizing both quality and similarity to the original image. As an example, at a low level of compression (close to lossless compression), we could start by reducing pixel precision, e.g. from 8 bits to 7 bits. Then, as in JPEG, we could express a local 8x8 neighborhood in a discrete cosine transform basis and store only the most significant components. This way, instead of introducing quantization artifacts in the image that would appear if we kept decreasing pixel precision, we preserve higher level structures but to a lower level of precision. However, what can we do beyond that as we keep pushing the compression? We would like to preserve the most important aspects of the image. What determines what is important?
Let us imagine that we are compressing images of cats and dogs and would like to compress an image down to one bit. What would that bit be? One would imagine that it should represent whether the image contains either a cat or a dog. How would we then get an image out of this single bit? If we have a good generative model, we can simply generate the entire image from this one latent variable, an image of a cat if the bit corresponds to ‘cat’, and an image of a dog otherwise. Now let us imagine that instead of compressing to one bit we wanted to compress down to ten bits. Now we can store the most important properties of the animal as well – e.g. its type, color, and basic pose. The rest would be ‘filled in’ by the generative model that is conditioned on this information. If we increase the number of bits further we can preserve more and more about the image, while generating the fine details such as hair, or the exact pattern of the floor, etc. Most bits are in fact about such low level details. We call this kind of compression – compressing by giving priority to higher levels of representation and generating the remainder – ‘conceptual compression’. We suggest that this should be the ultimate objective of lossy compression.
Importantly, if we solve deep representation learning with latent variable generative models that generate high quality samples, we achieve the objective of lossy compression mentioned above. We can see this as follows. Assume that the network has learned a hierarchy of progressively more abstract representations. Then, to get different levels of compression, we can store only the corresponding number of topmost layers and generate the rest. By solving unsupervised deep learning, the network would order information according to its importance and store it with that priority.
While the ultimate goal of unsupervised learning remains elusive, we make a step in this direction, and show that our network learns to order information from a rather global level to precise details in images, without being hand-engineered to do this explicitly, as illustrated in Figures 1 and 2. This information separation already allows us to achieve better compression quality than JPEG and JPEG2000 as shown in Figure 4. While we are not bound by the same constraints as these algorithms, such as speed and memory, these results demonstrate the potential of this method, which will get better as latent variable generative models improve.
What are the challenges involved in turning latent variable models into state-of-the-art generative models of images? Many successful vision architectures (e.g. Simonyan & Zisserman, 2014
) have highly over-complete representations that contain many more neurons in hidden layers than pixels. These representations need to be combined to get a very sharp distribution at the pixel level if the pixels are modeled independently. This distribution corresponds to salt and pepper noise which is not present in natural images to a perceptible level. This poses a major challenge.
After experimenting with deep variational auto-encoders we concluded that it was exceedingly difficult to obtain satisfactory results with a single computational pass through the network. Instead we propose that the network needs the ability to correct itself over a number of time steps. Thus, sharp reconstructions should not be a property of high-precision values in the network, but should rather be the result of an iterative feedback mechanism that is robust to network parameter change.
Such a mechanism is provided by the DRAW algorithm (Gregor et al., 2015), which is a recurrent type of variational auto-encoder. At each time step, DRAW maintains a provisionary reconstruction, takes in information about a given image, stores it in latent variables and updates the reconstruction. Keeping track of the reconstruction aids the iterative feedback mechanism which is learned by back-propagation. Computation is both deep – in iterations – and close to the pixels.
We introduce convolutional DRAW. It features convolutions, latent prior modeling, a Gaussian input distribution (for natural images) and, in some experiments, a multi-layer architecture. However, it does not use an explicit attentional mechanism. We note that even the single-layer version is already a deep generative model which can decide to process higher level information first before focusing on details, as we demonstrate to some degree. We also experiment with making convolutional DRAW hierarchical in a similar way that we would build conventional deep variational auto-encoders (Gregor et al., 2014)
– stacking more layers of latent and deterministic variables. We believe that the recurrence is important not just for accurate pixel reconstructions, but also at higher levels. For example, when the network decides to generate edges at different locations, it needs to make sure that they are aligned. It is hard to imagine this happening in a single computational pass through the network. Similarly at higher levels, when it decides to generate objects, they need to be generated with the right relationship to one another. And finally at the scene level, one probably does not generate entire scenes at once, but rather one step at a time.
Let us relate this discussion to two other families of generative models, specifically generative adversarial networks (GANs; Goodfellow et al. (2014)) and auto-regressive pixel models. GANs have been demonstrated to be able to generate realistic looking images (Denton et al., 2015; Radford et al., 2015), with properly aligned edges, using a simple feedforward generative network (Radford et al., 2015)
. GANs also contain two networks – a generative network that is the same as in variational auto-encoders, and a classification network. The classification network is presented with both real and model-generated images and tries to classify them according to their true nature – real or model-generated. The generative network gets gradients from the classification network, changing its weights in an attempt to make the generated images be judged as real ones by the classification network. This makes the generation network produce realistic images that ‘fool’ the classification network. It needs to produce a wide diversity of images, not just one realistic image, because if it produced only one (or a small number of them), the classification network would classify that image as generated and others as realistic, and be almost always correct. This actually happens in practice, and one has to apply a variety of techniques, e.g. as in
(Radford et al., 2015), to obtain sufficient image diversity. However the extent of GANs’ sampling diversity is unknown and currently there is no satisfactory measure for it. So while a given network doesn’t produce just one image, it is possible that it produces only a tiny subset of possible realistic images, as it simply competes with the power of the classifier. For example if it generates a sharp image, it is unclear whether the system is also capable of generating its translated version, or simply a slightly distorted version.Finally there is another way to get low uncertainty at the pixel level: instead of predicting pixels independently given the latents, we can decide not to use latents and iterate sequentially over the pixels, predicting a given pixel from the previous ones (from top left to bottom right) in an autoregressive fashion (Bengio & Bengio, 1999; Graves & Schmidhuber, 2009; Larochelle & Murray, 2011; Gregor & LeCun, 2011; van den Oord et al., 2016). This is as ‘close’ to the pixels as one can possibly be, and furthermore the procedure is purely deterministic. The disadvantage is conceptual – the information and decisions are not done at a conceptual level but at the pixel level. For example when generating cats vs dogs the decisions at the first set of pixels (top left of the image) will contain no information about a hypothetical cat or dog. But as we get to the region where these objects are, we start choosing pixels that will slowly tip the probability of generating a cat vs a dog one way or the other. As we start generating an ear, it will more likely be a cat’s or a dog’s and so on. However this pixel level approach and our approach are orthogonal and can be easily combined, for example by feeding the output of convolutional DRAW into the conditional computation of a pixel level model. In this paper we study the latent variable approach and make the pixels independent given the latents.
In this section we describe the details of a single-layer version of the algorithm. Convolutional DRAW contains the following variables: input , reconstruction , reconstruction error , the state of the encoder recurrent net , the state of the decoder recurrent net and latent variable . The variables , and are recurrent (passed between different time steps) and are initialized with learned biases. Then at each time step , convolutional DRAW performs the following updates:
(1) | |||||
(2) | |||||
(3) | |||||
(4) | |||||
(5) | |||||
(6) | |||||
(7) |
where denotes a convolution and Rnn denotes a recurrent network. We use LSTM (Hochreiter & Schmidhuber, 1997) with convolutional operators instead of the usual linear ones. The final value of
contains the parameters of the input distribution. For binary images we use the Bernoulli distribution. For natural images we use the Gaussian distribution with mean and log variance given by splitting the vector
to obtain the input cost and the total cost :(8) | |||||
(9) | |||||
(10) | |||||
(11) |
where the handling of real valued-ness of the inputs (8, 10) is explained below, and being the standard setting. The algorithm is schematically illustrated in the first layer of Figure 3
. The network is trained by calculating the gradient of this loss and using a stochastic gradient descent algorithm. Stochastic back-propagation through a sampling function is done as in variational auto-encoders
(Kingma & Welling, 2014; Rezende et al., 2014). Both the approximate posterior and the prior are Gaussian, with mean and log variance being linear functions of or , respectively.Let us discuss how we handle the input distribution for natural images. Each pixel (per color) is one of 256 values. We could use a soft-max distribution to model it. This would result in a rather large output vector at every time step and also does not take advantage of the underlying real valued-ness of intensities and therefore we opted for a Gaussian distribution. However this still needs to be converted to a discrete distribution over 256 values to calculate the negative likelihood loss . Instead of this, we add uniform noise to the input with width corresponding to the spacing between discrete values and calculate where with if inputs are scaled to the interval .
Next we explain how we can stack convolutional DRAW with a two layer example. The first layer is the same as the one just introduced. The second layer has the same structure: recurrent encoder, recurrent decoder and a stochastic layer. The input to the second layer is the mean of the approximate posterior of the first layer. The output of the second layer biases the prior of the latent variable of the first layer and is also passed as input into the first layer decoder recurrent net. This is illustrated in Figure 3. We don’t use any reconstruction or error in the second layer.
Here we describe a given computational step in detail. Indices and denote the variables of layers and , respectively. In addition, let be the mean of . Then, the update at a given time step is given by
(12) | |||||
(13) | |||||
(14) | |||||
(15) | |||||
(16) | |||||
(17) | |||||
(18) | |||||
(19) | |||||
(20) | |||||
(21) | |||||
(22) |
Systems with more layers can be built analogously.
Here we show how one can turn variational auto-encoders including convolutional DRAW into compression algorithms. We have not built the actual compressor, however, as we explain, we have strong reasons to believe it would perform as well as calculated here. Two basic approaches exist. The first one is less convenient because it needs to add extra data to the bitstream when compressing an image but has essentially a guaranteed compression rate. The other one may require some experimentation but is expected to yield a similar compression rate and can be used on a given image without needing extra data.
The underlying compression mechanism for all cases is arithmetic coding (Witten et al., 1987). Arithmetic coding takes as input a sequence of discrete variables and a set of probabilities that predict the variable at time from previous variables. It then compresses this sequence to bits plus a constant of order one.
Compressing inputs using variational auto-encoders proceeds as follows: discretize each latent variable in each layer using the width of (eq. 3), treat the resulting variables as a sequence with predictions (eq. 4) and compress using arithmetic coding.
For this to work as explained, several things are needed. First, the discretization should be independent of the input. This can be achieved by training the network with the variance of being a learned constant that does not depend on the input. We found that this does not have much effect on the likelihood. Second, one should evaluate the log likelihood using this discretization. One has to decide on the exact manner in which should be computed for each discrete value. Significant tuning might be required here, for the obtained likelihoods to be as good as the ones obtained with sampling. However this is likely to be fruitful since there exists a second, less convenient way to compress that is guaranteed to achieve this rate.
This second approach uses bits-back coding (Hinton & Van Camp, 1993). We explain only the basic idea here. We discretize the latents down to a very high level of precision and use to transmit the information. Because the discretization precision is high, the probabilities for discrete values are easily assigned. That will preserve the information but it will cost many bits, namely where is a probability under that discretization. Now, instead of sampling from the approximate posterior when encoding an input, we encode bits of other information into the choice of , that we also want to transmit. When is recovered at the receiving end, both the information about the current input and the other information is recovered and thus the information needed to encode the current input is . The expectation of this quantity is the KL-divergence in (7), which therefore measures the amount of information stored in a given latent layer. The disadvantage of this approach is that we cannot encode a given input without also having some other information we want to transmit. However, this coding scheme works even if the variance of the approximate posterior is dependent on the input.
For natural images, all models except otherwise specified were single-layer, with , a kernel size of
, and stride 2 convolutions between input layers and hidden layers with
latent feature maps. We trained the models on Cifar-10 and ImageNet with and LSTM feature maps respectively. We use the version of ImageNet presented in (van den Oord et al., 2016) that will soon be released as a standard dataset. We train the network with the Adam algorithm (Kingma & Ba, 2014) with learning rate . Occasionally, we find that the cost suddenly increases dramatically. This is probably due to the Gaussian nature of the distribution, when a given variable is produced too far from the mean relative to sigma. We observed this happening approximately once per run. To be able to keep training we store older parameters, detect such jumps and revert to the old parameters when they occur. The network then just keeps training as if nothing had happened.The recently introduced Omniglot dataset (Lake et al., 2015) is comprised of character classes drawn from multiple alphabets with just samples per class. Referred to by many as the ‘inverse of MNIST’, it was designed to study conceptual representations and generative models in a low-data regime. Table 1 shows likelihoods of different models compared to ours. For our model, we only calculate the upper bound (variational bound) and therefore the true likelihood is actually better. Samples generated by the model are shown in Figure 5.
Table 2 shows likelihoods of different models on Cifar-10. We see that our method outperforms previous methods except for the just released Pixel RNN model of (van den Oord et al., 2016). As mentioned, the advantage of our model compared to such auto-regressive models is that it is a latent variable model that can be used for representation learning and lossy compression. At the same time, the two approaches are orthogonal and can be combined, for example by feeding the output of convolutional DRAW into the recurrent network of Pixel RNN.
We also report the likelihood for a (non-recurrent) variational auto-encoder and standard DRAW. For the variational auto-encoder we tested architectures with multiple layers, both deterministic and stochastic but with standard functional forms, and this was the best result that we obtained. Convolutional DRAW performs significantly better.
Additionally, we trained on the version of ImageNet prepared in (van den Oord et al., 2016) which was created with the aim of making a standardized dataset to test generative models. The results are in Table 3. Note that being a new dataset, no other methods have been reported on it.
In Figure 6 and Figure 7 we show generations from the model. We trained networks with varying input cost scales as explained in the next section. The generations are sharp and contain many details, unlike previous versions of variational auto-encoder that tend to generate blurry images.
Model | NLL Test |
---|---|
VAE (2 layer, 5 samples) | 106.31 |
IWAE (2 layer, 50 samples) | 103.38 |
RBM (500 hidden) | 100.46 |
DRAW | 96.5 |
Conv DRAW | 91.0 |
Model | NLL Test (Train) |
---|---|
Uniform Distribution | 8.00 |
Multivariate Gaussian | 4.70 |
NICE [1] | 4.48 |
Deep Diffusion [2] | 4.20 |
Deep GMMs [3] | 4.00 |
Pixel RNN [4] | 3.00 (2.93) |
Deep VAE | 4.54 |
DRAW | 4.13 |
Conv DRAW | 3.58 (3.57) |
Method | NLL Validation (Train) | |
---|---|---|
Conv DRAW | 4.40 (4.35) | |
Pixel RNN | 3.86 (3.83) | |
Method | NLL Validation (Train) | |
Conv DRAW | 4.10 (4.04) | |
Pixel RNN | 3.63 (3.57) |
Each pixel (and color channel) of the data consists of 256 values, and as such, likelihood and lossless compression are well defined. When compressing the image there is much to be gained in capturing precise correlations between nearby pixels. There are a lot more bits in these low level details than in the higher level structure that we are actually interested in when learning higher level representations. The network might focus on these details, ignoring higher level structure.
One way to make it focus less on the details is to scale down the cost of the input relative to the latents, that is, setting in (11). Generations for different cost scalings are shown in Figure 6, with the original objective being scale . We see that lower scales indeed have a ‘cleaner’ high level structure. Scale 1 contains a lot of information at the precise pixel values and the network tries to capture that, while not being good enough to properly align details and produce realistic patterns. This might be simply a matter of scaling, making layers larger, networks deeper, using more iterations, or using better functional forms.
Convolutional DRAW uses many iterations and might be considered expensive. However we found that networks with a larger number of time steps train faster per data example as shown in Figure 8 (left). To study how they train with respect to real time, we multiply the time scale of each input by the number of iterations as seen in Figure 8 (right). We see that despite having to do several iterations, up to about , convolutional DRAW does not take more wall clock time to train than the same architecture with smaller . For larger , the training slows down, but it does eventually reach better performance than at lower .
We look at how much information different levels and time steps contain. This information is simply the KL divergence in (7) and (22). For a two layer system with one convolutional and one fully connected layer, this is shown in Figure 9.
We see that the higher level contains information mainly at the beginning of the computation, whereas the lower layer starts with low information which then gradually increases. This is desirable from a conceptual point of view. It suggests that the network first finds out about an overall structure of the image and then explains the details contained within that structure. Understanding the overall structure rapidly is also convenient if the algorithm needs to respond to observations in a timely manner.
We can compress an image with loss of information by storing only a subset of the latent variables, typically the high levels of the hierarchy. We can do this in multilayer convolutional DRAW, storing only higher levels. However we can also store only a subset of time steps, specifically a given number of time steps at the beginning, and let the network generate the rest.
The units not stored should be generated from the prior distribution (Equation 4). However we can also generate a more likely image by lowering the variance of the prior Gaussian. We show generations with full variance in row 3 of each block of Figure 4 and with variance zero in row 4 of that figure. We see that using the original variance, the network generates sharp details. Because the generative model is not perfect, the resulting images are less realistic looking as we lower the number of stored time steps. For zero variance we see that the network starts with rough details making a smooth image and then refines it with more time steps. All these generations are produced with a single-layer convolutional DRAW, and thus, despite being single-layer, it achieves some level of ‘conceptual compression’ by first capturing the global structure of the image and then focusing on details.
There is another dimension we can vary for lossy compression – the input scale introduced in subsection 4.2. Even if we store all the latent variables (but not the input bits), the reconstructed images will get less detailed as we scale down the input cost.
To build a really good compressor, at each compression rate, we need to find which of the networks, input scales and number of time steps would produce visually good images. For several compression levels, we have looked at images produced by different methods and selected qualitatively which network gave the best looking images. We have not done this per image, just per compression level. We then display compressed images that we have not seen with this selection.
We compare our results to JPEG and JPEG2000 compression which we obtained using ImageMagick. We found however that these compressors are unable to produce reasonable results for small images () at high compression rates. Instead, we concatenated 100 images into one image, compressed that and extracted back the compressed small images. The number of bits per image reported is then the number of bits of this image divided by 100. This is actually unfair to our algorithm since any correlations between nearby images can be exploited. Nevertheless we show the comparison in Figure 4. Our algorithm shows better quality than JPEG and JPEG 2000 at all levels where a corruption is easily detectable. Note that even when our algorithm is trained on one specific image size, it can be used on arbitrarily sized images for those networks that contain only convolutional operators.
In this paper, we introduced convolutional DRAW, a state-of-the-art generative model which demonstrates the potential of sequential computation and recurrent neural networks in scaling up latent variable models. During inference, the algorithm sequentially arrives at a natural stratification of information, ranging from global aspects to low-level details. An interesting feature of the method is that, when we restrict ourselves to storing just the high level latent variables, we arrive at a ‘conceptual compression’ algorithm that rivals the quality of JPEG2000. As a generative model, it outperforms earlier latent variable models on both the Omniglot and ImageNet datasets.
We thank Aaron van den Oord, Diederik Kingma and Koray Kavukcuoglu for fruitful discussions.
Proceedings of the 31st International Conference on Machine Learning
, 2014.Proceedings of the sixth annual conference on Computational learning theory
, pp. 5–13. ACM, 1993.The neural autoregressive distribution estimator.
Journal of Machine Learning Research, 15:29–37, 2011.Stochastic backpropagation and approximate inference in deep generative models.
In Proceedings of the 31st International Conference on Machine Learning, pp. 1278–1286, 2014.International Conference on Artificial Intelligence and Statistics
, pp. 448–455, 2009.Factoring variations in natural images with deep gaussian mixture models.
In Advances in Neural Information Processing Systems, pp. 3518–3526, 2014.Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
The Journal of Machine Learning Research, 11:3371–3408, 2010.Figures 10, 11 show image generations for input scaling and of (11), while Figures 12, 13 show generations, also for and .