Associative Compression Networks

04/06/2018 ∙ by Alex Graves, et al. ∙ 0

This paper introduces Associative Compression Networks (ACNs), a new framework for variational autoencoding with neural networks. The system differs from existing variational autoencoders in that the prior distribution used to model each code is conditioned on a similar code from the dataset. In compression terms this equates to sequentially transmitting the data using an ordering determined by proximity in latent space. As the prior need only account for local, rather than global variations in the latent space, the coding cost is greatly reduced, leading to rich, informative codes, even when autoregressive decoders are used. Experimental results on MNIST, CIFAR-10, ImageNet and CelebA show that ACNs can yield improved dataset compression relative to order-agnostic generative models, with an upper bound of 73.9 nats per image on binarized MNIST. They also demonstrate that ACNs learn high-level features such as object class, writing style, pose and facial expression, which can be used to cluster and classify the data, as well as to generate diverse and convincing samples.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 8

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised learning—the discovery of structure in data without extrinsic reward or supervision signals—is likely to be critical to the development of artificial intelligence, as it enables algorithms to exploit the vast amounts of data for which such signals are partially or completely lacking. In particular, it is hoped that unsupervised algorithms will be able to learn compact, transferable representations that will benefit the full spectrum of cognitive tasks, from low-level pattern recognition to high-level reasoning and planning.

Variational Autoencoders (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014) are a class of generative model in which an encoder network extracts a stochastic code from the data, and a decoder network then uses this code to reconstruct the data. From a representation learning perspective, the hope is that the code will provide a high-level description or abstraction of the data, which will guide the decoder as it models the low-level details. However it has been widely observed (e.g. Chen et al. 2016b; van den Oord et al. 2017

) that sufficiently powerful decoders—especially autoregressive models such as pixelCNN 

(Oord et al., 2016b)—will simply ignore the latent codes and learn an unconditional model of the data. Authors have proposed various modifications to correct this shortcoming, such as reweighting the coding cost (Higgins et al., 2017)

or removing it entirely from the loss function 

(van den Oord et al., 2017), weakening the decoder by e.g. limiting its range of context (Chen et al., 2016b; Bowman et al., 2015) or adding auxiliary objectives that reward more informative codes, for example by maximising the mutual information between the prior distribution and generated samples (Zhao et al., 2017)—a tactic that has been fruitfully applied to Generative Adversarial Networks (Chen et al., 2016a). These approaches have had considerable success at discovering useful and interesting latent representations. However they add parameters to the system that must be tuned by hand (e.g. weightings for various terms in the loss function, domain specific limitations on the decoder etc.) and in most cases yield worse log-likelihoods than purely autoregressive models.

To understand why VAEs do not typically improve on the modelling performance of autoregressive networks, it is helpful to analyse the system from a minimum description length perspective (Chen et al., 2016b). In that context a VAE embodies a two-part compression algorithm in which the code for each datum in the training set is first transmitted to a receiver equipped with a prior distribution over codes, followed by the residual bits required to correct the predictions of the decoder (to which the receiver also has access). The expected transmission cost of the code (including the ‘bits back’ received by the posterior; Hinton & Van Camp 1993

) is equal to the Kullback-Leibler divergence between the prior and the posterior distribution yielded by the encoder, while the residual cost is the negative log-likelihood of the data under the predictive distribution of the decoder. The sum of the two, added up over the training set, is the compression cost optimised by the VAE loss function

111We ignore for now the description length of the prior and the decoder weights, noting that the former is likely to be negligible and the latter could be minimised with e.g. variational inference (Hinton & Van Camp, 1993; Graves, 2011).

The underlying assumption of VAEs is that transmitting a piece of high-level information, for example that a particular MNIST image represents the digit 3, will be outweighed by the increased compression of the data by the decoder. But this assumption breaks down if the decoder is able to learn a distribution that closely matches the density of the data. In this case, if one-tenth of the training images are 3’s, finding out that a particular image is a 3 will only save the decoder around

bits. Furthermore, since an accurate prior will give a ten percent probability to 3’s, it will cost exactly the same amount for the encoder to transmit that information via the prior. In practice, since the code is stochastic and the decoder is typically deterministic, it is often more efficient to ignore the code entirely.

If we follow the above reasoning to its logical conclusion we come to a paradox that appears to undermine not only VAEs, but any effort to use high-level concepts to compress low-level data: the benefit of associating a particular concept with a particular piece of data will always be outweighed by the coding cost. The resolution to the paradox is that high-level concepts become efficient when a single concept can be collectively associated with many low-level data, rather than pointed to by each datum individually. This suggests a paradigm where latent codes are used to organise the training set as a whole, rather than annotate individual training examples. To return to the MNIST example, if we first sort the images according to digit class, then transmit all the zeros followed by all the ones and so on, the cost of transmitting the places where the digit class changes will be negligible compared to the cumulative savings over all the images of each class. Conversely, consider an encyclopaedia that has been carefully structured into topics, articles, paragraphs and so on, providing high-level context that is known to lead to improved compression. Now imagine that the encyclopaedia is transmitted in a randomly ordered sequence of 100 character chunks, attached to each of which is a code specifying the exact place in the structure from which it was drawn (topic X, article Y, paragraph Z etc.). It should be clear that this would be a very inefficient compression algorithm; so inefficient, in fact, that it would not be worth transmitting the structure at all.

Encyclopaedias are already ordered, and the most efficient way to compress them may well be to simply preserve the ordering and use an autoregressive model to predict one token at a time. But in general we do not know how the data should be ordered for efficient compression. It would be possible to find such an ordering by minimising a similarity metric defined directly on the data, such as Euclidean distance in pixel space or edit distance for text; however such metrics tend to be limited to superficial similarities (in the case of pixel distance we provide evidence of this in our experiments). We therefore turn to the similarity, or association (Bahdanau et al., 2014; Graves et al., 2014), among latent representations to guide the ordering. Transmitting associated codes consecutively will only be efficient if we have a prior that captures the local statistics of the area they inhabit, and not the global statistics of the entire dataset: if a series of pictures of sheep has just been sent, the prior should expect another sheep to come next. We achieve this by using a neural network to condition the prior on a code chosen from the nearest neighbours in latent space to the code being transmitted. Previous work has considered fitting mixture models as VAE priors (Nalisnick et al., ; Tomczak & Welling, 2017), and one could think of our procedure as fitting a conditional prior to a uniform mixture over the posterior codes closest to whichever code we are about to transmit. As , the size of the training set, we recover the familiar setting of fitting an unconditional prior. Among supervised methods, perhaps the closest point of reference is Matching Networks (Vinyals et al., 2016) in which a nearest neighbours search over embeddings is leveraged for one-shot learning.

Conditioning on neighbouring codes does not obviously lead to a compression procedure. However we can define the following sequential compression algorithm if we insist that the neighbour for each code in the training set is unique:

  • Alice and Bob share the weights of the encoder, decoder and prior networks222In normal VAEs the encoder does not need to be shared.

  • Alice chooses an ordering for the training set, then transmits one element at a time by sending first a sample from the encoding distribution, then the residual bits required for lossless decoding.

  • After decoding each data sample, Bob re-encodes the data using his copy of the encoder network, then passes the statistics of the encoding distribution into the prior network as input. The resulting prior distribution is used to transmit the next code sample drawn by Alice at a cost equal to the KL between their distributions333The prior for the first example may be assumed to be shared at negligible cost for a large dataset.

The optimal ordering Alice should choose is the one that minimises the sum of the KLs at each transmission step. Finding this ordering is a hard optimisation problem in general, but our empirical results suggest that the KL cost of the optimal ordering is well approximated by nearest neighbour sampling, given a suitable value of .

It should be clear that ACNs are not IID in the usual sense: they optimise the cost of transmitting the entire dataset, in an order of their choosing, as opposed to the expected cost of transmitting a single data-point. One consequence is that the ACN loss function is not directly comparable to that of VAEs or other generative models. Indeed, since the expected cost of transmitting a uniformly random ordering of a size dataset is bits, it be could argued that an ACN has ‘free bits’ per data-point to spend on codes relative to an IID model. However, we contend that it is exactly the information contained in the ordering, or more generally in the relational structure of dataset elements, that defines the high-level regularities we wish our representation to capture. For example, if half the voices in a speech database are male and half are female, compression should be improved by grouping according to gender, motivating the inclusion of gender in the latent codes; likewise representing speaker characteristics should make it possible to co-compress similar voices, and if there were enough examples of the same or similar phrases, it should become advantageous to encode linguistic information as well.

As the relationship between a particular datum and the rest of the dataset is not accessible to the decoder in an ACN, there is no need to weaken the decoder; indeed we recommend using the most powerful decoder possible to ensure that the latent codes are not cluttered by low-level information. Similarly, there is no need to modify the loss function or add extra terms to encourage the use of latent codes. Rather, the use of latent information is a natural consequence of the separation between high-level relations among data, and low-level dependencies within data. As our experiments demonstrate, this leads to compressed representations that capture many salient features likely to be useful for downstream tasks.

2 Background: Variational Auto-Encoders

Variational Autoencoders (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014) are a family of generative models consisting of two neural networks —an encoder and a decoder—trained in tandem. The encoder receives observable data as input and emits as output a data-conditional distribution

over latent vectors

. A sample is drawn from this distribution and used by the decoder to determine a code-conditional reconstruction distribution over the original data 444We use instead of the usual notation to avoid confusion with the ACN prior. The VAE loss function is defined as the expected negative log-likelihood of under (often referred to as the reconstruction cost) plus the KL divergence from some prior distribution to (referred to as the KL or coding cost):

Although VAEs with discrete latent variables have been explored (Mnih & Rezende, 2016)

, most are continuous to allow for stochastic backpropagation using the reparameterisation trick 

(Kingma & Welling, 2013). The prior

may be a simple distribution such as a unit variance, zero mean Gaussian, or something more complex such as an autoregressive distribution whose parameters are adapted during training 

(Chen et al., 2016b; Gulrajani et al., 2016). In all cases however, the prior is constant for all .

3 Associative Compression Networks

Associative compression networks (ACNs) are similar to VAEs, except the prior for each is now conditioned on the distribution used to encode some neighbouring datum . We used a unit variance, diagonal Gaussian for all encoding distributions, meaning that is entirely described by its mean vector , which we refer to as the code for . Given , we randomly pick , the code for , from , the set of nearest Euclidean neighbours to among all the codes for the training data. We then pass to the prior network to obtain the conditional prior distribution and hence determine the KL cost. Adding this KL cost to the usual VAE reconstruction cost yields the ACN loss function:

As with normal VAEs, the prior distribution may be chosen from a more or less flexible family. However, as each local prior is already conditioned on a nearby code, the marginal prior across latent space will be highly flexible even if the local priors are simple. For our experiments we chose an independent mixture prior for each dimension of latent space, to encourage multimodal but independent (and hence, hopefully, disentangled) representations.

As discussed in the introduction, conditioning on neighbouring codes is equivalent to a sequential compression algorithm, as long as every neighbour is unique to a particular code. This can be ensured by a simple modification to the above procedure: restrict at each step to contain only codes that have not yet been used as neighbours during the current pass through the dataset. With

this is equivalent to a greedy nearest neighbour heuristic for the Euclidean travelling salesman problem of finding the shortest tour through the codes. The route found by this heuristic may be substantially longer than the optimal tour, which in any case may not correspond to the ordering that minimises the KL cost, as this depends on the KLs between the priors and the codes, and not directly on the distance between the codes. Nonetheless it provides an upper bound on the optimal KL cost, and hence on the compression of the dataset (note that the reconstruction cost does not depend on the ordering, as the decoder is conditioned only on the current code). We provide results in

Section 4 to calibrate the accuracy of against the KL cost yielded by an actual tour.

To optimise we create an associative dataset that holds a separate code vector for each in the training set and run the following algorithm:

  Initialise :
  repeat
     Sample uniformly from
     Run encoder network, get
     Update with new code:
      nearest Euc. neighbours to in
     Pick randomly from
     Run prior network, get
     
     Run decoder network, compute
     
     Compute gradients, update network weights
  until convergence
Algorithm 1 Associative Compression Network Training

In general , and will be batches computed in parallel. As the codes in are only updated when the corresponding data is sampled, the codes used for the nearest neighbour (KNN) search will in general be somewhat stale. To check that this wasn’t a significant problem, we ran tests in which a parallel worker continually updated the codes using the current weight for the encoder network. For our experiments increasing the code-update frequency made no discernible difference to learning; however code staleness could become more damaging for larger datasets. Likewise the computational cost of performing the KNN search was low compared to that of activating the networks for our experiments, but could become prohibitive for large datasets.

3.1 Unconditional Prior

Unlike normal VAEs, ACNs by default lack an unconditional prior, which makes it difficult to compare them to existing generative models. However we can easily fit an unconditional prior to samples drawn from the codes in after training is complete.

3.2 Sampling

There are several different ways to sample from ACNs, of which we consider three. Firstly, by drawing a latent vector from the unconditional prior defined above and sampling from the decoder distribution , we can generate unconditional samples that reflect ACNs global data distribution. Secondly, by choosing a batch of real images, encoding them and decoding conditioned on the resulting code, we can generate stochastic reconstructions of the images, revealing which features of the original are represented in the latents and transmitted to the decoder. Note that in order to reduce sampling noise we use the mean codes as latents for the reconstructions, rather than samples from ; we assume at this point that the decoder is autoregressive. Lastly, the use of conditional priors opens up an alternative sampling protocol, where sequences of linked samples are generated from real data by iteratively encoding the data, sampling from the prior conditioned on the code, generating new data, then encoding again. We refer to these sequences as ‘daydreams’, as they remind us of the chains of associative imagining followed by the human mind at rest. The daydream sampling process is illustrated in Figure 1.

Figure 1: Flow diagram for daydream sampling.

3.3 Test Set Evaluation

Since the true KL cost depends on the order in which the data is transmitted, there are some subtleties in comparing the test set performance of ACN with other models. For one thing, as discussed in the introduction, most other models are order-agnostic, and hence arguably due a refund for the cost of specifying an arbitrary ordering (in the case of MNIST this would amount to 8.21 nats per test set image). We can resolve this by calculating both an upper bound on the ordered compression yielded by ACN, and the unordered compression which can be computed using the KL between the unconditional prior discussed in Section 3.1 and the test set encodings (recall that the reconstruction cost is unaffected by the ordering). As well as providing a fair comparison with previous results, the unconditional KL gives an idea of the total amount of information encoded for each data point, relative to the dataset as a whole. Another issue is that if an ordering is used, it is debatable whether the training and test set should be compressed together, with a single tour through all the data, or whether the test set should be treated as a separate tour, with the prior network conditioned on test set codes only. We chose the latter for simplicity, but note that doing so may unrealistically inflate the KL costs; for example if the test set is dramatically smaller than the training set, and the average distance between codes is correspondingly larger, the density of the prior distributions may be strongly miscalibrated.

4 Experimental Results

We present experimental results on four image datasets: binarized MNIST (Salakhutdinov & Murray, 2008), CIFAR-10 (Krizhevsky, 2009), ImageNet (Deng et al., 2009) and CelebA (Liu et al., 2015). Buoyed by our belief that the latent codes will not be ignored no matter how well the decoder can model the data, we used a Gated PixelCNN decoder (Oord et al., 2016b) to parameterise for all experiments. The ACN encoder was a convolutional network fashioned after a VGG-style classifier (Simonyan & Zisserman, 2014), and the encoding distribution was a unit variance Gaussian with mean specified by the output of the encoder network. The prior network was an MLP with three hidden layers each containing 512 units, and skip connections from the input to all hidden layers and all hiddens to the output layer. The ACN prior distribution was parameterised using the outputs of the prior network as follows:

where was the dimensionality of , is the element of , there are mixture components for each dimension, and all parameters are emitted by the prior network, with the softmax function used to normalise and the softplus function used to ensure . We used for MNIST and elsewhere; the results did not seem very sensitive to this. Polyak averaging (Polyak & Juditsky, 1992) was applied for all experiments with a decay parameter of 0.9999; all samples and test set costs were calculated using averaged weights. For the unconditional prior

we always fit a Gaussian mixture model using Expectation-Maximization, with the number of components optimised on the validation set.

For all experiments, the optimiser was rmsprop 

(Tieleman & Hinton, 2012) with learning rate and momentum . The encoding distribution was always a unit variance Gaussian with mean specified by the output of the encoder network. The dimensionality of was for binarized MNIST and otherwise. Unless stated otherwise, was used for the KNN lookups during ACN training.

4.1 Binarized MNIST

For the binarized MNIST experiments the ACN encoder had five convolutional layers, and the decoder consisted of 10 gated residual blocks, each using 64 filters of size 5x5. The decoder output was a single Bernoulli distribution for each pixel, and a batch size of 64 was used for training.


Model Nats / image
Gated Pixel CNN (ours) 81.6
Pixel CNN (Oord et al., 2016a) 81.3
Discrete VAE (Rolfe, 2016) 81.0
DRAW (Gregor et al., 2015)
G. PixelVAE  (Gulrajani et al., 2016) 79.5
Pixel RNN (Oord et al., 2016a) 79.2
VLAE (Chen et al., 2016b) 79.0
GLN (Veness et al., 2017) 79.0
MatNet (Bachman, 2016)
ACN (unordered) 80.9
Table 1: Binarized MNIST test set compression results

Cost Nats / image
KL () 2.6
KL () 3.5
KL (Greedy Tour) 3.6
KL () 4.1
KL (Unconditional) 10.6
Reconstruction 70.3
ACN (ordered) 73.9
Table 2: Binarized MNIST test set ACN costs

The results in Table 1 show that unordered ACN gives similar compression to the decoder alone (Gated Pixel CNN), supporting the thesis that conventional VAE loss is not significantly reduced by latent codes when using an autoregressive decoder. Table 2 shows that the upper bound on the ordered ACN cost (sum of greedy tour KL and reconstruction) is 7 nats per image lower than the unordered ACN cost. Given that the cost of specifying an ordering for the test set is 8.21 nats per image, this suggests that the model is using most of the ‘free bits’ to encode latent information. The KL cost yielded by the ‘greedy tour’ heuristic described in Section 3 is close to that given by KNN sampling on the test set codes with (note that we are varying when computing the test set KL only; the network was trained with ). Since this is a loose upper bound on the optimal KL for an ordered tour, and since the result is a lower bound (no tour can do better than always hopping to the nearest neighbour) we speculate that the true KL is somewhere between and .

Figure 2: MNIST compression costs. Unordered compression cost is height of blue and yellow bar, ordered compression cost (ACN models only) is height of red and yellow bar.

As discussed in the introduction, if the value for the KNN lookups approaches the size of the training set, ACN should reduce to a VAE with a learned prior. To test this, we trained ACNs with to , and measured the change in compression costs. We also implemented a standard feedforward VAE and a VAE with the same encoder and decoder as ACN, but with an unconditional Gaussian mixture prior whose parameters were trained in place of the prior network. We refer to the latter as Gated PixelVAE due to similarity with previous work (Gulrajani et al., 2016); but note that they used a fixed prior and a somewhat different encoder architectures. Figure 2 shows that the unordered compression cost per test set image is much the same for ACN regardless of , and very similar to that of both Gated PixelVAE and Gated PixelCNN (again underlining the marginal impact of latent codes on VAE loss). However the distribution of the costs changes, with higher reconstruction cost and lower KL cost for higher . As predicted, Gated PixelVAE performs similarly to ACN with very high . The VAE performs considerably worse due to the non-autoregressive decoder; however the higher KL suggests that more information is encoded in the latents. Our next experiment attempts to quantify how useful this information is.


Input Accuracy (%)
PCA (16 components) 82.8
pixels 89.4
standard VAE codes 95.4
Gated PixelVAE codes 97.9
ACN codes 98.5
Table 3: Binarized MNIST linear classification results
Figure 3: Visualisation of the first two principal components of ACN latent space for MNIST. Images are coloured according to class label (Smilkov et al., 2016).
Figure 4: MNIST reconstructions. The codes for the test set images in the leftmost column were used to generate the samples in the remaining columns.
Figure 5: MNIST samples from ACN with unconditional prior (top) and Gated PixelCNN (bottom).
Figure 6: MNIST daydream samples. The leftmost column is from the test set. The remaining columns were generated by daydream sampling (Section 3.2).

Table 3 shows the results of training a linear classifier to predict the training set labels with various inputs. This gives us a measure of how the amount of easily accessible high-level information the inputs contain. ACN codes are the most effective, but interestingly PixelVAE codes are a close second, in spite of having a KL cost of just over 1 nat per image. VAE codes, with a KL of 26 nats per image, are considerably worse; we hypothesize that the use of a weaker decoder leads the VAE to include more low-level information in the codes, making them harder to classify. In any case we can conclude that coding cost is not a reliable indicator of code utility.

The salience of the ACN codes is supported by the visualisation of the principal components of the codes shown in Figure 3: note the clustering of image classes (coloured differently to aid interpretation) and the gradation in writing style across the clusters (e.g. strokes becoming thicker towards the top of the clusters, thinner towards the bottom). The reconstructions in Figure 4 further stress the fidelity of digit class, stroke thickness, writing style and orientation within the codes, while the comparison between unconditional ACN samples and baseline samples from the Gated PixelCNN reveals a subtle improvement in sample quality. Figure 6 illustrates the dynamic modulation of daydream sampling as it moves through latent space: note the continual shift in rotation and stroke width, and the gradual morphing of one digit into another.

4.2 Cifar-10

Figure 7: CIFAR-10 reconstructions.
Figure 8: CIFAR-10 samples from ACN with unconditional prior (top) and Gated PixelCNN (bottom).

For the CIFAR-10 experiments the encoder was a convolutional network fashioned after a VGG-style classifier (Simonyan & Zisserman, 2014), with 11 convolutional layers and 3x3 filters. The decoder had 15 gated residual blocks, each using 128 filters of size 5x5; its output was a categorical distribution over subpixel intensities, with 256 bins for each colour channel. Training batch size was 64. The reconstructions in Figure 7 demonstrate some high level coherence, with object features such as parts of cars and horses occasionally visible, while Figure 8 shows an improvement in sample coherence relative to the baseline. We found that ACN codes for CIFAR-10 images were linearly classified with 55.3% accuracy versus 38.4% accuracy for pixels. See Appendix A for more samples and results.

4.3 ImageNet

Figure 9: ImageNet daydream samples.
Figure 10: ImageNet reconstructions.
Figure 11: ImageNet samples from ACN with unconditional prior (top) and Gated PixelCNN (bottom).

For these experiments the setup was the same as for CIFAR-10, except the decoder had 20 gated residual layers of 368 5x5 filters, and the batch size was 128. We downsamples the images to 32x32 resolution to speed up training. We found that ACN ImageNet codes can be linearly classified with 18.5% top 1 accuracy and 40.5% top 5 accuracy, compared to 3.0% and 9.0% respectively for pixels. Better unsupervised classification scores have been recorded for ImageNet (Doersch et al., 2015; Donahue et al., 2016; Wang & Gupta, 2015), but these were using higher resolution images. The reconstructions in Figure 10 suggest that ACN encodes information about image composition, colour, background and setting (natural, indoor, urban etc.), while Figure 9 shows continuous transitions in background, foreground and colour during daydream sampling. In this case the distinction between unconditional ACN samples and Gated PixelCNN samples was less clear (Figure 11). See Appendix B for more samples and results.

4.4 CelebA

Figure 12: CelebA reconstructions.
Figure 13: CelebA samples from ACN with unconditional prior (top) and PixelCNN (bottom).

We downsampled to CelebA images to 32x32 resolution and the same setup as for CIFAR-10. Figure 12 demonstrates that high-level aspects of the original images, such as gender, pose, lighting, face shape and facial expression are well represented by the codes, but that the specific details are left to the decoder. Figure 13 demonstrates a slight advantage in sample quality over the baseline.

5 Conclusion

We have introduced Associative Compression Networks (ACNs), a new form of Variational Autoencoder in which associated codes are used to condition the latent prior. Our experiments show that the latent representations learned by ACNs contain meaningful, high-level information that is not diminished by the use of autoregressive decoders. As well as providing a clear conditioning signal for the samples, these representations can be used to cluster and linearly classify the data, suggesting that they will be useful for other cognitive tasks. We have also seen that the joint latent and data space learned by the model can be naturally traversed by daydream sampling. We hope this work will open the door to more holistic, dataset-wide approaches to generative modelling and representation learning.

Acknowledgements

Many of our colleagues at DeepMind gave us valuable feedback on this work. We would particularly like to thank Andriy Mnih, Danilo Rezende, Igor Babuschkin, John Jumper, Oriol Vinyals, Guillaume Desjardins, Lasse Espeholt, Chris Jones, Alex Pritzel, Irina Higgins, Loic Matthey, Siddhant Jayakumar and Koray Kavukcuoglu.

References

Appendix A Cifar-10


Model Bits / dim
DRAW (Gregor et al., 2015) 4.13
Conv DRAW (Gregor et al., 2016) 4.00
Pixel CNN (Oord et al., 2016a) 3.14
Gated Pixel CNN (Oord et al., 2016b) 3.03
Pixel RNN (Oord et al., 2016a) 3.00
PixelCNN++ (Salimans et al., 2017) 2.92
PixelSNAIL (Chen et al., 2017) 2.85
ACN (unordered)
Table 4: CIFAR-10 test set compression results

Cost Nats / image
KL () 5.4
KL () 6.2
KL (Tour) 6.3
KL () 6.7
KL (Unconditional) 14.4
Reconstruction 6536.7
ACN (ordered)
Table 5: CIFAR-10 test set ACN costs
Figure 14: CIFAR-10 nearest neighbours. The leftmost column is from the test set. The remaining columns show the nearest Euclidean neighbours in ACN code space (top) and pixel space (bottom) in order of increasing distance. While the codes often cluster according to high-level features such as object class and figure composition, clustering in pixel space tends to match on background colour, and disproportionately favours blurry images.

Appendix B ImageNet


Model Bits / dim
conv. DRAW (Gregor et al., 2016) 4.40
Pixel RNN (Oord et al., 2016a) 3.86
Gated Pixel CNN (Oord et al., 2016b) 3.83
PixelSNAIL (Chen et al., 2017) 3.80
ACN (unordered)
Table 6: ImageNet 32x32 test set compression results

Cost Nats / image
KL () 2.9
KL () 8.7
KL () 10.3
KL (Greedy Tour) 10.6
KL (Unconditional) 18.2
Reconstruction 8112.8
ACN (ordered) 8123.4
Table 7: ImageNet test set ACN costs
Figure 15: ImageNet reconstructions.
Figure 16: ImageNet daydream samples.