1 Introduction
Unsupervised learning—the discovery of structure in data without extrinsic reward or supervision signals—is likely to be critical to the development of artificial intelligence, as it enables algorithms to exploit the vast amounts of data for which such signals are partially or completely lacking. In particular, it is hoped that unsupervised algorithms will be able to learn compact, transferable representations that will benefit the full spectrum of cognitive tasks, from lowlevel pattern recognition to highlevel reasoning and planning.
Variational Autoencoders (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014) are a class of generative model in which an encoder network extracts a stochastic code from the data, and a decoder network then uses this code to reconstruct the data. From a representation learning perspective, the hope is that the code will provide a highlevel description or abstraction of the data, which will guide the decoder as it models the lowlevel details. However it has been widely observed (e.g. Chen et al. 2016b; van den Oord et al. 2017
) that sufficiently powerful decoders—especially autoregressive models such as pixelCNN
(Oord et al., 2016b)—will simply ignore the latent codes and learn an unconditional model of the data. Authors have proposed various modifications to correct this shortcoming, such as reweighting the coding cost (Higgins et al., 2017)or removing it entirely from the loss function
(van den Oord et al., 2017), weakening the decoder by e.g. limiting its range of context (Chen et al., 2016b; Bowman et al., 2015) or adding auxiliary objectives that reward more informative codes, for example by maximising the mutual information between the prior distribution and generated samples (Zhao et al., 2017)—a tactic that has been fruitfully applied to Generative Adversarial Networks (Chen et al., 2016a). These approaches have had considerable success at discovering useful and interesting latent representations. However they add parameters to the system that must be tuned by hand (e.g. weightings for various terms in the loss function, domain specific limitations on the decoder etc.) and in most cases yield worse loglikelihoods than purely autoregressive models.To understand why VAEs do not typically improve on the modelling performance of autoregressive networks, it is helpful to analyse the system from a minimum description length perspective (Chen et al., 2016b). In that context a VAE embodies a twopart compression algorithm in which the code for each datum in the training set is first transmitted to a receiver equipped with a prior distribution over codes, followed by the residual bits required to correct the predictions of the decoder (to which the receiver also has access). The expected transmission cost of the code (including the ‘bits back’ received by the posterior; Hinton & Van Camp 1993
) is equal to the KullbackLeibler divergence between the prior and the posterior distribution yielded by the encoder, while the residual cost is the negative loglikelihood of the data under the predictive distribution of the decoder. The sum of the two, added up over the training set, is the compression cost optimised by the VAE loss function
^{1}^{1}1We ignore for now the description length of the prior and the decoder weights, noting that the former is likely to be negligible and the latter could be minimised with e.g. variational inference (Hinton & Van Camp, 1993; Graves, 2011).The underlying assumption of VAEs is that transmitting a piece of highlevel information, for example that a particular MNIST image represents the digit 3, will be outweighed by the increased compression of the data by the decoder. But this assumption breaks down if the decoder is able to learn a distribution that closely matches the density of the data. In this case, if onetenth of the training images are 3’s, finding out that a particular image is a 3 will only save the decoder around
bits. Furthermore, since an accurate prior will give a ten percent probability to 3’s, it will cost exactly the same amount for the encoder to transmit that information via the prior. In practice, since the code is stochastic and the decoder is typically deterministic, it is often more efficient to ignore the code entirely.
If we follow the above reasoning to its logical conclusion we come to a paradox that appears to undermine not only VAEs, but any effort to use highlevel concepts to compress lowlevel data: the benefit of associating a particular concept with a particular piece of data will always be outweighed by the coding cost. The resolution to the paradox is that highlevel concepts become efficient when a single concept can be collectively associated with many lowlevel data, rather than pointed to by each datum individually. This suggests a paradigm where latent codes are used to organise the training set as a whole, rather than annotate individual training examples. To return to the MNIST example, if we first sort the images according to digit class, then transmit all the zeros followed by all the ones and so on, the cost of transmitting the places where the digit class changes will be negligible compared to the cumulative savings over all the images of each class. Conversely, consider an encyclopaedia that has been carefully structured into topics, articles, paragraphs and so on, providing highlevel context that is known to lead to improved compression. Now imagine that the encyclopaedia is transmitted in a randomly ordered sequence of 100 character chunks, attached to each of which is a code specifying the exact place in the structure from which it was drawn (topic X, article Y, paragraph Z etc.). It should be clear that this would be a very inefficient compression algorithm; so inefficient, in fact, that it would not be worth transmitting the structure at all.
Encyclopaedias are already ordered, and the most efficient way to compress them may well be to simply preserve the ordering and use an autoregressive model to predict one token at a time. But in general we do not know how the data should be ordered for efficient compression. It would be possible to find such an ordering by minimising a similarity metric defined directly on the data, such as Euclidean distance in pixel space or edit distance for text; however such metrics tend to be limited to superficial similarities (in the case of pixel distance we provide evidence of this in our experiments). We therefore turn to the similarity, or association (Bahdanau et al., 2014; Graves et al., 2014), among latent representations to guide the ordering. Transmitting associated codes consecutively will only be efficient if we have a prior that captures the local statistics of the area they inhabit, and not the global statistics of the entire dataset: if a series of pictures of sheep has just been sent, the prior should expect another sheep to come next. We achieve this by using a neural network to condition the prior on a code chosen from the nearest neighbours in latent space to the code being transmitted. Previous work has considered fitting mixture models as VAE priors (Nalisnick et al., ; Tomczak & Welling, 2017), and one could think of our procedure as fitting a conditional prior to a uniform mixture over the posterior codes closest to whichever code we are about to transmit. As , the size of the training set, we recover the familiar setting of fitting an unconditional prior. Among supervised methods, perhaps the closest point of reference is Matching Networks (Vinyals et al., 2016) in which a nearest neighbours search over embeddings is leveraged for oneshot learning.
Conditioning on neighbouring codes does not obviously lead to a compression procedure. However we can define the following sequential compression algorithm if we insist that the neighbour for each code in the training set is unique:

Alice and Bob share the weights of the encoder, decoder and prior networks^{2}^{2}2In normal VAEs the encoder does not need to be shared.

Alice chooses an ordering for the training set, then transmits one element at a time by sending first a sample from the encoding distribution, then the residual bits required for lossless decoding.

After decoding each data sample, Bob reencodes the data using his copy of the encoder network, then passes the statistics of the encoding distribution into the prior network as input. The resulting prior distribution is used to transmit the next code sample drawn by Alice at a cost equal to the KL between their distributions^{3}^{3}3The prior for the first example may be assumed to be shared at negligible cost for a large dataset.
The optimal ordering Alice should choose is the one that minimises the sum of the KLs at each transmission step. Finding this ordering is a hard optimisation problem in general, but our empirical results suggest that the KL cost of the optimal ordering is well approximated by nearest neighbour sampling, given a suitable value of .
It should be clear that ACNs are not IID in the usual sense: they optimise the cost of transmitting the entire dataset, in an order of their choosing, as opposed to the expected cost of transmitting a single datapoint. One consequence is that the ACN loss function is not directly comparable to that of VAEs or other generative models. Indeed, since the expected cost of transmitting a uniformly random ordering of a size dataset is bits, it be could argued that an ACN has ‘free bits’ per datapoint to spend on codes relative to an IID model. However, we contend that it is exactly the information contained in the ordering, or more generally in the relational structure of dataset elements, that defines the highlevel regularities we wish our representation to capture. For example, if half the voices in a speech database are male and half are female, compression should be improved by grouping according to gender, motivating the inclusion of gender in the latent codes; likewise representing speaker characteristics should make it possible to cocompress similar voices, and if there were enough examples of the same or similar phrases, it should become advantageous to encode linguistic information as well.
As the relationship between a particular datum and the rest of the dataset is not accessible to the decoder in an ACN, there is no need to weaken the decoder; indeed we recommend using the most powerful decoder possible to ensure that the latent codes are not cluttered by lowlevel information. Similarly, there is no need to modify the loss function or add extra terms to encourage the use of latent codes. Rather, the use of latent information is a natural consequence of the separation between highlevel relations among data, and lowlevel dependencies within data. As our experiments demonstrate, this leads to compressed representations that capture many salient features likely to be useful for downstream tasks.
2 Background: Variational AutoEncoders
Variational Autoencoders (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014) are a family of generative models consisting of two neural networks —an encoder and a decoder—trained in tandem. The encoder receives observable data as input and emits as output a dataconditional distribution
over latent vectors
. A sample is drawn from this distribution and used by the decoder to determine a codeconditional reconstruction distribution over the original data ^{4}^{4}4We use instead of the usual notation to avoid confusion with the ACN prior. The VAE loss function is defined as the expected negative loglikelihood of under (often referred to as the reconstruction cost) plus the KL divergence from some prior distribution to (referred to as the KL or coding cost):Although VAEs with discrete latent variables have been explored (Mnih & Rezende, 2016)
, most are continuous to allow for stochastic backpropagation using the reparameterisation trick
(Kingma & Welling, 2013). The priormay be a simple distribution such as a unit variance, zero mean Gaussian, or something more complex such as an autoregressive distribution whose parameters are adapted during training
(Chen et al., 2016b; Gulrajani et al., 2016). In all cases however, the prior is constant for all .3 Associative Compression Networks
Associative compression networks (ACNs) are similar to VAEs, except the prior for each is now conditioned on the distribution used to encode some neighbouring datum . We used a unit variance, diagonal Gaussian for all encoding distributions, meaning that is entirely described by its mean vector , which we refer to as the code for . Given , we randomly pick , the code for , from , the set of nearest Euclidean neighbours to among all the codes for the training data. We then pass to the prior network to obtain the conditional prior distribution and hence determine the KL cost. Adding this KL cost to the usual VAE reconstruction cost yields the ACN loss function:
As with normal VAEs, the prior distribution may be chosen from a more or less flexible family. However, as each local prior is already conditioned on a nearby code, the marginal prior across latent space will be highly flexible even if the local priors are simple. For our experiments we chose an independent mixture prior for each dimension of latent space, to encourage multimodal but independent (and hence, hopefully, disentangled) representations.
As discussed in the introduction, conditioning on neighbouring codes is equivalent to a sequential compression algorithm, as long as every neighbour is unique to a particular code. This can be ensured by a simple modification to the above procedure: restrict at each step to contain only codes that have not yet been used as neighbours during the current pass through the dataset. With
this is equivalent to a greedy nearest neighbour heuristic for the Euclidean travelling salesman problem of finding the shortest tour through the codes. The route found by this heuristic may be substantially longer than the optimal tour, which in any case may not correspond to the ordering that minimises the KL cost, as this depends on the KLs between the priors and the codes, and not directly on the distance between the codes. Nonetheless it provides an upper bound on the optimal KL cost, and hence on the compression of the dataset (note that the reconstruction cost does not depend on the ordering, as the decoder is conditioned only on the current code). We provide results in
Section 4 to calibrate the accuracy of against the KL cost yielded by an actual tour.To optimise we create an associative dataset that holds a separate code vector for each in the training set and run the following algorithm:
In general , and will be batches computed in parallel. As the codes in are only updated when the corresponding data is sampled, the codes used for the nearest neighbour (KNN) search will in general be somewhat stale. To check that this wasn’t a significant problem, we ran tests in which a parallel worker continually updated the codes using the current weight for the encoder network. For our experiments increasing the codeupdate frequency made no discernible difference to learning; however code staleness could become more damaging for larger datasets. Likewise the computational cost of performing the KNN search was low compared to that of activating the networks for our experiments, but could become prohibitive for large datasets.
3.1 Unconditional Prior
Unlike normal VAEs, ACNs by default lack an unconditional prior, which makes it difficult to compare them to existing generative models. However we can easily fit an unconditional prior to samples drawn from the codes in after training is complete.
3.2 Sampling
There are several different ways to sample from ACNs, of which we consider three. Firstly, by drawing a latent vector from the unconditional prior defined above and sampling from the decoder distribution , we can generate unconditional samples that reflect ACNs global data distribution. Secondly, by choosing a batch of real images, encoding them and decoding conditioned on the resulting code, we can generate stochastic reconstructions of the images, revealing which features of the original are represented in the latents and transmitted to the decoder. Note that in order to reduce sampling noise we use the mean codes as latents for the reconstructions, rather than samples from ; we assume at this point that the decoder is autoregressive. Lastly, the use of conditional priors opens up an alternative sampling protocol, where sequences of linked samples are generated from real data by iteratively encoding the data, sampling from the prior conditioned on the code, generating new data, then encoding again. We refer to these sequences as ‘daydreams’, as they remind us of the chains of associative imagining followed by the human mind at rest. The daydream sampling process is illustrated in Figure 1.
3.3 Test Set Evaluation
Since the true KL cost depends on the order in which the data is transmitted, there are some subtleties in comparing the test set performance of ACN with other models. For one thing, as discussed in the introduction, most other models are orderagnostic, and hence arguably due a refund for the cost of specifying an arbitrary ordering (in the case of MNIST this would amount to 8.21 nats per test set image). We can resolve this by calculating both an upper bound on the ordered compression yielded by ACN, and the unordered compression which can be computed using the KL between the unconditional prior discussed in Section 3.1 and the test set encodings (recall that the reconstruction cost is unaffected by the ordering). As well as providing a fair comparison with previous results, the unconditional KL gives an idea of the total amount of information encoded for each data point, relative to the dataset as a whole. Another issue is that if an ordering is used, it is debatable whether the training and test set should be compressed together, with a single tour through all the data, or whether the test set should be treated as a separate tour, with the prior network conditioned on test set codes only. We chose the latter for simplicity, but note that doing so may unrealistically inflate the KL costs; for example if the test set is dramatically smaller than the training set, and the average distance between codes is correspondingly larger, the density of the prior distributions may be strongly miscalibrated.
4 Experimental Results
We present experimental results on four image datasets: binarized MNIST (Salakhutdinov & Murray, 2008), CIFAR10 (Krizhevsky, 2009), ImageNet (Deng et al., 2009) and CelebA (Liu et al., 2015). Buoyed by our belief that the latent codes will not be ignored no matter how well the decoder can model the data, we used a Gated PixelCNN decoder (Oord et al., 2016b) to parameterise for all experiments. The ACN encoder was a convolutional network fashioned after a VGGstyle classifier (Simonyan & Zisserman, 2014), and the encoding distribution was a unit variance Gaussian with mean specified by the output of the encoder network. The prior network was an MLP with three hidden layers each containing 512 units, and skip connections from the input to all hidden layers and all hiddens to the output layer. The ACN prior distribution was parameterised using the outputs of the prior network as follows:
where was the dimensionality of , is the element of , there are mixture components for each dimension, and all parameters are emitted by the prior network, with the softmax function used to normalise and the softplus function used to ensure . We used for MNIST and elsewhere; the results did not seem very sensitive to this. Polyak averaging (Polyak & Juditsky, 1992) was applied for all experiments with a decay parameter of 0.9999; all samples and test set costs were calculated using averaged weights. For the unconditional prior
we always fit a Gaussian mixture model using ExpectationMaximization, with the number of components optimised on the validation set.
For all experiments, the optimiser was rmsprop
(Tieleman & Hinton, 2012) with learning rate and momentum . The encoding distribution was always a unit variance Gaussian with mean specified by the output of the encoder network. The dimensionality of was for binarized MNIST and otherwise. Unless stated otherwise, was used for the KNN lookups during ACN training.4.1 Binarized MNIST
For the binarized MNIST experiments the ACN encoder had five convolutional layers, and the decoder consisted of 10 gated residual blocks, each using 64 filters of size 5x5. The decoder output was a single Bernoulli distribution for each pixel, and a batch size of 64 was used for training.
Model  Nats / image 

Gated Pixel CNN (ours)  81.6 
Pixel CNN (Oord et al., 2016a)  81.3 
Discrete VAE (Rolfe, 2016)  81.0 
DRAW (Gregor et al., 2015)  
G. PixelVAE (Gulrajani et al., 2016)  79.5 
Pixel RNN (Oord et al., 2016a)  79.2 
VLAE (Chen et al., 2016b)  79.0 
GLN (Veness et al., 2017)  79.0 
MatNet (Bachman, 2016)  
ACN (unordered)  80.9 
Cost  Nats / image 

KL ()  2.6 
KL ()  3.5 
KL (Greedy Tour)  3.6 
KL ()  4.1 
KL (Unconditional)  10.6 
Reconstruction  70.3 
ACN (ordered)  73.9 
The results in Table 1 show that unordered ACN gives similar compression to the decoder alone (Gated Pixel CNN), supporting the thesis that conventional VAE loss is not significantly reduced by latent codes when using an autoregressive decoder. Table 2 shows that the upper bound on the ordered ACN cost (sum of greedy tour KL and reconstruction) is 7 nats per image lower than the unordered ACN cost. Given that the cost of specifying an ordering for the test set is 8.21 nats per image, this suggests that the model is using most of the ‘free bits’ to encode latent information. The KL cost yielded by the ‘greedy tour’ heuristic described in Section 3 is close to that given by KNN sampling on the test set codes with (note that we are varying when computing the test set KL only; the network was trained with ). Since this is a loose upper bound on the optimal KL for an ordered tour, and since the result is a lower bound (no tour can do better than always hopping to the nearest neighbour) we speculate that the true KL is somewhere between and .
As discussed in the introduction, if the value for the KNN lookups approaches the size of the training set, ACN should reduce to a VAE with a learned prior. To test this, we trained ACNs with to , and measured the change in compression costs. We also implemented a standard feedforward VAE and a VAE with the same encoder and decoder as ACN, but with an unconditional Gaussian mixture prior whose parameters were trained in place of the prior network. We refer to the latter as Gated PixelVAE due to similarity with previous work (Gulrajani et al., 2016); but note that they used a fixed prior and a somewhat different encoder architectures. Figure 2 shows that the unordered compression cost per test set image is much the same for ACN regardless of , and very similar to that of both Gated PixelVAE and Gated PixelCNN (again underlining the marginal impact of latent codes on VAE loss). However the distribution of the costs changes, with higher reconstruction cost and lower KL cost for higher . As predicted, Gated PixelVAE performs similarly to ACN with very high . The VAE performs considerably worse due to the nonautoregressive decoder; however the higher KL suggests that more information is encoded in the latents. Our next experiment attempts to quantify how useful this information is.
Input  Accuracy (%) 

PCA (16 components)  82.8 
pixels  89.4 
standard VAE codes  95.4 
Gated PixelVAE codes  97.9 
ACN codes  98.5 
Table 3 shows the results of training a linear classifier to predict the training set labels with various inputs. This gives us a measure of how the amount of easily accessible highlevel information the inputs contain. ACN codes are the most effective, but interestingly PixelVAE codes are a close second, in spite of having a KL cost of just over 1 nat per image. VAE codes, with a KL of 26 nats per image, are considerably worse; we hypothesize that the use of a weaker decoder leads the VAE to include more lowlevel information in the codes, making them harder to classify. In any case we can conclude that coding cost is not a reliable indicator of code utility.
The salience of the ACN codes is supported by the visualisation of the principal components of the codes shown in Figure 3: note the clustering of image classes (coloured differently to aid interpretation) and the gradation in writing style across the clusters (e.g. strokes becoming thicker towards the top of the clusters, thinner towards the bottom). The reconstructions in Figure 4 further stress the fidelity of digit class, stroke thickness, writing style and orientation within the codes, while the comparison between unconditional ACN samples and baseline samples from the Gated PixelCNN reveals a subtle improvement in sample quality. Figure 6 illustrates the dynamic modulation of daydream sampling as it moves through latent space: note the continual shift in rotation and stroke width, and the gradual morphing of one digit into another.
4.2 Cifar10
For the CIFAR10 experiments the encoder was a convolutional network fashioned after a VGGstyle classifier (Simonyan & Zisserman, 2014), with 11 convolutional layers and 3x3 filters. The decoder had 15 gated residual blocks, each using 128 filters of size 5x5; its output was a categorical distribution over subpixel intensities, with 256 bins for each colour channel. Training batch size was 64. The reconstructions in Figure 7 demonstrate some high level coherence, with object features such as parts of cars and horses occasionally visible, while Figure 8 shows an improvement in sample coherence relative to the baseline. We found that ACN codes for CIFAR10 images were linearly classified with 55.3% accuracy versus 38.4% accuracy for pixels. See Appendix A for more samples and results.
4.3 ImageNet
For these experiments the setup was the same as for CIFAR10, except the decoder had 20 gated residual layers of 368 5x5 filters, and the batch size was 128. We downsamples the images to 32x32 resolution to speed up training. We found that ACN ImageNet codes can be linearly classified with 18.5% top 1 accuracy and 40.5% top 5 accuracy, compared to 3.0% and 9.0% respectively for pixels. Better unsupervised classification scores have been recorded for ImageNet (Doersch et al., 2015; Donahue et al., 2016; Wang & Gupta, 2015), but these were using higher resolution images. The reconstructions in Figure 10 suggest that ACN encodes information about image composition, colour, background and setting (natural, indoor, urban etc.), while Figure 9 shows continuous transitions in background, foreground and colour during daydream sampling. In this case the distinction between unconditional ACN samples and Gated PixelCNN samples was less clear (Figure 11). See Appendix B for more samples and results.
4.4 CelebA
We downsampled to CelebA images to 32x32 resolution and the same setup as for CIFAR10. Figure 12 demonstrates that highlevel aspects of the original images, such as gender, pose, lighting, face shape and facial expression are well represented by the codes, but that the specific details are left to the decoder. Figure 13 demonstrates a slight advantage in sample quality over the baseline.
5 Conclusion
We have introduced Associative Compression Networks (ACNs), a new form of Variational Autoencoder in which associated codes are used to condition the latent prior. Our experiments show that the latent representations learned by ACNs contain meaningful, highlevel information that is not diminished by the use of autoregressive decoders. As well as providing a clear conditioning signal for the samples, these representations can be used to cluster and linearly classify the data, suggesting that they will be useful for other cognitive tasks. We have also seen that the joint latent and data space learned by the model can be naturally traversed by daydream sampling. We hope this work will open the door to more holistic, datasetwide approaches to generative modelling and representation learning.
Acknowledgements
Many of our colleagues at DeepMind gave us valuable feedback on this work. We would particularly like to thank Andriy Mnih, Danilo Rezende, Igor Babuschkin, John Jumper, Oriol Vinyals, Guillaume Desjardins, Lasse Espeholt, Chris Jones, Alex Pritzel, Irina Higgins, Loic Matthey, Siddhant Jayakumar and Koray Kavukcuoglu.
References
 Bachman (2016) Bachman, Philip. An architecture for deep, hierarchical generative models. In Advances in Neural Information Processing Systems, pp. 4826–4834, 2016.
 Bahdanau et al. (2014) Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 Bowman et al. (2015) Bowman, Samuel R, Vilnis, Luke, Vinyals, Oriol, Dai, Andrew M, Jozefowicz, Rafal, and Bengio, Samy. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
 Chen et al. (2016a) Chen, Xi, Duan, Yan, Houthooft, Rein, Schulman, John, Sutskever, Ilya, and Abbeel, Pieter. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2172–2180, 2016a.
 Chen et al. (2016b) Chen, Xi, Kingma, Diederik P, Salimans, Tim, Duan, Yan, Dhariwal, Prafulla, Schulman, John, Sutskever, Ilya, and Abbeel, Pieter. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731, 2016b.
 Chen et al. (2017) Chen, Xi, Mishra, Nikhil, Rohaninejad, Mostafa, and Abbeel, Pieter. Pixelsnail: An improved autoregressive generative model. arXiv preprint arXiv:1712.09763, 2017.
 Deng et al. (2009) Deng, Jia, Dong, Wei, Socher, Richard, Li, LiJia, Li, Kai, and FeiFei, Li. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. IEEE, 2009.
 Doersch et al. (2015) Doersch, Carl, Gupta, Abhinav, and Efros, Alexei A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430, 2015.
 Donahue et al. (2016) Donahue, Jeff, Krähenbühl, Philipp, and Darrell, Trevor. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
 Graves (2011) Graves, Alex. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pp. 2348–2356, 2011.
 Graves et al. (2014) Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
 Gregor et al. (2015) Gregor, Karol, Danihelka, Ivo, Graves, Alex, Rezende, Danilo Jimenez, and Wierstra, Daan. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
 Gregor et al. (2016) Gregor, Karol, Besse, Frederic, Rezende, Danilo Jimenez, Danihelka, Ivo, and Wierstra, Daan. Towards conceptual compression. In Advances In Neural Information Processing Systems, pp. 3549–3557, 2016.
 Gulrajani et al. (2016) Gulrajani, Ishaan, Kumar, Kundan, Ahmed, Faruk, Taiga, Adrien Ali, Visin, Francesco, Vazquez, David, and Courville, Aaron. Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013, 2016.
 Higgins et al. (2017) Higgins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christopher, Glorot, Xavier, Botvinick, Matthew, Mohamed, Shakir, and Lerchner, Alexander. vae: Learning basic visual concepts with a constrained variational framework. ICLR, 2017.

Hinton & Van Camp (1993)
Hinton, Geoffrey E and Van Camp, Drew.
Keeping neural networks simple by minimizing the description length
of the weights.
In
Proceedings of the sixth annual conference on Computational learning theory
, pp. 5–13. ACM, 1993.  Kingma & Welling (2013) Kingma, Diederik P and Welling, Max. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Krizhevsky (2009) Krizhevsky, Alex. Learning multiple layers of features from tiny images. 2009.
 Liu et al. (2015) Liu, Ziwei, Luo, Ping, Wang, Xiaogang, and Tang, Xiaoou. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738, 2015.

Mnih & Rezende (2016)
Mnih, Andriy and Rezende, Danilo.
Variational inference for monte carlo objectives.
In
International Conference on Machine Learning
, pp. 2188–2196, 2016.  (21) Nalisnick, Eric, Hertel, Lars, and Smyth, Padhraic. Approximate inference for deep latent gaussian mixtures.
 Oord et al. (2016a) Oord, Aaron van den, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016a.
 Oord et al. (2016b) Oord, Aäron van den, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image generation with pixelcnn decoders. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 4797–4805. Curran Associates Inc., 2016b.
 Polyak & Juditsky (1992) Polyak, Boris T and Juditsky, Anatoli B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
 Rezende et al. (2014) Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 Rolfe (2016) Rolfe, Jason Tyler. Discrete variational autoencoders. arXiv preprint arXiv:1609.02200, 2016.

Salakhutdinov & Murray (2008)
Salakhutdinov, Ruslan and Murray, Iain.
On the quantitative analysis of deep belief networks.
In Proceedings of the 25th international conference on Machine learning, pp. 872–879. ACM, 2008.  Salimans et al. (2017) Salimans, Tim, Karpathy, Andrej, Chen, Xi, and Kingma, Diederik P. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
 Simonyan & Zisserman (2014) Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Smilkov et al. (2016) Smilkov, Daniel, Thorat, Nikhil, Nicholson, Charles, Reif, Emily, Viégas, Fernanda B, and Wattenberg, Martin. Embedding projector: Interactive visualization and interpretation of embeddings. arXiv preprint arXiv:1611.05469, 2016.
 Tieleman & Hinton (2012) Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5rmsprop, coursera: Neural networks for machine learning. University of Toronto, Technical Report, 2012.
 Tomczak & Welling (2017) Tomczak, Jakub M and Welling, Max. Vae with a vampprior. arXiv preprint arXiv:1705.07120, 2017.
 van den Oord et al. (2017) van den Oord, Aaron, Vinyals, Oriol, et al. Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6309–6318, 2017.
 Veness et al. (2017) Veness, Joel, Lattimore, Tor, Bhoopchand, Avishkar, GrabskaBarwinska, Agnieszka, Mattern, Christopher, and Toth, Peter. Online learning with gated linear networks. arXiv preprint arXiv:1712.01897, 2017.
 Vinyals et al. (2016) Vinyals, Oriol, Blundell, Charles, Lillicrap, Tim, Wierstra, Daan, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pp. 3630–3638, 2016.
 Wang & Gupta (2015) Wang, Xiaolong and Gupta, Abhinav. Unsupervised learning of visual representations using videos. arXiv preprint arXiv:1505.00687, 2015.
 Zhao et al. (2017) Zhao, Shengjia, Song, Jiaming, and Ermon, Stefano. Infovae: Information maximizing variational autoencoders. CoRR, abs/1706.02262, 2017.
Appendix A Cifar10
Model  Bits / dim 

DRAW (Gregor et al., 2015)  4.13 
Conv DRAW (Gregor et al., 2016)  4.00 
Pixel CNN (Oord et al., 2016a)  3.14 
Gated Pixel CNN (Oord et al., 2016b)  3.03 
Pixel RNN (Oord et al., 2016a)  3.00 
PixelCNN++ (Salimans et al., 2017)  2.92 
PixelSNAIL (Chen et al., 2017)  2.85 
ACN (unordered) 
Cost  Nats / image 

KL ()  5.4 
KL ()  6.2 
KL (Tour)  6.3 
KL ()  6.7 
KL (Unconditional)  14.4 
Reconstruction  6536.7 
ACN (ordered) 
Appendix B ImageNet
Model  Bits / dim 

conv. DRAW (Gregor et al., 2016)  4.40 
Pixel RNN (Oord et al., 2016a)  3.86 
Gated Pixel CNN (Oord et al., 2016b)  3.83 
PixelSNAIL (Chen et al., 2017)  3.80 
ACN (unordered) 
Cost  Nats / image 

KL ()  2.9 
KL ()  8.7 
KL ()  10.3 
KL (Greedy Tour)  10.6 
KL (Unconditional)  18.2 
Reconstruction  8112.8 
ACN (ordered)  8123.4 
Comments
There are no comments yet.