Dimensionality Reduction Flows

08/05/2019 ∙ by Hari Prasanna Das, et al. ∙ berkeley college 0

Deep generative modelling using flows has gained popularity owing to the tractable exact log-likelihood estimation with efficient training and synthesis process. Trained flow models carry rich information about the structure and local variance in input data. However, a bottleneck for flow models to scale with increasing dimensions is that the latent space has same size as the high-dimensional input space. In this paper, we propose methods to reduce the latent space dimension of flow models. Our first approach includes replacing standard high dimensional prior with a learned prior from a low dimensional noise space. Further improving to achieve exact log-likelihood with reduced dimensionality, our second approach presents an improved multi-scale architecture (Dinh et al., 2016) via likelihood contribution based factorization of dimensions. Using our method over state-of-the-art flow models, we demonstrate improvements in log-likelihood score on standard image benchmarks. Our work ventures a data dependent factorization scheme which is more efficient than static counterparts in prior works.



There are no comments yet.


page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Generative Modelling aims to learn the embedded distributions and representations in input (especially unlabelled) data, requiring no/minimal human labelling effort. Learning without knowledge of labels (unsupervised learning) is of increasing importance because of the abundance of unlabelled data and the rich inherent patterns they posses. The representations learnt can then be utilized in a number of downstream tasks such as semi-supervised learning

(Kingma et al., 2014; Odena, 2016), synthetic data augmentation and adversarial training (Cisse et al., 2017)

, text analysis and model based control etc. The repository of deep generative modelling majorly includes Likelihood based models such as autoregressive models

(Oord et al., 2016c; Graves, 2013), latent variable models (Kingma and Welling, 2013), flow based models (Dinh et al., 2014, 2016; Kingma and Dhariwal, 2018) and implicit models such as generative adversarial networks (GANs) (Goodfellow et al., 2014). We briefly cover the novelty and shortcomings for each type of model.

Autoregressive models (Salimans et al., 2017; Oord et al., 2016c, b; Chen et al., 2017)

achieve exceptional log-likelihood score on many standard datasets, indicative of their power to model the inherent distribution. But, they suffer from slow sampling process, making them unacceptable to adopt in real world applications. Latent variable models such as variational autoencoders

(Kingma and Welling, 2013) tend to better capture the global feature representation in data, but do not offer an exact density estimate as they maximize a lower bound of it. Implicit generative models such as GANs have recently become popular for their ability to synthesize realistic data (Karras et al., 2018; Engel et al., 2019). They optimize a generator and a discriminator in a min-max fashion such that at optimum, the generator is able to generate synthetic data matching the patterns in original data and the discriminator is able to distinguish synthetic from original data. But, GANs do not offer a latent space suitable for further downstream tasks, nor do they perform density estimation. Flow based generative models perform exact density estimation and fast inference and sampling, due to their parallelizability. They also provide an information rich latent space suitable for many applications.

However, flow based generative models have a bottleneck while scaling with increasing dimensions. The bottleneck is attributed to the latent space having the same dimension as the high dimensional input space. This comes along with the bijectivity nature of flows. Reducing the dimension of latent space while guaranteeing log-likelihood and qualitative performance will lead to better computation and memory complexity. In this paper, we propose methods to reduce the latent space dimension of flow based generative models. In Section 3.1, we describe our first approach, which is replacing the standard normal prior with a learned prior from low dimensional noise space. This approach involves a relaxation to the training objective to include a variational term. Further improving to achieve exact log-likelihood, we propose our second approach in Section  3.2, which is an enhanced multiscale architecture (Dinh et al., 2016) via likelihood contribution of dimensions. We propose that the choice for factorization of dimensions at each flow layer in a multiscale architecture should be based on data dependent masking rather than static masking methods as in prior works RealNVP (Dinh et al., 2016) and Glow (Kingma and Dhariwal, 2018). We also design such methods for factorization for RealNVP and Glow models. In Section  4 we present our experimental results for both the proposed approaches on standard image benchmarks.

2 Background

In this section, we illustrate the basic functioning of flow based generative models and Variational Autoencoders.

2.1 Flow-based Generative Models


be a high-dimensional random vector with unknown true distribution

. The formulation presented below is directly applicable to continous data, and to discrete data with some pre-processing steps such as dequantization (Uria et al., 2013; Salimans et al., 2017; Ho et al., 2019). Let be the latent variable with a known standard distribution , such as a standard multivariate gaussian, . Using an i.i.d. dataset , our target is to come up with a model with parameters . A flow, is defined to be an invertible transformation that maps observed data to the latent variable . A flow is invertible, so the inverse function maps to .


The probability density and the log-likelihood can be expressed as:


where is the jacobian of at .
The invertibile nature of flow allows it to be capable of being composed of other flows of compatible dimensions. In practice, flows are constructed by composing a series of component flows. Let the flow be composed of component flows, i.e. and the intermediate variables be denoted by with , . Then the log-likelihood of the composed flow can be expressed as,


which follows from the fact that . The first term in Equation 7 is commonly referred to as the log-latent-density term and the second term as log-determinant (log-det) term. The reverse path, from to can be written as a composition of inverse flows, . Because of the exact inference and flexibile scaling properties offered by flow models, they have proven to be an ideal candidate for deep generative modelling. Confirming with the properties as mentioned in this section, many different types of flows can be constructed (Kingma and Dhariwal, 2018; Dinh et al., 2016, 2014; Bell and Sejnowski, 1995). One flow of particular interest is Affine flow (Dinh et al., 2016) which has tractable log-determinant calculation and efficient inference and sampling. We use affine flows in our experiments (Section 4).

2.2 Variational Autoencoders

Variational Autoencoders (VAEs) (Kingma and Welling, 2013) are a class of autoencoders (Zhang et al., 2016; Kingma and Welling, 2013) which are based on variational inference, i.e. optimizing a lower bound on the log likelihood of data. The inherent assumption in these models is that the data () is generated by a random process involving some latent variables (). The process is carried out in two steps: (1) the latent variable () is generated from some prior distribution and (2) a value is generated from a conditional distribution and that and are tractable. A parametric inference model is introduced to approximately match the true posterior , which is generally intractable. The log-likelihood can be expressed as,


The Variational Lower Bound, is optimized via different methods. For continuous it can be done efficiently through a re-parameterization trick involving (Kingma and Welling, 2013; Rezende et al., 2014). The VLB can be re-written as,


where the first term is the reconstruction term and the second term is a regularization term. This is coherent with an autoencoder loss with as the encoder and as the decoder.

3 Dimensionality Reduction Methods

3.1 Learned Low-Dimensional Prior with Variational Inference

3.1.1 Related Work

A powerful technique often adopted to improve the generative modelling performance is to replace the standard prior distribution by a learned prior. Pertaining to this idea, in the case of VAEs, various flavours of learned priors have been proposed to improve their performance (Rezende and Mohamed, 2015; Turner and Sahani, 2011; Salimans et al., 2015; Mnih and Gregor, 2014). Kingma et al. (2016); Rezende and Mohamed (2015) use Inverse Autoregressive Flow (IAF) and normalizing flows to improve the approximate posterior. Chen et al. (2016) improve the VAE latent code using Autoregressive Flows (AF) which also serves well to improve the bits-back efficiency. For discrete inputs being used in flow models, Ho et al. (2019) model the dequantization noise distribution as a conditional flow based generative model. In VQ-VAE (Oord et al., 2017), the prior is learned with PixelCNN (Oord et al., 2016b, c) models. This technique has also been utilized in language modelling with LSTM decoders (Bowman et al., 2015) and in audio generation (with WaveNet (Oord et al., 2016a) prior). We exploit this idea to replace the usual gaussian latent space of a flow by a low-dimensional learned prior via variational inference.

3.1.2 Proposed Method

Latent variable models in general capture causal sources of variations in the data, referred to as the global feature representation. Flow models tend to capture the local variations along with the global feature representation. But, for datasets like images with high input redundancy, a flow model can be designed to discard some of the high frequency noise while maintaining its knowledge of global representation. We propose a cascaded Flow-VAE model, where the latent space of flow is variationally inferenced via a VAE.

For the flow () introduced in Section 2.1, let the latent space be learnt from a lower dimensional noise space . Combining Equations  4 and  8, we have,


The R.H.S in Equation 11, , which acts as the new lower bound for the true likelihood is maximized. Note that the lower bound is better than that of a stand-alone VAE due to the log-det term, hence our proposed model incorporates goodness of both flow and VAE worlds.

An important point to note here is that, earlier attempts to combine VAEs with autoregressive/flow models reported the problem of the autoregressive/flow part explaining most structure of the data and the latent code was not/barely being used (Chen et al., 2016; Fabius and van Amersfoort, 2014; Chung et al., 2015; Fraccaro et al., 2016; Serban et al., 2017). However, our approach does not run into this problem, since we propose the reverse, i.e. the flow model should explain most structure in the data and the latent code be used whenever necessary. In this manner, the dimension of latent space in our model gets fused at no/little expense of lost information. By careful design of the VAE part, our method can also have an added advantage, the sampling process in latent space can be easily manipulated to generate data with desired global representation, similar in process to Glow (Kingma and Dhariwal, 2018).

3.2 Likelihood Contribution based Multiscale Architecture

3.2.1 Related Work

Another approach to dimensionality reduction of flow models is based on iterative early factorization of a part of the total dimensions at regular intervals. This contributes towards a computationally fast and memory efficient architecture as the dimension of component flow (equivalently the number of parameters to train) gradually builds up going from latent space towards the input space. In the context of image recognition, this idea of dimensionality reduction has been beneficial in training very deep neural networks

(Simonyan and Zisserman, 2014). (Dinh et al., 2016) (RealNVP) proposed such a multi-scale architecture by gaussianizing half of the dimensions at each component flow construction phase, and finally composing each of them to form the final flow. For factorization of the dimensions, it first performs a squeezing operation to transform a tensor into a tensor, where is the image height/width and is the number of channels for image inputs, and splits the tensor into two halves along the channel dimension. One half of the split is gaussianized and the other half is passed on as an input to the next layer of flow. (Kingma and Dhariwal, 2018) (Glow) follow the same design choice for multiscale-architecture as RealNVP, combined with their proposed flow using convolution.

It is apparent that the dimensions getting exposed to more layers of flow will be more expressive in nature as compared to the ones which get factored at a finer scale (earlier layer). From this perspective, the method of splitting proposed by (Dinh et al., 2016) is static in nature and does not distinguish between importance of different dimensions. In case of images, although this can generate visually convincing samples as the squeezing and splitting operation preserves local variation in both parts of the split, is is not efficient from density estimation perspective.

3.2.2 Multi-scale Architecture based on Likelihood Contribution of Dimensions

Figure 1: Log-determinant based squeezing operation: (On left) The tensor representing log-det of variables in a flow layer (). It is squeezed to with local max and min operation. (On right) The black (white) marked pixels represent dimensions having more (less) log-det locally.

To perform a preferrential splitting in multiscale architecture, we propose a heuristic to decide the dimensions to be factored at an earlier layer. Recall from Equation

7 that the log-likelihood is composed of two terms, the log-latent-density term and the log-det term. The log-latent-density term depends on the choice of latent distribution and is fixed given the sampled latent variable. Whereas, the log-det term depends on the modelling of the flow layers. So, maximizing the log-det term results in maximized likelihood. We propose a two-step approach to build an efficient multiscale architecture which factors the dimensions at each layer in a way such that the local variance in the input space is well captured and the log-det is maximized. In the first step, dimension of all the component flows are kept the same as the input dimension and the network is trained to capture the log-dets at each layer. Based on the following proposed mechanism, we construct a mask at each layer to determine the dimensions to be gaussianized. In the second step, the mask at each layer as constructed in step 1 is used for dimension factorization. We now present the mechanism to construct the mask at each layer.

Let the dimension of the input space be , where is the image height/width and is the number of channels for image input. Let us apply the flow mentioned in Section 2.1, i.e to , pair. To recall, it has component flows, and the intermediate variables are denoted by with , . The log-det term at layer , , is given by,


where, denotes the log-det viewed as a tensor corresponding to each of the variables, summed over the flow layers till . Each of the terms in can be considered as the contribution towards total log-det ( log-likelihood) by the variable (dimension) corresponding to that term. The entries having higher value correspond to the variables which contribute more towards the total log-likelihood, hence are more valuable for better flow formulation. So, while factorizing at each layer, the variables with more log-det term should be exposed to more layer of flow and the ones having less log-det term should be directly gaussianized. In this manner, we provide more power to variables which capture meaningful representation (and are more valuable from log-det perspective) to be expressive by being exposed to multiple flow layers. But, for datasets like images, it is important that the local variance is well captured. Summatively, an ideal factorization method should,

  1. [labelindent=0pt]

  2. [For efficient density estimation] Gaussianize the variables having less log-det in a flow layer and expose the ones having more log-det to more flow layers

  3. [For qualitative reconstruction] Capture the local variance over the flow layers, i.e. the part of the factorization being exposed to more flow layers should contain representative pixels covering the whole image.

Variants of hybrid techniques for factorization satisfying above two requirements can be implemented to improve the multiscale architecture. We present here such implementations for RealNVP (Dinh et al., 2016) and Glow (Kingma and Dhariwal, 2018).

Implementation for RealNVP: The proposed factorization method involves a log-det contribution based squeezing. It converts the shaped tensor into a

shaped tensor using max-pooling and min-pooling (= −max-pooling(−input)) operations, as illustrated in Figure

1. Among the channels, one half contains the dimensions having more log-det term compared with its neighbourhood pixel locally (Black marked in Fig. 1), and the other half contains the dimensions having less log-det (White marked in 1). Finally, the mask is constructed so as to pass the dimensions contributing more to likelihood into more flow layers and early gaussianize the ones contributing less.

Implementation for Glow: The main part in implementation for Glow is to obtain the log-det term for all the channels. Unlike RealNVP where the log-det term is diagonal, Glow contains convolution blocks having non-diagonal log-det term for channel dimensions, given by,


It remains to obtain the individual contribution of each channel towards the pixel log-det term (

). As a suitable candidate, singular values of

correspond to the contribution from each channel dimension, so their log value is the individual log-det contribution. This is justified, as the product of singular values for square matrices equals the determinant, i.e.


where are the singular values of the weight matrix .
Once the log-det contribution term for each of the dimensions is obtained, a similar method as for RealNVP is employed for factorization of dimensions at each flow layer in the multiscale architecture.

4 Experiments

In this section we present the results of experiments conducted to evaluate the performance of our proposed methods. In Section  4.1, the results obtained with replacing the standard prior by a learned prior (the method described in Section  3.1) is presented. In Section  4.2, we detail the qualitative and quantitative results by implementing our proposed likelihood contribution based factorization of dimensions over RealNVP multiscale architecture (as described in Section  3.2). Results from our method applied over Glow’s multiscale architecture has not been included in this version due to computational constraints, and will be presented in a future revision of this manuscript.

Datasets: We perform experiments on four benchmarked image datasets: CIFAR-10 (Krizhevsky, 2009), Imagenet (Russakovsky et al., 2014) (downsampled to and ), and CelebFaces Attributes (CelebA) (Liu et al., 2015).

Pre-processing: For CelebA, we take a central crop of then resize it to . For dequantization of images (whose values lies in ), we use the same technique as RealNVP (Dinh et al., 2016), i.e. the data is transformed to , where = . To further improve, dequantization methods introduced by (Ho et al., 2019) can also be adopted. The sample allocation for training and validation were done as per the official allocation for the datasets.

Flow model architecture: We use affine coupling layers as introduced (Dinh et al., 2016). A layer of flow is defined as 3 coupling layers with checkerboard splits at resolution, 3 coupling layers with channel splits at resolution, where is the resolution at the input of that layer. For datasets having resolution 32, we use 3 such layers and for those having resolution 64, we use 4 layers. The cascade connection of the layers is followed by 4 coupling layers with checkerboard splits at the final resolution, marking the end of flow composition. For CIFAR-10, each coupling layer uses 8 residual blocks. Other datasets having images of size use 4 residual blocks whereas ones use 2 residual blocks. More details on architectures will be given in a source code release.

Optimization parameters: We optimize with ADAM (Kingma and Ba, 2014)

with default hyperparameters and use an

regularization on the weight scale parameters with coefficient . A batch size of 64 was used. The computations were performed in NVIDIA Tesla V100 GPUs.

4.1 Results with learned low-dimensional prior

We infer the latent space of flow by a VAE and vary the latent space dimension of VAE to capture the variation in image sample quality.

(a) Samples from model trained on CelebA dataset, without low-temperature sampling. Row 2-5 each show the results with various latent space () dimensions for VAE prior, is total dimension of input = .
(b) Samples from model with dim() = total input dimension/4, trained on CIFAR-10 dataset
Figure 2: Samples from flow model with learned prior trained on CelebA and CIFAR-10 datasets

Variational Autoencoder architecture: For encoder, we use 3 convolutional layers with 3, 32 and 32 filters respectively followed by a fully-connected layer to match the desired VAE latent space size. The decoder path is the similar to encoder path but with deconvolutional layers.

Figure 2(a) depicts the samples generated from the Flow-VAE model trained end-to-end on CelebA and CIFAR-10 datasets with the latent dimension of VAE varied. It is apparent that even with effectively lower dimension of latent space, the model is able to generate qualitative samples. The important benefit achieved by such a hybrid model is that it is able to capture a right balance between global representation and local variance in the data. For CelebA dataset, the local variance can be the hair styles, hair colour, which although not accurately generated (as in case of flow models), suffices to provide characteristic feature to faces. But the global representation such as facial features are constructed very efficiently. Figure  2(b) shows the samples from Flow-VAE model trained on CIFAR-10. If not competetive with the samples for natural images by GANs (Brock et al., 2019), with inherent properties of VAE, the samples for this natural image category are perceptually competetive with similar models even with a low-dimensional noise space.

4.2 Results with proposed likelihood contribution based multiscale architecture

We improve the multiscale architecture proposed by (Dinh et al., 2016) with our proposed factorization method. The training process consists of two steps, first to obtain the log-det term at each layer and the second with multiscale architecture based on the proposed factorization method using the log-det term obtained in the first step.

(a) Examples from the dataset.
(b) Samples from trained model
Figure 3: Samples from flow model with proposed likelihood contribution based multiscale architecture trained on different datasets. The datasets shown in this figure are in order: CIFAR-10, Imagenet(), Imagenet () and CelebA.

Since our method performs factorization based on likelihood contribution of dimensions at each flow layer, it can be adopted in any flow model having multiscale architecture. We apply our method for factorization keeping all other architectural details (coupling layers, residual blocks) same in RealNVP (Dinh et al., 2016) and Glow (Kingma and Dhariwal, 2018).

Model CelebA CIFAR-10 ImageNet 32x32 ImageNet 64x64
RealNVP (Dinh et al., 2016)
RealNVP with Likelihood Contribution
based Factorization (ours)
Table 1: Improvements in Log-likelihood for RealNVP and Glow models via proposed method

Multiscale Architecture: We use the same architecture as proposed in RealNVP, with scaling once for CIFAR-10, thrice for Imagenet and 4 times for Imagenet and CelebA.

The improvement in log-likelihood score using our method is summarized in Table  1. With our factorization method applied to RealNVP, the log-likelihood scores for each dataset gets improved. We observed that the improvement for CelebA is relatively high as compared to natural image datasets like CIFAR-10 or Imagenet. This can be attributed to the high redundancy in facial features.

Figure 4:

Smooth linear interpolations in latent space between two images from CelebA dataset

Figure  3 shows the original datasets and sampled data from model trained on those datasets. Coherent with  (Dinh et al., 2016), the local variances in samples, especially CelebA is well captured along with global feature representation. The background for natural images such as Imagenet and was well reconstructed. The inference process is parallelized over dimensions and hence is faster than autoregressive models (Oord et al., 2016c, c; Salimans et al., 2017; Parmar et al., 2018). We took a pair of images from CelebA dataset, obtained their representation in latent space and performed a linear interpolation between the latent codes to generate the interpolation samples, as shown in Figure  4. The smooth interpolations signify the efficient construction of latent space.

5 Conclusions and Future Work

We proposed methods to reduce the latent space dimension of flow based generative models, which can help them scale with increasing dimensions with efficient computation and memory requirements. Empirical studies conducted on benchmark image datasets validate the strength of our proposed methods, which improve log-likelihood scores for flow models with multiscale architecture(Dinh et al., 2016)

and are able to generate qualitative samples. A line of future work can be to design/learn a masking scheme for factorization on the go during training (possibly a parallel training process), while preserving flow properties to further reduce the computation. A reinforcement learning based masking scheme can also be adopted with appropriate reward functions with mask values as states.


  • A. J. Bell and T. J. Sejnowski (1995) An information-maximization approach to blind separation and blind deconvolution. Neural Computation 7 (6), pp. 1129–1159. External Links: Document, ISSN 0899-7667 Cited by: §2.1.
  • S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio (2015) Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349. Cited by: §3.1.1.
  • A. Brock, J. Donahue, and K. Simonyan (2019) Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, External Links: Link Cited by: §4.1.
  • X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel (2016) Variational lossy autoencoder. arXiv preprint arXiv:1611.02731. Cited by: §3.1.1, §3.1.2.
  • X. Chen, N. Mishra, M. Rohaninejad, and P. Abbeel (2017) Pixelsnail: an improved autoregressive generative model. arXiv preprint arXiv:1712.09763. Cited by: §1.
  • J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio (2015) A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems 28, pp. 2980–2988. Cited by: §3.1.2.
  • M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier (2017) Parseval networks: improving robustness to adversarial examples. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 854–863. Cited by: §1.
  • L. Dinh, D. Krueger, and Y. Bengio (2014) Nice: non-linear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: §1, §2.1.
  • L. Dinh, J. Sohl-Dickstein, and S. Bengio (2016) Density estimation using real NVP. CoRR abs/1605.08803. External Links: Link, 1605.08803 Cited by: Dimensionality Reduction Flows, §1, §1, §2.1, §3.2.1, §3.2.1, §3.2.2, §4.2, §4.2, §4.2, Table 1, §4, §4, §5.
  • J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts (2019) Gansynth: adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710. Cited by: §1.
  • O. Fabius and J. R. van Amersfoort (2014) Variational recurrent auto-encoders. arXiv preprint arXiv:1412.6581. Cited by: §3.1.2.
  • M. Fraccaro, S. K. Sønderby, U. Paquet, and O. Winther (2016) Sequential neural models with stochastic layers. In Advances in neural information processing systems, pp. 2199–2207. Cited by: §3.1.2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • A. Graves (2013)

    Generating sequences with recurrent neural networks

    arXiv preprint arXiv:1308.0850. Cited by: §1.
  • J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel (2019) Flow++: improving flow-based generative models with variational dequantization and architecture design. arXiv preprint arXiv:1902.00275. Cited by: §2.1, §3.1.1, §4.
  • T. Karras, S. Laine, and T. Aila (2018) A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948. Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §1, §2.2.
  • D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §1, §1, §2.1, §3.1.2, §3.2.1, §3.2.2, §4.2.
  • D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling (2014) Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pp. 3581–3589. Cited by: §1.
  • D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §3.1.1.
  • A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In

    Proceedings of the IEEE international conference on computer vision

    pp. 3730–3738. Cited by: §4.
  • A. Mnih and K. Gregor (2014) Neural variational inference and learning in belief networks. CoRR abs/1402.0030. External Links: Link, 1402.0030 Cited by: §3.1.1.
  • A. Odena (2016) Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583. Cited by: §1.
  • A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu (2016a) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §3.1.1.
  • A. V. d. Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. (2016b) Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pp. 4790–4798. Cited by: §1, §3.1.1.
  • A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu (2016c) Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759. Cited by: §1, §1, §3.1.1, §4.2.
  • A. v. d. Oord, O. Vinyals, et al. (2017) Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315. Cited by: §3.1.1.
  • N. Parmar, A. Vaswani, J. Uszkoreit, Ł. Kaiser, N. Shazeer, A. Ku, and D. Tran (2018) Image transformer. arXiv preprint arXiv:1802.05751. Cited by: §4.2.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    arXiv preprint arXiv:1401.4082. Cited by: §2.2.
  • D. J. Rezende and S. Mohamed (2015) Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770. Cited by: §3.1.1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li (2014) ImageNet large scale visual recognition challenge. CoRR abs/1409.0575. External Links: Link, 1409.0575 Cited by: §4.
  • T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma (2017) PixelCNN++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications. CoRR abs/1701.05517. External Links: Link, 1701.05517 Cited by: §1, §2.1, §4.2.
  • T. Salimans, D. Kingma, and M. Welling (2015) Markov chain monte carlo and variational inference: bridging the gap. In International Conference on Machine Learning, pp. 1218–1226. Cited by: §3.1.1.
  • I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. Courville, and Y. Bengio (2017) A hierarchical latent variable encoder-decoder model for generating dialogues. In

    Thirty-First AAAI Conference on Artificial Intelligence

    Cited by: §3.1.2.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.1.
  • R. E. Turner and M. Sahani (2011) Two problems with variational expectation maximisation for time series models. In Bayesian Time Series Models, pp. 104–124. External Links: Document Cited by: §3.1.1.
  • B. Uria, I. Murray, and H. Larochelle (2013) RNADE: the real-valued neural autoregressive density-estimator. In Advances in Neural Information Processing Systems, pp. 2175–2183. Cited by: §2.1.
  • B. Zhang, D. Xiong, J. Su, H. Duan, and M. Zhang (2016)

    Variational neural machine translation

    arXiv preprint arXiv:1605.07869. Cited by: §2.2.