Unsupervised learning of probabilistic models is a central yet challenging problem. Deep generative models have shown promising results in modeling complex distributions such as natural images (Radford et al., 2015), audio (Van Den Oord et al., ) and text (Bowman et al., 2015)
. A number of approaches have emerged over the years, including Variational Autoencoders (VAEs)(Kingma and Welling, 2014), Generative Adversarial Networks (GANs) (Goodfellow et al., 2014)
, autoregressive neural networks(Larochelle and Murray, 2011; Oord et al., 2016), and flow-based generative models (Dinh et al., 2014, 2016; Kingma and Dhariwal, 2018). Among these models, flow-based generative models gains popularity for this capability of estimating densities of complex distributions, efficiently generating high-fidelity syntheses and automatically learning useful latent spaces.
Flow-based generative models are typically warping a simple distribution into a complex one by mapping points from the simple distribution to the complex data distribution through a chain of invertible transformations whose Jacobian determinants are efficient to compute. This design guarantees that the density of the transformed distribution can be analytically estimated, making maximum likelihood learning feasible. Flow-based generative model has spawned significant interests in improving and analyzing it from both theoretical and practical perspectives, and applying it to a wide range of tasks and domains.
In their pioneering work, Dinh et al. (2014) proposed Non-linear Independent Component Estimation (NICE), where they first applied flow-based models for modeling complex dimensional densities. RealNVP (Dinh et al., 2016) extended NICE with more flexible invertible transformation and experimented on natural images. However, these flow-based generative models have much worse density estimation performance compared to state-of-the-art autoregressive models, and are incapable of realistic-looking synthesis of large images compared to GANs (Karras et al., 2018; Brock et al., 2019). Recently, Kingma and Dhariwal (2018) proposed Glow: generative flow with invertible 1x1 convolutions, significantly improving the density estimation performance on natural images. Importantly, they demonstrated that flow-based generative models optimized towards the plain likelihood-based objective are capable of generating realistic-looking high-resolution natural images efficiently. Prenger et al. (2018) investigated applying flow-based generative models to speech synthesis by combining Glow with WaveNet (Van Den Oord et al., ). Unfortunately, the density estimation performance of Glow on natural images still falls behind autoregressive models, such as PixelRNN/CNN (Oord et al., 2016; Salimans et al., 2017), Image Transformer (Parmar et al., 2018), PixelSNAIL (Chen et al., 2017) and SPN (Menick and Kalchbrenner, 2019). We noted in passing that there are also some work (Rezende and Mohamed, 2015; Kingma et al., 2016; Zheng et al., 2017) trying to apply flow to variational inference.
In this paper, we propose a novel architecture of generative flow, masked convolutional generative flow (MaCow
), using masked convolutional neural networks(Oord et al., 2016). The bijective mapping between input and output variables can easily be established; meanwhile, computation of the determinant of the Jacobian is efficient. Compared to inverse autoregressive flow (IAF) (Kingma et al., 2016), MaCow has the merits of stable training and efficient inference and synthesis by restricting the local connectivity in a small “masked” kernel, and large receptive fields by stacking multiple layers of convolutional flows and using reversed ordering masks (§3.1). We also propose a fine-grained version of the multi-scale architecture adopted in previous flow-based generative models to further improve the performance (§3.2
). Experimentally, on three benchmark datasets for images — CIFAR-10, ImageNet and CelebA-HQ, we demonstrate the effectiveness ofMaCow as a density estimator by consistently achieving significant improvements over Glow on all the three datasets. When equipped with the variational dequantization mechanism (Ho et al., 2018), MaCow considerably narrows the gap on density estimation to autoregressive models (§4).
2 Flow-based Generative Models
In this section, we first setup notations, describe flow-based generative models, and review Glow (Kingma and Dhariwal, 2018), which is MaCow built on.
Throughout we use uppercase letters for random variables, and lowercase letters for realizations of the corresponding random variables. Letbe the randoms variables of the observed data, e.g.,
is an image or a sentence for image and text generation, respectively.
Let denote the true distribution of the data, i.e., , and be our training sample, where are usually i.i.d. samples of . Let denote a parametric statistical model indexed by parameter , where is the parameter space. is used to denote the density of corresponding distribution
. In the literature of deep generative models, deep neural networks are the most widely used parametric models. The goal of generative models is to learn the parametersuch that can best approximate the true distribution . In the context of maximal likelihood estimation, we wish to minimize the negative log-likelihood of the parameters:
where is the empirical distribution derived from training data .
2.2 Flow-based Models
In the framework of flow-based generative models, a set of latent variables are introduced with a prior distribution , typically a simple distribution like multivariate Gaussian. For a bijection function (with ), the change of variable formula defines the model distribution on by :
where is the Jacobian of at .
The generative process is defined straight-forwardly as:
Flow-based generative models focus on certain types of transformations that both the inverse functions and the Jacobian determinants are tractable to compute. By stacking multiple such invertible transformations in a sequence, which is also called a (normalizing) flow (Rezende and Mohamed, 2015), a flow is capable of warping a simple distribution () in to a complex one ():
where is a flow of transformations. For brevity, we omit the parameter from and .
In recent years, A number types of invertible transformations have emerged to enhance the expressiveness of flows, among which Glow (Kingma and Dhariwal, 2018) has stood out for its simplicity and effectiveness on both density estimation and high-fidelity synthesis. We briefly describe the three types of transformations that Glow consists of.
Kingma and Dhariwal (2018)
proposed an activation normalization layer (Actnorm) as an alternative for batch normalization(Ioffe and Szegedy, 2015) to alleviate the problems in model training. Similar to batch normalization, Actnorm performs an affine transformation of the activations using a scale and bias parameter per channel for 2D images:
where both and
are tensors of shapewith spatial dimensions and channel dimension .
Invertible 1 x 1 convolution.
To incorporate a permutation along the channel dimension, Glow includes a trainable invertible 1 x 1 convolution layer to generalize the permutation operation:
where is the weight matrix with shape .
Affine Coupling Layers.
Following Dinh et al. (2016), Glow has affine coupling layers in its architecture:
where and are outputs of two neural networks with as input. The and functions perform operations along the channel dimension.
From the designed architecture of Glow, we see that interactions between spatial dimensions are incorporated only in the coupling layers. The coupling layer, however, is typically costly for memory, making it infeasible to stack a large number of coupling layers into a single model, particularly for high-resolution images. The main goal of this work is to design a new type of transformation that is able to simultaneously model dependencies in both the spatial and channel dimensions, meanwhile maintains relatively low-memory footprint, in the hope to improve the capacity of the generative flow.
3 Masked Convolutional Generative Flows
In this section, we describe the architectural components that compose the masked convolutional generative flow (MaCow). We first introduce the proposed flow transformation using masked convolutions in §3.1. Then, we present an fine-grained version of the multi-scale architecture adopted in previous generative flows (Dinh et al., 2016; Kingma and Dhariwal, 2018) in §3.2. In §3.3, we briefly revisit the dequantization problem involved in flow-based generative models.
3.1 Flow with Masked Convolutions
Applying autoregressive models to normalizing flows has been explored by previous studies (Kingma et al., 2016; Papamakarios et al., 2017). The idea behind autoregressive flows is sequentially modeling the input random variables in an autoregressive order to make sure the model cannot read input variables behind the current one:
where denotes the input variables in that are in front of in the autoregressive order. and are two autoregressive neural networks typically implemented using spatial masks (Germain et al., 2015; Oord et al., 2016).
Despite of the effectiveness in high-dimensional space, autoregressive flows suffer from two crucial problems: (1) The training procedure is unstable when modeling a long range of contexts and stacking multiple layers. (2) Inference and synthesis are inefficient, due to the non-parallelizable inverse function.
We propose to use masked convolutions to restricting the local connectivity in a small “masked” kernel to address these two problems. The two autoregressive neural networks, and , are implemented with one-layer masked convolutional networks with small kernels (e.g. ) to make sure that they are only able to read contexts in a small neighborhood:
where denotes the input variables, restricted in a small kernel, that depends on. By using masks in reversed ordering and stacking multiple layers of flows, the model can capture a large receptive field (see Figure 1), and is able to model dependencies in both the spatial and channel dimensions.
As discussed above, synthesis from autoregressive flows is inefficient since the inverse has to be computed by sequentially traversing through the autoregressive order. In the context of 2D images with shape , the time complexity of synthesis is , where is the time of computing the outputs from the neural network and with input shape . In our proposed flow with masked convolutions, computation of can begin as soon as all are available, contrary to the autoregressive requirement that all must be computed. Moreover, at each step we only need to feed (with shape ) into and . Here is the shape of a kernel in the convolution. Thus, the time complexity reduces significantly to .
3.2 Fine-grained Multi-Scale Architecture
Dinh et al. (2016) proposed a multi-scale architecture using a squeezing operation, which has been demonstrated to be helpful for training very deep flows. In the original multi-scale architecture, the model factors out half of the dimensions at each scale to reduce computational and memory cost. In this paper, inspired by the size upscaling in subscale ordering (Menick and Kalchbrenner, 2019) which generates an image as a sequence of sub-images of equal size, we propose a fine-grained multi-scale architecture to further improve model performance. In this fine-grained multi-scale architecture, each scale consists of blocks. After each block, the model splits out dimensions of the input111In our experiments, we set . Note that the original multi-scale architecture is a special case of the fine-grained version with .. Figure 5 illustrates the graphical specification of the two versions of architectures. Experimental improvements demonstrate the effectiveness of the fine-grained multi-scale architecture (§4).
From the description in §2
, generative flows are defined on continuous random variables. Many real word datasets, however, are recordings of discrete representations of signals, and fitting a continuous density model to discrete data will produce a degenerate solution that places all probability mass on discrete datapoints(Uria et al., 2013; Ho et al., 2018). A common solution to this problem is “dequantization” which converts the discrete data distribution into a continuous one.
Specifically, in the context of natural images, each dimension (pixel) of the discrete data takes on values in . The dequatization process adds a continuous random noise to , resulting a continuous data point:
where is continuous random noise taking values from interval . By modeling the density of with , the distribution of is defined as:
By restricting the range of in , the mapping between and a pair of and is bijective. Thus, we have .
By introducing a dequantization noise distribution , the training objective in (1) can be re-written as:
where is the distribution of the dequantized variable under the dequantization noise distribution .
The most common dequantization method used in prior work is uniform dequantization, where the noise is sampled from the uniform form distribution :
From (8), we have
As discussed in Ho et al. (2018), uniform dequantization asks to assign uniform density to unit hypercubes , which is difficult for smooth distribution approximators. They proposed to use a parametric dequantization noise distribution , and the training objective is to optimize the evidence lower bound (ELBO) provided in (8):
where . In this paper, we implemented both the two dequantization methods for our MaCow (detailed in §4).
We evaluate our MaCow model on both low-resolution and high-resolution datasets. For a step of MaCow, we use masked convolution units, and the Glow step is the same of that described in Kingma and Dhariwal (2018): an ActNorm followed by an Invertible convolution, followed by a coupling layer. Each the coupling layer has three convolution layers, where the the first and last convolutions are , while the center convolution is . For low resolution images we use affine coupling layers with 512 hidden channels, while for high resolution images we use additive layers with 256 hidden channels to reduce memory cost. ELU (Clevert et al., 2015)
is used as the activation function throughout the flow architecture. For variational dequantization, the dequantization noise distributionis modeled with a conditional MaCow with shallow architecture. More details on architectures, as well as results, and analysis of the conducted experiments, will be given in a source code release.
4.1 Low-Resolution Images
We begin our experiments by evaluating the density estimation performance of MaCow on two commonly used datasets of low-resolution images for evaluating deep generative models: CIFAR-10 with images of size (Krizhevsky and Hinton, 2009) and downsampled version of ImageNet (Oord et al., 2016).
|Autoregressive||IAF VAE (Kingma et al., 2016)||3.11||–|
|Parallel Multiscale (Reed et al., 2017)||–||3.70|
|PixelRNN (Oord et al., 2016)||3.00||3.63|
|Gated PixelCNN (van den Oord et al., 2016)||3.03||3.57|
|MAE (Ma et al., 2019)||2.95||–|
|PixelCNN++ (Salimans et al., 2017)||2.92||–|
|PixelSNAIL (Chen et al., 2017)||2.85||3.52|
|SPN (Menick and Kalchbrenner, 2019)||–||3.52|
|Flow-based||Real NVP (Dinh et al., 2016)||3.49||3.98|
|Glow (Kingma and Dhariwal, 2018)||3.35||3.81|
|Flow++: Unif (Ho et al., 2018)||3.29||–|
|Flow++: Var (Ho et al., 2018)||3.09||3.69|
We run experiments to dissect the effectiveness of each components of our MaCow model by ablation studies. The Org model utilizes the original multi-scale architecture, while the +fine-grained model augments the original one with the fine-grained multi-scale architecture proposed in §3.2. The +var model further implements the variational dequantization (§3.3) on the top of +fine-grained to replace the uniform dequantization. For each ablation, we slightly adjust the number of steps in each level so that all the models have similar numbers of parameters for fair comparison.
Table 1 provides the density estimation performance of different variations of our MaCow model, together with the top-performing autoregressive models (first section) and flow-based generative models (second section). First, on both the two datasets, fine-grained models outperform Org ones, demonstrating the effectiveness of the fine-grained multi-scale architecture. Second, with the uniform dequantization, MaCow combined with fine-grained multi-scale architecture significantly improves the performance over Glow on both the two datasets, and obtains slightly better results than Flow++ on CIFAR-10. In addition, with variational dequantization, MaCow achieves 0.05 improvement of bits/dim over Flow++ on ImageNet . On CIFAR-10, however, the performance of MaCow is around 0.07 bits/dim behind Flow++.
Compared with PixelSNAIL (Chen et al., 2017) and SPN (Menick and Kalchbrenner, 2019), the state-of-the-art autoregressive generative models, the performance of MaCow is around 0.31 bits/dim worse on CIFAR-10 and 0.12 worse on ImageNet . Further improving the density estimation performance of MaCow on natural images has been left to future work.
4.2 High-Resolution Images
We now demonstrate experimentally that our MaCow model is capable of high fidelity samples at high resolution. Following Kingma and Dhariwal (2018), we choose the CelebA-HQ dataset (Karras et al., 2018), which consists of 30,000 high resolution images from the CelebA dataset (Liu et al., 2015). We train our models on 5-bit images, with the fine-grained multi-scale architecture and both the uniform and variational dequantization.
4.2.1 Density Estimation
Table 2 illustrates the negative log-likelihood scores in bits/dim of two versions of MaCow on the 5-bit CelebA-HQ dataset. With uniform dequantization, MaCow improves the log-likelihood over Glow from 1.03 bits/dim to 0.95 bits/dim. Equipped with variational dequantization, MaCow obtains 0.74 bits/dim, 0.13 bits/dim behind the state-of-the-art autoregressive generative model SPN (Menick and Kalchbrenner, 2019), significantly narrowing the gap.
4.2.2 Image Generation
Consistent with previous work on likelihood-based generative models (Parmar et al., 2018; Kingma and Dhariwal, 2018), we found that sampling from a reduced-temperature model often results in higher-quality samples. Figure 6 showcases some random samples obtained from our model for 5-bit CelebA-HQ with temperature 0.7. The images are extremely high quality for non-autoregressive likelihood models.
In this paper, we proposed a new type of generative flow, coined MaCow, which exploits masked convolutional neural networks. By restricting the local dependencies in a small masked kernel, MaCow enjoys the properties of fast and stable training and efficient sampling. Experiments on both low- and high-resolution benchmark datasets of images show the capability of MaCow on both density estimation and high-fidelity generation, with state-of-the-art or comparable likelihood, and superior quality of samples against previous top-performing models.
One potential direction for future work is to extend MaCow to other forms of data, in particular text on which no attempt (to our best knowledge) has been made to apply flow-based generative models. Another exciting direction is to combine MaCow, or even general flow-based generative models, with variational inference to automatically learn meaningful (low-dimensional) representations from raw data.
- Bowman et al. (2015) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
- Brock et al. (2019) Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations (ICLR), 2019.
- Chen et al. (2017) Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixelsnail: An improved autoregressive generative model. arXiv preprint arXiv:1712.09763, 2017.
- Clevert et al. (2015) Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
- Dinh et al. (2014) Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
- Dinh et al. (2016) Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
Germain et al. (2015)
Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle.
Made: Masked autoencoder for distribution estimation.
International Conference on Machine Learning, pages 881–889, 2015.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems (NIPS-2014), pages 2672–2680, 2014.
- Ho et al. (2018) Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving flow-based generative models with variational dequantization and architecture design. 2018.
- Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
- Karras et al. (2018) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations (ICLR), 2018.
- Kingma and Welling (2014) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of the 2th International Conference on Learning Representations (ICLR-2014), Banff, Canada, April 2014.
- Kingma et al. (2016) Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016.
- Kingma and Dhariwal (2018) Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10236–10245, 2018.
- Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
Larochelle and Murray (2011)
Hugo Larochelle and Iain Murray.
The neural autoregressive distribution estimator.
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS-2011, pages 29–37, 2011.
Liu et al. (2015)
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep learning face attributes in the wild.
Proceedings of International Conference on Computer Vision (ICCV), 2015.
- Ma et al. (2019) Xuezhe Ma, Chunting Zhou, and Eduard Hovy. Mae: Mutual posterior-divergence regularization for variational autoencoders. In International Conference on Learning Representations (ICLR), 2019.
- Menick and Kalchbrenner (2019) Jacob Menick and Nal Kalchbrenner. Generating high fidelity images with subscale pixel networks and multidimensional upscaling, 2019.
Oord et al. (2016)
Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu.
Pixel recurrent neural networks.In Proceedings of International Conference on Machine Learning (ICML-2016), 2016.
- Papamakarios et al. (2017) George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2338–2347, 2017.
- Parmar et al. (2018) Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, and Alexander Ku. Image transformer. arXiv preprint arXiv:1802.05751, 2018.
- Prenger et al. (2018) Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint arXiv:1811.00002, 2018.
- Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Reed et al. (2017) Scott Reed, Aäron Oord, Nal Kalchbrenner, Sergio Gómez Colmenarejo, Ziyu Wang, Yutian Chen, Dan Belov, and Nando Freitas. Parallel multiscale autoregressive density estimation. In International Conference on Machine Learning, pages 2912–2921, 2017.
- Rezende and Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
- Salimans et al. (2017) Tim Salimans, Andrej Karpathy, Xi Chen, Diederik P Kingma, and Yaroslav Bulatov. Pixelcnn++: A pixelcnn implementation with discretized logistic mixture likelihood and other modifications. In International Conference on Learning Representations (ICLR), 2017.
- Uria et al. (2013) Benigno Uria, Iain Murray, and Hugo Larochelle. Rnade: The real-valued neural autoregressive density-estimator. In Advances in Neural Information Processing Systems, pages 2175–2183, 2013.
- (29) Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio.
- van den Oord et al. (2016) Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016.
- Zheng et al. (2017) Guoqing Zheng, Yiming Yang, and Jaime G. Carbonell. Convolutional normalizing flows. CoRR, abs/1711.02255, 2017.