1 Introduction
Unsupervised learning of probabilistic models is a central yet challenging problem. Deep generative models have shown promising results in modeling complex distributions such as natural images (Radford et al., 2015), audio (Van Den Oord et al., ) and text (Bowman et al., 2015)
. A number of approaches have emerged over the years, including Variational Autoencoders (VAEs)
(Kingma and Welling, 2014), Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), autoregressive neural networks
(Larochelle and Murray, 2011; Oord et al., 2016), and flowbased generative models (Dinh et al., 2014, 2016; Kingma and Dhariwal, 2018). Among these models, flowbased generative models gains popularity for this capability of estimating densities of complex distributions, efficiently generating highfidelity syntheses and automatically learning useful latent spaces.Flowbased generative models are typically warping a simple distribution into a complex one by mapping points from the simple distribution to the complex data distribution through a chain of invertible transformations whose Jacobian determinants are efficient to compute. This design guarantees that the density of the transformed distribution can be analytically estimated, making maximum likelihood learning feasible. Flowbased generative model has spawned significant interests in improving and analyzing it from both theoretical and practical perspectives, and applying it to a wide range of tasks and domains.
In their pioneering work, Dinh et al. (2014) proposed Nonlinear Independent Component Estimation (NICE), where they first applied flowbased models for modeling complex dimensional densities. RealNVP (Dinh et al., 2016) extended NICE with more flexible invertible transformation and experimented on natural images. However, these flowbased generative models have much worse density estimation performance compared to stateoftheart autoregressive models, and are incapable of realisticlooking synthesis of large images compared to GANs (Karras et al., 2018; Brock et al., 2019). Recently, Kingma and Dhariwal (2018) proposed Glow: generative flow with invertible 1x1 convolutions, significantly improving the density estimation performance on natural images. Importantly, they demonstrated that flowbased generative models optimized towards the plain likelihoodbased objective are capable of generating realisticlooking highresolution natural images efficiently. Prenger et al. (2018) investigated applying flowbased generative models to speech synthesis by combining Glow with WaveNet (Van Den Oord et al., ). Unfortunately, the density estimation performance of Glow on natural images still falls behind autoregressive models, such as PixelRNN/CNN (Oord et al., 2016; Salimans et al., 2017), Image Transformer (Parmar et al., 2018), PixelSNAIL (Chen et al., 2017) and SPN (Menick and Kalchbrenner, 2019). We noted in passing that there are also some work (Rezende and Mohamed, 2015; Kingma et al., 2016; Zheng et al., 2017) trying to apply flow to variational inference.
In this paper, we propose a novel architecture of generative flow, masked convolutional generative flow (MaCow
), using masked convolutional neural networks
(Oord et al., 2016). The bijective mapping between input and output variables can easily be established; meanwhile, computation of the determinant of the Jacobian is efficient. Compared to inverse autoregressive flow (IAF) (Kingma et al., 2016), MaCow has the merits of stable training and efficient inference and synthesis by restricting the local connectivity in a small “masked” kernel, and large receptive fields by stacking multiple layers of convolutional flows and using reversed ordering masks (§3.1). We also propose a finegrained version of the multiscale architecture adopted in previous flowbased generative models to further improve the performance (§3.2). Experimentally, on three benchmark datasets for images — CIFAR10, ImageNet and CelebAHQ, we demonstrate the effectiveness of
MaCow as a density estimator by consistently achieving significant improvements over Glow on all the three datasets. When equipped with the variational dequantization mechanism (Ho et al., 2018), MaCow considerably narrows the gap on density estimation to autoregressive models (§4).2 Flowbased Generative Models
In this section, we first setup notations, describe flowbased generative models, and review Glow (Kingma and Dhariwal, 2018), which is MaCow built on.
2.1 Notations
Throughout we use uppercase letters for random variables, and lowercase letters for realizations of the corresponding random variables. Let
be the randoms variables of the observed data, e.g.,is an image or a sentence for image and text generation, respectively.
Let denote the true distribution of the data, i.e., , and be our training sample, where are usually i.i.d. samples of . Let denote a parametric statistical model indexed by parameter , where is the parameter space. is used to denote the density of corresponding distribution
. In the literature of deep generative models, deep neural networks are the most widely used parametric models. The goal of generative models is to learn the parameter
such that can best approximate the true distribution . In the context of maximal likelihood estimation, we wish to minimize the negative loglikelihood of the parameters:(1) 
where is the empirical distribution derived from training data .
2.2 Flowbased Models
In the framework of flowbased generative models, a set of latent variables are introduced with a prior distribution , typically a simple distribution like multivariate Gaussian. For a bijection function (with ), the change of variable formula defines the model distribution on by :
(2) 
where is the Jacobian of at .
The generative process is defined straightforwardly as:
(3) 
Flowbased generative models focus on certain types of transformations that both the inverse functions and the Jacobian determinants are tractable to compute. By stacking multiple such invertible transformations in a sequence, which is also called a (normalizing) flow (Rezende and Mohamed, 2015), a flow is capable of warping a simple distribution () in to a complex one ():
where is a flow of transformations. For brevity, we omit the parameter from and .
2.3 Glow
In recent years, A number types of invertible transformations have emerged to enhance the expressiveness of flows, among which Glow (Kingma and Dhariwal, 2018) has stood out for its simplicity and effectiveness on both density estimation and highfidelity synthesis. We briefly describe the three types of transformations that Glow consists of.
Actnorm.
Kingma and Dhariwal (2018)
proposed an activation normalization layer (Actnorm) as an alternative for batch normalization
(Ioffe and Szegedy, 2015) to alleviate the problems in model training. Similar to batch normalization, Actnorm performs an affine transformation of the activations using a scale and bias parameter per channel for 2D images:where both and
are tensors of shape
with spatial dimensions and channel dimension .Invertible 1 x 1 convolution.
To incorporate a permutation along the channel dimension, Glow includes a trainable invertible 1 x 1 convolution layer to generalize the permutation operation:
where is the weight matrix with shape .
Affine Coupling Layers.
Following Dinh et al. (2016), Glow has affine coupling layers in its architecture:
where and are outputs of two neural networks with as input. The and functions perform operations along the channel dimension.
From the designed architecture of Glow, we see that interactions between spatial dimensions are incorporated only in the coupling layers. The coupling layer, however, is typically costly for memory, making it infeasible to stack a large number of coupling layers into a single model, particularly for highresolution images. The main goal of this work is to design a new type of transformation that is able to simultaneously model dependencies in both the spatial and channel dimensions, meanwhile maintains relatively lowmemory footprint, in the hope to improve the capacity of the generative flow.
3 Masked Convolutional Generative Flows
In this section, we describe the architectural components that compose the masked convolutional generative flow (MaCow). We first introduce the proposed flow transformation using masked convolutions in §3.1. Then, we present an finegrained version of the multiscale architecture adopted in previous generative flows (Dinh et al., 2016; Kingma and Dhariwal, 2018) in §3.2. In §3.3, we briefly revisit the dequantization problem involved in flowbased generative models.
3.1 Flow with Masked Convolutions
Applying autoregressive models to normalizing flows has been explored by previous studies (Kingma et al., 2016; Papamakarios et al., 2017). The idea behind autoregressive flows is sequentially modeling the input random variables in an autoregressive order to make sure the model cannot read input variables behind the current one:
(4) 
where denotes the input variables in that are in front of in the autoregressive order. and are two autoregressive neural networks typically implemented using spatial masks (Germain et al., 2015; Oord et al., 2016).
Despite of the effectiveness in highdimensional space, autoregressive flows suffer from two crucial problems: (1) The training procedure is unstable when modeling a long range of contexts and stacking multiple layers. (2) Inference and synthesis are inefficient, due to the nonparallelizable inverse function.
We propose to use masked convolutions to restricting the local connectivity in a small “masked” kernel to address these two problems. The two autoregressive neural networks, and , are implemented with onelayer masked convolutional networks with small kernels (e.g. ) to make sure that they are only able to read contexts in a small neighborhood:
(5) 
where denotes the input variables, restricted in a small kernel, that depends on. By using masks in reversed ordering and stacking multiple layers of flows, the model can capture a large receptive field (see Figure 1), and is able to model dependencies in both the spatial and channel dimensions.
Efficient Synthesis.
As discussed above, synthesis from autoregressive flows is inefficient since the inverse has to be computed by sequentially traversing through the autoregressive order. In the context of 2D images with shape , the time complexity of synthesis is , where is the time of computing the outputs from the neural network and with input shape . In our proposed flow with masked convolutions, computation of can begin as soon as all are available, contrary to the autoregressive requirement that all must be computed. Moreover, at each step we only need to feed (with shape ) into and . Here is the shape of a kernel in the convolution. Thus, the time complexity reduces significantly to .
3.2 Finegrained MultiScale Architecture
Dinh et al. (2016) proposed a multiscale architecture using a squeezing operation, which has been demonstrated to be helpful for training very deep flows. In the original multiscale architecture, the model factors out half of the dimensions at each scale to reduce computational and memory cost. In this paper, inspired by the size upscaling in subscale ordering (Menick and Kalchbrenner, 2019) which generates an image as a sequence of subimages of equal size, we propose a finegrained multiscale architecture to further improve model performance. In this finegrained multiscale architecture, each scale consists of blocks. After each block, the model splits out dimensions of the input^{1}^{1}1In our experiments, we set . Note that the original multiscale architecture is a special case of the finegrained version with .. Figure 5 illustrates the graphical specification of the two versions of architectures. Experimental improvements demonstrate the effectiveness of the finegrained multiscale architecture (§4).
3.3 Dequantization
From the description in §2
, generative flows are defined on continuous random variables. Many real word datasets, however, are recordings of discrete representations of signals, and fitting a continuous density model to discrete data will produce a degenerate solution that places all probability mass on discrete datapoints
(Uria et al., 2013; Ho et al., 2018). A common solution to this problem is “dequantization” which converts the discrete data distribution into a continuous one.Specifically, in the context of natural images, each dimension (pixel) of the discrete data takes on values in . The dequatization process adds a continuous random noise to , resulting a continuous data point:
(6) 
where is continuous random noise taking values from interval . By modeling the density of with , the distribution of is defined as:
(7) 
By restricting the range of in , the mapping between and a pair of and is bijective. Thus, we have .
By introducing a dequantization noise distribution , the training objective in (1) can be rewritten as:
(8) 
where is the distribution of the dequantized variable under the dequantization noise distribution .
Uniform Dequantization.
The most common dequantization method used in prior work is uniform dequantization, where the noise is sampled from the uniform form distribution :
From (8), we have
as .
Variational Dequantization.
As discussed in Ho et al. (2018), uniform dequantization asks to assign uniform density to unit hypercubes , which is difficult for smooth distribution approximators. They proposed to use a parametric dequantization noise distribution , and the training objective is to optimize the evidence lower bound (ELBO) provided in (8):
(9) 
where . In this paper, we implemented both the two dequantization methods for our MaCow (detailed in §4).
4 Experiments
We evaluate our MaCow model on both lowresolution and highresolution datasets. For a step of MaCow, we use masked convolution units, and the Glow step is the same of that described in Kingma and Dhariwal (2018): an ActNorm followed by an Invertible convolution, followed by a coupling layer. Each the coupling layer has three convolution layers, where the the first and last convolutions are , while the center convolution is . For low resolution images we use affine coupling layers with 512 hidden channels, while for high resolution images we use additive layers with 256 hidden channels to reduce memory cost. ELU (Clevert et al., 2015)
is used as the activation function throughout the flow architecture. For variational dequantization, the dequantization noise distribution
is modeled with a conditional MaCow with shallow architecture. More details on architectures, as well as results, and analysis of the conducted experiments, will be given in a source code release.4.1 LowResolution Images
We begin our experiments by evaluating the density estimation performance of MaCow on two commonly used datasets of lowresolution images for evaluating deep generative models: CIFAR10 with images of size (Krizhevsky and Hinton, 2009) and downsampled version of ImageNet (Oord et al., 2016).
Model  CIFAR10  ImageNet64  

Autoregressive  IAF VAE (Kingma et al., 2016)  3.11  – 
Parallel Multiscale (Reed et al., 2017)  –  3.70  
PixelRNN (Oord et al., 2016)  3.00  3.63  
Gated PixelCNN (van den Oord et al., 2016)  3.03  3.57  
MAE (Ma et al., 2019)  2.95  –  
PixelCNN++ (Salimans et al., 2017)  2.92  –  
PixelSNAIL (Chen et al., 2017)  2.85  3.52  
SPN (Menick and Kalchbrenner, 2019)  –  3.52  
Flowbased  Real NVP (Dinh et al., 2016)  3.49  3.98 
Glow (Kingma and Dhariwal, 2018)  3.35  3.81  
Flow++: Unif (Ho et al., 2018)  3.29  –  
Flow++: Var (Ho et al., 2018)  3.09  3.69  
MaCow: Org  3.31  3.78  
MaCow: +finegrained  3.28  3.75  
MaCow: +var  3.16  3.64 
We run experiments to dissect the effectiveness of each components of our MaCow model by ablation studies. The Org model utilizes the original multiscale architecture, while the +finegrained model augments the original one with the finegrained multiscale architecture proposed in §3.2. The +var model further implements the variational dequantization (§3.3) on the top of +finegrained to replace the uniform dequantization. For each ablation, we slightly adjust the number of steps in each level so that all the models have similar numbers of parameters for fair comparison.
Table 1 provides the density estimation performance of different variations of our MaCow model, together with the topperforming autoregressive models (first section) and flowbased generative models (second section). First, on both the two datasets, finegrained models outperform Org ones, demonstrating the effectiveness of the finegrained multiscale architecture. Second, with the uniform dequantization, MaCow combined with finegrained multiscale architecture significantly improves the performance over Glow on both the two datasets, and obtains slightly better results than Flow++ on CIFAR10. In addition, with variational dequantization, MaCow achieves 0.05 improvement of bits/dim over Flow++ on ImageNet . On CIFAR10, however, the performance of MaCow is around 0.07 bits/dim behind Flow++.
Compared with PixelSNAIL (Chen et al., 2017) and SPN (Menick and Kalchbrenner, 2019), the stateoftheart autoregressive generative models, the performance of MaCow is around 0.31 bits/dim worse on CIFAR10 and 0.12 worse on ImageNet . Further improving the density estimation performance of MaCow on natural images has been left to future work.
4.2 HighResolution Images
We now demonstrate experimentally that our MaCow model is capable of high fidelity samples at high resolution. Following Kingma and Dhariwal (2018), we choose the CelebAHQ dataset (Karras et al., 2018), which consists of 30,000 high resolution images from the CelebA dataset (Liu et al., 2015). We train our models on 5bit images, with the finegrained multiscale architecture and both the uniform and variational dequantization.
4.2.1 Density Estimation
Table 2 illustrates the negative loglikelihood scores in bits/dim of two versions of MaCow on the 5bit CelebAHQ dataset. With uniform dequantization, MaCow improves the loglikelihood over Glow from 1.03 bits/dim to 0.95 bits/dim. Equipped with variational dequantization, MaCow obtains 0.74 bits/dim, 0.13 bits/dim behind the stateoftheart autoregressive generative model SPN (Menick and Kalchbrenner, 2019), significantly narrowing the gap.
4.2.2 Image Generation
Consistent with previous work on likelihoodbased generative models (Parmar et al., 2018; Kingma and Dhariwal, 2018), we found that sampling from a reducedtemperature model often results in higherquality samples. Figure 6 showcases some random samples obtained from our model for 5bit CelebAHQ with temperature 0.7. The images are extremely high quality for nonautoregressive likelihood models.
5 Conclusion
In this paper, we proposed a new type of generative flow, coined MaCow, which exploits masked convolutional neural networks. By restricting the local dependencies in a small masked kernel, MaCow enjoys the properties of fast and stable training and efficient sampling. Experiments on both low and highresolution benchmark datasets of images show the capability of MaCow on both density estimation and highfidelity generation, with stateoftheart or comparable likelihood, and superior quality of samples against previous topperforming models.
One potential direction for future work is to extend MaCow to other forms of data, in particular text on which no attempt (to our best knowledge) has been made to apply flowbased generative models. Another exciting direction is to combine MaCow, or even general flowbased generative models, with variational inference to automatically learn meaningful (lowdimensional) representations from raw data.
References
 Bowman et al. (2015) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
 Brock et al. (2019) Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations (ICLR), 2019.
 Chen et al. (2017) Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixelsnail: An improved autoregressive generative model. arXiv preprint arXiv:1712.09763, 2017.
 Clevert et al. (2015) DjorkArné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
 Dinh et al. (2014) Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Nonlinear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
 Dinh et al. (2016) Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.

Germain et al. (2015)
Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle.
Made: Masked autoencoder for distribution estimation.
In
International Conference on Machine Learning
, pages 881–889, 2015.  Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems (NIPS2014), pages 2672–2680, 2014.
 Ho et al. (2018) Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving flowbased generative models with variational dequantization and architecture design. 2018.
 Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
 Karras et al. (2018) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations (ICLR), 2018.
 Kingma and Welling (2014) Diederik P Kingma and Max Welling. Autoencoding variational bayes. In Proceedings of the 2th International Conference on Learning Representations (ICLR2014), Banff, Canada, April 2014.
 Kingma et al. (2016) Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016.
 Kingma and Dhariwal (2018) Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10236–10245, 2018.
 Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

Larochelle and Murray (2011)
Hugo Larochelle and Iain Murray.
The neural autoregressive distribution estimator.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS2011
, pages 29–37, 2011. 
Liu et al. (2015)
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
, 2015.  Ma et al. (2019) Xuezhe Ma, Chunting Zhou, and Eduard Hovy. Mae: Mutual posteriordivergence regularization for variational autoencoders. In International Conference on Learning Representations (ICLR), 2019.
 Menick and Kalchbrenner (2019) Jacob Menick and Nal Kalchbrenner. Generating high fidelity images with subscale pixel networks and multidimensional upscaling, 2019.

Oord et al. (2016)
Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu.
Pixel recurrent neural networks.
In Proceedings of International Conference on Machine Learning (ICML2016), 2016.  Papamakarios et al. (2017) George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2338–2347, 2017.
 Parmar et al. (2018) Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, and Alexander Ku. Image transformer. arXiv preprint arXiv:1802.05751, 2018.
 Prenger et al. (2018) Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flowbased generative network for speech synthesis. arXiv preprint arXiv:1811.00002, 2018.
 Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 Reed et al. (2017) Scott Reed, Aäron Oord, Nal Kalchbrenner, Sergio Gómez Colmenarejo, Ziyu Wang, Yutian Chen, Dan Belov, and Nando Freitas. Parallel multiscale autoregressive density estimation. In International Conference on Machine Learning, pages 2912–2921, 2017.
 Rezende and Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
 Salimans et al. (2017) Tim Salimans, Andrej Karpathy, Xi Chen, Diederik P Kingma, and Yaroslav Bulatov. Pixelcnn++: A pixelcnn implementation with discretized logistic mixture likelihood and other modifications. In International Conference on Learning Representations (ICLR), 2017.
 Uria et al. (2013) Benigno Uria, Iain Murray, and Hugo Larochelle. Rnade: The realvalued neural autoregressive densityestimator. In Advances in Neural Information Processing Systems, pages 2175–2183, 2013.
 (29) Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio.
 van den Oord et al. (2016) Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016.
 Zheng et al. (2017) Guoqing Zheng, Yiming Yang, and Jaime G. Carbonell. Convolutional normalizing flows. CoRR, abs/1711.02255, 2017.
Comments
There are no comments yet.