MaCow: Masked Convolutional Generative Flow

02/12/2019 ∙ by Xuezhe Ma, et al. ∙ Carnegie Mellon University 2

Flow-based generative models, conceptually attractive due to tractability of both the exact log-likelihood computation and latent-variable inference, and efficiency of both training and sampling, has led to a number of impressive empirical successes and spawned many advanced variants and theoretical investigations. Despite their computational efficiency, the density estimation performance of flow-based generative models significantly falls behind those of state-of-the-art autoregressive models. In this work, we introduce masked convolutional generative flow (MaCow), a simple yet effective architecture of generative flow using masked convolution. By restricting the local connectivity in a small kernel, MaCow enjoys the properties of fast and stable training, and efficient sampling, while achieving significant improvements over Glow for density estimation on standard image benchmarks, considerably narrowing the gap to autoregressive models.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised learning of probabilistic models is a central yet challenging problem. Deep generative models have shown promising results in modeling complex distributions such as natural images (Radford et al., 2015), audio (Van Den Oord et al., ) and text (Bowman et al., 2015)

. A number of approaches have emerged over the years, including Variational Autoencoders (VAEs) 

(Kingma and Welling, 2014), Generative Adversarial Networks (GANs) (Goodfellow et al., 2014)

, autoregressive neural networks 

(Larochelle and Murray, 2011; Oord et al., 2016), and flow-based generative models (Dinh et al., 2014, 2016; Kingma and Dhariwal, 2018). Among these models, flow-based generative models gains popularity for this capability of estimating densities of complex distributions, efficiently generating high-fidelity syntheses and automatically learning useful latent spaces.

Flow-based generative models are typically warping a simple distribution into a complex one by mapping points from the simple distribution to the complex data distribution through a chain of invertible transformations whose Jacobian determinants are efficient to compute. This design guarantees that the density of the transformed distribution can be analytically estimated, making maximum likelihood learning feasible. Flow-based generative model has spawned significant interests in improving and analyzing it from both theoretical and practical perspectives, and applying it to a wide range of tasks and domains.

In their pioneering work, Dinh et al. (2014) proposed Non-linear Independent Component Estimation (NICE), where they first applied flow-based models for modeling complex dimensional densities. RealNVP (Dinh et al., 2016) extended NICE with more flexible invertible transformation and experimented on natural images. However, these flow-based generative models have much worse density estimation performance compared to state-of-the-art autoregressive models, and are incapable of realistic-looking synthesis of large images compared to GANs (Karras et al., 2018; Brock et al., 2019). Recently, Kingma and Dhariwal (2018) proposed Glow: generative flow with invertible 1x1 convolutions, significantly improving the density estimation performance on natural images. Importantly, they demonstrated that flow-based generative models optimized towards the plain likelihood-based objective are capable of generating realistic-looking high-resolution natural images efficiently. Prenger et al. (2018) investigated applying flow-based generative models to speech synthesis by combining Glow with WaveNet (Van Den Oord et al., ). Unfortunately, the density estimation performance of Glow on natural images still falls behind autoregressive models, such as PixelRNN/CNN (Oord et al., 2016; Salimans et al., 2017), Image Transformer (Parmar et al., 2018), PixelSNAIL (Chen et al., 2017) and SPN (Menick and Kalchbrenner, 2019). We noted in passing that there are also some work (Rezende and Mohamed, 2015; Kingma et al., 2016; Zheng et al., 2017) trying to apply flow to variational inference.

In this paper, we propose a novel architecture of generative flow, masked convolutional generative flow (MaCow

), using masked convolutional neural networks 

(Oord et al., 2016). The bijective mapping between input and output variables can easily be established; meanwhile, computation of the determinant of the Jacobian is efficient. Compared to inverse autoregressive flow (IAF) (Kingma et al., 2016), MaCow has the merits of stable training and efficient inference and synthesis by restricting the local connectivity in a small “masked” kernel, and large receptive fields by stacking multiple layers of convolutional flows and using reversed ordering masks (§3.1). We also propose a fine-grained version of the multi-scale architecture adopted in previous flow-based generative models to further improve the performance (§3.2

). Experimentally, on three benchmark datasets for images — CIFAR-10, ImageNet and CelebA-HQ, we demonstrate the effectiveness of

MaCow as a density estimator by consistently achieving significant improvements over Glow on all the three datasets. When equipped with the variational dequantization mechanism (Ho et al., 2018), MaCow considerably narrows the gap on density estimation to autoregressive models (§4).

2 Flow-based Generative Models

In this section, we first setup notations, describe flow-based generative models, and review Glow (Kingma and Dhariwal, 2018), which is MaCow built on.

2.1 Notations

Throughout we use uppercase letters for random variables, and lowercase letters for realizations of the corresponding random variables. Let

be the randoms variables of the observed data, e.g.,

is an image or a sentence for image and text generation, respectively.

Let denote the true distribution of the data, i.e., , and be our training sample, where are usually i.i.d. samples of . Let denote a parametric statistical model indexed by parameter , where is the parameter space. is used to denote the density of corresponding distribution

. In the literature of deep generative models, deep neural networks are the most widely used parametric models. The goal of generative models is to learn the parameter

such that can best approximate the true distribution . In the context of maximal likelihood estimation, we wish to minimize the negative log-likelihood of the parameters:


where is the empirical distribution derived from training data .

2.2 Flow-based Models

In the framework of flow-based generative models, a set of latent variables are introduced with a prior distribution , typically a simple distribution like multivariate Gaussian. For a bijection function (with ), the change of variable formula defines the model distribution on by :


where is the Jacobian of at .

The generative process is defined straight-forwardly as:


Flow-based generative models focus on certain types of transformations that both the inverse functions and the Jacobian determinants are tractable to compute. By stacking multiple such invertible transformations in a sequence, which is also called a (normalizing) flow (Rezende and Mohamed, 2015), a flow is capable of warping a simple distribution () in to a complex one ():

where is a flow of transformations. For brevity, we omit the parameter from and .

2.3 Glow

In recent years, A number types of invertible transformations have emerged to enhance the expressiveness of flows, among which Glow (Kingma and Dhariwal, 2018) has stood out for its simplicity and effectiveness on both density estimation and high-fidelity synthesis. We briefly describe the three types of transformations that Glow consists of.


Kingma and Dhariwal (2018)

proposed an activation normalization layer (Actnorm) as an alternative for batch normalization 

(Ioffe and Szegedy, 2015) to alleviate the problems in model training. Similar to batch normalization, Actnorm performs an affine transformation of the activations using a scale and bias parameter per channel for 2D images:

where both and

are tensors of shape

with spatial dimensions and channel dimension .

Invertible 1 x 1 convolution.

To incorporate a permutation along the channel dimension, Glow includes a trainable invertible 1 x 1 convolution layer to generalize the permutation operation:

where is the weight matrix with shape .

Affine Coupling Layers.

Following Dinh et al. (2016), Glow has affine coupling layers in its architecture:

where and are outputs of two neural networks with as input. The and functions perform operations along the channel dimension.

From the designed architecture of Glow, we see that interactions between spatial dimensions are incorporated only in the coupling layers. The coupling layer, however, is typically costly for memory, making it infeasible to stack a large number of coupling layers into a single model, particularly for high-resolution images. The main goal of this work is to design a new type of transformation that is able to simultaneously model dependencies in both the spatial and channel dimensions, meanwhile maintains relatively low-memory footprint, in the hope to improve the capacity of the generative flow.

3 Masked Convolutional Generative Flows

In this section, we describe the architectural components that compose the masked convolutional generative flow (MaCow). We first introduce the proposed flow transformation using masked convolutions in §3.1. Then, we present an fine-grained version of the multi-scale architecture adopted in previous generative flows (Dinh et al., 2016; Kingma and Dhariwal, 2018) in §3.2. In §3.3, we briefly revisit the dequantization problem involved in flow-based generative models.

Figure 1: Visualization of the receptive field of two masked convolutions with reversed ordering.

3.1 Flow with Masked Convolutions

Applying autoregressive models to normalizing flows has been explored by previous studies (Kingma et al., 2016; Papamakarios et al., 2017). The idea behind autoregressive flows is sequentially modeling the input random variables in an autoregressive order to make sure the model cannot read input variables behind the current one:


where denotes the input variables in that are in front of in the autoregressive order. and are two autoregressive neural networks typically implemented using spatial masks (Germain et al., 2015; Oord et al., 2016).

Despite of the effectiveness in high-dimensional space, autoregressive flows suffer from two crucial problems: (1) The training procedure is unstable when modeling a long range of contexts and stacking multiple layers. (2) Inference and synthesis are inefficient, due to the non-parallelizable inverse function.

We propose to use masked convolutions to restricting the local connectivity in a small “masked” kernel to address these two problems. The two autoregressive neural networks, and , are implemented with one-layer masked convolutional networks with small kernels (e.g. ) to make sure that they are only able to read contexts in a small neighborhood:


where denotes the input variables, restricted in a small kernel, that depends on. By using masks in reversed ordering and stacking multiple layers of flows, the model can capture a large receptive field (see Figure 1), and is able to model dependencies in both the spatial and channel dimensions.

Efficient Synthesis.

As discussed above, synthesis from autoregressive flows is inefficient since the inverse has to be computed by sequentially traversing through the autoregressive order. In the context of 2D images with shape , the time complexity of synthesis is , where is the time of computing the outputs from the neural network and with input shape . In our proposed flow with masked convolutions, computation of can begin as soon as all are available, contrary to the autoregressive requirement that all must be computed. Moreover, at each step we only need to feed (with shape ) into and . Here is the shape of a kernel in the convolution. Thus, the time complexity reduces significantly to .

(a) One step of MaCow (b) Original multi-scale architecture (c) Fine-grained multi-scale architecture
Figure 5: The architecture of the propose MaCow model, where each step (a) consists of units of ActNorm followd by two masked convolutions with reversed ordering, and a Glow step. This flow is combined with either a original multi-scale architecture (b) or a fine-grained one (c).

3.2 Fine-grained Multi-Scale Architecture

Dinh et al. (2016) proposed a multi-scale architecture using a squeezing operation, which has been demonstrated to be helpful for training very deep flows. In the original multi-scale architecture, the model factors out half of the dimensions at each scale to reduce computational and memory cost. In this paper, inspired by the size upscaling in subscale ordering (Menick and Kalchbrenner, 2019) which generates an image as a sequence of sub-images of equal size, we propose a fine-grained multi-scale architecture to further improve model performance. In this fine-grained multi-scale architecture, each scale consists of blocks. After each block, the model splits out dimensions of the input111In our experiments, we set . Note that the original multi-scale architecture is a special case of the fine-grained version with .. Figure 5 illustrates the graphical specification of the two versions of architectures. Experimental improvements demonstrate the effectiveness of the fine-grained multi-scale architecture (§4).

3.3 Dequantization

From the description in §2

, generative flows are defined on continuous random variables. Many real word datasets, however, are recordings of discrete representations of signals, and fitting a continuous density model to discrete data will produce a degenerate solution that places all probability mass on discrete datapoints 

(Uria et al., 2013; Ho et al., 2018). A common solution to this problem is “dequantization” which converts the discrete data distribution into a continuous one.

Specifically, in the context of natural images, each dimension (pixel) of the discrete data takes on values in . The dequatization process adds a continuous random noise to , resulting a continuous data point:


where is continuous random noise taking values from interval . By modeling the density of with , the distribution of is defined as:


By restricting the range of in , the mapping between and a pair of and is bijective. Thus, we have .

By introducing a dequantization noise distribution , the training objective in (1) can be re-written as:


where is the distribution of the dequantized variable under the dequantization noise distribution .

Uniform Dequantization.

The most common dequantization method used in prior work is uniform dequantization, where the noise is sampled from the uniform form distribution :

From (8), we have

as .

Variational Dequantization.

As discussed in Ho et al. (2018), uniform dequantization asks to assign uniform density to unit hypercubes , which is difficult for smooth distribution approximators. They proposed to use a parametric dequantization noise distribution , and the training objective is to optimize the evidence lower bound (ELBO) provided in (8):


where . In this paper, we implemented both the two dequantization methods for our MaCow (detailed in §4).

4 Experiments

We evaluate our MaCow model on both low-resolution and high-resolution datasets. For a step of MaCow, we use masked convolution units, and the Glow step is the same of that described in Kingma and Dhariwal (2018): an ActNorm followed by an Invertible convolution, followed by a coupling layer. Each the coupling layer has three convolution layers, where the the first and last convolutions are , while the center convolution is . For low resolution images we use affine coupling layers with 512 hidden channels, while for high resolution images we use additive layers with 256 hidden channels to reduce memory cost. ELU (Clevert et al., 2015)

is used as the activation function throughout the flow architecture. For variational dequantization, the dequantization noise distribution

is modeled with a conditional MaCow with shallow architecture. More details on architectures, as well as results, and analysis of the conducted experiments, will be given in a source code release.

4.1 Low-Resolution Images

We begin our experiments by evaluating the density estimation performance of MaCow on two commonly used datasets of low-resolution images for evaluating deep generative models: CIFAR-10 with images of size  (Krizhevsky and Hinton, 2009) and downsampled version of ImageNet (Oord et al., 2016).

Model CIFAR-10 ImageNet-64
Autoregressive IAF VAE (Kingma et al., 2016) 3.11
Parallel Multiscale (Reed et al., 2017) 3.70
PixelRNN (Oord et al., 2016) 3.00 3.63
Gated PixelCNN (van den Oord et al., 2016) 3.03 3.57
MAE (Ma et al., 2019) 2.95
PixelCNN++ (Salimans et al., 2017) 2.92
PixelSNAIL (Chen et al., 2017) 2.85 3.52
SPN (Menick and Kalchbrenner, 2019) 3.52
Flow-based Real NVP (Dinh et al., 2016) 3.49 3.98
Glow (Kingma and Dhariwal, 2018) 3.35 3.81
Flow++: Unif (Ho et al., 2018) 3.29
Flow++: Var (Ho et al., 2018) 3.09 3.69
MaCow: Org 3.31 3.78
MaCow: +fine-grained 3.28 3.75
MaCow: +var 3.16 3.64
Table 1: Density estimation performance on CIFAR-10 and ImageNet . Results are reported in bits/dim.

We run experiments to dissect the effectiveness of each components of our MaCow model by ablation studies. The Org model utilizes the original multi-scale architecture, while the +fine-grained model augments the original one with the fine-grained multi-scale architecture proposed in §3.2. The +var model further implements the variational dequantization (§3.3) on the top of +fine-grained to replace the uniform dequantization. For each ablation, we slightly adjust the number of steps in each level so that all the models have similar numbers of parameters for fair comparison.

Table 1 provides the density estimation performance of different variations of our MaCow model, together with the top-performing autoregressive models (first section) and flow-based generative models (second section). First, on both the two datasets, fine-grained models outperform Org ones, demonstrating the effectiveness of the fine-grained multi-scale architecture. Second, with the uniform dequantization, MaCow combined with fine-grained multi-scale architecture significantly improves the performance over Glow on both the two datasets, and obtains slightly better results than Flow++ on CIFAR-10. In addition, with variational dequantization, MaCow achieves 0.05 improvement of bits/dim over Flow++ on ImageNet . On CIFAR-10, however, the performance of MaCow is around 0.07 bits/dim behind Flow++.

Compared with PixelSNAIL (Chen et al., 2017) and SPN (Menick and Kalchbrenner, 2019), the state-of-the-art autoregressive generative models, the performance of MaCow is around 0.31 bits/dim worse on CIFAR-10 and 0.12 worse on ImageNet . Further improving the density estimation performance of MaCow on natural images has been left to future work.

Model CelebA-HQ
Glow (Kingma and Dhariwal, 2018) 1.03
SPN (Menick and Kalchbrenner, 2019) 0.61
MaCow: Unif 0.95
MaCow: Var 0.74
Table 2: Negative Log-likelihood scores for 5-bit datasets in bits/dim.
Figure 6: 5-bit CelebA-HQ samples, with temperature 0.7.

4.2 High-Resolution Images

We now demonstrate experimentally that our MaCow model is capable of high fidelity samples at high resolution. Following Kingma and Dhariwal (2018), we choose the CelebA-HQ dataset (Karras et al., 2018), which consists of 30,000 high resolution images from the CelebA dataset (Liu et al., 2015). We train our models on 5-bit images, with the fine-grained multi-scale architecture and both the uniform and variational dequantization.

4.2.1 Density Estimation

Table 2 illustrates the negative log-likelihood scores in bits/dim of two versions of MaCow on the 5-bit CelebA-HQ dataset. With uniform dequantization, MaCow improves the log-likelihood over Glow from 1.03 bits/dim to 0.95 bits/dim. Equipped with variational dequantization, MaCow obtains 0.74 bits/dim, 0.13 bits/dim behind the state-of-the-art autoregressive generative model SPN (Menick and Kalchbrenner, 2019), significantly narrowing the gap.

4.2.2 Image Generation

Consistent with previous work on likelihood-based generative models (Parmar et al., 2018; Kingma and Dhariwal, 2018), we found that sampling from a reduced-temperature model often results in higher-quality samples. Figure 6 showcases some random samples obtained from our model for 5-bit CelebA-HQ with temperature 0.7. The images are extremely high quality for non-autoregressive likelihood models.

5 Conclusion

In this paper, we proposed a new type of generative flow, coined MaCow, which exploits masked convolutional neural networks. By restricting the local dependencies in a small masked kernel, MaCow enjoys the properties of fast and stable training and efficient sampling. Experiments on both low- and high-resolution benchmark datasets of images show the capability of MaCow on both density estimation and high-fidelity generation, with state-of-the-art or comparable likelihood, and superior quality of samples against previous top-performing models.

One potential direction for future work is to extend MaCow to other forms of data, in particular text on which no attempt (to our best knowledge) has been made to apply flow-based generative models. Another exciting direction is to combine MaCow, or even general flow-based generative models, with variational inference to automatically learn meaningful (low-dimensional) representations from raw data.