1 Introduction
Generative models aim to learn a representation of the data
, in contrast with discriminative models that learn a probability distribution of labels given data
. Generative modeling may be used for numerous applications such as anomaly detection, denoising, inpainting, and superresolution. The task of generative modeling is challenging, because data is often very highdimensional, which makes optimization and choosing a successful objective difficult.
Generative models based on normalizing flows (Rippel & Adams, 2013) have several advantages over other generative models: i) They optimize the log likelihood of a continuous distribution exactly, as opposed to Variational AutoEncoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) which optimize a lower bound to the loglikelihood. ii) Drawing samples has a computational cost comparable to inference, in contrast with PixelCNNs (Van Oord et al., 2016). iii) Generative flows also have the potential for huge memory savings, because activations necessary in the backward pass can be obtained by computing the inverse of layers (Gomez et al., 2017; Li & Grathwohl, 2018).
The performance of density estimation models can be largely attributed to Masked Autoregressive Flows (MAFs)
(Papamakarios et al., 2017) and the coupling layers introduced in NICE and RealNVP (Dinh et al., 2014, 2017). MAFs contain flexible autoregressive transformations, but are computationally expensive to invert, which is a disadvantage for sampling highdimensional data. Coupling layers transform a subset of the dimensions of the data, parameterized by the remaining dimensions. The inverse of coupling layers is straightforward to compute, which makes them suitable for generative flows. However, since coupling layers can only operate on a subset of the dimensions of the data, they may be limited in flexibility.
To improve their effectiveness, coupling layers are alternated with less complex transformations that do operate on all dimensions of the data. Dinh et al. (2017) use a fixed channel permutation in Real NVP, and Kingma & Dhariwal (2018) utilize learnable convolutions in Glow.
However, convolutions suffer from limited flexibility, and using standard convolutions is not straightforward as they are very computationally expensive to invert. We propose two methods to obtain easily invertible and flexible convolutions: emerging and periodic convolutions. Both of these convolutions have receptive fields identical to standard convolutions, resulting in flexible transformations over both the channel and spatial axes.
The structure of an emerging convolution is depicted in Figure 1, where the top depicts the convolution filters, and the bottom shows the equivalent matrices of these convolutions. Two autoregressive convolutions are chained to obtain an emerging receptive field identical to a standard convolution. Empirically, we find that replacing convolutions with the generalized invertible convolutions produces significantly better results on galaxy images, CIFAR10 and ImageNet, even when correcting for the increase in parameters.
In addition to invertible convolutions, we also propose a QR decomposition for
convolutions, which resolves flexibility issues of the PLU decomposition proposed by Kingma & Dhariwal (2018).The main contributions of this paper are: 1) Invertible emerging convolutions using autoregressive convolutions. 2) Invertible periodic convolutions using decoupling in the frequency domain. 3) Numerically stable and flexible convolutions parameterized by a QR decomposition. 4) An accelerated inversion module for autoregressive convolutions.
2 Background
Generative Flow  Function  Inverse  Log Determinant 

Actnorm  
Affine coupling 



Conv  
Emerging Conv 



Periodic Conv 
denote elementwise multiplication and division. Input and output may be denoted as tensors
and with dimensions . The inputs and outputs may be denoted as onedimensional vectors and with dimension . Input and output in frequency domain are denoted with and , with dimensions , where the last two components denote frequencies.2.1 Change of variables formula
Consider a bijective map between variables and . The likelihood of the variable can be written as the likelihood of the transformation evaluated by , using the change of variables formula:
(1) 
The complicated probability density is equal to the probability density multiplied by the Jacobian determinant, where is chosen to be tractable. The function can be learned, but the choice of is constrained by two practical issues: Firstly, the Jacobian determinant should be tractable. Secondly, to draw samples from , the inverse of should be tractable.
2.1.1 Composition of functions
A sequence composed of several applications of the change of variables formula is often referred to as a normalizing flow (Deco & Brauer, 1995; Tabak et al., 2010; Tabak & Turner, 2013; Rezende & Mohamed, 2015). Let
be the intermediate representations produced by the layers of a neural network, where
and . The loglikelihood of is written as the loglikelihood of , and the summation of the log Jacobian determinant of each layer:(2) 
2.1.2 Dequantization
We will evaluate our methods with experiments on image datasets, where pixels are discretevalued from 0 to 255. Since generative flows are continuous density models, they may trivially place infinite mass on discretized bin locations. Therefore, we use the definition of Theis et al. (2016) that defines the relation between a discrete model and continuous model as an integration over bins: , where . They further derive a lowerbound to optimize this model with Jensen’s inequality, resulting in additive uniform noise for the integer valued pixels from the data distribution :
(3)  
2.2 Generative flows
Generative flows are bijective functions, often structured as deep learning layers, that are designed to have tractable Jacobian determinants and inverses. An overview of several generative flows is provided in Table
1, and a description is given below:Coupling layers (Dinh et al., 2017) split the input in two parts. The output is a combination of a copy of the first half, and a transformation of the second half, parametrized by the first part. As a result, the inverse and Jacobian determinant are straightforward to compute.
Actnorm layers (Kingma & Dhariwal, 2018)
are data dependent initialized layers with scale and translation parameters. They are initialized such that the distribution of activations has mean zero and standard deviation one. Actnorm layers improve training stability and performance.
2.3 Convolutions
Generally, a convolution layer^{1}^{1}1Note that in deep learning frameworks, convolutions are often actually crosscorrelations. In our equations denotes a crosscorrelation and denotes a convolution. In addition, a convolution layer is usually implemented as an aggregation of crosscorrelations, i.e. a crosscorrelation layer. We denote such a layer with the symbol . In the main text we may omit these details, and these operations are referred to as convolutions. with filter and input is equivalent to the multiplication of , a matrix, and a vectorized input . An example of a single channel convolution and its equivalent matrix is depicted in Figure 2. The signals and are indexed as , where is the width index, is the height index, and is the total width. Note that the matrix becomes sparser as the image dimensions grow and that the parameters of the filter occur repeatedly in the matrix . A twochannel convolution is visualized in Figure 3, where we have omitted parameters inside filters to avoid clutter. Here, and are vectorized using indexing , where denotes the channel index and the number of channels.
Using standard convolutions as a generative flow is inefficient. The determinant and inverse can be obtained naïvely by operating directly on the corresponding matrix, but this would be very expensive, corresponding to computational complexity .
2.4 Autoregressive Convolutions
Autoregressive convolutions have been widely used in the field of normalizing flows (Germain et al., 2015; Kingma et al., 2016) because it is straightforward to compute their Jacobian determinant. Although there exist autoregressive convolutions with different input and output dimensions, we let for invertibility. In this case, autoregressive convolutions can be expressed as a multiplication between a triangular weight matrix and a vectorized input.
In practice, a filter is constructed from weights and a binary mask that enforces the autoregressive structure (see Figure 4). The convolution with the masked filter is autoregressive without the need to mask inputs, which allows parallel computation of the convolution layer:
(4) 
where denotes a convolution layer. The matrix multiplication produces the equivalent result, where and are the vectorized signals, and is a sparse triangular matrix constructed from (see Figure 4). The Jacobian is triangular by design and its determinant can be computed in since it only depends on the diagonal elements of the matrix :
(5) 
where index denotes the channel and (, ) denotes the spatial center of the filter. The inverse of an autoregressive convolution can theoretically be computed using . In reality this matrix is large and impractical to invert. Since is triangular, the solution for can be found through forward substitution:
(6) 
The inverse can be computed by sequentially traversing through the input feature map in the imposed autoregressive order. The computational complexity of the inverse is and computation can be parallelized across examples in the minibatch.
3 Method
We present two methods to generalize 1 1 convolutions to invertible convolutions, improving the flexibility of generative flow models. Emerging convolutions are obtained by chaining autoregressive convolutions (section 3.1), and periodic convolutions are decoupled in frequency domain (section 3.2). In section 3.3, we provide a stable and flexible parameterization for invertible convolutions.
3.1 Emerging convolutions
Although autoregressive convolutions are invertible, their transformation is restricted by the imposed autoregressive order, enforced through masking of the filters (as depicted in Figure 4). To alleviate this restriction, we propose emerging convolutions, which are more flexible and nevertheless invertible. Emerging convolutions are obtained by chaining specific autoregressive convolutions, invertible via the autoregressive inverses. To some extent this resembles the combination of stacks used to resolve the blind spot problem in conditional image modeling with PixelCNNs (van den Oord et al., 2016), with the important difference that we do not constrain the resulting convolution itself to be autoregressive.
The emerging receptive field can be controlled by chaining autoregressive convolutions with variations in the imposed order. A collection of achievable receptive fields for emerging convolutions is depicted in Figure 5, based on commonly used autoregressive masking.
The autoregressive inverse requires the solution to a sequential problem, and as a result, it inevitably suffers some additional computational cost. In emerging convolutions we minimize this cost through the use of an accelerated parallel inversion module, implemented in Cython, and by maintaining relatively small dimensionality in the emerging convolutions compared to the internal size of coupling layers.
3.1.1 Square Emerging Convolutions
Deep learning applications tend to use square filters, and libraries are specifically optimized for these shapes. Since most of the receptive fields in Figure 5 are unusually shaped, these would require masking to fit them in rectangular arrays, leading to unnecessary computation.
However, there is a special case in which the emerging receptive field of two specific autoregressive convolutions is identical to a standard convolution. These square emerging convolutions can be obtained by combining off center square convolutions, depicted in the bottom row of Figure 5 (also Figure 1). Our square emerging convolution filters are more efficient since they require fewer masked values in rectangular arrays.
There are two approaches to efficiently compute square emerging convolutions during optimization and density estimation: either a emerging convolution is expressed as two smaller consecutive convolutions. Alternatively, the order of convolution can be changed: first the smaller filters ( and ) are convolved to obtain a single equivalent convolution filter. Then, the output of the emerging convolution is obtained by convolving the equivalent filter, , with the feature map :
(7) 
This equivalence follows from the associativity of convolutions and the time reversal of real discrete signals in crosscorrelations.
When , two autoregressive convolutions simplify to an LU decomposed convolution. To ensure that emerging convolutions are flexible, we use emerging convolutions that consists of: a single convolution, and two square autoregressive convolutions with different masking as depicted in the bottom row of Figure 1. Again, the individual convolutions may all be combined into a single emerging convolution filter using the associativity of convolutions (Equation 7).
3.2 Invertible Periodic Convolutions
In some cases, data may be periodic or boundaries may contain roughly the same values. In these cases it may be advantageous to use invertible periodic
convolutions, which assume that boundaries wrap around. When computed in the frequency domain, this alternative convolution has a tractable determinant Jacobian and inverse. The method leverages the convolution theorem, which states that the Fourier transform of a convolution is given by the elementwise product of the Fourier transformed signals. Specifically, the input and filter are transformed using the Discrete Fourier Transform (DFT) and multiplied elementwise, after which the inverse DFT is taken. By considering the transformation in the frequency domain, the computational complexity of the determinant Jacobian and the inverse are considerably reduced. In contrast with emerging convolutions, which are very specifically parameterized, the filters of periodic convolutions are completely unconstrained.
A standard convolution layer in deep learning is conventionally implemented as an aggregation of crosscorrelations for every output channel. The convolution layer with input and filter outputs the feature map , which is computed as:
(8) 
Let denote the Fourier transform and let denote the inverse Fourier transform. The Fourier transform can be moved inside the channel summation, since it is distributive over addition. Let , and , which are indexed by frequencies and . Because a convolution differs from a crosscorrelation by a time reversal for real signals, let denote the reflection of filter in both spatial directions. Using these definitions, each crosscorrelation is written as an elementwise multiplication in the frequency domain:
(9) 
which can be written as a sum of products in scalar form:
(10) 
The summation of multiplications can be reformulated as a matrix multiplication over the channel axis by viewing the output at frequency as a multiplication of the matrix and the input vector :
(11) 
The matrix has dimensions , the input and output are vectors with dimension and . The output in the original domain can simply be retrieved by taking the inverse Fourier transform, . The perspective of matrix multiplication in the frequency domain decouples the convolution transformation (see Figure 6). Therefore, the log determinant of a periodic convolution layer is equal to the sum of determinants of individual frequency components:
(12) 
The determinant remains unchanged by the Fourier transform and its inverse, since these are unitary transformations. The inverse operation requires an inversion of the matrix for every frequency :
(13) 
The solution of in the original domain is obtained by the inverse Fourier transform, , for every channel .
Recall that a standard convolution layer is equivalent to a matrix multiplication with a matrix, where we let for invertibility. The Fourier transform decouples the transformation of the convolution layer at each frequency, which divides the computation into separate matrix multiplications with matrices. Therefore, the computational cost of the determinant is reduced from to in the frequency domain, and computation can be parallelized since the matrices are independent across frequencies and independent of the data. Furthermore, the inverse matrices only need to be computed once after the model has converged, which reduces the inverse convolution to an efficient matrix multiplication with computational complexity^{2}^{2}2The inverse also incurs some overhead due to the Fourier transform of the feature maps which corresponds to a computational complexity . .
3.3 Qr 1 1 convolutions
Standard 1 1 convolutions are flexible but may be numerically unstable during optimization, causing crashes in the training procedure. Kingma & Dhariwal (2018) propose to learn a PLU decomposition, but since the permutation matrix is fixed during optimization, their flexibility is limited.
In order to resolve the stability issues while retaining the flexibility of the transformation, we propose to use a QR decomposition. Any real square matrix can be decomposed into a multiplication of an orthogonal and a triangular matrix. In a similar fashion to the PLU parametrization, we stabilize the decomposition by choosing , where is orthogonal, is strictly triangular, and elements in are nonzero. Any orthogonal matrix can be constructed from at most Householder reflections through , where is a Householder reflection:
(14) 
are learnable parameters. Note that in our case . In practice, arbitrary flexibility of may be redundant, and we can trade off computational complexity and flexibility by using a smaller number of Householder reflections. The log determinant of the QR decomposition is and can be computed in . The computational complexity to construct is between and depending on the desired flexibility. The QR parametrization has two main advantages: in contrast with the straightforward parameterization it is numerically stable, and it can be completely flexible in contrast with the PLU parametrization.
4 Related Work
The field of generative modeling has been approached from several directions. This work mainly builds upon generative flow methods developed in (Rippel & Adams, 2013; Dinh et al., 2014, 2017; Papamakarios et al., 2017; Kingma & Dhariwal, 2018). In (Papamakarios et al., 2017) autoregressive convolutions are also used for density estimation, but both its depth and number of channels makes drawing samples computationally expensive.
Normalizing flows have also been used to perform flexible inference in variational autoencoders (Rezende & Mohamed, 2015; Kingma et al., 2016; Tomczak & Welling, 2016; van den Berg et al., 2018; Huang et al., 2018) and Bayesian neural networks (Louizos & Welling, 2017)
. Instead of designing discrete sequences of transformations, continuoustime normalizing flows can also be designed by drawing a connection with ordinary differential equations
(Chen et al., 2018; Grathwohl et al., 2018).Other likelihoodbased methods such as PixelCNNs (Van Oord et al., 2016) impose a specific order on the dimensions of the image, which may not reflect the actual generative process. Furthermore, drawing samples tends to be computationally expensive. Alternatively, VAEs (Kingma & Welling, 2014) optimize a lower bound of the likelihood. The likelihood can be evaluated via an importance sampling scheme, but the quality of the estimate depends on the number of samples and the quality of the proposal distribution.
Many non likelihoodbased methods that can generate high resolution image samples utilize Generative Adversarial Networks (GAN) (Goodfellow et al., 2014). Although GANs tend to generate high quality images, they do not directly optimize a likelihood. This makes it difficult to obtain likelihoods and to measure their coverage of the dataset.
5 Results
The architecture of (Kingma & Dhariwal, 2018) is the starting point for the architecture in our experiments. In the flow module, the invertible convolution can simply be replaced with a periodic or emerging convolution. For a detailed overview of the architecture see Figure 7. We quantitatively evaluate models on a variety of datasets in bits per dimension, which is equivalent to the negative loglikelihood. We do not use inception based metrics, as they do not generalize to different datasets, and they do not report overfitting (Barratt & Sharma, 2018). In addition, we provide image samples generated with periodic convolutions trained on galaxy images, and samples generated with emerging convolutions trained on CIFAR10.
Note that generative models are very computationally expensive in general, and we do not have the computational budget to run extremely highdimensional image modeling tasks.
5.1 Galaxy density modeling
Since periodic convolutions assume that image boundaries are connected, they are suited for data where pixels along the boundaries are roughly the same, or are actually connected. An example of such data is pictures taken in space, as they tend to contain some scattered light sources, and boundaries are mostly dark. Ackermann et al. collected a small classification dataset of galaxies with images of merging and nonmerging galaxies. On the nonmerging galaxy images, we compare the bits per dimension of three models, constrained by the same parameter budget: convolutions (Glow), Periodic and Emerging convolutions (see Table 2). Experiments show that both our periodic and emerging convolutions significantly outperform convolutions, and their performance is less sensitive to initialization. Samples of the model using periodic convolutions are depicted in Figure 8.
Galaxy  

(Glow)  2.03 
Periodic  1.98 
Emerging  1.98 
5.2 Emerging convolutions
The performance of emerging convolution is extensively tested on CIFAR10 (Krizhevsky & Hinton, 2009) and ImageNet (Russakovsky et al., 2015), with different architectural sizes. The experiments in Table 3 use the architecture from Kingma & Dhariwal (2018), where emerging convolutions replace the convolutions. Emerging convolutions perform either on par or better than Glow^{3}^{3}3The CIFAR10 performance of Glow was obtained by running the code from the original github repository., which may be caused by the overparameterization of these large models. Samples of the model using emerging convolutions are depicted in Figure 9.
In some cases, it may not be feasible to run very large models in production because of the large computational cost. Therefore, it is interesting to study the behavior of models when they are constrained in size. We compare and emerging convolutions with the same number of flows per level (), for and . Both on CIFAR10 and ImageNet, we observe that models using emerging convolutions perform significantly better. Furthermore, for smaller models the contribution of emerging convolutions becomes more important, as evidenced by the increasing performance gap (see Table 4).
CIFAR10  ImageNet 32x32  ImageNet 64x64  

Real NVP  3.51  4.28  3.98 
Glow  3.36  4.09  3.81 
Emerging  3.34  4.09  3.81 
CIFAR10  ImageNet 32x32  D  

(Glow)  3.46  4.18  8 
Emerging  3.43  4.16  8 
(Glow)  3.56  4.28  4 
Emerging  3.51  4.25  4 
5.3 Modeling and sample time comparison with MAF
Recall that the inverse of autoregressive convolutions requires solving a sequential problem, which we have accelerated with an inversion module that uses Cython and parallelism across the minibatch. Considering CIFAR10 and the same architecture as used in Table 3
, it takes 39ms to sample an image using our accelerated emerging inverses, 46 times faster than the naïvely obtained inverses using tensorflow bijectors (see Table
5). As expected, sampling from models using convolutions remains faster and takes 5ms.Masked Autoregressive Flows (MAFs) are a very flexible method for density estimation, and they improve performance over emerging convolutions slightly, 3.33 versus 3.34 bits per dimension. However, the width and depth of MAFs makes them a poor choice for sampling, because it considerably increases the time to compute their inverse: 3000ms per sample using a naïve solution, and 650ms per sample using our inversion module. Since emerging convolutions operate on lower dimensions of the data, they are 17 times faster to invert than the MAFs.
CIFAR10  bits/dim  Naïve
sample (ms) 
Accelerated
sample (ms) 

(Glow)  3.36  5  5 
MAF &  3.33  3000  650 
Emerging  3.34  1800  39 
5.4 Qr 1 1 convolutions
QR convolutions are compared with standard and PLU convolutions on the CIFAR10 dataset. The models have 3 levels and 8 flows per level. Experiments confirm that our stable QR decomposition achieves the same performance as the standard parameterization, as shown in Table 6. This is expected, since any real square matrix has a QR decomposition. Furthermore, the experiments confirm that the less flexible PLU parameterization leads to worse performance, which is caused by the fixed permutation matrix.
Parametrization  CIFAR10 

W  3.46 
PLU  3.47 
QR  3.46 
6 Conclusion
We have introduced three generative flows: i) emerging convolutions as invertible standard zeropadded convolutions, ii) periodic convolutions for periodic data or data with minimal boundary variation, and iii) stable and flexible convolutions using a QR parametrization. Our methods show consistent improvements over various datasets using the same parameter budget, especially when considering models constrained in size.
References

(1)
Ackermann, S., Schawinksi, K., Zhang, C., Weigel, A. K., and Turp, M. D.
Using transfer learning to detect galaxy mergers.
Monthly Notices of the Royal Astronomical Society.  Barratt & Sharma (2018) Barratt, S. and Sharma, R. A note on the inception score. ICML Workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018.
 Chen et al. (2018) Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pp. 6572–6583, 2018.
 Deco & Brauer (1995) Deco, G. and Brauer, W. Higher Order Statistical Decorrelation without Information Loss. In Tesauro, G., Touretzky, D. S., and Leen, T. K. (eds.), Advances in Neural Information Processing Systems 7, pp. 247–254. MIT Press, 1995.
 Dinh et al. (2014) Dinh, L., Krueger, D., and Bengio, Y. NICE: Nonlinear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
 Dinh et al. (2017) Dinh, L., SohlDickstein, J., and Bengio, S. Density estimation using Real NVP. International Conference on Learning Representations, ICLR, 2017.

Germain et al. (2015)
Germain, M., Gregor, K., Murray, I., and Larochelle, H.
Made: Masked autoencoder for distribution estimation.
In International Conference on Machine Learning, pp. 881–889, 2015. 
Gomez et al. (2017)
Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B.
The reversible residual network: Backpropagation without storing activations.
In Advances in Neural Information Processing Systems, pp. 2214–2224, 2017.  Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Grathwohl et al. (2018) Grathwohl, W., Chen, R. T., Betterncourt, J., Sutskever, I., and Duvenaud, D. Ffjord: Freeform continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, 2018.
 Huang et al. (2018) Huang, C.W., Krueger, D., Lacoste, A., and Courville, A. Neural autoregressive flows. arXiv preprint arXiv:1804.00779, 2018.
 Kingma & Dhariwal (2018) Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10236–10245, 2018.
 Kingma & Welling (2014) Kingma, D. P. and Welling, M. AutoEncoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, 2014.
 Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743–4751, 2016.
 Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Li & Grathwohl (2018) Li, X. and Grathwohl, W. Training Glow with Constant Memory Cost. NIPS Workshop on Bayesian Deep Learning, 2018.
 Louizos & Welling (2017) Louizos, C. and Welling, M. Multiplicative normalizing flows for variational bayesian neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2218–2227. JMLR. org, 2017.
 Papamakarios et al. (2017) Papamakarios, G., Murray, I., and Pavlakou, T. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338–2347, 2017.
 Rezende & Mohamed (2015) Rezende, D. and Mohamed, S. Variational Inference with Normalizing Flows. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 1530–1538. PMLR, 2015.
 Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 1278–1286. PMLR, 2014.
 Rippel & Adams (2013) Rippel, O. and Adams, R. P. Highdimensional probability estimation with deep density models. arXiv preprint arXiv:1302.5125, 2013.

Russakovsky et al. (2015)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., et al.
Imagenet large scale visual recognition challenge.
International journal of computer vision
, 115(3):211–252, 2015.  Tabak & Turner (2013) Tabak, E. and Turner, C. V. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145–164, 2013.
 Tabak et al. (2010) Tabak, E. G., VandenEijnden, E., et al. Density estimation by dual ascent of the loglikelihood. Communications in Mathematical Sciences, 8(1):217–233, 2010.
 Theis et al. (2016) Theis, L., van den Oord, A., and Bethge, M. A note on the evaluation of generative models. In International Conference on Learning Representations, 2016.
 Tomczak & Welling (2016) Tomczak, J. M. and Welling, M. Improving variational autoencoders using householder flow. arXiv preprint arXiv:1611.09630, 2016.
 van den Berg et al. (2018) van den Berg, R., Hasenclever, L., Tomczak, J. M., and Welling, M. Sylvester normalizing flows for variational inference. arXiv preprint arXiv:1803.05649, 2018.
 van den Oord et al. (2016) van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al. Conditional Image Generation with PixelCNN Decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798, 2016.

Van Oord et al. (2016)
Van Oord, A., Kalchbrenner, N., and Kavukcuoglu, K.
Pixel recurrent neural networks.
In International Conference on Machine Learning, pp. 1747–1756, 2016.
Appendix A Experimental details
Models are optimized with settings identical to (Kingma & Dhariwal, 2018). The optimizer Adamax is used with a learning rate of 0.001. Coupling layers contain neural networks with three convolution layers. The first and last convolution are and the center convolution is . The two hidden layers have
(width) channels and ReLU activations. A flow module consists of an actnorm layer, a mixing layer and a coupling layer. A mixing layer is either a
convolution (Glow) or a (emerging or periodic) convolution. A forward pass of the entire model uses flows per level . At the end of a level the squeeze operation reduces the spatial dimensions by two, and increases the channel dimensions by four.Model  batchsize  

3  8  512  256  
Periodic  3  8  510  256 
Emerging  3  8  510  256 
In the MAF model, convolutions are succeeded (not replaced) by two MAF layers with opposite autoregressive order. MAF layers have the same structure as coupling layers, with the difference that the two hidden layers have 96 channels.
Model  batchsize  

Glow  3  32  512  256 
MAF  3  32  462  256 
Emerging  3  32  512  256 
The models constrained in size are tested on CIFAR10 and have a different number of flows (
) per level. The other hyperparameters remain unchanged.
Model  batchsize  

Glow  3  512  256 
Emerging  3  507  256 
Experiment  Dataset  batchsize  

Glow  ImageNet 32  3  48  512  256 
Emerging  ImageNet 32  3  48  512  256 
Glow  ImageNet 64  3  48  512  128 
Emerging  ImageNet 64  3  48  512  128 
The experiments that compare parameterizations of the convolutions all use the same hyperparameters, , , , and .