1 Introduction
Invertible neural networks have many applications in machine learning. They have been employed to investigate representations of deep classifiers
[14], understand the cause of adversarial examples [13], learn transition operators for MCMC [26, 17], create generative models that are directly trainable by maximum likelihood [6, 5, 22, 15, 9, 1], and perform approximate inference [25, 16].Many applications of invertible neural networks require that both inverting the network and computing the Jacobian determinant be efficient. While typical neural networks are not invertible, achieving these properties often imposes restrictive constraints to the architecture. For example, planar flows [25] and Sylvester flow [2] constrain the number of hidden units to be smaller than the input dimension. NICE [5] and Real NVP [6]
rely on dimension partitioning heuristics and specific architectures such as coupling layers, which could make training more difficult
[1]. Methods like FFJORD [9], iResNets [1] have fewer architectural constraints. However, their Jacobian determinants have to be approximated, which is problematic if repeatedly performed at training time as in flow models.In this paper, we propose a new method of constructing invertible neural networks which are flexible, efficient to invert, and whose Jacobian can be computed exactly and efficiently. We use triangular matrices as our basic module. Then, we provide a set of composition rules to recursively build more complex nonlinear modules from the basic module, and show that the composed modules are invertible as long as their Jacobians are nonsingular. As in previous work [6, 22], the Jacobians of our modules are triangular, allowing efficient determinant computation. The inverse of these modules can be obtained by an efficiently parallelizable fixedpoint iteration method, making the cost of inversion comparable to that of an iResNet [1] block.
Using our composition rules and masked convolutions as the basic triangular building block, we construct a rich set of invertible modules to form a deep invertible neural network. The architecture of our proposed invertible network closely follows that of ResNet [10]—the stateoftheart architecture of discriminative learning. We call our model Masked Invertible Network (MintNet). To demonstrate the capacity of MintNets, we first test them on image classification. We found that a MintNet classifier achieves 99.6% accuracy on MNIST, matching the performance of a ResNet with a similar architecture. On CIFAR10, it achieves 91.2% accuracy, comparable to the 92.6% accuracy of ResNet. When using MintNets as generative models, they achieve the new stateoftheart results of bits per dimension (bpd) on uniformly dequantized images. Specifically, MintNet achieves bpd values of 0.98, 3.32, and 4.06 on MNIST, CIFAR10 and ImageNet 3232, while former best (published) results are 0.99 (FFJORD [9]), 3.35 (Glow [15]) and 4.09 (Glow) respectively. Moreover, MintNet uses fewer parameters and less computational resources. Our MNIST model uses 30% fewer parameters than FFJORD [9]. For CIFAR10 and ImageNet 3232, MintNet uses 60% and 74% fewer parameters than the corresponding Glow [15] models. When training on dataset such as CIFAR10, MintNet required 2 GPUs for approximately 5 days, while FFJORD [9] used 6 GPUs for approximately 5 days, and Glow [15] used 8 GPUs for approximately 7 days.
2 Background
Consider a neural network that maps a data point to a latent representation . When for every there exists a unique such that , we call an invertible neural network. There are several basic properties of invertible networks. First, when is continuous, a necessary condition for to be invertible is . Second, if and are both invertible, will also be invertible. In this work, we mainly consider applications of invertible neural networks to classification and generative modeling.
2.1 Classification with invertible neural networks
Neural networks for classification are usually not invertible because the number of classes is usually different from the input dimension . Therefore, when discussing invertible neural networks for classification, we separate the classifier into two parts
and classification , where is usually the softmax function. We say the classifier is invertible when is invertible. Invertible classifiers are arguably more interpretable, because a prediction can be traced down by inverting latent representations [14, 13].2.2 Generative modeling with invertible neural networks
An invertible network
can be used to warp a complex probability density
to a simple base distribution (e.g., a multivariate standard Gaussian) [5, 6]. Under the condition that both and are differentiable, the densities of and are related by the following change of variable formula(1) 
where denotes the Jacobian of and we require to be nonsingular so that is welldefined. Using this formula, can be easily computed if the Jacobian determinant is cheaply computable and is known.
Therefore, an invertible neural network implicitly defines a normalized density model , which can be directly trained by maximum likelihood. The invertibility of is critical to fast sample generation. Specifically, in order to generate a sample from , we can first draw , and warp it back through the inverse of to obtain .
Note that multiple invertible models can be stacked together to form a deeper invertible model , without much impact on the inverse and determinant computation. This is because we can sequentially invert each component, i.e., , and the total Jacobian determinant equals the product of each individual Jacobian determinant, i.e., .
3 Building invertible modules compositionally
In this section, we discuss how simple blocks like masked convolutions can be composed to build invertible modules that allow efficient, parallelizable inversion and determinant computation. To this end, we first introduce the basic building block of our models. Then, we propose a set of composition rules to recursively build up complex nonlinear modules with triangular Jacobians. Next, we prove that these composed modules are invertible as long as their Jacobians are nonsingular. Finally, we discuss how these modules can be inverted efficiently using numerical methods.
3.1 The basic module
We start from considering linear transformations
, with , and . For a general , computing its Jacobian determinant requires operations. We therefore choose to be a triangular matrix. In this case, the Jacobian determinant is the product of all diagonal entries of , and the computational complexity is reduced to . The linear function with being triangular is our basic module.Masked convolutions.
Convolution is a special type of linear transformation that is very effective for image data. The triangular structure of the basic module can be achieved using masked convolutions (e.g., causal convolutions in PixelCNN [20]). We provide the formula of our masks in Appendix B and an illustration of a masked convolution with filters in Fig. 1. Intuitively, the causal structure of the filters (ordering of the pixels) enforces a triangular structure.
3.2 The calculus of building invertible modules
Complex nonlinear invertible functions can be constructed from our basic modules in two steps. First, we follow several composition rules so that the composed module has a triangular Jacobian. Next, we impose appropriate constraints so that the module is invertible. To simplify the discussion, we only consider modules with lower triangular Jacobians here, and we note that it is straightforward to extend the analysis to modules with upper triangular Jacobians.
The following proposition summarizes several rules to compositionally build new modules with triangular Jacobians using existing ones.
Proposition 1.
Define as the set of all continuously differentiable functions whose Jacobian is lower triangular. Then contains the basic module in Section 3.1, and is closed under the following composition rules.

Rule of addition. , where .

Rule of composition. . A special case is , where
is a continuously differentiable nonlinear activation function that is applied elementwise.
The proof of this proposition is straightforward and deferred to Appendix A. By repetitively applying the rules in Proposition 1, our basic linear module can be composed to construct complex nonlinear modules having continuous and triangular Jacobians. Note that besides our linear basic modules, other functions with triangular and continuous Jacobians can also be made more expressive using the composition rules. For example, the layers of dimension partitioning models (e.g., NICE [5], Real NVP [6], Glow [15]) and autoregressive flows (e.g., MAF [22]) all have continuous and triangular Jacobians and therefore belong to . Note that the rule of addition in Proposition 1 preserves triangular Jacobians but not invertibility. Therefore, we need additional constraints if we want the composed functions to be invertible.
Next, we state the condition for to be invertible, and denote the invertible subset of as .
Theorem 1.
If and is nonsingular for all in the domain, then is invertible.
Proof.
A proof can be found in Appendix A. ∎
The nonsingularity of constraint in Theorem 1 is natural in the context of generative modeling. This is because in order for Eq. (1) to make sense, has to be welldefined, which requires to be nonsingular.
In many cases, Theorem 1 can be easily used to check and enforce the invertibility of . For example, the layers of autoregressive flow models and dimension partitioning models can all be viewed as elements of because they are continuously differentiable and have triangular Jacobians. Since the diagonal entries of their Jacobians are always strictly positive and hence nonsingular, we can immediately conclude that they are invertible with Theorem 1, thus generalizing their modelspecific proofs of invertibility.
In Fig. 2, we provide a Venn Diagram to illustrate the set of functions that satisfy the condition of Theorem 1. As depicted by the orange set labeled by , Theorem 1 captures a subset of where the Jacobians of functions are nonsingular so that the change of variable formula is usable. Note the condition in Theorem 1 is sufficient but not necessary. For example, is invertible, but is singular. Many previous invertible models with special architectures, such as NICE, Real NVP, and MAF, can be viewed as elements belonging to subsets of .
3.3 Efficient inversion of the invertible modules
In this section, we show that when the conditions in Theorem 1 hold, not only do we know that is invertible (), but also we have a fixedpoint iteration method to invert with strong theoretical guarantees and good performance in practice.
The pseudocode of our proposed inversion algorithm is described in Algorithm 1. Theoretically, we can prove that this method is locally convergent—as long as the initial value is close to the true value, the method is guaranteed to find the correct inverse. We formally summarize this result in Theorem 2.
Theorem 2.
The iterative method of Algorithm 1 is locally convergent whenever .
Sketch of Proof.
In practice, the method is also easily parallelizable on GPUs, making the cost of inverting similar to that of an iResNet [1]
layer. Within each iteration, the computation is mostly matrix operations that can be vectorized and run efficiently in parallel. Therefore, the time cost will be roughly proportional to the number of iterations,
i.e., . As will be shown in our experiments, Algorithm 1 converges fast and usually the error quickly becomes negligible when . This is in stark contrast to existing methods of inverting autoregressive flow models such as MAF [22], where univariate equations need to be solved sequentially, requiring at least iterations. There are also other approaches for inverting . For example, the bisection method is guaranteed to converge globally, but its computational cost is , and is usually much more expensive than Algorithm 1. Note that as discussed earlier, autoregressive flow models can also be viewed as special cases of our framework. Therefore, Algorithm 1 is also applicable to inverting autoregressive flow models and could potentially result in large improvements of sampling speed.4 Masked Invertible Networks
We show that techniques developed in Section 3 can be used to build our Masked Invertible Network (MintNet). First, we discuss how we compose several masked convolutions to form the Masked Invertible Layer (Mint layer). Next, we stack multiple Mint layers to form a deep neural network, i.e., the MintNet. Finally, we compare MintNets with several existing invertible architectures.
4.1 Building the Masked Invertible Layer
We construct an invertible module in that serves as the basic layer of our MintNet. This invertible module, named Mint layer, is defined as
(2) 
where denotes the elementwise multiplication, , , and are all lower triangular matrices with additional constraints to be specified later, and . Additionally, Mint layers use a monotonic activation function , so that . Common choices of include ELU [4], tanh and sigmoid. Note that every individual weight matrix has the same size, and the 3 groups of weights , and can be implemented with 3 masked convolutions (see Appendix B). We design the form of so that it resembles a ResNet / iResNet block that also has 3 convolutions with filters, with being the number of channels of .
From Proposition 1 in Section 3.2, we can easily conclude that . Now, we consider additional constraints on the weights so that , i.e., it is invertible. Note that the analytic form of its Jacobian is
(3) 
with , , and . Therefore, once we impose the following constraint
(4) 
we have , which satisfies the condition of Theorem 1 and as a consequence we know . In practice, the constraint Eq. (4) can be easily implemented. For all , we impose no constraint on and , but replace with . Note that has the same signs as and therefore . Moreover, is almost everywhere differentiable w.r.t. , which allows gradients to backprop through.
4.2 Constructing the Masked Invertible Network
In this section, we introduce design choices that help stack multiple Mint layers together to form an expressive invertible neural network, namely the MintNet. The full MintNet is constructed by stacking the following paired Mint layers and squeezing layers.
Paired Mint layers.
As discussed above, our Mint layer always has a triangular Jacobian. To maximize the expressive power of our invertible neural network, it is undesirable to constrain the Jacobian of the network to be triangular since this limits capacity and will cause blind spots in the receptive field of masked convolutions. We thus always pair two Mint layers together—one with a lower triangular Jacobian and the other with an upper triangular Jacobian, so that the Jacobian of the paired layers is not triangular, and blind spots can be eliminated.
Squeezing layers.
Subsampling is important for enlarging the receptive field of convolutions. However, common subsampling operations such as pooling and strided convolutions are usually not invertible. Following
[6] and [1], we use a “squeezing” operation to reshape the feature maps so that they have smaller resolution but more channels. After a squeezing operation, the height and width will decrease by a factor of , but the number of channels will increase by a factor of. This procedure is invertible and the Jacobian is an identity matrix. Throughout the paper, we use
.4.3 Comparison to other approaches
In what follows we compare MintNets to several existing methods for developing invertible architectures. We will focus on architectures with a tractable Jacobian determinant. However, we note that there are models (cf., [7, 19, 8]) that allow fast inverse computation but do not have tractable Jacobian determinants. Following [1], we also provide some comparison in Tab. 5 (see Appendix E).
Identities of determinants.
Some identities can be used to speed up the computation of determinants if the Jacobians have special structures. For example, in Sylvester flow [2], the invertible transformation has the form , where is a nonlinear activation function, , , and . By Sylvester’s determinant identity, can be computed in , which is much less than if . However, the requirement that is small becomes a bottleneck of the architecture and limits its expressive power. Similarly, Planar flow [25] uses the matrix determinant lemma, but has an even narrower bottleneck.
The form of bears some resemblance to Sylvester flow. However, we improve the capacity of Sylvester flow in two ways. First, we add one extra nonlinear convolutional layer. Second, we avoid the bottleneck that limits the maximum dimension of latent representations in Sylvester flow.
Dimension partitioning.
NICE [5], Real NVP [6], and Glow [15] all depend on an affine coupling layer. Given , is first partitioned into two parts . The coupling layer is an invertible transformation, defined as , where and are two arbitrary functions. However, the partitioning of relies on heuristics, and the performance is sensitive to this choice (cf., [15, 1]). In addition, the Jacobian of is a triangular matrix with diagonal . In contrast, the Jacobian of MintNets has more flexible diagonals—without being partially restricted to ’s.
Autoregressive transformation.
By leveraging autoregressive transformations, the Jacobian can be made triangular. For example, MAF [22] defines the invertible tranformation as , where and . However, the architecture of is only an affine combination of autoregressive functions with , which might restrict its expressive power. In contrast, the architecture of MintNets is arguably more flexible.
Freeform invertible models.
Some work proposes invertible transformations whose Jacobians are not limited by special structures. For example, FFJORD [9] uses a continuous version of change of variables formula [3]
where the determinant is replaced by trace. Unlike MintNets, FFJORD needs an ODE solver to compute its value and inverse, and uses a stochastic estimator to approximate the trace. Another work is iResNet
[1] which constrains the Lipschitzness of ResNet layers to make it invertible. Both iResNet and MintNet use ResNet blocks with 3 convolutions. The inverse of iResNet can be obtained efficiently by a parallelizable fixedpoint iteration method, which has comparable computational cost as our Algorithm 1. However, unlike MintNets whose Jacobian determinants are exact, the logdeterminant of Jacobian of an iResNet must be approximated by truncating a power series and estimating each term with stochastic estimators.5 Experiments
In this section, we evaluate our MintNet architectures on both image classification and density estimation. We focus on three common image datasets, namely MNIST, CIFAR10 and ImageNet 3232. We also empirically verify that Algorithm 1 can provide accurate solutions within a small number of iterations. We provide more details about settings and model architectures in Appendix D.
5.1 Classification
To check the capacity of MintNet and understand the tradeoff of invertibility, we test its classification performance on MNIST and CIFAR10, and compare it to a ResNet with a similar architecture.
On MNIST, MintNet achieves a test accuracy of 99.6%, which is the same as that of the ResNet. On CIFAR10, MintNet reaches 91.2% test accuracy while ResNet reaches 92.6%. Both MintNet and ResNet achieve 100% training accuracy on MNIST and CIFAR10 datasets. This indicates that MintNet has enough capacity to fit all data labels on the training dataset, and the invertible representations learned by MintNet are comparable to representations learned by noninvertible networks in terms of generalizability. Note that the small degradation in classification accuracy is also observed in other invertible networks. For example, depending on the Lipschitz constant, the gap between test accuracies of iResNet and ResNet can be as large as 1.92% on CIFAR10.
5.2 Density estimation and verification of invertibility
In this section, we demonstrate the superior performance of MintNet on density estimation by training it as a flow generative model. In addition, we empirically verify that Algorithm 1 can accurately produce the inverse using a small number of iterations. We show that samples can be efficiently generated from MintNet by inverting each Mint layer with Algorithm 1.
Density estimation.
In Tab. 1, we report bits per dimension (bpd) on MNIST, CIFAR10, and ImageNet 3232 datasets. It is notable that MintNet sets the new records of bpd on all three datasets. Moreover, when compared to previous best models, our MNIST model uses 30% fewer parameters than FFJORD, and our CIFAR10 and ImageNet 3232 models respectively use 60% and 74% fewer parameters than Glow. When trained on datasets such as CIFAR10, MintNet requires 2 GPUs for approximately five days, while FFJORD is trained on 6 GPUs for five days, and Glow on 8 GPUs for seven days. Note that all values in Tab. 1 are with respect to the continuous distribution of uniformly dequantized images, and results of models that view images as discrete distributions are not directly comparable (e.g., PixelCNN [20], IAFVAE [16], and Flow++ [12]
). To show that MintNet learns semantically meaningful representations of images, we also perform latent space interpolation similar to the interpolation experiments in Real NVP (see Appendix
C).Method  MNIST  CIFAR10  ImageNet 3232 
NICE [5]  4.36  4.48   
MAF [22]  1.89  4.31   
Real NVP [6]  1.06  3.49  4.28 
Glow [15]  1.05  3.35  4.09 
FFJORD [9]  0.99  3.40   
iResNet [1]  1.06  3.45   
MintNet (ours)  0.98  3.32  4.06 
Verification of invertibility.
We first examine the performance of Algorithm 1 by measuring the reconstruction error of MintNets. We compute the inverse of MintNet by sequentially inverting each Mint layer with Algorithm 1. We used grid search to select the step size in Algorithm 1 and chose respectively for MNIST, CIFAR10 and ImageNet 3232. An interesting fact is for MNIST, actually works better than other values of within , even though it does not have the theoretical gurantee of local convergence. As Fig. 3(a) shows, the normalized reconstruction error converges within iterations for all datasets considered. Additionally, Fig. 3(b) demonstrates that the reconstructed images look visually indistinguishable to true images.
Samples.
Using Algorithm 1, we can generate samples efficiently by computing the inverse of MintNets. We use the same step sizes as in the reconstruction error analysis, and run Algorithm 1 for 120 iterations for all three datasets. We provide uncurated samples in Fig. 3, and more samples can be found in Appendix F. In addition, we compare our sampling time to that of the other models (see Tab. 6 in Appendix E). Our sampling method has comparable speed as iResNet. It is approximately 5 times faster than autoregressive sampling on MNIST, and is roughly 25 times faster on CIFAR10 and ImageNet 3232.
6 Conclusion
We propose a new method to compositionally construct invertible modules that are flexible, efficient to invert, and with a tractable Jacobian. Starting from linear transformations with triangular matrices, we apply a set of composition rules to recursively build new modules that are nonlinear and more expressive (Proposition 1). We then show that the composed modules are invertible as long as their Jacobians are nonsingular (Theorem 1), and propose an efficiently parallelizable numerical method (Algorithm 1) with theoretical guarantees (Theorem 2) to compute the inverse. The Jacobians of our modules are all triangular, which allows efficient and exact determinant computation.
As an application of this idea, we use masked convolutions as our basic module. Using our composition rules, we compose multiple masked convolutions together to form a module named Mint layer, following the architecture of a ResNet block. To enforce its invertibility, we constrain the masked convolutions to satisfy the condition of Theorem 1. We show that multiple Mint layers can be stacked together to form a deep invertible network which we call MintNet. Experimentally, we show that MintNet performs well on MNIST and CIFAR10 classification. Moreover, when trained as a generative model, MintNet achieves new stateoftheart performance on MNIST, CIFAR10 and ImageNet 3232.
Acknowledgements
This research was supported by Intel Corporation, Amazon AWS, TRI, NSF (#1651565, #1522054, #1733686), ONR (N000141912145), AFOSR (FA9550 1910024).
References
 [1] J. Behrmann, D. D. Will Grathwohl, Ricky T. Q. Chen, and J.H. Jacobsen. Invertible residual networks. arXiv preprint arXiv:1811.00995, 2019.
 [2] R. v. d. Berg, L. Hasenclever, J. M. Tomczak, and M. Welling. Sylvester normalizing flows for variational inference. arXiv preprint arXiv:1803.05649, 2018.

[3]
T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud.
Neural ordinary differential equations.
In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 6571–6583. Curran Associates, Inc., 2018.  [4] D.A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
 [5] L. Dinh, D. Krueger, and Y. Bengio. NICE: nonlinear independent components estimation. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Workshop Track Proceedings, 2015.
 [6] L. Dinh, J. SohlDickstein, and S. Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.

[7]
A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse.
The reversible residual network: Backpropagation without storing activations.
In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2214–2224. Curran Associates, Inc., 2017.  [8] A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse. The reversible residual network: Backpropagation without storing activations. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2214–2224. Curran Associates, Inc., 2017.
 [9] W. Grathwohl, I. S. Ricky T. Q. Chen, Jesse Bettencourt, and D. Duvenaud. Ffjord: Freeform continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, 2018.

[10]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [11] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
 [12] J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel. Flow++: Improving flowbased generative models with variational dequantization and architecture design, 2019.
 [13] J.H. Jacobsen, J. Behrmann, R. Zemel, and M. Bethge. Excessive invariance causes adversarial vulnerability. In International Conference on Learning Representations, 2019.
 [14] J.H. Jacobsen, A. W. Smeulders, and E. Oyallon. irevnet: Deep invertible networks. In International Conference on Learning Representations, 2018.
 [15] D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039, 2018.
 [16] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4743–4751. Curran Associates, Inc., 2016.
 [17] D. Levy, M. D. Hoffman, and J. SohlDickstein. Generalizing hamiltonian monte carlo with neural networks. In International Conference on Learning Representations, 2018.
 [18] I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.

[19]
M. MacKay, P. Vicol, J. Ba, and R. B. Grosse.
Reversible recurrent neural networks.
In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 9029–9040. Curran Associates, Inc., 2018.  [20] A. V. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1747–1756, New York, New York, USA, 20–22 Jun 2016. PMLR.
 [21] J. M. Ortega and W. C. Rheinboldt. Iterative solution of nonlinear equations in several variables, volume 30. Siam, 1970.
 [22] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2338–2347, 2017.
 [23] T. T. Phuong and L. T. Phong. On the convergence proof of amsgrad and a new version. arXiv preprint arXiv:1904.03590, 2019.
 [24] S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
 [25] D. Rezende and S. Mohamed. Variational inference with normalizing flows. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1530–1538, Lille, France, 07–09 Jul 2015. PMLR.
 [26] J. Song, S. Zhao, and S. Ermon. Anicemc: Adversarial training for mcmc. In Advances in Neural Information Processing Systems, pages 5140–5150, 2017.
Appendix A Proofs
Notations.
Let denote the Jacobian of evaluated at . We use to denote the th component of the vectorvalued function , and to denote the th entry of . We further use to denote the th component of the input vector , and to denote the partial derivative of w.r.t. , evaluated at .
Proposition 1.
Define as the set of all continuously differentiable functions whose Jacobian is lower triangular. Then contains the basic module in Section 3.1, and is closed under the following composition rules.

Rule of addition. , where .

Rule of composition. . A special case is , where is a continuously differentiable nonlinear activation function that is applied elementwisely.
Proof.
Since the basic modules have the form , where is a lower triangular matrix, we immediately know that is continuously differentiable and is lower triangular, therefore . Next, we prove the closeness properties of one by one.

Rule of addition. is continuously differentiable, and is lower triangular. This is because , and both and are continuous and lower triangular.

Rule of composition. is continuously differentiable and has a lower triangular Jacobian. This is because , and both and are continuous and lower triangular. As a special case, we choose , where is a continuously differentiable univariate function. Since the Jacobian of is diagonal and continuous, we have . Therefore holds true for all .
∎
The following two lemmas will be very helpful for proving Theorem 1.
Lemma 1.
is lower triangular for all implies is a function of , and does not depend on .
Proof.
Due to the fact that is lower triangular, we have for any . When are fixed, we have
(5)  
(6) 
This implies that does not depend on for any . In other words, is only a function of . ∎
Lemma 2.
implies that for any , either (i) or (ii) . That is, is monotonic w.r.t. when are fixed.
Proof.
Clearly is equivalent to . This means for any , and it shares the same sign with , a constant that is either strictly positive or strictly negative. This further implies that when are fixed, is either strictly positive or strictly negative for all , and is therefore monotonic w.r.t. . ∎
Theorem 1.
If and is nonsingular for all in the domain, then is invertible.
Proof.
Assume without loss of generality that is lower triangular. We first prove that by contradiction. Assuming , then such that . Because is always triangular and nonsingular, we immediately conclude that . Assume without loss of generality that and . Then, by the intermediate value theorem, we know that such that , which contradicts that fact that is always nonsingular.
Next, we prove that for all in the range of , there exists a unique such that . To obtain , we only need to solve , which is an equation of variable , as concluded from Lemma 1. Since Lemma 2 implies that is monotonic w.r.t. , we know that has a unique inverse whenever is in the range of . Now assume we have already obtained , where . In this case, Lemma 1 asserts that is an equation of variable . Again Lemma 2 implies that is a monotonic function of given , which implies further that has a unique solution whenever is in the range of . By induction, we can solve for by repetitively employing this procedure, which concludes that exists, and can be determined uniquely.
∎
Proof.
Let be any value in the range of and , where denotes a diagonal matrix whose diagonal entries are the reciprocals of those of . The iterative method of Algorithm 1 can be written as . Because of Theorem 1, there exists a unique such that , in which case . Applying the product rule, we have
where denotes the Jacobian of evaluated at . Since is triangular,
will also be triangular. Therefore, the only eigenvalue of
is , due to the fact that the only solution to the equation system is . Since , the spectral radius of satisfies . Then the Ostrowski Theorem (cf., Theorem 10.1.3. in [21]) shows that the sequence obtained by converges locally to as . ∎Appendix B Masked convolutions
Convolution is a special type of linear transformation that proves to be very effective for image data. The basic invertible module can be implemented using masked convolutions (e.g., causal convolutions in PixelCNN [20]). Consider a 2D convolutional layer with input feature maps, filters, a kernel size of
and a zeropadding of
. We assumeis an odd integer and
so that the input and output of the convolutional layer have the same shape. Letbe the weight tensor of this layer. We define a mask
that satisfies(7) 
The masked convolution then uses as the weight tensor. In Fig. 1, we provide an illustration on a masked convolution with filters.
In MintNet, is efficiently implemented with 3 masked convolutional layers. The weights and masks are denoted as , and , which separately correspond to in Eq. (2). Let be the number of input feature maps, and suppose the kernel size is . The shapes of , and are respectively , and . The masks of them are simple concatenations of copies of the mask in Eq. (7). For instance, consists of copies of Eq. (7), and consists of copies. Using masked convolutions, can be concisely written as
(8) 
where are biases, and denotes the operation of discrete 2D convolution.
Appendix C Interpolation of hidden representations
MintNet interpolation of hidden representation.
Left: MintNet MNIST latent space interpolation. Middle: MintNet CIFAR10 latent space interpolation. Right: MintNet ImageNet 3232 latent space interpolation.Given four images in the dataset, let , where , be the corresponding features in the feature domain. Similar to [6], in the feature domain, we define
(9) 
where axis corresponds to , axis corresponds to , and both and range over . We then transform back to the image domain by taking . Interpolation results are shown in Fig. 5.
Appendix D Experiment setup and network architecture
Hyperparameter tuning and computation infrastructure.
We use the standard train/test split of MNIST, CelebA and CIFAR10. We tune our models by observing its training bpd. For density estimation on CIFAR10 and ImageNet 3232, the models were run on two Titan XP GPUs. In other cases the model was run on one Titan XP GPU.
Classification setup.
Following [1], we pad the images to 16 channels with zeros. This corresponds to the first convolution in ResNet which increases the number of channels to 16. Both ResNet and our MintNet are trained with AMSGrad [24]
for 200 epochs with the cosine learning rate schedule
[18] and an initial learning rate of 0.001. Both networks use a batch size of 128.Classification architecture.
The ResNet contains 38 preactivation residual blocks [11], and each block has three convolutions. The architecture is divided into 3 stages, with 16, 64 and 256 filters respectively. Our MintNet uses 19 grouped invertible layers, which include a total of 38 residual invertible layers, each having three
convolutions. Batch normalization is applied before each invertible layer. Note that batch normalization does not affect the invertibility of our network, because during test time it uses fixed running average and standard deviation and is an invertible operation. We use 2 squeezing blocks at the same position where ResNet applies subsampling, and matches the number of filters used in ResNet. To produce the logits for classification, both MintNet and ResNet first apply global average pooling and then use a fully connected layer (see Tab.
2).Density estimation setup.
We mostly follow the settings in [22]
. All training images are dequantized and transformed using the logit transformation. Networks are trained using AMSGrad
[23]. On MNIST, we decay the learning rate by a factor of 10 at the 250th and 350th epoch, and train for 400 epochs. On CIFAR10, we train with cosine learning rate decay for a total of 200 epochs. On ImageNet 3232, we train with cosine learning rate decay for a total of 350k steps. All initial learning rates are 0.001.Density estimation architecture.
For density estimation on MNIST, we use 20 paired Mint layers with 45 filters each. For both CIFAR10 and ImageNet 3232, we use 21 paired Mint layers, each of which has 255 filters. For all the three datasets, two squeezing operations are used and are distributed evenly across the network (see Tab. 3 and Tab. 4).
Tuning the step size for sampling.
We perform grid search to find hyperparamter for Algorithm 1 using a minibatch of 128 images. More specifically, we start from to 5 with a step size 0.5 for MNIST, CIFAR10, and ImageNet 3232, and compute the normalized reconstruction error with respect to the number of iterations. The normalized error is defined as , where and are two image vectors corresponding to the original and reconstructed images. We find that the algorithm converges most quickly when is in intervals , and for MNIST, CIFAR10 and ImageNet 3232 respectively. Then we perform a second round grid search on the corresponding interval with a step size 0.05. In this case, we are able to find the best , that is for the corresponding datasets.
Verification of invertibility.
To verify the invertibility of MintNet, we study the normalized reconstruction error for MNIST, CIFAR10 and ImageNet 3232. The reconstruction error is computed for 128 images on all three datasets. We plot the exponential of the mean log reconstruction errors in Fig. 4. The shaded area corresponds to the exponential of the standard deviation of log reconstruction errors.
Name  Configuration  Replicate Block 
Paired Mint Block1 with Batch Normalization  batch normalization  
lower triangular masked convolution, 1 filter  
leaky relu activation 

lower triangular masked convolution, filter  
leaky relu activation  
lower triangular masked convolution, filter  
batch normalization  
upper triangular masked convolution, filter  
leaky relu activation  
upper triangular masked convolution, filter  
leaky relu activation  
upper triangular masked convolution, filter  
Squeezing Layer  squeezing layer  — 
Paired Mint Block2 with Batch Normalization  batch normalization  
lower triangular masked convolution, 1 filter  
leaky relu activation  
lower triangular masked convolution, filter  
leaky relu activation  
lower triangular masked convolution, filter  
batch normalization  
upper triangular masked convolution, filter  
leaky relu activation  
upper triangular masked convolution, filter  
leaky relu activation  
upper triangular masked convolution, filter  
Squeezing Layer  squeezing layer  — 
Paired Mint Block3 with Batch Normalization  batch normalization  
lower triangular masked convolution, 1 filter  
leaky relu activation  
lower triangular masked convolution, filter  
leaky relu activation  
lower triangular masked convolution, filter  
batch normalization  
upper triangular masked convolution, filter  
leaky relu activation  
upper triangular masked convolution, filter  
leaky relu activation  
upper triangular masked convolution, filter  
Output Layer  average pooling  — 
fully connected layer  
softmax layer 
Name  Configuration  Replicate Block 
Paired Mint Block1  lower triangular masked convolution, 45 filters  
elu activation  
lower triangular masked convolution, filters  
elu activation  
lower triangular masked convolution, filters  
upper triangular masked convolution, filters  
elu activation  
upper triangular masked convolution, filters  
elu activation  
upper triangular masked convolution, filters  
Squeezing Layer  squeezing layer  — 
Paired Mint Block2  lower triangular masked convolution, 45 filters  
elu activation  
lower triangular masked convolution, filters  
elu activation  
lower triangular masked convolution, filters  
upper triangular masked convolution, filters  
elu activation  
upper triangular masked convolution, filters  
elu activation  
upper triangular masked convolution, filters  
Squeezing Layer  squeezing layer  — 
Paired Mint Block3  lower triangular masked convolution, 45 filters  
elu activation  
lower triangular masked convolution, filters  
elu activation  
lower triangular masked convolution, filters  
upper triangular masked convolution, filters  
elu activation  
upper triangular masked convolution, filters  
elu activation  
upper triangular masked convolution, filters 
Name  Configuration  Replicate Block 
Paired Mint Block1  lower triangular masked convolution, 85 filters  
elu activation  
lower triangular masked convolution, filters  
elu activation  
lower triangular masked convolution, filters  
upper triangular masked convolution, filters  
elu activation  
upper triangular masked convolution, filters  
elu activation  
upper triangular masked convolution, filters  
Squeezing Layer  squeezing layer  — 
Paired Mint Block2  lower triangular masked convolution, 85 filters  
elu activation  
lower triangular masked convolution, filters  
elu activation  
lower triangular masked convolution, filters  
upper triangular masked convolution, filters  
elu activation  
upper triangular masked convolution, filters  
elu activation  
upper triangular masked convolution, filters  
Squeezing Layer  squeezing layer  — 
Paired Mint Block3  lower triangular masked convolution, 85 filters  
elu activation  
lower triangular masked convolution, filters  
elu activation  
lower triangular masked convolution, filters  
upper triangular masked convolution, filters  
elu activation  
upper triangular masked convolution, filters  
elu activation  
upper triangular masked convolution, filters 
Appendix E Additional tables
Method  MNIST  CIFAR10  ImageNet 3232 
iResNet [1] (100 iterations)  11.56s  99.41s  92.53s 
Autoregressive (1 iteration)  63.61s  2889.64s  2860.21s 
MintNet (120 iterations) (ours)  12.81s  117.83s  120.78s 
Appendix F More Samples
In this section, we provide more uncurated MintNet samples on MNIST, CIFAR10 and ImageNet 3232.
Comments
There are no comments yet.