# MintNet: Building Invertible Neural Networks with Masked Convolutions

We propose a new way of constructing invertible neural networks by combining simple building blocks with a novel set of composition rules. This leads to a rich set of invertible architectures, including those similar to ResNets. Inversion is achieved with a locally convergent iterative procedure that is parallelizable and very fast in practice. Additionally, the determinant of the Jacobian can be computed analytically and efficiently, enabling their generative use as flow models. To demonstrate their flexibility, we show that our invertible neural networks are competitive with ResNets on MNIST and CIFAR-10 classification. When trained as generative models, our invertible networks achieve new state-of-the-art likelihoods on MNIST, CIFAR-10 and ImageNet 32x32, with bits per dimension of 0.98, 3.32 and 4.06 respectively.

## Authors

• 49 publications
• 11 publications
• 127 publications
05/24/2019

### Generative Flow via Invertible nxn Convolution

Flow-based generative models have recently become one of the most effici...
12/15/2014

### MatConvNet - Convolutional Neural Networks for MATLAB

MatConvNet is an implementation of Convolutional Neural Networks (CNNs) ...
11/30/2020

### General Invertible Transformations for Flow-based Generative Modeling

In this paper, we present a new class of invertible transformations. We ...
11/19/2019

### KISS: Keeping It Simple for Scene Text Recognition

Over the past few years, several new methods for scene text recognition ...
06/16/2020

### Understanding and mitigating exploding inverses in invertible neural networks

Invertible neural networks (INNs) have been used to design generative mo...
11/15/2017

### AOGNets: Deep AND-OR Grammar Networks for Visual Recognition

This paper presents a method of learning deep AND-OR Grammar (AOG) netwo...
05/28/2019

### Network Deconvolution

Convolution is a central operation in Convolutional Neural Networks (CNN...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Invertible neural networks have many applications in machine learning. They have been employed to investigate representations of deep classifiers

[14], understand the cause of adversarial examples [13], learn transition operators for MCMC [26, 17], create generative models that are directly trainable by maximum likelihood [6, 5, 22, 15, 9, 1], and perform approximate inference [25, 16].

Many applications of invertible neural networks require that both inverting the network and computing the Jacobian determinant be efficient. While typical neural networks are not invertible, achieving these properties often imposes restrictive constraints to the architecture. For example, planar flows [25] and Sylvester flow [2] constrain the number of hidden units to be smaller than the input dimension. NICE [5] and Real NVP [6]

rely on dimension partitioning heuristics and specific architectures such as coupling layers, which could make training more difficult

[1]. Methods like FFJORD [9], i-ResNets [1] have fewer architectural constraints. However, their Jacobian determinants have to be approximated, which is problematic if repeatedly performed at training time as in flow models.

In this paper, we propose a new method of constructing invertible neural networks which are flexible, efficient to invert, and whose Jacobian can be computed exactly and efficiently. We use triangular matrices as our basic module. Then, we provide a set of composition rules to recursively build more complex non-linear modules from the basic module, and show that the composed modules are invertible as long as their Jacobians are non-singular. As in previous work [6, 22], the Jacobians of our modules are triangular, allowing efficient determinant computation. The inverse of these modules can be obtained by an efficiently parallelizable fixed-point iteration method, making the cost of inversion comparable to that of an i-ResNet [1] block.

Using our composition rules and masked convolutions as the basic triangular building block, we construct a rich set of invertible modules to form a deep invertible neural network. The architecture of our proposed invertible network closely follows that of ResNet [10]—the state-of-the-art architecture of discriminative learning. We call our model Masked Invertible Network (MintNet). To demonstrate the capacity of MintNets, we first test them on image classification. We found that a MintNet classifier achieves 99.6% accuracy on MNIST, matching the performance of a ResNet with a similar architecture. On CIFAR-10, it achieves 91.2% accuracy, comparable to the 92.6% accuracy of ResNet. When using MintNets as generative models, they achieve the new state-of-the-art results of bits per dimension (bpd) on uniformly dequantized images. Specifically, MintNet achieves bpd values of 0.98, 3.32, and 4.06 on MNIST, CIFAR-10 and ImageNet 3232, while former best (published) results are 0.99 (FFJORD [9]), 3.35 (Glow [15]) and 4.09 (Glow) respectively. Moreover, MintNet uses fewer parameters and less computational resources. Our MNIST model uses 30% fewer parameters than FFJORD [9]. For CIFAR-10 and ImageNet 3232, MintNet uses 60% and 74% fewer parameters than the corresponding Glow [15] models. When training on dataset such as CIFAR-10, MintNet required 2 GPUs for approximately 5 days, while FFJORD [9] used 6 GPUs for approximately 5 days, and Glow [15] used 8 GPUs for approximately 7 days.

## 2 Background

Consider a neural network that maps a data point to a latent representation . When for every there exists a unique such that , we call an invertible neural network. There are several basic properties of invertible networks. First, when is continuous, a necessary condition for to be invertible is . Second, if and are both invertible, will also be invertible. In this work, we mainly consider applications of invertible neural networks to classification and generative modeling.

### 2.1 Classification with invertible neural networks

Neural networks for classification are usually not invertible because the number of classes is usually different from the input dimension . Therefore, when discussing invertible neural networks for classification, we separate the classifier into two parts

and classification , where is usually the softmax function. We say the classifier is invertible when is invertible. Invertible classifiers are arguably more interpretable, because a prediction can be traced down by inverting latent representations [14, 13].

### 2.2 Generative modeling with invertible neural networks

An invertible network

can be used to warp a complex probability density

to a simple base distribution (e.g., a multivariate standard Gaussian) [5, 6]. Under the condition that both and are differentiable, the densities of and are related by the following change of variable formula

 logp(x)=logπ(z)+log|det(Jf(x))|, (1)

where denotes the Jacobian of and we require to be non-singular so that is well-defined. Using this formula, can be easily computed if the Jacobian determinant is cheaply computable and is known.

Therefore, an invertible neural network implicitly defines a normalized density model , which can be directly trained by maximum likelihood. The invertibility of is critical to fast sample generation. Specifically, in order to generate a sample from , we can first draw , and warp it back through the inverse of to obtain .

Note that multiple invertible models can be stacked together to form a deeper invertible model , without much impact on the inverse and determinant computation. This is because we can sequentially invert each component, i.e., , and the total Jacobian determinant equals the product of each individual Jacobian determinant, i.e., .

## 3 Building invertible modules compositionally

In this section, we discuss how simple blocks like masked convolutions can be composed to build invertible modules that allow efficient, parallelizable inversion and determinant computation. To this end, we first introduce the basic building block of our models. Then, we propose a set of composition rules to recursively build up complex non-linear modules with triangular Jacobians. Next, we prove that these composed modules are invertible as long as their Jacobians are non-singular. Finally, we discuss how these modules can be inverted efficiently using numerical methods.

### 3.1 The basic module

We start from considering linear transformations

, with , and . For a general , computing its Jacobian determinant requires operations. We therefore choose to be a triangular matrix. In this case, the Jacobian determinant is the product of all diagonal entries of , and the computational complexity is reduced to . The linear function with being triangular is our basic module.

Convolution is a special type of linear transformation that is very effective for image data. The triangular structure of the basic module can be achieved using masked convolutions (e.g., causal convolutions in PixelCNN [20]). We provide the formula of our masks in Appendix B and an illustration of a masked convolution with filters in Fig. 1. Intuitively, the causal structure of the filters (ordering of the pixels) enforces a triangular structure.

### 3.2 The calculus of building invertible modules

Complex non-linear invertible functions can be constructed from our basic modules in two steps. First, we follow several composition rules so that the composed module has a triangular Jacobian. Next, we impose appropriate constraints so that the module is invertible. To simplify the discussion, we only consider modules with lower triangular Jacobians here, and we note that it is straightforward to extend the analysis to modules with upper triangular Jacobians.

The following proposition summarizes several rules to compositionally build new modules with triangular Jacobians using existing ones.

###### Proposition 1.

Define as the set of all continuously differentiable functions whose Jacobian is lower triangular. Then contains the basic module in Section 3.1, and is closed under the following composition rules.

• Rule of addition. , where .

• Rule of composition. . A special case is , where

is a continuously differentiable non-linear activation function that is applied element-wise.

The proof of this proposition is straightforward and deferred to Appendix A. By repetitively applying the rules in Proposition 1, our basic linear module can be composed to construct complex non-linear modules having continuous and triangular Jacobians. Note that besides our linear basic modules, other functions with triangular and continuous Jacobians can also be made more expressive using the composition rules. For example, the layers of dimension partitioning models (e.g., NICE [5], Real NVP [6], Glow [15]) and autoregressive flows (e.g., MAF [22]) all have continuous and triangular Jacobians and therefore belong to . Note that the rule of addition in Proposition 1 preserves triangular Jacobians but not invertibility. Therefore, we need additional constraints if we want the composed functions to be invertible.

Next, we state the condition for to be invertible, and denote the invertible subset of as .

###### Theorem 1.

If and is non-singular for all in the domain, then is invertible.

###### Proof.

A proof can be found in Appendix A. ∎

The non-singularity of constraint in Theorem 1 is natural in the context of generative modeling. This is because in order for Eq. (1) to make sense, has to be well-defined, which requires to be non-singular.

In many cases, Theorem 1 can be easily used to check and enforce the invertibility of . For example, the layers of autoregressive flow models and dimension partitioning models can all be viewed as elements of because they are continuously differentiable and have triangular Jacobians. Since the diagonal entries of their Jacobians are always strictly positive and hence non-singular, we can immediately conclude that they are invertible with Theorem 1, thus generalizing their model-specific proofs of invertibility.

In Fig. 2, we provide a Venn Diagram to illustrate the set of functions that satisfy the condition of Theorem 1. As depicted by the orange set labeled by , Theorem 1 captures a subset of where the Jacobians of functions are non-singular so that the change of variable formula is usable. Note the condition in Theorem 1 is sufficient but not necessary. For example, is invertible, but is singular. Many previous invertible models with special architectures, such as NICE, Real NVP, and MAF, can be viewed as elements belonging to subsets of .

### 3.3 Efficient inversion of the invertible modules

In this section, we show that when the conditions in Theorem 1 hold, not only do we know that is invertible (), but also we have a fixed-point iteration method to invert with strong theoretical guarantees and good performance in practice.

The pseudo-code of our proposed inversion algorithm is described in Algorithm 1. Theoretically, we can prove that this method is locally convergent—as long as the initial value is close to the true value, the method is guaranteed to find the correct inverse. We formally summarize this result in Theorem 2.

###### Theorem 2.

The iterative method of Algorithm 1 is locally convergent whenever .

###### Sketch of Proof.

Denote and let be the true inverse. We can show that the spectral radius of the Jacobian of at is , which is smaller than when . Then the local convergence holds true according to Ostrowski’s Theorem (cf., Theorem 10.1.3. in [21]). We provide a more rigorous proof in Appendix A. ∎

In practice, the method is also easily parallelizable on GPUs, making the cost of inverting similar to that of an i-ResNet [1]

layer. Within each iteration, the computation is mostly matrix operations that can be vectorized and run efficiently in parallel. Therefore, the time cost will be roughly proportional to the number of iterations,

i.e., . As will be shown in our experiments, Algorithm 1 converges fast and usually the error quickly becomes negligible when . This is in stark contrast to existing methods of inverting autoregressive flow models such as MAF [22], where univariate equations need to be solved sequentially, requiring at least iterations. There are also other approaches for inverting . For example, the bisection method is guaranteed to converge globally, but its computational cost is , and is usually much more expensive than Algorithm 1. Note that as discussed earlier, autoregressive flow models can also be viewed as special cases of our framework. Therefore, Algorithm 1 is also applicable to inverting autoregressive flow models and could potentially result in large improvements of sampling speed.

We show that techniques developed in Section 3 can be used to build our Masked Invertible Network (MintNet). First, we discuss how we compose several masked convolutions to form the Masked Invertible Layer (Mint layer). Next, we stack multiple Mint layers to form a deep neural network, i.e., the MintNet. Finally, we compare MintNets with several existing invertible architectures.

### 4.1 Building the Masked Invertible Layer

We construct an invertible module in that serves as the basic layer of our MintNet. This invertible module, named Mint layer, is defined as

 L(x)=t⊙x+K∑i=1W3ih(K∑j=1W2ijh(W1jx+b1j)+b2ij)+b3i, (2)

where denotes the elementwise multiplication, , , and are all lower triangular matrices with additional constraints to be specified later, and . Additionally, Mint layers use a monotonic activation function , so that . Common choices of include ELU [4], tanh and sigmoid. Note that every individual weight matrix has the same size, and the 3 groups of weights , and can be implemented with 3 masked convolutions (see Appendix B). We design the form of so that it resembles a ResNet / i-ResNet block that also has 3 convolutions with filters, with being the number of channels of .

From Proposition 1 in Section 3.2, we can easily conclude that . Now, we consider additional constraints on the weights so that , i.e., it is invertible. Note that the analytic form of its Jacobian is

 JL(x)=K∑i=1W3iAiK∑j=1W2ijBjW1j+t, (3)

with , , and . Therefore, once we impose the following constraint

 diag(W3i)diag(W2ij)diag(W1j)≥0,∀1≤i,j≤K, (4)

we have , which satisfies the condition of Theorem 1 and as a consequence we know . In practice, the constraint Eq. (4) can be easily implemented. For all , we impose no constraint on and , but replace with . Note that has the same signs as and therefore . Moreover, is almost everywhere differentiable w.r.t. , which allows gradients to backprop through.

### 4.2 Constructing the Masked Invertible Network

In this section, we introduce design choices that help stack multiple Mint layers together to form an expressive invertible neural network, namely the MintNet. The full MintNet is constructed by stacking the following paired Mint layers and squeezing layers.

#### Paired Mint layers.

As discussed above, our Mint layer always has a triangular Jacobian. To maximize the expressive power of our invertible neural network, it is undesirable to constrain the Jacobian of the network to be triangular since this limits capacity and will cause blind spots in the receptive field of masked convolutions. We thus always pair two Mint layers together—one with a lower triangular Jacobian and the other with an upper triangular Jacobian, so that the Jacobian of the paired layers is not triangular, and blind spots can be eliminated.

#### Squeezing layers.

Subsampling is important for enlarging the receptive field of convolutions. However, common subsampling operations such as pooling and strided convolutions are usually not invertible. Following

[6] and [1], we use a “squeezing” operation to reshape the feature maps so that they have smaller resolution but more channels. After a squeezing operation, the height and width will decrease by a factor of , but the number of channels will increase by a factor of

. This procedure is invertible and the Jacobian is an identity matrix. Throughout the paper, we use

.

### 4.3 Comparison to other approaches

In what follows we compare MintNets to several existing methods for developing invertible architectures. We will focus on architectures with a tractable Jacobian determinant. However, we note that there are models (cf., [7, 19, 8]) that allow fast inverse computation but do not have tractable Jacobian determinants. Following [1], we also provide some comparison in Tab. 5 (see Appendix E).

#### Identities of determinants.

Some identities can be used to speed up the computation of determinants if the Jacobians have special structures. For example, in Sylvester flow [2], the invertible transformation has the form , where is a nonlinear activation function, , , and . By Sylvester’s determinant identity, can be computed in , which is much less than if . However, the requirement that is small becomes a bottleneck of the architecture and limits its expressive power. Similarly, Planar flow [25] uses the matrix determinant lemma, but has an even narrower bottleneck.

The form of bears some resemblance to Sylvester flow. However, we improve the capacity of Sylvester flow in two ways. First, we add one extra non-linear convolutional layer. Second, we avoid the bottleneck that limits the maximum dimension of latent representations in Sylvester flow.

#### Dimension partitioning.

NICE [5], Real NVP [6], and Glow [15] all depend on an affine coupling layer. Given , is first partitioned into two parts . The coupling layer is an invertible transformation, defined as , where and are two arbitrary functions. However, the partitioning of relies on heuristics, and the performance is sensitive to this choice (cf., [15, 1]). In addition, the Jacobian of is a triangular matrix with diagonal . In contrast, the Jacobian of MintNets has more flexible diagonals—without being partially restricted to ’s.

#### Autoregressive transformation.

By leveraging autoregressive transformations, the Jacobian can be made triangular. For example, MAF [22] defines the invertible tranformation as , where and . However, the architecture of is only an affine combination of autoregressive functions with , which might restrict its expressive power. In contrast, the architecture of MintNets is arguably more flexible.

#### Free-form invertible models.

Some work proposes invertible transformations whose Jacobians are not limited by special structures. For example, FFJORD [9] uses a continuous version of change of variables formula [3]

where the determinant is replaced by trace. Unlike MintNets, FFJORD needs an ODE solver to compute its value and inverse, and uses a stochastic estimator to approximate the trace. Another work is i-ResNet

[1] which constrains the Lipschitz-ness of ResNet layers to make it invertible. Both i-ResNet and MintNet use ResNet blocks with 3 convolutions. The inverse of i-ResNet can be obtained efficiently by a parallelizable fixed-point iteration method, which has comparable computational cost as our Algorithm 1. However, unlike MintNets whose Jacobian determinants are exact, the log-determinant of Jacobian of an i-ResNet must be approximated by truncating a power series and estimating each term with stochastic estimators.

## 5 Experiments

In this section, we evaluate our MintNet architectures on both image classification and density estimation. We focus on three common image datasets, namely MNIST, CIFAR-10 and ImageNet 3232. We also empirically verify that Algorithm 1 can provide accurate solutions within a small number of iterations. We provide more details about settings and model architectures in Appendix D.

### 5.1 Classification

To check the capacity of MintNet and understand the trade-off of invertibility, we test its classification performance on MNIST and CIFAR-10, and compare it to a ResNet with a similar architecture.

On MNIST, MintNet achieves a test accuracy of 99.6%, which is the same as that of the ResNet. On CIFAR-10, MintNet reaches 91.2% test accuracy while ResNet reaches 92.6%. Both MintNet and ResNet achieve 100% training accuracy on MNIST and CIFAR-10 datasets. This indicates that MintNet has enough capacity to fit all data labels on the training dataset, and the invertible representations learned by MintNet are comparable to representations learned by non-invertible networks in terms of generalizability. Note that the small degradation in classification accuracy is also observed in other invertible networks. For example, depending on the Lipschitz constant, the gap between test accuracies of i-ResNet and ResNet can be as large as 1.92% on CIFAR-10.

### 5.2 Density estimation and verification of invertibility

In this section, we demonstrate the superior performance of MintNet on density estimation by training it as a flow generative model. In addition, we empirically verify that Algorithm 1 can accurately produce the inverse using a small number of iterations. We show that samples can be efficiently generated from MintNet by inverting each Mint layer with Algorithm 1.

#### Density estimation.

In Tab. 1, we report bits per dimension (bpd) on MNIST, CIFAR-10, and ImageNet 3232 datasets. It is notable that MintNet sets the new records of bpd on all three datasets. Moreover, when compared to previous best models, our MNIST model uses 30% fewer parameters than FFJORD, and our CIFAR-10 and ImageNet 3232 models respectively use 60% and 74% fewer parameters than Glow. When trained on datasets such as CIFAR-10, MintNet requires 2 GPUs for approximately five days, while FFJORD is trained on 6 GPUs for five days, and Glow on 8 GPUs for seven days. Note that all values in Tab. 1 are with respect to the continuous distribution of uniformly dequantized images, and results of models that view images as discrete distributions are not directly comparable (e.g., PixelCNN [20], IAF-VAE [16], and Flow++ [12]

). To show that MintNet learns semantically meaningful representations of images, we also perform latent space interpolation similar to the interpolation experiments in Real NVP (see Appendix

C).

#### Verification of invertibility.

We first examine the performance of Algorithm 1 by measuring the reconstruction error of MintNets. We compute the inverse of MintNet by sequentially inverting each Mint layer with Algorithm 1. We used grid search to select the step size in Algorithm 1 and chose respectively for MNIST, CIFAR-10 and ImageNet 3232. An interesting fact is for MNIST, actually works better than other values of within , even though it does not have the theoretical gurantee of local convergence. As Fig. 3(a) shows, the normalized reconstruction error converges within iterations for all datasets considered. Additionally, Fig. 3(b) demonstrates that the reconstructed images look visually indistinguishable to true images.

#### Samples.

Using Algorithm 1, we can generate samples efficiently by computing the inverse of MintNets. We use the same step sizes as in the reconstruction error analysis, and run Algorithm 1 for 120 iterations for all three datasets. We provide uncurated samples in Fig. 3, and more samples can be found in Appendix F. In addition, we compare our sampling time to that of the other models (see Tab. 6 in Appendix E). Our sampling method has comparable speed as i-ResNet. It is approximately 5 times faster than autoregressive sampling on MNIST, and is roughly 25 times faster on CIFAR-10 and ImageNet 3232.

## 6 Conclusion

We propose a new method to compositionally construct invertible modules that are flexible, efficient to invert, and with a tractable Jacobian. Starting from linear transformations with triangular matrices, we apply a set of composition rules to recursively build new modules that are non-linear and more expressive (Proposition 1). We then show that the composed modules are invertible as long as their Jacobians are non-singular (Theorem 1), and propose an efficiently parallelizable numerical method (Algorithm 1) with theoretical guarantees (Theorem 2) to compute the inverse. The Jacobians of our modules are all triangular, which allows efficient and exact determinant computation.

As an application of this idea, we use masked convolutions as our basic module. Using our composition rules, we compose multiple masked convolutions together to form a module named Mint layer, following the architecture of a ResNet block. To enforce its invertibility, we constrain the masked convolutions to satisfy the condition of Theorem 1. We show that multiple Mint layers can be stacked together to form a deep invertible network which we call MintNet. Experimentally, we show that MintNet performs well on MNIST and CIFAR-10 classification. Moreover, when trained as a generative model, MintNet achieves new state-of-the-art performance on MNIST, CIFAR-10 and ImageNet 3232.

### Acknowledgements

This research was supported by Intel Corporation, Amazon AWS, TRI, NSF (#1651565, #1522054, #1733686), ONR (N00014-19-1-2145), AFOSR (FA9550- 19-1-0024).

## Appendix A Proofs

#### Notations.

Let denote the Jacobian of evaluated at . We use to denote the -th component of the vector-valued function , and to denote the -th entry of . We further use to denote the -th component of the input vector , and to denote the partial derivative of w.r.t. , evaluated at .

###### Proposition 1.

Define as the set of all continuously differentiable functions whose Jacobian is lower triangular. Then contains the basic module in Section 3.1, and is closed under the following composition rules.

• Rule of addition. , where .

• Rule of composition. . A special case is , where is a continuously differentiable non-linear activation function that is applied element-wisely.

###### Proof.

Since the basic modules have the form , where is a lower triangular matrix, we immediately know that is continuously differentiable and is lower triangular, therefore . Next, we prove the closeness properties of one by one.

• Rule of addition. is continuously differentiable, and is lower triangular. This is because , and both and are continuous and lower triangular.

• Rule of composition. is continuously differentiable and has a lower triangular Jacobian. This is because , and both and are continuous and lower triangular. As a special case, we choose , where is a continuously differentiable univariate function. Since the Jacobian of is diagonal and continuous, we have . Therefore holds true for all .

The following two lemmas will be very helpful for proving Theorem 1.

###### Lemma 1.

is lower triangular for all implies is a function of , and does not depend on .

###### Proof.

Due to the fact that is lower triangular, we have for any . When are fixed, we have

 [f(x1,...,xj−1,xj,xj+1,xD)]i =[f(x1,...,xj−1,0,xj+1,...,xD)]i+∫xj0∂[f(t)]i∂tjdtj (5) =[f(x1,...,xj−1,0,xj+1,...,xD)]i. (6)

This implies that does not depend on for any . In other words, is only a function of . ∎

###### Lemma 2.

implies that for any , either (i) or (ii) . That is, is monotonic w.r.t. when are fixed.

###### Proof.

Clearly is equivalent to . This means for any , and it shares the same sign with , a constant that is either strictly positive or strictly negative. This further implies that when are fixed, is either strictly positive or strictly negative for all , and is therefore monotonic w.r.t. . ∎

###### Theorem 1.

If and is non-singular for all in the domain, then is invertible.

###### Proof.

Assume without loss of generality that is lower triangular. We first prove that by contradiction. Assuming , then such that . Because is always triangular and non-singular, we immediately conclude that . Assume without loss of generality that and . Then, by the intermediate value theorem, we know that such that , which contradicts that fact that is always non-singular.

Next, we prove that for all in the range of , there exists a unique such that . To obtain , we only need to solve , which is an equation of variable , as concluded from Lemma 1. Since Lemma 2 implies that is monotonic w.r.t. , we know that has a unique inverse whenever is in the range of . Now assume we have already obtained , where . In this case, Lemma 1 asserts that is an equation of variable . Again Lemma 2 implies that is a monotonic function of given , which implies further that has a unique solution whenever is in the range of . By induction, we can solve for by repetitively employing this procedure, which concludes that exists, and can be determined uniquely.

###### Theorem 2.

The iterative method of Algorithm 1 is locally convergent whenever .

###### Proof.

Let be any value in the range of and , where denotes a diagonal matrix whose diagonal entries are the reciprocals of those of . The iterative method of Algorithm 1 can be written as . Because of Theorem 1, there exists a unique such that , in which case . Applying the product rule, we have

 Jg(x∗;α,z)=I−αdiag(Jf(x∗))−1Jf(x∗),

where denotes the Jacobian of evaluated at . Since is triangular,

will also be triangular. Therefore, the only eigenvalue of

is , due to the fact that the only solution to the equation system is . Since , the spectral radius of satisfies . Then the Ostrowski Theorem (cf., Theorem 10.1.3. in [21]) shows that the sequence obtained by converges locally to as . ∎

Convolution is a special type of linear transformation that proves to be very effective for image data. The basic invertible module can be implemented using masked convolutions (e.g., causal convolutions in PixelCNN [20]). Consider a 2D convolutional layer with input feature maps, filters, a kernel size of

. We assume

is an odd integer and

so that the input and output of the convolutional layer have the same shape. Let

be the weight tensor of this layer. We define a mask

that satisfies

 M[i,j,m,n]={0,if i⌊\nicefracR2⌋ or i=j∧m=⌊\nicefracR2⌋∧n>⌊\nicefracR2⌋,1,Otherwise. (7)

The masked convolution then uses as the weight tensor. In Fig. 1, we provide an illustration on a masked convolution with filters.

In MintNet, is efficiently implemented with 3 masked convolutional layers. The weights and masks are denoted as , and , which separately correspond to in Eq. (2). Let be the number of input feature maps, and suppose the kernel size is . The shapes of , and are respectively , and . The masks of them are simple concatenations of copies of the mask in Eq. (7). For instance, consists of copies of Eq. (7), and consists of copies. Using masked convolutions, can be concisely written as

 (8)

where are biases, and denotes the operation of discrete 2D convolution.

## Appendix C Interpolation of hidden representations

Given four images in the dataset, let , where , be the corresponding features in the feature domain. Similar to [6], in the feature domain, we define

 z=cos(ϕ)(cos(ϕ′)z1+sin(ϕ′)z2)+sin(ϕ)(cos(ϕ′)z3+sin(ϕ′)z4) (9)

where -axis corresponds to , -axis corresponds to , and both and range over . We then transform back to the image domain by taking . Interpolation results are shown in Fig. 5.

## Appendix D Experiment setup and network architecture

#### Hyperparameter tuning and computation infrastructure.

We use the standard train/test split of MNIST, CelebA and CIFAR-10. We tune our models by observing its training bpd. For density estimation on CIFAR-10 and ImageNet 3232, the models were run on two Titan XP GPUs. In other cases the model was run on one Titan XP GPU.

#### Classification setup.

Following [1], we pad the images to 16 channels with zeros. This corresponds to the first convolution in ResNet which increases the number of channels to 16. Both ResNet and our MintNet are trained with AMSGrad [24]

for 200 epochs with the cosine learning rate schedule

[18] and an initial learning rate of 0.001. Both networks use a batch size of 128.

#### Classification architecture.

The ResNet contains 38 pre-activation residual blocks [11], and each block has three convolutions. The architecture is divided into 3 stages, with 16, 64 and 256 filters respectively. Our MintNet uses 19 grouped invertible layers, which include a total of 38 residual invertible layers, each having three

convolutions. Batch normalization is applied before each invertible layer. Note that batch normalization does not affect the invertibility of our network, because during test time it uses fixed running average and standard deviation and is an invertible operation. We use 2 squeezing blocks at the same position where ResNet applies subsampling, and matches the number of filters used in ResNet. To produce the logits for classification, both MintNet and ResNet first apply global average pooling and then use a fully connected layer (see Tab.

2).

#### Density estimation setup.

We mostly follow the settings in [22]

. All training images are dequantized and transformed using the logit transformation. Networks are trained using AMSGrad

[23]. On MNIST, we decay the learning rate by a factor of 10 at the 250th and 350th epoch, and train for 400 epochs. On CIFAR-10, we train with cosine learning rate decay for a total of 200 epochs. On ImageNet 3232, we train with cosine learning rate decay for a total of 350k steps. All initial learning rates are 0.001.

#### Density estimation architecture.

For density estimation on MNIST, we use 20 paired Mint layers with 45 filters each. For both CIFAR-10 and ImageNet 3232, we use 21 paired Mint layers, each of which has 255 filters. For all the three datasets, two squeezing operations are used and are distributed evenly across the network (see Tab. 3 and Tab. 4).

#### Tuning the step size for sampling.

We perform grid search to find hyperparamter for Algorithm 1 using a minibatch of 128 images. More specifically, we start from to 5 with a step size 0.5 for MNIST, CIFAR-10, and ImageNet 3232, and compute the normalized reconstruction error with respect to the number of iterations. The normalized error is defined as , where and are two image vectors corresponding to the original and reconstructed images. We find that the algorithm converges most quickly when is in intervals , and for MNIST, CIFAR-10 and ImageNet 3232 respectively. Then we perform a second round grid search on the corresponding interval with a step size 0.05. In this case, we are able to find the best , that is for the corresponding datasets.

#### Verification of invertibility.

To verify the invertibility of MintNet, we study the normalized reconstruction error for MNIST, CIFAR-10 and ImageNet 3232. The reconstruction error is computed for 128 images on all three datasets. We plot the exponential of the mean log reconstruction errors in Fig. 4. The shaded area corresponds to the exponential of the standard deviation of log reconstruction errors.