 # The Convolution Exponential and Generalized Sylvester Flows

This paper introduces a new method to build linear flows, by taking the exponential of a linear transformation. This linear transformation does not need to be invertible itself, and the exponential has the following desirable properties: it is guaranteed to be invertible, its inverse is straightforward to compute and the log Jacobian determinant is equal to the trace of the linear transformation. An important insight is that the exponential can be computed implicitly, which allows the use of convolutional layers. Using this insight, we develop new invertible transformations named convolution exponentials and graph convolution exponentials, which retain the equivariance of their underlying transformations. In addition, we generalize Sylvester Flows and propose Convolutional Sylvester Flows which are based on the generalization and the convolution exponential as basis change. Empirically, we show that the convolution exponential outperforms other linear transformations in generative flows on CIFAR10 and the graph convolution exponential improves the performance of graph normalizing flows. In addition, we show that Convolutional Sylvester Flows improve performance over residual flows as a generative flow model measured in log-likelihood.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep generative models aim to learn a distribution for a high-dimensional variable . Flow-based generative models (dinh2014nice; dinh2016density) are particularly attractive because they admit exact likelihood optimization and straightforward sampling. Since normalizing flows are based on the change of variable formula, they require the flow transformation to be invertible. In addition, the Jacobian determinant needs to be tractable to compute the likelihood.

In practice, a flow is composed of multiple invertible layers. Since the Jacobian determinant is required to compute the likelihood, many flow layers are triangular maps, as the determinant is then the product of the diagonal elements. However, without other transformations, the composition of triangular maps will remain triangular. For that reason, triangular flows are typically interleaved with linear flows that mix the information over dimensions. Existing methods include permutations (dinh2014nice) and 1 1 convolutions (kingma2018glow) but these do not operate over feature maps spatially. Alternatives are emerging convolutions (hoogeboom2019emerging) and periodic convolutions (finzi2019invertible; karami2019invertible). However, periodicity is generally not a good inductive bias for images, and emerging convolutions are autoregressive and their inverse is solved iteratively over dimensions.

In this paper, we introduce a new method to construct invertible transformations, by taking the exponential of any linear transformation. The exponential is always invertible, and computing the inverse and Jacobian determinant is straightforward. Unlike prior work, we observe that the exponential can be computed implicitly. As a result, we can take the exponential of linear operations for which the corresponding matrix multiplication would be intractable. The canonical example of such a transformation is a convolutional layer, using which we develop a new transformation named the convolution exponential. In addition we propose a new residual transformation named Convolutional Sylvester Flow, a combination of a generalized formulation for Sylvester Flows, and the convolution exponential as basis change. Figure 1: Visualization of the equivalent matrix exponential exp(M) where M represents a 2d convolution on a 1×5×5 input (channel first). In this example the computation is explicit, however in practice the exponential is computed implicit and the matrices M and exp(M) are never stored.

## 2 Background

Consider a variable and an invertible function that maps each to a unique output . In this case, the likelihood can be expressed in terms of a base distribution and the Jacobian determinant of :

 pX(x)=pZ(z)∣∣∣dzdx∣∣∣, (1)

where is typically chosen to be a simple factorized distribution such as a Gaussian, and is a function with learnable parameters that is referred to as a flow. Drawing a sample is equivalent to drawing a sample and computing . Figure 2: A convolution of a signal x with a kernel m (left) is equivalent to a matrix multiplication using a matrix Mand a vectorized signal →x (right). In this example, x has a single channel with spatial dimensions 5×5. The convolution is zero-padded with one pixel on all sides. A white square indicates its value is zero.

### 2.1 The Matrix Exponential

The matrix exponential gives a method to construct an invertible matrix from any dimensionality preserving linear transformation. For any square (possibly non-invertible) matrix

, the matrix exponential is given by the power series:

 exp(M)≡I+M1!+M22!+…=∞∑i=0Mii!. (2)

The matrix exponential is well-defined and the series always converges. Additionally, the matrix exponential has two very useful properties: i) computing the inverse of the matrix exponential has the same computational complexity as the exponential itself, and ii) the determinant of the matrix exponential can be computed easily using the trace:

 exp(M)−1=exp(−M)andlogdet[exp(M)]=TrM. (3)

The matrix exponential has been largely used in the field of ODEs. Consider the linear ordinary differential equation

. Given the initial condition , the solution for at time can be written using the matrix exponential: .

### 2.2 Convolutions as Matrix Multiplications

Convolutional layers in deep learning can be expressed as matrix multiplications. Let

denote a convolution111In frameworks convolutions are typically implemented as cross-correlations. We follow literature convention and refer to them as convolutions in text. In equations denotes a cross-correlations and is a convolution., then there exists an equivalent matrix such that the convolution is equivalent to the matrix multiplication , where vectorizes . An example is provided in Figure 2. In these examples we use zero-padded convolutions, for periodic and reflective padded convolutions a slightly different equivalent matrix exists. An important detail to notice is that the equivalent matrix is typically unreasonably large to store in memory, its dimensions grow quadratically with the dimension of . For example, for 2d signals it has size , where is height, is width and denotes number of channels. In practice the equivalent matrix is never stored but instead, it is a useful tool to utilize concepts from linear algebra.

## 3 The Convolution Exponential

We introduce a new method to build linear flows, by taking the exponential of a linear transformation. As the main example the exponential of a convolutional layer is taken, which we name the convolution exponential. Since a convolutional is linear, it can be expressed as a matrix multiplication (section 2.2). For a convolution with a kernel , there exists an associated equivalent matrix using the matrix such that and are equivalent. We define the convolution exponential:

 z=m⋆ex, (4)

for a kernel and signal as the output of the matrix exponential of the equivalent matrix: , where the difference between and is a vectorization or reshape operation that can be easily inverted. Notice that although is a linear operation with respect to and , the exponential operation is only linear with respect to . Using the properties of the matrix exponential, the inverse is given by , and the log Jacobian determinant is the trace of . For a 2d convolutional layer the trace is

given the 4d kernel tensor

, where height is , width is , the spatial center of the kernel is given by and iterates over channels. As an example, consider the convolution in Figure 2. The exponential of its equivalent matrix is depicted in Figure 1. In contrast with a standard convolution, the convolution exponential guaranteed to be invertible, and computing the Jacobian determinant is computationally cheap.

Implicit iterative computation
Due to the popularity of the matrix exponential as a solutions to ODEs, numerous methods to compute the matrix exponential with high numerical precision exist (arioli1996pade; moler2003nineteen). However, these methods typically rely on storing the matrix in memory, which is very expensive for transformations such as convolutional layers. Instead, we propose to solve the exponential using matrix vector products . The exponential matrix vector product can be computed implicitly using the power series, multiplied by any vector using only matrix-vector multiplications:

 exp(M)⋅→x=→x+M⋅→x1!+M2⋅→x2!+…=∞∑i=0Mi⋅→xi!, (5)

where the term can be expressed as two matrix vector multiplications . Further, computation from previous terms can be efficiently re-used as described in Algorithm 1. Using this fact, the convolution exponential can be directly computed using the series:

 m⋆ex=x+m⋆x1!+m⋆(m⋆x)2!+…, (6)

which can be done efficiently by simply setting in Algorithm 2. A visual example of the implicit computation is presented in Figure 3. (a) Forward computation z=m⋆ex. Figure 4: Upper bound of the norm of a term in the power series ||Mix||p/i! at iteration i, relative to the size of the input ||x||p given a matrix norm.

Power series convergence
Even though the exponential can be solved implicitly, it is uncertain how many terms of the series will need to be expanded for accurate results. Moreover, it is also uncertain whether the series can be computed with high numerical precision. To resolve both issues, we constrain the induced matrix norm of the linear transformation. Given the -norm on the matrix , a theoretical upper bound for the size of the terms in the power series can be computed using the inequality: . Hence, an upper bound for relative size of the norm of a term at iteration , is given by . Notice that the factorial term in the denominator causes the exponential series to converges very fast, which is depicted in Figure 4.

In our experiments we constrain using spectral normalization (miyato2018spectral), which constrains the norm of the matrix () and can be computed efficiently for convolutional layers and standard linear layers. Even though the algorithm from miyato2018spectral produces a lower bound on the norm, in practice the bound is sufficiently close to produce convergence behaviour as shown in Figure 4. Moreover, the figure depicts worst-case behaviour given the norm, and typically the series converges far more rapidly. In experiments we normalize the convolutional layer using a coefficient of and we find that expanding around or terms of the series is generally sufficient.

### 3.1 Graph Convolution Exponential

In this section we extend the Convolution Exponential to graph structured data. Given a graph with nodes and edges . We define a matrix of nodes features , an adjacency matrix and a degree matrix . Employing a similar notation as kipf2016semi, a linear graph convolutional layer can be defined as:

where are free parameters. Since the output in the graph convolution linearly depends on its inputs, it can also be expressed as a product of some equivalent matrix with a vectorized signal . Note that the trace of this equivalent matrix is is equal to the trace of , multiplied by the number of nodes, i.e. . This is because the adjacency matrix contains zeros on its diagonal and all self-connections are parametrized by . The proofs to obtain from equation 7 and its trace are shown in the Appendix.

The graph convolution exponential can be computed by replacing with the function in Algorithm 2. Since the size and structure of graphs may vary, the norm of changes depending on this structure even if the parameters and remain unchanged. As a rule of thumb we find that the graph convolution exponential converges quickly when the norm is constrained to one divided by the maximum number of neighbours, and to one (as it is already normalized via ).

### 3.2 General Linear Exponentials and Equivariance

In the previous section we generalized the exponential to convolutions and graph convolutions. Graph convolutions can be viewed as permutation equivariant convolutions (maron2019invariant). In this section we ask the question whether exponentiation and equivariance commute, i.e., which equivariant convolutions will remain equivariant after exponentiation.

Consider a feature field with components where indexes the channel dimensions of the capsules and is the pixel at position . Note that feature fields can consist of multiple capsules but we will restrict ourselves without loss of generality to only one. We now construct a resized vector which combines the indices into . Next we define a (not necessarily linear) operator which maps . Equivariance under is defined as where

is a general kernel that maps one layer of the neural network to the next layer. It states that first performing the map

(usually a convolution) and then the symmetry transform in the activation layer is the same as first transforming the input layer and then convolving. Note that neither nor need to be invertible, but that we did require that the symmetry transformation in the input layer and the activation layer are the same (this is less general than the usual equivariance constraint which is of the form ). Subject to that constraint, this definition is however very general and encompasses group convolutions (cohen2016group; dieleman2016exploiting) and permutation equivariant graph convolutions (maron2019invariant).

Since it follows that for positive powers . Moreover, any linear combination of any collection of powers commutes as well, which in turn implies the statement proving that the exponential of the map is still equivariant. In particular it shows that both the exponential of group convolutions and permutation equivariant graph convolutions are still equivariant.

## 4 Generalized Sylvester Flows

Sylvester Normalizing Flows (SNF) (van2018sylvester) takes advantage of the Sylvester identity that allows to calculate a determinant of form in an efficient manner. Specifically, van2018sylvester parametrize and

using a composition of a shared orthogonal matrix

and triangular matrices , such that and . However, the original Sylvester flows utilize fully connected parametrizations, and not convolutional ones. We introduce an improvement of Sylvester flows which we call generalized Sylvester flows. The transformation is described by:

 z=x+W−1fAR(Wx), (8)

where is an invertible matrix and is a smooth autoregressive function. In this case the determinant can be computed using:

 det(dzdx)=det(I+JfAR(Wx)WW−1)=det(I+JfAR(Wx)), (9)

where denotes the Jacobian of , which is triangular because is autoregressive. We can show that 1) generalized Sylvester flows are invertible and 2) that they generalize the original Sylvester flows.

Theorem 1: Let be an invertible matrix. Let be a smooth autoregressive function (i.e., if ). Additionally, constrain . Then the transformation given by (8) is invertible. Proof: see Appendix A.

Theorem 2: The original Sylvester flow , is a special case of the generalized Sylvester flow (8). Proof: see Appendix A.

Theorem 1 demonstrates that the generalized Sylvester Flow transformation is invertible and Theorem 2 shows that the original Sylvester Flows can be modelled as a special case by this new transformation. The generalization admits the use of any invertible linear transformation for , such as a convolution exponential. In addition, it allows the use of general autoregressive functions.

### 4.1 Inverting Sylvester Flows

Recall that we require that the diagonal values of are greater than . If we additionally constrain the maximum of this diagonal to , then the function becomes a one-dimensional contraction, given that the other dimensions are fixed. Using this, the inverse of Sylvester flows can be easily computed using a fixed point iteration. Firstly, compute and let . At this point the triangular system can be solved for using the fixed-point iteration:

 u(t)=v−fAR(u(t−1)). (10)

Subsequently, can be obtained by computing . This procedure is valid both for our method and the original Sylvester flows. Although the fixed-point iteration is identical to (behrmann2019invertible), the reason that Sylvester flows converge is because i) the function is a contraction in one dimension, ii) the function is autoregressive (for a proof see Appendix A). The entire function does not need to be a contraction. Solving an autoregressive inverse using fixed-point iteration is generally faster than solving the system iteratively (song2020nonlinear; wiggers2020predictive).

Specifically, we choose that , where are strictly autoregressive functions parametrized by neural networks with a shared representation. Also , utilize a final function so that their output is in and , which we set to . This transformation is somewhat similar to the construction of the original Sylvester flows (van2018sylvester), with the important difference that can now be modelled by any strictly autoregressive function.

### 4.2 Convolutional Sylvester Flows

Generalized Sylvester flows and the convolution exponential can be naturally combined to obtain Convolutional Sylvester Flows (CSFs). In Equation 8 we let , where is the equivalent matrix of a convolution with filter . In addition is an orthogonal convolution modeled by Householder reflections (tomczak2016improving; hoogeboom2019emerging):

 z=x+QT((−m)⋆e∑fAR(m⋆eQx)), (11)

where the function is modelled using autoregressive convolutions (germain2015made; kingma2016improved). For this transformation the determinant , which is straightforward to compute as is triangular.

## 5 Related Work

Deep generative models can be broadly divided in likelihood based model such as autoregressive models (ARMs)

, Variational AutoEncoders (VAEs)

(kingma2014auto), Normalizing flows (rezende2015norm), and adversarial methods (goodfellow2014generative)

. Normalizing flows are particularly attractive because they admit exact likelihood estimation and can be designed for fast sampling.

Linear flows are generally used to mix information in-between triangular maps. Existing transformations in literature are permutations (dinh2016density), orthogonal transformations tomczak2016improving; golinski2019improving, 1 1 convolutions (kingma2018glow), low-rank Woodbury transformations (lu2020woodbury), emerging convolutions (hoogeboom2019emerging), and periodic convolutions (finzi2019invertible; karami2019invertible; hoogeboom2019emerging). From these transformations only periodic and emerging convolutions have a convolutional parametrization. However, periodicity is generally not a good inductive bias for images, and since emerging convolutions are autoregressive, their inverse requires the solution to an iterative problem. Notice that golinski2019improving utilize the matrix exponential to construct orthogonal transformations. However, their method cannot be utilized for convolutional transformations since they compute the exponential matrix explicitly. Our linear exponential can also be seen as a linear ODE (chen2018neural), but the methods are used for different purposes and are computed differently.

There exist many triangular flows in the literature such as coupling layers (dinh2014nice; dinh2016density), autoregressive flows (germain2015made; kingma2016improved; papamakarios2017masked; chen2018neural; cao2019block; song2019mintnet; nielsen2020closing), spline flows (durkan2019neural; durkan2019cubic) and polynomial flows (jaini2019sumsquares). Other flows such as Sylvester Flows (van2018sylvester) and Residual Flows (behrmann2019invertible; chen2019residual) learn invertible residual transformations. Sylvester Flows ensure invertibility by orthogonal basis changes and constraints on triangular matrices. Our interpretation connects Sylvester Flows to more general triangular functions, such as the ones described above. Residual Flows ensure invertibility by constraining the Lipschitz continuity of the residual function. A disadvantage of residual flows is that computing the log determinant is not exact.

## 6 Experiments

Because image data needs to be dequantized theis2016note; ho2019flow++, we optimize the expected lowerbound (ELBO) of the log-likelihood. The performance is compared in terms of negative ELBO and negative log-likelihood (NLL) which is approximated with importance weighting samples. Values are reported in bits per dimension on CIFAR10.

### 6.1 Mixing for generative flows Figure 5: Samples from a generative Convolutional Sylvester flow trained on CIFAR10.

In this experiment the convolution exponential is utilized as a linear layer in-between affine coupling layers. For a fair comparison, all the methods are implemented in the same framework, and are optimized using the same procedure. For details regarding architecture and optimization see Appendix C. The convolution exponential is compared to other linear mixing layers from literature: 1 1 convolutions (kingma2018glow), emerging convolutions (hoogeboom2019emerging), and Woodbury transformations lu2020woodbury. The number of intermediate channels in the coupling layers are adjusted slightly such that each method has an approximately equal parameter budget. The experiments show that our method outperforms all other methods measured in negative ELBO and log-likelihood (see Table 1). Interestingly, even though emerging convolutions also have a convolutional parametrization, their performance is worse than the convolution exponential. This indicates that the autoregressive factorization of emerging convolutions somewhat limits their flexibility, and the exponential parametrization works better.

### 6.2 Density modelling using residual transformations

Since Sylvester Flows are designed to have a residual connection, it is natural to compare their performance to invertible residual networks

(behrmann2019invertible) which were improved to have unbiased log determinant estimates, and subsequently named residual flows (chen2019residual). For a fair comparison, we run the code from (chen2019residual) inside our framework using the same architecture and number of optimizer steps. For reference we also train a typical coupling-based flow with the same architecture. For more details please refer to Appendix C. Note that a direct comparison to the results in Table 1 may not be fair, as the network architectures are structurally different. The results show that Sylvester flows considerably outperform residual networks in image density estimation. Additionally, the memory footprint during training of residual blocks is roughly twice of the other models, due to the the Jacobian determinant estimation. When correcting for this, the equal memory budget result is obtained. In this case, the residual block flow is even outperformed by a coupling flow. We hypothesize that this is caused by the strict Lipschitz continuity that has to be enforced for residual flows. The ablation study in Table 3 shows the effect of using non-generalized Sylvester Flows, and the effect of not doing the basis change. Since the original Sylvester Flows (van2018sylvester) are not convolutional, it is difficult to directly compare these methods. The ablation result without the generalization using , is the closest to a convolutional interpretation of the original Sylvester flows, although it already has the added benefit of the exponential basis change. Even so, our Convolutional Sylvester flows considerably outperform this non-generalized Sylvester flow.

### 6.3 Graph Normalizing Flows

In this section we compare our Graph Convolution Exponential with other methods from the literature. As a first baseline we use a baseline coupling flow dinh2016density that does not exploit the graph structure of the data. The second baseline is a Graph Normalizing Flows that uses graph coupling layers as described in (liu2019graph). Since normalizing flows for edges of the graph is an open problem, following liu2019graph we assume a fully connected adjacency matrix. Our method then adds a graph convolution exponential layer preceding every coupling layer. For further implementation details refer to Appendix B. Following liu2019graph we test the methods on the graph datasets Mixture of Gaussian (MoG) and Mixture of Gaussians Ring (MoG-Ring), which are essentially mixtures of permutation of Gaussians. The original MoG dataset considers Gaussians, which we extend to and points obtaining two new datasets MoG-9 and MoG-16 to study performance when the number of nodes increase. The MoG-Ring entropy is estimated using importance weighting to marginalize over the rotation. Results are presented in Table 4. Adding the graph convolution exponential improves the performance in all four datasets. The improvement becomes larger as the number of nodes increases (e.g. MoG-9 and MoG-16), which is coherent with the intuition that our Graph Convolution Exponential propagates information among nodes in the mixing layer.

## 7 Conclusion

In this paper we introduced a new simple method to construct invertible transformations, by taking the exponential of any linear transformation. Unlike prior work, we observe that the exponential can be computed implicitly. Using this we developed new invertible transformations named convolution exponentials and graph convolution exponentials, and showed that they retain their equivariance properties under exponentiation. In addition, we generalize Sylvester Flows and propose Convolutional Sylvester Flows.

## Appendix A Details for Generalized Sylvester Flows

Recall that the Generalized Sylvester Flows transformation is described by:

 z=x+W−1fAR(Wx). (12)

Theorem 1: Let be an invertible matrix. Let be a smooth autoregressive function (i.e., if ). Additionally, constrain . Then the transformation given by (12) is invertible.

Proof. The vectors of matrix form a basis change for . Since the basis change is invertible, it suffices to show that the transformation in the basis is invertible. Multiplying Equation 12 by from the left gives:

 Wzv=Wxu+fAR(Wxu), (13)

Recall that a one-dimensional real function with strictly positive derivative is invertible. Let and . Since solely depends on and , the function

 v1=u1+fAR(u1), (14)

is invertible and hence given there is a unique . Suppose all variables have been obtained. Consider the equation . Since are all known, now depends solely on and since , the inverse is unique. Inductively, this gives an expression for given and therefore the transformation is invertible.

Theorem 2: The original Sylvester flow , is a special case of the generalized Sylvester flow (12).

Proof. Let and let . Indeed, any orthogonal matrix is invertible, so can be modelled by . Also, note that and are upper triangular and is an elementwise function. The matrix product of the Jacobians is triangular, and thus has a triangular Jacobian and is therefore autoregressive. Hence, it can be modelled by . Further, note that van2018sylvester bound > , which ensures that the constraint is satisfied. Hence, can be written as Equation 8 when writing and without violating any constraints on and , and is therefore a special case.

Remark 1:
The increased expressitivity originates from and not from . To see why, suppose we replace and in the original formulation by and . Consider that any real square matrix may be decomposed as . Hence, compositions and can be written as and , where and which are both still upper triangular. Hence, we have shown that even if the orthogonal matrix is replaced by an invertible matrix , the transformation can still be written in terms of a shared orthogonal matrix and upper triangular matrices and . Therefore, the source of the increased expressitivity is not the replacement of by .

Remark 2:
Sylvester flows can also be viewed from a different perspective as a composition of three invertible transformations: a basis change, a residual invertible function and the inverse basis change. Specifically, let be an invertible function that can be written as (for instance can be the autoregressive function from above). Now apply a linear basis change on , and the inverse on the output . Then . In other words, because the basis change is linear it distributes over addition and cancels in the identity connection, which results in a residual transformation.

The reason that we still utilize an invertible matrix is that it allows more freedom when modelling using the convolution exponential (a decomposition for convolutions can generally not be expressed in terms of convolutions). For fully connected settings can safely be chosen to be orthogonal.

Inverting Sylvester Flows
The inverse of Sylvester flows can be easily computed using a fixed point iteration. Firstly, compute and let . At this point the triangular system can be solved for using the fixed-point iteration:

 u(t)=v−fAR(u(t−1)). (15)

To show that it converges, recall that we constrain the diagonal values of to be greater than and less than . In addition, we require to be Lipschitz continuous for some value

. Note that since neural network are generally composed of linear layers and activation functions that are Lipschitz continuous, these networks themselves are also Lipschitz continuous.

Firstly note that , and we’ll refer to the maximum over all as the one-dimensional Lipschitz continuity , where . Since the function is autoregressive, when all preceding dimensions are fixed, the function is a (one-dimensional) contraction. We will show inductively that converges to . For the base case, note that and hence converges at a rate of . For the remainder of this prove we use the distance as distance metric.

Assume that converges at a rate of , that is for some constant . Then converges at a rate of . We can bound the difference for recursively using:

 |u(t)i−u(t+1)i|≤γ|u(t−1)i−u(t)i|+L||u(t−1)1:i−1−u(t)1:i−1||. (16)

When expanding this equation, this equation and using that , we can write:

 |u(t)i−u(t+1)i| ≤γt|u(0)i−u(1)i|+t∑t′=1γ(t−t′)LBt′n−1γt′, ≤γt|u(0)i−u(1)i|+tnγtLB,

which is guaranteed to converge at least at a rate of . The last inequality follows because . Since converges already at a rate of , the convergence rate for is bounded by . Combining this result with the base case, converges with a rate of , gives the result that the convergence rate of the entire vector is bounded by , where is the dimensionality of . Studying this equation, we can recognize two factors that influence the convergence that we can easily control: The Lipschitz continuity , and the one-dimensional continuity . We find experimentally that constraining the Lipschitz continuity of the convolutional layers in to , and setting generally allows the fixed point iteration to converge within 50 iterations when using an absolute tolerance of .

## Appendix B Graph Convolution Exponential

Given the product of three matrices with dimensions , and respectively we can express its vectorized form in the following way:

 vec(ABC)=(CT⊗A)vec(B) (17)

Where stands for the Kronecker product. We obtain the vectorized form of the Linear Graph Convolutional Layer by applying the above mentioned equation 17 to the graph convolutional layer equation as follows:

Now that we have analytically obtained we can compute its trace by making use of the following Kronecker product property: . The trace of will be:

## Appendix C Experimental Details

We train on the first images of CIFAR10, using the remaining for validation. The final performance is shown on the conventional test images.

### c.1 Mixing experiment

The flow architecture is multi-scale following (kingma2018glow): Each level starts with a squeeze operation, and then subflows which each consist of a linear mixing layer and an affine coupling layer (dinh2016density). The coupling architecture utilizes densenets as described in (hoogeboom2019integer). Further, we use variational dequantization (ho2019flow++), using the same flow architecture as for the density estimation, but using less subflows. Following (dinh2016density; kingma2018glow) after each level (except the final level) half the variables are transformed by another coupling layer and then factored-out. The final base distribution

is a diagonal Gaussian with mean and standard deviation. All methods are optimized using a batch size of

using the Adam optimizer (kingma2014adam) with a learning rate of with standard settings. More details are given in Table 5. Notice that convexp mixing utilizes a convolution exponential and a convolutions, as it tends to map close to the identity by the construction of the power series. Results are obtained by running models three times after random weight initialization, and the mean of the values is reported. Runs require approximately four to five days to complete. Results are obtained by running on four NVIDIA GeForce GTX 1080Ti GPUs, CUDA Version: .

### c.2 Invertible Residual Transformations experiment

The setup is identical to section C.1, where a single subflow is now either a residual block or a convolutional sylvester flow transformation, with a leading actnorm layer (kingma2018glow). The network architectures inside the Sylvester and residual network architectures all consist of three standard convolutional layers: A convolution, a convolution and another convolution. These provide the translation and scale parameters for the Sylvester transformation, and they model the residual for the residual flows. These convolutions map to channels internally, where the first and last convolutional layers map to the respective input and output sizes. Note that for the coupling flow an important difference that a subflow consists of a coupling layer and a convolution, because the coupling layer itself cannot mix information. All methods use the same dequantization flow and splitprior architecture that were described in section C.1. All methods are optimized using a batch size of using the Adam optimizer (kingma2014adam) with a learning rate of with . Results are obtained by running models a single after random weight initialization. More details are given in Table 6. Results are obtained by running on two NVIDIA GeForce GTX 1080Ti GPUs, CUDA Version: . Runs require approximately four to five days to complete. The residual block flow utilizes four GPUs as it requires more memory.

### c.3 Graph Normalizing Flow experiment

The normalizing flows in the graph experiments all utilize three subflows, where a subflow consists of an actnorm layer (kingma2018glow), a convolution and an affine coupling layer dinh2016density. In the model that utilizes the graph convolution exponential, the convolution exponentional precedes each coupling layer. In the baseline coupling flow, the neural networks inside the coupling layers are

-layer Multi Layer Perceptrons (MLPs) with Leaky Relu activations. In the graph normalizing flow, the neural networks inside the coupling layers are graph neural networks where node and edge operations are performed by a

-layer and a -layer MLPs respectively with ReLU activations. All above mentioned neural networks utilize hidden features.

All experiments are optimized for iterations, using a batch size of 256 and a starting learning rate of with a learning rate decay factor of 0.1 every iterations. For testing we used samples, i.e. iterations with a batch size of .

## Appendix D Variational posterior modelling in VAEs

This experiment utilizes normalizing flows as the variational posterior for a Variational AutoEncoder (VAE). The code is built upon the original Sylvester flow implementation (van2018sylvester) which contains a VAE with a single latent representation with a standard Gaussian prior. There are two differences: For CIFAR10 a discretized mixture of logistics salimans2017pixelcnn++ is used as output distribution for the decoder. Additionally, the gated convolutions are replaced by denseblock layers.

The proposed Convolutional Sylvester Flows outperform the other methods considerably in terms of ELBO and log-likelihood on CIFAR10. Interestinly, although the Convolutional Sylvester Flow outperforms other convolutional methods, the experiment shows that fully connected flows actually perform better on binary MNIST. Noteworthy for the comparison between the orginal Householder Sylvester flows (H-SNF) and our method is that H-SNF has four times more parameters than our convolutional Sylvester flows.