Deep generative models are one of the lynchpins of unsupervised learning, underlying tasks spanning distribution learning, feature extraction and transfer learning. Parametric families of neural-network based models have been improved to the point of being able to model complex distributions like images of human faces. One paradigm that has received a lot attention is normalizing flows, which model distributions as pushforwards of a standard Gaussian through aninvertible neural network . Thus, the likelihood has an explicit form via the change of variables formula using the Jacobian of .
Training normalizing flows is challenging due to a couple of main issues. Empirically, these models seem to require a much larger size than other generative models (e.g. GANs) and most notably, a much larger depth. This makes training challenging due to vanishing/exploding gradients. A very related problem is conditioning
, more precisely the smallest singular value of the forward map. It’s intuitively clear that natural images will have a low-dimensional structure, thus a close-to-singular might be needed. On the other hand, the change-of-variables formula involves the determinant of the Jacobian of , which grows larger the more singular is.
While recently, the universal approximation power of various types of invertible architectures has been studied (Dupont et al., 2019; Huang et al., 2020) if the input is padded with a sufficiently large number of all-0 coordinates, precise quantification of the cost of invertibility in terms of the depth required and the conditioning of the model has not been fleshed out.
In this paper, we study both mathematically and empirically representational aspects of depth and conditioning in normalizing flows and answer several fundamental questions.
2 Overview of Results
2.1 Results about General Architectures
The most basic question about normalizing flows we can ask is how restrictive the assumption on invertibility of a model is—in particular how much the depth must increase to compensate for it. More precisely, we ask:
Question 1: is there a distribution over which can be written as the pushforward of a Gaussian through a small, shallow generator, which cannot be written as the pushforward of a Gaussian through a small, shallow invertible neural network?
Given that there is great latitude in terms of the choice of layer architecture, while keeping the network invertible, the most general way to pose this question is to require each layer to be a function of parameters – i.e. where denotes function composition and each
is an invertible function specified by a vectorof parameters.
This framing is extremely general: for instance it includes layerwise invertible feedforward networks in which , is invertible, is invertible, and . It also includes popular architectures based on affine coupling blocks (e.g. Dinh et al. (2014, 2016); Kingma and Dhariwal (2018)) where each has the form for some which we revisit in more detail in the following subsection.
We answer this question in the negative: namely, we show that there is a distribution over which can be expressed as the pushforward of a network with depth and size that cannot be (even very approximately) expressed as the pushforward of a Gaussian through a Lipschitz invertible network of depth smaller than .
Towards formally stating the result, let be the vector of all parameters (e.g. weights, biases) in the network, where are the parameters that correspond to layer , and let denote the resulting function. Define so that is contained in the Euclidean ball of radius .
We say the family is -Lipschitz with respect to its parameters and inputs, if
and . 111Note for architectures having trainable biases in the input layer, these two notions of Lipschitzness should be expected to behave similarly. We will discuss the reasonable range for in terms of the weights after the Theorem statement. We show:
For every and , there exists a neural network with parameters and depth , s.t. for any family of layerwise invertible networks that are -Lipschitz with respect to its parameters and inputs, have parameters per layer and depth at most we have
Furthermore, for all and .
We make several comments. First, note that while the number of parameters in both networks is comparable (i.e. it’s ), the invertible network is deeper, which usually is accompanied with algorithmic difficulties for training, due to vanishing and exploding gradients. For layerwise invertible generators, if we assume that the nonlinearity is -Lipschitz and each matrix in the network has operator norm at most , then a depth network will have 222Note, our theorem applies to exponentially large Lipschitz constants. and . For an affine coupling network with parameterized by -layer networks with parameters each, -Lipschitz activations and weights bounded by as above, we would similarly have .
We furthermore make several remarks on the “hard” distribution we construct, as well as the meaning of the parameter and how to interpret the various lower bounds in the different metrics. The distribution for a given will in fact be close to a mixture of Gaussians, each with mean on the sphere of radius and covariance matrix . Thus – this distribution has most of it’s mass in a sphere of radius —thus this is close to a trivial approximation for .
The KL divergence bounds are derived by so-called transport inequalities between KL and Wasserstein for subgaussian distributions (Bobkov and Götze, 1999). The discrepancy between the two KL divergences comes from the fact that the functions may have different Lipschitz constants, hence the tails of and behave differently. In fact, if the function had the same Lipschitz constant as , both KL lower bounds would be on the order of a constant.
2.2 Results About Affine Coupling Architectures
. The appeal of these architecture comes from training efficiency. Although layerwise invertible neural networks (i.e. networks for which each layer consists of an invertible matrix and invertible pointwise nonlinearity) seem like a natural choice, in practice this is computationally prohibitive.
Concretely, if a random variableand distribution is the pushforward of the through an invertible function , the density of at a point is , where denotes the Jacobian of the function . Hence, maximizing the likelihood of the observable data requires an efficient way of evaluating the determinant of the Jacobian of . For a structurally unrestricted network, the determinant of the Jacobian for such networks will be a operation, which is prohibitively expensive.
Consequently, it’s typical for the transformations in a flow network to be constrained in a manner that allows for efficient computation of the Jacobian determinant. The most common building block is an affine coupling block, originally proposed by Dinh et al. (2014, 2016). A coupling block partitions the coordinates into two parts: and , for a subset with containing around half the coordinates of . The transformation then has the form:
An affine coupling block is a map , s.t.
Of course, the modeling power will be severely constrained if the coordinates in never change: so typically, flow models either change the set in a fixed or learned way (e.g. alternating between different partitions of the channel in Dinh et al. (2016) or applying a learned permutation in Kingma and Dhariwal (2018)). Of course, a permutation is a discrete object, so difficult to learn in a differentiable manner – so Kingma and Dhariwal (2018) simply learns an invertible linear function (i.e. a 1x1 convolution) as a differentiation-friendly relaxation thereof.
2.2.1 The effect of choice of partition on depth
The first question about affine couplings we ask is how much of a saving in terms of the depth of the network can one hope to gain from using learned partitions (ala GLOW) as compared to a fixed partition. More precisely:
Question 2: Can models like Glow (Kingma and Dhariwal, 2018) be simulated by a sequence of affine blocks with a fixed partition without increasing the depth by much?
We answer this question in the affirmative at least for equally sized partitions (which is what is typically used in practice). We show the following surprising fact: consider an arbitrary partition of , such that satisfies , for . Then for any invertible matrix , the linear map can be exactly represented by a composition of affine coupling layers that are linear, namely have the form or for matrices , s.t. each is diagonal. For convenience of notation, without loss of generality let . Then, each of the layers is a matrix of the form or , where the rows and columns are partitioned into blocks of size .
With this notation in place, we show the following theorem:
For all , there exists a such that for any invertible with , there exist matrices and diagonal matrices for all such that
In particular, since permutation matrices are invertible, this means that any applications of permutations to achieve a different partition of the inputs (e.g. like in Glow (Kingma and Dhariwal, 2018)) can in principle be represented as a composition of not-too-many affine coupling layers, indicating that the flexibility in the choice of partition is not the representational bottleneck.
It’s a reasonable to ask how optimal the bound is – we supplement our upper bound with a lower bound, namely that . This is surprising, as naive parameter counting would suggest might work. Namely, we show:
For all and , there exists an invertible with , s.t. for all and for all diagonal matrices it holds that
Beyond the relevance of this result in the context of how important the choice of partitions is, it also shows a lower bound on the depth for an equal number of nonlinear affine coupling layers (even with quite complex functions and in each layer) – since a nonlinear network can always be linearized about a (smooth) point to give a linear network with the same number of layers. In other words, studying linear affine coupling networks lets us prove a depth lower bound/depth separation for nonlinear networks for free.
Finally, in Section 5.3, we include an empirical investigation of our theoretical results on synthetic data, by fitting random linear functions of varying dimensionality with linear affine networks of varying depths in order to see the required number of layers. The results there suggest that the constant in the upper bound is quite loose – and the correct value for is likely closer to the lower bound – at least for random matrices.
2.2.2 Universal Approximation with Ill-Conditioned Affine Coupling Networks
Finally, we turn to universal approximation and the close ties to conditioning. Namely, a recent work (Theorem 1 of (Huang et al., 2020)) showed that deep affine coupling networks are universal approximators if we allow the training data to be padded with sufficiently many zeros. While zero padding is convenient for their analysis (in fact, similar proofs have appeared for other invertible architectures like Augmented Neural ODEs (Zhang et al., 2020)), in practice models trained on zero-padded data often perform poorly (see Appendix B).
In fact, we show that neither padding nor depth is necessary representationally: shallow models without zero padding are already universal approximators in Wasserstein.
Theorem 4 (Universal approximation without padding).
Suppose that is the standard Gaussian measure in with even and is a distribution on with bounded support and absolutely continuous with respect to the Lebesgue measure. Then for any , there exists a depth-3 affine coupling network , with maps represented by feedforward ReLU networks such that
represented by feedforward ReLU networks such that.
A shared caveat of the universality construction in Theorem 4 with the construction in Huang et al. (2020) is that the resulting network is poorly conditioned. In the case of the construction in Huang et al. (2020), this is obvious because they pad the -dimensional training data with
additional zeros, and a network that takes as input a Gaussian distribution in(i.e. has full support) and outputs data on -dimensional manifold (the space of zero padded data) must have a singular Jacobian almost everywhere.333Alternatively, we could feed a degenerate Gaussian supported on a -dimensional subspace into the network as input, but there is no way to train such a model using maximum-likelihood training, since the prior is degenerate. In the case of Theorem 4, the condition number of the network blows up at least as quickly as as we take the approximation error , so this network is also ill-conditioned if we are aiming for a very accurate approximation. Based on Theorem 3, we can show that condition number blowup of either the Jacobian or the Hessian is necessary for such a shallow model to be universal, even when approximating well-conditioned linear maps (see Remark 8).
3 Related Work
On the empirical side, flow models were first popularized by Dinh et al. (2014), who introduce the NICE model and the idea of parametrizing a distribution as a sequence of transformations with triangular Jacobians, so that maximum likelihood training is tractable. Quickly thereafter, Dinh et al. (2016) improved the affine coupling block architecture they introduced to allow non-volume-preserving (NVP) transformations, and finally Kingma and Dhariwal (2018) introduced 1x1 convolutions in the architecture, which they view as relaxations of permutation matrices—intuitively, allowing learned partitions for the affine blocks. Subsequently, there have been variants on these ideas: (Grathwohl et al., 2018; Dupont et al., 2019; Behrmann et al., 2018) viewed these models as discretizations of ODEs and introduced ways to approximate determinants of non-triangular Jacobians, though these models still don’t scale beyond datasets the size of CIFAR10. The conditioning/invertibility of trained models was experimentally studied in (Behrmann et al., 2019), along with some “adversarial vulnerabilities” of the conditioning. Mathematically understanding the relative representational power and statistical/algorithmic implications thereof for different types of generative models is still however a very poorly understood and nascent area of study.
Most closely related to our results are the recent works of Huang et al. (2020) and Zhang et al. (2020). Both prove universal approximation results for invertible architectures (the former affine couplings, the latter neural ODEs) if the input is allowed to be padded with zeroes. As already expounded upon in the previous sections – our results prove universal approximation even without padding, but we focus on more fine-grained implications to depth and conditioning of the learned model.
More generally, there are various classical results that show a particular family of generative models can closely approximate most sufficiently regular distributions over some domain. Some examples are standard results for mixture models with very mild conditions on the component distribution (e.g. Gaussians, see (Everitt, 2014)et al., 2011; Montufar and Ay, 2011); GANs (Bailey and Telgarsky, 2018).
4 Proof of Theorem 1: Depth Lower Bounds on Invertible Models
In this section we prove Theorem 1. The intuition behind the bound on the depth relies on parameter counting: a depth invertible network will have parameters in total ( per layer)—which is the size of the network we are trying to represent. Of course, the difficulty is that we need more than
simply not being identical: we need a quantitative bound in various probability metrics.
The proof will proceed as follows. First, we will exhibit a large family of distributions (of size ), s.t. each pair of these distributions has a large pairwise Wasserstein distance between them. Moreover, each distribution in this family will be approximately expressible as the pushforward of the Gaussian through a small neural network. Since the family of distributions will have a large pairwise Wasserstein distance, by the triangle inequality, no other distribution can be close to two distinct members of the family.
Second, we can count the number of “approximately distinct” invertible networks of depth : each layer is described by weights, hence there are parameters in total. The Lipschitzness of the neural network in terms of its parameters then allows to argue about discretizations of the weights.
Formally, we show the following lemma:
Lemma 1 (Large family of well-separated distributions).
For every , there exists a family of distributions, s.t. and:
Each distribution is a mixture of Gaussians with means and covariance .
and , we have for a neural network with at most parameters.444The size of doesn’t indeed depend on . The weights in the networks will simply grow as becomes small.
For any .
The proof of this lemma will rely on two ideas: first, we will show that there is a family of distributions consisting of mixtures of Gaussians with components – s.t. each pair of members of this family is far in distance, and each member in the family can be approximated by the pushforward of a network of size .
The reason for choosing mixtures is that it’s easy to lower bound the Wasserstein distance between two mixtures with equal weights and covariance matrices in terms of the distances between the means. Namely, we show:
Let and be two mixtures of spherical Gaussians in dimensions with mixing weights , means and respectively, and with all of the Gaussians having spherical covariance matrix for some . Suppose that there exists a set with such that for every ,
By the dual formulation of Wasserstein distance (Kantorovich-Rubinstein Theorem, (Villani, 2003)), we have where the supremum is taken over all -Lipschitz functions . Towards lower bounding this, consider and note that this function is -Lipschitz and always valued in . For a single Gaussian , observe that
Therefore, we see that by combining the above calculation with the fact that at least of the centers for are in . On the other hand, for we have
(e.g. by Bernstein’s inequality (Vershynin, 2018), as is a sum of squares of Gaussians, i.e. a -random variable). In particular, since the points in do not have a close point in , we similarly have , since very little mass from each Gaussian in lands in the support of by the separation assumption. Combining the bounds gives the result. ∎
Given this, to design a family of mixtures of Gaussians with large pairwise Wasserstein distance, it suffices to construct a large family of -tuples for the means, s.t. for each pair of -tuples , there exists a set , s.t. . We do this by leveraging ideas from coding theory (the Gilbert-Varshamov bound (Gilbert, 1952; Varshamov, 1957)). Namely, we first pick a set of vectors of norm , each pair of which has a large distance; second, we pick a large number () of -tuples from this set at random, and show with high probability, no pair of tuples intersect in more than elements.
Concretely, first, by elementary Chernoff bounds, we have the following result:
Lemma 3 (Large family of well-separated points).
Let . There exists a set of vectors with , s.t. for all .
Recall that for a random unit vector on the sphere in dimensions, . (This is a basic fact about spherical caps, see e.g. Rao (2011)). By spherical symmetry and the union bound, this means for two unit vectors sampled uniformly at random . Taking gives that the probability is ; therefore if draw i.i.d. vectors, the probability that two have inner product larger than in absolute value is at most if , which in particular implies such a collection of vectors exists. ∎
From this, we construct a large set of -sized subsets of this family which have small overlap, essentially by choosing such subsets uniformly at random. We use the following result:
Lemma 4 (Rödl and Thoma (1996)).
There exists a set consisting of subsets of size of , s.t. no pair of subsets intersect in more than elements.
To handle part 2 of Lemma 1, we also show that a mixture of Gaussians can be approximated as the pushforward of a Gaussian through a network of size . The idea is rather simple: the network will use a sample from a standard Gaussian in . We will subsequently use the first coordinate to implement a “mask” that most of the time masks all but one randomly chosen coordinate in . The remaining coordinates are used to produce a sample from each of the components in the Gaussian, and the mask is used to select only one of them. For details, see Section A.
With this lemma in hand, we finish the Wasserstein lower bound with a stanard epsilon-net argument, using the parameter Lipschitzness of the invertible networks. Namely, the following lemma is immediate:
Suppose that is contained in a ball of radius and is a family of invertible layerwise networks which is -Lipschitz with respect to its parameters. Then there exists a set of neural networks , s.t. and for every there exists a , s.t. .
The proof of Theorem 1 can then be finished by triangle inequality: since the family of distributions has large Wasserstein distance, by the triangle inequality, no other distribution can be close to two distinct members of the family. Finally, KL divergence bounds can be derived from the Bobkov-Götze inequality Bobkov and Götze (1999), which lower bounds KL divergence by the squared Wasserstein distance. Concretely:
Theorem 5 (Bobkov and Götze (1999)).
Let be two distributions s.t. for every 1-Lipschitz and , is -subgaussian. Then, we have .
Then, to finish the two inequalities in the statement of the main theorem, we will show that:
In this section, we will prove Theorems 3 and 2. Before proceeding to the proofs, we will introduce a bit of helpful notation. We let denote the group of matrices with positive determinant (see Artin (2011) for a reference on group theory). The lower triangular linear affine coupling layers are the subgroup of the form
and likewise the upper triangular linear affine coupling layers are the subgroup of the form
Finally, define . This set is not a subgroup because it is not closed under multiplication. Let denote the th power of , i.e. all elements of the form for .
5.1 Upper Bound
The main result of this section is the following:
Theorem 6 (Restatement of Theorem 2).
There exists an absolute constant such that for any , .
In other words, any linear map with positive determinant (“orientation-preserving”) can be implemented using a bounded number of linear affine coupling layers. In group-theoretic language, this says that generates and furthermore the diameter of the corresponding (uncountably infinite) Cayley graph is upper bounded by a constant independent of . The proof relies on the following two structural results. The first one is about representing permutation matrices, up to sign, using a constant number of linear affine coupling layers:
For any permutation matrix , there exists with for all .
The second one proves how to represent using a constant number of linear affine couplings matrices with special eigenvalue structure:
Let be an arbitrary invertible matrix with distinct real eigenvalues and be a lower triangular matrix with the same eigenvalues as . Then .
Given these Lemmas, the strategy to prove Theorem 6 will proceed as follows. Every matrix has a an factorization Horn and Johnson (2012) into a lower-triangular, upper-triangular, and permutation matrix. Lemma 6 takes care of the permutation part, so what remains is building an arbitrary lower/upper triangular matrix; because the eigenvalues of lower-triangular matrices are explicit, a careful argument allows us to reduce this to Lemma 7.
We proceed to implement this strategy.
We start with Lemma 5. As a preliminary, we recall a folklore result about permutations. Let denote the symmetric group on elements, i.e. the set of permutations of equipped with the multiplication operation of composition. Recall that the order of a permutation is the smallest positive integer such that is the identity permutation.
For any permutation , there exists of order at most 2 such that
This result is folklore. We include a proof of it for completeness555This proof, given by HH Rugh, and some other ways to prove this result can be found at https://math.stackexchange.com/questions/1871783/every-permutation-is-a-product-of-two-permutations-of-order-2 ..
First, recall that every permutation has a unique decomposition as a product of disjoint cycles. Therefore if we show the result for a single cycle, so for every , then taking and proves the desired result since and are both of order at most .
It remains to prove the result for a single cycle of length . The cases are trivial. Without loss of generality, we assume . Let , , and otherwise . Let , , , and otherwise . It’s easy to check from the definition that both of these elements are order at most .
We now claim . To see this, we consider the following cases:
For all other , .
In all cases we see that which proves the result. ∎
We now prove Lemma 6.
Proof of Lemma 6.
It is easy to see that swapping two elements is possible in a fashion that doesn’t affect other dimensions by the following ‘signed swap’ procedure requiring matrices:
Next, let and . There will be an equal number of elements which in a particular permutation will be permuted from to as those which will be permuted from to . We can choose an arbitrary bijection between the two sets of elements and perform these ‘signed swaps’ in parallel as they are disjoint, using a total of matrices. The result of this will be the elements partitioned into and that would need to be mapped there.
We can also (up to sign) transpose elements within a given set or via the following computation using our previous ‘signed swaps’ that requires one ‘storage component’ in the other set:
So, up to sign, we can in matrices compute any transposition in L or R separately. In fact, since any permutation can be represented as the product of two order-2 permutations (Lemma 8) and any order-2 permutation is a disjoint union of transpositions, we can implement an order-2 permutation up to sign using matrices and an arbitrary permutation up to sign using matrices.
In total, we used matrices to move elements to the correct side and matrices to move them to their correct position, for a total of matrices. ∎
Next, we proceed to prove Lemma 7. We will need the following simple lemma:
Suppose is a matrix with distinct real eigenvalues. Then there exists an invertible matrix such that where is a diagonal matrix containing the eigenvalues of .
Observe that for every eigenvalue of , the matrix has rank
by definition, hence there exists a corresponding real eigenvectorby taking a nonzero solution of the real linear system . Taking to be the linear operator which maps to standard basis vector , and proves the result. ∎
With this, we prove Lemma 7.
Proof of Lemma 7.
where is an invertible matrix that will be specified later. We can multiply out with these values giving
Here what remains is to guarantee . Since and have the same eigenvalues, by Lemma 9 there exist real matrices such that and for the same diagonal matrix , hence . Therefore taking gives the result. ∎
Finally, with all lemmas in place, we prove Theorem 6.
Proof of Theorem 6.
Recall that our goal is to show that for an absolute constant . To show this, we consider an arbitrary matrix , i.e. an arbitrary matrix with positive determinant, and show how to build it as a product of a bounded number of elements from . As is a square matrix, it admits an LUP decomposition (Horn and Johnson, 2012): i.e. a decomposition into the product of a lower triangular matrix , an upper triangular matrix , and a permutation matrix . This proof proceeds essentially by showing how to construct the , , and components in a constant number of our desired matrices.
By Lemma 6, we can produce a matrix with which agrees with up to the sign of its entries using linear affine coupling layers. Then is a matrix which admits an decomposition: for example, given that we know has an decomposition, we can modify flip the sign of some entries of to get an LU decomposition of . Furthermore, since , we can choose an decomposition such that (for any decomposition which does not satisfy this, the two matrices and must both have negative determinant as . In this case, we can flip the sign of column in and row in to make the two matrices positive determinant).
It remains to show how to construct a lower/upper triangular matrix with positive determinant out of our matrices. We show how to build such a lower triangular matrix as building is symmetrical.
At this point we have a matrix , where and are lower triangular. We can use column elimination to eliminate the bottom-left block:
where and are lower-triangular.
Recall from (1) that we can perform the signed swap operation in of taking for using 3 affine coupling blocks. Therefore using 6 affine coupling blocks we can perform a sign flip map . Note that because , the number of negative entries in the first diagonal entries has the same parity as the number of negative entries in the second diagonal entries. Therefore, using these sign flips in parallel, we can ensure using affine coupling layers that that the first and last diagonal entries of have the same number of negative elements. Now that the number of negative entries match, we can apply two diagonal rescalings to ensure that:
The first diagonal entries of the matrix are distinct.
The last diagonal entries contain the multiplicative inverses of the first entries up to reordering. Here we use that the number of negative elements in the first and last elements are the same, which we ensured earlier.
At this point, we can apply Lemma 7 to construct this matrix from four of our desired matrices. Since this shows we can build and , this shows we can build any matrix with positive determinant.
Now, let’s count the matrices we needed to accomplish this. In order to construct , we needed 21 matrices. To construct , we needed 1 for column elimination, 6 for the sign flip, 2 for the rescaling of diagonal elements, and 4 for Lemma 7 giving a total of 13. So, we need total matrices to construct the whole decomposition. ∎
5.2 Lower Bound
We proceed to the lower bound. Note, a simple parameter counting argument shows that for sufficiently large , at least four affine coupling layers are needed to implement an arbitrary linear map (each affine coupling layer has only parameters whereas is a Lie group of dimension ). Perhaps surprisingly, it turns out that four affine coupling layers do not suffice to construct an arbitrary linear map. We prove this in the following Theorem.
Theorem 7 (Restatement of Theorem 3).
For , is a proper subset of . In other words, there exists a matrix which is not in .
The key observation is that matrices in satisfy a strong algebraic invariant which is not true of arbitrary matrices. This invariant can be expressed in terms of the Schur complement (Zhang, 2006):
Suppose that is an invertible matrix and suppose there exist matrices , and diagonal matrices , such that
Then the Schur complement is similar to : more precisely, if then .
We explicitly solve the block matrix equations. Multiplying out the LHS gives
Starting with the top-left block gives that
Next, the top-right block gives that
The bottom-left and (2) gives
Taking the bottom-right block and substituting (6) gives
Substituting (5) gives
Substituting (7) gives
Here we notice that is similar to , where we get to choose values along the diagonal of . In particular, this means that and must have the same eigenvalues. ∎
With this, we can prove Theorem 7.
Proof of Theorem 7.
First, note that element in can be written in either the form or for and . We construct an explicit matrix which cannot be written in either form.
Consider an invertible matrix of the form
and observe that the Schur complement is simply . Therefore Lemma 10 says that this matrix can only be in if is similar to for some diagonal matrix . Now consider the case where is a permutation matrix encoding the permutation and is a diagonal matrix with nonzero entries. Then is a diagonal matrix as well, hence has real eigenvalues, while the eigenvalues of are the -roots of unity. (The latter claim follows because for any with , the vector is an eigenvector of with eigenvalue ). Since similar matrices must have the same eigenvalues, it is impossible that and are similar.
The remaining possibility we must consider is that this matrix is in . In this case by applying the symmetrical version of Lemma 10 (which follows by swapping the first and last coordinates), we see that and must be similar. Since and , this is impossible. ∎
We remark that the argument in the proof is actually fairly general; it can be shown, for example, that for a random choice of and from the Ginibre ensemble, that cannot typically be expressed in . So there are significant restrictions on what matrices can be expressed even four affine coupling layers.
Remark 8 (Connection to Universal Approximation).
As mentioned earlier, this lower bound shows that the map computed by general 4-layer affine coupling networks is quite restricted in its local behavior (it’s Jacobian cannot be arbitrary). This implies that smooth 4-layer affine coupling networks, where smooth means the Hessian (of each coordinate of the output) is bounded in spectral norm, cannot be universal function approximators as they cannot even approximate some linear maps. In contrast, if we allow the computed function to be very jagged then three layers are universal (see Theorem 4).
5.3 Experimental results
We also verify the bounds from this section. At least on randomly chosen matrices, the correct bound is closer to the lower bound. Precisely, we generate (synthetic) training data of the form , where for a fixed square matrix with random standard Gaussian entries and train a linear affine coupling network with layers by minimizing the loss . We are training this “supervised” regression loss instead of the standard unsupervised likelihood loss to minimize algorithmic (training) effects as the theorems are focusing on the representational aspects. The results for are shown in Figure 2, and more details are in Section B.
Finally, we also note that there are some surprisingly simple functions that cannot be exactly implemented by a finite affine coupling network. For instance, an entrywise function (i.e. an entrywise nonlinearity) cannot be exactly represented by any finite affine coupling network, regardless of the nonlinearity used. Details of this are in Appendix C.