Normalizing flows (NF, Rezende and Mohamed, 2015) have gained popularity in the recent years because of their unique ability to model complex data distributions while allowing both for sampling and exact density computation. This family of deep neural networks combines a base distribution with a series of invertible transformations while keeping track of the change of density that is caused by each transformation.
Probabilistic graphical models (PGMs) are well-established mathematical tools that combine graph and probability theory to ease the manipulation of joint distributions. They are commonly used to visualize and reason about the set of independencies in probabilistic models. Among PGMs, Bayesian networks(BN, Pearl, 2011) offer a nice balance between readability and modeling capacity. Reading independencies stated by a BN is simple and can be performed graphically with the d-separation algorithm (Geiger et al., 1990).
In this note, we revisit NFs as Bayesian networks. We first briefly review the mathematical grounds of these two worlds. Then, for the first time in the literature, we show that the modeling assumptions behind coupling and autoregressive transformations can be perfectly expressed by distinct classes of BNs. From this insight, we show that stacking multiple transformation layers relaxes independencies and entangles the model distribution. Then, we show that a fundamental change of regime emerges when the NF architecture includes 3 transformation steps or more. Finally, we prove the non-universality of affine normalizing flows.
2.1 Normalizing flows
A normalizing flow is defined as a sequence of invertible transformation steps () that are composed together to create an expressive invertible mapping
. This mapping can be used to perform density estimation, usingto map a sample
to a latent vectorequipped with a density . The transformation implicitly defines a density as given by the change of variables formula,
where is the Jacobian of with respect to . The resulting model is trained by maximizing the likelihood of the data . NFs can also be used for data generation tasks while keeping track of the density of the generated samples such as to improve the latent distribution in variational auto-encoders (Rezende and Mohamed, 2015). In the rest of this paper, we will not distinguish between and when the discussion will be focused on only one of these steps .
In general, steps can take any form as long as they define a bijective map. Here, we focus on a sub-class of normalizing flows for which these steps can be mathematically described as
where the are denoted as the conditioners and constrain the structure of the Jacobian of . The functions , partially parameterized by their conditioner, must be invertible with respect to their input variable . These are usually defined as affine or strictly monotonic functions, with the latter being the most general class of invertible scalar continuous functions. In this note, we mainly discuss affine normalizers that can be expressed as where and are computed by the conditioner.
2.2 Bayesian networks
Bayesian networks allow for a compact and natural representation of probability distributions by exploiting conditional independence. More precisely, a BN is a directed acyclic graph (DAG) which structure encodes for the conditional independencies through the concept of d-separation (Geiger et al., 1990). Equivalently, its skeleton supports an efficient factorization of the joint distribution.
A BN is able to model a distribution if and only if it is an I-map with respect to . That is, iff the set of independencies stated by the BN structure is a subset of the independencies that holds for . Equivalently, a BN is a valid representation of a random vector iff its density can be factorized by the BN structure as
where denotes the set of parents of the vertex and is the adjacency matrix of the BN. As an example, Fig. (a)a is a valid BN for any distribution over
because it does not state any independence, leading to a factorization that results in the chain rule.
3 Normalizing flows as Bayesian networks
3.1 Autoregressive conditioners
Autoregressive conditioners can be expressed as
where are functions of the first components of and whose output size depends on architectural choices. These conditioners constrain the Jacobian of to be lower triangular, making the computation of its determinant . The multivariate density induced by and can be expressed as a product of univariate conditional densities,
When is a factored distribution , we identify that each component coupled with the corresponding function encodes for the conditional . An explicit connection between BNs and autoregressive conditioners can be made if we define and compare (2) with (1). Therefore, and as illustrated in Fig. (a)a, autoregressive conditioners can be seen as a way to model the conditional factors of a BN that does not state any independence.
3.2 Coupling conditioners
Coupling conditioners (Dinh et al., 2017) are another popular type of conditioners used in normalizing flows. The conditioners made from coupling layers are defined as
where the symbol define constant values. As for autoregressive conditioners, the Jacobian of made of coupling layers is lower triangular. Assuming a factored latent distribution, the density associated with these conditioners can be written as follows:
The factors define valid 1D conditional probability distributions because they can be seen as 1D changes of variables betweenand . This factorization can be graphically expressed by a BN as shown in Fig. (b)b. In addition, we can see Fig. (b)b as the marginal BN of Fig. (c)c which fully describes the stochastic process modeled by a NF that is made of a single transformation step and a coupling conditioner. In contrast to autoregressive conditioners, coupling layers are not by themselves universal density approximators, even when associated with very expressive normalizers . Indeed, d-separation reveals independencies stated by this class of BN, such as the conditional independence between each pair in knowing . These independence statements do not hold in general.
3.3 Stacking transformation steps
In practice, the invertible transformations discussed above are often stacked together in order to increase the representation capacity of the flow, with the popular good practice of permuting the vector components between two transformation steps. The structural benefits of this stacking strategy can be explained from the perspective of the underlying BN.
First, a BN that explicitly includes latent variables is faithful as long as the sub-graph made only of those latent nodes is an I-map with respect to their distribution. Normalizing flows composed of multiple transformation layers can therefore be viewed as single transformation flows whose latent distribution is itself recursively modeled by a normalizing flow. As an example, Fig. 5 illustrates a NF made of two transformation steps with coupling conditioners. It can be observed that the latent vector is itself a normalizing flow whose distribution can be factored out by a class of BN.
Second, from the BN associated to a NF, we observe that additional layers relax the independence assumptions defined by its conditioners. The distribution modeled by the flow gets more entangled at each additional layer. For example, Fig. 5 shows that for coupling layers, the additional steps relax the strong conditional independencies between and of the single transformation NF of Fig. (c)c. Indeed, we can observe from the figure that and have common ancestors ( and ) whereas they are clearly assumed independent in Fig. (b)b.
In general, we note that edges between two nodes in a BN do not model dependence, only the absence of edges does model independence. However, because some of the relationship between nodes are bijective, this implies that these nodes are strictly dependent on each other. We represent these relationships with undirected edges in the BN, as it can be seen in Fig. 5.
4 Affine normalizing flows unlock their capacity with 3 transformation steps
We now show how some of the limitations of affine normalizers can be relaxed by stacking multiple transformation steps. We also discuss why some limitations cannot be relaxed even with a large number of transformation steps. We intentionally put aside monotonic normalizers because they have already been proven to lead to universal density approximators when the conditioner is autoregressive (Huang et al., 2018). We focus our discussion on a multivariate normal with an identity covariance matrix as base distribution .
We first observe from Fig. 4 that in a NF with a single transformation step at least one component of is a function of only one latent variable. If the normalizer is affine and the base distribution is normal, then this necessarily implies that the marginal distribution of this component is normal as well, which will very likely not lead to a good fit. We easily see that adding steps relaxes this constraint. A more interesting question to ask is what exactly the modeling capacity gain for each additional step of affine normalizer is. Shall we add steps to increase capacity or shall we increase the capacity of each step instead? We first discuss a simple 2-dimensional case, which has the advantage of unifying the discussion for autoregressive and coupling conditioners, and then extend it to a more general setting.
Affine NFs made of a single transformation step induce strong constraints on the form of the density. In particular, these models implicitly assume that the data distribution can be factorized as a product of conditional normal distributions. These assumptions are rexaled when accumulating steps in the NF. As an example, Fig.6 shows the equivalent BN of a 2D NF composed of 3 steps. This flow is mathematically described with the following set of equations:
From these equations, we see that after one step the latent variables and are respectively normal and conditionally normal. This is relaxed with the second step, where the latent variable
is a non-linear function of two random variables distributed normally (by assumption on the distribution ofand ). However, is a stochastic affine transformation of a normal random variable. In addition, we observe that the expression of is strictly more expressive than the expression of . Finally, and are non-linear functions of both latent variables and . Assuming that the functions and are universal approximators, we argue that the stochastic process that generates and the one that generates are as expressive as each other. Indeed, by making the functions arbitrarily complex the transformation for could be made arbitrarily close to the transformations for and vice versa. This is true because both transformations can be seen as an affine transformation of a normal random variables whose scaling and offset factors are non-linear arbitrarily expressive transformations of all the latent variables. Because of this equilibrium between the two expressions, additional steps do not improve the modeling capacity of the flow. The same observations can be made empirically as illustrated in Fig. 6 for 2-dimensional toy problems. A clear leap of capacity occurs from 2-step to 3-step NFs, while having 4 steps or more does not result in any noticeable improvement when and already have enough capacity.
For , autoregressive and coupling conditioners do not correspond to the same set of equations or BN. However, if the order of the vector is reversed between two transformation steps, the discussion generalizes to any value of for both conditioners. Indeed, in both cases each component of the intermediate latent vectors can be seen as having a set of conditioning variables and a set of independent variables. At each successive step the indices of the non-conditioning variables are exchanged with the conditioning ones and thus any output vector’s component can be expressed either as a component of the vector form of or of .
5 Affine normalizing flows are not universal density approximators
We argue that affine normalizers do not lead to universal density approximators in general, even for an infinite number of steps. In the following, we assume again that the latent variables are distributed according to a normal distribution with a unit covariance matrix.
To prove the non-universality of affine normalizing flows, one only needs to provide a counter-example. Let us consider the simple setup in which one component of the random vector is independent from all the other components. Let us also assume that is distributed under a non-normal distribution. We can then consider two cases. First, has only one component of the latent vector as an ancestor. This implies that the equivalent BN would be as in Fig. 8, hence that is a linear function of this ancestor and is therefore normally distributed. Else, has components of the latent vector as ancestors. However, this second case would imply that at least one undirected edge is removed from the original BN considered in Section 3.3. This cannot happen since it would deadly hurt the bijectivity of the flow.
Besides proving the non-universality of affine NFs, this discussion provides the important insight that when affine normalizers must transform non-linearly some latent variables they introduce dependence in the model of the distribution. In some sense, this means that the additional disorder required to model this non-normal component is performed at the cost of some loss in entropy caused by mutual information between the random vector components.
In this preliminary work, we have revisited normalizing flows from the perspective of Bayesian networks. We have shown that stacking multiple transformations in a normalizing flow relaxes independence assumptions and entangles the model distribution. Then, we have shown that affine normalizing flows benefit from having at least 3 transformation layers. Finally, we demonstrated that they remain non-universal density approximators regardless of their depths.
We hope these results will give practitioners more intuition in the design of normalizing flows. We also believe that this work may lead to further research. First, unifying Bayesian networks and normalizing flows could be pushed one step further with conditioners that are specifically designed to model Bayesian networks. Second, the study could be extended for other type of normalizing flows such as non-autoregressive monotonic flows. Finally, we believe this study may spark research at the intersection of structural equation modeling, causal networks and normalizing flows.
The authors would like to thank Matthia Sabatelli, Johann Brehmer and Louis Wehenkel for proofreading the manuscript. Antoine Wehenkel is a research fellow of the F.R.S.-FNRS (Belgium) and acknowledges its financial support. Gilles Louppe is recipient of the ULiège - NRB Chair on Big data and is thankful for the support of NRB.
-  (2017) Density estimation using real nvp. In International Conference in Learning Representations, Cited by: §3.2.
D-separation: from theorems to algorithms.
Machine Intelligence and Pattern Recognition, Vol. 10, pp. 139–148. Cited by: §1, §2.2.
-  (2018) Neural autoregressive flows. In International Conference on Machine Learning, pp. 2083–2092. Cited by: §4.
-  (2011) Bayesian networks. Cited by: §1.
-  (2015) Variational inference with normalizing flows. In International Conference on Machine Learning, pp. 1530–1538. Cited by: §1, §2.1.