A major goal of statistics and machine learning has been to model a probability distribution given samples drawn from that distribution. This is an example of unsupervised learning and is sometimes called generative modelling. Its importance derives from the relative abundance of unlabelled data compared to labelled data. Applications include density estimation, outlier detection, prior construction, and dataset summarization.
Many methods for generative modeling have been proposed. Direct analytic approaches approximate a data distribution with a fixed family of distributions; these include traditional parameter estimation in statistics. Variational approaches and expectation maximization introduce latent variables which aim to explain observed variables and provide additional flexibility but can increase the complexity of learning and inference. Graphical models[graphModels]
aim to explicitly model the conditional dependence between random variables. Recently, generative neural approaches have been proposed including generative adversarial networks (GAN)[Goodfellow2014GenerativeAN] and variational auto-encoders (VAE)[Kingma2014AutoEncodingVB].
GANs and VAEs have demonstrated impressive performance results on challenging tasks such as learning distributions of natural images. However, several issues limit their application in practice. Neither allows for exact evaluation of the probability density of new points.111Approximations of the likelihood for GANs and VAEs are possible but can be difficult and computationally expensive. Furthermore, training can be challenging due to a variety of phenomena including mode collapse, posterior collapse, vanishing gradients and training instability [Bowman2015GeneratingSF; Salimans2016ImprovedTF].
A Normalizing Flow (NF) is family of generative models which produces tractable distributions where both sampling and density evaluation can be efficient and exact. Successful applications include: image generation [Kingma2018; ho2019flow], video generation [videoflow], audio generation [flowwavenet; waveglow; esling2019], graph generation [Madhawa2019]
, and reinforcement learning[Ward2019RL].
There are several survey papers for VAEs [Kingma2019IntroVAE] and GANs [Creswell2018GenerativeAN; Wang2017GenerativeAN], but to the best of our knowledge, no equivalent survey articles for Normalizing Flows. This article aims to provide a coherent and comprehensive review of the literature around the construction and use of Normalizing Flows for distribution learning. Our goals are to 1) provide context and explanation to enable a reader to become familiar with the basics, 2) review current the state-of-the-art literature, and 3) identify open questions and promising future directions.
In Section 2, we introduce Normalizing Flows and describe how they are trained. In Section 3 we review existing constructions for Normalizing Flows. In Section 4 we describe datasets used for testing Normalizing Flows and illustrate the performance of existing methods. Finally, in Section 5 we discuss open problems and possible future research directions.
Normalizing Flows were popularised by Rezende2015 in the context of variational inference and by Dinh2015 for density estimation. However, the framework was previously defined in Tabak2010 and Tabak2013 and explored for density estimation in a preprint by Rippel2013.
A Normalizing Flow is a transformation of a simple probability distribution (e.g., a standard normal) into a more complex distribution by a sequence of invertible and differentiable mappings. The density of a sample can be evaluated by transforming it back to the original simple distribution and then computing the product of i) the density of the inverse-transformed sample under this distribution and ii) the associated change in volume induced by the sequence of inverse transformations. The change in volume is the product of the absolute values of the determinants of the Jacobians for each transformation, as required by the change of variables formula.
The result of this approach is a mechanism to construct new families of distributions by choosing an initial density and then chaining together some number of parameterized, invertible and differentiable transformations. The new density can be sampled from (by sampling from the initial density and applying the transformations) and the density at a sample (i.e., the likelihood) can be computed as described above.
be a random variable with a known and tractable probability density function. Let be an invertible function and . Then using the change of variables formula, one can compute the probability density function of the random variable :
where is the inverse of , is the Jacobian of and is the Jacobian of . This new density function is called a pushforward of the density by the function and denoted by (Figure 1).
In the context of generative models, the above function (a generator) “pushes forward” the base density (sometimes referred to as the “noise”) to a more complex density. This movement from base density to final complex density is the generative direction. Note that to generate a data point , one can sample from the base distribution, and then apply the generator: .
The inverse function moves in the opposite, normalizing direction: from a complex and irregular data distribution towards a simpler, more regular or “normal” form, specifically the base measure . This view is what gives rise to the name “normalizing flows” as is “normalizing” the data distribution. This term is doubly accurate if the base measure
is chosen as a Normal distribution as it often is in practice.
Intuitively, if the transformation can be arbitrarily complex, one can generate any distribution from a base distribution 222There should be some reasonable condition on the base density, like has non-zero support everywhere. as the pushforward . This has been proven formally [transportation; triangular2005; triangular2008]. See Section 3.5.1 for more details.
Constructing arbitrarily complex non-linear invertible functions (bijections) can be difficult. By the term Normalizing Flows people mean bijections which are convenient to compute, invert, and calculate the determinant of their Jacobian. One approach to this is to note that the composition of invertible functions is itself invertible and the determinant of its Jacobian has a specific form. In particular, let be a set of bijective functions and define to be the composition of the functions. Then it can be shown that is also bijective, with inverse given by
and the determinant of the Jacobian is
where is the Jacobian of , is the value of the -th intermediate flow and . Thus, given a set of nonlinear bijective functions, the above result allows them to be composed to construct successively more complex functions.
2.1.1 More formal construction
In this section we explain normalizing flows from more formal perspective. Readers unfamiliar with measure theory can safely skip it and move to Section 2.2. First, let us recall the general definition of a pushforward.
If , are measurable spaces, is a measurable mapping between them, and is a measure on , then one can define a measure on (called the pushforward measure and denoted by ) by the formula
This notion gives a general formulation of a generative modelling. Data can be understood as a sample from a measured “data” space , which we want to learn. To do that one can introduce a simpler measured space and find a function , such that . This function can be interpreted as a “generator”, and as a latent space. This view puts generative models in the context of transportation theory [transportation].
For the rest of this survey we will assume that , all sigma-algebras are Borel, and all measures are absolutely continuous with respect to Lebesgue measure (i.e., ).
A function is called a diffeomorphism, if it is bijective, differentiable, and its inverse is differentiable as well.
It is well known that the pushforward of an absolutely continuous measure by a diffeomorphism is also absolutely continuous and its density function is calculated by Equation (2.1). Note that this more general approach is important if one wants to study generative models on non-Euclidean spaces (see Section 5.2 for discussion).
It is common in the normalizing flows literature to simply refer to diffeomorphisms as “bijections” even though this is formally incorrect. In general, it is not necessary that is everywhere differentiable; rather it is sufficient that it is differentiable only almost everywhere with respect to the Lebesgue measure on . This allows, for instance, piecewise continuous functions to be used in the construction of .
2.2.1 Density estimation and sampling
The natural and most obvious use of normalizing flows is to perform density estimation. For simplicity assume that only a single flow,
, is used and it is parameterized by the vector. Further, assume that the base measure, is given and is parameterized by the vector . Given a set of data observed from some complex distribution, , we can then perform likelihood-based estimation of the parameters . The data likelihood in this case simply becomes
where the first term is the log likelihood of the sample under the base measure and the second term, sometimes called the log-determinant or volume correction, accounts for the change of volume induced by the transformation of the normalizing flows (see Equation (2.1)). During training, the parameters of the flow () and of the base distribution are adjusted to maximize the log-likelihood.
Note that evaluating the likelihood of a normalizing flow requires computing the inverse mapping, (normalizing direction), as well as the log determinant. This occurs at test time, but will occur repeatedly during likelihood-based training. Thus, efficiency of the inverse mapping and log determinant will determine the efficiency of the training.
However, sampling from the distribution defined by the normalizing flow requires only evaluating the forward mapping of the flow (generative direction). Thus sampling performance is largely determined by the cost of the forward mapping. Even though a flow must be theoretically invertible, computation of the inverse may be difficult in practice; hence, for density estimation it is common to model a flow in the normalizing direction (i.e., ).
If one wants to have both efficient density estimation and sampling, Oord2017ParallelWF proposed an approach called Probability Density Distillation which trains the flow with log likelihood maximization, and then fixes it and uses it as a teacher network. The teacher network is used to train a student network by minimizing the divergence between the distribution of sampled points and the modeled data distribution by the teacher network (Figure 2).
Finally, while maximum likelihood estimation is often effective (and statistically efficient under certain conditions) other forms of estimation can and have been used with normalizing flows. In particular, adversarial losses can be used with normalizing flow models (e.g., in Flow-GAN [Grover2018]).
2.2.2 Variational Inference
Consider a latent variable model . Let be an observed variable and the latent variable. The posterior distribution is used when estimating the parameters of the model, but its computation is usually intractable in practice. One solution is to use variational inference and introduce the approximate posterior which should be as close to the real posterior as possible. This can be achieved by minimizing the KL divergence , which is equivalent to maximizing the evidence lower bound . The latter optimization can be done with gradient descent; however for that one needs to compute gradients of the form , which is not straightforward.
As was observed in Rezende2015, one can reparametrize with normalizing flows. Assume for simplicity, that only a single flow with parameterization is used, and the base distribution does not depend on . Note that this is a generative direction. Then
and the gradient of the right hand side with respect to can be computed.
In this scenario evaluating the likelihood is only required at points which have been sampled. Here the sampling performance and evaluation the log determinant are the only relevant metrics and computing the inverse of the mapping may not be necessary. Indeed, the planar and radial flows introduced in Rezende2015 are not easily invertible (see Section 3.3).
As described in the previous section, Normalizing Flows should satisfy several conditions in order to be practical:
They should be invertible; for sampling we need to know their inverse.
They should be expressive enough to model real data distributions.
They should be computationally efficient: calculation of the Jacobian determinant, sampling from the base distribution, and application of the forward and (for sampling) inverse functions should be tractable.
In this section we will describe different flows that have been proposed in the literature.
3.1 Elementwise bijections
A basic form of bijective non-linearity can be constructed given any bijective scalar function. That is, let be a (scalar valued) bijection. Then, if ,
is also a bijection whose inverse simply requires computing and whose Jacobian is the product of the derivatives of
. This can be generalized by allowing each element to have its own distinct bijective function; this might be useful, for instance, if we wish to only modify portions of our parameter vector. In deep learning terminology,Maas2013; He2015] can be used instead.
Elementwise operations on their own are inherently insufficient as they cannot express any form of correlation between dimensions; more bijections are necessary if dependencies between dimensions are to be captured.
3.2 Linear Flows
Linear mappings take the form
where and are parameters. If
is an invertible matrix, the function is invertible. Linear flows on their own are not very expressive. For example, consider a Gaussian base distribution:. After transformation by a linear flow, the distribution will remain Gaussian: if , then its distribution will be . More generally, a linear flow of a distribution from the exponential family will stay in the exponential family. However, linear flows are an important building block as they form the basis of affine coupling flows (Section 3.6.1).
Note that the determinant of the Jacobian is simply which can be computed directly in . The inverse map can also be computed in . Hence, using general form of the linear flow can become expensive for large . By restricting the form of we can avoid these practical problems at the expense of expressive power. For example, if is diagonal with nonzero diagonal entries, then its inverse can be computed in linear time and its determinant is simply the product of the diagonal entries. However, such a transformation is extremely limited in expressivity; it is an elementwise transformation. Nonetheless, this can still be useful for representing normalization layers [Dinh2017] which have become a ubiquitous part of modern CNNs [Ioffe2015].
A somewhat more expressive form of linear transformation is the triangular matrix. Its determinant is the product of its diagonal and it is non-singular so long as its diagonal entries are non-zero. Inversion is relatively inexpensive requiring a single pass of back-substitution costingoperations [linalgbook] versus
for a general matrix. This can be viewed as a linear analog of autoregressive models considered in Section3.5.
A slight generalization was done in Tomczakconvex. Consider triangular matrices with ones on the diagonal and a -dimensional probability vector . Then one can define a linear transformation . The determinant of this bijection is one, however finding the inverse has complexity, if some of the matrices are upper- and some are lower-triangular.
3.2.2 Permutation and Orthogonal
The expressiveness of triangular transformations is sensitive to the ordering of dimensions. Reordering the dimensions can be done easily and has an absolute determinant of . Different strategies have been tried, including reversing and a fixed random permutation [Dinh2017; Kingma2018]. However, in these cases the permutations are set initially and remain fixed during training which might not be optimal for a given problem.
An alternative to permutations is the use of orthogonal transformations which is a generalization of the permutation operation because the set of permutation matrices is a subset of the set of orthogonal matrices. The inverse and absolute determinant of an orthogonal matrices are both trivial which make them efficient to use. Tomczak2016 used orthogonal matrices parameterized by the Householder transform
. The idea is based on the observation that any orthogonal matrix can be written as a product of reflections. To parameterize a reflection matrixin one just needs to fix a nonzero vector , and then define . To model an orthogonal matrix one can consider a linear flow of reflections.333 Another natural choice of parameterization is to use the tangent space of (i.e.
, the set of skew-symmetric matrices) to parameterizeand the exponential map as the transformation. This has yet to be explored in the literature but is an interesting alternative as it is directly differentiable and may have a more natural geometry for optimization.
Instead of limiting the form of , Kingma2018 proposed to parameterize it using the factorization. Specifically,
where is lower triangular with ones on the diagonal, is upper triangular with non-zero diagonal entries, and is a permutation matrix. The determinant is the product of the diagonal entries of which can be computed in . The inverse of the function can be computed using two passes of backward substitution in . However, one issue is that is a discrete permutation and cannot be easily optimized as a parameter of the model. To avoid this, is randomly generated initially and then fixed. Hoogeboom2019 noted that fixing the permutation matrix limits the flexibility of the transformation, and proposed using the decomposition instead. They proposed modeling the orthogonal matrix for that decomposition with Householder transforms (Section 3.2.2)
Another form of linear transformation is a convolution. A general convolution is easy to compute but it can be difficult to efficiently calculate the determinant or ensure invertibility. Kingma2018 restricted themselves to “” convolutions for flows, and Zheng2018 used 1D convolutions (ConvFlow). Hoogeboom2019 provided a more general solution for modelling convolutions by stacking together autoregressive convolutions.
3.3 Planar and Radial Flows
Rezende2015 introduced planar and radial flows. They are relatively simple, but their inverses aren’t easily computed. These flows are not widely used in practice, yet we review them here for completeness.
3.3.1 Planar flows
Planar flows expand and contract the distribution along certain specific directions and take the form
where and are parameters and is a smooth non-linearity. The Jacobian determinant for this transformation is
where the last equality comes from the application of the matrix determinant lemma. This can be computed in time. The inversion of this flow isn’t possible in closed form and may not exist for certain choices of and certain parameter settings. This is explored in more detail in the Appendix of Rezende2015.
As was speculated in Kingma2016, the term
can be interpreted as a multilayer perceptron with a bottleneck hidden layer with a single unit. This bottleneck means that one needs to stack many planar flows to get high expressivity.[Hasenclever2017] and [sylvester] introduced Sylvester flows to resolve this problem. Here, the transformation is given by the similar formula as before:
where and are matrices, and is an elementwise smooth nonlinearity, where
is a hyperparameter to choose and which can be interpreted as the dimension of a hidden layer. In this case the Jacobian determinant is:
where the last equality is Sylvester’s determinant identity (which gives these flows their name). This can be computationally efficient if is small. Some sufficient conditions for the invertibility of Sylvester transformations are discussed in Hasenclever2017 and sylvester.
3.3.2 Radial flows
Radial flows instead modify the distribution around a specific point so that
where is the point around which the distribution is distorted, and are parameters, . As for planar flows, the Jacobian determinant can be computed relatively efficiently. The inverse of radial flows cannot be given in closed form, however, it exists under suitable constraints on the parameters. Details can be found in Rezende2015.
In the next two sections we will describe coupling and auto-regresssive flows. These are currently the two most widely used flow architectures. First, we will present them in the general form, and then in Section 3.6 we will give specific examples.
3.4 Coupling Flows
Dinh2015 introduced a coupling method to enable highly expressive transformations for flows. Consider a (disjoint) partition of the input into two subspaces: and a bijection , parametrized by . Then one can define a function by the formula:
where is any arbitrary function which only uses as input, which is called a conditioner. A bijection is called a coupling layer, and the resulting function is called a coupling flow. The coupling flow is invertible if and only if is invertible and its inverse is given by:
The Jacobian of is a block triangular matrix where the diagonal blocks are
and the identity matrix respectively. Hence the determinant of the Jacobian of the coupling flow is simply the determinant of.
The majority of coupling layers used in the literature are in the decomposed form (i.e., they apply to element-wise):
where each is a scalar bijection. In this case a coupling flow is a triangular transformation (i.e., has a triangular Jacobian matrix). We give a list of coupling layers in Section 3.6.
The power of a coupling flow resides, largely, in the ability of
to be arbitrarily complex. In practice it is usually modelled as a neural network, for instance,Kingma2018 used a shallow ResNet [He2016] architecture. Alternatively, the conditioner can be constant (i.e., not depend on at all). This provides a form of factorization which allows for the construction of a “multi-scale flow” which gradually introduces dimensions to the distribution in the generative direction. In the normalizing direction, the dimension of the layer reduces by half after each iteration step, such that the final output contains the most of semantic information about the input. This was proposed by Dinh2017 to reduce the computational costs of transforming high dimensional distributions as well as to capture the multi-scale structure inherent in natural images. The coupling architecture and its use for a multi-scale structure is demonstrated in Figure 3.
A question remains of how to partition . This is often done by splitting the dimensions in half [Dinh2015], potentially after a random permutation. However, more structured partitioning has also been explored and is common practice, particularly when modelling images. For instance, Dinh2017 used “masked” flows take alternating pixels or blocks of channels in the case of an image in non-volume preserving flows (NVP). In place of permutation Kingma2018 used convolution (Glow). For the partition for the multi-scale flow in the normalizing direction, Das2019 suggested selecting features at which the Jacobian of the flow has higher values for the propagated part.
3.5 Autoregressive Flows
Kingma2016 used autoregressive models as a form of normalizing flow. One can consider these as nonlinear versions of multiplication by a triangular matrix (Section 3.2.1).
Consider an ordering on the basis of , so that the order of entries in the variable is fixed. Denote by the tuple . Let be a bijection parameterized by . Then an autoregressive model is a function , which outputs each entry of conditioned on the previous entries of the input (with respect to the fixed ordering). Formally,
where for each we choose arbitrary functions mapping to the set of all parameters, and is a constant. The functions use as input are called conditioners.
The Jacobian matrix of the autoregressive transformation is triangular (each output only depends on , and so the determinant is just a product of its diagonal entries:
Given the inverse of , the inverse of can be found with recursion. For that and for any , . However this computation is inherently sequential which makes it difficult to implement efficiently on modern hardware as it cannot be parallelized.
Note that the functional form for the autoregressive model is very similar to that for the coupling flow. In both cases a bijection is used, which has as an input one part of the space and which is parameterized conditioned on the other part. We call this bijection a coupling layer in both cases. Note that Huang2018NeuralFlows used the name “transformer” (which has nothing to do with transformers in NLP).
Alternatively, Kingma2016 introduced the “inverse autoregressive flow” (IAF), which outputs each entry of conditioned the previous entries of (with respect to the fixed ordering). Formally,
One can see that the functional form of the inverse autoregressive flow is the same as the form of the inverse of the flow in Equation (17), hence the name. Computation of the IAF is sequential and expensive, but the inverse of IAF (which is a direct autoregressive flow) can be computed relatively efficiently (Figure 4).
Germain2015 (in MADE) suggested to use masks for efficient autoregressive density estimation. This idea was used by Papamakarios2017 for masked autoregressive flows (MAF): masking makes it possible to compute all the entries of the direct flow (Equation (17)) in one pass.
As mentioned before (Section 2.2.1), papers typically model flows in the “normalizing” direction (from data to the base density) to enable efficient evaluation of the log-likelihood for training with density estimation. In this context one can think of IAF as a flow in the generative direction: from base density to data. As noted in Papamakarios2017, one should use IAF if fast sampling is needed (e.g., for stochastic variational inference), and use MAF if fast density estimation is desirable. However, these two methods are theoretically equivalent, i.e., they can learn the same distribution [Papamakarios2017]. For both efficient density estimation and sampling one can also use Probability Density Distillation [Oord2017ParallelWF] (Section 2.2.1).
For several autoregressive flows the universality property has been proven [Huang2018NeuralFlows; Priyank2019]. Informally, universality means that the flow can learn any target density to any required precision given sufficient capacity and data. We will provide a formal proof of the universality theorem following the clean exposition of Priyank2019. This section requires some knowledge of measure theory and functional analysis and can be safely skipped without jeopardizing comprehension of the rest of the document.
First, recall that a mapping is called triangular if is a function of for each . Such triangular map is called increasing if is an increasing function of for each .
[[triangular2005], Lemma 2.1] If and are absolutely continuous Borel probability measures on , then there exists an increasing triangular transformation , such that . This transformation is unique up to null sets of . Similar result holds for measures on .
If is an absolutely continuous Borel probability measures on and is a sequence of maps which converges pointwise to a map , then a sequence of measures weakly converges to . See [Huang2018NeuralFlows], Lemma 4, for details. Basically the result follows from the dominated convergence theorem.
As a corollary, to claim that a class of autoregressive flows is universal, it is enough to demonstrate that a family of coupling layers used in the class is dense in the set of all monotone functions in the pointwise convergence topology. In particular, Huang2018NeuralFlows used neural monotone networks for coupling layers, and Priyank2019 used monotone polynomials. Using the theory outlined in this section, the universality could be proved for spline flows [splines1; splines2] with splines for coupling layers (it was not done in the papers). See the next section for further details on each architecture.
3.6 Coupling Layers
As described in the previous sections, coupling flows and autoregressive flows have a similar functional form and both have coupling layers as building blocks. Basically, a coupling layer is just a univariate bijective differentiable function , parameterized by . Note, that such a function is necessarily (strictly) monotone.444On the other hand, every continuous univariate function which is surjective and strictly monotone, must be bijective. In this section we describe the coupling layers used in the literature.
3.6.1 Affine coupling
Two simple forms of coupling layers were proposed by [Dinh2015] in NICE (nonlinear independent component estimation): the additive coupling layer:
and the affine coupling layer:
Affine coupling layers are used for coupling architectures in NICE [Dinh2015], RealNVP [Dinh2017], Glow [Kingma2018] and for autoregressive architectures in IAF [Kingma2016] and MAF [Papamakarios2017]. It is simple and facilitates efficient computation, however it is limited in its expressiveness and many such layers must be stacked.
3.6.2 Nonlinear squared flow
Ziegler2019 proposed an invertible non-linear squared transformation defined by:
Under some constraints on the parameters , the coupling layer is invertible and its inverse is analytically computable as a root of a cubic polynomial (with only one real root). Qualitative experiments demonstrated that such coupling layer facilitates learning multimodal distributions.
3.6.3 Expressive coupling with continuous mixture CDFs
ho2019flow proposed Flow++ model, where among several improvements a more expressive coupling layer was suggested. The layer is almost like a linear transformation, but one also applies a monotone function to . Formally,
where , and , is a positive integer. The function is the CDF of a mixture of logistics, postcomposed with inverse sigmoid:
Note, that the postcomposition with is used to ensure the right range for . Computation of the inverse of the coupling function is done numerically with the bisection algorithm. The derivative of this transformation with respect to is expressed in terms of PDF of logistic mixture (i.e., a linear combination of hyperbolic secant functions), and its computation is not expensive. An ablation study demonstrated that switching from an affine coupling layer to logistic mixture gives slightly better performance.
A spline is a piecewise-polynomial or a piecewise-rational function. One can model a coupling layer as a monotone spline. To define a spline one specifies points , called knots, through which the spline passes. To define a monotone spline one has a restriction: and .
Usually splines are considered on a compact interval.
Muller2018NeuralSampling introduced a linear spline for a coupling layer . They divided the domain into bins of equal length. Instead of defining increasing values for , the authors suggested to model a monotone piecewise-linear function as the integral of a positive piecewise-constant function. Formally:
where is a probability vector, (the bin that contains ), and (the relative position of in the bin ). This map is clearly invertible, if all . The derivative is:
Muller2018NeuralSampling also consider a monotone quadratic spline on the unit interval for a coupling layer. Analogously to the linear case, they propose to model it as an integral of a positive piecewise-linear function. A monotone quadratic spline is invertible; finding its inverse map requires solving a quadratic equation.
splines1 proposed using monotone cubic splines for a coupling layer. Unlike [Muller2018NeuralSampling], they do not restrict the domain to the unit interval, but they model a coupling layer in the form: , where is a monotone cubic spline and is a sigmoid.
For construction of a monotone cubic spline the authors use Steffen’s method: one needs to specify knots of the spline and also boundary derivatives and . All of these quantities are modelled as the output of a neural network.
Computation of the derivative of a cubic spline is not hard as it is a piecewise-quadratic polynomial. To invert a cubic spline, one needs to find a root of a cubic polynomial555Note that a monotone cubic polynomial has only one real root., which can be done either analytically or numerically. However, the procedure is numerically unstable if not treated carefully. Because Steffen’s method is differentiable, the flow can be trained by gradient descent. However, the problem with this model, as was observed in [splines2], is numerical difficulty with the sigmoid which saturates for input values far from zero.
Rational quadratic splines
splines2 model a coupling layer as a monotone rational-quadratic splines on the compact interval , and outside of the interval as the identity function. To define such a spline (with the method of Gregory and Delbourgo), one needs to specify knots, where boundary points are and ), and also specify derivatives at the inner points: . All these parameters are modelled as the output of a neural network.
The derivative of such layer with respect to is simply a quotient derivative. The function can be inverted by solving a quadratic equation. The authors suggested to use this layer either with a coupling architecture RQ-NSF(C) or with auto-regressive architecture RQ-NSF(AR).
3.6.5 Neural autoregressive flow
A different approach was taken by Huang2018NeuralFlows in their Neural Autoregressive Flows (NAF). They modeled with a deep neural network. The authors proved the following statement giving a sufficient condition for a neural network to be bijective: If is a multilayer percepton, such that all the weights are positive and all activation functions are strictly monotone, then is a strictly monotone function.
Two forms of neural networks for were suggested: the deep sigmoidal coupling layer (DSF) and deep dense sigmoidal coupling layer (DDSF); the resulting autoregressive flows are called NAF-DSF and NAF-DDSF
respectively. In both cases the coupling layers are MLPs with layers of sigmoid and logit units and nonnegative weights - seeHuang2018NeuralFlows for details. By Proposition 3.6.5, the resulting is a strictly monotonic function. The paper further proved that any strictly monotonic univariate function can be approximated with a DSF network666More formally, DSF networks are dense in the set of all strictly monotonic univariate continuous functions in the topology of compact convergence.. Hence, NAF-DSF is a universal flow (Section 3.5.1).
Wehenkel2019 noted that imposing positivity of weights on a flow makes training harder and requires more complex conditioners. To mitigate this, they introduced unconstrained monotonic neural networks (UMNN). The idea is simple: every strictly monotone function has a nonzero derivative. To model a strictly monotone function one has to model a strictly positive (or negative) function with a neural network and then integrate it using any numerical integration method. Authors demonstrated that UMNN requires less parameters than NAF to reach similar performance, and hence, it is more scalable for higher-dimensional datasets.
3.6.6 Sum-of-Squares polynomial flow
Priyank2019 proposed to model as a strictly increasing polynomial. They proved that any strictly monotonic univariate continuous function can be approximated with an increasing univariate polynomial777More formally, a set of increasing univariate polynomials is dense in the set of all strictly monotonic univariate continuous functions in the topology of compact convergence.. Hence, the resulting flow (called SOS - sum of squares polynomial flow) is a universal flow (see Section 3.5.1).
To construct all possible increasing single-variable polynomials the authors observed that the derivative of such a polynomial is a positive polynomial. Then a classical result from algebra was used: all positive single-variable polynomials are the sum of squares of polynomials. Hence, to get a coupling layer, one needs to integrate the sum of squares:
where and are hyperparameters (and, as noted in the paper, can be chosen to be ).
As Priyank2019 stated, SOS is easier to train than NAF, because there are no restrictions on the parameters (like positivity of weights). Further, if one takes L=0, SOS reduces to the affine coupling layer and as such can be viewed as a generalization of the basic affine coupling flow.
3.6.7 Piecewise-bijective coupling
dinh2019rad explore the idea that a coupling layer does not need to be bijective, but just piecewise-bijective. Formally, they consider a function and a covering of the domain into disjoint subsets: , such that the restriction of the function onto each subset is injective (see Figure 5).
dinh2019rad constructed a flow with a coupling architecture with piecewise-bijective coupling layer in the normalizing direction - from data distribution to (simpler) base distribution. This means that there is a covering of the data domain, and each subset of this covering is separately mapped to the base distribution. It follows that Equation 2.1 is not applicable for flows with piecewise-bijective coupling layers; each part of the base distribution receives contributions from each subset of the data domain. dinh2019rad suggested how to correct this by using a probabilistic mapping from the base domain to the data domain; for each part of the base domain there is some probability that the mapping is to each subset of the data domain.
More formally, let us denote the input with and the output with , and consider a lookup function , such that , if . One can define a new map , given by the rule , and a density on a target space . One can think of this as an unfolding of the non-injective map . In particular, for each point one can find its pre-image by sampling from , which is called a gating network. Pushing forward along this unfolded map is now well-defined and one gets the formula for the density :
The resulting flow (called RAD - for “real and discrete”) was demonstrated to efficiently learn distributions with discrete structures (multimodal distributions, distributions with holes, discrete symmetries etc).
3.7 Residual Flows
Residual networks [He2016] are simply compositions of functions of the form
(such a function is called a residual connection), where the residual block
is a feed-forward neural network of any kind (a CNN in the original paper).
The first attempts to build a reversible network architecture based on residual connections were made in RevNets [Gomez2017] and iRevNets [jacobsen2018irevnet]. Their main motivation was to save memory during training and to stabilize computation. The central idea are a variation of additive coupling layers: consider a disjoint partition of denoted by for the input and for the output, and define a function:
where and are residual blocks. This network is clearly invertible. However, the computation of the Jacobian determinant is not efficient.
A different point of view on reversible networks comes from a dynamical systems perspective via the observation that a residual connection is a discretization of a first order ordinary differential equation[haber2017; e2017]. Consider an ODE
where is a function which determines the dynamic (the evolution function), is a set of parameters and is a parameterization. The discretization of this equation (Euler’s method) is
and this is equivalent to a residual connection with a residual block .
Guided by the rich and elaborated theory of ODEs, one can look for improvements of residual blocks. For example, Chang2018; chang2018antisymmetricrnn proposed several architectures for stable networks, which are robust to the input noise and easier to train, and some of these networks were demonstrated to be invertible. However the Jacobian determinant of these network cannot be computed efficiently.
A sufficient condition for the invertibility was found in [Behrmann]. They proved the following statement: The residual connection (29) is invertible, if the Lipschitz constant of the residual block is . There is no analytically closed form of the inverse of the network, but it can be found numerically using fixed-point iterations (which, by the Banach theorem, converge if we assume ).
Controlling the Lipschitz constant of a neural network is not simple. The specific architecture proposed by Behrmann, called iResNet, uses a convolutional network for the residual block. It constrains the spectral radius of each convolutional layer in this network to be less than one.
The Jacobian determinant of the iResNet cannot be computed directly, so the authors propose to use a (biased) stochastic estimation . They considered the residual connection in Equation (29); its Jacobian is: . Because the function is assumed to be Lipschitz with , one has: . One can write the following chain of equalities:
where the second equality is a linear algebra identity: (see loc.cit for the proof). Then one considers a power series for the trace of the matrix logarithm:
By truncating this series one can calculate the approximation of log Jacobian determinant of . To efficiently compute each member of the truncated series, a stochastic estimation of the matrix trace (Hutchinson trick) was used: for a matrix , one has: , where , , and .
Truncating the power series in Equation (34) gives a biased estimation of the log Jacobian determinant of the flow (the bias depends on the truncation error). An unbiased stochastic estimator was proposed by iresnetunbiased in Residual flow. The authors used a Russian roulette estimator instead of truncation. Informally, every time one adds the next term to the partial sum while calculating the series
, one flips a coin to decide if the calculation should be continued or stopped at this step. During this process one needs to reweight terms for an unbiased estimation.888More formally, for a series one has an equality , where
is a random variable, which support is natural numbers (e.g, geometric distribution).
3.8 Infinitesimal (Continuous) Flows
In the previous section we discussed a discretization of ODEs. But what if one does not discretize and tries to learn the continuous dynamical system instead? Such flows are called infinitesimal or continuous). We consider two distinct types. The formulation of the first type comes from ordinary differential equations, and of the second type from stochastic differential equations.
3.8.1 ODE-based methods
First let us set up the mathematical formulation. One will need Definition 2.1.1 for this section.
Consider an ODE as in Equation (31), where . Under some assumptions on the evolution function (uniform Lipschitz continuity in and continuity in ), the solution exists (at least, locally) and, given an initial condition , is unique (this fact is known as Picard-Lindelöf-Lipschitz-Cauchy theorem [ArnoldODE]). We denote the solution at each time as .
At each time , is a diffeomorphism and it satisfies the group law: . Mathematically speaking, an ODE (31) defines a one-parameter group of diffeomorphisms on . Such group is called a smooth flow in the dynamical systems theory and differential geometry - see intro_to_dynamical_systems for details.
In particular, when , a diffeomorphism is called a time one map. The idea to model a normalizing flow as a time one map was presented by [chen2018neural] under the name Neural ODE (NODE). From the deep learning perspective this can be seen as an “infinitely deep” neural network with the input layer , the output layer and continuous weights . As mentioned, the invertibility of such networks naturally comes from the theorem of the existence and uniqueness of the solution of the ODE.
Training these types of networks for a supervised downstream task could be done by continuous analog of backpropagation, the so calledadjoint sensitivity method
, introduced by Pontryagin. It computes gradients of the loss function (whatever is more relevant for the task) by solving a second (augmented) ODE backwards in time.999If one has the loss , where is a solution of ODE (31), one can consider its sensitivity , also called adjoint. It is the analog of the derivative of the loss with respect to the hidden layer. In a standard neural network, one uses the backpropagation formula to find this derivative: . For “infinitely deep” neural network, this formula changes into an ODE: . For a more formal description, see chen2018neural. For density estimation learning, the change of variables formula is given by another ODE:
Note that one no longer needs determinant computation (compare with Equation (2.1)). To train the model and to sample from one needs to solve these ODEs, which can be done with any numerical ODE solver.
GrathwohlFFJORD:MODELS proposed a slightly modified version (called FFJORD), where they provided an unbiased stochastic estimation of the trace-term, using the Hutchinson estimator. This reduced the complexity even further.
An interesting side-effect of considering continuous ODE-type flows is that one needs much fewer parameters to achieve the same performance. For example, GrathwohlFFJORD:MODELS show that for the comparable performance on CIFAR10, FFJORD uses less than as many parameters as Glow.
Not all diffeomorphisms can be presented as a time one map of an ODE. This is a well-studied question in mathematics - see [intro_to_dynamical_systems; time_one_maps]. Without going too deep into the theory of dynamical systems, one can easily find a necessary condition. Note that is a (continuous) path in the space of diffeomorphisms from the identity map to the time one map . Because the Jacobian determinant of a diffeomorphism is nonzero, the sign of it can not change along the path. Hence a time one map must have a positive Jacobian determinant (such maps are called orientation preserving). For example, consider a map , such that . It is obviously a diffeomorphism, but it can not be presented as a time one map of any ODE, because it is not orientation preserving.
augmentedNeuralODE suggested how one can improve Neural ODE in order to be able to represent a broader class of diffeomorphisms as time one maps. Their model is called Augmented Neural ODE (ANODE). They simply added variables and considered a new ODE:
with initial condition . Note that addition of free variables, in particular, gives freedom to keep the Jacobian determinant positive. As was demonstrated in the experiments, ANODE is capable of learning distributions that the Neural ODE cannot, and the training time is shorter. Zhang2019ode proved that any diffeomorphism can be represented as a time one map of ANODE, hence this is a universal flow.
Salman2018DeepDN take a similar approach and presented Deep Diffeomorphic Flows, where the authors modelled a path in the space of diffeomorphisms as a solution of an ODE in that space. They also proposed geodesic regularisation in which longer paths are punished.
3.8.2 SDE-based methods (Langevin flows)
The idea of the Langevin flow is simple: assume that we have a complicated and irregular distribution on , then we can mix it well enough to produce a simple base distribution (either normal or uniform on some compact support). If the mixing obeys certain rules (stochastic equations), then this procedure can be invertible, and we obtain a bijection this way. This idea was explored in the literature by several authors (Fokker-Plank-machine; Welling2011BayesianLV; Rezende2015; Sohl-DicksteinW15; Salimans2015; Jankowiak2018; Chen2018CTF). We will provide a high-level overview of the method, including the necessary mathematical background.
A stochastic differential equation (SDE), also called Itô process, is an equation describing a change of a random variable in time. It is written as:
where , , - drift coefficient, - diffusion coefficient, and is -dimensional Brownian motion. Under some assumption on the functions and , the solution exists and is unique [Oksendal]. Informally, one can interpret the drift term as a deterministic dynamic and the diffusion term as providing all the stochasticity and mixing. In particular, the diffusion process satisfies this equation.
For a time-dependent random variable one can consider its density function , which is also time dependent. If is a solution of Equation (37
), its density function satisfies two partial differential equations describing the forward and backward evolution (in the positive and negative direction of time, respectively)[Oksendal]. The forward evolution is given by Fokker-Plank equation (or Kolmogorov’s forward equation):
where , with the initial condition . The reverse evolution is given by Kolmogorov’s backward equation:
where , and the initial condition is .
As observed by Fokker-Plank-machine, asymptotically the Langevin flow can learn any distribution, if one picks the drift and diffusion coefficients appropriately. However this result is not very practical, because one needs to know the (unnormalized) density function of the data distribution for that.
One can see that if the diffusion coefficient is zero, the Itô process reduces to an ODE, and the Fokker-Plank equation becomes a Liouville’s equation. The connection between this and Equation (35), which describes the log density changes, was observed in chen2018neural. It is also equivalent to the form of the transport equation considered in Jankowiak2018 for stochastic optimization.
Sohl-DicksteinW15 and Salimans2015 suggested using MCMC methods to model the diffusion. They consider discrete time . For each time shot , is a random variable at each time shot, is a data point, and a base point. The forward transition probability
is taken to be either normal or binomial distribution (whose parameters are trainable parameters of the model). As was noticed in[Sohl-DicksteinW15], it follows from Kolmogorov’s backward equation that the backward transition must have the same functional form as the forward transition (i.e., be either normal or binomial, respectively, with trainable parameters). Denote: , the data distribution, and , the base distribution. Applying the backward transition to the base distribution, one obtains a new density , which one wants to match with . Hence, the optimization objective is the log likelihood . This is intractable in practice, but one can find a lower bound on (similarly to in variational inference).
Several papers have worked explicitly with the SDE (37) [Chen2018CTF; NSDE1; NSDE2; Liutkus2018SlicedWassersteinFN]. In particular, Chen2018CTF discuss how to create an interesting posterior for variational inference. For that they sampled a latent variable conditioned on the input , and then evolved with SDE. In practice this evolution can be computed by discretization. Also, by analogy to Neural ODEs, Neural Stochastic Differential Equations were proposed [NSDE1; NSDE2]. Here coefficients of the equation are modelled as neural networks, and black box SDE solvers are used for the inference.101010To train Neural SDE one needs an analog of backpropagation. NSDE2 used Kunita’s theory of stochastic flows.
Note, that even though Langevin flows manifest nice mathematical properties, they have not found practical applications. In particular, none of the methods has been tested on baseline datasets for flows.
4 Datasets and performance
In this section we discuss datasets commonly used for training and testing normalizing flows. We also provide comparison tables of the results as they were presented in the corresponding papers. The list of the flows for which we post the performance results is given in Table 1.
|Coupling flows, 3.4||affine, 3.6.1||realNVP|
|mixture CDF, 3.6.3||Flow++|
|splines, 3.6.4||quadratic (C)|
|Autoregressive flows, 3.5||affine||MAF|
|neural network, 3.6.5||NAF|
|Residual flows, 3.7||iResNet|
Note that for preliminary qualitative analysis of flow performance people usually use synthetic 2-dimensional datasets. Such datasets (grid Gaussian mixture, ring Gaussian mixture, two moons, two circles, spiral, many moons, checkboard etc.) exhibit multimodality, strong curvature and other peculiarities which an expressive flow should be able to model.
4.1 Tabular datasets
We describe datasets as they were preprossessed in Papamakarios2017111111See https://github.com/gpapamak/maf. The relatively small size makes these datasets reasonable for a first test of unconditional density estimation models.
A collection of datasets from University of California, Irvine, machine learning repository [uci].
POWER: a collection of electric power consumption measurements in one house over 47 months.
GAS: a collection of measurements from chemical sensors in several gas mixtures.
HEPMASS: measurements from high-energy physics experiments aiming to detect particles with unknown mass.
MINIBOONE: measurements from MiniBooNE experiment for observing neutrino oscillations.
Berkeley segmentation dataset [bsds] contains segmentations of natural images. Papamakarios2017 extracted random monochrome patches from it.
All these datasets were cleaned and dequantized by adding uniform noise, so they can be considered samples from an absolutely continuous distribution. In Table 2 we provide a summary of these datasets.
|Examples in train set||M||K||K||K||M|
In Table 3 we present performance of flows on tabular datasets. For the details of the experiments see the following papers: realNVP and MAF [Papamakarios2017], Glow and FFJORD [GrathwohlFFJORD:MODELS], NAF [Huang2018NeuralFlows], UMNN [Wehenkel2019], SOS [Priyank2019], Quadratic Spline flow and RQ-NSF [splines2], Cubic Spline Flow [splines1].
|Quadratic Spline (C)||0.64||12.80||-15.35||-9.35||157.65|
|Quadratic Spline (AR)||0.66||12.91||-14.67||-9.72||157.42|
One can see that universal flows (NAF, SOS, Splines) demonstrate relatively better performance.
4.2 Image datasets
These are datasets of increasing complexity. They are used preprocessed as in Dinh2017 by dequantizing with uniform noise (except for Flow++). The most common image datasets used for evaluating normalizing flows are MNIST and CIFAR-10, however other (Omniglot, LSUN, CIFAR-100, CelebA HQ, SVHN, etc) could be used too. In Table 4 we give the summary of the image datasets.
|Examples in train set||50K||K||M||M|
Performance of flows on image datasets is given in Table 5
for unconditional density estimation. For the details of the experiments see the following papers: realNVP for CIFAR-10 and ImageNet[Dinh2017], Glow for CIFAR-10 and ImageNet [Kingma2018], realNVP and Glow for MNIST, MAF and FFJORD [GrathwohlFFJORD:MODELS], SOS [Priyank2019], RQ-NSF [splines2], UMNN [Wehenkel2019], iResNet [Behrmann], Residual Flow [iresnetunbiased], Flow++ [ho2019flow].
One can see that Flow++ is the overall winner. Besides using expressive coupling layers (Section 3.6.3), ho2019flow changed the architecture of the conditioner to self-attention and used variational dequantization instead of uniform. Ablation study in the paper shows that the latter twist gave the most significant improvement. It would be interesting to test other flow models with variational dequantization.
5 Discussion and open problems
5.1 Inductive biases
Role of the base measure
The base measure of a normalizing flow is (generally) assumed to be a simple distribution such as a uniform or Gaussian. However as noted in several places this does not need to be the case. Any distribution for which we can easily sample from and compute the log probability density function is possible and the parameters of this distribution can also be learned during training.
One important thing to note is that theoretically a base measure shouldn’t matter: any distribution for which a CDF can be computed, can be simulated by applying the inverse CDF to draw from the uniform distribution. However in practice if structure is provided in the base measure, the resulting transformations may need to be: 1) less complex and 2) easier to learn. In other words, the choice of base measure can be viewed as a form of prior or inductive bias on the distribution and may be useful in its own right. For example, a trade-off between the complexity of the generative transformation and the form of base measure was explored in[tails] in the context of modelling tail behaviour.
Form of diffeomorphisms
The majority of the flows explored are triangular flows (either coupling or autoregressive architecture). Residual networks and Neural ODEs are also being actively investigated and applied. The natural question to ask: are there other ways to model diffeomorphisms which are efficient for computation? What inductive bias does the architecture impose? A related question concerns the best way to model conditional normalizing flows when one needs to learn conditional probability distribution.Trippe2018ConditionalDE suggested to use different flows for each condition, but this approach doesn’t leverage weight sharing, and as a result may be inefficient in terms of memory and data usage. Atanov2019 proposed to use affine coupling layers where the parameters depend on the condition. Conditional distributions are useful in particular for time series modelling, where one needs to find . This context was considered in [videoflow].
The majority of the existing flows are trained by minimization of KL-divergence between the source and the target distributions (or, equivalently, with log-likelihood maximization). However, other losses could be used which would put normalizing flows in a broader context of optimal transport theory [transportation]. Interesting work has been done in this direction including the Flow-GAN [Grover2018] paper mentioned earlier and the minimization of the Wasserstein distance as suggested by [Arjovsky2017WassersteinG; Tolstikhin2018WassersteinA].
5.2 Generalisation to non-Euclidean spaces
Flows on manifolds.
A mathematically interesting problem is warping probability distributions on a manifold. This idea was explored in [lie2019], where the analog of the Gaussian reparameterization trick was given for a Lie group. In particular, for a -dimensional Lie group , one considers its Lie algebra and chooses an isomorphism . Then for a base distribution with the density on , one can push it forward on via the exponential map. Then, the analog of shifting by an element is by left multiplication. Additionally applying a normalizing flow to a base measure before pushing it to helps to construct multimodal distributions on . Another approach was proposed in [hyperbolic2019], where the Gaussian reparameterization trick in a hyperbolic space was investigated. In general, the question is how best to define a normalizing flow on a differentiable manifold.
Discrete latent variables were used in dinh2019rad
as an auxiliary tool to pushforward continuous random variables along piecewise-bijective maps (see Section3.6.7
). However, how one can define normalizing flows if one or both of our distributions are discrete? Answers to this could be useful for many applications including natural language processing, graph generation and others.
Several approaches have been suggested. Tran2019 models bijective functions on a finite set and show that, in this case, the change of variables is given by the formula: , i.e., with no Jacobian term (compare with Definition 2.1.1). For backpropagation of functions with discrete variables they use the straight-through gradient estimator [Bengio2013StraightThrough]. However this method is not scalable to distributions with large numbers of elements.
Alternatively hoogeb2019lossless models bijections on
directly with additive coupling layers. Further, several approaches transform a discrete variable into a continuous latent variable with a variational autoencoder, and then apply normalzing flows in the continuous latent space[Zizhuang2019; Ziegler2019].
Finally, another approach is dequantization, i.e.
, adding noise to discrete data to make it continuous, can be used on for ordinal variables,e.g., discretized intensities. The noise can be uniform but other forms are possible and this dequantization can even be learned as part of a variational model [ho2019flow]. Modelling distributions over discrete spaces is important in a range of problems, however the generalization of normalizing flows to discrete distributions remains an open problem.