implicit-normalizing-flows
Code for "Implicit Normalizing Flows" (ICLR 2021 spotlight)
view repo
Normalizing flows define a probability distribution by an explicit invertible transformation π³=f(π±). In this work, we present implicit normalizing flows (ImpFlows), which generalize normalizing flows by allowing the mapping to be implicitly defined by the roots of an equation F(π³, π±)= 0. ImpFlows build on residual flows (ResFlows) with a proper balance between expressiveness and tractability. Through theoretical analysis, we show that the function space of ImpFlow is strictly richer than that of ResFlows. Furthermore, for any ResFlow with a fixed number of blocks, there exists some function that ResFlow has a non-negligible approximation error. However, the function is exactly representable by a single-block ImpFlow. We propose a scalable algorithm to train and draw samples from ImpFlows. Empirically, we evaluate ImpFlow on several classification and density modeling tasks, and ImpFlow outperforms ResFlow with a comparable amount of parameters on all the benchmarks.
READ FULL TEXT VIEW PDFCode for "Implicit Normalizing Flows" (ICLR 2021 spotlight)
None
Normalizing flows (NFs)Β (rezende2015variational; nice) are promising methods for density modeling. NFs define a model distribution by specifying an invertible transformation from
to another random variable
. By change-of-variable formula, the model density is(1) |
where follows a simple distribution, such as Gaussian. NFs are particularly attractive due to their tractability, i.e., the model density can be directly evaluated as Eqn.Β (1). To achieve such tractability, NF models should satisfy two requirements: (i) the mapping between and is invertible; (ii) the log-determinant of the Jacobian is tractable. Searching for rich model families that satisfy these tractability constraints is crucial for the advance of normalizing flow research. For the second requirement, earlier works such as inverse autoregressive flowΒ (kingma2016improved) and RealNVPΒ (realnvp) restrict the model family to those with triangular Jacobian matrices.
More recently, there emerge some free-form Jacobian approaches, such as Residual Flows (ResFlows)Β (behrmann2019invertible; chen2019residual)
. They relax the triangular Jacobian constraint by utilizing a stochastic estimator of the log-determinant, enriching the model family. However, the Lipschitz constant of each transformation block is constrained for invertibility. In general, this is not preferable because mapping a simple prior distribution to a potentially complex data distribution may require a transformation with a very large Lipschitz constant (See Fig.Β
3 for a 2D example). Moreover, all the aforementioned methods assume that there exists an explicit forward mapping . Bijections with explicit forward mapping only covers a fraction of the broad class of invertible functions suggested by the first requirement, which may limit the model capacity.In this paper, we propose implicit flows (ImpFlows) to generalize NFs, allowing the transformation to be implicitly defined by an equation . Given (or ), the other variable can be computed by an implicit root-finding procedure . An explicit mapping used in prior NFs can viewed as a special case of ImpFlows in the form of . To balance between expressiveness and tractability, we present a specific from of ImpFlows, where each block is the composition of a ResFlow block and the inverse of another ResFlow block. We theoretically study the model capacity of ResFlows and ImpFlows in the function space. We show that the function family of single-block ImpFlows is strictly richer than that of two-block ResFlows by relaxing the Lipschitz constraints. Furthermore, for any ResFlow with a fixed number of blocks, there exists some invertible function that ResFlow has non-negligible approximation error, but ImpFlow can exactly model.
On the practical side, we develop a scalable algorithm to estimate the probability density and its gradients, and draw samples from ImpFlows. The algorithm leverages the implicit differentiation formula. Despite being more powerful, the gradient computation of ImpFlow is mostly similar with that of ResFlows, except some additional overhead on root finding. We test the effectiveness of ImpFlow on several classification and generative modeling tasks. ImpFlow outperforms ResFlow on all the benchmarks, with comparable model sizes and computational cost.
Expressive Normalizing Flows There are many works focusing on improving the capacity of NFs. For example, nice; realnvp; kingma2018glow; ho2019flow++; song2019mintnet; hoogeboom2019emerging; de2020block; durkan2019neural design dedicated model architectures with tractable Jacobian. More recently, grathwohl2018ffjord; behrmann2019invertible; chen2019residual propose NFs with free-form Jacobian, which approximate the determinant with stochastic estimators. In parallel with architecture design, chen2020vflow; huang2020augmented; cornish2019relaxing; nielsen2020survae improve the capacity of NFs by operating in a higher-dimensional space. As mentioned in the introduction, all these existing works adopt explicit forward mappings, which is only a subset of the broad class of invertible functions. In contrast, the implicit function family we consider is richer. While we primarily discuss the implicit generalization of ResFlowsΒ (chen2019residual) in this paper, the general idea of utilizing implicit invertible functions could be potentially applied to other models as well. Finally,Β zhang2020approximation formally prove that the model capacity of ResFlows is restricted by the dimension of the residual blocks. In comparison, we study another limitation of ResFlows in terms of the bounded Lipschitz constant, and compare the function family of ResFlows and ImpFlows with a comparable depth.
Continuous Time Flows (CTFs)Β (chen2018neural; grathwohl2018ffjord; chen2018continuous)
are flexible alternative to discrete time flows for generative modeling. They typically treat the invertible transformation as a dynamical system, which is approximately simulated by ordinary differential equation (ODE) solvers. In contrast, the implicit function family considered in this paper does not contain differential equations, and only requires fixed point solvers. Moreover, the theoretical guarantee is different. While CTFs typically study the universal approximation capacity under the continuous time case (i.e., β
infinite depthβ limit), we consider the model capacity of ImpFlows and ResFlows under a finite number of transformation steps. Finally, while CTFs are flexible, their learning is challenging due to instabilityΒ (liu2020does; massaroli2020stable) and exceedingly many ODE solver stepsΒ (finlay2020train), making their large-scale application still an open problem.Implicit Deep Learning
Utilizing implicit functions enhances the flexibility of neural networks, enabling the design of network layers in a problem-specific way. For instance,
bai2019deep propose a deep equilibrium model as a compact replacement of recurrent networks; amos2017optnet generalize each layer to solve an optimization problem; wang2019satnet integrate logical reasoning into neural networks; reshniak2019robust utilize the implicit Euler method to improve the stability of both forward and backward processes for residual blocks; and sitzmann2020implicit incorporate periodic functions for representation learning. Different from these works, which consider implicit functions as a replacement to feed-forward networks, we develop invertible implicit functions for normalizing flows, discuss the conditions of the existence of such functions, and theoretically study the model capacity of our proposed ImpFlow in the function space.We now present implicit normalizing flows, by starting with a brief overview of existing work.
As shown in Eqn.Β (1), a normalizing flow is an invertible function that defines a probability distribution with the change-of-variable formula. The modeling capacity of normalizing flows depends on the expressiveness of the invertible function . Residual flows (ResFlows)Β (chen2019residual; behrmann2019invertible) are a particular powerful class of NFs due to their free-form Jacobian. ResFlows use to construct the invertible mapping, where each layer is an invertible residual network with Lipschitz constraints bounded by a fixed constant :
(2) |
where is the Lipschitz constant of a function (see Sec.Β 4.1 for details). Despite their free-form Jacobian, the model capacity of ResFlows is still limited by the Lipschitz constant of the invertible function. The Lipschitz constant of each ResFlow block cannot exceed 2Β (behrmann2019invertible), so the Lipschitz constant of an -block ResFlow cannot exceed . However, to transfer a simple prior distribution to a potentially complex data distribution, the Lipschitz constant of the transformation can be required to be sufficiently large in general. Therefore, ResFlows can be undesirably deep simply to meet the Lipschitz constraints (see Fig.Β 3 for a 2D example). Below, we present implicit flows (ImpFlows) to relax the Lipschitz constraints.
In general, an implicit flow (ImpFlow) is defined as an invertible mapping between random variables and of dimension by finding the roots of , where is a function from to . In particular, the explicit mappings used in prior flow instancesΒ (chen2019residual; kingma2018glow) can be expressed as an implicit function in the form . While ImpFlows are a powerful family to explore, generally they are not guaranteed to satisfy the invertibility and the tractability of the log-determinant as required by NFs. In this paper, we focus on the following specific form, which achieves a good balance between expressiveness and tractability, and leave other possibilities for future studies.
Let and be two functions such that and , where is the Lipschitz constant of a function . A specific form of ImpFlows is defined by
(3) |
The root pairs of Eqn.Β (3) form a subset in , which actually defines the assignment rule of a unique invertible function . To see this, for any , according to DefinitionΒ 1, we can construct a contraction with a unique fixed point, which corresponds to a unique root (w.r.t. ) of , denoted by . Similarly, in the reverse process, given a , the root (w.r.t. ) of also exists and is unique, denoted by . These two properties are sufficient to ensure the existence and the invertibility of , as summarized in the following theorem.
Eqn.(3) defines a unique mapping , and is invertible.
See proof in AppendixΒ A.1. TheoremΒ 1 characterizes the validness of the ImpFlows introduced in DefinitionΒ 1. In fact, a single ImpFlow is a stack of a single ResFlow and the inverse of another single ResFlow, which will be formally stated in SecΒ 4. We will investigate the expressiveness of the function family of the ImpFlows in SecΒ 4, and present a scalable algorithm to learn a deep generative model built upon ImpFlows in Sec.Β 5.
We first present some preliminaries on Lipschitz continuous functions in Sec.Β 4.1 and then formally study the expressiveness power of ImpFlows, especially in comparison to ResFlows. In particular, we prove that the function space of ImpFlows is strictly richer than that of ResFlows in Sec.Β 4.2 (see an illustration in Fig.Β 1 (a)). Furthermore, for any ResFlow with a fixed number of blocks, there exists some function that ResFlow has a non-negligible approximation error. However, the function is exactly representable by a single-block ImpFlow. The results are illustrated in Fig.Β 1 (b) and formally presented in Sec.Β 4.3.
For any differentiable function and any , we denote the Jacobian matrix of at as .
A function is called Lipschitz continuous if there exists a constant , s.t.
The smallest that satisfies the inequality is called the Lipschitz constant of , denoted as .
Generally, the definition of depends on the choice of the norm , while we use -norm by default in this paper for simplicity.
A function is called bi-Lipschitz continuous if it is Lipschitz continuous and has an inverse mapping which is also Lipschitz continuous.
It is useful to consider an equivalent definition of the Lipschitz constant in our following analysis.
(Rademacher (federer1969grundlehren, Theorem 3.1.6)) If is Lipschitz continuous, then is differentiable almost everywhere, and
where is the operator norm of the matrix .
We formally compare the expressive power of a single-block ImpFlow and a two-block ResFlow. We highlight the structure of the theoretical results in this subsection in Fig.Β 1 (a) and present a 1D motivating example in Fig.Β 2. All the proofs can be found in Appendix.Β A.
On the one hand, according to the definition of ResFlow, the function family of the single-block ResFlow is
(4) |
where consists of all functions from to with continuous derivatives and denotes the identity map. Besides, the function family of -block ResFlows is defined by composition:
(5) |
On the other hand, according to the definition of the ImpFlow in Eqn.Β (3), we can obtain where denotes the composition of functions. Equivalently, we have which implies the function family of the single-block ImpFlow is
(6) |
Intuitively, a single-block ImpFlow can be interpreted as the composition of a ResFlow block and the inverse function of another ResFlow block, which may not have an explicit form (see Fig.Β 2 (c) and (d) for a 1D example). Therefore, it is natural to investigate the relationship between and . Before that, we first introduce a family of βmonotonically increasing functionsβ that does not have an explicit Lipschitz constraint, and show that it is strictly larger than .
(7) |
where is the set of all bi-Lipschitz -diffeomorphisms from to , and means is a proper subset of .
Note that it follows from behrmann2019invertible that all functions in are bi-Lipschitz, so . In the 1D input case, we can get , and . In the high dimensional cases, and are hard to illustrate. Nevertheless, the Lipschitz constants of the functions in is less than Β (behrmann2019invertible), but those of the functions in can be arbitrarily large. Based on LemmaΒ 1, we prove that the function family of ImpFlows consists of the compositions of two functions in , and therefore is a strictly larger than , as summarized in the following theorem.
(Equivalent form of the function family of a single-block ImpFlow).
(8) |
Note that the identity mapping , and it is easy to get . Thus, the Lipschitz constant of a single ImpFlow (and its reverse) can be arbitrarily large. Because and there exists some functions in (see a constructed example in Sec.Β 4.3), we can get the following corollary.
.
We further investigate the relationship between for and , as illustrated in Fig.Β 1 (b). For a fixed , the Lipschitz constant of functions in is still bounded, and there exist infinite functions that are not in but in . We construct one such function family: for any , define
(9) |
where is an -dimensional ball with radius of . Obviously, is an infinite set. Below, we will show that , has a non-negligible approximation error for functions in . However, they are exactly representable by functions in .
Given and , we have
.
, . Moreover, for any with -dimensional ball , the minimal error for fitting in by functions in satisfies
(10) |
It follows TheoremΒ 3 that to model , we need only a single-block ImpFlow but at least a -block ResFlow. In Fig.Β 2 (b), we show a 1D case where a 3-block ResFlow cannot fit a function that is exactly representable by a single-block ImpFlow. In addition, we also prove some other properties of ImpFlows. In particular, . We formally present the results in AppendixΒ B.
ImpFlows can be parameterized by neural networks and stacked to form a deep generative model to model high-dimensional data distributions. We develop a scalable algorithm to perform inference, sampling and learning in such models. For simplicity, we focus on a single-block during derivation.
Formally, a parametric ImpFlow block is defined by
(11) |
and , . Let denote all the parameters in and (which does NOT mean and share parameters). Note that refers to the input of the layer, not the input data.
The inference process to compute given in a single ImpFlow block is solved by finding the root of w.r.t. , which cannot be explicitly computed because of the implicit formulation. Instead, we adopt a quasi-Newton method (i.e. Broydenβs methodΒ (broyden1965class)) to solve this problem iteratively, as follows:
(12) |
where is a low-rank approximation of the Jacobian inverse111We refer readers to broyden1965class for the calculation details for . and is the step size which we use line search method to dynamically compute. The stop criterion is , where
is a hyperparameter that balances the computation time and precision. As TheoremΒ
1 guarantees the existence and uniqueness of the root, the convergence of the Broydenβs method is also guaranteed, which is typically faster than a linear rate.Another inference problem is to estimate the log-likelihood. Assume that where is a simple prior distribution (e.g. standard Gaussian). The log-likelihood of can be written by
(13) |
where denotes the Jacobian matrix of a function at . See Appendix.Β A.4 for the detailed derivation. Exact calculation of the log-determinant term requires
time cost and is hard to scale up to high-dimensional data. Instead, we propose the following unbiased estimator of
using the same technique in chen2019residual with Skilling-Hutchinson trace estimatorΒ (skilling1989eigenvalues; hutchinson1989stochastic):(14) |
where is a distribution supported over the positive integers.
The sampling process to compute given can also be solved by the Broydenβs method, and the hyperparameters are shared with the inference process.
In the learning
process, we perform stochastic gradient descent to minimize the negative log-likelihood of the data, denoted as
. For efficiency, we estimate the gradient w.r.t. the model parameters in the backpropagation manner. According to the chain rule and the additivity of the log-determinant, in each layer we need to estimate the gradients w.r.t.
and of Eqn.Β (13). In particular, the gradients computation involves two terms: one is and the other is , where is a function satisfying and denotes or . On the one hand, for the log-determinant term, we can use the same technique as chen2019residual, and obtain an unbiased gradient estimator as follows.(15) |
where is a distribution supported over the positive integers. On the other hand, can be computed according to the implicit function theorem as follows (See details in AppendixΒ A.5):
(16) |
In comparision to directly calculate the gradient through the quasi-Newton iterations of the forward pass, the implicit gradient above is simple and memory-efficient, treating the root solvers as a black-box. FollowingΒ bai2019deep, we compute by solving a linear system iteratively, as detailed in AppendixΒ C.1. The training algorithm is formally presented in AppendixΒ C.4.
We demonstrate the model capacity of ImpFlows on the classification and density modeling tasks222 See https://github.com/thu-ml/implicit-normalizing-flows for details.. In all experiments, we use spectral normalizationΒ (miyato2018spectral) to enforce the Lipschitz constrants, where the Lipschitz constant upper bound of each layer (called Lipschitz coefficient) is denoted as . For the Broydenβs method, we use and for training and testing to numerically ensure the invertibility and the stability during training. Please see other detailed settings including the method of estimating the log-determinant, the network architecture, learning rate, batch size, and so on in AppendixΒ D.
Vanilla | |||||||
---|---|---|---|---|---|---|---|
CIFAR10 | ResFlow | 6.61(0.02) | 8.24 | 8.39 | 8.69 | 9.25 | 9.94 |
(0.03) | (0.01) | (0.03) | (0.02) | (0.02) | |||
ImpFlow | 7.29 | 7.41 | 7.94 | 8.44 | 9.22 | ||
(0.03) | (0.03) | (0.06) | (0.04) | (0.02) | |||
CIFAR100 | ResFlow | 27.83(0.03) | 31.02 | 31.88 | 32.21 | 33.58 | 34.48 |
(0.05) | (0.02) | (0.03) | (0.02) | (0.03) | |||
ImpFlow | 29.06 | 30.47 | 31.40 | 32.64 | 34.17 | ||
(0.03) | (0.03) | (0.03) | (0.01) | (0.02) |
POWER | GAS | HEPMASS | MINIBOONE | BSDS300 | |
RealNVPΒ (realnvp) | 0.17 | 8.33 | -18.71 | -13.55 | 153.28 |
FFJORDΒ (grathwohl2018ffjord) | 0.46 | 8.59 | -14.92 | -10.43 | 157.40 |
MAFΒ (papamakarios2017masked) | 0.24 | 10.08 | -17.70 | -11.75 | 155.69 |
NAFΒ (huang2018neural) | 0.62 | 11.96 | -15.09 | -8.86 | 157.73 |
ImpFlow () | 0.61 | 12.11 | -13.95 | -13.32 | 155.68 |
ResFlow () | 0.26 | 6.20 | -18.91 | -21.81 | 104.63 |
ImpFlow () | 0.30 | 6.94 | -18.52 | -21.50 | 113.72 |
ResFlow () | 3.469(0.0004) | 3.533(0.0002) | 3.627(0.0004) | 3.820(0.0003) |
---|---|---|---|---|
ImpFlow () | 3.452(0.0003) | 3.511(0.0002) | 3.607(0.0003) | 3.814(0.0005) |
We first empirically compare ResFlows and ImpFlows on classification tasks. Compared with generative modeling, classification is a more direct measure of the richness of the functional family, because it isolates the function fitting from generative modeling subtleties, such as log-determinant estimation. We train both models in the same settings on CIFAR10 and CIFAR100Β (krizhevsky2009learning). Specifically, we use an architecture similar to ResNet-18Β (he2016deep). Overall, the amount of parameters of ResNet-18 with vanilla ResBlocks, ResFlows and ImpFlows are the same of M. The detailed network structure can be found in AppendixΒ D. The classification results are shown in TableΒ 1. To see the impact of the Lipschitz constraints, we vary the Lipschitz coefficient to show the difference between ResFlows and ImpFlows under the condition of a fixed Lipschitz upper bound. Given different values of , the classification results of ImpFlows are consistently better than those of ResFlows. These results empirically validate CorollaryΒ 1, which claims that the functional family of ImpFlows is richer than ResFlows. Besides, for a large Lipschitz constant upper bound , ImpFlow blocks are comparable with the vanilla ResBlocks in terms of classification.
For the density modeling tasks, we first evaluate ImpFlows on the Checkerboard data whose density is multi-modal, as shown in Fig.Β 3 (a). For fairness, we follow the same experiment settings as chen2019residual (which are specified in AppendixΒ D), except that we adopt a SineΒ (sitzmann2020implicit)activation function for all models. We note that the data distribution has a bounded support while we want to fit a transformation
mapping it to the standard Gaussian distribution, whose support is unbounded. A perfect
requires a sufficiently large for some mapped far from the mean of the Gaussian. Therefore, the Lipschtiz constant of such is too large to be fitted by a ResFlow with 8 blocks (See Fig.Β 3 (b)). A -block ImpFlow can achieve a result of bits, which outperforms the bits of a -block ResFlow with the same number of parameters. Such results accord with our theoretical results in TheoremΒ 2 and strongly motivate ImpFlows.We also train ImpFlows on some real density modeling datasets, including the tabular datasets (used byΒ papamakarios2017masked), CIFAR10 and 5-bit CelebAΒ (kingma2018glow). For all the real datasets, we use the scalable algorithm proposed in Sec.Β 5.
We test our performance on five tabular datasets: POWER (), GAS (), HEPMASS (), MINIBOONE () and BSDS300 () from the UCI repositoryΒ (Dua:2019), where is the data dimension. For a fair comparison, on each dataset we use a 10-block ResFlow and a 5-block ImpFlow with the same amount of parameters, and a 20-block ImpFlow for a better result. The detailed network architecture and hyperparameters can be found in AppendixΒ D. TableΒ 2 shows the average test log-likelihood for ResFlows and ImpFlows. ImpFlows achieves better density estimation performance than ResFlow consistently on all datasets. Again, the results demonstrate the effectiveness of ImpFlows.
Then we test ImpFlows on the CIFAR10 dataset. We train a multi-scale convolutional version for both ImpFlows and ResFlows, following the same settings as chen2019residual except that we use a smaller network of 5.5M parameters for both ImpFlows and ResFlows (see details in AppendixΒ D). As shown in TableΒ 3, Impflow achieves better results than ResFlow consistently given different values of the Lipschitz coefficient . Moreover, the computation time of ImpFlow is comparable to that of ResFlow. See AppendixΒ C.2
for detailed results. Besides, there is a trade-off between the expressiveness and the numerical optimization of ImpFlows in larger models. Based on the above experiments, we believe that advances including an lower-variance estimate of the log-determinant can benefit ImpFlows in larger models, which is left for future work.
We also train ImpFlows on the 5-bit CelebA. For a fair comparison, we use the same settings as chen2019residual. The samples from our model are shown in AppendixΒ E.
We propose implicit normalizing flows (ImpFlows), which generalize normalizing flows via utilizing an implicit invertible mapping defined by the roots of the equation . ImpFlows build on Residual Flows (ResFlows) with a good balance between tractability and expressiveness. We show that the functional family of ImpFlows is richer than that of ResFlows, particularly for modeling functions with large Lipschitz constants. Based on the implicit differentiation formula, we present a scalable algorithm to train and evaluate ImpFlows. Empirically, ImpFlows outperform ResFlows on several classification and density modeling benchmarks. Finally, while this paper mostly focuses on the implicit generalization of ResFlows, the general idea of utilizing implicit functions for NFs could be extended to a wider scope. We leave it as a future work.
We thank Yuhao Zhou, Shuyu Cheng, Jiaming Li, Kun Xu, Fan Bao, Shihong Song and QiβAn Fu for proofreading. This work was supported by the National Key Research and Development Program of China (Nos. 2020AAA0104304), NSFC Projects (Nos. 61620106010, 62061136001, U19B2034, U181146, 62076145), Beijing NSF Project (No. JQ19016), Beijing Academy of Artificial Intelligence (BAAI), Tsinghua-Huawei Joint Research Program, Huawei Hisilicon Kirin Intelligence Engineering Development, the MindSpore team, a grant from Tsinghua Institute for Guo Qiang, Tiangong Institute for Intelligent Computing, and the NVIDIA NVAIL Program with GPU/DGX Acceleration. C. Li was supported by the fellowship of China postdoctoral Science Foundation (2020M680572), and the fellowship of China national postdoctoral program for innovative talents (BX20190172) and Shuimu Tsinghua Scholar. J. Chen was supported by Shuimu Tsinghua Scholar.
(TheoremΒ 1)
Firstly, , the mapping
is a contrative mapping, which can be shown by Lipschitz condition of :
Therefore, has an unique fixed point, denoted by :
Similarly, we also have: , there exists an unique satisfying .
Moreover, Let , we have . By the uniqueness, we have . Similarly, . Therefore, is unique and invertible. β
We denote as the set of all bi-Lipschitz -diffeomorphisms from to .
Firstly, we prove LemmaΒ 1 in the main text.
(LemmaΒ 1). , we have
which is equivalent to
Note that is nonsingular, so , we have . Thus,
So we have
Note that the converse is not true, because does not restrict the upper bound of Lipschitz constant of . For example, when where is a positive real number, we have
However, can be any large positive number. So we have . β
, if
(17) |
then
(18) |
(Proof of LemmaΒ 2). By Inverse Function Theorem,
Because is from to , we have
Let and , we have , and
The above equation uses this fact: for a real matrix , because .
Note that is Lipschitz continuous, . So
which means
Thus,
β
, if , we have
(19) |
, if
then , s.t. ,
(Proof of LemmaΒ 4). Note that is Lipschitz continuous, so . Denote
And let
, we have
The above equation uses this fact: for a real matrix , because . β