Bayesian inference is a statistical inference framework where a priorprobability distribution of some unknown variable gets updated as more observations become available. The task of sampling from such a posterior probability distribution can be extremely challenging, in particular when the unknown variable is either very high-dimensional, or the underlying model is very complex and expensive to evaluate (e.g. chaotic dynamical systems, SDEs, PDEs). The problem becomes even harder when observations arrive in a streaming form and Bayesian inference has to be performed sequentially (a.k.a. filtering). Standard attempts to solve these tasks include MCMC algorithms, SMC techniques, variational inference approximations gilks1995markov; doucet2009tutorial; blei2017variational and many remarkable improvements of these contributions have been proposed in the last decades.
A more recent approach entails the use of optimal transport villani2008optimal in order to map a reference density (e.g. Gaussian) to a target density (e.g. posterior) marzouk2016sampling. One can define a possible class of transport maps and, within this class, seek for an optimal map such that the correspondent push-forward density of the reference approximates the target as closely as possible. Samples from the target density can then be recovered by applying the transport map to samples from the reference density. The class of transport maps that is chosen characterizes the method in use. For example, marzouk2016sampling introduces polynomial maps within a Knothe-Rosenblatt rearrangement structure, whereas liu2016stein; detommaso2018stein exploit variational characterizations embedded in a RKHS.
In this paper, we propose different ways to use Invertible Neural Networks (INNs) within an optimal transport perspective. INNs have recently been introduced in dinh2016density as class of neural networks characterized by an invertible architecture. Crucially, every layer combines orthogonal transformations and triangular maps, which ensures that the determinant of the Jacobian of the overall map can be computed at essentially no extra cost during the forward pass. Here, we extend the results in dinh2016density; ardizzone2018analyzing and we introduce and compare different transport constructions with the goal of sampling from a posterior distribution. Finally we propose HINT, a Hierarchical Invertible Neural Transport method where a hierarchical INN architecture is exploited within a Knothe-Rosenblatt structure in order to densify the input-to-output dependence of the transport map and to allow sampling from the posterior. One of the advantages of HINT is that model evaluations can be performed fully offline and the gradient of the model is not required, which is very convenient when models and their gradients (e.g. PDEs and their adjoint operators) are very expensive to compute. In addition, analytical evaluations of the prior density are never required, which makes HINT ideal for Sequential Bayesian inference.
provides a transport perspective, where new constructions are derived and studied, and a statistical error analysis of the transport estimator in3.3. Section 4 introduces HINT, a novel hierarchical architecture, and how to use it to sample from the posterior. Section 5 describes HINT in a sequential Bayesian framework. Section 6 studies the performance of the algorithm on two challenging inference problems. We finish with some conclusion in section 7.
2 Invertible Neural Networks
Invertible Neural Networks (INNs) have been introduced in dinh2016density as a special case of deep neural networks whose architecture allows for trivial inversion of the map from input to output. One can describe an INN as an invertible map
defined on some vector space, characterized by the composition of invertible layers , i.e. , for and . The architecture of each layer can be described as follows:
where is an arbitrary splitting, denotes element-wise exponential operation, are orthogonal matrices, i.e. , and are arbitrarily complex neural networks. In practice, we will take to be series of Householder reflections and ,
as sequences of fully-connected layers with leaky ReLU activations. We denote byall the parameters within and that we want to learn across all layers, which will in turn parametrize the map .
We emphasize that each layer is a composition of an orthogonal transformation and a triangular map, where the latter is better known in the field of transport maps as Knothe-Rosenblatt rearrangement marzouk2016sampling
. This factorization can be seen as a non-linear generalization to a classic QR decompositionstoer2013introduction
. Whereas the triangular part encodes the possibility to represent non-linear transformations, the orthogonal part reshuffles the entries to foster dependence of each part of the input to the final output, thereby drastically increasing the representation power of the map.
Let us call , and . Then
where denotes the element-wise sum.
A proof is given in appendix A. Lemma 1 crucially shows that the determinant of the Jacobian of an INN can be calculated for free during the forward pass. The importance of this will become clear in Section 3.
We remark that the expressions in (1) are trivially invertible:
where corresponds to the splitting of . Note that and do not need to be invertible.
2.1 About the extendibility to non-linear .
As previously highlighted, every layer in (1) could be seen as a QR-type decomposition with constant . However, one could argue that in order to increase the representation power of the decomposition, also should be taken to be a non-linear function of . In this section, we generalize to , where is the triangular part of the map in (1), whereas the function should generalize the role of . If , we recover the architecture in . In addition, we have . Then, we would like to choose to be a smooth and easily invertible transformation such that
is an orthogonal matrix, so thatcan still be computed at no extra cost. Such transformations belong to the more general class of conformal maps ahlfors2010conformal. In fact, any conformal map in is characterized by , where is assumed to exist almost everywhere. Unfortunately, Liouville’s theorem revsetnjak1967liouville shows that, for any , the only conformal maps are Möbius transformations of the form
where , , and is an orthogonal matrix. Note that, for both and , , and Lemma 1 can be easily generalized.
3 A transport perspective for general Bayesian inference
, INNs were used to sample from a posterior probability distribution. In this section, we will generalize the results inardizzone2018analyzing via the mathematical framework of transport maps. We will suggest different procedures and architectures which involve INNs and we will propose a new algorithm to perform sampling from a posterior distribution.
3.1 A general Bayesian framework
Let us denote by some hidden parameters of interest and by some observations. We respectively denote by , , and
the prior density, the likelihood function, the joint density and the evidence. By Bayes’ theorem, the posterior density can be expressed as.
All the results we will present hold for any possible likelihood function , in contrast with the results in ardizzone2018analyzing which hold exclusively for the Dirac delta likelihood , where is some non-linear operator. However, for sake of clarity, throughout the paper we will focus on additive Gaussian noise relations, i.e. , for some standard Gaussian noise. It immediately follows that .
Furthermore, we introduce a latent variable , whose dimension will be specified below. Although any treatable density can be chosen for , we will practically use a standard Gaussian .
3.2 A transport perspective
Suppose that we want to approximate a target density . A possible approach is to seek for an invertible transport map such that the pushforward density of some reference density approximates, in some sense, the target .111For any invertible map and probability density , the pushforward density of is given by . We restrict the attention to parametric families of transport maps , for . We define the function
is the Kullback-Leibler divergence between two probability densities. One would like to minimize the functionover ; if the family of maps is rich enough, the KL divergence will approach zero and .
In general, the family of maps can contain any invertible function approximator that can be trained over to push a treatable reference to the target . Different choices for such a family have been proposed in the literature, e.g. polynomials regressors marzouk2016sampling
, Tensor-Train representationsdolgov2018approximation and variational approximations liu2016stein; detommaso2018stein. The accuracy of will depend on how well represents a map between the two densities. In this paper, we propose INNs as a suitable choice for that, at the same time, retains the universal approximation properties of deep neural networks and makes the minimization tractable because of the triangular architecture within the layers . In fact, the main computational bottleneck in the minimization of the expression in (4) is that, as we will see, it involves the evaluation of at every input of the training set and at every training step of the algorithm, which quickly becomes practically unfeasible. Lemma 1 showed how, for INNs, this expression can be calculated essentially at no extra cost during the forward pass.
Different algorithms can be constructed depending on the choices of and . Although the ultimate goal is to sample from the posterior , for some specific combination of densities and this can be done in two phases: first, train a transport map between the two densities; then, recover posterior samples via specific conditional procedures. We will study three cases, which are represented in Figure (2).
In both cases 1⃝ and 3⃝, the family of maps is trained over independently of the actual observation . After training is performed, samples from can be recovered for any contingent by a conditional procedure. When is taken to be an INN, case 1⃝ corresponds to an extension of ardizzone2018analyzing to general likelihood functions . Case 3⃝ is related to marzouk2016sampling, where families of polynomial maps are employed, and will be studied further in section 4. On the other hand, case 2⃝ attempts to sample directly from the posterior , therefore training requires the actual observation . For a different observation, training should be recomputed. Stein variational inference liu2016stein; detommaso2018stein closely relates to this case. In the following theorem, for each case we derive an explicit loss upon which a family of maps can be trained. Algorithms to sample from are given in appendix B.
Given a parametric family of invertible maps , minimizing the function in (4) over is equaivalent to minimize the following losses :
Let , and assume . Let and . Then
where , with and .
Let , and an observed . Let and . Then
Let , , and . Then
In practice, expectations within the losses of Theorem 1 should be approximated. A sensible choice is to exploit Monte Carlo samples, which can be directly obtained in each of the three cases and will play the role of training set for the family of maps . For both cases 1⃝ and 3⃝, these samples, together with potentially expensive evaluations of the model , can be performed a priori of the training phase. In addition, no gradient of is required during training, which allows to be completely treated as a black box. Vice versa, the loss in case 2⃝ involve the term , which does not allow offline evaluations of and requires its gradient during training.
The minimization of the first term in the loss of case 1⃝ can be challenging. First, has to be trained to be a surrogate of , which can be difficult if is particularly complicated or exhibits an unstable (e.g. chaotic) behaviour. In addition, either if is very small or is very large, the loss can become very steep and difficult to optimize. Similar issues are present in case 2⃝. In contrast, the loss in case 3⃝ strikes for its simplicity. The first term in the loss pushes all samples towards the mode of the standard Gaussian density , whereas the second term is a repulsion force that maintains them apart. The information about the model and the probabilistic relation between hidden space and observation space is completely contained within the training set sampled from .
Furthermore, we observe that, unlike in case 2⃝, cases 1⃝ and 3⃝ never require evaluations from the prior but only samples from it. This makes them very suitable for sequential Bayesian inference, as we will study further in section 5.
In defence of case 2⃝, if we want to sample from a specific posterior and if none of the issues raised above constitute a major difficulty, for particular applications this case may result faster than the other two, because the family of maps is trained to map samples directly to the posterior. In addition, unlike 1⃝ and 3⃝, case 2⃝ does not require an explicit expression for .
3.3 Consistency and Central Limit Theorem of the transport map estimator
Here, we analyze asymptotic properties of the statistical error due to the Monte Carlo approximation of the losses in Theorem 1.
Let us take , and assume , where is a compact set. Although this is a technical limitation, it is also fully reasonable in practice, as trainable parameters are often clamped within a defined range to avoid instability. Let us introduce to be the integrand in the loss , for some probability density . We denote by the Monte Carlo estimator of , where
are independent random variables. Suppose there exist strict global minimizersand in the interior of . We remark that both and are random variables depending on . Let us assume that the transport map is continuous and differentiable with respect to . Furthermore, we impose the regularity conditions and
, which ensure the applicability of the Uniform Law of Large Numbers. We will respectively denote convergence in probability and distribution forby and .
With the assumptions above, for any input , the transport map estimator is consistent, i.e. . Furthermore, satisfies the following Central Limit Theorem:
satisfies the following Central Limit Theorem:
4 HINT: Hierarchical Invertible Neural Transport
In this section, we propose an algorithm to sample from a posterior density via case 3⃝, which we discussed to be appealing for several reasons.
4.1 From joint to posterior via Knothe-Rosenblatt transport maps
Consider an invertible transport map and suppose that it can be rewritten as a Knothe-Rosenblatt rearrangement, so that for , and . Let us denote by its inverse, i.e. and , for .
Observe that we can split the latent density as , where and respectively correspond to the marginal density of and the conditional density of given . Because we chose to be a standard Gaussian, we further have . Finally, assume that exactly, or equivalently . Then, it was shown in marzouk2016sampling that and .
The result above suggests that, given a map satisfying the conditions above, a posterior sample can be simply achieved by calculating , for . Intuitively, whereas the simple application of the map to a sample would provide a sample from the joint , the act of fixing to be the anti-image of through makes sure that the resulting sample comes from the posterior . A simplified visualization of this procedure is displayed in Figure 3.
4.2 Hierarchical architecture
The goal is to introduce an architecture which endows with a Knothe-Rosenblatt structure in order to apply the sampling procedure described in section 4.1. In fact, such a structure is not satisfied by a general INN architecture in (1) because of the orthogonal transformations , which are essential to have a large representation power of the network. In order to overcome this issue, we proceed as follows: first, we develop a hierarchical generalization of the architecture in (1) to embolden a rich input-to-output dependence within each layer ; then, we enforce the last level of each hierarchy to be triangular, so that the overall map satisfies the desired structure.
Intuitively, we want to recursively nest INNs within each other in order to perform multiple coordinate splittings and therefore densify the architecture structure. In order to characterize this nesting procedure, for each layer we define a binary tree of splittings. Each entry refers to a splitting coordinate and to the sub-tree of its children. We denote the tree root by . Given , let us denote by and its direct children. Also, we denote by the cardinality of the tree, i.e. the number of splittings. Let us define the following hierarchical architecture:
with the initial condition for . is an arbitrary orthogonal matrix, whereas and are arbitrarily complex neural networks. Note that, for , (8) corresponds to (1). Figure 3(a) displays the architecture of a hierarchical layer , whereas Figure 3(b) shows the layout of the Jacobian for a tree with , with respect to the transformed inputs and . Analogously to hierarchical matrices, the architecture in (8) densifies the dependence between input and output of by recursively nesting INNs layers within themselves.
Let us define and the overall map . It is easy to check that can be recursively decomposed to assume a similar structure as in Lemma 1 and calculated essentially for free during the forward pass.
Given the hierarchical construction of defined above, we can enforce the overall map to retain the desired Knothe-Rosenblatt structure by, for each : (a) defining to split between the variables and ; (b) taking . It immediately follows that, for any input , we can split , with and . Hence, after training is performed upon the loss in (7), we can apply the procedure described in section 4.1 to sample from a posterior density . We address the overall algorithm as Hierarchical Invertible Neural Transport (HINT). An implementation is given in appendix B.
5 HINT for sequential Bayesian inference
Sequential Bayesian inference typically describes a dynamical framework where data arrives in a streaming form. We denote by a sequence of data points at times . Analogously, we assume that there is an underlying sequence of corresponding hidden states . Model dependencies are defined through the graphical model in Figure 5.
By Bayes’ theorem and the assumed dependence structure, we have
Equation (10) is usually known as the prediction step, while equation (9) is called assimilation (or updating) step. Here, is the desired posterior density, is the likelihood function, whereas plays the role of prior density given the previous observations . The analytic expression of the prior is not given explicitly, but rather through an expectation over the previous posterior . If we have samples from the previous posterior, we can estimate the expectation via Monte Carlo, however this can easily be very inaccurate if the transition density is complicated or very concentrated. In the limit case where the relation between and is deterministic, i.e. is a Dirac delta function, a Monte Carlo estimate is not even possible. In addition to this problem, in many applications of interest, every evaluation of the transition density requires the solution of a very expensive model, and the evaluation of the prior density through a Monte Carlo approximation would become too expensive for online prediction.
For those reasons, we would like to have an algorithm that is able to sequentially generate samples from the posterior but does not need to evaluate the prior density . Theorem 3 shows how to generalize the results for HINT (case 3⃝) in Theorem 1 to sequential Bayesian inference and how only samples from are required, but never an analytical evaluation of it. An analogous result can be derived for case 1⃝. As before, although the results can be achieved for general probability densities, we focus on additive Gaussian noise relations and , for some non-linear operators , , some standard Gaussian noises , and standard deviations and . This immediately implies and .
Let be a parametric family of invertible transport map from to . Suppose we observed and denote . Let us define
Then, minimizing in (11) over is equivalent to minimize the following loss:
A proof follows directly from the proof of Theorem 1 in appendix A. Importantly, the minimization of the loss in (12) at time should be initialized at the optimal value of at time . In fact, if the geometry of the posterior does not change much, the new optimal value is going to be close to the previous one, and the training very short.
6 Numerical experiments
In this section, we compare the performance of standard INN (case 1⃝) and HINT (case 3⃝) on two challenging numerical experiments in a Bayesian inference framework. In both cases, we use approximately parameters. We use HINT with a coarse hierarchical depth and show that this suffices to compare favorably to INN.
6.1 Competitive Lotka-Volterra and unobserved species prediction over time
Competitive Lotka-Volterra is a generalization of the classic Lotka-Volterra which describes the demographical interaction of species competing on common resources. It is given by
for , where is the size of the -th species at a given time, is its growth rate and describes the interaction with the other species. We take and set parameters . Observations are taken at times , for , and we set . The solution of (13) between characterizes the transition model , and , where and . We observe a perturbed number of the first three species with and , where is a realization of the process with . We set an initial prior . The goal is to sequentially recover the posterior densities and to predict the unobserved component .
At , we estimate the mean-squared-error (MSE) of the trace of the posterior covariance matrix both varying the number of training samples and training steps. Figure 6
(top row) shows that both standard INN and HINT converge to very small values of the MSE. INN appears to largely struggle for small training sizes, but seems to perform slightly better for many training epochs.
Figure 6 (bottom row) describes the sequential prediction of , which shows a largely better performance of HINT over INN. Methods were trained with 64000 samples for 50 epochs.
6.2 Lorenz96 transition and log-Rosenbrock observation models
Lorenz96 is a chaotic dynamical system characterized by
for . where it is assumed that , and . We take and set (chaotic regime). We start at and take an observation at time . The solution of (14) between characterizes the transition model , and , where and . To make the problem very complicated, we take to be a log-Rosenbrock function in each component, i.e. for , and observe with , where is a realization of the process with . We set an initial prior . The goal is to sample from the posterior at time .
For this hard experiment, INN failed to produce meaningful results. We remark that with better annealing rate of and network configurations, INN may still be able to converge, but this can be tedious and difficult. In contrast, HINT does not have to worry about concentration issues, as highlighted in section 3. Hence, it is much more robust and, importantly, requires much less parameters tweaking. Figure 7 shows its convergence at time .
In this work, we introduced HINT as an algorithm that combines INNs and optimal transport to sample from a posterior distribution in a general and a sequential Bayesian framework. We discussed how the use of HINT over INN can be advantageous for several reasons, and we performed numerical comparisons in two challenging test cases. Further research directions may include the use of Quasi-Monte Carlo training samples for better space-filling and generalization properties dick2013high, and multilevel techniques to beat down the computation cost of generating the training set kuo2017multilevel.
Appendix A Appendix: Proofs
Proof of Lemma 1.
With the notation above, we have
. Then, by chain rule we have
where we denote by the element-wise product and we used that because are orthogonal matrices. ∎
Proof of Theorem 1.
Let us take a family of invertible maps . It is easy to check that
for any densities and , with . We study the following three cases separately.
Let , assume and take . Let and . We observe that the decomposition of the target density as is enforcing independence of from and , in fact . Then
where , with and . By definitions of , , and , we have