Preconditioned training of normalizing flows for variational inference in inverse problems

01/11/2021 ∙ by Ali Siahkoohi, et al. ∙ Microsoft Georgia Institute of Technology UMC Utrecht 8

Obtaining samples from the posterior distribution of inverse problems with expensive forward operators is challenging especially when the unknowns involve the strongly heterogeneous Earth. To meet these challenges, we propose a preconditioning scheme involving a conditional normalizing flow (NF) capable of sampling from a low-fidelity posterior distribution directly. This conditional NF is used to speed up the training of the high-fidelity objective involving minimization of the Kullback-Leibler divergence between the predicted and the desired high-fidelity posterior density for indirect measurements at hand. To minimize costs associated with the forward operator, we initialize the high-fidelity NF with the weights of the pretrained low-fidelity NF, which is trained beforehand on available model and data pairs. Our numerical experiments, including a 2D toy and a seismic compressed sensing example, demonstrate that thanks to the preconditioning considerable speed-ups are achievable compared to training NFs from scratch.



There are no comments yet.


page 3

page 8

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Our aim is to perform approximate Bayesian inference for inverse problems characterized by computationally expensive forward operators, , with a data likelihood, :


where is the unknown model, the observed data, and the measurement noise. Given a prior density, , variational inference (VI, jordan1999introduction) based on normalizing flows (NFs, rezende2015variational) can be used where the Kullback-Leibler (KL) divergence is minimized between the predicted and the target—i.e., high-fidelity, posterior density (liu2016stein; kruse2019hint; rizzuti2020SEGpub; siahkoohi2020TRfuqf; sun2020deep):


In the above expression, denotes a NF with parameters and a Gaussian latent variable

. The above objective consists of the data likelihood term, regularization on the output of the NF, and a log-determinant term that is related to the entropy of the NF output. The last term is necessarily to prevent the output of the NF from collapsing on the maximum a posteriori estimate. For details regarding the derivation of the objective in Equation (

2), we refer to Appendix A. During training, we replace the expectation by Monte-Carlo averaging using mini-batches of . After training, samples from the approximated posterior, , can be drawn by evaluating for (kruse2019hint). It is important to note that Equation (2) trains a NF specific to the observed data . While the above VI formulation in principle allows us to train a NF to generate samples from the posterior given a single observation

, this variational estimate requires access to a prior density, and the training calls for repeated evaluations of the forward operator,

, as well as the adjoint of its Jacobian,

. As in multi-fidelity Markov chain Monte Carlo (MCMC) sampling

(marzouk2018multifidelity), the costs associated with the forward operator may become prohibitive even though VI-based methods are known to have computational advantages over MCMC (blei2017variational).

Aside from the above computational considerations, reliance on having access to a prior may be problematic especially when dealing with images of the Earth’s subsurface, which are the result of complex geological processes that do not lend themselves to be easily captured by hand-crafted priors. Under these circumstances, data-driven priors—or even better data-driven posteriors obtained by training over model and data pairs sampled from the joint distribution,

—are preferable. More specifically, we follow kruse2019hint, kovachki2020conditional, and baptista2020adaptive, and formulate the objective function in terms of a block-triangular conditional NF, , with latent space :


Thanks to the block-triangular structure of , samples of the approximated posterior, can be drawn by evaluating for (marzouk2016sampling). Unlike the objective in Equation (2), training does not involve multiple evaluations of and , nor does it require specifying a prior density. However, its success during inference heavily relies on having access to training pairs from the joint distribution, . Unfortunately, unlike medical imaging, where data is abundant and variability among patients is relatively limited, samples from the joint distribution are unavailable in geophysical applications. Attempts have been made to address this lack of training pairs including the generation of simplified artificial geological models (mosser2018stochastic), but these approaches cannot capture the true heterogeneity exhibited by the Earth’s subsurface. This is illustrated in Figure 5, which shows several true seismic image patches drawn from the Parihaka dataset. Even though samples are drawn from a single data set, they illustrate significant differences between shallow (Figures (a)a and (b)b) and deeper (Figures (c)c and (d)d) sections.

Figure 5: Subsurface images estimated from a real seismic survey, indicating strong heterogeneity between (a), (b) shallow and (c), (d) deep parts of the same survey area.

To meet the challenges of computational cost, heterogeneity and lack of access to training pairs, we propose a preconditioning scheme where the two described VI methods are combined to:

  1. take maximum advantage of available samples from the joint distribution , to pretrain by minimizing Equation (3). We only incur these costs once, by training this NF beforehand. As these samples typically come from a different (neighboring) region, they are considered as low-fidelity;

  2. exploit the invertibility of , which gives us access to a low-fidelity posterior density, . For a given , this trained (conditional) prior can be used in Equation (2);

  3. initialize with weights from the pretrained

    . This initialization can be considered as an instance of transfer learning

    (yosinski2014transferable), and we expect a considerable speed-up when solving Equation (2). This is important since it involves inverting , which is computationally expensive.

2 Related work

In the context of variational inference for inverse problems with expensive forward operators, herrmann2019NIPSliwcuc

train a generative model to sample from the posterior distribution, given indirect measurements of the unknown model. This approach is based on an Expectation Maximization technique, which infers the latent representation directly instead of using an inference encoding model. While that approach allows for inclusion of hand-crafted priors, capturing the posterior is not fully developed. Like


, we also use a block-triangular map between the joint model and data distribution and their respective latent spaces to train a network to generate samples from the conditional posterior. By imposing an additional monotonicity constraint, these authors train a generative adversarial network

(GAN, Goodfellow2014) to directly sample from the posterior distribution. To allow for scalability to large scale problems, we work with NFs instead, because they allow for more memory efficient training (vandeLeemput2019MemCNN; putzky2019invert; peters2020fullysensing; peters2020fully). Our contribution essentially corresponds to a reformulation of parno2018transport and marzouk2018multifidelity. In that work, transport-based maps are used as non-Gaussian proposal distributions during MCMC sampling. As part of the MCMC, this transport map is then tuned to match the target density, which improves the efficiency of the sampling. marzouk2018multifidelity extend this approach by proposing a preconditioned MCMC sampling technique where a transport-map trained to sample from a low-fidelity posterior distribution is used as a preconditioner. This idea of multi-fidelity preconditioned MCMC inspired our work where we setup a VI objective instead. We argue that this formulation can be faster and may be easier to scale to large-scale Bayesian inference problems (blei2017variational).

Finally, there is a conceptional connection between our work and previous contributions on amortized variational inference (gershman2014amortized), including an iterative refinement step (hjelm2016iterative; pmlrv84krishnan18a; pmlrv80kim18e; marino2018iterative). Although similar in spirit, our approach is different from these attempts because we adapt the weights of our conditional generative model to account for the inference errors instead of correcting the inaccurate latent representation of the out-of-distribution data.

3 Multi-fidelity preconditioning scheme

For an observation , we define a NF as


where we obtain by training through minimizing the objective function in Equation (3). To train , we use available low-fidelity training pairs . We perform this training phase beforehand, similar to the pretraining phase during transfer learning (yosinski2014transferable). Thanks to the invertibility of , it provides an expression for the posterior. We refer to this posterior as low-fidelity because the network is trained with often scarce and out-of-distribution training pairs. Because the Earth’s heterogeneity does not lend itself to be easily captured by hand-crafted priors, we argue that this NF can still serve as a (conditional) prior in Equation (2):


To train the high-fidelity NF given observed data , we minimize the KL divergence between the predicted and the high-fidelity posterior density, (liu2016stein; kruse2019hint)


where the prior density of Equation (5) is used. Notice that this minimization problem differs from the one stated in Equation (2). Here, the optimization involves “fine-tuning” the low-fidelity network parameters introduced in Equation (3). Moreover, this low-fidelity network is also used as a prior. While other choices exist for the latter—e.g., it could be replaced or combined with a hand-crafted prior in the form of constraints (peters2018pmf) or by a separately trained data-driven prior (mosser2018stochastic), using the low-fidelity posterior as a prior (cf. Equation (5)) has certain distinct advantages. First, it removes the need for training a separate data-driven prior model. Second, use of the low-fidelity posterior may be more informative (yang2018conditional) than its unconditional counterpart because it is conditioned by the observed data . In addition, our multi-fidelity approach has strong connections with online variational Bayes (zeno2018task) where data arrives sequentially and previous posterior approximates are used as priors for subsequent approximations.

In summary, the problem in Equation (6) can be interpreted as an instance of transfer learning (yosinski2014transferable) for conditional NFs. This formulation is particularly useful for inverse problems with expensive forward operators, where access to high fidelity training samples, i.e. samples from the target distribution, is limited. In the next section, we present two numerical experiments designed to show the speed-up and accuracy gained with our proposed multi-fidelity formulation.

4 Numerical experiments

We present two synthetic examples aimed at verifying the anticipated speed-up and increase in accuracy of the predicted posterior density via our multi-fidelity preconditioning scheme. The first example is a two-dimensional problem where the posterior density can be accurately and cheaply sampled via MCMC. The second example demonstrates the effect of the preconditioning scheme in a seismic compressed sensing (candes2006stable; donoho2006compressed)

problem. Details regarding training hyperparameters and the NF architectures are included in Appendix B. Code to reproduce our results are made available on

GitHub. Our implementation relies on InvertibleNetworks.jl (invnet), a recently-developed memory-efficient framework for training invertible networks in the Julia programming language.

4.1 2D toy example

To illustrate, the advantages of working with our multi-fidelity scheme, we consider the 2D Rosenbrock distribution, , plotted in Figure (a)a. High-fidelity data are generated via , where and is a forward operator. To control the discrepancy between the low- and high-fidelity samples, we set equal to , where is the spectral radius of ,

has independent and normally distributed entries, and

. By choosing smaller values for , we make

more dissimilar to the identity matrix, therefore increasing the discrepancy between the low- and high-fidelity posterior.

Figure (b)b depicts the low- (purple) and high-fidelity (red) data densities. The dark star represents the unknown model. Low-fidelity data samples are generated with the identity as the forward operator. During the pretraining phase conducted beforehand, we minimize the objective function in Equation (3) for epochs.

Figure 11: (a) Prior, (b) low- and high-fidelity data, (c) low- (blue) and high-fidelity (orange) approximated posterior densities, and (d) approximated posterior densities via MCMC (dark circles), and objectives in Equations (2) in green and (6) in orange. (e) Objective value during training via Equations (2) in green and (6) in orange.

The pretrained low-fidelity posterior is subsequently used to precondition the minimization of (6) given observed data . The resulting low- and high-fidelity estimates for the posterior as plotted in Figure (c)c differ significantly. In Figure ((d)d), the accuracy of the proposed method is verified by comparing the approximated high-fidelity posterior density (orange contours) with the approximation (in green) obtained by minimizing the objective of Equation (2). The overlap between the orange contours and the green shaded area confirms consistency between the two methods. To assess the accuracy of the estimated densities themselves, we also include samples from the posterior (dark circles) obtained via stochastic gradient Langevin dynamics (welling2011bayesian), an MCMC sampling technique. As expected, the estimated posterior densities with and without the preconditioning scheme are in agreement with the MCMC samples.

Finally, to illustrate the performance our multi-fidelity scheme, we consider the convergence plot in Figure (e)e where the objective values of Equations (2) and (6) are compared. As explained in Appendix A, the values of the objective functions correspond to the KL divergence (plus a constant) between the posterior given by Equation (2) and the posterior distribution obtained by our multi-fidelity approach (Equation (6)). As expected, the multi-fidelity objective converges much faster because of the “warm start”. In addition, the updates of via Equation (6) succeed in bringing down the KL divergence within only five epochs (see orange curve), whereas it takes epochs via the objective in Equation (2) to reach approximately the same KL divergence. This pattern holds for smaller values of too as indicated in Table 1. According to Table 1, the improvements by our multi-fidelity method become more pronounced if we decrease the . This behavior is to be expected since the samples used for pretraining are more and more out of distribution in that case. We refer to Appendix C for additional figures for different values of .

Low-fidelity Without preconditioning With preconditioning
Table 1: The KL divergence (plus some constant, see Appendix A) between different estimated posterior densities and high-fidelity posterior distribution for different values of . Second column corresponds to the low-fidelity posterior estimate obtained via Equation (3), third column relates to the posterior estimate via Equation (2), and finally, the last column shows the same quantity for the posterior estimate via the multi-fidelity preconditioned scheme.

4.2 Seismic compressed sensing example

This experiment is designed to show challenges with geophysical inverse problems due to the Earth’s strong heterogeneity. We consider the inversion of noisy indirect measurements of image patches sampled from deeper parts of the Parihaka seismic dataset. The observed measurements are given by where . For simplicity, we chose with a compressing sensing matrix with

% subsampling. The measurement vector

corresponds to a pseudo-recovered model contaminated with noise.

To mimic a realistic situation in practice, we change the likelihood distribution by reducing the standard deviation of the noise to

in combination with using image patches sampled from the shallow part of the Parihaka dataset. As we have seen in Figure 5, these patches differ in texture. Given pairs , we pretrain our network by minimizing Equation (3). Figures (a)a and (b)b contain a pair not used during pretraining. Estimates for the conditional mean and standard deviation obtained by drawing samples from the pretrained conditional NF for the noisy indirect measurement (Figure (b)b) are included in Figures (c)c and (d)d. Both estimates exhibit the expected behavior because the examples in Figure (a)a and (b)b are within the distribution. As anticipated, this observation no longer holds if we apply this pretrained network to indirect data depicted in Figure (f)f, which is sampled from the deeper part. However, these results are significantly improved when the pretrained network is fine-tuned by minimizing Equation (6). After fine tuning, the fine details in the image are recovered (compare Figures (g)g and (h)h). This improvement is confirmed by the relative errors plotted in Figures (i)i and (j)j, as well as by the reduced standard deviation (compare Figures (k)k and (l)l).

Figure 24: Seismic compressed sensing. First row indicates the performance of the pretrained network on data not used during pretraining. Second row is compares the pretraining-based recovery with the accelerated scheme result. Last row compares the recovery errors and pointwise STDs for the two recoveries.

5 Conclusions

Inverse problems in fields such as seismology are challenging for several reasons. The forward operators are complex and expensive to evaluate numerically while the Earth is highly heterogeneous. To handle this situation and to quantify uncertainty, we propose a preconditioned scheme for training normalizing flows for Bayesian inference. The proposed scheme is designed to take full advantage of having access to training pairs drawn from a joint distribution, which for the reasons stated above is close but not equal to the actual joint distribution. We use these samples to train a normalizing flow via likelihood maximization leveraging the normalizing property. We use this pretrained low-fidelity estimate for the posterior as a prior and preconditioner for the actual variational inference on the observed data, which minimizes the Kullback-Leibler divergence between the predicted and the desired posterior density. By means of a series of examples, we demonstrate that our preconditioned scheme leads to considerable speed-ups compared to training a normalizing flow from scratch.


Appendix A Mathematical derivations


be a bijective transformation that maps a random variable

to . We can write the change of variable formula (villani2009optimal)

that relates probability density functions

and in the following manner:


This relation serves as the basis for the objective functions used throughout this paper.

a.1 Derivation of Equation (2)

In Equation 2, we train a bijective transformation, denoted by , that maps the latent distribution to the high-fidelity posterior density . We optimize the parameters of by minimizing the KL divergence between the push-forward density (bogachev2006measure), denoted by , and the posterior density:


In the above expression, we can rewrite the expectation with respect to as the expectation with respect to the latent distribution, followed by a mapping via —i.e.,


The last term in the expectation above can be further simplified via the change of variable formula in Equation (7). If , then:


The last equality in Equation (10) holds due to the invertibility of and the differentiability of its inverse (inverse function theorem). By combining Equations (9) and (10), we arrive at the following objective function for training :


Finally, by ignoring the term, which is constant with respect to , using Bayes’ rule, and inserting our data likelihood model from Equation (1), we derive Equation (2):


Next, based on this equation, we derive the objective function used in the pretraining phase.

a.2 Derivation of Equation (3)

The derivation of objective in Equation (3) follows directly from the change of variable formula in Equation 7, applied to a bijective map, , where and are Gaussian latent spaces. That is to say:


Given (low-fidelity) training pairs, , the maximum likelihood estimate of is obtained via the following objective:


that is the objective function in Equation (3). The NF trained via the objective function, given samples from the latent distribution, draws samples from the low-fidelity joint distribution, .

By construction, is a block-triangular map—i.e.,


kruse2019hint showed that after solving the optimization problem in Equation (3), approximates the well-known triangular Knothe-Rosenblat map (santambrogio2015optimal). As shown in marzouk2016sampling, the triangular structure and ’s invertibility yields the following property,



denotes the low-fidelity posterior probability density function. The expression above means we can get access to low-fidelity posterior distribution samples by simply evaluating

for for a given observed data .

Appendix B Training details and network architectures

For our network architecture, we adapt the recursive coupling blocks proposed by kruse2019hint, which use invertible coupling layers from dinh2016density in a hierarchical way. In other words, we recursively divide the incoming state variables and apply an affine coupling layer. The final architecture is obtained by composing several of these hierarchical coupling blocks. The hierarchical structure leads to dense triangular Jacobian, which is essential in representation power of NFs (kruse2019hint).

For all examples in this paper, we use the hierarchical coupling blocks as described in kruse2019hint. The affine coupling layers within each hierarchal block contain a residual block as described in he2016deep. Each residual block has the following dimensions: input, hidden, and output channels, except for the first and last coupling layer where we have input and output channels, respectively. We use the Wavelet transform and its transpose before feeding seismic images into the network and after the last layer of the NFs.

Below, we describe the network architectures and training details regarding the two numerical experiments described in the paper. Throughout the experiments, we use the Adam optimization algorithm (kingma2015adam).

b.1 2D toy example

We use 8 hierarchal coupling blocks, as described above for both and (Equation (3)). As a result, due to our proposed method in Equation (4), we choose the same architecture for (Equation (2)).

For pretraining according to Equation (3), we use low-fidelity joint training pairs, . We minimize Equation (3) for epochs with a batch size of and a (starting) learning rate of . We decrease the learning rate each epoch by a factor of .

For the preconditioned step—i.e., solving Equation 6, we use latent training samples. We train for epochs with a batch size of and a learning rate . Finally, as a comparison, we solve the objective in Equation 6 for a randomly initialized NF with the same latent training samples for epochs. We decrease the learning rate each epoch by .

b.2 Seismic compressed sensing

We use 12 hierarchal coupling blocks, as described above for both , , and we use the same architecture for as .

For pretraining according to Equation (3), we use low-fidelity joint training pairs, . We minimize Equation (3) for epochs with a batch size of and a starting learning rate of . Once again, we decrease the learning rate each epoch by .

For the preconditioned step—i.e., solving Equation 6, we use latent training samples. We train for epochs with a batch size and a learning rate of , where we decay the step by in every th epoch.

Appendix C 2D toy example—more results

Here we show the effect on our proposed method in the 2D toy experiment. By choosing smaller values for , we make with less close to the identity matrix, hence enhancing the discrepancy between the low- and high-fidelity posterior. The first row of Figure 37 shows the low- (purple) and high-fidelity (red) data densities for decreasing values of from down to . The second row depicts the predicted posterior densities via the preconditioned scheme (orange contours) and the low-fidelity posterior in green along with MCMC samples (dark circles). The third row compares the preconditioned posterior densities to samples obtained via the low-fidelity pretrained NF—i.e., Equation (3). Finally, the last row shows the objective function values during training with (orange) and without (green) preconditioning.

Figure 37: 2D toy example with decreasing values of from first to last column, respectively. First row: Low- and high-fidelity data. Second row: approximated posterior densities via MCMC (dark circles), and objectives in Equations (2) and (6). Third row: overlaid prior density with predicted posterior densities via objectives in Equations (3) and (6). Last row: training objective values during training via Equations (2) and (6).

We observe that by decreasing from to , the low-fidelity posterior approximations become worse. As a result, the objective function for the preconditioned approach (orange) at the beginning start from a higher value, indicating more mismatch between low- and high-fidelity posterior densities. Finally, our preconditioning method consistently improves upon low-fidelity posterior by training for epochs.

Appendix D Seismic compressed sensing—more results

Here we show more examples to verify the pretraining phase obtained via solving the objective in Equation (2). Each row in Figure 54 corresponds to a different testing image. The first column shows the different true seismic images used to create low-fidelity compressive sensing data, depicted in the second column. The third and last columns correspond to the conditional mean and pointwise STD estimates, respectively. Clearly, the pretrained network is able to successfully recover the true image, and consistently indicates more uncertainty in areas with low-amplitude events.

Figure 54: Seismic compressed sensing for four different low-fidelity images. First column: true seismic images. Second column: low-fidelity observed data. Third and last columns: conditional mean and pointwise STD estimates obtained by drawing samples from the pretrained conditional NF.