1 Introduction
Our aim is to perform approximate Bayesian inference for inverse problems characterized by computationally expensive forward operators, , with a data likelihood, :
(1) 
where is the unknown model, the observed data, and the measurement noise. Given a prior density, , variational inference (VI, jordan1999introduction) based on normalizing flows (NFs, rezende2015variational) can be used where the KullbackLeibler (KL) divergence is minimized between the predicted and the target—i.e., highfidelity, posterior density (liu2016stein; kruse2019hint; rizzuti2020SEGpub; siahkoohi2020TRfuqf; sun2020deep):
(2) 
In the above expression, denotes a NF with parameters and a Gaussian latent variable
. The above objective consists of the data likelihood term, regularization on the output of the NF, and a logdeterminant term that is related to the entropy of the NF output. The last term is necessarily to prevent the output of the NF from collapsing on the maximum a posteriori estimate. For details regarding the derivation of the objective in Equation (
2), we refer to Appendix A. During training, we replace the expectation by MonteCarlo averaging using minibatches of . After training, samples from the approximated posterior, , can be drawn by evaluating for (kruse2019hint). It is important to note that Equation (2) trains a NF specific to the observed data . While the above VI formulation in principle allows us to train a NF to generate samples from the posterior given a single observation, this variational estimate requires access to a prior density, and the training calls for repeated evaluations of the forward operator,
, as well as the adjoint of its Jacobian,. As in multifidelity Markov chain Monte Carlo (MCMC) sampling
(marzouk2018multifidelity), the costs associated with the forward operator may become prohibitive even though VIbased methods are known to have computational advantages over MCMC (blei2017variational).Aside from the above computational considerations, reliance on having access to a prior may be problematic especially when dealing with images of the Earth’s subsurface, which are the result of complex geological processes that do not lend themselves to be easily captured by handcrafted priors. Under these circumstances, datadriven priors—or even better datadriven posteriors obtained by training over model and data pairs sampled from the joint distribution,
—are preferable. More specifically, we follow kruse2019hint, kovachki2020conditional, and baptista2020adaptive, and formulate the objective function in terms of a blocktriangular conditional NF, , with latent space :(3)  
Thanks to the blocktriangular structure of , samples of the approximated posterior, can be drawn by evaluating for (marzouk2016sampling). Unlike the objective in Equation (2), training does not involve multiple evaluations of and , nor does it require specifying a prior density. However, its success during inference heavily relies on having access to training pairs from the joint distribution, . Unfortunately, unlike medical imaging, where data is abundant and variability among patients is relatively limited, samples from the joint distribution are unavailable in geophysical applications. Attempts have been made to address this lack of training pairs including the generation of simplified artificial geological models (mosser2018stochastic), but these approaches cannot capture the true heterogeneity exhibited by the Earth’s subsurface. This is illustrated in Figure 5, which shows several true seismic image patches drawn from the Parihaka dataset. Even though samples are drawn from a single data set, they illustrate significant differences between shallow (Figures (a)a and (b)b) and deeper (Figures (c)c and (d)d) sections.
To meet the challenges of computational cost, heterogeneity and lack of access to training pairs, we propose a preconditioning scheme where the two described VI methods are combined to:

take maximum advantage of available samples from the joint distribution , to pretrain by minimizing Equation (3). We only incur these costs once, by training this NF beforehand. As these samples typically come from a different (neighboring) region, they are considered as lowfidelity;

exploit the invertibility of , which gives us access to a lowfidelity posterior density, . For a given , this trained (conditional) prior can be used in Equation (2);

initialize with weights from the pretrained
. This initialization can be considered as an instance of transfer learning
(yosinski2014transferable), and we expect a considerable speedup when solving Equation (2). This is important since it involves inverting , which is computationally expensive.
2 Related work
In the context of variational inference for inverse problems with expensive forward operators, herrmann2019NIPSliwcuc
train a generative model to sample from the posterior distribution, given indirect measurements of the unknown model. This approach is based on an Expectation Maximization technique, which infers the latent representation directly instead of using an inference encoding model. While that approach allows for inclusion of handcrafted priors, capturing the posterior is not fully developed. Like
kovachki2020conditional, we also use a blocktriangular map between the joint model and data distribution and their respective latent spaces to train a network to generate samples from the conditional posterior. By imposing an additional monotonicity constraint, these authors train a generative adversarial network
(GAN, Goodfellow2014) to directly sample from the posterior distribution. To allow for scalability to large scale problems, we work with NFs instead, because they allow for more memory efficient training (vandeLeemput2019MemCNN; putzky2019invert; peters2020fullysensing; peters2020fully). Our contribution essentially corresponds to a reformulation of parno2018transport and marzouk2018multifidelity. In that work, transportbased maps are used as nonGaussian proposal distributions during MCMC sampling. As part of the MCMC, this transport map is then tuned to match the target density, which improves the efficiency of the sampling. marzouk2018multifidelity extend this approach by proposing a preconditioned MCMC sampling technique where a transportmap trained to sample from a lowfidelity posterior distribution is used as a preconditioner. This idea of multifidelity preconditioned MCMC inspired our work where we setup a VI objective instead. We argue that this formulation can be faster and may be easier to scale to largescale Bayesian inference problems (blei2017variational).Finally, there is a conceptional connection between our work and previous contributions on amortized variational inference (gershman2014amortized), including an iterative refinement step (hjelm2016iterative; pmlrv84krishnan18a; pmlrv80kim18e; marino2018iterative). Although similar in spirit, our approach is different from these attempts because we adapt the weights of our conditional generative model to account for the inference errors instead of correcting the inaccurate latent representation of the outofdistribution data.
3 Multifidelity preconditioning scheme
For an observation , we define a NF as
(4) 
where we obtain by training through minimizing the objective function in Equation (3). To train , we use available lowfidelity training pairs . We perform this training phase beforehand, similar to the pretraining phase during transfer learning (yosinski2014transferable). Thanks to the invertibility of , it provides an expression for the posterior. We refer to this posterior as lowfidelity because the network is trained with often scarce and outofdistribution training pairs. Because the Earth’s heterogeneity does not lend itself to be easily captured by handcrafted priors, we argue that this NF can still serve as a (conditional) prior in Equation (2):
(5) 
To train the highfidelity NF given observed data , we minimize the KL divergence between the predicted and the highfidelity posterior density, (liu2016stein; kruse2019hint)
(6) 
where the prior density of Equation (5) is used. Notice that this minimization problem differs from the one stated in Equation (2). Here, the optimization involves “finetuning” the lowfidelity network parameters introduced in Equation (3). Moreover, this lowfidelity network is also used as a prior. While other choices exist for the latter—e.g., it could be replaced or combined with a handcrafted prior in the form of constraints (peters2018pmf) or by a separately trained datadriven prior (mosser2018stochastic), using the lowfidelity posterior as a prior (cf. Equation (5)) has certain distinct advantages. First, it removes the need for training a separate datadriven prior model. Second, use of the lowfidelity posterior may be more informative (yang2018conditional) than its unconditional counterpart because it is conditioned by the observed data . In addition, our multifidelity approach has strong connections with online variational Bayes (zeno2018task) where data arrives sequentially and previous posterior approximates are used as priors for subsequent approximations.
In summary, the problem in Equation (6) can be interpreted as an instance of transfer learning (yosinski2014transferable) for conditional NFs. This formulation is particularly useful for inverse problems with expensive forward operators, where access to high fidelity training samples, i.e. samples from the target distribution, is limited. In the next section, we present two numerical experiments designed to show the speedup and accuracy gained with our proposed multifidelity formulation.
4 Numerical experiments
We present two synthetic examples aimed at verifying the anticipated speedup and increase in accuracy of the predicted posterior density via our multifidelity preconditioning scheme. The first example is a twodimensional problem where the posterior density can be accurately and cheaply sampled via MCMC. The second example demonstrates the effect of the preconditioning scheme in a seismic compressed sensing (candes2006stable; donoho2006compressed)
problem. Details regarding training hyperparameters and the NF architectures are included in Appendix B. Code to reproduce our results are made available on
GitHub. Our implementation relies on InvertibleNetworks.jl (invnet), a recentlydeveloped memoryefficient framework for training invertible networks in the Julia programming language.4.1 2D toy example
To illustrate, the advantages of working with our multifidelity scheme, we consider the 2D Rosenbrock distribution, , plotted in Figure (a)a. Highfidelity data are generated via , where and is a forward operator. To control the discrepancy between the low and highfidelity samples, we set equal to , where is the spectral radius of ,
has independent and normally distributed entries, and
. By choosing smaller values for , we makemore dissimilar to the identity matrix, therefore increasing the discrepancy between the low and highfidelity posterior.
Figure (b)b depicts the low (purple) and highfidelity (red) data densities. The dark star represents the unknown model. Lowfidelity data samples are generated with the identity as the forward operator. During the pretraining phase conducted beforehand, we minimize the objective function in Equation (3) for epochs.
The pretrained lowfidelity posterior is subsequently used to precondition the minimization of (6) given observed data . The resulting low and highfidelity estimates for the posterior as plotted in Figure (c)c differ significantly. In Figure ((d)d), the accuracy of the proposed method is verified by comparing the approximated highfidelity posterior density (orange contours) with the approximation (in green) obtained by minimizing the objective of Equation (2). The overlap between the orange contours and the green shaded area confirms consistency between the two methods. To assess the accuracy of the estimated densities themselves, we also include samples from the posterior (dark circles) obtained via stochastic gradient Langevin dynamics (welling2011bayesian), an MCMC sampling technique. As expected, the estimated posterior densities with and without the preconditioning scheme are in agreement with the MCMC samples.
Finally, to illustrate the performance our multifidelity scheme, we consider the convergence plot in Figure (e)e where the objective values of Equations (2) and (6) are compared. As explained in Appendix A, the values of the objective functions correspond to the KL divergence (plus a constant) between the posterior given by Equation (2) and the posterior distribution obtained by our multifidelity approach (Equation (6)). As expected, the multifidelity objective converges much faster because of the “warm start”. In addition, the updates of via Equation (6) succeed in bringing down the KL divergence within only five epochs (see orange curve), whereas it takes epochs via the objective in Equation (2) to reach approximately the same KL divergence. This pattern holds for smaller values of too as indicated in Table 1. According to Table 1, the improvements by our multifidelity method become more pronounced if we decrease the . This behavior is to be expected since the samples used for pretraining are more and more out of distribution in that case. We refer to Appendix C for additional figures for different values of .
Lowfidelity  Without preconditioning  With preconditioning  

4.2 Seismic compressed sensing example
This experiment is designed to show challenges with geophysical inverse problems due to the Earth’s strong heterogeneity. We consider the inversion of noisy indirect measurements of image patches sampled from deeper parts of the Parihaka seismic dataset. The observed measurements are given by where . For simplicity, we chose with a compressing sensing matrix with
% subsampling. The measurement vector
corresponds to a pseudorecovered model contaminated with noise.To mimic a realistic situation in practice, we change the likelihood distribution by reducing the standard deviation of the noise to
in combination with using image patches sampled from the shallow part of the Parihaka dataset. As we have seen in Figure 5, these patches differ in texture. Given pairs , we pretrain our network by minimizing Equation (3). Figures (a)a and (b)b contain a pair not used during pretraining. Estimates for the conditional mean and standard deviation obtained by drawing samples from the pretrained conditional NF for the noisy indirect measurement (Figure (b)b) are included in Figures (c)c and (d)d. Both estimates exhibit the expected behavior because the examples in Figure (a)a and (b)b are within the distribution. As anticipated, this observation no longer holds if we apply this pretrained network to indirect data depicted in Figure (f)f, which is sampled from the deeper part. However, these results are significantly improved when the pretrained network is finetuned by minimizing Equation (6). After fine tuning, the fine details in the image are recovered (compare Figures (g)g and (h)h). This improvement is confirmed by the relative errors plotted in Figures (i)i and (j)j, as well as by the reduced standard deviation (compare Figures (k)k and (l)l).5 Conclusions
Inverse problems in fields such as seismology are challenging for several reasons. The forward operators are complex and expensive to evaluate numerically while the Earth is highly heterogeneous. To handle this situation and to quantify uncertainty, we propose a preconditioned scheme for training normalizing flows for Bayesian inference. The proposed scheme is designed to take full advantage of having access to training pairs drawn from a joint distribution, which for the reasons stated above is close but not equal to the actual joint distribution. We use these samples to train a normalizing flow via likelihood maximization leveraging the normalizing property. We use this pretrained lowfidelity estimate for the posterior as a prior and preconditioner for the actual variational inference on the observed data, which minimizes the KullbackLeibler divergence between the predicted and the desired posterior density. By means of a series of examples, we demonstrate that our preconditioned scheme leads to considerable speedups compared to training a normalizing flow from scratch.
References
Appendix A Mathematical derivations
Let
be a bijective transformation that maps a random variable
to . We can write the change of variable formula (villani2009optimal)that relates probability density functions
and in the following manner:(7) 
This relation serves as the basis for the objective functions used throughout this paper.
a.1 Derivation of Equation (2)
In Equation 2, we train a bijective transformation, denoted by , that maps the latent distribution to the highfidelity posterior density . We optimize the parameters of by minimizing the KL divergence between the pushforward density (bogachev2006measure), denoted by , and the posterior density:
(8)  
In the above expression, we can rewrite the expectation with respect to as the expectation with respect to the latent distribution, followed by a mapping via —i.e.,
(9) 
The last term in the expectation above can be further simplified via the change of variable formula in Equation (7). If , then:
(10) 
The last equality in Equation (10) holds due to the invertibility of and the differentiability of its inverse (inverse function theorem). By combining Equations (9) and (10), we arrive at the following objective function for training :
(11) 
Finally, by ignoring the term, which is constant with respect to , using Bayes’ rule, and inserting our data likelihood model from Equation (1), we derive Equation (2):
(12) 
Next, based on this equation, we derive the objective function used in the pretraining phase.
a.2 Derivation of Equation (3)
The derivation of objective in Equation (3) follows directly from the change of variable formula in Equation 7, applied to a bijective map, , where and are Gaussian latent spaces. That is to say:
(13) 
Given (lowfidelity) training pairs, , the maximum likelihood estimate of is obtained via the following objective:
(14)  
that is the objective function in Equation (3). The NF trained via the objective function, given samples from the latent distribution, draws samples from the lowfidelity joint distribution, .
By construction, is a blocktriangular map—i.e.,
(15) 
kruse2019hint showed that after solving the optimization problem in Equation (3), approximates the wellknown triangular KnotheRosenblat map (santambrogio2015optimal). As shown in marzouk2016sampling, the triangular structure and ’s invertibility yields the following property,
(16) 
where
denotes the lowfidelity posterior probability density function. The expression above means we can get access to lowfidelity posterior distribution samples by simply evaluating
for for a given observed data .Appendix B Training details and network architectures
For our network architecture, we adapt the recursive coupling blocks proposed by kruse2019hint, which use invertible coupling layers from dinh2016density in a hierarchical way. In other words, we recursively divide the incoming state variables and apply an affine coupling layer. The final architecture is obtained by composing several of these hierarchical coupling blocks. The hierarchical structure leads to dense triangular Jacobian, which is essential in representation power of NFs (kruse2019hint).
For all examples in this paper, we use the hierarchical coupling blocks as described in kruse2019hint. The affine coupling layers within each hierarchal block contain a residual block as described in he2016deep. Each residual block has the following dimensions: input, hidden, and output channels, except for the first and last coupling layer where we have input and output channels, respectively. We use the Wavelet transform and its transpose before feeding seismic images into the network and after the last layer of the NFs.
Below, we describe the network architectures and training details regarding the two numerical experiments described in the paper. Throughout the experiments, we use the Adam optimization algorithm (kingma2015adam).
b.1 2D toy example
We use 8 hierarchal coupling blocks, as described above for both and (Equation (3)). As a result, due to our proposed method in Equation (4), we choose the same architecture for (Equation (2)).
For pretraining according to Equation (3), we use lowfidelity joint training pairs, . We minimize Equation (3) for epochs with a batch size of and a (starting) learning rate of . We decrease the learning rate each epoch by a factor of .
For the preconditioned step—i.e., solving Equation 6, we use latent training samples. We train for epochs with a batch size of and a learning rate . Finally, as a comparison, we solve the objective in Equation 6 for a randomly initialized NF with the same latent training samples for epochs. We decrease the learning rate each epoch by .
b.2 Seismic compressed sensing
We use 12 hierarchal coupling blocks, as described above for both , , and we use the same architecture for as .
For pretraining according to Equation (3), we use lowfidelity joint training pairs, . We minimize Equation (3) for epochs with a batch size of and a starting learning rate of . Once again, we decrease the learning rate each epoch by .
For the preconditioned step—i.e., solving Equation 6, we use latent training samples. We train for epochs with a batch size and a learning rate of , where we decay the step by in every th epoch.
Appendix C 2D toy example—more results
Here we show the effect on our proposed method in the 2D toy experiment. By choosing smaller values for , we make with less close to the identity matrix, hence enhancing the discrepancy between the low and highfidelity posterior. The first row of Figure 37 shows the low (purple) and highfidelity (red) data densities for decreasing values of from down to . The second row depicts the predicted posterior densities via the preconditioned scheme (orange contours) and the lowfidelity posterior in green along with MCMC samples (dark circles). The third row compares the preconditioned posterior densities to samples obtained via the lowfidelity pretrained NF—i.e., Equation (3). Finally, the last row shows the objective function values during training with (orange) and without (green) preconditioning.
We observe that by decreasing from to , the lowfidelity posterior approximations become worse. As a result, the objective function for the preconditioned approach (orange) at the beginning start from a higher value, indicating more mismatch between low and highfidelity posterior densities. Finally, our preconditioning method consistently improves upon lowfidelity posterior by training for epochs.
Appendix D Seismic compressed sensing—more results
Here we show more examples to verify the pretraining phase obtained via solving the objective in Equation (2). Each row in Figure 54 corresponds to a different testing image. The first column shows the different true seismic images used to create lowfidelity compressive sensing data, depicted in the second column. The third and last columns correspond to the conditional mean and pointwise STD estimates, respectively. Clearly, the pretrained network is able to successfully recover the true image, and consistently indicates more uncertainty in areas with lowamplitude events.