# Multimodal, high-dimensional, model-based, Bayesian inverse problems with applications in biomechanics

This paper is concerned with the numerical solution of model-based, Bayesian inverse problems. We are particularly interested in cases where the cost of each likelihood evaluation (forward-model call) is expensive and the number of un- known (latent) variables is high. This is the setting in many problems in com- putational physics where forward models with nonlinear PDEs are used and the parameters to be calibrated involve spatio-temporarily varying coefficients, which upon discretization give rise to a high-dimensional vector of unknowns. One of the consequences of the well-documented ill-posedness of inverse prob- lems is the possibility of multiple solutions. While such information is contained in the posterior density in Bayesian formulations, the discovery of a single mode, let alone multiple, is a formidable task. The goal of the present paper is two- fold. On one hand, we propose approximate, adaptive inference strategies using mixture densities to capture multi-modal posteriors, and on the other, to ex- tend our work in [1] with regards to effective dimensionality reduction techniques that reveal low-dimensional subspaces where the posterior variance is mostly concentrated. We validate the model proposed by employing Importance Sam- pling which confirms that the bias introduced is small and can be efficiently corrected if the analyst wishes to do so. We demonstrate the performance of the proposed strategy in nonlinear elastography where the identification of the mechanical properties of biological materials can inform non-invasive, medical di- agnosis. The discovery of multiple modes (solutions) in such problems is critical in achieving the diagnostic objectives.

• 2 publications
• 3 publications
12/01/2014

### Sparse Variational Bayesian Approximations for Nonlinear Inverse Problems: applications in nonlinear elastography

This paper presents an efficient Bayesian framework for solving nonlinea...
03/18/2020

### Data-Driven Forward Discretizations for Bayesian Inversion

This paper suggests a framework for the learning of discretizations of e...
06/22/2021

### Enabling and interpreting hyper-differential sensitivity analysis for Bayesian inverse problems

Inverse problems constrained by partial differential equations (PDEs) pl...
05/12/2021

### Multiscale Invertible Generative Networks for High-Dimensional Bayesian Inference

We propose a Multiscale Invertible Generative Network (MsIGN) and associ...
09/06/2022

### Semi-supervised Invertible DeepONets for Bayesian Inverse Problems

Deep Operator Networks (DeepONets) offer a powerful, data-driven tool fo...
06/01/2018

### Generalized modes in Bayesian inverse problems

This work is concerned with non-parametric modes and MAP estimates for p...
03/02/2018

### Beyond black-boxes in Bayesian inverse problems and model validation: applications in solid mechanics of elastography

The present paper is motivated by one of the most fundamental challenges...

## 1 Introduction

Model-based (or model-constrained), inverse problems appear in many scientific fields and their solution represents a fundamental challenge in the context of model calibration and system identification biegler_large-scale_2010

. Bayesian formulations offer a rigorous setting for their solution as they account for various sources of uncertainty that is unavoidably present in these problem. Furthermore, they possess a great advantage over deterministic alternatives as apart from point-estimates, they provide quantitative metrics of the uncertainty in the unknowns encapsulated in the

posterior distribution gelman_bayesian_2003 .

An application of particular, but not exclusive, interest for this paper involves the identification of the mechanical properties of biological materials, in the context non-invasive medical diagnosis (elastography). While in certain cases mechanical properties can also be measured directly by excising multiple tissue samples, non-invasive procedures offer obvious advantages in terms of ease, cost and reducing risk of complications to the patient. Rather than x-ray techniques which capture variations in density, the identification of stiffness, or mechanical properties in general, can potentially lead to earlier and more accurate diagnosis ISI:000263259100006 ; liver2006 , provide valuable insights that differentiate between modalities of the same pathology cur12gen , monitor the progress of treatments and ultimately lead to patient-specific treatment strategies.

All elastographic techniques consist of the following three basic steps doyley_model-based_2012 : 1) excite the tissue using a (quasi-)static, harmonic or transient source, 2) (indirectly) measure tissue deformation (e.g. displacements, velocities) using an imaging technique such as ultrasound Ophir:1991 , magnetic resonance mut95mag or optical tomography khalil_tissue_2005 , and 3) infer the mechanical properties from this data using a suitable continuum mechanical model of the tissue’s deformation. Perhaps the most practical such imaging technique due to its lower relative cost and increased portability is ultrasound elasticity imaging sarvazyan_elasticity_2011 ; doyley_elastography:_2014 . The pioneering work of Ophir and coworkers Ophir:1991 followed by several clinical studies Garra:1997 ; Bamber:2002 ; Thomas:2006 ; par11im have demonstrated that the resulting strain images typically improve the diagnostic accuracy over ultrasound alone. Furthermore, technological advances have led to ultra-portable ultrasound transducers, attachable to smartphones/tablets schleder_diagnostic_2013 ; driver_doctors_2016 . As the rate of data acquisition increases and the cost decreases, it becomes increasingly important to develop tools that leverage the capabilities of physics-based models in order to produce quickly and accurately diagnostic estimates as well as quantify the confidence in them. Apart from breast cancer, there is a wealth of evidence indicating the potential of elastography-based techniques in detecting a variety of other pathologies such as prostate kro98ela ; hoy08tis and liver cancer asbach_assessment_2008 , characterizing blood clots schmitt_noninvasive_2007 , brain imaging hamhaber_vivo_2010 , atherosclerosis ohayon_biomechanics_2013 , osteopenia shore_transversely_2011 .

In this paper we advocate a probabilistic, indirect or iterative procedure (in contrast to direct elastography ISI:000275699600006 ) which admits an inverse problem formulation and involves the discrepancy between observed and model-predicted displacements ISI:000223500200013 ; doyley_enhancing_2006 ; ISI:000275756200016 ; ISI:000280774700004 ; doyley_model-based_2012 . . Several other problems which involve complex forward models (i.e. expensive likelihood) and the identification of spatially varying model parameters share similar characteristics such as permeability estimation for soil transport processes that can assist in the detection of contaminants, oil exploration and carbon sequestration wang_bayesian_2004 ; wang_hierarchical_2005 ; dostert_coarse-gradient_2006

The solution of such model calibration problems in the Bayesian framework is hampered by two main difficulties. The first affects the computational efficiency of such methods and stems from the poor scaling of traditional Bayesian inference tools, with respect to the dimensionality of the unknown parameter vector - another instance of the

curse-of-dimensionality. In problems such as the one described above, the model parameters of interest (i.e. material properties) exhibit spatial variability which requires fine discretizations in order to be captured. This variability can also span different scales mallat_wavelet_2008 ; koutsourelakis_multi-resolution_2009

. Standard Markov Chain Monte Carlo (MCMC,

green_bayesian_2015 ) techniques require an exorbitant number of likelihood evaluations (i.e. solutions of the forward model) in order to converge roberts_exponential_1996 ; roberts_optimal_1998 ; mattingly_diffusion_2012 ; pillai_optimal_2012 . As each of these calls implies the solution of very large systems of (non)linear, and potentially transient, equations, it is generally of interest to minimize their number particularly in time-sensitive applications. Advanced sampling schemes, involving adaptive MCMC lee_markov_2002 ; holloman_multi-resolution_2006 ; chopin_free_2012 and Sequential Monte Carlo (SMC, moral_sequential_2006 ; koutsourelakis_multi-resolution_2009 ; del_moral_adaptive_2010 ) exploit the physical insight and the use of multi-fidelity solvers in order to expedite the inference process. Nevertheless, the number of forward calls can still be in the order of tens of thousands if not much more. Several attempts have also been directed towards using emulators or surrogates or reduced-order models of various kinds marzouk_stochastic_2007 ; bui-thanh_model_2008 ; rosic_sampling-free_2012 ; bilionis_solution_2014 ; chen_sparse-grid_2015 ; lan_emulation_2016 but such a task is severely hindered by the high- dimensionality. The use of first and second-order derivatives has also been advocated either in a standard MCMC format or by developing advanced sampling strategies. These are generally available by solving appropriate adjoint problems which are well-understood in the context of deterministic formulations flath_fast_2011 ; girolami_riemann_2011 ; martin_stochastic_2012 ; bui-thanh_solving_2014 ; petra_computational_2014

. More recent treatments, attempt to exploit lower-dimensional structure of the target posterior by identifying subspaces where either most of the probability mass is contained

franck_sparse_2015 or where maximal sensitivity is observed cui_likelihood-informed_2014 ; spantini_optimal_2015 ; cui_scalable_2015 ; cui_dimension-independent_2016 . This enables inference tasks that are carried out on spaces of significantly reduced dimension and are not hampered by the aforementioned difficulties. Generally all such schemes construct such approximations around the MAP point by employing local information (e.g. gradients) and therefore are not suitable for multi-modal or highly non-Gaussian posteriors.

The latter represents the second challenge that we attempt to address in this paper. That is, the identification of multiple posterior modes. In the context of elastography, multi-modality can originate from anisotropic material chatelin_anisotropic_2014 , wrong/missing information from images/measurements fang_compositional-prior-guided_2010 or the imagaing modality employed fromageau_estimation_2007

. In all cases, each mode in the posterior can lead to different diagnostic conclusions and it is therefore very important to identify them and correctly assess their posterior probabilities. The majority of Bayesian strategies for the solution of computationally intensive inverse problems operates under the assumption of a unimodal posterior or focuses on the approximation of a single mode of the posterior. Some numerical inference tools based on SMC or other tempering mechanisms

gill_dynamic_2004 ; li_adaptive_2015 ; feroz_multimodal_2008 have been developed but require a very large number of forward model calls particularly when the dimension of unknowns increases. We note finally that the treatment of multi-modal densities in high-dimensions has attracted significant interest in atomistic simulation in the context of free energy computations lelievre_free_2010 ; bilionis_free_2012 but in such problems (apart from other distinguishing features) the cost per density evaluation (i.e. one MD time-step) is smaller than in our setting.

In this paper we propose a Variational Bayesian (VB) strategy that extends our previous work franck_sparse_2015

. Therein we have shown how accurate approximations of the true posterior can be attained by identifying a low-dimensional subspace where posterior uncertainty is concentrated. This has led to computational schemes that require only a few tens of forward model runs in the problems investigated. Nevertheless, our previous work was based on the assumption of a unimodal posterior which we propose overcoming in this paper by employing a mixture of multivariate Gaussians. Such mixtures have been employed in various statics and machine learning applications (e.g. speaker identification

reynolds_speaker_2000 , data clustering jain_data_1999 ) and in combination with Variational Bayesian inference techniques as well choudrey_variational_2003 ; beal_variational_2003 ; kuusela_gradient-based_2009 . Nevertheless, all these problems were characterized by inexpensive likelihoods, relatively low-dimensions and multiple data/measurements. In contrast to that, the inverse problems considered here are based on a single experiment and a single observation vector. Furthermore, we propose an adaptive algorithm based on information-theoretic criteria for the identification of the number of the required mixture components (section 2). We present the parametrization of the proposed model in section 2

where we also discuss a Variational-Bayesian Expectation-Maximization

beal_variational_2003 scheme for performing inference and learning. In section 3 we present numerical illustrations involving a simple toy-example and an example in the context of elastography.

## 2 Methodology

This section discusses the methodological framework advocated. We begin (Section 2.1) with a generic presentation of the forward model and in particular with models arising from the discretization of nonlinear PDEs such as in our motivating application of nonlinear elastography. In Section 2.2 we present a Bayesian mixture model that can identify lower-dimensional subspaces where most of the posterior mass is concentrated as well as accounts for multiple modes. The prior assumptions for the model parameters are summarized in Section 2.3. In Section 2.4 we discuss a Variational Bayesian Expectation-Maximization scheme for computing efficiently approximations to the posterior for a fixed number of mixture components, and in Section 2.5 we discuss a scheme for determining the appropriate number of such components. Finally, in Section 2.6 we discuss how to assess the accuracy of the approximation computed as well as a way to correct for any bias if this is deemed to be necessary.

### 2.1 Forward model - Governing equations

Canonical formulations of model-based, inverse problems postulate the existence of a forward model that typically arises from the discretization of governing equations, such as PDEs/ODEs, and which can be expressed in the residual form as follows:

 r(u;Ψ)=0. (1)

The residual vector depends on the state vector (forward-problem unknowns) and , the vector of unobserved (latent) model parameters (inverse-problem unknowns). The aforementioned equation is complemented by a set of (noisy) observations/measurements which pertain to the model output :

 ^y=y(Ψ)+noise. (2)

The unknowns in the problems considered arise from the discretization of spatially varying parameters and we refer to them as material parameters in view of the biomechanics applications discussed in Section 3.2. Throughout this work we assume that the noise term pertains only to measurement/observation errors which we model with a zero-mean, uncorrelated Gaussian vector :

 ^y=y(Ψ)+z,z∼N(0,τ−1Idy). (3)

The precision of the observation noise will be assumed unknown and will be inferred from the data along with . Other noise distributions (e.g. to account for faulty sensors) can be readily considered jin_variational_2012 . We note that the difference between observations and model predictions would in general contain model errors arising from the discretization of the governing equations and/or the inadequacy of the model itself to capture the underlying physical process. While the former source can be reduced by considering very fine discretizations (at the cost of increasing the dimensionality of the state vector and potentially ), the latter requires a much more thorough treatment which exceeds the scope of this work kennedy_bayesian_2001 ; higdon_computer_2008 ; bayarri_framework_2007 ; berliner_modeling_2008 ; koutsourelakis_novel_2012 ; strong_when_2014 ; sargsyan_statistical_2015 .

In the traditional route, the likelihood , implied by Equation (3):

 p(^y|Ψ,τ)∝τdy/2e−τ2||^y−y(Ψ)||2. (4)

is complemented by priors and which, with the application of the Bayes’ rule, lead to the definition of the posterior density which is proportional to:

 p(Ψ,τ|^y)∝p(^y|Ψ,τ) pΨ(Ψ) pτ(τ). (5)

The intractability of the map precludes the availability of closed-form solutions for the posterior and necessitates the use of various sampling schemes such as those discussed in the introduction. This task is seriously impeded by a) the need for repeated solutions of the discretized forward problem (Equation (1)) each of which can be quite taxing computationally, b) the high dimensionality of the vector of unknowns which hinders the efficient search (e.g. by sampling) of the latent parameter space and further increases the computational burden. The goal of the proposed Variational Bayesian scheme is to alleviate these difficulties by proposing adequate approximations and dimensionality-reduction techniques that are seamlessly integrated in the inference framework. Furthermore, we attempt to overcome well-known limitations that have to do with the multimodality of the posterior and which further exacerbate these problems. Multimodality is inherently related to the ill-posedness of inverse problems and its potential can increase when the dimension of the vector of unknowns increases and/or the noise is amplified.

### 2.2 Bayesian Mixture Model

In this section and in view of the aforementioned desiderata we introduce the augmented formulation of the Bayesian inverse problem.

• In order to capture multiple modes of the posterior (if those are present) we introduce the discrete, latent variable which takes integer values between and . The latter represents the number of modes identified, each of which will be modeled with a multivariate Gaussian. The cardinality of the model i.e. is learned in a manner that is described in the sequel.

• In order to identify a lower-dimensional representation of the unknowns , we define the latent variables such that . The premise here is that while is high-dimensional, its posterior can be adequately represented on a subspace of dimension that captures most of the variance. As we have argued in franck_sparse_2015 these latent variables can give rise to a PCA-like representation of the form:

 Ψ=μ+WΘ+η (6)

where

is the mean vector and the columns of the orthogonal matrix

() span the aforementioned subspace with reduced coordinates . The vector captures the residual variance (noise) that complements the main effects.

In view of the multimodal approximation and since each mode implies a different mean and a different lower-dimensional subspace (Figure 1), we advocate in this work expansions of the form:

 Ψs=μs+WsΘ+η,s=1,2,…,S (7)

where the notation implies the representation of within mode . As with multiple modes there can also be multiple subspaces where the variance is concentrated, so it is necessary/important to distinct the . In principle, the dimension of the reduced subspaces can also vary with but we do not consider this here for simplicity of notation.

For reasons that will become apparent in the following, we distinguish between latent variables: , , and , and model parameters: , . We seek point estimates for the latter and (approximations) of the actual (conditional) posterior for the former. The discussion thus far suggests that that the likelihood of Equation (4) takes the form:

 p(^y|s,Θ,η,τ,μs,Ws)∝τdy/2e−τ2||^y−y(μs+WsΘ+η)||2. (8)

A graphical illustration of the proposed probabilistic generative model proposed is in Figure 2.

Following the standard Bayesian formalism, one would complement the aforementioned likelihood with priors on the model parameters and the latent variables , in order to obtain the joint posterior (given ):

 p(s,Θ,η,τ,μ,W|^y)∝p(^y|s,Θ,η,τ,μs,Ws) p(Θ,η,s,τ) p(μ,W). (9)

We discuss the specific form of the priors (and associated hyperparameters) in Subsection

2.3 as well as the inference/learning strategy we propose in 2.4. We note at this stage however, that given this posterior, one would obtain a mixture representation of the unknown material parameters , as implied by Equation (7). In particular, given values for , it immediately follows that the posterior (given ) of is:

 p(Ψ|μ,W,^y,T)=∑Ss=1∫p(Ψ,s,Θ,η,τ,|μ,W,^y) dΘ dη dτ=∑Ss=1∫p(Ψ|s,Θ,η,τ,μs,Ws) p(s,Θ,η,τ,|μ,W,^y) dΘ dη dτ=∑Ss=1∫δ(Ψ−(μs+WsΘ+η)) p(s,Θ,η,τ,|μ,W,^y) dΘ dη dτ (10)

where the conditional posterior is found from Equation (9). We discuss in Section 2.4 how the posterior on the latent variables is approximated as well as the values (point estimates) for the model parameters are computed.

### 2.3 Priors

We assume that, a priori, the precision of the observation noise is independent of the remaining latent variables i.e.:

 p(Θ,η,s,τ)=p(Θ,η,s) pτ(τ). (11)

In particular, we employ:

• a Gamma prior on : We employ a (conditionally) conjugate Gamma prior:

 pτ(τ)≡Gamma(a0,b0). (12)

We use which results in a non-informative Jeffreys’ prior that is scale-invariant.

• We assume that and are a priori, conditionally independent i.e. that . We discuss each of these terms below:

• We assume that each component is, a priori, equally likely, which implies:

 ps(s)=1S,s∈[1:S]. (13)

Hierarchical priors can be readily be adopted (e.g. beal_variational_2003 ) but we consider here the simplest possible scenario. An interesting extension would involve infinite models with Dirichlet Process priors antoniak_mixtures_1974 ; maceachern_estimating_1998 which would enable the number of components to be automatically determined. In this work, a less elegant, but quite effective adaptive scheme for determining is proposed in Section 2.5.

• A Gaussian prior on :
The role of the latent variables is to capture the most significant variations of around its mean . By significant we mean in this case the directions along which, the largest posterior uncertainty is observed. Naturally these are closely related to the matrices and represent the reduced coordinates along the subspace spanned by their column vectors. We assume therefore that, a priori, these are independent, have zero mean and follow a multivariate Gaussian:

 pΘ(Θ|s)=N(0,Λ−10,s) (14)

where , express prior variances along each of the latent principal directions.

• A Gaussian prior on :

As the role of the latent variables is to capture any residual variance (that is not accounted for by ), we assume that, a priori, can be modeled by a multivariate Gaussian that has zero mean and an isotropic covariance:

 pη(η|s)=N(0,λ−10,η,sIdΨ). (15)

For the model parameters , we assume that, a priori, the parameters associated with each component are independent. In particular:

• Prior on each for :
In general such priors must encapsulate not only the information/beliefs available a priori to the analyst but also reflect the physical meaning of the parameters . We are motivated by applications in elastography where the goal is to identify inclusions that correspond to tumors and generally have very different properties from the surrounding tissue (wellman_breast_1999 ; krouskop_elastic_1998 ). The vector represents the spatial discretization of the material parameters i.e. each of its entries corresponds to the value of the material parameter at a certain point in the physical domain. This structure is inherited by and for this reason we employ a hierarchical prior that penalizes jumps between neighboring locations (on the spatial domain) calvetti_hypermodels_2008 in a manner controlled by appropriately selected hyperparameters. The model was discussed in detail in franck_sparse_2015 and is included for completeness in B.

• Prior specification on each for :
We require that each is orthonormal i.e. , where is the

dimensional identity matrix. This is equivalent to employing a uniform prior on the Stiefel manifold

.

### 2.4 Variational Approximation

We note that inference (exact or approximate) for all the model parameters described previously would pose a formidable task particularly with regard to and which scale with . For that purpose, we advocate a hybrid approach whereby Maximum-A-Posteriori (MAP) point estimates of the high-dimensional parameters are obtained and the posterior of the remaining (latent) variables is approximated. To that end we make use of the Variational Bayesian Expectation-Maximization scheme (VB-EM, beal_variational_2003 ; franck_sparse_2015 ) which provides a lower bound on the log of the marginal posterior of . This can be iteratively maximized by a generalized coordinate ascent (Figure 3) which alternates between finding optimal approximations of the exact (conditional) posterior and optimizing with respect to .

On the basis of the discussion above and the separation between latent variables () and model parameters , we can rewrite Equation (9) (for a given ) as follows:

 p(s,Θ,η,τ,T|^y)=p(^y|s,Θ,η,τ,T) ps(s) pΘ(Θ|s) pη(η|s) pτ(τ) pT(T)p(^y). (16)

We note that both sides of the equation above depend implicitly on i.e. the total number of components in the model. This is especially important for the model evidence term which we discuss in Section 2.5. We nevertheless omit from the expressions in order to simplify the notation.

Furthermore the conditional posterior of given is:

 p(s,Θ,η,τ|T,^y)=p(s,Θ,η,τ,T|^y)p(T|^y) (17)

where is the (marginal) posterior of the model parameters .

For an arbitrary density and by employing Jensen’s inequality, it can be shown that franck_sparse_2015 :

 logp(T|^y)=log∑Ss=1∫p(T,Θ,η,τ,s|^y) dΘ dη dτ=log∑Ss=1∫q(Θ,η,τ,s)p(T,Θ,η,τ,s|^y)q(Θ,η,τ,s) dΘ dη dτ≥∑Ss=1∫q(Θ,η,τ,s)logp(T,Θ,η,τ,s|^y)q(Θ,η,τ,s) dΘ dη dτ=F(q(Θ,η,τ,s),T). (18)

We note here that the variational lower bound has a direct connection with the Kullback-Leibler (KL) divergence between and the (conditional) posterior . In particular, if we denote by expectations with respect to , then:

 KL(q(Θ,η,τ,s)||p(Θ,η,τ,s|^y,T))=−Eq[logp(Θ,η,τ,s|^y,T)q(Θ,η,τ,s)]=−Eq[logp(T,Θ,η,τ,s|^y)p(T|^y) q(Θ,η,τ,s)]=logp(T|^y)−F(q(Θ,η,τ,s),T). (19)

The Kullback-Leibler divergence is by definition non-negative and becomes zero when

. Hence, for a given , constructing a good approximation to the conditional posterior (in the KL divergence sense) is equivalent to maximizing the lower bound with respect to . Analogously, maximizing with respect to (for a given leads to (sub-)optimal MAP estimates franck_sparse_2015 . This suggests an iterative scheme that alternates between:

• VB-Expectation step: Given the current estimate of , find the that maximizes .

• VB-Maximization step: Given the current , find that maximizes .

As in standard EM schemes neal_view_1998 , relaxed versions of the aforementioned partial optimization problems can be considered that improve upon the current rather than finding the optimum at each iteration.

Using Equation (16), the lower bound can be expressed as:

 F(q(Θ,η,τ,s),T)=Eq[logp(^y|s,Θ,η,τ,T) ps(s) pΘ(Θ|s) pη(η|s) pτ(τ) pT(T)p(^y) q(Θ,η,τ,s)]=Eq[logp(^y|s,Θ,η,τ,T) ps(s) pΘ(Θ|s) pη(η|s) pτ(τ)q(Θ,η,τ,s)]+logpT(T)−logp(^y)=^F(q(Θ,η,τ,s),T)+logpT(T)−logp(^y). (20)

We will omit the term as it does not depend on nor . It is apparent that the challenging term in involves the likelihood, i.e.:

 ^F(q(Θ,η,τ,s),T)=Eq[logp(^y|s,Θ,η,τ,T) ps(s) pΘ(Θ|s) pη(η|s) pτ(τ)q(Θ,η,τ,s)]=Eq[dy2logτ−τ2||^y−y(μs+WsΘ+η)||2]+Eq[logps(s) pΘ(Θ|s) pη(η|s) pτ(τ)q(Θ,η,τ,s)]. (21)

The intractability of the map precludes an analytic computation of the expectation with respect to , let alone the optimization with respect to this. While stochastic approximation techniques in the context of VB inference have been suggested titsias_doubly_2014 to carry out this task, these would require repeated forward solves (i.e. evaluations of ) which would render them impractical. For that purpose, as in our previous work franck_sparse_2015 , we invoke an approximation by using a first-order Taylor series expansion of (given ) at i.e.:

 y(μs+WsΘ+η)=y(μs)+Gs(WsΘ+η)+O(||WsΘ+η||2) (22)

where is the gradient of the map at . We will discuss rigorous validation strategies of the approximation error thus introduced in Section 2.6. Truncating Equation (22) to first-order, the term in the exponent of the likelihood becomes:

 ||^y−y(μs+WsΘ+η)||2=||^y−y(μs)−GsWsΘ−Gsη||2=||^y−y(μs)||2−2(^y−y(μs))TGsWsΘ+WTsGTsGsWs:ΘΘT−2ηTGTs(^y−y(μs)−GsWsΘ)+ηTGTsGsη. (23)

We introduce a second approximation in terms of the family of ’s over which we wish to optimize by using a mean-field decomposition (peierls_minimum_1938 ; opper_advanced_2001 ) of the form:

 q(Θ,s,τ)≈q(Θ,η,s)q(τ)=q(Θ,η|s)q(s) q(τ)≈q(Θ|s)q(η|s)q(s)q(τ). (24)

In the first line, is assumed to be a posteriori independent of the remaining latent variables on the premise that the measurement noise precision is determined by the experimental conditions and is not directly dependent on the latent variables. In the the third line, we assume that and are conditionally independent given 111This implies that and are actually dependent as one would expect.. The latter assumption is justified by the role of and in the representation of (Equation (7)) expressing the main effects around the mean and the residual “noise” respectively. As such, it is also reasonable to assume that the means of and are zero a posteriori i.e. and . Furthermore we employ an isotropic covariance for i.e. where represents the (unknown) variance.

If we denote the expectations with respect to , and with , and , then Equation (21) becomes 222we omit constants that do not affect the optimization.:

 ^F(q(Θ,η,τ,s),T)=dy2τ(\em% \footnotesize)−<τ>τ2∑sq(s)||^y−y(μs)||2+<τ>τ∑sq(s)(^y−y(μs))TGsWs<Θ>Θ|s(\em\footnotesize=0 since <Θ>Θ|s=0 )−<τ>τ2∑sq(s)WTsGTsGsWs:<ΘΘT>Θ|s+<τ>τ∑sq(s)<η>Tη|sGTs(^y−y(μs))(\em\footnotesize=0 since <η>η|s=0 )−<τ>τ∑sq(s)<η>Tη|sGTsGsWs<Θ>Θ|s(\em\footnotesize=0 since <η>η|s=0 )−<τ>τ2∑sq(s)GTsGs:<ηηT>η|s+∑sq(s)log1S(\em\footnotesize)+(a0−1)τ−b0<τ>τ(\em\footnotesize)+∑sq(s)(12log|Λ0,s|−12Λ0:<ΘΘT>Θ|s)(\em\footnotesize)+∑sq(s)(dΨ2logλ0,η,s−λ0,η,s2I:<ηηT>η|s)(\em\footnotesize)−∑sq(s)∫q(Θ|s)logq(Θ|s) dΘ(\em\footnotesize−)−∑sq(s)∫q(η|s)logq(η|s) dΘ(\em\footnotesize−)−∑sq(s)logq(s)(\em\footnotesize−)−τ.(\em\footnotesize−) (25)

Despite the long expression, the optimization of in the VB-Expectation step can be done analytically and we find that the optimal (given ) is:

 qopt(Θ|s)≡N(0,Λ−1s)qopt(η|s)≡N(0,λ−1η,sIdΨ)qopt(τ)≡Gamma(a,b) (26)

where:

 Λs=Λ0,s+<τ>τWTsGTsGsWs (27)
 λη,s=λ0,η,s+1dΨ<τ>τtrace(GTsGs) (28)
 a=a0+dy/2 (29)
 b=b0+12S∑s=1q(s)(||^y−y(μs)||2+WTsGTsGsWs:Λ−1s+λ−1η,s trace(GTsGs)). (30)

Furthermore, for the latent variable we find that:

 qopt(s)∝ecs (31)

where:

 cs=12log|Λ0,s||Λs|+dΨ2logλ0,η,sλη,s−<τ>τ2||^y−y(μs)||2 (32)

and . The normalization constant for can be readily found by imposing the condition which yields:

 qopt(s)=ecs∑s′ecs′. (33)

While the optimal s are inter-dependent, we note in the expression above that the posterior probability of each mixture component , as one would expect, increases as the mean-square error gets smaller. More interestingly perhaps, increases as the determinant of the posterior precision matrix decreases i.e. as the posterior variance associated with the reduced coordinates of component increases. The same effect is observed for the posterior residual variance . This implies that, ceteris paribus, mixture components with larger posterior variance will have a bigger weight in the overall posterior.

For the optimal (given ) in the equations above, the variational lower bound takes the following form (terms independent of or are omitted - for details see A):

 ^F(qopt,T)=S∑s=1qopt(s)(−<τ>τ2||^y−y(μs)||2+12log|Λ0,s||Λs|+dΨ2logλ0,η,sλη,s−logqopt(s))+alog(<τ>τ) (34)

and

 F(qopt,T)=^F(qopt,T)+logpT(T) (35)

where is the normalization constant of a distribution with parameters . This can be computed at each full iteration of VB-EM in order to monitor convergence.

While it is difficult again to gain insight in the expression above due to the inter-dependencies between the various terms, we note that the smaller the mean-square error of becomes (i.e. the better the mean is able to reproduce the measurements), the more the lower bound increases. In addition we can see that the lower bound increases as the variance of the mixture components gets larger, meaning the more variance they can capture.

For the VB-Maximization step, it can be readily established from Equation (25) that the optimization of with respect to (given ) involves the following set of uncoupled optimization problems franck_sparse_2015 :

 maxμjFμj=−<τ>2||^y−y(μj)||2+logp(μj),j=1,…,S. (36)

Since the objectives are identical for each , we can deduce that should correspond to (different or identical) local maxima of . This implies that in the posterior approximation constructed, each Gaussian in the mixture is associated with a (regularized - due to the prior) local optimum in the least-square solution of the inverse problem. The search for multiple local optima, and more importantly their number, is discussed in the next section.

The determination of the optimal is performed using first-order derivatives of . Since and its derivative are analytically unavailable, we employ an additional layer (inner loop) of Expectation-Maximization to deal with the hyperparameters in the prior of . The details were discussed in franck_sparse_2015 and are included in B for completeness.

Considering the computational cost of these operations, we point out that the updates of are the most demanding as they require calls to the forward model to evaluate and the derivatives . For the computation of the derivatives we employ the adjoint formulations which offer great savings when and papadimitriou_direct_2008 . As discussed in detail in franck_sparse_2015 , the latter condition can be removed as long as a direct solver is used for the solution of the forward problem. In this case, the cost of the solution of the adjoint equations is even less that that of the forward solution.

The remaining aspect of the VB-Maximization step, involves the optimization with respect to the (given ). As with , it suffices to consider only the the terms in Equation (25) that depend on (which we denote by ) and which again lead to a set of uncoupled problems:

 maxWjFWj=−<τ>2WTjGTjGjWj:Λ−1j+logp(Wj),j=1,…,S. (37)

The first term prefers directions corresponding to the smallest eigenvectors of

, where is the gradient of the map at . As discussed previously in Section 2.3, the prior enforces the orthogonality of the basis vectors in . To solve this constrained optimization problem, we use the iterative algorithm of wen_feasible_2013 , which employs a Cayley transform to enforce the constraint. It makes use of first-order derivatives of and as such does not required any additional forward model runs.

With regards to the number of columns in (which is equal to the dimension of ), we assume that this is the same across all mixture components . We had developed an information-theoretic criterion in franck_sparse_2015 which can also be employed here. This allows the adaptive determination of by measuring the information gain, here denoted by for each mixture component , that each new dimension in furnishes. When these fall below a threshold (in our examples we use ) i.e.:

 I(dΘ)=maxjI(dΘ,j) ≤ Imax (38)

we assume that the number of is sufficient. A detailed discussion on the estimation of using the information gain criteria is given in C and franck_sparse_2015 .

Following the previous discussion in Equation (10), we note that once the (approximate) posterior and the optimal model parameters have been computed, we obtain a multimodal posterior approximation for the material parameters , which is given by:

 p(Ψ|μ,W,^y,T)=∑Ss=1∫δ(Ψ−(μs+WsΘ+η)) p(s,Θ,η,τ,|μ,W,^y) dΘ dη≈∑Ss=1∫δ(Ψ−(μs+WsΘ+η)) q(s,Θ,η,) dΘ dη=∑Ss=1q(s)∫δ(Ψ−(μs+WsΘ+η)) q(Θ,η|s) dΘ dη=∑Ss=1q(s) qs(Ψ)=q(Ψ) (39)

where each component in the last mixture is given by:

 qs(Ψ)=∫δ(Ψ−(μs+WsΘ+η)) q(Θ,η|s) dΘ dη≡N(μs,Ds), (40)

i.e. a multivariate Gaussian with mean and covariance where:

 Ds=WsΛ−1sWTs+λ−1η,sIdΨ. (41)

Based on Equation (39), one can evaluate the posterior mean and covariance of as follows:

 Eq[Ψ]=E[Eq[Ψ|s]]=E[μs]=∑Ss=1q(s)μsCovq[Ψ]=Eq[ΨΨT]−Eq[Ψ]ETq[Ψ]=E[Eq[ΨΨT|s]]−Eq[Ψ]ETq[Ψ]=∑Ss=1q(s)(Ds+μsμTs)−(∑Ss=1q(s)μs)(∑Ss=1q(s)μs)T=∑Ss=1q(s)Ds+∑Ss=1q(s)μsμTs−(∑Ss=1q(s)μs)(∑Ss=1q(s)μTs). (42)

Posterior moments of any order or posterior probabilities can be readily computed as well.

Note that if