1 Introduction
Modelbased (or modelconstrained), inverse problems appear in many scientific fields and their solution represents a fundamental challenge in the context of model calibration and system identification biegler_largescale_2010
. Bayesian formulations offer a rigorous setting for their solution as they account for various sources of uncertainty that is unavoidably present in these problem. Furthermore, they possess a great advantage over deterministic alternatives as apart from pointestimates, they provide quantitative metrics of the uncertainty in the unknowns encapsulated in the
posterior distribution gelman_bayesian_2003 .An application of particular, but not exclusive, interest for this paper involves the identification of the mechanical properties of biological materials, in the context noninvasive medical diagnosis (elastography). While in certain cases mechanical properties can also be measured directly by excising multiple tissue samples, noninvasive procedures offer obvious advantages in terms of ease, cost and reducing risk of complications to the patient. Rather than xray techniques which capture variations in density, the identification of stiffness, or mechanical properties in general, can potentially lead to earlier and more accurate diagnosis ISI:000263259100006 ; liver2006 , provide valuable insights that differentiate between modalities of the same pathology cur12gen , monitor the progress of treatments and ultimately lead to patientspecific treatment strategies.
All elastographic techniques consist of the following three basic steps doyley_modelbased_2012 : 1) excite the tissue using a (quasi)static, harmonic or transient source, 2) (indirectly) measure tissue deformation (e.g. displacements, velocities) using an imaging technique such as ultrasound Ophir:1991 , magnetic resonance mut95mag or optical tomography khalil_tissue_2005 , and 3) infer the mechanical properties from this data using a suitable continuum mechanical model of the tissue’s deformation. Perhaps the most practical such imaging technique due to its lower relative cost and increased portability is ultrasound elasticity imaging sarvazyan_elasticity_2011 ; doyley_elastography:_2014 . The pioneering work of Ophir and coworkers Ophir:1991 followed by several clinical studies Garra:1997 ; Bamber:2002 ; Thomas:2006 ; par11im have demonstrated that the resulting strain images typically improve the diagnostic accuracy over ultrasound alone. Furthermore, technological advances have led to ultraportable ultrasound transducers, attachable to smartphones/tablets schleder_diagnostic_2013 ; driver_doctors_2016 . As the rate of data acquisition increases and the cost decreases, it becomes increasingly important to develop tools that leverage the capabilities of physicsbased models in order to produce quickly and accurately diagnostic estimates as well as quantify the confidence in them. Apart from breast cancer, there is a wealth of evidence indicating the potential of elastographybased techniques in detecting a variety of other pathologies such as prostate kro98ela ; hoy08tis and liver cancer asbach_assessment_2008 , characterizing blood clots schmitt_noninvasive_2007 , brain imaging hamhaber_vivo_2010 , atherosclerosis ohayon_biomechanics_2013 , osteopenia shore_transversely_2011 .
In this paper we advocate a probabilistic, indirect or iterative procedure (in contrast to direct elastography ISI:000275699600006 ) which admits an inverse problem formulation and involves the discrepancy between observed and modelpredicted displacements ISI:000223500200013 ; doyley_enhancing_2006 ; ISI:000275756200016 ; ISI:000280774700004 ; doyley_modelbased_2012 . . Several other problems which involve complex forward models (i.e. expensive likelihood) and the identification of spatially varying model parameters share similar characteristics such as permeability estimation for soil transport processes that can assist in the detection of contaminants, oil exploration and carbon sequestration wang_bayesian_2004 ; wang_hierarchical_2005 ; dostert_coarsegradient_2006
The solution of such model calibration problems in the Bayesian framework is hampered by two main difficulties. The first affects the computational efficiency of such methods and stems from the poor scaling of traditional Bayesian inference tools, with respect to the dimensionality of the unknown parameter vector  another instance of the
curseofdimensionality. In problems such as the one described above, the model parameters of interest (i.e. material properties) exhibit spatial variability which requires fine discretizations in order to be captured. This variability can also span different scales mallat_wavelet_2008 ; koutsourelakis_multiresolution_2009. Standard Markov Chain Monte Carlo (MCMC,
green_bayesian_2015 ) techniques require an exorbitant number of likelihood evaluations (i.e. solutions of the forward model) in order to converge roberts_exponential_1996 ; roberts_optimal_1998 ; mattingly_diffusion_2012 ; pillai_optimal_2012 . As each of these calls implies the solution of very large systems of (non)linear, and potentially transient, equations, it is generally of interest to minimize their number particularly in timesensitive applications. Advanced sampling schemes, involving adaptive MCMC lee_markov_2002 ; holloman_multiresolution_2006 ; chopin_free_2012 and Sequential Monte Carlo (SMC, moral_sequential_2006 ; koutsourelakis_multiresolution_2009 ; del_moral_adaptive_2010 ) exploit the physical insight and the use of multifidelity solvers in order to expedite the inference process. Nevertheless, the number of forward calls can still be in the order of tens of thousands if not much more. Several attempts have also been directed towards using emulators or surrogates or reducedorder models of various kinds marzouk_stochastic_2007 ; buithanh_model_2008 ; rosic_samplingfree_2012 ; bilionis_solution_2014 ; chen_sparsegrid_2015 ; lan_emulation_2016 but such a task is severely hindered by the high dimensionality. The use of first and secondorder derivatives has also been advocated either in a standard MCMC format or by developing advanced sampling strategies. These are generally available by solving appropriate adjoint problems which are wellunderstood in the context of deterministic formulations flath_fast_2011 ; girolami_riemann_2011 ; martin_stochastic_2012 ; buithanh_solving_2014 ; petra_computational_2014. More recent treatments, attempt to exploit lowerdimensional structure of the target posterior by identifying subspaces where either most of the probability mass is contained
franck_sparse_2015 or where maximal sensitivity is observed cui_likelihoodinformed_2014 ; spantini_optimal_2015 ; cui_scalable_2015 ; cui_dimensionindependent_2016 . This enables inference tasks that are carried out on spaces of significantly reduced dimension and are not hampered by the aforementioned difficulties. Generally all such schemes construct such approximations around the MAP point by employing local information (e.g. gradients) and therefore are not suitable for multimodal or highly nonGaussian posteriors.The latter represents the second challenge that we attempt to address in this paper. That is, the identification of multiple posterior modes. In the context of elastography, multimodality can originate from anisotropic material chatelin_anisotropic_2014 , wrong/missing information from images/measurements fang_compositionalpriorguided_2010 or the imagaing modality employed fromageau_estimation_2007
. In all cases, each mode in the posterior can lead to different diagnostic conclusions and it is therefore very important to identify them and correctly assess their posterior probabilities. The majority of Bayesian strategies for the solution of computationally intensive inverse problems operates under the assumption of a unimodal posterior or focuses on the approximation of a single mode of the posterior. Some numerical inference tools based on SMC or other tempering mechanisms
gill_dynamic_2004 ; li_adaptive_2015 ; feroz_multimodal_2008 have been developed but require a very large number of forward model calls particularly when the dimension of unknowns increases. We note finally that the treatment of multimodal densities in highdimensions has attracted significant interest in atomistic simulation in the context of free energy computations lelievre_free_2010 ; bilionis_free_2012 but in such problems (apart from other distinguishing features) the cost per density evaluation (i.e. one MD timestep) is smaller than in our setting.In this paper we propose a Variational Bayesian (VB) strategy that extends our previous work franck_sparse_2015
. Therein we have shown how accurate approximations of the true posterior can be attained by identifying a lowdimensional subspace where posterior uncertainty is concentrated. This has led to computational schemes that require only a few tens of forward model runs in the problems investigated. Nevertheless, our previous work was based on the assumption of a unimodal posterior which we propose overcoming in this paper by employing a mixture of multivariate Gaussians. Such mixtures have been employed in various statics and machine learning applications (e.g. speaker identification
reynolds_speaker_2000 , data clustering jain_data_1999 ) and in combination with Variational Bayesian inference techniques as well choudrey_variational_2003 ; beal_variational_2003 ; kuusela_gradientbased_2009 . Nevertheless, all these problems were characterized by inexpensive likelihoods, relatively lowdimensions and multiple data/measurements. In contrast to that, the inverse problems considered here are based on a single experiment and a single observation vector. Furthermore, we propose an adaptive algorithm based on informationtheoretic criteria for the identification of the number of the required mixture components (section 2). We present the parametrization of the proposed model in section 2where we also discuss a VariationalBayesian ExpectationMaximization
beal_variational_2003 scheme for performing inference and learning. In section 3 we present numerical illustrations involving a simple toyexample and an example in the context of elastography.2 Methodology
This section discusses the methodological framework advocated. We begin (Section 2.1) with a generic presentation of the forward model and in particular with models arising from the discretization of nonlinear PDEs such as in our motivating application of nonlinear elastography. In Section 2.2 we present a Bayesian mixture model that can identify lowerdimensional subspaces where most of the posterior mass is concentrated as well as accounts for multiple modes. The prior assumptions for the model parameters are summarized in Section 2.3. In Section 2.4 we discuss a Variational Bayesian ExpectationMaximization scheme for computing efficiently approximations to the posterior for a fixed number of mixture components, and in Section 2.5 we discuss a scheme for determining the appropriate number of such components. Finally, in Section 2.6 we discuss how to assess the accuracy of the approximation computed as well as a way to correct for any bias if this is deemed to be necessary.
2.1 Forward model  Governing equations
Canonical formulations of modelbased, inverse problems postulate the existence of a forward model that typically arises from the discretization of governing equations, such as PDEs/ODEs, and which can be expressed in the residual form as follows:
(1) 
The residual vector depends on the state vector (forwardproblem unknowns) and , the vector of unobserved (latent) model parameters (inverseproblem unknowns). The aforementioned equation is complemented by a set of (noisy) observations/measurements which pertain to the model output :
(2) 
The unknowns in the problems considered arise from the discretization of spatially varying parameters and we refer to them as material parameters in view of the biomechanics applications discussed in Section 3.2. Throughout this work we assume that the noise term pertains only to measurement/observation errors which we model with a zeromean, uncorrelated Gaussian vector :
(3) 
The precision of the observation noise will be assumed unknown and will be inferred from the data along with . Other noise distributions (e.g. to account for faulty sensors) can be readily considered jin_variational_2012 . We note that the difference between observations and model predictions would in general contain model errors arising from the discretization of the governing equations and/or the inadequacy of the model itself to capture the underlying physical process. While the former source can be reduced by considering very fine discretizations (at the cost of increasing the dimensionality of the state vector and potentially ), the latter requires a much more thorough treatment which exceeds the scope of this work kennedy_bayesian_2001 ; higdon_computer_2008 ; bayarri_framework_2007 ; berliner_modeling_2008 ; koutsourelakis_novel_2012 ; strong_when_2014 ; sargsyan_statistical_2015 .
In the traditional route, the likelihood , implied by Equation (3):
(4) 
is complemented by priors and which, with the application of the Bayes’ rule, lead to the definition of the posterior density which is proportional to:
(5) 
The intractability of the map precludes the availability of closedform solutions for the posterior and necessitates the use of various sampling schemes such as those discussed in the introduction. This task is seriously impeded by a) the need for repeated solutions of the discretized forward problem (Equation (1)) each of which can be quite taxing computationally, b) the high dimensionality of the vector of unknowns which hinders the efficient search (e.g. by sampling) of the latent parameter space and further increases the computational burden. The goal of the proposed Variational Bayesian scheme is to alleviate these difficulties by proposing adequate approximations and dimensionalityreduction techniques that are seamlessly integrated in the inference framework. Furthermore, we attempt to overcome wellknown limitations that have to do with the multimodality of the posterior and which further exacerbate these problems. Multimodality is inherently related to the illposedness of inverse problems and its potential can increase when the dimension of the vector of unknowns increases and/or the noise is amplified.
2.2 Bayesian Mixture Model
In this section and in view of the aforementioned desiderata we introduce the augmented formulation of the Bayesian inverse problem.

In order to capture multiple modes of the posterior (if those are present) we introduce the discrete, latent variable which takes integer values between and . The latter represents the number of modes identified, each of which will be modeled with a multivariate Gaussian. The cardinality of the model i.e. is learned in a manner that is described in the sequel.

In order to identify a lowerdimensional representation of the unknowns , we define the latent variables such that . The premise here is that while is highdimensional, its posterior can be adequately represented on a subspace of dimension that captures most of the variance. As we have argued in franck_sparse_2015 these latent variables can give rise to a PCAlike representation of the form:
(6) where
is the mean vector and the columns of the orthogonal matrix
() span the aforementioned subspace with reduced coordinates . The vector captures the residual variance (noise) that complements the main effects.In view of the multimodal approximation and since each mode implies a different mean and a different lowerdimensional subspace (Figure 1), we advocate in this work expansions of the form:
(7) where the notation implies the representation of within mode . As with multiple modes there can also be multiple subspaces where the variance is concentrated, so it is necessary/important to distinct the . In principle, the dimension of the reduced subspaces can also vary with but we do not consider this here for simplicity of notation.
For reasons that will become apparent in the following, we distinguish between latent variables: , , and , and model parameters: , . We seek point estimates for the latter and (approximations) of the actual (conditional) posterior for the former. The discussion thus far suggests that that the likelihood of Equation (4) takes the form:
(8) 
A graphical illustration of the proposed probabilistic generative model proposed is in Figure 2.
Following the standard Bayesian formalism, one would complement the aforementioned likelihood with priors on the model parameters and the latent variables , in order to obtain the joint posterior (given ):
(9) 
We discuss the specific form of the priors (and associated hyperparameters) in Subsection
2.3 as well as the inference/learning strategy we propose in 2.4. We note at this stage however, that given this posterior, one would obtain a mixture representation of the unknown material parameters , as implied by Equation (7). In particular, given values for , it immediately follows that the posterior (given ) of is:(10) 
where the conditional posterior is found from Equation (9). We discuss in Section 2.4 how the posterior on the latent variables is approximated as well as the values (point estimates) for the model parameters are computed.
2.3 Priors
We assume that, a priori, the precision of the observation noise is independent of the remaining latent variables i.e.:
(11) 
In particular, we employ:

a Gamma prior on : We employ a (conditionally) conjugate Gamma prior:
(12) We use which results in a noninformative Jeffreys’ prior that is scaleinvariant.

We assume that and are a priori, conditionally independent i.e. that . We discuss each of these terms below:

We assume that each component is, a priori, equally likely, which implies:
(13) Hierarchical priors can be readily be adopted (e.g. beal_variational_2003 ) but we consider here the simplest possible scenario. An interesting extension would involve infinite models with Dirichlet Process priors antoniak_mixtures_1974 ; maceachern_estimating_1998 which would enable the number of components to be automatically determined. In this work, a less elegant, but quite effective adaptive scheme for determining is proposed in Section 2.5.

A Gaussian prior on :
The role of the latent variables is to capture the most significant variations of around its mean . By significant we mean in this case the directions along which, the largest posterior uncertainty is observed. Naturally these are closely related to the matrices and represent the reduced coordinates along the subspace spanned by their column vectors. We assume therefore that, a priori, these are independent, have zero mean and follow a multivariate Gaussian:(14) where , express prior variances along each of the latent principal directions.

A Gaussian prior on :
As the role of the latent variables is to capture any residual variance (that is not accounted for by ), we assume that, a priori, can be modeled by a multivariate Gaussian that has zero mean and an isotropic covariance:
(15)

For the model parameters , we assume that, a priori, the parameters associated with each component are independent. In particular:

Prior on each for :
In general such priors must encapsulate not only the information/beliefs available a priori to the analyst but also reflect the physical meaning of the parameters . We are motivated by applications in elastography where the goal is to identify inclusions that correspond to tumors and generally have very different properties from the surrounding tissue (wellman_breast_1999 ; krouskop_elastic_1998 ). The vector represents the spatial discretization of the material parameters i.e. each of its entries corresponds to the value of the material parameter at a certain point in the physical domain. This structure is inherited by and for this reason we employ a hierarchical prior that penalizes jumps between neighboring locations (on the spatial domain) calvetti_hypermodels_2008 in a manner controlled by appropriately selected hyperparameters. The model was discussed in detail in franck_sparse_2015 and is included for completeness in B. 
Prior specification on each for :
We require that each is orthonormal i.e. , where is thedimensional identity matrix. This is equivalent to employing a uniform prior on the Stiefel manifold
.
2.4 Variational Approximation
We note that inference (exact or approximate) for all the model parameters described previously would pose a formidable task particularly with regard to and which scale with . For that purpose, we advocate a hybrid approach whereby MaximumAPosteriori (MAP) point estimates of the highdimensional parameters are obtained and the posterior of the remaining (latent) variables is approximated. To that end we make use of the Variational Bayesian ExpectationMaximization scheme (VBEM, beal_variational_2003 ; franck_sparse_2015 ) which provides a lower bound on the log of the marginal posterior of . This can be iteratively maximized by a generalized coordinate ascent (Figure 3) which alternates between finding optimal approximations of the exact (conditional) posterior and optimizing with respect to .
On the basis of the discussion above and the separation between latent variables () and model parameters , we can rewrite Equation (9) (for a given ) as follows:
(16) 
We note that both sides of the equation above depend implicitly on i.e. the total number of components in the model. This is especially important for the model evidence term which we discuss in Section 2.5. We nevertheless omit from the expressions in order to simplify the notation.
Furthermore the conditional posterior of given is:
(17) 
where is the (marginal) posterior of the model parameters .
For an arbitrary density and by employing Jensen’s inequality, it can be shown that franck_sparse_2015 :
(18) 
We note here that the variational lower bound has a direct connection with the KullbackLeibler (KL) divergence between and the (conditional) posterior . In particular, if we denote by expectations with respect to , then:
(19) 
The KullbackLeibler divergence is by definition nonnegative and becomes zero when
. Hence, for a given , constructing a good approximation to the conditional posterior (in the KL divergence sense) is equivalent to maximizing the lower bound with respect to . Analogously, maximizing with respect to (for a given leads to (sub)optimal MAP estimates franck_sparse_2015 . This suggests an iterative scheme that alternates between:
VBExpectation step: Given the current estimate of , find the that maximizes .

VBMaximization step: Given the current , find that maximizes .
As in standard EM schemes neal_view_1998 , relaxed versions of the aforementioned partial optimization problems can be considered that improve upon the current rather than finding the optimum at each iteration.
Using Equation (16), the lower bound can be expressed as:
(20) 
We will omit the term as it does not depend on nor . It is apparent that the challenging term in involves the likelihood, i.e.:
(21) 
The intractability of the map precludes an analytic computation of the expectation with respect to , let alone the optimization with respect to this. While stochastic approximation techniques in the context of VB inference have been suggested titsias_doubly_2014 to carry out this task, these would require repeated forward solves (i.e. evaluations of ) which would render them impractical. For that purpose, as in our previous work franck_sparse_2015 , we invoke an approximation by using a firstorder Taylor series expansion of (given ) at i.e.:
(22) 
where is the gradient of the map at . We will discuss rigorous validation strategies of the approximation error thus introduced in Section 2.6. Truncating Equation (22) to firstorder, the term in the exponent of the likelihood becomes:
(23) 
We introduce a second approximation in terms of the family of ’s over which we wish to optimize by using a meanfield decomposition (peierls_minimum_1938 ; opper_advanced_2001 ) of the form:
(24) 
In the first line, is assumed to be a posteriori independent of the remaining latent variables on the premise that the measurement noise precision is determined by the experimental conditions and is not directly dependent on the latent variables. In the the third line, we assume that and are conditionally independent given ^{1}^{1}1This implies that and are actually dependent as one would expect.. The latter assumption is justified by the role of and in the representation of (Equation (7)) expressing the main effects around the mean and the residual “noise” respectively. As such, it is also reasonable to assume that the means of and are zero a posteriori i.e. and . Furthermore we employ an isotropic covariance for i.e. where represents the (unknown) variance.
If we denote the expectations with respect to , and with , and , then Equation (21) becomes ^{2}^{2}2we omit constants that do not affect the optimization.:
(25) 
Despite the long expression, the optimization of in the VBExpectation step can be done analytically and we find that the optimal (given ) is:
(26) 
where:
(27) 
(28) 
(29) 
(30) 
Furthermore, for the latent variable we find that:
(31) 
where:
(32) 
and . The normalization constant for can be readily found by imposing the condition which yields:
(33) 
While the optimal s are interdependent, we note in the expression above that the posterior probability of each mixture component , as one would expect, increases as the meansquare error gets smaller. More interestingly perhaps, increases as the determinant of the posterior precision matrix decreases i.e. as the posterior variance associated with the reduced coordinates of component increases. The same effect is observed for the posterior residual variance . This implies that, ceteris paribus, mixture components with larger posterior variance will have a bigger weight in the overall posterior.
For the optimal (given ) in the equations above, the variational lower bound takes the following form (terms independent of or are omitted  for details see A):
(34) 
and
(35) 
where is the normalization constant of a distribution with parameters . This can be computed at each full iteration of VBEM in order to monitor convergence.
While it is difficult again to gain insight in the expression above due to the interdependencies between the various terms, we note that the smaller the meansquare error of becomes (i.e. the better the mean is able to reproduce the measurements), the more the lower bound increases. In addition we can see that the lower bound increases as the variance of the mixture components gets larger, meaning the more variance they can capture.
For the VBMaximization step, it can be readily established from Equation (25) that the optimization of with respect to (given ) involves the following set of uncoupled optimization problems franck_sparse_2015 :
(36) 
Since the objectives are identical for each , we can deduce that should correspond to (different or identical) local maxima of . This implies that in the posterior approximation constructed, each Gaussian in the mixture is associated with a (regularized  due to the prior) local optimum in the leastsquare solution of the inverse problem. The search for multiple local optima, and more importantly their number, is discussed in the next section.
The determination of the optimal is performed using firstorder derivatives of . Since and its derivative are analytically unavailable, we employ an additional layer (inner loop) of ExpectationMaximization to deal with the hyperparameters in the prior of . The details were discussed in franck_sparse_2015 and are included in B for completeness.
Considering the computational cost of these operations, we point out that the updates of are the most demanding as they require calls to the forward model to evaluate and the derivatives . For the computation of the derivatives we employ the adjoint formulations which offer great savings when and papadimitriou_direct_2008 . As discussed in detail in franck_sparse_2015 , the latter condition can be removed as long as a direct solver is used for the solution of the forward problem. In this case, the cost of the solution of the adjoint equations is even less that that of the forward solution.
The remaining aspect of the VBMaximization step, involves the optimization with respect to the (given ). As with , it suffices to consider only the the terms in Equation (25) that depend on (which we denote by ) and which again lead to a set of uncoupled problems:
(37) 
The first term prefers directions corresponding to the smallest eigenvectors of
, where is the gradient of the map at . As discussed previously in Section 2.3, the prior enforces the orthogonality of the basis vectors in . To solve this constrained optimization problem, we use the iterative algorithm of wen_feasible_2013 , which employs a Cayley transform to enforce the constraint. It makes use of firstorder derivatives of and as such does not required any additional forward model runs.With regards to the number of columns in (which is equal to the dimension of ), we assume that this is the same across all mixture components . We had developed an informationtheoretic criterion in franck_sparse_2015 which can also be employed here. This allows the adaptive determination of by measuring the information gain, here denoted by for each mixture component , that each new dimension in furnishes. When these fall below a threshold (in our examples we use ) i.e.:
(38) 
we assume that the number of is sufficient. A detailed discussion on the estimation of using the information gain criteria is given in C and franck_sparse_2015 .
Following the previous discussion in Equation (10), we note that once the (approximate) posterior and the optimal model parameters have been computed, we obtain a multimodal posterior approximation for the material parameters , which is given by:
(39) 
where each component in the last mixture is given by:
(40) 
i.e. a multivariate Gaussian with mean and covariance where:
(41) 
Based on Equation (39), one can evaluate the posterior mean and covariance of as follows:
(42) 
Posterior moments of any order or posterior probabilities can be readily computed as well.
Note that if