1 Introduction
The extensive use of largescale computational models poses several challenges in model calibration as the accuracy of the predictions provided depends strongly on assigning proper values to the various model parameters. In mechanics of materials, accurate mechanical property identification can guide damage detection and an informed assessment of the system’s reliability johannesson_multiresolution_2005 . Identifying propertycross correlations can lead to the design of multifunctional materials torquato_random_2002
. Permeability estimation for soil transport processes can assist in detection of contaminants, oil exploration
wang_markov_2006 .Deterministic optimization techniques which have been developed to address these problems vogel_computational_2002 , lead to point estimates for the unknowns without rigorously considering the statistical nature of the problem and without providing quantification of the uncertainty in the inverse solution. Statistical approaches based on the Bayesian paradigm kaipio_statistical_2006
on the other hand, aim at computing a (posterior) probability distribution on the parameters of interest. Bayesian formulations offer several advantages as they provide a unified framework for dealing with the uncertainty introduced by the incomplete and noisy measurements. Significant successes have been noted in applications such as geological tomography
glaser_stochastic_2004 , medical tomography weir_fully_1997 , petroleum engineering hegstad_uncertainty_2001 , as well as a host of other physical, biological, or social systems wang_hierarchical_2005 ; liu_bayesian_2008 . Representations of the parametric fields in existing deterministic and Bayesian approaches (artificially) impose a minimum length scale of variability usually determined by the discretization size of the governing PDEs lee_markov_2002. As a result they give rise to a very large vector of unknowns. Inference in highdimensional spaces using standard Markov Chain Monte Carlo (MCMC) schemes is generally impractical as it requires an exuberant number of calls to the forward simulator in order to achieve convergence. Advanced schemes such as those employing Sequential Monte Carlo samplers
moral_sequential_2006 ; koutsourelakis_multiresolution_2009 , adaptive MCMC chopin_free_2012 , accelerated MCMC methods hoang_complexity_2013 or spectral methods marzouk_stochastic_2007 can alleviate some of these difficulties particularly when the posterior is multimodal but still pose significant challenges in terms of the computational cost bilionis_solution_2014 .This work is particularly concerned with the identification of the mechanical properties of biological materials, in the context noninvasive medical diagnosis. While in certain cases mechanical properties can also be measured directly by excising multiple tissue samples, noninvasive procedures offer obvious advantages in terms of ease, cost and reducing risk of complications to the patient. Rather than xray techniques which capture variations in density, the identification of stiffness or mechanical properties in general, can potentially lead to earlier and more accurate diagnosis oberai_linear_2009 ; gannecarrie_accuracy_2006 , provide valuable insights that differentiate between modalities of the same pathology curtis_genomic_2012 and monitor the progress of treatments. In this paper we do not propose new imaging techniques but rather aim at developing rigorous statistical models and efficient computational tools that can make use of the data/observables (i.e. noisy displacements of deformed tissue) from existing imaging modalities (such as magnetic resonance muthupillai_magnetic_1995 , ultrasonic) in order to produce certifiable estimates of mechanical properties. The primary imaging modality considered in this project is ultrasound elasticity imaging (elastography sarvazyan_editorial_2011 ; doyley_modelbased_2012 ). It is based on ultrasound tracking of pre and postcompression images to obtain a map of position changes and deformations of the specimen due to an external pressure/load. The pioneering work of Ophir and coworkers ophir_elastography:_1991 followed by several clinical studies bamber_progress_2002 ; thomas_realtime_2006 ; parker_imaging_2011 have demonstrated that the resulting strain images typically improve the diagnostic accuracy over ultrasound alone.
Beyond a mere strain imaging there are two approaches for inferring the constitutive material parameters. In the direct approach, the equations of equilibrium are interpreted as equations for the material parameters of interest, where the inferred strains and their derivatives appear as coefficients barbone_adjointweighted_2010 . While such an approach provides a computationally efficient strategy, it does not use the raw data (i.e. noisy displacements) but transformed versions i.e. strain fields (or evenworse, strain derivatives) which arise by applying sometimes ad hoc filtering and smoothing operators. As a result the informational content of the data is compromised and the quantification of the effect of observation noise is cumbersome. Furthermore, the smoothing employed can smear regions with sharply varying properties and hinder proper identification.
The alternative to direct methods, i.e. indirect or iterative procedures admit an inverse problem formulation where the discrepancy between observed and modelpredicted displacements is minimized with respect to the material fields of interest oberai_evaluation_2004 ; doyley_enhancing_2006 ; arnold_efficient_2010 ; olson_numerical_2010 . While these approaches utilize directly the raw data, they generally imply an increased computational cost as the forward problem and potentially derivatives have to be solved/computed several times. This effort is amplified when stochastic/statistical formulations are employed as those arising in the Bayesian paradigm. Technological advances have led to the development of handcarried ultrasound systems in the size of a smartphone schleder_diagnostic_2013 . Naturally their accuracy and resolution does not compare with the more expensive traditional ultrasound machines or even more so MRI systems. If however computational tools are available that can distill the informational content from noisy and incomplete data then this would constitute a major advance. Furthermore, significant progress is needed in improving the computational efficiency of these tools if they are to be made applicable on a patientspecific basis.
In this work we advocate a Variational Bayesian (VB) perspective beal_variational_2003 ; bishop_pattern_2006
. Such methods have risen into prominence for probabilistic inference tasks in the machine learning community
jordan_introduction_1999 ; attias_variational_2000 ; wainwright_graphical_2008 but have recently been employed also in the context of inverse problems chappell_variational_2009 ; jin_hierarchical_2010 . They provide approximateinference results by solving an optimization problem over a family of appropriately selected probability densities with the objective of minimizing the KullbackLeibler divergence
cover_elements_1991 with the exact posterior. The success of such an approach hinges upon the selection of appropriate densities that have the capacity of providing good approximations while enabling efficient (and preferably) closedform optimization with regards to their parameters. We note that an alternative optimization strategy originating from a different perspective and founded on mapbased representations of the posterior has been proposed in el_moselhy_bayesian_2012 .A pivotal role in Variational Bayesian strategies or any other inference method, is dimensionality reduction i.e. the identification of lowerdimensional features that provide the strongest signature to the unknowns and the corresponding posterior. Discovering a sparse set of features has attracted great interest in many applications as in the representation of natural images olshausen_sparse_1997 and a host of algorithms have been developed not only for finding such representations but also an appropriate dictionary for achieving this goal lewicki_learning_2000 . While all these tools are pertinent to the present problem they differ in a fundamental way. They are based on several data/observations/instantiations of the vector that we seek to represent. In our problem however we do not have such direct observations i.e. the data available pertains to the output of a model which is nonlinearly and implicitly dependent on the vector of unknowns. Furthermore we are primarily interested in approximating the posterior of this vector rather than the dimensionality reduction itself. We demonstrate how this can be done by using a fully Bayesian argumentation and employing the marginal likelihood or evidence as the ultimate model validation metric for any proposed dimensionality reduction.
The paper is organized as follows: The next section (Section 2) presents the essential ingredients of the forward model (Section 2.1) which are common with a wide range of nonlinear, highdimensional problems encountered in several simulation contexts. We also discuss the VB framework advocated, the dimensionality reduction scheme proposed, the prior densities for all model parameters, an iterative, coordinateascent algorithm that enables the identification of all the unknowns (Section 2.2) as well as an informationtheoretic criterion for determining the number of dimensions needed (Section 2.3). We finally describe a Monte Carlo scheme based on Importance Sampling that can provide statistics of the exact posterior as well as a quantitative assessment of the VB approximation (Section 2.4). Section 3 demonstrates the performance and features of the proposed methodology in two problems from solid mechanics that are of relevance to the elastography settings. Various signaltonoise ratios are considered and the performance of the method, in terms of forward calls and accuracy, is assessed.
2 Methodology
The motivating application is related to continuum mechanics in the nonlinear elasticity regime. We describe below the governing equations in terms of conservation laws and the constitutive equations. The proposed model calibration process can be readily adapted to other forward models. As it will be shown, the only information utilized by the Bayesian inference engine proposed is a) the response quantities at the locations where measurements are available, and b) their derivatives with respect to the unknown model parameters.
2.1 Forward model  Governing equations
The following expressions are formulated in the general case which includes nonlinear material behavior and large deformations. Our physical domain is described by in in the reference configuration. Let denote the coordinates of the continuum particles in the undeformed configuration and in the deformed. Their relation (in the static case) is provided by the deformation map such that: . The displacement field is defined as . The gradient of the deformation map is denoted by and
is then the Lagrangian (finite) strain tensor used as the primary kinematic state variable in our constitutive law
holzapfel_nonlinear_2000 ; mase_continuum_2009 ; bonet_worked_2012 . The governing equations consist of the conservation of linear momentum:(1) 
and the Dirichlet and Neumann boundary conditions as
(2) 
(3) 
b is body force vector (per unit mass), is the initial density, S is the second PiolaKirchhoff stress tensor and is the outward normal at . and are subsets of the boundary, , on which displacement and traction boundary data, and , respectively, are specified. For a hyperelastic material, it is assumed that the strain energy density function exists and depends on the invariants of the Lagrangian strain tensor E and the constitutive material parameters . We note that the latter in general exhibit spatial variability which we intend to estimate using the methods discussed. The conjugate stress variables described by the second PiolaKirchhoff stress tensor can be found as:
(4) 
The aforementioned governing equations should be complemented with any other information about the problem or the material, such as incompressibility. In fact incompressibility is frequently encountered in biomaterials and corresponds to the condition at all points in the problem problem domain.
The governing equations presented thus far cannot be solved analytically for the vast majority of problems and one must resort to numerical techniques that discretize these equations and the associated fields. The most prominent such approach is the Finite Element Method (FEM) which is employed in this study as well. In the first step, the weak form of the PDEs needs to be derived. To that end, we define the usual function spaces and for the set of admissible solutions and weighting functions respectively, as follows gokhale_solution_2008 :
(5) 
where denotes the Sobolev space of square integrable functions with square integrable derivatives in adams_sobolev_2003 . By multiplying Equation (1) with a weighting function , integrating by parts and exploiting the essential and nonessential boundary conditions above, we obtain:
(6) 
In the incompressible case, pressure must be taken into account and for that purpose the pressure trial solutions and weighting functions should also be introduced goenezen_solution_2011 .
Subsequently the problem domain is discretized into finite elements and shape functions are used for interpolating the unknown fields. As this is a very mature subject, from a theoretical and computational point of view, we do not provide further details here but refer the interested reader to one of many books available
zienkiewicz_finite_1977 ; hughes_finite_2000 or more specifically in the context of inverse problems for (in)compressible elasticity in gokhale_solution_2008 ; goenezen_solution_2011 . Most often all unknowns are expressed in terms of the discretized displacement field denoted here by a vector . An approximate solution can be found by solving an dimensional system of nonlinear algebraic equations which in residual form can be written as:(7) 
We denote here by the residuals and by , the discretized vector of constitutive material parameters .
The discretizations can be done in many different ways. For example if the same mesh and shape functions as for the discretization of the displacements are adopted, then each entry of the vector corresponds to the value of the material parameter of interest at each nodal point. Frequently it is assumed that the value of the constitutive parameters are constant within each finite element in which case coincides with the number of elements in the FE mesh. While the representation of is discussed in detail in the sequence, we point out that the discretization of does not need to be associated with the discretization used for the governing equations. Usually in practice the two are related, but if one aims at inferring as many details about the variability of that the discretized equations can provide, a finer discretization might be employed for . We note however that if the material properties exhibit significant variability within each finite element i.e. if , special care has to be taken in formulating the finite element solution and multiscale schemes might need to be employed e_principles_2011 .
We note here:

Frequently the size of the system of the equations that need to be solved is large. This is necessitated by accuracy requirements in capturing the underlying physics and mathematics. It can impose a significant computational burden as in general repeated solutions of this system, under different values of , are needed. If for example a NewtonRaphson method is employed then repeated solutions of the linearized Equation (7) will need to be performed:
(8) where is the iteration number and is the Jacobian matrix. Hence for large as in applications of interest, the number of such forward solutions is usually what dictates the overall computational cost and this is what we report in subsequent numerical experiments. Depending on the particular solution method employed, converged solutions at a certain stage of the inversion procedure can be used as initial guesses for subsequent solutions under different reducing as a result the overall cost. In this work we do not make use of such techniques.

The data available generally concerns a subset or more generally a lowerdimensional function of . In this work, the experimental measurements/ observations are (noisy) displacements at specific locations in the physical domain. We denote these displacements by and they can be formally expressed as where is a Boolean matrix which picks out the entries of interest from . Naturally, since depends on , is also a function of i.e. . We emphasize that this function is generally highly nonlinear and most often than not, manytoone schillings_sparse_2013 . The unavailability of the inverse as well as the high nonlinearity constitute two of the basic difficulties of the associated inverse problem.

In addition to the solution vector , the proposed inference scheme will make use of the derivatives . The computation of derivatives of the response with respect to model parameters is a wellstudied subject in the context of PDEconstrained optimization giles_introduction_2000 ; hinze_optimization_2009 ; papadimitriou_direct_2008 and we make use of it in this work. For any scalar function , one can employ the adjoint form of Equation (7) according to which:
(9) where is defined such as:
(10) We note that is the Jacobian of the residuals in Equation (7) evaluated at the solution . We point out that if a direct solver for the solution of the linear system in is employed, then the additional cost of evaluating is minimal as the Jacobian would not need to be refactorized for solving Equation (10) ^{1}^{1}1The cost of evaluating is negligible compared to other terms as it scales linearly with the number of elements/nodes.. In the context of the problems considered in this paper (see Section 3), repeated use of Equation (10) is made where is a different component of the observables and as such the overall cost increases proportionally with the number of observables (displacements in our problems) that are available. In problems where is so large that it precludes the use of direct solvers, then the cost of its solution of the adjoint equations can be augmented but nevertheless comparable to the cost of a forward solution. In cases where both as well as the dimension of are high, then advanced iterative solvers, suitable for multiple righthand sides must be employed Orginos:1118470 ; gutknecht_block_2009 . These imply an added computational burden which nevertheless scales sublinearly with the dimension of .
2.2 Bayesian Model
The following discussion is formulated in general terms and can be applied for the calibration of any model with parameters represented by the vector when output is available. We also presuppose the availability of the derivatives . For problems of practical interest, it is assumed that the dimension of the unknowns is very large which poses a significant hindrance in the solution of the associated inverse problem as well as in finding proper regularization (in deterministic settings calvetti_tikhonov_2003 ) or in specifying appropriate priors (in probabilistic settings bardsley_gaussian_2013 ; schwab_sparse_2012 ). The primary focus of the Bayesian model developed is twofold:

find lowerdimensional representations of the unknown parameter vector that capture as much as possible of the associated posterior density

enable the computation of the posterior density with as few forward calls (i.e. evaluations of ) as possible.
We denote the vector of observations/measurements. In the context of elastography the observations are displacements (in the static case) and/or velocities (in the dynamics). The extraction of this data from images (ultrasound or MRI) is a challenging topic that requires sophisticated image registration techniques richards_quantitative_2007 ; rivaz_realtime_2011 . Naturally, these compromise the informational content of the raw data (i.e. the images). In this study we ignore the error introduced by the image registration process, as the emphasis is on the inversion of the continuum mechanics, PDEbased, model, and assume that the displacement data are contaminated with noise. We postulate the presence of i.i.d. Gaussian noise denoted here by the random vector such that:
(11) 
We assume that each entry of
has zero mean and an unknown variance
which will also be inferred from the data. We note that other models can also be employed as for example impulsive noise to account for outliers due to instrument calibration or experimental conditions
jin_variational_2012 . Generally the difference between observed and modelpredicted outputs can be attributed not only to observation errors (noise) but also to model discrepancies arridge_approximation_2006 ; kaipio_statistical_2007 ; koutsourelakis_novel_2012 . In this work such model errors are lumped with observation errors in the term.The likelihood function of the observed data i.e. its conditional probability density given the model parameters (and implicitly the model itself as described by the aforementioned governing Equations (7)) and is:
(12) 
In the Bayesian framework advocated one would also need to specify priors on the unknown parameters. We defer a detailed discussion of the priors associated with for the next section where the dimensionality reduction aspects are discussed. With regards to the noise precision we employ a (conditionally) conjugate Gamma prior i.e.
(13) 
The values of the parameters are taken in the following examples. This corresponds to a limiting case where the density degenerates to an improper, noninformative Jeffreys prior i.e. that is scale invariant gelman_bayesian_2003 . Naturally more informative choices can be made if such information is available a priori.
2.2.1 Dimensionality Reduction for
As mentioned earlier one of the primary goals of the present work to is identify, with the least number of forward calls, a lowerdimensional subspace in on which the posterior probability density can be sufficientlywell approximated. Dimensionality reduction could be enforced directly by appropriate prior specification. For example in honarvar_sparsity_2012
the Fourier transform coefficients of
corresponding to smallwavelength fluctuations were turnedoff by assigning zero prior probability to nonzero values. While such an approach achieves the goal of dimensionality reduction it does not take into account the forward model in doing so. The nonlinear map
as well as the available data provide varying amounts of information for identifying different features of . One would expect the likelihood (which measures the degree of fit of model predictions with the data) to exhibit different levels of sensitivity along different directions in the space.Consider for example Laplace’s method which is based on a semianalytic Gaussian approximation around the MaximumAPosteriori estimate
mackay_choice_1998 ; bishop_pattern_2006 . The negative of the Hessian of the logposterior (assuming this is positivedefinite) serves as the covariance matrix. As it was shown in buithanh_extremescale_2012in many inverse problems this covariance matrix exhibits a significant discrepancy in its eigenvalues which was exploited in constructing lowrank approximations. At one extreme, there would be principal directions (with small variance) along which the slightest change from from
would cause a huge decrease in the posterior and on the other, there would principal directions (with large variance) along which the posterior would remain almost constant. Such principal directions will naturally encapsulate the effect of the logprior. In the proposed scheme however, only the data loglikelihood affects the directions with the maximal posterior variance tiangang_cui_likelihoodinformed_2014 . More importantly perhaps we propose a unified framework where the identification of the subspace with the largest posterior variance is performed simultaneously with the inference of the posterior under the same Variational Bayesian objective. This yields not only a highly efficient algorithm (in terms of the number of forward solves) but also a highly extendable framework as discussed in the conclusions.The inference and dimensionality reduction problems are approached by employing fully Bayesian argumentation and invoking the quality of the approximation to the posterior as our guiding objective. To that end we postulate the following representation for the highdimensional vector of unknowns :
(14) 
The motivation behind such a decomposition is quite intuitive as it resembles a Principal Component Analysis (PCA) model
tipping_probabilistic_1999 . The vector represents the mean value of the representation of whereas the reduced (and latent) coordinates of along the linear subspace spanned by the columns of the matrix . The linear decomposition of a highdimensional vector such as has received a lot of attention in several different fields. Most commonly represents a highdimensional signal (e.g. an image, an audio/video recording) and consists of an over or undercomplete basis set olshausen_sparse_1997 ; dobigeon_bayesian_2010 which attempts to encode the signal as sparsely as possible. Significant advances in Compressed Sensing candes_robust_2006 or Sparse Bayesian Learning wipf_sparse_2004 have been achieved in recent years along these lines. A host of deterministic lee_efficient_2006 or probabilistic seeger_variational_2010 algorithms have been developed for identifying the reducedcoordinates (or their posterior) as well as techniques for learning the most appropriate set of basis (dictionary learning) i.e. the one that can lead to the sparsest possible representation. While all these tools are pertinent to the present problem they differ in a fundamental way. They are based on several data/ observations/instantiations of whereas in our problem we do not have such direct observations i.e. the data available pertains to which is nonlinearly and implicitly dependent on . Furthermore we are primarily interested in approximating the posterior on rather than the dimensionality reduction itself.We focus now on the representation of Equation (14) and proceed to discuss the identification of , and . In a fully Bayesian setting these parameters would be equipped with priors, say respectively^{2}^{2}2We use the same symbol for various densities without super/subscripts for economy of notation. The parameters each density pertains to, can be identified from its arguments., and their joint posterior would be sought:
(15) 
where represents the Gamma prior for discussed in Equation (13). Such an inference problem would in general be formidable particularly with regards to and whose dimension is dominated by . To address this difficulty we propose computing point estimates for and while inferring the whole posterior of . In doing so for and the natural objective function would be the marginal posterior :
(16) 
In such a case the point estimates for would be the MaximumaPosterioriEstimates (MAP). We note that (up to an additive constant):
(17) 
We note that such an integration is analytically impossible primarily due to the nonlinear and implicit nature of and secondarily due to the coupling of and . To that end we employ a Variational Bayesian approximation bishop_pattern_2006 to the integral in Equation (17). We provide further details in the next section. We note that similar approximations have been employed in previous works jin_hierarchical_2010 ; jin_variational_2012 ; chappell_variational_2009 in order to expedite Bayesian inference. The novel element of this work pertains to the dimensionality reduction that can be achieved.
2.2.2 Variational Bayesian approximation
Consider an arbitrary joint density on the latent variables . Then by employing Jensen’s inequality one can construct a lower bound to the logmarginalposterior in Equation (17) as follows:
(18) 
We note here that the variational lowerbound has an intimate connection with the KullbackLeibler divergence between and the (conditional) posterior on :
(19) 
In particular, if we denote by is the expectation with regards to :
(20) 
By definition the KLdivergence is nonnegative and it becomes when . Hence , for a given , constructing a good approximation to the conditional posterior (in the KLdivergence sense) is equivalent to maximizing the lower bound with regards to .
The aforementioned discussion suggests an iterative optimization scheme that resembles the Variational Bayes  ExpectationMaximization (VBEM) methods that have appeared in Machine Learning literature
beal_variational_2003 . At each iteration , one alternates between (Figure 1):
VBExpectation: Given , find:
(21) 
VBMaximization: Given , find:
(22)
In plain terms, the strategy advocated in order to carry out the inference task can be described as a generalized coordinate ascent with regards to (Figure 2).
To alleviate the difficulties with the loglikelihood integral above we employ the following approximations:

We linearize the map at . Hence:
(25) where is the gradient of the map at .
By keeping the first order terms from Equation (25) , the term in the exponent of the likelihood becomes:
(26) We note here that a quadratic expression with respect to could also be obtained by considering the order Taylor series of around directly. In particular if we denote by and and keeping only up to second order terms yields:
(27) The computation of order derivatives can also be addressed within the adjoint framework. We refer the interested reader to hinze_optimization_2009 ; buithanh_adaptive_2012 as we do not pursue this possibility further in this work. The ensuing expressions are based on Equation (26) but can be readily adjusted to include the terms in Equation (27) instead ^{3}^{3}3The only additional requirement is that is semipositive definite or that a semipositive approximation is used..
We note that by making use of the linearization of the map and the Variational Bayesian approximation, one can obtain a tractable approximations of the posterior of the latent parameters and . This will enable us to ultimately identify all model parameters and through this process the optimal subspace for approximating the posterior on . This will be explained in detail when the final algorithm is presented in section 2.2.4.

The aforementioned equations for the VBExpectation step imply that probabilistic inference can be expressed in terms of a parametric optimization problem. One can adopt a functional form for depending on an appropriate set of parameters and identify their optimal value by minimizing the KLdivergence with the posterior or equivalently maximizing . We adopt a meanfield approximation where one looks for factorized densities of the form:
(28) Variational meanfield approximations have their origin in statistical physics peierls_minimum_1938 . We make these expressions more specific in the next sections where we discuss the prior for as well.
2.2.3 Prior Specification for and
We discuss first the prior specification on . Its columns span the subspace over which an approximation of is sought. We note that depends on the product which would remain invariant by appropriate rescaling of each pair of and for any . Hence, to resolve identifiability issues we require that is orthogonal i.e. where is the
dimensional identity matrix. This is equivalent to employing a uniform prior on
on the Stiefel manifold muirhead_aspects_1982 .The latent, reduced coordinates capture the variation of around its mean along the directions of as implied by Equation (14). It is therefore reasonable to assume that, a priori, these should have zero mean and should be uncorrelated tipping_probabilistic_1999 . For that purpose we adopt a multivariate Gaussian prior (denoted by in the Equations of the previous section) with a diagonal covariance denoted by . We select prior variances such that . This induces a natural (stochastic) ordering to the reduced coordinates since is invariant to permutations of the entries of the and the columns of (Equation (14)). As a result of this ordering, is associated with the direction along which the largest variance in is attained,
with the direction with the second largest variance and so on. We discuss the particular values given to prior hyperparameters
in the sequel (Section 3) and in Section 2.3 the possibility of an adaptive decomposition is also presented. This enables the sequential addition of reduced coordinates until a sufficiently good approximation to the posterior is attained.The final aspect of the prior model pertains to . We use a hierarchical prior that induces the requisite smoothness given that represents the spatial variability of the material parameters. In particular the prior model employed penalizes the jumps in the values of and which correspond to neighboring sites/locations . The definition of a neighborhood can be adjusted depending on the problem, but in this work we assume that sites/locations belong to the neighborhood if the they correspond to adjacent pixels/voxels. Suppose is the total number of jumps or neighboring pairs. Then for if and denote the corresponding neighboring pair:
(29) 
The strength of the penalty is proportional to the hyperparameter , i.e. smaller values of induce a weaker penalty and vice versa. Let the denote the Boolean matrix that can be used to produce the vector of all jumps (as the one above) between all neighboring sites from the vector as , and the diagonal matrix containing all the hyperparameters associated with each of these jumps. We can represent the combined prior on as:
(30) 
A conjugate prior of the hyperparameters
is a product of Gamma distributions:
(31) 
As in bardsley_hierarchical_2010 the independence is motivated by the absence of correlation (a priori) with regards to the locations of the jumps. In this work we use which corresponds to a limiting case of a Jeffreys prior that is scale invariant. We note that in contrast to previous works where such priors have been employed for the vector of unknowns and MAP estimates have been obtained kaipio_statistical_2006 , we employ this here for which is only part of the overall decomposition in Equation (14). We discuss in the following section the update equations for and the associated hyperparameters as well as for the remaining model variables.
2.2.4 Update equations for
We postulate first that the reduced coordinates should a posteriori have zero mean as they capture variability around . For that purpose we confine our search for to distributions with zero mean. Given the aforementioned prior and the linearization discussed in the previous section, we can readily deduce from Equation (23) that the optimal approximate posteriors and under the meanfield Variational Bayesian scheme adopted will be:
(32) 
The associated parameters are given by the following iterative Equations:
(33) 
(34) 
where .
As a result of the aforementioned Equations and Equation (14), one can establish that the posterior of is approximated by a Gaussian with mean and covariance given by:
(35) 
We note that if we diagonalize i.e. where is diagonal and
is orthogonal with columns equal to the eigenvectors of
, then:(36) 
where is also orthogonal (i.e. ) and contains the principal directions of the posterior covariance of . Hence it suffices to consider approximate posteriors with covariance that is diagonal i.e. . In this case the update equations for in Equation (34) reduce to:
(37) 
We note that despite the prior assumption on uncorrelated , the posterior on exhibits correlation and captures the principal directions along which the variance is largest. Furthermore, implicit to the aforementioned derivations is the assumption of a unimodal posterior on and subsequently on . This assumption can be relaxed by employing a mixture of Gaussians (e.g. choudrey_variational_2003 ) that will enable the approximation of highly nonGaussian and potentially multimodal posteriors. Such approximations could also be combined with the employment of different basis sets for each of the mixture component which would provide a wide range of possibilities. We defer further discussions along these lines to future work. In the examined elastography applications, the unimodal assumption seems to be a reasonable one due to the generally large of amounts of data/observations obtained from various imaging modalities.
Given the aforementioned results one can obtain an expression for the variational lower bound in Equation (23). For economy of notation we use to denote expectations with respect to and/or as implied by the arguments:
(38) 
where is the normalization constant of a distribution with parameters . The aforementioned equation can be further simplified by making use of the following expectations: , :
(39) 
In order to update in the VBMaximization step, it suffices to consider only the terms of that depend on it which we denote by i.e.:
(40) 
As discussed earlier the prior enforces the orthogonality constraint on . To address this constrained optimization problem, we employ the iterative algorithm proposed in wen_feasible_2013 which has proven highly efficient in terms of the number of iterations and the cost per iterations in several settings. It employs the Cayley transform to preserve the constraint during the optimization and makes use only of first order derivatives:
(41) 
In brief, if
is the skewsymmetric matrix:
(42) 
the update equations are based on a CrankNicholsonlike scheme:
(43) 
where is the step size. One notes that the aforementioned update preserves the orthogonality of (wen_feasible_2013 ). In order to derive a good step size we use the BarzilaiBorwein scheme barzilai_2point_1988 which results in a nonmonotone line search algorithm:
(44) 
where represents the difference between the current parameter values as compared to the previous step. As discussed in detail in wen_feasible_2013 the inversion of the matrix in Equation (43) can be efficiently performed by inverting a matrix of dimension which is much smaller than . We note that the updates of require no forward calls for the computation of or its derivatives . The updates/iterations are terminated when no further improvement to the objective is possible.
The final component involves the optimization of . As with we consider only the terms of (Equation (39)) that depend on which we denote by i.e.:
(45) 
Due to the analytical unavailability of and its derivatives we employ here an ExpectationMaximization scheme dempster_maximum_1977 ; neal_view_1998 which we describe in A for completeness. The output of this algorithm is also the posterior on the hyperparameters in Equation (29) which capture the locations of jumps in as well as the probabilities associated with them. The cost of the numerical operations is minimal and scales linearly with the number of neighboring pairs . In the following we simply make use of Equations (65) without further explanation.
Formally the determination of the optimal would require the derivatives in Equation (45). We note that depends on . Hence finding would require the computation of secondorder derivatives of which poses significant computational difficulties in the highdimensional setting considered. To avoid this and only for the purpose of the updates, we linearize Equation (45) around the current guess by ignoring the dependence of on or equivalently by assuming that remains constant in the vicinity of the current guess. In particular, let denote the value of at iteration , then in order to find the increment , we define the new objective as follows:
(46) 
We note that there is no approximation with regards to the prior term. By keeping only the terms depending on in the Equation above we obtain:
(47) 
This is a concave and quadratic with respect to the unknown . The maximum can be found by setting which yields:
(48) 
We note that the exact objective is evaluated at and is accepted only if the value of the objective is larger than that at . Iterations at terminated when no further improvement is possible. Finally it was found that activating the regularization term () after five updates/iterations during which the optimization is performed solely on the basis of , enabled better exploration of the feasible solutions.
We summarize below the basic steps of the iterative Variational Bayesian scheme proposed in Algorithm 1.
With regards to the overall computational cost we note that the updates of are the most demanding as they require calls to the forward model to evaluate and the derivatives . The updates were terminated when no further increase in (Equation (39)) can be attained.
2.3 Adaptive learning  Cardinality of reduced coordinates
The presentation thus far was based on a fixed number of reduced coordinates . A natural question that arises is how many should one consider. In order to address this issue we propose an adaptive learning scheme. According to this the analysis is first performed with a few (even one) reduced coordinates and upon convergence additional reduced coordinates are introduced, either in small batches or even onebyone. Critical to the implementation of such a scheme is a metric for the progress achieved by the addition of reduced coordinates and basis vectors which can also be used as a termination criterion.
In this work we advocate the use of an informationtheoretic criterion which measures the information gain between the prior beliefs on and the corresponding posterior. To measure such gains we employ again the KLdivergence between the aforementioned distributions itti_bayesian_2009 . In particular if (section 2.2.3) and (Equation (37)) denote the dimensional prior and posterior respectively, we define the quantity as follows:
(49) 
which measures the (relative) information gain from to reduced coordinates. The KL divergence between and with and where are diagonal as explained previously follows with: