Many physical phenomena are modeled by governing equations which depend linearly on some terms and non-linearly on other terms. For example, the wave equation may depend linearly on a forcing term and non-linearly on the medium velocity. This paper is on inverse problems where both a linear part and a nonlinear part are unknown. Such inverse problems occur in passive radar imaging, or in seismology where the source of an earthquake has to be determined (the source could be a point, or a fault) and a forcing term supported on that source is also unknown. This inverse problem is then linear in the unknown forcing term and nonlinear in the location of the source. In the last section of this paper, we show a simulation in seismology where a fault can be reconstructed thanks to our stochastic algorithm for mixed problems, while classic deterministic algorithms fail. There is also another example where mixed linear and nonlinear inverse problems occur: the training phase of neural networks. As most neural networks are based on linear combinations of basis functions depending on a few parameters, these parameters have to be determined by ”training” the network on data, in other words by solving a mixed linear and nonlinear optimization problem where the nonlinear part represents the parameters to be determined,[Bishop et al., 1995].
Let us now formulate the mixed linear and nonlinear problem studied in this paper. Assume that after discretization a forward model which is linear in and nonlinear in is given by the relation
where in is the forcing term, in is the nonlinear parameter, is an matrix depending continuously on the parameter , is an dimensional Gaussian random variable assumed to have zero mean and covariance with , and is the resulting data for the inverse problem. Depending on the problem, may represent a constitutive coefficient in a PDE, or the location of a point source if is derived from a Green function, or the geometry of a support if is derived from the convolution with a Green function. In practice the mapping is assumed to be known, in other words a model is known. We are interested in challenging cases where the following difficulties arise simultaneously:
the size of the matrix is such that (sparse data),
the ordered singular values of, are such that (even is ill conditioned),
is unknown and not even an estimate ofis available, thus choosing a regularization parameter by the discrepancy principle is impractical (unknown noise level),
due to the singularity and rectangular shape of , for all parameters in the search set, can be made arbitrarily small for some in , thus the Variable Projection (VP) functional as defined in [Golub and Pereyra, 2003] can be minimized to numerical zero for all in the search set, and consequently the variable projection method is inapplicable.
In light of (iv), solving the inverse problem (1.1) is not about minimizing over all in the search set, nor any regularized version thereof. Instead, there is a great advantage to formulating this inverse problem in terms of finding probabilistic information about . We propose in this paper an algorithm capable of computing the expected value of , its covariance matrix, and possibly, the posterior marginal probability distribution function of some of the components of . As to the linear part of inverse problem (1.1), ultimately, we may only be interested in its expected value knowing some particular value of
, such as its expected value, or any value within a standard deviation of its expected value.
Instead of using the VP functional, we start from the regularized error functional,
where is an invertible by matrix. Typical choices for
include the identity matrix and matrices derived from discretizing derivative operators. In all casesis assumed to be square, large, sparse, and well-conditioned. In section 3 we relate the functional (1.2) to the probability density of knowing and and we use the Maximum Likelihood (ML) assumption to eliminate . As far as we know, this idea was first introduced (for linear problems only) in [Galatsanos and Katsaggelos, 1992], but unlike as in that reference, we do not eliminate : we let
remain a random variable and thanks to Bayes’ theorem we find the a formula for the probability density ofknowing : this is stated in Theorem 3.1. In section 4.2, this formula is used to build a parallel, adaptive, choice sampling algorithm to simulate the probability density of knowing and from there the statistics of and can be computed. Finally we show in section 5 a numerical simulation where this algorithm is applied to a particularly challenging inverse problem in geophysics. In this problem a fault geometry, described by a nonlinear parameter in , has to be reconstructed from surface displacement data, a vector in . The data is produced by a large slip field modeled by a vector in . depends linearly on and this dependance can be expressed by a matrix . In this simulation, the matrix is full since it is derived from convolution by a Green function. In addition, the entries of are particularly expensive to compute, which is a hallmark of problems involving half space elasticity. This application features all the difficulties (i) through (iv) listed above. This contributes to not only making the VP functional method unsuitable, but also rendering other methods such as Generalized Cross Validation and the discrepancy principle ineffective as discussed in section 5.3.
2 The linear part of the inverse problem
In this section we review three classical methods for selecting an adequate value for the regularization constant in (1.2), assuming that the nonlinear parameter is fixed. The Euclidean norm will be denoted by and the transpose of a matrix will be denoted by . Since we assumed that the matrix is ill-conditioned, it is well known that one should not attempt to minimize for in to solve for the linear part of the inverse problem without some kind of regularization. We then consider a Tikhonov type regularization where we seek to minimize over the functional (1.2) for some . It is well-known that the functional (1.2) has a unique minimum for in . A difficult issue remains: selecting a value for the regularization constant . Values that are too low may lead to solutions that are too oscillatory, with very large norms, and overly sensitive to noise. Values that are too large may lead to solutions that are too smooth and that lead to large differences between and , where is the minimizer of (1.2). There is a vast amount of literature on methods for selecting an adequate value for the regularization constant . An account of most commonly used methods, together with error analysis, can be found in [Vogel, 2002]. In this paper we review three such methods, that we subsequently compare to our own algorithm.
2.1 Generalized cross validation (GCV)
We first note that setting , minimizing (1.2) is equivalent to minimizing for in ,
The GCV method was first introduced and analyzed in [Golub et al., 1979]. The parameter is selected by minimizing
where , is the pseudo-inverse of given by,
and tr is the trace operator. Note that,
It is well known that for a fixed , (2.1) as a function of achieves a unique minimum at . Let be the value of which minimizes (2.2). Golub et al. proved in [Golub et al., 1979] that as a function of , the expected value of which can be thought of as an indicator of fidelity of the pseudo-solution to the equation , is approximately minimized at as . Although the GCV method enjoys this remarkable asymptotic property and does not require the knowledge of the covariance , many authors have noted that determining the minimum of (2.2) in practice can be costly and inaccurate as in practical situations the quantity in (2.2) is flat near its minimum for a wide range of values of [Thompson et al., 1989, Varah, 1983].
2.2 The discrepancy principle (CLS)
This method is also called the constrained least square (CLS), [Galatsanos and Katsaggelos, 1992]. A regularization constant such that (2.5) is achieved will be denoted by . Clearly, applying this method requires a knowledge of the value of the covariance or at least some reasonable approximation of its value. Even if is known, leads to solutions that are in general overly smooth, [Galatsanos and Katsaggelos, 1992, Vogel, 2002].
2.3 Maximum likelihood (ML)
Of all three previously known methods that we use in our paper for comparison to our proposed algorithm, this one is of greatest interest since we will show in the next section how a modified and expanded version can be successfully adapted to mixed linear and nonlinear inverse problems. To the best of our knowledge the ML method was first proposed in [Galatsanos and Katsaggelos, 1992]. It relies on maximizing the likelihood of the minimizer of (1.2) knowing and . As the maximum is computed over all , Galatsanos and Katsaggelos obtained in [Galatsanos and Katsaggelos, 1992] an expression that is independent of , that they then minimize in . This expression to be minimized in is,
where is as in (2.4). We will show that the numerator in (2.6) is positive for any non-zero . We will also indicate how the determinant in the denominator of (2.6) can be efficiently evaluated from and . Minimizing (2.6) does not require any knowledge of the covariance . Interestingly, if is set to be , the minimizer of (2.6), Galatsanos and Katsaggelos showed in [Galatsanos and Katsaggelos, 1992] the relation
In [Galatsanos and Katsaggelos, 1992], formulas (2.6) and (2.7) were only established in the case of square matrices (). The generalization to rectangular matrices is rather straightforward. In this paper, our main contribution is to generalize the ML method to mixed linear and nonlinear inverse problems as becomes variable and to propose an alternative to minimizing the ratio (2.6). In this alternative will itself be a random variable. Instead of only retaining the most likely value of , we will consider all positive values of . There is a simple intuitive explanation for why this new approach will prove to be fruitful. Since the nonlinear parameter is variable, the ’optimal’ value for depends on . One line of thinking is to compute the optimal value for as a function of using the GCV or the CLS method. Our numerical simulations show that this leads to highly unstable solutions. This is chiefly due to the fact that for values of which are far from its ’correct’ value, the computed value for is low so more irregular solutions for the linear part of the problem are favored. For values of which are close to its ’correct’ value, higher values for are selected and accordingly more regular solutions for the linear part of the problem are favored: altogether this leads to a poor way of comparing how well different values of will lead to better fitting the data. One way around that hurdle is to find a criterion for a selecting a uniform value of for all as in previous studies [Volkov and Sandiumenge, 2019, Volkov et al., 2017a]. This led to acceptable results on simulated data and on measured data, albeit the nonlinear parameter had to be restricted to be three dimensional for this approach to be fruitful. However, a physical argument can be made against selecting a uniform value of for all : suppose that equation (1.1) models a physical phenomenon such that the nonlinear parameter is related to a distance to a set of sources. Suppose that the intensity of the induced physical field decays in or in . Then in order to produce the same intensity of measurement, a faraway source will require a stronger impulse and pushes toward lower values of . This explains why the selection for a uniform value of leads to a bias toward decreasing the distance to reconstructed sources.
3 A new Bayesian approach for finding the posterior of the augmented random variable
We make the following assumptions:
[label=H0. , wide=0.5em, leftmargin=*]
, , and are random variables in , respectively,
has a known prior distribution denoted by ,
is an by matrix which depends continuously on ,
is an dimensional Gaussian random variable that we assume to have zero mean and covariance , with ,
relation (1.1) holds,
is a fixed invertible by matrix and we set ,
we set , equivalently, is the minimizer of (1.2),
the ML assumption: the prior of is also a normal random variable with zero mean and covariance .
The ML assumption H8 was introduced in [Galatsanos and Katsaggelos, 1992] and justified in that paper by a physical argument. Here we give another interpretation. The functional (1.2) may be rewritten as
According to (1.1), we would like the difference to behave like a normal random variable with zero mean and covariance . Assuming that the the prior of is also a normal random variable with zero mean and covariance restores a balance between reconstruction fidelity (first term in (3.1)) and regularity requirements (second term in (3.1)).
Assume assumptions H1 to H8 hold. Let be the marginal probability density of knowing . As a function of , achieves a unique maximum at
Fixing , the probability density of knowing is then given, up to a multiplicative constant, by the formula
Proof: According to H4, H5, the probability density of knowing , , , and is, since does not depend on ,
Due to assumption H8,
since this prior is independent of
. The joint distribution ofknowing is related to the distribution of knowing by
is the prior probability distribution of[Kaipio and Somersalo, 2006], which we said was given by (3.5). Combining (3.4, 3.5, 3.6) we obtain
This last integral can be computed explicitly [Volkov and Sandiumenge, 2019] to find
which we now maximize for in . Note that does not depend on . As tends to infinity, the limit of (3.9) is clearly zero. As tends to zero, as long as is non-zero, , so the limit of (3.9) is again zero. We then take the derivative of (3.9) in and set it to equal to zero to find the equation
thus the value
where means ’equal to some constant times’. Since our goal is to reconstruct and knowing we apply Bayes’ law
to obtain (3.3).
We now compare formulas (3.2) and (3.3) from Theorem 3.1 to formulas (28) and (29) found in [Galatsanos and Katsaggelos, 1992]. Let us first emphasize a major difference in our approach. In [Galatsanos and Katsaggelos, 1992], the ratio (28) is optimized in the regularization parameter ( in their paper), so eventually only one regularization parameter is considered. Instead, formula (3.3) uses a prior on the regularization parameter , so all values of can be considered.
In order to show the connection between the numerator of (28) in [Galatsanos and Katsaggelos, 1992] and the term in (3.3), we note that since satisfies assumption H7,
which is the analog of the numerator in formula (28) in [Galatsanos and Katsaggelos, 1992]. To relate the determinant in (3.3) to the determinant in formula (28) in [Galatsanos and Katsaggelos, 1992], we need the following lemma.
For any ,
Proof: We first notice that
so the first two terms in (3.10) are equal. Note that . Let
be an eigenvalue ofwhich is different from 1. There is an in such that . This implies that
and in particular . From (3.12),
which shows that is also an eigenvalue of since . The same calculation can be used to show that if are
independent eigenvectors offor the eigenvalue , then are independent eigenvectors of for the eigenvalue .
Conversely, let be an eigenvalue of which is different from 1. Then there is a non-zero in such that
and due to (3.11)
thus is an eigenvalue of as .
The same calculation can be used to show that if
are independent eigenvectors of
for the eigenvalue ,
then are independent eigenvectors of for the eigenvalue
In conclusion we have shown that the symmetric matrices
and have the same eigenvalues with same multiplicity, except possibly for the eigenvalue 1.
It follows that they have same determinant.
The determinants in (3.10) can be evaluated efficiently. In many applications the matrix is rectangular. In the particular application shown later in this paper, . We recall that the matrix is sparse and well-conditioned, so can be efficiently evaluated. Let be the non-zero singular values of counted with multiplicity. Note that . In practice, if both and are large, since we assumed that the singular values of are rapidly decaying, computing just the largest singular values of is sufficient. The eigenvalues of that are different from 1 are and accordingly
4 Proposed algorithm
We first present in section 4.1 a single-processor version of our adaptive choice sampling algorithm derived from formula (3.3). A reader familiar with this kind of algorithms can just focus on the notations introduced in section 4.1 and the sub-algorithm for computing the relative probability densities of proposals, and then move to section 4.2 where we explain how this algorithm can be adjusted to multi-processor platforms. The adjustment is based on [Calderhead, 2014], where it is shown that astute combinations between proposals computed in parallel result in better mixing and faster convergence properties.
4.1 Single processor algorithm
Based on (3.3
), we define the non-normalized distribution
Our proposed algorithm will call a sub-algorithm which computes for a given . We use the following notations: E for expected value, cov for covariance matrix, for a normal distribution with mean and covariance ,
for a uniform distribution in the interval. The random walk algorithm starts from a point in such that and an initial covariance matrix obtained from the prior distribution. A good choice of the initial point may have a strong impact on how many sampling steps are necessary in our algorithm. A poor choice may result in a very long ”burn in” phase where the random walk is lost in a low probability region. How to find a good starting point depends greatly on the application, so we will discuss that issue in a later section where we cover a specific example. Our basic single processor algorithm follows the well established adaptive MCMC propose/accept/reject algorithm [Roberts and Rosenthal, 2009]. Let be a decreasing sequence in which converges to 0. This sequence is used to weigh a convex combination between the initial covariance and the covariance learned from sampling. The updating of need not occur at every step. Let be the total number of steps and the number of steps between updates of . We require that . Finally the covariance for the proposals is adjusted by a factor of as recommended in [Roberts et al., 2001, Roberts and Rosenthal, 2009]. It was shown in [Roberts et al., 2001] that this scaling leads to an optimal acceptance rate.
The sub-algorithm for computing depends on the application. We present in the last section an application where is derived from an integral operator with an integration kernel which is very expensive to compute. In that case the matrix is full and it is advantageous to use array operations to compute the entries of in aggregate. Typically, the regularization matrix could be the identity or a matrix derived from the evaluation of discrete derivatives, plus a few entries to make well conditioned. In the application shown in the last section, is in the latter form. Note that is sparse too and does nor depend on , so it should be evaluated only once and stored. In any case, it is important to take advantage of the sparsity of . Computing should take advantage of formula (3.14). As discussed after the proof of lemma 3.1 only the first non-zero singular values of are needed. If the size of is large, one can take advantage of randomized SVD techniques. Finally, for larger problems, an iterative solver should be used to evaluate , the minimizer of (1.2). For efficiency, one should make sure to code the function without evaluating the matrix product .
4.2 Parallel algorithm
Let be the number of processing units. A straightforward way of taking advantage of multiple processors is to generate separate chains of samples using the single processor algorithm described in section 4.1 and then concatenate them. However, computations can be greatly accelerated by analyzing the proposals produced by the chains in aggregate [Calderhead, 2014, Jacob et al., 2011]. While in section 4.1 was a dimensional vector, here we set to be a by matrix where the -th column will be denoted by and is a sample of the random variable , . Next, if , we assemble an by transition matrix from the computed non-normalized densities and , , where is the proposal. Let be the vector in with coordinates
The entries of the transition matrix are given by the following formula [Calderhead, 2014],
Note that for the row defines a discrete probability distribution on .
Note that this parallel algorithm is especially well suited to applications where computing the non-normalized density is particularly expensive. In that case, even the naive parallel algorithm where separate chains are computed in parallel will be about times more efficient than the single processor algorithm. The parallel algorithm presented in this section is in fact even more efficient due to superior mixing properties and sampling performance: the performance is not overly sensitive to tuning of proposal parameters [Calderhead, 2014].
5 Application to the fault inverse problem in seismology and numerical simulations
We now formulate a fault inverse problem in seismology that we will solve using the algorithm introduced in section 4.2 . Using standard rectangular coordinates, denote elements of . We define to be the open half space . We use the equations of linear elasticity with Lamé constants and such that and . For a vector field
the stress and strain tensors will be denoted as follows,
and the stress vector in the direction will be denoted by
Let be a Lipschitz open surface which is strictly included in , with normal vector . We define the jump of the vector field across to be
for in , if this limit exists. Let be the displacement field solving
where is the vector .
Let be a bounded domain in with Lipschitz boundary containing . Let be the space of restrictions to of tangential fields in supported in . In [Volkov et al., 2017a], we defined the functional space of vector fields defined in such that and