1 Introduction
Observations obeying certain physical laws can often be described by a partial differential equation (PDE). Real world measurements carry statistical noise and thus do not generally exactly exhibit the idealised pattern of the PDE, but it is desirable that recovery of parameters from data is consistent with the PDE structure. In the mathematical literature on inverse problems several algorithms that incorporate such constraints have been proposed, notably optimisation based methods such as Tikhonov regularisation [EHN96, BB18] and maximum a posteriori (MAP) estimates related to Bayesian inversion techniques [stuart10, DS16]. In statistical terminology these methods can often be viewed as penalised least squares estimators over parameter spaces of regression functions that are restricted to lie in the range of some ‘forward operator’ describing the solution map of the PDE. The case where is linear is reasonably well studied in the inverse problems literature, but already in basic elliptic PDE examples, the map is nonlinear and the analysis is more delicate. The observation scheme considered here is (a natural continuous analogue of) the standard Gaussian regression model
(1.1) 
where are ‘equally spaced’ design points on a bounded domain with smooth boundary . The function is, in our first example, the solution of the elliptic PDE (with denoting the gradient and the divergence operator)
(1.2) 
where is a given source function defined on and is an unknown conductivity (or diffusion) coefficient. The second model example arises with solutions of the timeindependent Schrödinger equation (with equal to the standard Laplacian operator)
(1.3) 
corresponding to the unknown attenuation potential (or reaction coefficient) , and given positive ‘boundary temperatures’ . Both PDEs have a fundamental physical interpretation and feature in many application areas, see, e.g., [EHN96, BHM04, HP08, BU10, stuart10, devore, DS16], and references therein.
When belongs to some Sobolev space for appropriate , unique solutions of the PDEs (1.2), (1.3) exist, and the ‘forward’ map is nonlinear. A natural method to estimate is by a penalised least squares approach: one minimises in the squared Euclidean distance
of the observation vector
to the fitted values , and penalises too complex solutions by, for instance, an additive Sobolev norm  type penalty. The crucial constraint can be incorporated by a smooth onetoone transformation of the penalty function, and a final estimator minimises a criterion function of the formover , where is a scalar regularisation parameter to be chosen. Both Tikhonov regularisers as well as Bayesian maximum a posteriori (MAP) estimates arising from suitable Gaussian priors fall into this class of estimators. We show in the present paper that suitable choices of give rise to statistically optimal solutions of the above PDE constrained regression problems from data (1.1), in prediction loss. The convergence rates obtained can be combined with ‘stability estimates’ to obtain bounds also for the recovery of the parameter itself.
Our main results are based on a general convergence rate theorem for minimisers over of functionals of the form
in possibly nonlinear inverse problems whose forward map satisfies a certain modulus of continuity assumption between Hilbert spaces. This result, which adapts estimation techniques [sara2001, saramest] to the inverse problems setting, is of independent interest, and provides novel results also for linear forward maps, see Remark 2.5 for an application to Radon transforms.
For sake of conciseness, our theory is given in the Gaussian white noise model introduced in (2.1) below – it serves as an asymptotically equivalent (see [BL96, R08]) continuous analogue of the discrete model (1.1), and facilitates the application of PDE techniques in our proofs. Transferring our results to discrete regression models is possible, but the additional difficulties are mostly of a technical nature and will note be pursued here.
Recovery for nonlinear inverse problems such as those mentioned above has been studied initially in the deterministic regularisation literature [EKN89, N92, SEK93, EHN96, TJ02], and the convergence rate theory developed there has been adapted to the statistical regression model (1.1) in [BHM04, BHMR07, HP08, LL10]. These results all assume that a suitable Fréchet derivative of the nonlinear forward map exists at the ‘true’ parameter , and moreover require that lies in the range of the adjoint operator of – the so called ‘source condition’. Particularly for the PDE (1.2), such conditions are problematic and do not hold in general for rich enough classes of ’s (such as Sobolev balls) unless one makes very stringent additional model assumptions. Our results circumvent such source conditions. Further remarks, including a discussion of related convergence analysis of estimators obtained from Bayesian inversion techniques [v13, dashti13, n17] can be found in Section 3.4.
1.1 Some preliminaries and basic notation
Throughout, , , denotes a bounded nonempty domain (an open bounded set with smooth boundary) with closure . The usual space of square integrable functions carries a norm induced by the inner product
where denotes Lebesgue measure. For any multiindex of ‘length’ , let denote the th (weak) partial derivative operator of order . Then for integer , the usual Sobolev spaces are defined as
normed by . For noninteger real values , we define
by interpolation, see, e.g.,
[lionsmagenes] or [T78].The spaces of bounded and continuous functions on and are denoted by and , respectively, equipped with the supremum norm . For , the space of times differentiable functions on with uniformly continuous derivatives is denoted by . For , we say if for all multiindices with (the integer part of ), exists and is Hölder continuous. The norm on is
We also define the set of smooth functions as and its subspace of functions compactly supported in .
The previous definitions will be used also for replaced by or . When there is no ambiguity, we omit from the notation.
For any normed linear space its topological dual space is
which is a Banach space for the norm
We need further Sobolevtype spaces to address routine subtleties of the behaviour of functions near : denote by the completion of for the norm, and let denote the closed subspace of consisting of functions supported in . We have unless (Section 4.3.2 in [T78]), and one defines negative order Sobolev spaces , cf. also Theorem 3.30 in [ML00].
We use the symbols “” for inequalities that hold up to multiplicative constants that are universal, or whose dependence on other constants will be clear from the context. We also use the standard notation and for .
2 A convergence rate result for general inverse problems
2.1 Forward map and white noise model
Let be a separable Hilbert space with inner product . Suppose that and that
is a given ‘forward’ map. For some , and for scalar ‘noise level’ , we observe a realisation of the equation
(2.1) 
where is a centred Gaussian white noise process indexed by the Hilbert space (see p.1920 in [nicklgine]). Let denote the expectation operator under the law of from (2.1). Observing (2.1) means to observe a realisation of the Gaussian process with marginal distributions
In the case relevant in Section 3 below, (2.1) can be interpreted as a Gaussian shift experiment in the Sobolev space (see, e.g., [cn1, n17]), and also serves as a theoretically convenient (and, for , as asymptotically equivalent) continuous surrogate model for observing in the standard fixed design Gaussian regression model
(2.2) 
where the are ‘equally spaced’ design points in the domain (see [BL96, R08]).
In the discrete model (2.2) the least squares criterion can be decomposed as . The first term is independent of and can be neglected when optimising in . In the continuous model (2.1) we have a.s. (unless dim), which motivates to define a ‘Tikhonovregularised’ functional
(2.3) 
where is a regularisation parameter to be chosen, and where we set for . Maximising thus amounts to minimising the natural least squares fit with a penalty for , and we note that it also corresponds to maximising the penalised loglikehood function arising from (2.1), see, e.g., [n17], Section 7.4. In all that follows could be replaced by any equivalent norm on .
2.2 Results
For , and , define the functional
(2.4) 
The main result of this section, Theorem 2.2, proves the existence of maximisers for over suitable subsets and concentration properties for , where is the ‘true’ function generating the law from equation (2.1). Note that bounds for simultaneously control the ‘prediction error’ as well as the regularity of the estimated output .
Theorem 2.2 is proved under a general ‘modulus of continuity’ condition on the map which reads as follows.
Definition 2.1.
Let be nonnegative real numbers and . Set if , and if . A map is called regular if there exists a constant such that for all , we have
(2.5) 
This condition is easily checked for ‘smoothing’ linear maps with , see Remark 2.5 for an example. But (2.5) also allows for certain nonlinearities of on unbounded parameter spaces that will be seen later on to accommodate the forward maps induced by the PDEs (1.2), (1.3). See also Remarks 2.6, 3.10 below.
Theorem 2.2.
Suppose that is a regular map for some integer . Let from (2.1) for some fixed . Then the following holds.
1. Let be closed for the weak topology of the Hilbert space . Then for all , almost surely under , there exists a maximiser of from (2.3) over , satisfying
(2.6) 
2. Let . There exist constants such that we have for all satisfying
(2.7) 
all , any maximiser of over and any ,
(2.8) 
and also
(2.9) 
Various applications of Theorem 2.2 for specific choices of , , and will be illustrated in the following  besides the main PDE applications from Section 3, see Remarks 2.4, 3.10 and 3.11 as well as Example 2.5 below.
Theorem 2.2 does not necessarily require as long as can be suitably approximated by some , see Remark 2.4 for an instance of when this is relevant. If then we can set in the above theorem and obtain the following convergence rates, which are well known to be optimal for smoothing linear forward maps , and which will be seen to be optimal also for the nonlinear inverse problems arising from the PDE models (1.2) and (1.3).
Corollary 2.3.
Under the conditions of Part 2 of Theorem 2.2, there exists such that for , we have that for all , all small enough and any maximizer of over ,
(2.10) 
When images of bounded subsets of under a forward map are bounded in for some , then the bound (2.10) extends (via interpolation and bounds for implied by Theorem 2.2) to norms, , which in turn can be used to obtain convergence rates also for by using stability estimates. See the results in Section 3 and also Example 2.5 below for examples.
Remark 2.4 (MAP estimates).
Let be a Gaussian process prior measure for with reproducing kernel Hilbert space (RKHS) and RKHSnorm . Taking note of the form of the likelihood function in the model (2.1) (see, e.g., Section 7.4 in [n17]), maximisers of over with have a formal interpretation as maximum a posteriori (MAP) estimators for the resulting posterior distributions , see also [dashti13, HB15]. For instance, let and consider a linear inverse problem where for and , is a linear map satisfying (2.5) with for all . Then, applying Theorem 2.2 with (so that ) and yields
(2.11) 
for any fixed subdomain such that . Indeed, one easily checks (2.7), and given set , where is such that on and
is the Fourier transform. Then
and in (2.9) yield (2.11). Similar comments apply to nonlinear , with appropriate choice of , see Remark 3.10.Example 2.5 (Rates for the Radon transform).
Let be the Radon transform, where and , equipped with Lebesgue measure, see p.9 in [N86] for definitions. Then satisfies (2.5) with and any – see p.42 in [N86] and note that our norm is the norm used in [N86] (cf. Theorem 3.30 in [ML00]). Applying Corollary 2.3 with , and implies that for any ,
(2.12) 
Using again the estimates on p.42 in [N86] and that Hölder’s inequality implies
for defined as in [N86], we deduce from (2.12) and Markov’s inequality
with constants uniform in for any . Similarly, if one chooses instead, then the MAP estimate from Remark 2.4 satisfies
uniformly over for .
Remark 2.6 (The effect of nonlinearity).
In the proof of Theorem 2.2 we follow ideas for estimation from [saramest, sara2001], and condition (2.5) is needed to bound the entropy numbers of images of Sobolev balls under , which in turn control the modulus of continuity of the Gaussian process that determines the convergence rate of to . The at most polynomial growth in of the Lipschitz constants
(2.13) 
in (2.5) turns out to be essential in the proof of Theorem 2.2. But even when only a ‘polynomial nonlinearity’ is present (), the last term in the condition (2.7) can become dominant if the penalisation parameter is too small. The intuition is that, for nonlinear problems, too little penalisation can mean that the maximisers over unbounded parameter spaces behave erratically, yielding suboptimal convergence rates.
3 Results for elliptic PDE models
In this section, we apply Theorem 2.2 to the inverse problems induced by the PDEs (1.2) and (1.3). We also discuss the implied convergence rates for the parameter .
3.1 Basic setup and link functions
For any integer and any constant , and denoting the outward pointing normal vector at by , define the parameter space
(3.1) 
and its subclasses
We note that the restrictions and on in (3.1) are made only for convenience, and could be replaced by for any fixed satisfying . Moreover, for estimation over parameter spaces without prescribed boundary values for , see Remark 3.11.
We will assume that the coefficient of the second order linear elliptic partial differential operators featuring in the boundary value problems (1.2) and (1.3), respectively, belong to for large enough , and denote by
(3.2) 
the corresponding solution maps. Following (2.1) with , we then observe
(3.3) 
whose law will now be denoted by for .
We will apply Theorem 2.2 to a suitable bijective reparameterisation of for which the set one optimises over is a linear space. This is natural for implementation purposes but also necessary to retain the Bayesian interpretation of our estimators from Remark 2.4. To this end, we introduce ‘link functions’ – the lowercase and uppercase notation for corresponding functions and will be used throughout.
Definition 3.1.
1. A function is called a link function if is a smooth, strictly increasing bijective map satisfying and on .
2. A function , , is called regular if all derivatives of of order are bounded, i.e.
(3.4) 
In the notation of Theorem 2.2, throughout this section we set , to be the ‘pulledback’ parameter space, and
(3.5) 
For as in (3.1), one easily verifies that
where the second equality follows from the characterization of in Theorem 11.5 of [lionsmagenes]. Given a realisation of (3.3) and a regular link function , we define the generalised Tikhonov regularised functional ,
(3.6) 
Then for all , we have in the notation (2.3), and maximising over is equivalent to maximising over . Any pair of maximisers will be denoted by
The proofs of the theorems which follow are based on an application of Theorem 2.2, after verifying that the map (3.5) satisfies (2.5) with and suitable values of . The verification of (2.5) is based on PDE estimates that control the modulus of continuity of the solution map (3.2), and on certain analytic properties of the link function . In practice often the choice is made (cf. [stuart10]), but our results suggest that the use of a regular link function might be favourable. Indeed, the polynomial growth requirement (2.13) discussed above is not met if one chooses for the exponential function. Before we proceed, let us give an example of a regular link function.
Example 3.2.
Define the function by , let be a smooth, compactly supported function with , and write for their convolution. It follows from elementary calculations that, for any ,
is a regular link function with range .
3.2 Divergence form equation
For a given source function , we consider the Dirichlet boundary value problem
(3.7) 
where from (3.1), for . Then (LABEL:hemb) implies for some and the Schauder theory for elliptic PDEs (Theorem 6.14 in [gt]) then gives that (3.7) has a unique classical solution in which we shall denote by .
Upper bounds
For a link function and , define (cf. (2.4))
Theorem 3.3 (Prediction error).
Let be given by (3.1) for some integer and . Let denote the unique solution of (3.7) and let from (3.3) for some . Moreover, suppose that is a regular link function and that is given by (3.6), where
Then the following holds.

For each and , almost surely under , there exists a maximiser of over .

For each , , there exist finite constants such that for any maximiser of , all and all ,
(3.8) 
For each , and , there exists a constant such that for any maximiser of with corresponding , for all ,
(3.9)
Lower bounds
We now give a minimax lower bound on the rate of estimation for which matches the bound in (3.9). To facilitate the exposition we only consider the unit ball , set identically on , and fix loss with .
Theorem 3.4.
Observe that (3.10) coincides with the lower bound for estimating as an unconstrained regression function in under loss. Note however that unconstrained ‘off the shelf’ regression function estimators for will not satisfy the nonlinear PDE constraint for some , thus providing no recovery of the PDE coefficient itself.
Rates for via stability estimates
For estimators that lie in the range of the forward map , we can resort to ‘stability estimates’ which allow to control the convergence rate of to by the rate of towards , in appropriate norms. Injectivity and global stability estimates for this problem have been studied in several papers since Richter [R81], see the recent contribution [devore] and the discussion therein. They require additional assumptions, a very common choice being that throughout . The usefulness of these estimates depends in possibly subtle ways on the class of ’s one constrains the problem to. The original stability estimate given in [R81] controls in terms of which does not combine well with the  convergence rates obtained in Theorem 3.3. The results proved in [devore] are designed for ‘low regularity’ cases where : they give at best
(3.11) 
which via Theorem 3.3 would imply a convergence rate of for . For higher regularity relevant here, this can be improved. We prove in Lemma LABEL:lemdivstab below a Lipschitz stability estimate for the map between the spaces and , and combined with Theorem 3.3 this gives the following rate bound for .
Theorem 3.5.
Suppose that , , , , , are as in Theorem 3.3 and assume in addition that on . Let be any maximiser of . Then, for each and , there exists a constant such that we have for all ,
(3.12) 
The rate in Theorem 3.5 is strictly better than what can be obtained from (3.11), or by estimating by and using Richter’s stability estimate. A more detailed study of the stability problem, and of the related question of optimal rates for estimating , is beyond the scope of the present paper and will be pursued elsewhere.