Log In Sign Up

Convergence rates for Penalised Least Squares Estimators in PDE-constrained regression problems

by   Richard Nickl, et al.
University of Cambridge
ETH Zurich

We consider PDE constrained nonparametric regression problems in which the parameter f is the unknown coefficient function of a second order elliptic partial differential operator L_f, and the unique solution u_f of the boundary value problem L_fu=g_1 on O, u=g_2 on ∂ O, is observed corrupted by additive Gaussian white noise. Here O is a bounded domain in R^d with smooth boundary ∂ O, and g_1, g_2 are given functions defined on O, ∂ O, respectively. Concrete examples include L_fu=Δ u-2fu (Schrödinger equation with attenuation potential f) and L_fu=div (f∇ u) (divergence form equation with conductivity f). In both cases, the parameter space F={f∈ H^α( O)| f > 0}, α>0, where H^α( O) is the usual order α Sobolev space, induces a set of non-linearly constrained regression functions {u_f: f ∈ F}. We study Tikhonov-type penalised least squares estimators f̂ for f. The penalty functionals are of squared Sobolev-norm type and thus f̂ can also be interpreted as a Bayesian `MAP'-estimator corresponding to some Gaussian process prior. We derive rates of convergence of f̂ and of u_f̂, to f, u_f, respectively. We prove that the rates obtained are minimax-optimal in prediction loss. Our bounds are derived from a general convergence rate result for non-linear inverse problems whose forward map satisfies a mild modulus of continuity condition, a result of independent interest that is applicable also to linear inverse problems, illustrated in an example with the Radon transform.


page 1

page 2

page 3

page 4


On statistical Calderón problems

For D a bounded domain in R^d, d > 3, with smooth boundary ∂ D, the non...

On some information-theoretic aspects of non-linear statistical inverse problems

Results by van der Vaart (1991) from semi-parametric statistics about th...

Stability estimates for the expected utility in Bayesian optimal experimental design

We study stability properties of the expected utility function in Bayesi...

Laplace priors and spatial inhomogeneity in Bayesian inverse problems

Spatially inhomogeneous functions, which may be smooth in some regions a...

Noisy linear inverse problems under convex constraints: Exact risk asymptotics in high dimensions

In the standard Gaussian linear measurement model Y=Xμ_0+ξ∈ℝ^m with a fi...

Noise level free regularisation of general linear inverse problems under unconstrained white noise

In this note we solve a general statistical inverse problem under absenc...

1 Introduction

Observations obeying certain physical laws can often be described by a partial differential equation (PDE). Real world measurements carry statistical noise and thus do not generally exactly exhibit the idealised pattern of the PDE, but it is desirable that recovery of parameters from data is consistent with the PDE structure. In the mathematical literature on inverse problems several algorithms that incorporate such constraints have been proposed, notably optimisation based methods such as Tikhonov regularisation [EHN96, BB18] and maximum a posteriori (MAP) estimates related to Bayesian inversion techniques [stuart10, DS16]. In statistical terminology these methods can often be viewed as penalised least squares estimators over parameter spaces of regression functions that are restricted to lie in the range of some ‘forward operator’ describing the solution map of the PDE. The case where is linear is reasonably well studied in the inverse problems literature, but already in basic elliptic PDE examples, the map is non-linear and the analysis is more delicate. The observation scheme considered here is (a natural continuous analogue of) the standard Gaussian regression model


where are ‘equally spaced’ design points on a bounded domain with smooth boundary . The function is, in our first example, the solution of the elliptic PDE (with denoting the gradient and the divergence operator)


where is a given source function defined on and is an unknown conductivity (or diffusion) coefficient. The second model example arises with solutions of the time-independent Schrödinger equation (with equal to the standard Laplacian operator)


corresponding to the unknown attenuation potential (or reaction coefficient) , and given positive ‘boundary temperatures’ . Both PDEs have a fundamental physical interpretation and feature in many application areas, see, e.g., [EHN96, BHM04, HP08, BU10, stuart10, devore, DS16], and references therein.

When belongs to some Sobolev space for appropriate , unique solutions of the PDEs (1.2), (1.3) exist, and the ‘forward’ map is non-linear. A natural method to estimate is by a penalised least squares approach: one minimises in the squared Euclidean distance

of the observation vector

to the fitted values , and penalises too complex solutions by, for instance, an additive Sobolev norm - type penalty. The crucial constraint can be incorporated by a smooth one-to-one transformation of the penalty function, and a final estimator minimises a criterion function of the form

over , where is a scalar regularisation parameter to be chosen. Both Tikhonov regularisers as well as Bayesian maximum a posteriori (MAP) estimates arising from suitable Gaussian priors fall into this class of estimators. We show in the present paper that suitable choices of give rise to statistically optimal solutions of the above PDE constrained regression problems from data (1.1), in prediction loss. The convergence rates obtained can be combined with ‘stability estimates’ to obtain bounds also for the recovery of the parameter itself.

Our main results are based on a general convergence rate theorem for minimisers over of functionals of the form

in possibly non-linear inverse problems whose forward map satisfies a certain modulus of continuity assumption between Hilbert spaces. This result, which adapts -estimation techniques [sara2001, saramest] to the inverse problems setting, is of independent interest, and provides novel results also for linear forward maps, see Remark 2.5 for an application to Radon transforms.

For sake of conciseness, our theory is given in the Gaussian white noise model introduced in (2.1) below – it serves as an asymptotically equivalent (see [BL96, R08]) continuous analogue of the discrete model (1.1), and facilitates the application of PDE techniques in our proofs. Transferring our results to discrete regression models is possible, but the additional difficulties are mostly of a technical nature and will note be pursued here.

Recovery for non-linear inverse problems such as those mentioned above has been studied initially in the deterministic regularisation literature [EKN89, N92, SEK93, EHN96, TJ02], and the convergence rate theory developed there has been adapted to the statistical regression model (1.1) in [BHM04, BHMR07, HP08, LL10]. These results all assume that a suitable Fréchet derivative of the non-linear forward map exists at the ‘true’ parameter , and moreover require that lies in the range of the adjoint operator of – the so called ‘source condition’. Particularly for the PDE (1.2), such conditions are problematic and do not hold in general for rich enough classes of ’s (such as Sobolev balls) unless one makes very stringent additional model assumptions. Our results circumvent such source conditions. Further remarks, including a discussion of related convergence analysis of estimators obtained from Bayesian inversion techniques [v13, dashti13, n17] can be found in Section 3.4.

1.1 Some preliminaries and basic notation

Throughout, , , denotes a bounded non-empty -domain (an open bounded set with smooth boundary) with closure . The usual space of square integrable functions carries a norm induced by the inner product

where denotes Lebesgue measure. For any multi-index of ‘length’ , let denote the -th (weak) partial derivative operator of order . Then for integer , the usual Sobolev spaces are defined as

normed by . For non-integer real values , we define

by interpolation, see, e.g.,

[lionsmagenes] or [T78].

The spaces of bounded and continuous functions on and are denoted by and , respectively, equipped with the supremum norm . For , the space of -times differentiable functions on with uniformly continuous derivatives is denoted by . For , we say if for all multi-indices with (the integer part of ), exists and is -Hölder continuous. The norm on is

We also define the set of smooth functions as and its subspace of functions compactly supported in .

The previous definitions will be used also for replaced by or . When there is no ambiguity, we omit from the notation.

For any normed linear space its topological dual space is

which is a Banach space for the norm

We need further Sobolev-type spaces to address routine subtleties of the behaviour of functions near : denote by the completion of for the norm, and let denote the closed subspace of consisting of functions supported in . We have unless (Section 4.3.2 in [T78]), and one defines negative order Sobolev spaces , cf. also Theorem 3.30 in [ML00].

We use the symbols “” for inequalities that hold up to multiplicative constants that are universal, or whose dependence on other constants will be clear from the context. We also use the standard notation and for .

2 A convergence rate result for general inverse problems

2.1 Forward map and white noise model

Let be a separable Hilbert space with inner product . Suppose that and that

is a given ‘forward’ map. For some , and for scalar ‘noise level’ , we observe a realisation of the equation


where is a centred Gaussian white noise process indexed by the Hilbert space (see p.19-20 in [nicklgine]). Let denote the expectation operator under the law of from (2.1). Observing (2.1) means to observe a realisation of the Gaussian process with marginal distributions

In the case relevant in Section 3 below, (2.1) can be interpreted as a Gaussian shift experiment in the Sobolev space (see, e.g., [cn1, n17]), and also serves as a theoretically convenient (and, for , as asymptotically equivalent) continuous surrogate model for observing in the standard fixed design Gaussian regression model


where the are ‘equally spaced’ design points in the domain (see [BL96, R08]).

In the discrete model (2.2) the least squares criterion can be decomposed as . The first term is independent of and can be neglected when optimising in . In the continuous model (2.1) we have a.s. (unless dim), which motivates to define a ‘Tikhonov-regularised’ functional


where is a regularisation parameter to be chosen, and where we set for . Maximising thus amounts to minimising the natural least squares fit with a -penalty for , and we note that it also corresponds to maximising the penalised log-likehood function arising from (2.1), see, e.g., [n17], Section 7.4. In all that follows could be replaced by any equivalent norm on .

2.2 Results

For , and , define the functional


The main result of this section, Theorem 2.2, proves the existence of maximisers for over suitable subsets and concentration properties for , where is the ‘true’ function generating the law from equation (2.1). Note that bounds for simultaneously control the ‘prediction error’ as well as the regularity of the estimated output .

Theorem 2.2 is proved under a general ‘modulus of continuity’ condition on the map which reads as follows.

Definition 2.1.

Let be non-negative real numbers and . Set if , and if . A map is called -regular if there exists a constant such that for all , we have


This condition is easily checked for ‘-smoothing’ linear maps with , see Remark 2.5 for an example. But (2.5) also allows for certain non-linearities of on unbounded parameter spaces that will be seen later on to accommodate the forward maps induced by the PDEs (1.2), (1.3). See also Remarks 2.6, 3.10 below.

Theorem 2.2.

Suppose that is a -regular map for some integer . Let from (2.1) for some fixed . Then the following holds.

1. Let be closed for the weak topology of the Hilbert space . Then for all , almost surely under , there exists a maximiser of from (2.3) over , satisfying


2. Let . There exist constants such that we have for all satisfying


all , any maximiser of over and any ,


and also


Various applications of Theorem 2.2 for specific choices of , , and will be illustrated in the following - besides the main PDE applications from Section 3, see Remarks 2.4, 3.10 and 3.11 as well as Example 2.5 below.

Theorem 2.2 does not necessarily require as long as can be suitably approximated by some , see Remark 2.4 for an instance of when this is relevant. If then we can set in the above theorem and obtain the following convergence rates, which are well known to be optimal for -smoothing linear forward maps , and which will be seen to be optimal also for the non-linear inverse problems arising from the PDE models (1.2) and (1.3).

Corollary 2.3.

Under the conditions of Part 2 of Theorem 2.2, there exists such that for , we have that for all , all small enough and any maximizer of over ,


When images of -bounded subsets of under a forward map are bounded in for some , then the -bound (2.10) extends (via interpolation and bounds for implied by Theorem 2.2) to -norms, , which in turn can be used to obtain convergence rates also for by using stability estimates. See the results in Section 3 and also Example 2.5 below for examples.

Remark 2.4 (MAP estimates).

Let be a Gaussian process prior measure for with reproducing kernel Hilbert space (RKHS) and RKHS-norm . Taking note of the form of the likelihood function in the model (2.1) (see, e.g., Section 7.4 in [n17]), maximisers of over with have a formal interpretation as maximum a posteriori (MAP) estimators for the resulting posterior distributions , see also [dashti13, HB15]. For instance, let and consider a linear inverse problem where for and , is a linear map satisfying (2.5) with for all . Then, applying Theorem 2.2 with (so that ) and yields


for any fixed sub-domain such that . Indeed, one easily checks (2.7), and given set , where is such that on and

is the Fourier transform. Then

and in (2.9) yield (2.11). Similar comments apply to non-linear , with appropriate choice of , see Remark 3.10.

Example 2.5 (Rates for the Radon transform).

Let be the Radon transform, where and , equipped with Lebesgue measure, see p.9 in [N86] for definitions. Then satisfies (2.5) with and any – see p.42 in [N86] and note that our -norm is the -norm used in [N86] (cf. Theorem 3.30 in [ML00]). Applying Corollary 2.3 with , and implies that for any ,


Using again the estimates on p.42 in [N86] and that Hölder’s inequality implies

for defined as in [N86], we deduce from (2.12) and Markov’s inequality

with constants uniform in for any . Similarly, if one chooses instead, then the MAP estimate from Remark 2.4 satisfies

uniformly over for .

Remark 2.6 (The effect of nonlinearity).

In the proof of Theorem 2.2 we follow ideas for -estimation from [saramest, sara2001], and condition (2.5) is needed to bound the entropy numbers of images of Sobolev balls under , which in turn control the modulus of continuity of the Gaussian process that determines the convergence rate of to . The at most polynomial growth in of the Lipschitz constants


in (2.5) turns out to be essential in the proof of Theorem 2.2. But even when only a ‘polynomial nonlinearity’ is present (), the last term in the condition (2.7) can become dominant if the penalisation parameter is too small. The intuition is that, for non-linear problems, too little penalisation can mean that the maximisers over unbounded parameter spaces behave erratically, yielding sub-optimal convergence rates.

3 Results for elliptic PDE models

In this section, we apply Theorem 2.2 to the inverse problems induced by the PDEs (1.2) and (1.3). We also discuss the implied convergence rates for the parameter .

3.1 Basic setup and link functions

For any integer and any constant , and denoting the outward pointing normal vector at by , define the parameter space


and its subclasses

We note that the restrictions and on in (3.1) are made only for convenience, and could be replaced by for any fixed satisfying . Moreover, for estimation over parameter spaces without prescribed boundary values for , see Remark 3.11.

We will assume that the coefficient of the second order linear elliptic partial differential operators featuring in the boundary value problems (1.2) and (1.3), respectively, belong to for large enough , and denote by


the corresponding solution maps. Following (2.1) with , we then observe


whose law will now be denoted by for .

We will apply Theorem 2.2 to a suitable bijective re-parameterisation of for which the set one optimises over is a linear space. This is natural for implementation purposes but also necessary to retain the Bayesian interpretation of our estimators from Remark 2.4. To this end, we introduce ‘link functions’ – the lowercase and uppercase notation for corresponding functions and will be used throughout.

Definition 3.1.

1. A function is called a link function if is a smooth, strictly increasing bijective map satisfying and on .

2. A function , , is called regular if all derivatives of of order are bounded, i.e.


In the notation of Theorem 2.2, throughout this section we set , to be the ‘pulled-back’ parameter space, and


For as in (3.1), one easily verifies that

where the second equality follows from the characterization of in Theorem 11.5 of [lionsmagenes]. Given a realisation of (3.3) and a regular link function , we define the generalised Tikhonov regularised functional ,


Then for all , we have in the notation (2.3), and maximising over is equivalent to maximising over . Any pair of maximisers will be denoted by

The proofs of the theorems which follow are based on an application of Theorem 2.2, after verifying that the map (3.5) satisfies (2.5) with and suitable values of . The verification of (2.5) is based on PDE estimates that control the modulus of continuity of the solution map (3.2), and on certain analytic properties of the link function . In practice often the choice is made (cf. [stuart10]), but our results suggest that the use of a regular link function might be favourable. Indeed, the polynomial growth requirement (2.13) discussed above is not met if one chooses for the exponential function. Before we proceed, let us give an example of a regular link function.

Example 3.2.

Define the function by , let be a smooth, compactly supported function with , and write for their convolution. It follows from elementary calculations that, for any ,

is a regular link function with range .

3.2 Divergence form equation

For a given source function , we consider the Dirichlet boundary value problem


where from (3.1), for . Then (LABEL:h-emb) implies for some and the Schauder theory for elliptic PDEs (Theorem 6.14 in [gt]) then gives that (3.7) has a unique classical solution in which we shall denote by .

Upper bounds

For a link function and , define (cf. (2.4))

Theorem 3.3 (Prediction error).

Let be given by (3.1) for some integer and . Let denote the unique solution of (3.7) and let from (3.3) for some . Moreover, suppose that is a regular link function and that is given by (3.6), where

Then the following holds.

  1. For each and , almost surely under , there exists a maximiser of over .

  2. For each , , there exist finite constants such that for any maximiser of , all and all ,

  3. For each , and , there exists a constant such that for any maximiser of with corresponding , for all ,


Lower bounds

We now give a minimax lower bound on the rate of estimation for which matches the bound in (3.9). To facilitate the exposition we only consider the unit ball , set identically on , and fix -loss with .

Theorem 3.4.

For , and on , consider solutions to (3.7). Then there exists such that for all small enough,


where the infimum ranges over all measurable functions of from (3.3) that take values in .

Observe that (3.10) coincides with the lower bound for estimating as an unconstrained regression function in under -loss. Note however that unconstrained ‘off the shelf’ regression function estimators for will not satisfy the non-linear PDE constraint for some , thus providing no recovery of the PDE coefficient itself.

Rates for via stability estimates

For estimators that lie in the range of the forward map , we can resort to ‘stability estimates’ which allow to control the convergence rate of to by the rate of towards , in appropriate norms. Injectivity and global stability estimates for this problem have been studied in several papers since Richter [R81], see the recent contribution [devore] and the discussion therein. They require additional assumptions, a very common choice being that throughout . The usefulness of these estimates depends in possibly subtle ways on the class of ’s one constrains the problem to. The original stability estimate given in [R81] controls in terms of which does not combine well with the - convergence rates obtained in Theorem 3.3. The results proved in [devore] are designed for ‘low regularity’ cases where : they give at best


which via Theorem 3.3 would imply a convergence rate of for . For higher regularity relevant here, this can be improved. We prove in Lemma LABEL:lem-div-stab below a Lipschitz stability estimate for the map between the spaces and , and combined with Theorem 3.3 this gives the following rate bound for .

Theorem 3.5.

Suppose that , , , , , are as in Theorem 3.3 and assume in addition that on . Let be any maximiser of . Then, for each and , there exists a constant such that we have for all ,


The rate in Theorem 3.5 is strictly better than what can be obtained from (3.11), or by estimating by and using Richter’s stability estimate. A more detailed study of the stability problem, and of the related question of optimal rates for estimating , is beyond the scope of the present paper and will be pursued elsewhere.

3.3 Schrödinger equation

We now turn to the Schrödinger equation


with given . By standard results for elliptic PDEs (Theorem 6.14 in [gt]), for from (3.1) with , a unique classical solution to (1.3) exists which lies in