 # Inverse learning in Hilbert scales

We study the linear ill-posed inverse problem with noisy data in the statistical learning setting. Approximate reconstructions from random noisy data are sought with general regularization schemes in Hilbert scale. We discuss the rates of convergence for the regularized solution under the prior assumptions and a certain link condition. We express the error in terms of certain distance functions. For regression functions with smoothness given in terms of source conditions the error bound can then be explicitly established.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Let  be a linear injective operator between the infinite-dimensional Hilbert spaces  and  with the inner products  and , respectively. Let  be the space of functions between a Polish space  and a real separable Hilbert space . Here we study the linear ill-posed operator problems governed by the operator equation

 (1) A(f)=g,forf∈Handg∈H′.

We observe noisy values of

at some points, and the foremost objective is to estimate the true solution

. The problem of interest can be described as follows: Given data  under the model

 (2) yi=g(xi)+εi,i=1,…,m,

where  is the observational noise, and  denotes the sample size, determine (approximately) the underlying element  with  being the regression function.

For classical inverse problems, the observational noise is assumed to be deterministic. Here we assume that the random observations

are independent and follow some unknown probability distribution

, defined on the sample space , and hence we are in the context of statistical inverse problems.

The reconstruction of the unknown true solution will be based on spectral regularization schemes. Various schemes can be used to stably estimate the true solution. Tikhonov regularization is widely-considered in the literature. This scheme consists of the error term measuring the fitness of the data and a penalty term, controlling the complexity of the reconstruction. In this study we enforce smoothness of the approximated solution by introducing an unbounded, linear, self-adjoint, strictly positive operator  with a dense domain of definition , and then we define Tikhonov regularization scheme in Hilbert scales as follows:

 (3)

where  is a positive regularization parameter and the operator  influences the properties of the approximated solution. Standard Tikhonov regularization corresponds to , the identity mapping. In many practical problems, the operator  is chosen to be a differential operator in some appropriate function spaces, e.g., -spaces.

Notice from (3), that the reconstruction  belongs to , such that formally we may introduce . In the regular case, when , then we let . With this notation we can rewrite (1) as

 g=Af=AL−1u,u∈D(L).

Also, the Tikhonov minimization problem would reduce to the standard one

albeit for a different operator . Accordingly, the error bounds relate as

 ∥∥fρ−fz,λ∥∥H=∥∥L−1(uρ−uz,λ)∥∥H.

Therefore, error bounds for  in the weak norm, in , yield bounds for . The latter bounds are not known from previous studies. Also, we are interested in the oversmoothing case, when , such that we provide a detailed error analysis, here. However, the above relation will implicitly be utilized in the subsequent proofs.

We review literature related to the considered problem. Regularization schemes in Hilbert scales are widely considered in classical inverse problems (with deterministic noise), starting from F. Natterer , and continued in [9, 18, 20, 21, 23, 24, 25, 27, 31]. G. Blanchard and N. Mücke  considered general regularization schemes for linear inverse problems in statistical learning and provided (upper and lower) rates of convergence under Hölder type source conditions. Here we consider general (spectral) regularization schemes in Hilbert scales for the statistical inverse problems. We discuss rates of convergence for general regularization under certain noise conditions, approximate source conditions, and a specific link condition between the operators , governing the equation (1), and the smoothness promoting operator  as used e.g. in (3). We study error estimates by using the concept of reproducing kernel Hilbert spaces. The concept of the effective dimension plays an important role in the convergence analysis.

The key-points in our results can be described as follows:

• We do not restrict ourselves to the white or coloured Hilbertian noise. We consider general centered noise, satisfying certain moment conditions, see Assumption

3.

• We consider general regularization schemes in Hilbert scales. It is well-known that Tikhonov regularization suffers the saturation effect. On the contrary, this saturation is delayed for Tikhonov regularization in Hilbert scales.

• The analysis uses the concept of link conditions, see Assumption 4, required to transfer information in terms of properties of the operator  to the covariance operator.

• We analyze the regular case, i.e., when the true solution belongs to the domain of operator .

• We also focus on the oversmoothing case, when the true solution does not belong to the domain of operator .

The paper is organized as follows. The basic definitions, assumptions, and notation required in our analysis are presented in Section 2. In Section 3 we discuss the bounds of the reconstruction error in the direct learning setting and inverse problem setting by means of distance functions. This section comprises of two main results: The first result is devoted to convergence rates in the oversmoothing case, while the second result focuses on the regular case. When specifying smoothness in terms of source conditions we can bound the distance functions, and this gives rise to convergence rates in terms of the sample size . This program is performed in Section 4. In case that both, the smoothness as well as the link condition are of power type we establish the optimality of the obtained error bounds in the regular case in Section 5. In the Appendix, we present probabilistic estimates which provide the tools to obtain the error bounds.

## 2. Notation and Assumptions

In this section, we introduce some basic concepts, definitions, notation, and assumptions required in our analysis.

We assume that  is a Polish space, therefore the probability distribution  allows for a disintegration as

 ρ(x,y)=ρ(y|x)ν(x),

where

given , and  is the marginal probability distribution. We consider random observations  which follow the model  with centered noise . We assume throughout the paper that the operator  is injective.

###### Assumption 1 (The true solution).

The conditional expectation w.r.t.  of  given  exists (a.s.), and there exists  such that

 ∫Yydρ(y|x)=gρ(x)=A(fρ)(x), for all x∈X.

The element  is the true solution which we aim at estimating.

###### Assumption 2 (Noise condition).

There exist some constants  such that for almost all ,

 ∫Y(e∥∥y−A(fρ)(x)∥∥Y/M−∥∥y−A(fρ)(x)∥∥YM−1)dρ(y|x)≤Σ22M2.

This assumption is usually referred to as a Bernstein-type assumption.

We return to the unbounded operator . By spectral theory, the operator  is well-defined for , and the spaces  equipped with the inner product  are Hilbert spaces. For , the space  is defined as completion of  under the norm . The space  is called the Hilbert scale induced by

. The following interpolation inequality is an important tool for the analysis:

 (4) ∥f∥Hr≤∥f∥s−rs−tHt∥f∥r−ts−tHs,f∈Hs,

which holds for any  [11, Chapt. 8].

### 2.1. Reproducing Kernel Hilbert space and related operators

We start with the concept of reproducing kernel Hilbert spaces. It is a subspace of  (the space of square-integrable functions from  to  with respect to the probability distribution

) which can be characterized by a symmetric, positive semidefinite kernel and each of its functions satisfies the reproducing property. Here we discuss the vector-valued reproducing kernel Hilbert spaces, following

, which are the generalization of real-valued reproducing kernel Hilbert spaces .

###### Definition 2.1 (Vector-valued reproducing kernel Hilbert space).

For a non-empty set  and a real separable Hilbert space , a Hilbert space  of functions from  to  is said to be the vector-valued reproducing kernel Hilbert space, if the linear functional , defined by

 Fx,y(f)=⟨y,f(x)⟩Y∀f∈H,

is continuous for every  and .

###### Definition 2.2 (Operator-valued positive semi-definite kernel).

Suppose  is the Banach space of bounded linear operators. A function  is said to be an operator-valued positive semi-definite kernel if

For a given operator-valued positive semi-definite kernel , we can construct a unique vector-valued reproducing kernel Hilbert space  of functions from  to  as follows:

1. We define the linear function

 Kx:Y→H:y↦Kxy,

where  for  and .

2. The span of the set  is dense in .

3. Reproducing property:

 ⟨f(x),y⟩Y=⟨f,Kxy⟩H,x∈X, y∈Y, ∀ f∈H,

in other words .

Moreover, there is a one-to-one correspondence between operator-valued positive semi-definite kernels and vector-valued reproducing kernel Hilbert spaces, see .

We assume the following assumption concerning the Hilbert space :

###### Assumption 3.

The space  is assumed to be a vector-valued reproducing kernel Hilbert space of functions  corresponding to the kernel  such that

1. is a Hilbert-Schmidt operator for  with

 κ′2:=supx∈X∥Kx∥2HS=supx∈Xtr(K∗xKx)<∞.
2. For , the real-valued function  is measurable.

###### Example 2.3.

In case that the set  is a bounded subset of  then the reproducing kernel Hilbert space becomes real-valued reproducing kernel Hilbert space. The corresponding kernel becomes the symmetric, positive semi-definite  with the reproducing property . Also, in this case the Assumption 3 simplifies to the condition that the kernel is measurable and .

Now we introduce some relevant operators used in the convergence analysis. We introduce the notation for the vectors . The product Hilbert space  is equipped with the inner product  and the corresponding norm . We define the sampling operator , then the adjoint  is given by

 S∗xy=1mm∑i=1Kxiyi.

Let  denotes the canonical injection map . Then we observe that, under Assumption 3, both the operators  and  are bounded by , since

 ∥Iνf∥2L2(X,ν;Y)=∫X∥f(x)∥2Ydν(x)=∫X∥K∗xf∥2Ydν(x)≤κ′2∥f∥2H

and

 ∥Sxf∥2m=1mm∑i=1∥f(xi)∥2Y=1mm∑i=1∥∥K∗xif∥∥2Y≤κ′2∥f∥2H.

We denote the population operators , and their empirical versions . The operators  are positive, self-adjoint and depend on the kernel. Under Assumption 3, the operators  are bounded by  and the operators  are bounded by  for , i.e.,  and .

In the subsequent analysis, we shall derive convergence rates by using approximate source conditions, which are related to a certain benchmark smoothness. This benchmark smoothness is determined by the user. In order to have handy arguments to derive the convergence rates, we shall fix an (integer) power . We shall use a link condition to transfer smoothness in terms of the operator L to the covariance operator . This link condition will involve an index function.

###### Definition 2.4 (Index function).

A function  is said to be an index function if it is continuous and strictly increasing with .

An index function is called sub-linear whenever the mapping  is nondecreasing. Further, we require this index function to belong to the following class of functions.

 (5) F={ φ=φ1φ2:φ1,φ2:[0,κ2]→[0,∞),φ1 nondecreasing continuous sub-linear, φ2  nondecreasing Lipschitz, φ1(0)=φ2(0)=0}.

The representation  is not unique, therefore  can be assumed as a Lipschitz function with Lipschitz constant . Now we phrase an important result, needed in our analysis [28, Corollary 1.2.2]:

 ∥φ2(Tx)−φ2(Tν)∥HS≤∥Tx−Tν∥HS.
###### Example 2.5.

The polynomial function , and the logarithm function  are examples of functions in the class .

There exist a power  and an index function , for which the function  is sub-linear. There are constants  such that

The function  belongs to the class .

As shown in , Assumption 4 implies the range identity . In the context of a comparison of operators we mention the well-known Heinz Inequality, see [11, Prop. 8.21], which asserts that a comparison , for non-negative self-adjoint operators  yields for every exponent  that . Applying this to the above link condition we obtain the following:

###### Proposition 2.6.

Under Assumption 4 we have

 ∥∥L−1u∥∥H≤∥ϱ(Tν)u∥H≤β∥∥L−1u∥∥H,u∈H and

Moreover, we have that

 (6) ∥∥ϱ(Tν)(Tν+λI)−1/2∥∥L(H)≤ϱ(λ)√λ,0<λ≤1.
###### Proof.

The first assertions are a consequence of Heinz Inequality. For the last one, we argue as follows. Since  is assumed to be sub-linear. Hence we find that

 ∥∥ϱ(Tν)(Tν+λI)−1/2∥∥L(H) =1√λ∥∥∥ϱ(Tν)(λ(Tν+λI)−1)1/2∥∥∥L(H) ≤1√λ∥∥ϱ2(Tν)(λ(Tν+λI)−1)∥∥1/2L(H) ≤ϱ(λ)√λ,

which completes the proof. ∎

###### Remark 2.7.

From the assertion, it is heuristically clear that the function

cannot increase faster than linearly, because the operator  has  in it. More details will be given in Section 5.

Link conditions as in Assumption 4 imply decay rates for the singular numbers of the operators, known as Weyl’s Monotonicity Theorem [4, Cor. III.2.3]. In our case, this yields that . For classical spaces, as e.g. Sobolev spaces, when , then  (one spatial dimension). For the above index function  this means that .

###### Example 2.8 (Finitely smoothing).

In case that the function , and hence its inverse is of power type then this implies a power type decay of the singular numbers of . In this case, the operator  is called finitely smoothing.

###### Example 2.9 (Infinitely smoothing).

If, on the other hand, the function  is logarithmic, as e.g., , then . In this case, the operator  is called infinitely smoothing.

### 2.3. Effective dimension

Now we introduce the concept of the effective dimension which is an important ingredient to derive the rates of convergence under Hölder’s source condition [7, 10, 12] and general source condition [16, 29]. The effective dimension for the trace–class operator  is defined as,

 NTν(λ):=Tr((Tν+λI)−1Tν), for λ>0.

It is known that the function  is continuous and decreasing from  to zero for  for an infinite dimensional operator  (see for details [5, 8, 15, 16, 32]).

The integral operator  is a trace class operator, hence the effective dimension is finite, and we have that

 NTν(λ)≤∥∥(Tν+λI)−1∥∥L(H)Tr(Tν)≤κ2λ.

In the subsequent analysis, we shall need a relationship between the effective dimensions  and . For this, the link condition (Assumption 4) is crucial. The arguments will be based on operator monotonicity and concavity. Below, for an operator  we assign  the singular numbers of the operator .

The following assumption was introduced in . There, it was shown that it is satisfied for both moderately ill-posed and severely ill-posed operators.

###### Assumption 5.

There exists a constant  such that for  we have

 t−1∑sj(Lν)

The relation between the effective dimensions is established in the following proposition, with proof will given in Appendix A.

###### Proposition 2.10.

Suppose Assumptions 4 and 5 hold true. Suppose the function  from the link condition is such that the function  is operator concave, and that there is some  for which the function  is concave. Then, there is  for which we have that

 NLν(λϱ2(λ))≤2βn+1˜CNTν(λ),0<λ≤∥Tν∥L(H).
###### Remark 2.11.

For a power type function  the above concavity assumptions hold true whenever  and . In particular the number  is uniquely determined.

### 2.4. Regularization Schemes

General regularization schemes were introduced and discussed in ill-posed inverse problems and learning theory (See [17, Section 2.2] and [2, Section 3.1] for brief discussion). By using the notation from § 2.1, the Tikhonov regularization scheme from (3) can be re-expressed as follows:

and its minimizer is given by

 fz,λ=L−1(Tx+λI)−1B∗xy.

We consider the following definition.

###### Definition 2.12 (General regularization).

We say that a family of functions , is a regularization scheme if there exists  such that

• .

• .

• .

• For some constant  (independent of ), the maximal  satisfying the condition:

 supt∈[0,κ2]|rλ(t)|tp≤γpλp

is said to be the qualification of the regularization scheme .

###### Definition 2.13.

The qualification  covers the index function  if the function  is nondecreasing.

We mention the following result.

###### Proposition 2.14.

Suppose  is a nondecreasing index function and the qualification, say , of the regularization  covers . Then

 supt∈[0,κ2]|rλ(σ)|φ(σ)≤cpφ(λ),cp=max(γ,γp).

Also, we have that

 supt∈[0,κ2]|rλ(σ)|φ(λ+σ)≤2pcpφ(λ).
###### Proof.

The first assertion is a restatement of [19, Proposition 3]. For the second assertion, we stress that , which follows from convexity. This yields

which implies the second assertion and completes the proof. ∎

Essentially all the linear regularization schemes (Tikhonov regularization, Landweber iteration or spectral cut-off) satisfy the properties of general regularization. Inspired by the representation for the minimizer of the Tikhonov functional we consider a general regularized solution in Hilbert scales corresponding to the above regularization in the form

 (7) fz,λ=L−1gλ(Tx)B∗xy.

## 3. Convergence analysis

Here we study the convergence for general regularization schemes in the Hilbert scale of the linear statistical inverse problem based on the prior assumptions and the link condition.

The analysis will distinguish between two cases, the ‘regular’ one, when , and the ‘low smoothness’ case, when . In either case, we shall first utilize the concept of distance functions. This will later give rise to establish convergence rates in a more classical style.

For the asymptotical analysis, we shall require the standard assumption relating the sample size

and the parameter  such that

 (8) NTν(λ)≤mλand0<λ≤1.

It will be seen, that asymptotically the condition (8) is always satisfied for the parameter which is optimally chosen under known smoothness.

The fact that  is decreasing function of  and  implies that . Hence from condition (8) we obtain,

 (9) NTν(1)≤mλ.

Several probabilistic quantities will be used to express the error bounds. Precisely, for an index function  we let

 (10) Ξζ=Ξζ(λ) :=∥∥∥(1ζ)(Tx+λI)ζ(Tν+λI)∥∥∥L(H), (11) Λ=Λ(λ) :=∥∥(Lν+λI)−1/2(Lν−Lx)∥∥HS, (12) Υ=Υ(λ) :=∥∥(Tν+λI)−1/2(Tν−Tx)∥∥HS, and (13) Ψ=Ψ(λ) :=∥∥(Tν+λI)−1/2B∗x(SxAfρ−y)∥∥H.

In case that  we abbreviate  by  and  by , not to be confused with the power. High probability bounds for these quantities are known, and these will be given correspondingly in Propositions B.1 and B.2.

### 3.1. The oversmoothing case

As mentioned before, we shall use distance functions, and these are called ‘approximate source conditions’ sometimes, because these measure the violation of a benchmark smoothness. Here the benchmark will be .

###### Definition 3.1 (Approximate source condition).

We define the distance function  by

 (14) d(R)=inf{∥∥fρ−f∥∥H:f=L−1v and ∥v∥H≤R},R>0.

We denote  the element which realizes the above minimization problem.

Notice the following: If  then for some  the minimizer  of the distance function will obey .

###### Remark 3.2.

In a rudimentary form, this approach was given in [3, Thm. 6.8]. It was then introduced in regularization theory in . Within learning theory, such a concept was also used in the study .

###### Theorem 3.3.

Let  be i.i.d. samples drawn according to the probability measure . Suppose the Assumptions 15 hold true. Suppose that the qualification  of the regularization  covers the function  (for  from Assumption 4) and that  are concave, or operator concave functions for some , respectively. Then for all , and for  satisfying the condition (8) the following upper bound holds for the regularized solution  (7) with confidence :

 ∥∥fz,λ−fρ∥∥H≤C{d(R)+2Rϱ(λ)}log4(4η),R≥Σ+κM/NTν(1),

where  depends on .

###### Proof.

For the minimizer  of the distance function defined in (14), the error can be expressed as follows:

 fρ−fz,λ= L−1{rλ(Tx)L(fρ−fRρ)+rλ(Tx)LfRρ+gλ(Tx)B∗x(SxAfρ−y)}.

By using Proposition 2.6 the error for the regularized solution can be bounded as

 (15) ∥∥fρ−fz,λ∥∥H≤ ∥∥L−1rλ(Tx)L(fρ−fRρ)∥∥H+∥∥L−1rλ(Tx)LfRρ∥∥H+∥∥L−1gλ(Tx)B∗x(SxAfρ−y)∥∥H ≤ d(R)∥∥L−1rλ(Tx)L∥∥L(H)I1+∥∥ϱ(Tν)rλ(Tx)LfRρ∥∥HI2+∥∥ϱ(Tν)gλ(Tx)B∗x(SxAfρ−y)∥∥HI3.

We shall bound each summand on the right in (15).

:

By Lemma B.3 we find that

 ∥∥L−1rλ(Tx)L∥∥L(H)≤1+(B+D)(ΞϱΞυ+Ξϱ(λ)(ϱ(λ)+1)Λ√λ)

with  as in (10), (11) and .

From the estimates of Propositions B.1B.2 we get with confidence  that

 (16) ∥∥L−1rλ(Tx)L∥∥L(H)≤ 1+(B+D)⎧⎨⎩(2κ+1)8+2(2κ+1)4(ϱ(λ)+1)⎛⎝~κϱ(λ)mλ+√~κϱ2(λ)NLν(λ)mλ⎞⎠⎫⎬⎭ ×log4(4η),

For  under the fact that  is increasing function and , for  small enough, we get

 λNLν(λ)≤ϑ(λ)NLν(ϑ(λ)).

This together with Proposition 2.10 implies that

 (17) ϱ2(λ)NLν(λ)≤NLν(λϱ2(λ))≤2βn+1˜CNTν(λ).

Under the condition (8) from the estimates (9), (16), (17) we get with confidence :

 (18) ∥∥L−1rλ(Tx)L∥∥L(H)≤ 1+(B+D)βn+1˜CCκ,~κlog4(4η),

where  depends on .

:

By construction of  we have that . Using the fact that  covers  we bound

 (19) ∥∥ϱ(Tν)rλ(Tx)LfRρ∥∥H ≤RΞϱ∥ϱ(Tx+λI)rλ(Tx)∥L(H)≤2RΞϱϱ(λ).
:

For the last summand we argue

 (20) ∥∥ϱ(Tν)gλ(Tx)B∗x(SxAfρ−y)∥∥H ≤ Ξ12ΞϱΨ∥∥∥gλ(Tx)ϱ(Tx+λI)(Tx+λI)12∥∥∥L(H) ≤ Ξ12ΞϱΨsupt∈[0,κ2]ϱ(t+λ)(t+λ)12|gλ(t)| ≤ Ξ12ΞϱΨ(supt∈[0,κ2]ϱ(t+λ)(t+λ)−12){λsupt∈[0,κ2]|gλ(t)|+supt∈[0,κ2]|tgλ(t)|} ≤ Ξ12ΞϱΨ{B+D}ϱ(λ)λ−12,

where  and  were as in (10) and (13).

Summarizing, using the estimates of Propositions B.1B.2, and (18)–(20), we get with confidence :

 (21) ∥∥fρ−fz,λ∥∥H≤C⎡⎣d(R)+ϱ(λ)⎧⎨⎩R+κMmλ+√Σ2NTν(λ)mλ⎫⎬⎭⎤⎦log4(4η).

For any parameter choice  satisfying the condition (8) using the inequality (9) we get that

 κMmλ≤κMNTν(1)

and

 √Σ2NTν(λ)mλ≤Σ.

This implies

 (22) R+κMmλ+√Σ2NTν(λ)mλ≤2R,

provided that . Inserting the bound from inequality (22) into the estimate (21) completes the proof. ∎

The bound from Theorem 3.3 is valid for all , and we shall now optimize the bound from Theorem 3.3 with respect to the choice of .

First, if  then there is  such that , and

 ∥∥fz,λ−fρ∥∥H≤C¯R ϱ(λ)log4(4η),

where  depends on .

Otherwise, in the low smoothness case, , we introduce the following function

 Γ(R):=