    # Tikhonov regularization with oversmoothing penalty for nonlinear statistical inverse problems

In this paper, we consider the nonlinear ill-posed inverse problem with noisy data in the statistical learning setting. The Tikhonov regularization scheme in Hilbert scales is considered to reconstruct the estimator from the random noisy data. In this statistical learning setting, we derive the rates of convergence for the regularized solution under certain assumptions on the nonlinear forward operator and the prior assumptions. We discuss estimates of the reconstruction error using the approach of reproducing kernel Hilbert spaces.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

We consider the nonlinear ill-posed operator equation of the form

 A(f)=g

with a nonlinear forward operator  between the infinite-dimensional Hilbert spaces  and . Moreover,  is the space of functions  for a Polish space  (the input space) and a real separable Hilbert space  (the output space). Ill-posed inverse problems have important applications in the field of science and technology (see, e.g., [13, 15, 29, 31]).

In classical inverse problem setting, we observe the approximation  of the function  with  for some known noise level , then we reconstruct the estimator of the quantity  through the regularization schemes. Here we consider the problem in statistical learning setting in which we observe the random noisy image  at the points . The problem can be described as follows:

 (1) yi=g(xi)+εi,g=A(f)

where  is the random observational noise with  and  is called the sample size.

The model (1) covers nonparametric regression under random design (which we also call the direct problem, i.e., ), and the linear statistical inverse learning problem. Thus, introducing a general nonlinear operator  gives a unified approach to the different learning problems.

Suppose the random observations are drawn identically and independently according to the joint probability measure

on the sample space  and the probability measure  can be splitting as follows:

 ρ(x,y)=ρ(y|x)ν(x),

where

given  and  is the marginal probability distribution on .

For the statistical inverse problem (1), the goodness of an estimator  can be measured through the expected risk:

 (2) Eρ(f)=∫Z∥A(f)(x)−y∥2Ydρ(x,y).

Further, we assume that  for any . Then for the function

 gρ(x)=∫Yydρ(y|x),

the expected risk can be expressed as follows:

 (3) Eρ(f)=∫X∥∥A(f)(x)−gρ(x)∥∥2Ydν(x)+∫Z∥∥gρ(x)−y∥∥2Ydρ(x,y).

Hence we observe that finding the minimizer of the expected risk is equivalent to obtaining the minimizer of the quantity .

Since the probability measure  is unknown, the only information of the probability measure is known through the sample. Therefore we use the regularization methods to stably reconstruct the estimator of the quantity

. The Tikhonov regularization is widely considered in both the classical inverse problems and the statistical learning theory. We consider the Tikhonov regularization in Hilbert scales which consists of the error term measuring the fitness of data and oversmoothing penalty. We introduce an unbounded, closed, linear, self-adjoint, strictly positive operator

with a dense domain of definition  to treat an oversmoothing penalty in terms of a Hilbert scale. For some , the operator satisfies:

 (4) ∥Lf∥H≥ℓ∥f∥Hfor allf∈D(L).

For a given sample , we define Tikhonov regularization scheme in Hilbert scales:

 (5)

Here  denotes some initial guess of the true solution, which offers the possibility to incorporate a priori information. Here  is a positive regularization parameter which controls the trade-off between the error term and the complexity of the solution.

In many practical problems, the operator  which influences the properties of the regularized approximation is chosen to be a differential operator in some appropriate function spaces, e.g., the space of square-integrable functions . It is well-known that the standard Tikhonov regularization suffers the saturation effect. The finite qualification of Tikhonov regularization can be overcome using the Hilbert scales. The problem (5) is non-convex, therefore the minimizer may not exist in general. For the continuous and weakly sequentially closed111i.e., if a sequence  converges weakly to some  and if the sequence  converges weakly to some , then  and . operator , there exists a global minimizer of the functional in (5). But it is not necessarily unique since  is nonlinear (see [29, Section 4.1.1]).

Generally, in the classical inverse problem literature (see [4, 13, 17, 29] and references therein), the 2-step approaches are considered in which first they construct the estimator of the function  by  from the observations , then estimate the quantity  stably using the various regularization schemes. Here we estimate the quantity  in a 1-step method using the Tikhonov regularization scheme (5) in the statistical learning setting.

Now we review the work in the literature related to the considered problem. Regularization schemes in Hilbert scales are widely considered in classical inverse problems (with deterministic noise) [12, 16, 22, 24, 25, 26, 30]. On the contrary, the inverse problems with random observations are not well-studied. The linear statistical inverse problems are studied in , under the assumption that the marginal probability measure  is known which is an unrealistic assumption since the only information is available through the input points . This problem is also discussed in  for the general random design with an unknown marginal probability measure.

In this nonlinear setup, the reference  established the error estimates for the generalized Tikhonov regularization for (1) using the linearization technique in a random design setting. In other work, the authors  consider a 2-step approach, however, again under the assumption of the norm in  being known. The references  and [17, 32] consider respectively a Gauss-Newton algorithm and the Tikhonov regularization for certain nonlinear inverse problem, but also in the idealized setting of Hilbertian white or colored noise with known covariance, which can only cover sampling effects when  is known. Loubes et al.  discussed the problem (1) under a fixed design and concentrate on the problem of model selection. Finally, the recent work  discussed the rates of convergence for the Tikhonov regularization of the nonlinear inverse problem.

In contrast with the existing work [3, 4, 17, 32] our results are improved in three respects:

• We do not restrict ourselves to the Hilbertian white or colored noise.

• We consider a 1-step approach rather than existing 2-step approaches for the nonlinear inverse problems.

• The considered approach does not suffer the saturation effect of standard Tikhonov regularization.

Following the work [1, 7], we develop the error analysis for the Tikhonov regularization scheme for the nonlinear inverse problems in Hilbert scales in the statistical learning setting. We establish the error bounds for the statistical inverse problems in reproducing kernel approach. We discuss the rates of convergence for Tikhonov regularization under certain assumptions on the nonlinear forward operator and the prior assumptions.

Some structural assumptions are required on the nonlinear mappings  to establish the convergence analysis. We consider the widely assumed conditions in the literature of the classical inverse problems, first assumed in , and presented in detail in the monograph . We assume that the operator is Fréchet differentiable at the true solution, the Fréchet derivative is Lipschitz continuous and satisfies the link condition (for precise statement see Assumption 4).

The goal is to analyze the theoretical properties of the Tikhonov estimator , in particular, the asymptotic performance of the regularization scheme is evaluated by the error estimates of the Tikhonov estimator

in the reproducing kernel approach. Precisely, we develop a non-asymptotic analysis of Tikhonov regularization (

5) for the nonlinear statistical inverse problem based on the tools that have been developed for the modern mathematical study of reproducing kernel methods. The challenges specific to the studied problem are that the considered model is an inverse problem (rather than a pure prediction problem) and nonlinear. The rate of convergence for the Tikhonov estimator  to the true solution is described in the probabilistic sense by exponential tail inequalities. For sample size  and the confidence level , we establish the bounds of the form

 Pz∈Zm{∥fz,λ−f∥H≤ε(m)log2(1η)}≥1−η.

Here the function  is a positive decreasing function and describes the rate of convergence as .

The paper is organized as follows. In Section 2, we discuss the basic definition and assumptions required in our analysis. In Section 3, we discuss the bounds of the reconstruction error under certain assumptions on the (unknown) joint probability measure , and the (nonlinear) mapping . In Appendix, we present the probabilistic estimates and the preliminary results which provide the tools to obtain the error bounds in reproducing kernel approach.

## 2. Notation and assumptions

In this section, we introduce some basic concepts, definitions, and notations required in our analysis.

### 2.1. Reproducing Kernel Hilbert space and related operators

We start with the concept of the reproducing kernel Hilbert spaces. It is a subspace of  (the space of square-integrable functions from  to  with respect to the probability distribution

) which can be characterized by a symmetric, positive semidefinite kernel and each of its function satisfies the reproducing property. Here we discuss the vector-valued reproducing kernel Hilbert spaces

 which are the generalization of real-valued reproducing kernel Hilbert spaces .

###### Definition 2.1 (Vector-valued reproducing kernel Hilbert space).

For a non-empty set  and a real separable Hilbert space , a Hilbert space  of functions from  to  is said to be the vector-valued reproducing kernel Hilbert space, if the linear functional , defined by

 Fx,y(f)=⟨y,f(x)⟩Y∀f∈H,

is continuous for every  and .

Throughout the paper,  denotes adjoint of an operator .

###### Definition 2.2 (Operator-valued positive semi-definite kernel).

Suppose  is the Banach space of bounded linear operators. A function  is said to be an operator-valued positive semi-definite kernel if

For a given operator-valued positive semi-definite kernel , we can construct a unique vector-valued reproducing kernel Hilbert space  of functions from  to  as follows:

1. We define the linear function

 Kx:Y→H:y↦Kxy,

where  for  and .

2. The span of the set  is dense in .

3. Reproducing property:

 ⟨f(x),y⟩Y=⟨f,Kxy⟩H,x∈X, y∈Y, ∀ f∈H,

in other words .

Moreover, there is a one-to-one correspondence between operator-valued positive semi-definite kernels and vector-valued reproducing kernel Hilbert spaces . The reproducing kernel Hilbert space becomes real-valued reproducing kernel Hilbert space, in the case that  is a bounded subset of , and the corresponding kernel becomes the symmetric, positive semi-definite  with the reproducing property .

We assume the following assumption concerning the Hilbert space :

###### Assumption 1.

The space  is assumed to be a vector-valued reproducing kernel Hilbert space of functions  corresponding to the kernel  such that

1. is a Hilbert-Schmidt operator for  with

 κ2:=supx∈X∥Kx∥2HS=supx∈Xtr(K∗xKx)<∞.
2. For , the real-valued function  is measurable.

Note that in case of real-valued functions (), Assumption 1 simplifies to the condition that the kernel is measurable and .

Now we introduce some relevant operators used in the convergence analysis. We introduce the notations for the discrete ordered sets . The product Hilbert space  is equipped with the inner product  and the corresponding norm . We define the sampling operator , then the adjoint  is given by

 S∗xc=1mm∑i=1Kxici,    ∀c=(c1,…,cm)∈Ym.

Let  be the canonical injection map to . Then we observe that both the canonical injection map  and the sampling operator   are bounded by  under Assumption 1, since

 ∥Iνf∥2L2(X,ν;Y)=∫X∥f(x)∥2Ydν(x)=∫X∥K∗xf∥2Ydν(x)≤κ2∥f∥2H

and

 ∥Sxf∥2m=1mm∑i=1∥f(xi)∥2Y=1mm∑i=1∥∥K∗xif∥∥2Y≤κ2∥f∥2H.

We denote the population version , the corresponding covariance operator. The operator  is positive, self-adjoint and depends on both the kernel and the marginal probability measure . We also introduce the sampling version operator which is positive, self-adjoint and depends on both the kernel and the inputs .

By the spectral theory, the operator  is well-defined for , and the spaces  equipped with the inner product  are Hilbert spaces. For , the spaces  is defined as completion of  under the norm . The space  is called the Hilbert scale induced by . We notice that the space  is

according to the above notations. The interpolation inequality is an important tool for the analysis:

 (6) ∥f∥Hr≤∥f∥s−rs−tHt∥f∥r−ts−tHs,f∈Hs

which holds for any .

### 2.2. The true solution, noise condition, and nonlinearity structure

We consider that random observations  follow the model  with a centered noise .

We assume throughout the paper that the operator is injective.

###### Assumption 2 (The true solution).

The conditional expectation w.r.t.  of  given  exists (a.s.), and there exists  such that

 ∫Yydρ(y|x)=gρ(x)=A(fρ)(x), for all x∈X.

From (3) we observe that  is the minimizer of the expected risk. The element  is the true solution which we aim at estimating.

###### Assumption 3 (Noise condition).

There exist some constants  such that for almost all ,

 ∫Y(e∥∥y−A(fρ)(x)∥∥Y/M−∥∥y−A(fρ)(x)∥∥YM−1)dρ(y|x)≤Σ22M2.

This Assumption is usually referred to as a Bernstein-type assumption. The distribution of the observational noise reflects in terms of the parameters , . For the convergence analysis, the output space need not be bounded as long as the noise condition for the output variable is fulfilled.

We need the assumption on the nonlinearity structure of operator to establish the rates of convergence. Following the work of Engl et al.  [13, Chapt. 10] on ‘classical’ nonlinear inverse problems, we consider the following assumption:

###### Assumption 4 (nonlinearity structure).
1. is convex,  is weakly sequentially closed and  is Fréchet differentiable with derivative .

2. the Fréchet derivative  is bounded in a ball of sufficiently large radius , i.e., there exists such that

 ∥∥A′(f)∥∥H→H′≤J∀f∈Bd(fρ)∩D(A)⊂H,
3. (Link condition) There exists constants  and  such that for all ,

 α∥g∥H−p≤∥IνA′(fρ)g∥L2(X,ν;Y)≤β∥g∥H−p.
4. (Lipschitz continuity of ) For all , there exists a constant  such that

 ∥Iν{A′(fρ)−A′(f)}∥H−p→L2(X,ν;Y)≤γ∥fρ−f∥H≤α22β.

A sufficient condition for weak sequential closedness is that is weakly closed (e.g. closed and convex) and is weakly continuous. The link condition (Assumption 4 (iii)) is an interplay between the operator  and the Fréchet derivative of the operator . This link condition is known as finitely smoothing. This condition is satisfied in various types of problems (for examples see [9, Example 10.2][32, Example 4, 5]).

### 2.3. Effective dimension

Now we introduce the concept of the effective dimension which is an important ingredient to derive the rates of convergence rates [7, 10, 14, 19, 21, 28]. The effective dimension is defined as

 N(λ):=Tr((Tν+λI)−1Tν), % for λ>0.

Using the singular value decomposition

for an orthonormal sequence

of eigenvectors of

with corresponding eigenvalues

such that , we get

 N(λ)=∞∑i=1tiλ+ti.

Hence the function  is continuous and decreasing from  to zero for  for the infinite-dimensional operator  (see for details [5, 8, 18, 21, 33]).

Since the integral operator  is a trace class operator, the effective dimension is finite and we have that

 (7) N(λ)≤∥∥(Tν+λI)−1∥∥L(H′)Tr(Tν)≤κ2λ.
###### Assumption 5 (Polynomial decay condition).

Assume that there exists some positive constant  such that

 N(λ)≤cλ−b, for b<1,∀λ>0.
###### Assumption 6 (Logarithmic decay condition).

Assume that there exists some positive constant  such that

 N(λ)≤clog(1λ),∀λ>0.

Lu et al.  showed that different kernels with some probability measures show different behavior of the effective dimension. For Gaussian kernel  with the uniform sampling on , the effective dimension exhibits the log-type behavior (Assumption 6), on the other hand, the kernel  exhibits the power-type behavior (Assumption 5).

Caponnetto et al  showed that if the eigenvalues ’s of the integral operator  follow the polynomial decay: i.e., for fixed positive constants  and ,

 tn≤μn−1b∀n∈N,

then the effective dimension behaves like power-type function (Assumption 5).

## 3. Convergence analysis

Here we establish the error bounds for the Tikhonov regularization for the nonlinear statistical inverse problems in the -norm in the probabilistic sense. The explicit expression of  is not known, therefore we use the definition (5) of the Tikhonov estimator  to derive the error estimates. The linearization techniques is used for nonlinear operator  in the neighborhood of the true solution . The rates of convergence are established by exploiting the nonlinearity structure of operator (see Assumption 4). We discuss the rates of convergence for the Tikhonov estimator by measuring the effect of random sampling which is governed by the noise condition (Assumption 3). The bounds of the reconstruction error depend on the effective dimension, the smoothness parameter of the true solution and the parameter related to the link condition.

It is convenient to introduce the “standardized” quantities used in our analysis. Here we introduce shorthand notation for some key quantities. We let

 Ξx:=Sx(Tx+λI)−1S∗x,
 Δz:=SxA(fρ)−y,
 Ψx:=∥∥(Tν+λI)−1/2(Tν−Tx)∥∥L(H′)

and

 Γx:=∥∥(Tx+λI)−1/2(Tν+λI)1/2∥∥L(H′).

The error bound discussed in the following theorem holds non-asymptotically, but this holds with the following choice of the regularization parameter  and sample size . We can choose appropriate regularization parameter  and sample size  such that the following holds:

 (8) N(λ)≤mλandλ≤min(1,∥Tν∥L(H′)).

The condition (8) says that as the regularization parameter decreases, the sample size must increase.

###### Theorem 3.1.

Let  be i.i.d. samples drawn according to the probability measure . If Assumptions 14 and (8) hold true and if  for some . Then, for the Tikhonov estimator  in (5) with the a-priori choice of the regularization parameter  for , for all , the following error bound holds with the confidence :

 ∥fz,λ−fρ∥H=O((Θ−1N,p,q(1√m))rlog2(4η))forr=q2(p+1).
###### Proof.

By the definition of  as the solution of minimization problem in (5), we have

 1mm∑i=1∥A(fz,λ)(xi)−yi∥2Y+λ∥L(fz,λ−¯f)∥2H≤1mm∑i=1∥A(fρ)(xi)−yi∥2Y+λ∥L(fρ−¯f)∥2H

which implies

 (9) ∥SxA(fz,λ)−y∥2m+λ∥L(fz,λ−¯f)∥2H≤∥SxA(fρ)−y∥2m+λ∥L(fρ−¯f)∥2H.

By linearizing the nonlinear operator  at  we get

 (10) A(fz,λ)=A(fρ)+A′(fρ)(fz,λ−fρ)+r(fz,λ)

where  is the error term by linearizing the operator  at true solution . Using this we reexpress the inequality (9) as follows,

 ∥SxA′(fρ)(fz,λ−fρ)+Δz+Sxr(fz,λ)∥2m+λ∥L(fz,λ−¯f)∥2H≤∥Δz∥2m+λ∥L(fρ−¯f)∥2H.

Then we have,

 ∥SxA′(fρ)(fz,λ−fρ)∥2m+∥Δz+Sxr(fz,λ)∥2m+2⟨SxA′(fρ)(fz,λ−fρ),Δz+Sxr(fz,λ)⟩m +λ∥L(fz,λ−fρ)∥2H≤2λ⟨L(fρ−fz,λ),L(fρ−¯f)⟩H+∥Δz∥2m

which implies

 (11) ∥∥IνA′(fρ)(fz,λ−fρ)∥∥2L2(X,ν;Y)+λ∥∥L(fz,λ−fρ)∥∥2H ≤ +⟨A′(fρ)(fz,λ−fρ),(Tν−Tx)A′(fρ)(fz,λ−fρ)⟩H′−2⟨A′(fρ)(fz,λ−fρ),S∗x{Δz+Sxr(fz,λ)}⟩H′.

Now with Assumption 4 and (4) from Lemmas A.3A.4 we obtain,

 α2∥∥fz,λ−fρ∥∥2H−p+λ∥∥L(fz,λ−fρ)∥∥2H ≤ δ1+√λδ2∥∥L(fz,λ−fρ)∥∥H+δ3∥∥fz,λ−fρ∥∥H−p+βγ∥∥fz,λ−fρ∥∥H∥∥fz,λ−fρ∥∥2H−p

where  and .

Under the condition (iii) of Assumption 4 using the interpolation inequality (6), we obtain

 (12) ≤ δ1+√λδ2∥∥L(fz,λ−fρ)∥∥H+δ3∥∥fz,λ−fρ∥∥H−p+2λ∥∥fρ−¯f∥∥Hq∥∥fz,λ−fρ∥∥q−1p+1H−p∥∥L(fz,λ−fρ)∥∥p−q+2p+1H

which can be re-expressed as

 (13) ∥∥fz,λ−fρ∥∥2H−p =

In the analysis, we will make repeated use of the following:

 (14) cr≤e+dct⇒cr=O(e+drr−t)

which holds for  and .

We apply this inequality to the estimate (13) for  and . First we take  and and we obtain

Then we choose  and  and we get

 (15)

where .

Replacing the term that contains  on the right-hand side in (12) and using the inequality  for  we obtain

 ∥∥L(fz,λ−fρ)∥∥2H= O⎛⎜ ⎜⎝δ1λ+δ3δ4λ+δ2√λ∥∥L(fz,λ−fρ)∥∥H+δ122δ3λ34∥∥L(fz,λ−fρ)∥∥12H +δq−1p+14∥∥L(fz,λ−fρ)∥∥p−q+2p+1H+λq−12p−q+3∥∥L(fz,λ−fρ)∥∥2(p−q+2)2p−q+3H).

Applying (14) repeatedly for  and  and  we obtain

 ∥∥L(fz,λ−fρ)∥∥2H=O (1λ{δ1+δ3δ4+δ22+δ232δ433}+λq−12p+q+1δ2(q−1)2p+q+12+λ2(q−p−2)3p−q+4δ2(2p−q+3)3p−q+43 +δ2(q−1)p+q4+λq−1p+1),
 (16) ∥∥L(fz,λ−fρ)∥∥2H=O(δ2λ+λq−12p+q+1δ2(q−1)2p+q+1+λ2(q−p−2)3p−q+4δ2(2p−q+3)3p−q+4+δ2(q−1)q+p+λq−1p+1),

where .

Under the condition (8), from Propositions A.1A.2 we get with the probability ,

 (17) δ=O⎛⎝⎧⎨⎩1m√λ+√N(λ)m⎫⎬⎭log2(4η)⎞⎠.

Under the condition (8) the spectral decomposition of the operator  gives

 (18) N(λ)≥∥Tν∥L(H′)λ+∥Tν∥L(H′)≥12forλ≤∥Tν∥L(H′).

From (8) we get

 (19) 1m√λ≤2N(λ)m√λ≤2√N(λ)m.

Hence we get,

 (20) δ=O⎛⎝√N(λ)mlog2(4η)⎞⎠.

By balancing the error terms in (16), we consider the parameter choice  for . We have with the probability ,

 ∥∥L(fz,λ−fρ)∥∥H=O(λq−12(p+1)log2(4η))=O⎛⎜ ⎜⎝(Θ−1N,p,q(1√m))q−12(p+1)log2(4η)⎞⎟ ⎟