Convergence analysis of Tikhonov regularization for non-linear statistical inverse learning problems

02/14/2019 ∙ by Abhishake Rastogi, et al. ∙ Weierstrass Institute Universität Potsdam 0

We study a non-linear statistical inverse learning problem, where we observe the noisy image of a quantity through a non-linear operator at some random design points. We consider the widely used Tikhonov regularization (or method of regularization, MOR) approach to reconstruct the estimator of the quantity for the non-linear ill-posed inverse problem. The estimator is defined as the minimizer of a Tikhonov functional, which is the sum of a data misfit term and a quadratic penalty term. We develop a theoretical analysis for the minimizer of the Tikhonov regularization scheme using the ansatz of reproducing kernel Hilbert spaces. We discuss optimal rates of convergence for the proposed scheme, uniformly over classes of admissible solutions, defined through appropriate source conditions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In this study, we shall consider non-linear operator equations of the form

(1)

where the non-linear mapping  is acting between the real separable Hilbert spaces  and . Such non-linear inverse problems occur in many situations, and examples are given in the seminal monograph [9]

. Of special importance are problems of parameter identification in partial differential equations, and we mention the monograph 

[12, Chapt. 1], and the more recent [25].

Within the classical setup it is assumed to observe noisy data  with , where the number  denotes the noise level. In supervised learning, it is assumed that the image space  consists of functions, given on some domain  and taking values in another Hilbert space . Moreover, function evaluation is continuous, such that for  the values  are well defined elements in . The goal is to learn the unknown and indirectly observed quantity  from examples, given in the form of i.i.d. samples , where the elements  are noisy observations of  at random points  of the form

(2)

We assume that the random observations of 

are drawn independently and identically according to some unknown joint probability distribution 

on the sample space . The noise terms 

are independent centered random variables satisfying 

. The cardinality  of the samples  is called sample size.

In the case of random observations, the literature is much more scarce than for the classical setup. Milestone work includes [19]

which considers asymptotic analysis for the generalized Tikhonov regularization for (

3) using the linearization technique. The reference [3] considers a 2-step approach, however, it is assumed that the norm in  (the space of square integrable functions with respected the probability measure  on ) is observable, an unrealistic assumption if the only information on  is available through the points . The references [1] and [11] consider respectively a Gauss-Newton algorithm and the MOR method for certain non-linear inverse problem, but also in the idealized setting of Hilbertian white or colored noise, which can only cover sampling effects when  is known. Loubes et al. [14] consider (3) under a fixed design and concentrate on the problem of model selection. Finally, the recent work [23] analyzes rates of convergence in a model where observations are of the form 

perturbed by noise, but only in a white noise model and for specific, uni-variate non-linear link functions 

, linear operator .

A widely used approach to stabilizing the estimation problem (2) is Tikhonov regularization or regularized least-squares algorithm or method of regularization (MOR). The estimate of the true solution of (2) is obtained by minimizing an objective function consisting of an error term measuring the fit to the data plus a smoothness term measuring the complexity of the quantity . For the non-linear statistical inverse learning problem (2), the regularization scheme over the hypothesis space  can be described as

(3)

Here  denotes some initial guess of the true solution, which offers the possibility to incorporate a priori information. The regularization parameter  is positive which controls the trade-off between the error term measuring the fitness of data and the complexity of the solution measured in the norm in .

The objective of this paper is to analyze the theoretical properties of the regularized least-squares estimator , in particular, the asymptotic performance of the algorithm is evaluated by the bounds and the rates of convergence of the regularized least-squares estimator  in the reproducing kernel ansatz. Precisely, we develop a non-asymptotic analysis of Tikhonov regularization (3) for the non-linear statistical inverse learning problem based on the tools that have been developed for the modern mathematical study of reproducing kernel methods. The challenges specific to the studied problem are that the considered model is an inverse problem (rather than a pure prediction problem) and non-linear. The upper rate of convergence for the regularized least-squares estimator  to the true solution is described in probabilistic sense by exponential tail inequalities. For sample size , a positive decreasing function , and for confidence level , we establish bounds of the form

The function  describes the rate of convergence as . The upper rate of convergence is complemented by a minimax lower bound for any learning algorithm for considered non-linear statistical inverse problem. The lower rate result shows that the error rate attained by Tikhonov regularization scheme for suitable parameter choice of the regularization parameter is optimal on a suitable class of probability measures.

Now we review previous results concerning regularization algorithms on different learning schemes which are directly comparable to our results: Rastogi et al. [22] and Blanchard et al. [4]. For convenience, we tried to present the most essential points in a unified way in Table 1.

  Smoothness     Scheme general source condition Optimal rates
Rastogi et al. [22]     General regularization for direct learning
Blanchard et al. [4]     General regularization for linear inverse learning
Our Results     Tikhonov regularization for non-linear inverse learning
Table 1. Convergence rates of the regularized least-squares algorithms on different learning schemes

In this table, the parameter  corresponds to a (Hölder type) smoothness assumption for the unknown true solution, and the parameter 

corresponds to the decay rate of the eigenvalues of the covariance operator, both to be introduced below in Assumption 

6, and Assumption 7, respectively.

The model (2) covers non-parametric regression under random design (which we also call the direct problem, i.e., ), and the linear statistical inverse learning problem. Thus, introducing a general non-linear operator  gives a unified approach to the different learning problems. In the direct learning setting, Rastogi et al. [22] obtained minimax optimal rates of convergence for general regularization under general source condition. Blanchard et al. [4] considered the general regularization for the linear statistical inverse learning problem. They generalized the convergence analysis of the direct learning scheme to the inverse learning setting and achieved the minimax optimal rates of convergence for general regularization under a Hölder source condition. They considered that the image of the operator  is a reproducing kernel Hilbert space which is a special case of our general assumption that  is contained in a reproducing kernel Hilbert space. Here, we consider Tikhonov regularization for the non-linear statistical inverse learning problem. We obtain minimax optimal rates of convergence under a general source condition. The assumptions on the non-linear operator  (see Assumption 5, and the condition (11), below) allow us to estimate the error bounds for the source condition under some additional constraint, which for Hölder source condition () corresponds to the range .

The structure of the paper is as follows. In Section 2, we introduce the basic setup and notation for supervised learning problems in a reproducing kernel Hilbert space framework. In Sections 3 and 4, we discuss the main results of this paper on consistency and error bounds of the regularized least-squares solution  under certain assumptions on the (unknown) joint probability measure , and on the (non-linear) mapping . We establish minimax rates of convergence over the regularity classes defined through appropriate source conditions by using the concept of effective dimension. In Section 5, we present a concluding discussion on some further aspects of the results. In the appendix, we establish the concentration inequalities, perturbation results and the proofs of consistency results, upper error bounds and lower error bounds.

2. Setup and basic definitions

In this section, we discuss the mathematical concepts and definitions used in our analysis. We start with a brief description of the reproducing kernel Hilbert spaces since our approximation schemes will be built in such spaces. The vector-valued reproducing kernel Hilbert spaces are the extension of real-valued reproducing kernel Hilbert spaces, see e.g. 

[18].

Definition 2.1.

Let  be a non-empty set,  be a real separable Hilbert space and  be a Hilbert space  of functions from  to . If the linear functional , defined by

is continuous for every  and , then  is called vector-valued reproducing kernel Hilbert space.

For the Banach space of bounded linear operators , a function  is said to be an operator-valued positive semi-definite kernel if for each pair , and for every finite set of points  and ,

For every operator-valued positive semi-definite kernel, , there exists a unique vector-valued reproducing kernel Hilbert space  of functions from   to  satisfying the following conditions:

  1. For all  and , the function , defined by

    belongs to ; this allows us to define the linear mapping .

  2. The span of the set  is dense in .

  3. For all  and , in other words  (reproducing property).

Moreover, there is a one-to-one correspondence between operator-valued positive semi-definite kernels and vector-valued reproducing kernel Hilbert spaces [18]. In special case, when  is a bounded subset of , the reproducing kernel Hilbert space is said to be real-valued reproducing kernel Hilbert space. In this case, the operator-valued positive semi-definite kernel becomes the symmetric, positive semi-definite kernel  and each reproducing kernel Hilbert space  can described as the completion of the span of the set  for . Moreover, for every function  in the reproducing kernel Hilbert space , the reproducing property can be described as .

First, we assume that the input space  be a Polish space and the output space  be a real separable Hilbert space. Hence, the joint probability measure  on the sample space  can be described as , where  is the conditional distribution of  given  and  is the marginal distribution on .

We specify the abstract framework for the present study. We consider that random observations  follow the model  with the centered noise .

Assumption 1 (True solution ).

The conditional expectation w.r.t.  of  given  exists (a.s.), and there exists  such that

The element  is the true solution which we aim at estimating.

Assumption 2 (Noise condition).

There exist some constants  such that for almost all ,

This Assumption is usually referred to as a Bernstein-type assumption.

Concerning the Hilbert space , we assume the following throughout the paper.

Assumption 3 (Vector valued reproducing kernel Hilbert space ).

We assume  to be a vector-valued reproducing kernel Hilbert space of functions  corresponding to the kernel  such that

  1. For all  is a Hilbert-Schmidt operator, and

    implying in particular that .

  2. The real-valued function , defined by , is measurable .

Note that in case of real-valued functions (), Assumption 3 simplifies to the condition that the kernel is measurable and .

The operator  denotes the canonical injection map , that

We denote  the corresponding covariance operator.

3. Consistency

We establish consistency in RMS sense and almost surely of Tikhonov regularization in the sense that  as . For this we need weak assumptions on the operator.

Assumption 4 (Lispschitz continuity).

We suppose that  is weakly closed with nonempty interior and  is Lipschitz continuous, one-to-one.

The inequality  for  and the continuity of the operator  implies that  is also continuous. Since  is weakly closed, therefore  is weakly sequentially closed111i.e., if a sequence  converges weakly to some  and if the sequence  converges weakly to some , then  and .. For the continuous and weakly sequentially closed opeator , there exists a global minimizer of the functional in (3). But it is not necessarily unique since  is non-linear (see [25, Section 4.1.1]).

The proofs of Theorems 3.13.3 will be given in Appendix B.

Theorem 3.1.

Suppose that Assumptions 1, 3, 4 hold true and . Let  denote a (not necessarily unique) solution to the minimization problem (3) and assume that the regularization parameter  is chosen such that

(4)

Then we have that

(5)
Remark 3.2.

As can be seen from the proof, the existence of arbitrary moments, as required in Assumption 

2 is not needed. Instead, only the existence of second moments is used, as seen from the introduction of .

The previous result can be strengthened as follows.

Theorem 3.3.

Suppose that Assumptions 14 hold true. Let  denote a (not necessarily unique) solution to the minimization problem (3) and assume that the regularization parameter  is chosen such that

(6)

Then we have that

(7)

4. Convergence rates

In order to derive rates of convergence additional assumptions are made on the operator . We need to introduce the corresponding notion of smoothness of the true solution  from Assumption 1. We discuss the class of probability measures defined through the appropriate source condition which describe the smoothness of the true solution.

Following the work of Engl et al. [9, Chapt. 10] on ‘classical’ non-linear inverse problems, we consider the following assumption:

Assumption 5 (Non-linearity of the operator).

We assume that  is convex with nonempty interior,  is weakly sequentially closed and one-to-one. Furthermore, we assume that

  1. is Fréchet differentiable,

  2. the Fréchet derivative  of  at  is bounded in a sufficiently large ball , i.e., there exists  such that

    and

  3. there exists  such that for all  we have,

Remark 4.1.

The condition (iii) also holds true under the stronger assumption that  is Lipschitz for the operator norm (see [9, Chapt. 10]), i.e.,

A sufficient condition for weak sequential closedness is that  is weakly closed (e.g. closed and convex) and  is weakly continuous. Note that under the Fréchet differentiability of  (Assumption 5 (ii)), the operator  is Lipschitz continuous with Lipschitz constant .

To illustrate the general setting, we consider a family of integral operators on the Sobolev space satisfying the above assumptions, where the kernel  is completely explicit.

Example 4.2.

Let  be the Sobolev space  of differential order  (based on ), for the integer , which is defined as the completion of  with respect to the norm given by:

The Sobolev space  is a reproducing kernel Hilbert space with the reproducing kernel , given by (see [24, Sec. 1.3.5])

where  is the Euclidean norm in .

It satisfies Assumption 3 with . We consider the non-linear operator  given by:

where  is -times differentiable. It can be checked that , with

where

(assumed to be finite).

The Fréchet derivative of  at  is given by

Then we have

and

so that Assumption 5 is satisfied.

Under the above non-linearity assumption on the operator  we now introduce the corresponding operators which will turn out to be useful in the analysis of regularization schemes.

We recall that  denotes the canonical injection map . We define the operator

We denote  the corresponding covariance operator. The operators  from Section 2, and   are positive, self-adjoint and compact operators, even trace-class operators.

Observe that the operator  depends on  and , thus on the joint probability measure  itself. It is bounded and satisfies .

The consistency results as established in Section 3, yield convergence of the minimizers , as  tends to infinity, and the parameter  is chosen appropriately. However, the rates of convergence may be arbitrarily slow. This phenomenon is known as the no free lunch theorem [8]. Therefore, we need some prior assumptions on the probability measure  in order to achieve uniform rates of convergence for learning algorithms.

Assumption 6 (General source condition).

The true solution  belongs to the class  with

where  is a continuous increasing index function defined on the interval  with the assumption .

The general source condition , by allowing for the index functions , cover a wide range of source conditions, such as Hölder source condition  with , and logarithmic-type source condition  with . The source sets  are precompact sets in , since the operator  is compact. Observe that in contrast with the linear case, in the equation  from Assumption 6, the true solution  appears on both sides, since the operator  itself depends on it (through ). This condition is more easily interpreted as a condition on the “initial guess” , so that the initial error  should satisfy a source condition with respect to the operator linearized at the true solution. Assumption 6 is usually referred to as a general source condition, see e.g. [17], which is a measure of regularity of the true solution . This is inspired, on the one hand, by the approach considered in previous works on statistical learning using kernels, and, on the other hand, by the “classical” literature on non-linear inverse problems. The true solution  is represented in terms of the marginal probability distribution  over the input space , and of the linearized operator at the true solution, respectively. Both aspects enter into Assumption 6.

Following the concept of Bauer et al. [2], and Blanchard et al. [4], we consider the class of probability measures  which satisfy both the noise assumption 2 and which allow for the smoothness assumption 6. This class depends on the observation noise distribution (reflected in the parameters ) and the smoothness properties of the true solution  (reflected in the parameters ). For the convergence analysis, the output space need not be bounded as long as the noise condition for the output variable is fulfilled.

The class  may further be constrained, by imposing properties of the covariance operator  from above. Thus we consider the set of probability measures  which also satisfy the following condition:

Assumption 7 (Eigenvalue decay condition).

The eigenvalues  of the covariance operator  follow a polynomial decay, i.e., for fixed positive constants  and ,

Now under Assumption 5

(ii) using the relation for singular values 

for  (see Chapter 11 [20]) we obtain,

Hence the polynomial decay condition on eigenvalues of the operator implies that the eigenvalues of also follows the polynomial decay.

We achieve optimal minimax rates of convergence using the concept of effective dimension of the operator . For the trace class operator , the effective dimension is defined as

For the infinite dimensional operator , the effective dimension is a continuously decreasing function of  from  to . For further discussion on the effective dimension, we refer to the literature [13, 15].

Under Assumptions 35 (ii), the effective dimension  can trivially be estimated as follows,

(8)

However, we know from [5, Prop. 3] that, under Assumption 7, we have the improved bound

(9)

4.1. Upper rates of convergence

In Theorems 4.34.4, we present the upper error bounds for the regularized least-squares solution  over the class of probability measures . We establish the error bounds for both the direct learning setting in the sense of the -norm reconstruction error  and the inverse problem setting in the sense of the -norm reconstruction error . Since the explicit expression of  is not known, we use the definition (3) of the regularized least-squares solution  to derive the error bounds. We use the linearization techniques for the operator  in the neighborhood of the true solution  under the (Fréchet) differentiability of . We estimate the error bounds for the regularized least-squares estimator by measuring the complexity of the true solution  and the effect of random sampling. The rates of convergence are governed by the noise condition (Assumption 2), the general source condition (Assumption 6) and the ill-posedness of the problem, as measured by an assumed power decay (Assumption 7) of the eigenvalues of  with exponent . The effect of random sampling and the complexity of  are measured through Assumption 2 and Assumption 6 in Proposition A.3 and Proposition C.1, respectively. We briefly discuss two additional assumptions of the theorem. Condition (10) below says that as the regularization parameter  decreases, the sample size must increase. This condition will be automatically satisfied under the parameter choice considered later in Theorem 4.5. The additional assumption (11) is a “smallness” condition which imposes a constraint between  and the non-linearity as measured by the parameter  in Assumption 5 (iii). In order for the latter norm to be finite for any function satisfying the source condition , it requires that  remains bounded near 0, in particular if , that .

The error bound discussed in the following theorem holds non-asymptotically, but this holds with sufficiently small regularization parameter  and sufficiently large sample size . For fixed  and , we can choose sufficiently large sample size  such that

(10)

Under the source condition  for , we have that  for . We assume that

(11)

The proofs of Theorems 4.34.5 will be given in Appendix C.

Theorem 4.3.

Let  be i.i.d. samples drawn according to the probability measure  where . Suppose Assumptions 1356 and the conditions (10), (11) hold true. Then, for all , for the regularized least-squares estimator  (not necessarily unique) in (3) with the confidence  the following upper bound holds:

and

where  and  depends on the parameters .

In the above theorem we discussed the error bounds for the Hölder source condition (Assumption 6) with . In the following theorem, we discuss the error bound for the general source condition with the suitable assumptions on the function .

Theorem 4.4.

Let  be i.i.d. samples drawn according to the probability measure  where  is an index function satisfying the conditions that  and  are nondecreasing functions. Suppose Assumptions 13, 56 and the conditions (10), (11) hold true. Then, for all , for the regularized least-squares estimator  (not necessarily unique) in (3) with the confidence  the following upper bound holds:

where  depends on the parameters .

Note that error bounds for  in both Theorem 4.3 and Theorem 4.4 are the same upto a constant factor which depends on the parameters .

In Theorems 4.34.4, the error estimates reveal the interesting fact that the error terms consist of increasing and decreasing functions of  which led to propose a choice of regularization parameter by balancing the error terms. We derive the rates of convergence for the regularized least-squares estimator based on a data independent (a priori) parameter choice of  for the classes of probability measures  and . The effective dimension plays a crucial role in the error analysis of regularized least-squares learning algorithm. In Theorem 4.5, we derive the rate of convergence for the regularized least-squares solution  under the general source condition  for the parameter choice rule for  based on the index function  and the sample size . For the class of probability measures , the polynomial decay condition (Assumption 7) on the spectrum of the operator  also enters into the picture and the parameter  enters in the parameter choice by the estimate (9) of effective dimension. For this class, we derive the minimax optimal rate of convergence in terms of the index function , the sample size  and the parameter .

Theorem 4.5.

Under the same assumptions of Theorem 4.4, the convergence of the regularized least-squares estimator  in (3) to the true solution  can be described as:

  1. For the class of probability measures  with the parameter choice  where , we have

    where  depends on the parameters  and

  2. For the class of probability measures  under Assumption 7 and the parameter choice  where , we have

    where  depends on the parameters  and

Notice that the rates given for the class  is worse than the one for the (smaller) class , which is easily seen from the fact that  for , and hence  for .

We obtain the following corollary as a consequence of Theorem 4.5.

Corollary 4.6.

Under the same assumptions of Theorem 4.4 with the Hölder’s source condition , the convergence of the regularized least-squares estimator  in (3) to the true solution  can be described as:

  1. For the class of probability measures  with the parameter choice , for all , we have with the confidence ,

  2. For the class of probability measures  with the parameter choice , for all , we have with the confidence ,

We obtain the following corollary as a consequence of Theorem 4.3.

Corollary 4.7.

Under the same assumptions of Theorem 4.3 with the Hölder’s source condition , the convergence of the regularized least-squares estimator  in (3) to the true solution  can be described as:

  1. For the class of probability measures  with the parameter choice , for all , we have with the confidence ,

    and

    where  and  depends on the parameters .

  2. For the class of probability measures  with the parameter choice , for all , we have with the confidence ,

    and