Dimension-free PAC-Bayesian bounds for matrices, vectors, and linear least squares regression

12/07/2017 ∙ by Olivier Catoni, et al. ∙ Young’s fringes pattern obtained at 80 kV showing a point 0

This paper is focused on dimension-free PAC-Bayesian bounds, under weak polynomial moment assumptions, allowing for heavy tailed sample distributions. It covers the estimation of the mean of a vector or a matrix, with applications to least squares linear regression. Special efforts are devoted to the estimation of Gram matrices, due to their prominent role in high-dimension data analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The subject of this paper is to discuss dimension-free PAC-Bayesian bounds for matrices and vectors. It comes after Catoni (2016) and Giulini (2017a)

, the first paper discussing dimension dependent bounds and the second one dimension-free bounds, under a kurtosis like assumption about the data distribution. Here, in contrast, we envision even weaker assumptions, and focus on dimension-free bounds only.

Our main objective is the estimation of the mean of a random vector and of a random matrix. Finding sub-Gaussian estimators for the mean of a non necessarily sub-Gaussian random vector has been the subject of much research in the last few years, with important contributions from Joly, Lugosi and Oliveira (2017), Lugosi and Mendelson (2017) and Minsker (2015). While in Joly, Lugosi and Oliveira (2017) the statistical error bound still has a residual dependence on the dimension of the ambient space, in Lugosi and Mendelson (2017) this dependence is removed, for an estimator of the median of means type. However, this estimator is not easy to compute and the bound contains large constants. We propose here another type of estimator, that can be seen as a multidimensional extension of Catoni (2012). It provides a nonasymptotic confidence region with the same diameter (including the values of the constants) as the Gaussian concentration inequality stated in equation (1.1) of Lugosi and Mendelson (2017)

, although in our case, the confidence region is not necessarily a ball, but still a convex set. The Gaussian bound concerns the estimation of the expectation of a Gaussian random vector by the mean of an i.i.d. sample, whereas in our case, we only assume that the variance is finite, a much weaker hypothesis.

In Minsker (2016) the question of estimating the mean of a random matrix is addressed. The author uses exponential matrix inequalities in order to extend Catoni (2012) to matrices and to control the operator norm of the error. In the bounds at confidence level , the complexity term is multiplied by . Here, we extend Catoni (2012) using PAC-Bayesian bounds to measure complexity, and define an estimator with a bound where the term is multiplied by some directional variance term only, and not the complexity factor, that is larger.

After recalling in Section 2 the PAC-Bayesian inequality that will be at the heart of many of our proofs, we deal successively with the estimation of a random vector (Section 3) and of a random matrix (Section 4). Section 6 is devoted to the estimation of the Gram matrix, due to its prominent role in multidimensional data analysis. In Section 7 we introduce some applications to least squares regression.

2 Some well known PAC-Bayesian inequality

This is a preliminary section, where we state the PAC-Bayesian inequality that we will use throughout this paper to obtain deviation inequalities holding uniformly with respect to some parameter.

Consider a random variable

and a measurable parameter space . Let

be a probability measure on

and a bounded measurable function. For any other probability measure on , define the Kullback divergence function as usual by the formula

Let be independent copies of .

Proposition 2.1.

For any , with probability at least , for any probability measure ,

Proof.

It is a consequence of equation (5.2.1) page 159 of Catoni (2004). Indeed, let us recall the identity

where may be any bounded measurable function (extensions to unbounded are possible but will not be required in this paper), and where the supremum in is taken on all probability measures on the measurable parameter space . The proof may be found in (Catoni, 2004, page 159). Combined with Fubini’s lemma, it yields

Since implies that

we obtain the desired result, considering

3 Estimation of the mean of a random vector

Let be a random vector and let be independent copies of . In this section, we will estimate the mean and obtain dimension-free non-asymptotic bounds for the estimation error.

Let be the unit sphere of and let

be the identity matrix of size

. Let

be the normal distribution centered at

, whose covariance matrix is , where is a positive real parameter.

Instead of estimating directly the mean vector , our strategy will be rather to estimate its component in each direction of the unit sphere. For this, we introduce the estimator of defined as

where is the symmetric influence function

(1)

and where the positive constants and will be chosen afterward.

As stated in the following lemma, we chose this influence function because it is close to the identity in a neighborhood of zero and is such that is bounded by polynomial functions.

Lemma 3.1.

For any ,

Proof.

Put . Remark that for and that for . As and

proving that

Since is increasing on and decreasing on , while is constant on these two intervals, the above inequality can be extended to all . From the symmetry , we deduce the converse inequality

that ends the proof. ∎

Since follows a normal distribution with mean

and standard deviation

, and since the influence function is piecewise polynomial, the estimator can be computed explicitly in terms of the standard normal distribution function. This is done in the following lemma.

Lemma 3.2.

Let be a standard Gaussian real valued random variable. For any and any , define

The function can be computed as

where, introducing , , the correction term is

Remark that the correction term is small when is small and is small, since

Proof.

The proof of this lemma is a simple computation, based on the expression

on the identities

and on the fact that . ∎

Accordingly, the estimator can be computed as

3.1 Estimation without centering

Proposition 3.3.

Assume that

and

where and are two known constants and where is an arbitrary symmetric subset of the unit sphere, meaning that if then . Choose any confidence parameter and set the constants and used in the definition of the estimator to

Non asymptotic confidence region: With probability at least ,

Consider an estimator of satisfying

With probability at least , such a vector exists and

Remark 3.1.

In particular in the case when is the whole unit sphere, we obtain with probability at least the bound

By choosing as the middle of a diameter of the confidence region, we could do a little better and replace the factor in this bound by a factor .

Proof.

According to the PAC-Bayesian inequality of Proposition 2.1, with probability at least , for any ,

We can then use the polynomial approximation of given by Lemma 3.1, remarking that and that , to deduce that

We conclude by considering both and to get the reverse inequality, using the assumption that is symmetric and remarking that .
The existence with probability of satisfying the required inequality is granted by the fact that on the event defined by the above PAC-Bayesian inequality, the expectation belongs to the confidence region that, as a result, cannot be empty. ∎

3.2 Centered estimate

The bounds in the previous section are simple, but they are stated in terms of uncentered moments of order two where we would have expected a variance. In this section, we explain how to deduce centered bounds from the uncentered bounds of the previous section, through the use of a sample splitting scheme.
Assume that

and

where and are known constants. Remark that when these bounds hold, the bounds

(2)

hold in the previous section. Assume that we know also some bound such that

Split the sample in two parts and . Use the first part to construct an estimator of as described in Proposition 3.3, choosing . According to this proposition and by equation (2), with probability at least ,

where we have put .
We then construct an estimator of , , built as described in Proposition 3.3, based on the sample and on the constants and . With probability at least ,

and we can, if needed, deduce from an estimator such that with probability at least ,

If we want the correction term to behave as a second order term when tends to , we can for example take , in which case is equivalent to at infinity, so that is equivalent to

Let us also mention that a simpler estimator, obtained by shrinking the norm of , is also possible. It comes with a sub-Gaussian deviation bound under the slightly stronger hypothesis that for some (non necessarily integer) exponent , and is described in Catoni and Giulini (2017).

4 Mean matrix estimate

Let be a random matrix and let be independent copies of . In this section, we will provide an estimator for .

From the previous section, we already have an estimator of with a bounded Hilbert-Schmidt norm , since from the point of view of the Hilbert-Schmidt norm, is nothing but a random vector of size . Here, we will be interested in another natural norm, the operator norm

Indeed, recalling that

we see that we can deduce results from the previous section on vectors, considering the scalar product between matrices

and the part of the unit sphere defined as

Doing so, we obtain in the uncentered case a bound of the form

We will show in the next section that the second -dependent term is satisfactory whereas the first -independent term can be improved.

4.1 Estimation without centering

Consider the influence function defined by equation (1) on page 1.

For any , let , where is the identity matrix of size . In the same way, let , . Consider the estimator of defined as

Proposition 4.1.

For any parameters , , with probability at least , for any and any ,

Proof.

The PAC-Bayesian inequality of Proposition 2.1 tells us that with probability at least , for any and any ,

Using the properties of (Lemma 3.1) and Fubini’s lemma, we get

As

this concludes the proof. ∎

Let us now discuss the question of computing . Remark that, according to Lemma 3.2, for any ,

It is also easy to check that

Consider a standard random vector . We obtain that

so that

The last term is not explicit, since it contains an expectation, but should be most of the time a small reminder and can be evaluated using a Monte-Carlo numerical scheme. This gives a more explicit and efficient method than evaluating directly using a Monte-Carlo simulation for the couple of random variables .

Proposition 4.2.

Assume that the following finite bounds are known

and choose

For any values of , , with probability at least , for any , any ,

Consider now any estimator of . With probability at least ,

In particular, if we choose such that,

with probability at least , this choice is possible and

Remark 4.1.

In particular, choosing , we get

The bound is of the type , with a complexity (or dimension) term equal to

Remark 4.2.

Let us envision a simple case to compare the precision of the bounds in a setting where dimension-free and dimension-dependent bounds coincide. Assume more specifically that the entries of the matrix ,

are centered and i.i.d.. Assume that is known, and take

Choosing , we get a complexity term equal to

whereas the bound of the previous section made for vectors has a complexity factor equal to .

4.2 Controlling both the operator norm error and the Hilbert-Schmidt error

There are situations where it is desirable to control both and . To do so we can very easily combine Propositions 3.3 and Proposition 4.2, since these two propositions are based on the construction of confidence regions.

More precisely, first consider as a vector and use the scalar product

Applying Proposition 3.3, we can build an estimator such that with probability at least ,

On the other hand, we can also apply Proposition 4.2 and build an estimator , such that with probability at least ,

Proposition 4.3.

Consider a matrix such that

Combining Propositions 3.3 and 4.2 shows that, with probability at least , such a matrix exists and satisfies both

Remark that is typically smaller than as expected in interesting large dimension situations.

4.3 Centered estimator

As already done in the case of the estimation of the mean of a random vector, we deduce in this section centered bounds from the uncentered bounds of the previous sections, using sample splitting.

Put and . Assume that we know finite constants such that

When this is true, we can take for the previous uncentered constants

In view of this, it is suitable to assume that we also know some finite constants and such that

As we see that the Hilbert-Schmidt norm comes into play, we will use the combined preliminary estimate provided by Proposition 4.3.

Given an i.i.d. matrix sample , first use to build a preliminary estimator as described in Proposition 4.3. With probability at least ,

Then use the sample to build an estimator based on the construction described in Proposition 4.2, at confidence level . It is such that with probability at least ,

If we choose for instance , we obtain that