, the first paper discussing dimension dependent bounds and the second one dimension-free bounds, under a kurtosis like assumption about the data distribution. Here, in contrast, we envision even weaker assumptions, and focus on dimension-free bounds only.
Our main objective is the estimation of the mean of a random vector and of a random matrix. Finding sub-Gaussian estimators for the mean of a non necessarily sub-Gaussian random vector has been the subject of much research in the last few years, with important contributions from Joly, Lugosi and Oliveira (2017), Lugosi and Mendelson (2017) and Minsker (2015). While in Joly, Lugosi and Oliveira (2017) the statistical error bound still has a residual dependence on the dimension of the ambient space, in Lugosi and Mendelson (2017) this dependence is removed, for an estimator of the median of means type. However, this estimator is not easy to compute and the bound contains large constants. We propose here another type of estimator, that can be seen as a multidimensional extension of Catoni (2012). It provides a nonasymptotic confidence region with the same diameter (including the values of the constants) as the Gaussian concentration inequality stated in equation (1.1) of Lugosi and Mendelson (2017)
, although in our case, the confidence region is not necessarily a ball, but still a convex set. The Gaussian bound concerns the estimation of the expectation of a Gaussian random vector by the mean of an i.i.d. sample, whereas in our case, we only assume that the variance is finite, a much weaker hypothesis.
In Minsker (2016) the question of estimating the mean of a random matrix is addressed. The author uses exponential matrix inequalities in order to extend Catoni (2012) to matrices and to control the operator norm of the error. In the bounds at confidence level , the complexity term is multiplied by . Here, we extend Catoni (2012) using PAC-Bayesian bounds to measure complexity, and define an estimator with a bound where the term is multiplied by some directional variance term only, and not the complexity factor, that is larger.
After recalling in Section 2 the PAC-Bayesian inequality that will be at the heart of many of our proofs, we deal successively with the estimation of a random vector (Section 3) and of a random matrix (Section 4). Section 6 is devoted to the estimation of the Gram matrix, due to its prominent role in multidimensional data analysis. In Section 7 we introduce some applications to least squares regression.
2 Some well known PAC-Bayesian inequality
This is a preliminary section, where we state the PAC-Bayesian inequality that we will use throughout this paper to obtain deviation inequalities holding uniformly with respect to some parameter.
Consider a random variableand a measurable parameter space . Let
be a probability measure onand a bounded measurable function. For any other probability measure on , define the Kullback divergence function as usual by the formula
Let be independent copies of .
For any , with probability at least , for any probability measure ,
It is a consequence of equation (5.2.1) page 159 of Catoni (2004). Indeed, let us recall the identity
where may be any bounded measurable function (extensions to unbounded are possible but will not be required in this paper), and where the supremum in is taken on all probability measures on the measurable parameter space . The proof may be found in (Catoni, 2004, page 159). Combined with Fubini’s lemma, it yields
Since implies that
we obtain the desired result, considering
3 Estimation of the mean of a random vector
Let be a random vector and let be independent copies of . In this section, we will estimate the mean and obtain dimension-free non-asymptotic bounds for the estimation error.
Let be the unit sphere of and let
be the identity matrix of size. Let
be the normal distribution centered at, whose covariance matrix is , where is a positive real parameter.
Instead of estimating directly the mean vector , our strategy will be rather to estimate its component in each direction of the unit sphere. For this, we introduce the estimator of defined as
where is the symmetric influence function
and where the positive constants and will be chosen afterward.
As stated in the following lemma, we chose this influence function because it is close to the identity in a neighborhood of zero and is such that is bounded by polynomial functions.
For any ,
Put . Remark that for and that for . As and
Since is increasing on and decreasing on , while is constant on these two intervals, the above inequality can be extended to all . From the symmetry , we deduce the converse inequality
that ends the proof. ∎
Since follows a normal distribution with mean, and since the influence function is piecewise polynomial, the estimator can be computed explicitly in terms of the standard normal distribution function. This is done in the following lemma.
Let be a standard Gaussian real valued random variable. For any and any , define
The function can be computed as
where, introducing , , the correction term is
Remark that the correction term is small when is small and is small, since
The proof of this lemma is a simple computation, based on the expression
on the identities
and on the fact that . ∎
Accordingly, the estimator can be computed as
3.1 Estimation without centering
where and are two known constants and where is an arbitrary symmetric subset of the unit sphere, meaning that if then . Choose any confidence parameter and set the constants and used in the definition of the estimator to
Non asymptotic confidence region: With probability at least ,
Consider an estimator of satisfying
With probability at least , such a vector exists and
In particular in the case when is the whole unit sphere, we obtain with probability at least the bound
By choosing as the middle of a diameter of the confidence region, we could do a little better and replace the factor in this bound by a factor .
According to the PAC-Bayesian inequality of Proposition 2.1, with probability at least , for any ,
We can then use the polynomial approximation of given by Lemma 3.1, remarking that and that , to deduce that
We conclude by considering both and
to get the reverse inequality, using the assumption that
is symmetric and remarking that .
The existence with probability of satisfying the required inequality is granted by the fact that on the event defined by the above PAC-Bayesian inequality, the expectation belongs to the confidence region that, as a result, cannot be empty. ∎
3.2 Centered estimate
The bounds in the previous section are simple, but they
are stated in terms of uncentered moments of order two where
we would have expected a variance.
In this section, we explain how to deduce centered bounds
from the uncentered bounds of the previous section, through the use of
a sample splitting scheme.
where and are known constants. Remark that when these bounds hold, the bounds
hold in the previous section. Assume that we know also some bound such that
Split the sample in two parts and . Use the first part to construct an estimator of as described in Proposition 3.3, choosing . According to this proposition and by equation (2), with probability at least ,
where we have put .
We then construct an estimator of , , built as described in Proposition 3.3, based on the sample and on the constants and . With probability at least ,
and we can, if needed, deduce from an estimator such that with probability at least ,
If we want the correction term to behave as a second order term when tends to , we can for example take , in which case is equivalent to at infinity, so that is equivalent to
Let us also mention that a simpler estimator, obtained by shrinking the norm of , is also possible. It comes with a sub-Gaussian deviation bound under the slightly stronger hypothesis that for some (non necessarily integer) exponent , and is described in Catoni and Giulini (2017).
4 Mean matrix estimate
Let be a random matrix and let be independent copies of . In this section, we will provide an estimator for .
From the previous section, we already have an estimator of with a bounded Hilbert-Schmidt norm , since from the point of view of the Hilbert-Schmidt norm, is nothing but a random vector of size . Here, we will be interested in another natural norm, the operator norm
Indeed, recalling that
we see that we can deduce results from the previous section on vectors, considering the scalar product between matrices
and the part of the unit sphere defined as
Doing so, we obtain in the uncentered case a bound of the form
We will show in the next section that the second -dependent term is satisfactory whereas the first -independent term can be improved.
4.1 Estimation without centering
For any , let , where is the identity matrix of size . In the same way, let , . Consider the estimator of defined as
For any parameters , , with probability at least , for any and any ,
Let us now discuss the question of computing . Remark that, according to Lemma 3.2, for any ,
It is also easy to check that
Consider a standard random vector . We obtain that
The last term is not explicit, since it contains an expectation, but should be most of the time a small reminder and can be evaluated using a Monte-Carlo numerical scheme. This gives a more explicit and efficient method than evaluating directly using a Monte-Carlo simulation for the couple of random variables .
Assume that the following finite bounds are known
For any values of , , with probability at least , for any , any ,
Consider now any estimator of . With probability at least ,
In particular, if we choose such that,
with probability at least , this choice is possible and
In particular, choosing , we get
The bound is of the type , with a complexity (or dimension) term equal to
Let us envision a simple case to compare the precision of the bounds in a setting where dimension-free and dimension-dependent bounds coincide. Assume more specifically that the entries of the matrix ,
are centered and i.i.d.. Assume that is known, and take
Choosing , we get a complexity term equal to
whereas the bound of the previous section made for vectors has a complexity factor equal to .
4.2 Controlling both the operator norm error and the Hilbert-Schmidt error
There are situations where it is desirable to control both and . To do so we can very easily combine Propositions 3.3 and Proposition 4.2, since these two propositions are based on the construction of confidence regions.
More precisely, first consider as a vector and use the scalar product
Applying Proposition 3.3, we can build an estimator such that with probability at least ,
On the other hand, we can also apply Proposition 4.2 and build an estimator , such that with probability at least ,
Remark that is typically smaller than as expected in interesting large dimension situations.
4.3 Centered estimator
As already done in the case of the estimation of the mean of a random vector, we deduce in this section centered bounds from the uncentered bounds of the previous sections, using sample splitting.
Put and . Assume that we know finite constants such that
When this is true, we can take for the previous uncentered constants
In view of this, it is suitable to assume that we also know some finite constants and such that
As we see that the Hilbert-Schmidt norm comes into play, we will use the combined preliminary estimate provided by Proposition 4.3.
Given an i.i.d. matrix sample , first use to build a preliminary estimator as described in Proposition 4.3. With probability at least ,
Then use the sample to build an estimator based on the construction described in Proposition 4.2, at confidence level . It is such that with probability at least ,
If we choose for instance , we obtain that