 # PAC-Bayesian Bounds for Randomized Empirical Risk Minimizers

The aim of this paper is to generalize the PAC-Bayesian theorems proved by Catoni in the classification setting to more general problems of statistical inference. We show how to control the deviations of the risk of randomized estimators. A particular attention is paid to randomized estimators drawn in a small neighborhood of classical estimators, whose study leads to control the risk of the latter. These results allow to bound the risk of very general estimation procedures, as well as to perform model selection.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

The aim of this paper is to perform statistical inference with observations in a possibly large dimensional space. Let us first introduce the notations.

### 1.1. General notations

Let be the number of observations. Let be a measurable space and , …, be probability measures on this space, unknown to the statistician. We assume that

 (Z1,...,ZN)

is the canonical process on

 (ZN,B⊗N,P1⊗...⊗PN).
###### Definition 1.1.

Let us put

 P=P1⊗...⊗PN,

and

 ¯¯¯P=1NN∑i=1δZi.

We want to perform statistical inference on a general parameter space

, with respect to some loss function

 ℓθ:Z→R,θ∈Θ.
###### Definition 1.2 (Risk functions).

We introduce, for any ,

 r(θ)=¯¯¯P(ℓθ)=1NN∑i=1ℓθ(Zi),

the empirical risk function, and

 R(θ)=P(ℓθ)=1NN∑i=1Pi(ℓθ),

the risk function.

We now describe three classical problems in statistics that fit the general context described above.

###### Example 1.1 (Classification).

We assume that where is a set of objects and a finite set of possible labels for these objects. Consider a set of classification functions which assign to each object a label. Let us put, for any , where is some symmetric discrepancy measure. The most usual case is to use the 0-1 loss function . If moreover we can decide that and set . However, in many practical situations, algorithmic considerations lead to use a convex upper bound of this loss function, like

 ψ(y,y′) =(1−yy′)+=max(1−yy′,0),the "hinge % loss", ψ(y,y′) =exp(−yy′),the exponential loss, ψ(y,y′) =(1−yy′)2,the least square loss.

For example, Cortes and Vapnik  generalized the SVM technique to non-separable data using the hinge loss, while Schapire, Freund, Bartlett and Lee  gave a statistical interpretation of boosting algorithm thanks to the exponential loss. See Zhang  for a complete study of the performance of classification methods using these loss functions. Remark that in this case, is allowed to take any real value, and not only or , although the labels in the training set are either or .

###### Example 1.2 (Regression estimation).

The context is the same except that the label set is infinite, in most case it is or an interval of . Here, the most usual case is the regression with quadratic loss, with , however, more general cases can be studied like the loss for some .

###### Example 1.3 (Density estimation).

Here, we assume that and consequently that , and we want to estimate the density of with respect to a known measure . We assume that we are given a set of probability measures with densities and we use the loss function . Indeed in this case, we can write under suitable hypotheses

 R(θ)=P(−log∘qθ)=P(−log∘dQθdμ)=P(log∘dPdQθ)+P(log∘dμdP)=K(P,Qθ)−P(log∘f),

showing that the risk is the Kullback-Leibler divergence between

and up to a constant (the definition of is reminded in this paper, see Definition 1.8 page 1.8).

In each case the objective is to estimate on the basis of the observations , …, - presumably using in some way or another the value of the empirical risk. We have to notice that when the space

is large or complex (for example a vector space with large dimension),

and can be very different. This does not happen if is simple (for example a vector space with small dimension), but such a case is less interesting as we have to eliminate a lot of dimensions in before proceeding to statistical inference with no guarantees that these directions are not relevant.

### 1.2. Statistical learning theory and PAC-Bayesian point of view

The learning theory point of view introduced by Vapnik and Cervonenkis (, see Vapnik  for a presentation of the main results in English) gives a setting that proved to be adapted to deal with estimation problems in large dimension. This point of view received an important interest over the past few years, see for example the well-known books of Devroye, Gy rfi and Lugosi , Friedman, Hastie and Tibshirani  or more recently the paper by Boucheron, Bousquet and Lugosi  and the references therein, for a state of the art.

The idea of Vapnik and Cervonenkis is to introduce a structure, namely a family of submodels , , … The problem of model selection then arises: we must choose the submodel in which the minimization of the empirical risk will lead to the smallest possible value for the real risk . This choice requires to estimate the complexity of submodels . An example of complexity measure is the so-called Vapnik Cervonenkis dimension or VC-dimension, see [9, 21].

The PAC-Bayesian point of view, introduced in the context of classification by McAllester [16, 17] is based on the following remark: while classical measures of complexity (like VC-dimension) require theoretical results on the submodels, the introduction of a probability measure on the model allows to measure empirically the complexity of every submodel. In a more technical point of view, we will see later that allows a generalization of the so-called union bound (see  for example). This point of view might be compared with Rissanen’s work on MDL (Minimum Description Length, see ) making a link between statistical inference and information theory, and can be seen as the length of a code for the parameter (at least when is finite).

The PAC-Bayesian point of view was developed in more contexts (classification, least square regression and density estimation) by Catoni , and then improved in the context of classification by Catoni , Audibert  and in the context of least square regression by Audibert  and of regression with a general loss in our PhD thesis . The most recent work in the context of classification by Catoni  improves the upper-bound given on the risk of the PAC-Bayesian estimators, leading to purely empirical bounds that allow to perform model selection with no assumption on the probability measure . The aim of this work is to extend these results to the very general context of statistical inference introduced in subsection 1.1, that includes classification, regression with a general loss function and density estimation.

Let us introduce our estimators.

###### Definition 1.3.

Let us assume that we have a family of functions

 ψiθ:Z→R∪{+∞}

indexed by in a finite or countable set and by . For every we choose:

 ^θi∈argminθ∈Θ¯¯¯P(ψiθ).
###### Example 1.4 (Empirical risk minimization and model selection).

If we take we can choose and we obtain and so

 ^θ0=argminθ∈Θr(θ)

the empirical risk minimizer. In the case where the dimension of is large, we can choose several submodels indexed by a finite or countable family : . In order to obtain

 ^θi=argminθ∈Θir(θ)

we can put

 ψiθ(.)=⎧⎪⎨⎪⎩lθ(.)ifθ∈Θi+∞otherwise.

The problem of the selection of the with the smallest possible risk (so-called model selection problem) can be solved with the help of PAC-Bayesian bounds.

Note that PAC-Bayesian bounds given by Catoni [6, 7, 8] usually apply to "randomized estimators". More formally, let us introduce a -algebra on and a probability measure on the measurable space . We will need the following definitions.

###### Definition 1.4.

For any measurable set , we let denote the set of all probability measures on the measurable space .

###### Definition 1.5.

In order to generalize the notion of estimator (a measurable function ), we call a randomized estimator any function that is a regular conditional probability measure. For the sake of simplicity, the sample being given, we will write instead of .

PAC-Bayesian bounds for randomized estimators are usually given for their mean risk

 ∫θ∈ΘR(θ)dρ(θ),

whereas here we will rather focus on , where is drawn from and is highly concentrated around a "classical" (deterministic) estimator .

### 1.3. Truncation of the risk

In this subsection, we introduce a truncated version of the relative risk of two parameters and .

###### Definition 1.6.

We put, for any and

 Rλ(θ,θ′)=P[(ℓθ−ℓθ′)∧Nλ].

Note of course that if -almost surely, we have then .

In what follows, we will give empirical bounds on for some and chosen by some statistical procedure. One can wonder why we prefer to bound this truncated version of the risk instead of . The reason is the following. In this paper, we want to give bounds that hold with no particular assumption on the unknown data distribution . However, it is clear that we cannot obtain a purely empirical bound on with no assumption on the data distribution, as it is shown by the following example.

###### Example 1.5.

Let us choose and . We assume that and that with . We put with probability and otherwise. Then we have and

 R(θ)=1NcN+(1−1N)0=c

while and with probability at least we also have , this means that we cannot upper bound precisely by empirical quantities with no assumption.

So, we introduce the truncation of the risk. However, two remarks shall be made. First, in the case of a bounded loss function , with a large enough ratio we have .

In the general case, if we want to upper bound we can make additional hypotheses on the data distribution, ensuring that we can dispose of a (known) upper-bound :

 Δλ(θ,θ′)≥R(θ)−R(θ′)−Rλ(θ,θ′)

as it is done in our PhD Thesis . For the sake of completeness, such an upper bound is given in the Appendix, page Appendix : bounding the effect of truncation.

### 1.4. Main tools

In this subsection, we give two lemmas that will be useful in order to build PAC-Bayesian theorems. First, let us recall the following definition. In this whole subsection, we assume that is an arbitrary measurable space.

###### Definition 1.7.

For any measurable function , for any measure we put

 m(h)=supB∈R∫E[h(x)∧B]m(dx).
###### Definition 1.8 (Kullback-Leibler divergence).

Given a measurable space , we define , for any , the Kullback-Leibler divergence function

###### Lemma 1.1 (Legendre transform of the Kullback divergence function).

For any , for any measurable function such that we have

 (1.1) logn(exp∘h)=supm∈M1+(E)(m(h)−K(m,n)),

where by convention . Moreover, as soon as is upper-bounded on the support of , the supremum with respect to in the right-hand side is reached for the Gibbs distribution, given by:

 ∀e∈E,dnexp(h)dn(e)=exp[h(e)]π(exp∘h).

The proof of this lemma is given at the end of the paper, in a section devoted to proofs (subsection 5.1 page 5.1). We now state another lemma that will be useful in the sequel. First, we need the following definition.

###### Definition 1.9.

We put, for any ,

 Φα:]−∞,1/α[ →R t ↦−log(1−αt)α.

Note that is invertible, that for any ,

 Φ−1α(u)=1−exp(−αu)α≤u,

and that . Also note that for , is convex and that . An elementary study of this function also proves that for any , for any and any we have:

 Φα(p)≤p+αp22.

We can now give the lemma.

###### Lemma 1.2.

We have, for any , for any , for any ,

The proof is almost trivial, we give it now in order to emphasize the role of the truncation and of the change of variable.

###### Proof.

For any , for any ,

 Pexp{λΦλN[Rλa(θ,θ′)]−λNN∑i=1ΦλN[(ℓθ−ℓθ′)(Zi)∧aNλ]}=Pexp{N∑i=1(log[1−λN((lθ−lθ′)(Zi)∧aNλ)]−log[1−λNPi((lθ−lθ′)(Zi)∧aNλ)])}=P⎡⎢ ⎢⎣N∏i=11−λN((lθ−lθ′)(Zi)∧aNλ)1−λNPi((lθ−lθ′)(Zi)∧aNλ)⎤⎥ ⎥⎦=N∏i=1Pi⎡⎢ ⎢⎣1−λN((lθ−lθ′)(Zi)∧aNλ)1−λNPi((lθ−lθ′)(Zi)∧aNλ)⎤⎥ ⎥⎦=1.

Note that this lemma will be used as an alternative to Hoeffding’s or Bernstein’s (see [13, 4]) inequalities in order to prove PAC inequalities.

### 1.5. A basic PAC-Bayesian Theorem

Let us integrate Lemma 1.2 with respect to with a given probability measure with . Applying Fubini-Tonelli Theorem we obtain:

 (1.2) P{∫(θ,θ′)∈Θ2d(π⊗π′)(θ,θ′)exp{λΦλN[Rλa(θ,θ′)]−λNN∑i=1ΦλN[(ℓθ−ℓθ′)(Zi)∧aNλ]}}=1.

This implies that for any ,

 P{∫(θ,θ′)∈Θ2d(ρ⊗ρ′)(θ,θ′)exp{λΦλN[Rλa(θ,θ′)]−λNN∑i=1ΦλN[(ℓθ−ℓθ′)(Zi)∧aNλ]−log[d(ρ⊗ρ′)d(π⊗π′)(θ,θ′)]}}≤1.

(This inequality becomes an equality when and .)

###### Theorem 1.3.

Let us assume that we have , and two randomized estimators and . For any , for any , with -probability at least over the sample and the parameters , we have:

 Rλa(~θ,~θ′)≤Φ−1λN{1NN∑i=1ΦλN[(ℓ~θ−ℓ~θ′)(Zi)∧aNλ]+log[dρdπ(~θ)]+log[dρ′dπ′(~θ′)]+log1ελ}.

In order to provide an interpretation of Theorem 1.3, let us give the following corollary in the bounded case, which is obtained using basic properties of the function given just after Definition 1.9 page 1.9. In this case, the parameter is just set to .

###### Corollary 1.4.

Let us assume that for any . Let us assume that we have , and two randomized estimators and . For any , for any , with -probability at least we have:

 R(~θ)−R(~θ′)≤Φ−1λN{r(~θ)−r(~θ′)+λ2N¯¯¯P[(l~θ−l~θ′)2]+log[dρdπ(~θ)]+log[dρ′dπ′(~θ′)]+log1ελ}.

We can see that the difference of the "true" risk of the randomized estimators and , drawn independently from and

, is upper bounded by the difference of the empirical risk, plus a variance term and a complexity term expressed in terms of the

of the density of the randomized estimator with respect to a given prior. So Theorem 1.3 provides an empirical way to compare the theoretical performance of two randomized estimators, leading to applications in model selection. This paper is devoted to improvements of Theorem 1.3 (we will see in the sequel that this theorem does not necessarily lead to optimal estimators) and to the effective construction of estimators using variants of Theorem 1.3.

Now, note that the choice of the randomized estimators and is not straightforward. The following theorem, which gives an integrated variant of Theorem 1.3, can be usefull for that purpose.

###### Theorem 1.5.

Let us assume that we have . For any , for any , with -probability at least , for any ,

 ∫Θ2Rλa(θ,θ′)d(ρ⊗ρ′)(θ,θ′)≤Φ−1λN{∫Θ21NN∑i=1ΦλN[(ℓθ−ℓθ′)(Zi)∧aNλ]d(ρ⊗ρ′)(θ,θ′)+K(ρ,π)+K(ρ′,π′)+log1ελ}.

The proof is given in subsection 5.2 page 5.2.

### 1.6. Main results of the paper

In our PhD dissertation , a particular case of Theorem 1.5 is given and applied to regression estimation with quadratic loss in a bounded model of finite dimension . In this particular case, it is shown that the estimators based on the minimization of the right-hand side of Theorem 1.5 do not achieve the optimal rate of convergence: , but only . A solution is given by Catoni in  and consists in replacing the prior by the so-called "localized prior" for a given . The main problem is that this choice leads to the presence of non-empirical terms in the right-hand side, .

In Section 2, we give an empirical bound for this term

. We also give a heuristic that leads to this technique of localization.

In Section 3, we show how this result, combined with Theorem 1.5, leads to the effective construction of an estimator that can reach optimal rates of convergence.

The proofs of the theorems stated in this paper are gathered in Section 5.

## 2. Empirical bound for the localized complexity and localized PAC-Bayesian theorems

### 2.1. Mutual information between the sample and the parameter

Let us consider Theorem 1.5 with for a given parameter . For the sake of simplicity, let us assume in this subsection that we are in the bounded case ( bounded by ). Theorem 1.5 ensures that, for any , with -probability at least , for any ,

 ρ(R)−R(θ′)≤ρ(r)−r(θ′)+λ2N¯¯¯P[∫Θ(lθ−lθ′)2dρ(θ)]+K(ρ,π)+log1ελ.

This is an incitation to choose

 ρ=argminμ∈M1+(Θ)[μ(r)+λ2N¯¯¯P[∫Θ(lθ−lθ′)2dρ(θ)]+K(μ,π)λ].

However, if we choose to neglect the variance term, we may consider the following randomized estimator:

 ρ=argminμ∈M1+(Θ)[μ(r)+K(μ,π)λ].

Actually, in this case, Lemma 1.1 leads to:

 ρ=πexp(−λr).

Let us remark that, for any we have:

 (2.1)

This implies that, for a given data-dependent , the optimal deterministic measure is in the sense that it minimizes the expectation of (left-hand side of Equation 2.1), making it equal to the expectation of . This last quantity is the mutual information between the estimator and the sample.

So, for , this is an incitation to replace the prior with . It is then natural to approximate this distribution by .

In what follows, we replace by for a given

, keeping one more degree of freedom. Now, note that Theorem

1.5 gives:

 ρ(R)−R(θ′)≤ρ(r)−r(θ′)+λ2N¯¯¯P[∫Θ(lθ−lθ′)2dρ(θ)]+K(ρ,πexp(−βR))+log1ελ

and note that the upper bound is no longer empirical (observable to the statistician).

The aim of the next subsection is to upper bound by an empirical bound in a general setting.

### 2.2. Empirical bound of the localized complexity

###### Definition 2.1.

Let us put, for any and ,

 va,λN(θ,θ′)=2Nλ{λNN∑i=1ΦλN[(ℓθ−ℓθ′)(Zi)∧aNλ]−[r(θ)−r(θ′)]}.
###### Theorem 2.1.

Let us choose a distribution . For any , for any such that , with -probability at least , for any ,

 K(ρ,πexp(−βR))≤BKa,β,γ(ρ,π)+βγ−βlog1ε

where

The proof is given in the section dedicated to proofs, more precisely in subsection 5.3 page 5.3. Note that the localized entropy term is controlled by its empirical counterpart together with a variance term.

Before combining this result with Theorem 1.5, we give the analogous result for the non-integrated case, which proof is also given in subsection 5.3.

###### Theorem 2.2.

Let us choose a distribution and a randomized estimator . For any and , for any such that , with -probability at least ,

 log[dρdπexp[−βR](~θ)]≤Da,β,γ(ρ,π)(~θ)+βγ−βlog1ε

where

### 2.3. Localized PAC-Bayesian theorems

###### Definition 2.2.

From now on, we will deal with model selection. We assume that we have a family of submodels of : where is finite or countable. We also choose a probability measure , and assume that we have a prior distribution for every .

We choose

 π=∑i∈Iμ(i)πiexp(−βiR)

and apply Theorem 1.3 that we combine with Theorem 2.2 by a union bound argument, to obtain the following result.

###### Theorem 2.3.

Let us assume that we have randomized estimators such that , for any , for any such that and , with -probability at least over the sample and the parameters , for any we have:

 Rλa(~θi,~θi′)≤Φ−1λN{r(~θi)−r(~θi′)+λ2Nva,λN(~θi,~θi′)+1λ[Da,β,γ(ρ,πi)(~θi)+Da,β′,γ′(ρ,πi′)(~θi′)+(1+βγ−β+β′γ′−β′)log3εμ(i)μ(i′)]}.

In the same way, we can give an integrated variant, using Theorem 1.5 and Theorem 2.1.

###### Theorem 2.4.

For any , for any such that and , with -probability at least , for any and ,

 ∫Θ2d(ρ⊗ρ′)(θ,θ′)Rλa(θ,θ′)≤Φ−1λN{ρ(r)−ρ′(r)+λ2N∫Θ2d(ρ⊗ρ′)(θ,θ′)va,λN(θ,θ′)+BKa,β,γ(ρ,πi)+BKa,β′,γ′(ρ′,πi′)+(1+βγ−β+β′γ′−β′)log3εμ(i)μ(i′)λ}.

### 2.4. Choice of the parameters

In this subsection, we explain how to choose the parameters , , , and in Theorems 2.3 and 2.4

. In some really simple situations (parametric model with strong assumptions on

), this choice can be made on the basis of theoretical considerations, however, in many realistic situations, such hypothesis cannot be made and we would like to optimize the upper bound in the Theorems with respect to the parameters. This would lead to data-dependant values for the parameters, and this is not allowed by Theorems 2.4 and 2.3. Catoni  proposes to make a union bound on a grid of values of the parameters, thus allowing optimization with respect to these parameters. We apply this idea to Theorem 2.4, and obtain the following result.

###### Theorem 2.5.

Let us choose a measure that is supported by a finite or countable set of points, . Let us assume that we have randomized estimators such that . For any and , with -probability at least over the sample and the parameters , for any and we have:

 Rλa(~θi,β,~θi′,β′)≤B((i,β),(i′,β′))=infλ∈]0,+∞[γ∈]β,+∞[γ′∈]β′,+∞[Φ−1λN{r(~θi,β)−r(~θi′,β′)+λ2Nva,λN(~θi,β,~θi′,β′)+1λ[Da,β,γ(ρi,β,πi)(~θi,β)+Da,β′,γ′(ρi,β′,πi′)(~θi′,β′)+(1+βγ−β+β′γ′−β′)log3εν(λ)ν(γ)ν(β)ν(γ′)ν(β′)μ(i)μ(i′)]}.

### 2.5. Introduction of the complexity function

It is convenient to remark that we can dissociate the optimization with respect to the different parameters in Theorem 2.5 thanks to the introduction of an appropriate complexity function. The model selection algorithm we propose in the next subsection takes advantage of this decomposition.

###### Definition 2.3.

Let us choose some real constants , and . We assume that some randomized estimators have been chosen and that we have drawn for every and . We define, for any ,

 C(i,β)=infγ∈[ζβ,+∞[{Da,β,γ(ρi,β,πi)(~θi,β)+(βγ−β+1ζ−1+1)log3εμ(i)ν(β)ν(γ)}.

We have the following result.

###### Theorem 2.6.

For any ,

 B((i,β),(i′,β′))≤infλ>0Φ−1λN{r(~θi,β)−r(~θi′,β′)+λ2Nva,λN(~θi,β,~θi′,β′)+C(~θi,β)+C(~θi′,β′)+ζ+1ζ−1log3εν(λ)λ}.

Note, as a consequence of the concavity of , that this implies

###### Corollary 2.7.
 B((i,β),(i′,β′))+B((i′,β′),(i,β))≤2