PAC-Bayes under potentially heavy tails

05/20/2019 ∙ by Matthew J. Holland, et al. ∙ Osaka University 0

We derive PAC-Bayesian learning guarantees for heavy-tailed losses, and demonstrate that the resulting optimal Gibbs posterior enjoys much stronger guarantees than are available for existing randomized learning algorithms. Our core technique itself makes use of PAC-Bayesian inequalities in order to derive a robust risk estimator, which by design is easy to compute. In particular, only assuming that the variance of the loss distribution is finite, the learning algorithm derived from this estimator enjoys nearly sub-Gaussian statistical error.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

More than two decades ago, the origins of PAC-Bayesian learning theory were developed with the goal of strengthening traditional PAC learning guarantees111

PAC: Probably approximately correct

[20]. by explicitly accounting for prior knowledge [18, 13, 6]. Subsequent work developed finite-sample risk bounds for “Bayesian” learning algorithms which specify a distribution over the model [14]. These bounds are controlled using the empirical risk and the relative entropy between “prior” and “posterior” distributions, and hold uniformly over the choice of the latter, meaning that the guarantees hold for data-dependent posteriors, hence the naming. Furthermore, choosing the posterior to minimize PAC-Bayesian risk bounds leads to practical learning algorithms which have seen numerous successful applications [3].

Following this framework, a tremendous amount of work has been done to refine, extend, and apply the PAC-Bayesian framework to new learning problems. Tight risk bounds for bounded losses are due to Seeger [16] and Maurer [12], with the former work applying them to Gaussian processes. Bounds constructed using the loss variance in a Bernstein-type inequality were given by Seldin et al. [17], with a data-dependent extension derived by Tolstikhin and Seldin [19]. As stated by McAllester [15]

, virtually all the bounds derived in the original PAC-Bayesian theory “only apply to bounded loss functions.” This technical barrier was solved by

Alquier et al. [3]

, who introduce an additional error term depending on the concentration of the empirical risk about the true risk. This technique was subsequently applied to the log-likelihood loss in the context of Bayesian linear regression by

Germain et al. [11], and further systematized by Bégin et al. [5]

. While this approach lets us deal with unbounded losses, naturally the statistical error guarantees are only as good as the confidence intervals available for the empirical mean deviations. In particular, strong assumptions on all of the moments of the loss are essentially unavoidable using the traditional tools espoused by

Bégin et al. [5]

, which means the “heavy-tailed” regime, where all we assume is that a few higher-order moments are finite (say finite variance and/or finite kurtosis), cannot be handled. A new technique for deriving PAC-Bayesian bounds even under heavy-tailed losses is introduced by

Alquier and Guedj [2]; their lucid procedure provides error rates even under heavy tails, but as the authors recognize, the rates are highly sub-optimal due to direct dependence on the empirical risk, leading in turn to sub-optimal algorithms derived from these bounds.222See work by Catoni [8], Devroye et al. [10]

and the references within for background on the fundamental limitations of the empirical mean for real-valued random variables.

In this work, while keeping many core ideas of Bégin et al. [5] intact, using a novel approach we obtain exponential tail bounds on the excess risk using PAC-Bayesian bounds that hold even under heavy-tailed losses. Our key technique is to replace the empirical risk with a new mean estimator inspired by the dimension-free estimators of Catoni and Giulini [9], designed to be computationally convenient. We review some key theory in section 2 before introducing the new estimator in section 3. In section 4 we apply this estimator to the PAC-Bayes setting, deriving a new robust optimal Gibbs posterior. Most detailed proofs are relegated to section A.1 in the appendix.

2 PAC-Bayesian theory based on the empirical mean

Let us begin by briefly reviewing the best available PAC-Bayesian learning guarantees under general losses. Denote by a sequence of independent observations distributed according to common distribution . Denote by a model, from which the learner selects a candidate based on the -sized sample. The quality of this choice can be measured in a pointwise fashion using a loss function , typically assumed . The learning task is to achieve a small risk, defined by . Since the underlying distribution is inherently unknown, the canonical proxy is

Let and respectively denote “prior” and “posterior” distributions on the model . The so-called Gibbs risk induced by , as well as its empirical counterpart are given by

When our losses are almost surely bounded, lucid guarantees are available.

Theorem 1 (PAC-Bayes under bounded losses [14, 5]).

Assume , and fix any arbitrary prior on . For any confidence level , we have with probability no less than over the draw of the sample that

uniformly in the choice of .

Since the “good event” where the inequality in Theorem 1 holds is valid for any choice of , the result holds even when depends on the sample, which justifies calling it a posterior distribution. Optimizing this upper bound leads to the so-called optimal Gibbs posterior, which takes a form which is readily characterized (cf. Remark 13).

The above results fall apart when the loss is unbounded, and meaningful extensions become challenging when exponential moment bounds are not available. As highlighted in section 1 above, over the years, the analytical machinery has evolved to provide general-purpose PAC-Bayesian bounds even under heavy-tailed data. The following theorem of Alquier and Guedj [2] extends the strategy of Bégin et al. [5] to obtain bounds under the weakest conditions we know of.

Theorem 2 (PAC-Bayes under heavy-tailed losses [2]).

Take any and set . For any confidence level , we have with probability no less than over the draw of the sample that

uniformly in the choice of .

For concreteness, consider the case of , where , and assume that the variance of the loss is is -finite, namely that

From Proposition 4 of Alquier and Guedj [2], we have . It follows that on the high-probability event, we have

While the rate and dependence on a divergence between and are similar, note that the dependence on the confidence level is polynomial; compare this with the logarithmic dependence available in Theorem 1 above when the losses were bounded.

For comparison, our main result of section 4 is a uniform bound on the Gibbs risk: with probability no less than , we have

where is an estimator of defined in section 3, and the key constants are bounds such that for all we have . As long the as second moment is finite, this guarantee holds, and thus both sub-Gaussian and heavy-tailed losses (e.g., with infinite higher-order moments) are permitted. Given any valid , the PAC-Bayesian upper bound above can be computed from the data, and thus an optimal Gibbs posterior which minimizes this bound in can also be computed in practice. In section 4, we characterize this “robust posterior.”

3 A new estimator using smoothed Bernoulli noise

Notation

In this section, we are dealing with the specific problem of robust mean estimation, thus we specialize our notation here slightly. Data observations will be , assumed to be independent copies of . Denote the index set . Write and . Write for the set of all probability measures defined on the measurable space . We shall typically suppress and even in the notation when it is clear from the context. Use and to denote the positive and negative parts of the enclosed functions, e.g., and . Let be a bounded, non-decreasing function such that for some and all ,

(1)

As a concrete and analytically useful example, we shall use the piecewise polynomial function of Catoni and Giulini [9], defined by

(2)

which satisfies (1) with , and is pictured in Figure 1 with the two key bounds.333Slightly looser bounds hold for an analogous procedure using a Huber-type influence function with .

Figure 1: Graph of the Catoni function over .

Estimator definition

We consider a straightforward procedure, in which the data are subject to a soft truncation after re-scaling, defined by

(3)

where is a re-scaling parameter. Depending on the setting of , this function can very closely approximate the sample mean, and indeed modifying this scaling parameter controls the bias of this estimator in a direct way, which can be quantified as follows. As the scale grows, note that

which implies that taking expectation with respect to the sample and , in the limit this estimator is unbiased, with

On the other hand, taking closer to zero implies that more observations will be truncated. Taking small enough,444More precisely, taking . we have

which converges to zero as . Here the the positive/negative indices are and . Thus taking too small means that only the signs of the observations matter, and the absolute value of the estimator tends to become too small.

High-probability deviation bounds for

We are interested in high-probability bounds on the deviations under the weakest possible assumptions on the underlying data distribution. To obtain such guarantees in a straightforward manner, we make the simple observation that the estimator defined in (3) can be related to an estimator with smoothed noise as follows. Let be an iid sample of noise with distribution for some . Then, taking expectation with respect to the noise sample, one has that

(4)

This simple observation becomes useful to us in the context of the following technical fact.

Lemma 3.

Assume we are given some independent data , assumed to be copies of the random variable . In addition, let similarly be independent observations of “strategic noise,” with distribution that we can design. Fix an arbitrary prior distribution , and consider , assumed to be bounded and measurable. Write

for the Kullback-Leibler divergence between distributions

and . It follows that with probability no less than over the random draw of the sample, we have

uniform in the choice of , where expectation on the left-hand side is over the noise sample.

The special case of interest here is . Using (1) and Lemma 3, with prior and posterior , it follows that on the high-probability event, uniform in the choice of , we have

(5)

where we have used the fact that in the Bernoulli case. Dividing both sides by and optimizing this as a function of yields a closed-form expression for depending on the second moment, the confidence , and . Analogous arguments yield lower bounds on the same quantity. Taking these facts together, we have the following proposition, which says that assuming only finite second moments , the proposed estimator achieves exponential tail bounds scaling with the second non-central moment.

Proposition 4 (Concentration of deviations).

Scaling with , the estimator defined in (3) satisfies

(6)

with probability at least .

Proof of Proposition 4.

First, note that the upper bound derived from (5) holds uniformly in the choice of on a high-probability event. Setting and solving for the optimal setting is just calculus. It remains to obtain a corresponding lower bound on . To do so, consider the analogous setting of Bernoulli and , but this time on the domain , with and . Using (1) and Lemma 3 again, we have

where we note and . This yields a high-probability lower bound in the desired form when we set , since an upper bound on is equivalent to a lower bound on . However, since we have changed the prior in this case, the high-probability event here need not be the same as that for the upper bound, and as such, we must take a union bound over these two events to obtain the desired final result. ∎

Remark 5.

While the above bound (6) depends on the true second moment, as is clear from the proof outlined above, the result is easily extended to hold for any valid upper bound on the moment, which is what will inevitably be used in practice.

Centered estimates

Note that the bound (6) depends on the second moment of the underlying data; this is in contrast to M-estimators which due to a natural “centering” of the data typically have tail bounds depending on the variance. This results in a sensitivity to the absolute value of the location of the distribution, e.g., on a distribution with unit variance and will tend to be much better than a distribution with . Fortunately, a simple centering strategy works well to alleviate this sensitivity, as follows.

Without loss of generality, assume that the first estimates are used for constructing a shifting device, with the remaining points left for running the usual routine on shifted data. More concretely, define

(7)

From (6) in Proposition 4, we have

on an event with probability no less than , over the draw of the -sized sub-sample. Using this, we shift the remaining data points as . Note that the second moment of this data is bounded as follows:

Passing these shifted points through (3) with analogous second moment bounds used for scaling, we have

(8)

Shifting the resulting output back to the original location by adding and shifting back to the original location by adding , conditioned on , we have by (6) again that

with probability no less than over the draw of the remaining points. Defining the centered estimator as , and taking a union bound over the two “good events” on the independent sample subsets, we may thus conclude that

(9)

where probability is over the draw of the full -sized sample. While one takes a hit in terms of the sample size, the variance works to combat sensitivity to the distribution location.

4 PAC-Bayesian bounds for heavy-tailed data

An import and influential paper due to D. McAllester gave the following theorem as a motivating result. For clarity to the reader, we give a slightly modified version of his result.

Theorem 6 (McAllester [13], Preliminary Theorem 2).

Let

be a prior probability distribution over

, assumed countable, and to be such that for all

. Consider the pattern recognition task with

, and the classification error . Then with probability no less than , for any choice of , we have

Proof.

For clean notation, denote the empirical risk as

Using a classical Chernoff bound specialized to the case of Bernoulli observations (Lemma 17), we have that for any , it holds that

Rearranging terms, it follows immediately that with probability no less than , we have

The desired result follows from a union bound:

The event on the left-hand side of the above inequality is precisely that of the hypothesis, namely the “bad event” on which the sample is such that the risk exceeds the given bound for some candidate . ∎

Our motivating pre-theorem

The basic idea of our approach is very simple: instead of using the sample mean, bound the off-sample risk using a more robust estimator which is easy to compute directly, and which allows risk bounds even under unbounded, potentially heavy-tailed losses. Define a new approximation of the risk by

(10)

for . Note that this is just a direct application of the robust estimator defined in (3) to the case of a loss which depends on the choice of candidate . As a motivating result, we basically re-prove McAllester’s result (Theorem 6) under much weaker assumptions on the loss, using the statistical properties of the new risk estimator (10), rather than relying on classical Chernoff inequalities.

Theorem 7 (Pre-theorem).

Let be a prior probability distribution over , assumed countable. Assume that for all , and that for all . Setting the scale in (10) to , then with probability no less than , for any choice of , we have

Proof.

We start by making use of the pointwise deviation bound given in Proposition 4, which tells us that with high probability

for any pre-fixed . Replacing with gives the key error level

and using the union bound argument in the proof of Theorem 6, we have

Remark 8.

We note that all quantities on the right-hand side of Theorem 7 are easily computed based on the sample, except for the second moment , which in practice must be replaced with an empirical estimate. With an empirical estimate of in place, the upper bound can easily be used to derive a learning algorithm.

Main theorem

Next we extend the previous motivating theorem to a more general result on a potentially uncountable model , using stochastic learning algorithms, as has become standard in the PAC-Bayes literature. Denote .

Theorem 9.

Let be a prior distribution on model . Assume only the second moment of the loss is bounded as for all . Setting the scale in (10) to , then with probability no greater than over the random draw of the sample, it holds that

for any choice of probability distribution on where .

Proof of Theorem 9.

To begin, let us recall a useful “change of measures” inequality,555There are other very closely related approaches to this proof. See Tolstikhin and Seldin [19], Bégin et al. [5] for some recent examples. Furthermore, the key facts used here are also present in Catoni [7]. which can be immediately derived from our proof of Theorem 18. In particular, recall from identity (25) that given some prior and constructing such that almost everywhere one has

it follows that

whenever . In the case where , upper bounds are of course meaningless. Re-arranging, observe that since , it follows that

(11)

This inequality given in (11) is deterministic, holds for any choice of , and is a standard technical tool in deriving PAC-Bayes bounds.

To keep notation clean, write

To begin, noting is random with dependence on the sample, via Markov’s inequality we have

(12)

with probability no less than . Here probability and are with respect to the sample. Since is bounded, as long as , we have , which lets us use the change of measures inequality in a meaningful way. Now for any constant , observe that we have

with probability no less than . The first inequality follows from change of measures (11), the second inequality follows from (12), and the interchange of integration operations is valid using Fubini’s theorem [4]. Note that the “good event” depends only on (fixed in advance) and not . Thus, the above inequality holds on the good event, uniformly in .

It remains to bound . First, using the classic identity relating the expectation to the tails of a distribution, we have

(13)

where the second equality follows using integration by substitution. The right-hand side of (13) is readily controlled as follows. First note that using Proposition 4, we have

Writing , the key bound of interest can be compactly written as

Note that the first equality uses the usual “complete the square” identity, and the rest follows from basic properties of the Gaussian integral. Filling in the definition of , we have

and furthermore setting , we have

Note that using concavity of the square root and Jensen’s inequality, we have

We can of course build more expressive, general-purpose bounds with the above inequality, but the simplest case is the one in which we assume for all . In this case, taking the log of both sides of the above bound yields the simple form

Finally, going back to the bound on , and plugging in what we have for and bounded , the result is

Dividing both sides by yields the desired result. ∎

Remark 10.

Note that while in its tightest form, the above bound requires knowledge of , we may set used to define using any valid upper bound , under which the above bound still holds as-is, using known quantities.

As a principled approach to deriving stochastic learning algorithms, one naturally considers the choice of posterior in Theorem 9 that minimizes the upper bound. This is typically referred to as the optimal Gibbs posterior [11], and takes a form which is easily characterized, as we prove in the following proposition.

Proposition 11 (Robust optimal Gibbs posterior).

The upper bound of Theorem 9 is optimized by a data-dependent posterior distribution , defined in terms of its density function with respect to the prior as

Furthermore, the risk bound under the optimal Gibbs posterior takes the form

with probability no less than over the draw of the sample.

Proof of Proposition 11.

To keep the notation clean, write . Similar to the proof of Theorem 18, we have

(14)
(15)
(16)
(17)

whenever . Using non-negativity of the relative entropy (Lemma 15), the left-hand side of this chain of equalities is minimized in at . Since is free of , it follows that

which proves the result regarding the form of optimal Gibbs posterior.

Evaluating the risk bound under this posterior is straightforward computation. Observe that

Substituting this into the upper bound of Theorem 9, the robust empirical mean estimate terms cancel, and we have

Remark 12.

The bound in Proposition 11 achieved by the optimal Gibbs posterior computed based on the data is rather straightforward to interpret. It converges at a rate, and prior knowledge is reflected explicitly in that a prior which performs better in the sense of smaller leads to a smaller risk bound.

Remark 13 (Comparison with traditional Gibbs posterior).

In traditional PAC-Bayes analysis [11, Equation 8], the optimal Gibbs posterior, let us write , is defined by

where is the empirical risk. We have and , but since scaling in the latter case should be done with , so in both cases the factor cancels out. In the special case of the negative log-likelihood loss, Germain et al. [11] demonstrate that the optimal Gibbs posterior coincides with the classical Bayesian posterior. As noted by Alquier et al. [3], the optimal Gibbs posterior has shown strong empirical performance in practice, and variational approaches have been proposed as efficient alternatives to more traditional MCMC-based implementations. Comparison of both the computational and learning efficiency of our proposed “robust Gibbs posterior” with the traditional Gibbs posterior is a point of significant interest moving forward.

5 Conclusions

The main contribution of this paper was to develop a novel approach to obtaining PAC-Bayesian learning guarantees, which admits deviations with exponential tails under weak moment assumptions on the underlying loss distribution, while still being computationally amenable. In this work, our chief interest was the fundamental problem of obtaining strong guarantees for stochastic learning algorithms which can reflect prior knowledge about the data-generating process, from which we derived a new robust Gibbs posterior. Moving forward, a deeper study of the statistical nature of this new stochastic learning algorithm, as well as computational considerations to be made in practice are of significant interest.

Appendix A Technical appendix

a.1 Additional proofs

Relative entropy

Here we recall the basic notions of the relative entropy, or Kullback-Leibler divergence, between two probability distributions. Consider and , both defined over a finite space . The relative entropy of from is defined

(18)

where this definition clearly includes the possibility that , which occurs only when assigns zero probability to an element that assigns positive probability to.

More generally, when is potentially uncountably infinite, consider two probabilities and on the measurable space , where is an appropriate -algebra.666A certain degree of measure theory is assumed in this exposition, at approximately the level of the first few chapters of Ash and Doleans-Dade [4]. In this case, the relative entropy is defined

(19)

where denotes the Radon-Nikodym derivative of with respect to , typically called the density of with respect to . The basic underlying technical assumption, denoted , is that be absolutely continuous with respect to , meaning that whenever , for . In the event that does not hold, by convention we define . Recall that the Radon-Nikodym theorem guarantees that when , there exists a measurable function such that

This function is unique in the sense that if there exists another satisfying the above equality, then almost everywhere . This uniqueness justifies using the notation , and calling this function the density of (rather than a density of ).

Lemma 14 (Chain rule).

On measure space , let be a Borel-measurable function, and define measure by

For any Borel-measurable function on , it follows that

Proof.

See section 2.2, problem 4 of Ash and Doleans-Dade [4]. ∎

Lemma 15 (Non-negativity of relative entropy).

For any probabilities and , we have .

Proof of Lemma 15.

If does not hold, then and non-negativity follows trivially. As for the case of , we begin with the basic logarithmic inequality for any [1]. We thus have for any

. Using this inequality and the chain rule (Lemma

14), we have

The final equality uses the Radon-Nikodym theorem. ∎

Lemma 16 (Lower bound on Bernoulli relative entropy).

The relative entropy between and is bounded below by .

Proof of Lemma 16.

Consider the function defined

Fix any arbitrary , and take the derivative with respect to , noting that

Using the basic fact that for all , we have that the factor is non-negative. Thus, the slope is negative when , postive when , and zero when . Thus this is the only minimum of the function in . Note that , and so for all it follows that . This holds for any choice of as well, implying the desired result by the definition of . ∎

Lemma 17 (Chernoff bound for Bernoulli data).

Let be independent and identically distributed random variables, taking values . Write for the sample mean. The tails of the sample mean deviations can be bounded as

for any .

Proof of Lemma 17.

For random variable , recall that using Markov’s inequality, for any we have

Taking the derivative of this upper bound with respect to and setting it to zero, we obtain the condition

where we write