Finite-sample risk bounds for maximum likelihood estimation with arbitrary penalties

12/29/2017 ∙ by W. D. Brinda, et al. ∙ 0

The MDL two-part coding index of resolvability provides a finite-sample upper bound on the statistical risk of penalized likelihood estimators over countable models. However, the bound does not apply to unpenalized maximum likelihood estimation or procedures with exceedingly small penalties. In this paper, we point out a more general inequality that holds for arbitrary penalties. In addition, this approach makes it possible to derive exact risk bounds of order 1/n for iid parametric models, which improves on the order ( n)/n resolvability bounds. We conclude by discussing implications for adaptive estimation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Aremarkably general method for bounding the statistical risk of penalized likelihood estimators comes from work on two-part coding, one of the minimum description length (MDL) approaches to statistical inference. Two-part coding MDL prescribes assigning codelengths to a model (or model class) then selecting the distribution that provides the most efficient description of one’s data [1]. The total description length has two parts: the part that specifies a distribution within the model (as well as a model within the model class if necessary) and the part that specifies the data with reference to the specified distribution. If the codelengths are exactly Kraft-valid, this approach is equivalent to Bayesian maximum a posteriori (MAP) estimation, in that the two parts correspond to log reciprocal of prior and log reciprocal of likelihood respectively. More generally, one can call the part of the codelength specifying the distribution a penalty term; it is called the complexity in MDL literature.

Let denote a discrete set indexing distributions along with a complexity function. With , the (pointwise) redundancy of any is its two-part codelength minus , the codelength one gets by using as the coding distribution.111For now, we mean that governs the entirety of the data. The notion of sample size and iid assumptions are not essential to the bounds, as will be seen in the statement of Theorem II.1. Specialization to iid data will be discussed thereafter. The expectation of redundancy is the relative entropy from to plus . Let denote the minimizer of expected redundancy; it is the average-case optimal representative from when the true distribution is . Its expected redundancy will be denoted

or in the context of iid data and iid modeling , its expected redundancy rate is denoted

Interestingly, [2] showed that if the complexity function is large enough, then the corresponding penalized likelihood estimator outperforms the best-case average representative. Specifically, the statistical risk is bounded by ; that result is stated for iid sampling in (2) below.222Throughout the paper, we will refer to this inequality as “the resolvability bound,” but realize that there are a variety of related resolvability bounds in other contexts. They involve comparing risk to a codelength and lead to bounds that are suboptimal by a factor.

There are a number of attractive features of the resolvability bound; we will highlight four. One of the most powerful aspects of the resolvability bound is the ease with which it can be used to devise adaptive estimation procedures for which the bound applies. For instance, to use a class of nested models rather than a single model, one only needs to tack on an additional penalty term corresponding to a codelength used to specify the selected model within the class.

Another nice feature is its generality: the inequality statement only requires that the data-generating distribution has finite relative entropy to some probability measure in the model.

333Although the forthcoming resolvability bounds (i.e., as in (2) with that is at least twice a codelength function) are valid under misspecification, they do not in general imply consistency in the sense that the corresponding penalized estimator eventually converges to the element of that minimizes KL or Hellinger to the truth . Indeed, there are various examples [3] in which the twice-codelength penalized estimator is inconsistent (i.e., provably never converges to ). In practice, the common assumptions of other risk bound methods, for instance, that the generating distribution belongs to the model, are unlikely to be exactly true.

A third valuable property of the bound is its exactness for finite samples. Many risk bound methods only provide asymptotic bounds. But such results do not imply anything exact for a data analyst with a specific sample.

Lastly, the resolvability bound uses a meaningful loss function:

-Renyi divergence [4] with . For convenience, we specialize our discussion and our present work to Bhattacharyya divergence [5] which is the -Renyi divergence.

where denotes the Hellinger affinity

Like relative entropy, decomposes product measures into sums; that is,

Bhattacharyya divergence is bounded below by squared Hellinger distance (using ) and above by relative entropy (using Jensen’s inequality). Importantly, it has a strictly increasing relationship with squared Hellinger distance , which is an -divergence:

As such, it inherits desirable -divergence properties such as the data processing inequality. Also, it is clear from the definition that is parametrization-invariant. For many more properties of , including its bound on total variation distance, see [6].

Next, we make note of some of the limitations of the resolvability bound. One complaint is that it is for discrete parameter sets, while people generally want to optimize penalized likelihood over a continuous parameter space. In practice, one typically selects a parameter value that is rounded to a fixed precision, so in effect the selection is from a discretized space. However, for mathematical convenience, it is nice to have risk bounds for the theoretical optimizer. A method to extend the resolvability bound to continuous models was introduced by [7]; in that paper, the method was specialized to estimation of a log density by linear combinations from a finite dictionary with an penalty on the coefficients. More recently, [8] worked out the continuous extension for Gaussian graphical models (building on [9]) with

penalty assuming the model is well-specified and for linear regression with

penalty assuming the true error distribution is Gaussian. These results are explained in more detail by [10], where the extension for the penalty for linear regression is also shown, again assuming the true error distribution is Gaussian.

Another limitation is that the resolvability bound needs a large enough penalty; it must have a finite Kraft sum. This paper provides a more general inequality that escapes such a requirement and therefore applies even to unpenalized maximum likelihood estimation. The resulting bound retains the four desirable properties we highlighted above, but loses the coding and resolvability interpretations.

Finally, the resolvability bounds for smooth parametric iid modeling are of order and cannot be improved, according to [11], whereas under regularity conditions (for which Bhattacharyya divergence is locally equivalent to one-half relative entropy, according to [7]) the optimal Bhattacharyya risk is of order [12]. Our variant on the resolvability method leads to the possibility of deriving exact bounds of order .

Our bounds can be used for the penalized MLE over a discretization of an unbounded parameter space under a power decay condition on the Hellinger affinity, as in Theorems II.11 and II.12

. We show that such a condition is satisfied by exponential families of distributions with a boundedness assumption on the largest eigenvalue of the covariance matrix of their sufficient statistics (see Lemma

II.9). For these models and others, we establish order bounds for the Bhattacharyya risk. The primary focus of this paper is to develop new tools towards this end.

One highly relevant line of work is [13], where he established a more general resolvability risk bound for “posterior” distributions on the parameter space. Implications for penalized MLEs come from forcing the “posteriors” to be point-masses. He derives risk bounds that have the form of plus a “corrective” term, which is comparable to the form of our results. Indeed, as we will point out, one of our corollaries nearly coincides with [13, Thm 4.2] but works with arbitrary penalties.

The trick we employ is to introduce an arbitrary function , which we call a pseudo-penalty, that adds to the penalty ; strategic choices of pseudo-penalty can help to control the “penalty summation” over the model. The resulting risk bound has an additional term that must be dealt with.

In Section II, we prove our more general version of the resolvability bound inequality using a derivation closely analogous to the one by [14]. We then explore corollaries that arise from various choices of pseudo-penalty. In Section III, we explain how our approach applies in the context of adaptive modeling. Additional work can be found in [15], including some simple concrete examples [15, “Simples concrete examples”, Sec 2.1.2], extension to continuous models [15, “Continuous parameter spaces”, Sec 2.2], and an application to Gaussian mixtures [15, Chap 4].

Every result labeled a Theorem or Lemma has a formal proof, some of which are in the Appendix. Any result labeled a Corollary is an immediate consequence of previously stated results and thus no formal proof is provided. For any random vector

, the notation means the covariance matrix, while represents its trace . The notation means the

th eigenvalue of the matrix argument. Whenever a capital letter has been introduced to represent a probability distribution, the corresponding lower-case letter will represent a density for the measure with respect to either Lebesgue or counting measure. The

penalized MLE is the (random) parameter that maximizes log-likelihood minus penalty. The notation represents the infimum relative entropy from to distributions indexed by the model . Multiplication and division take precedence over and ; for instance, means .

Ii Models with countable cardinality

Let us begin with countable (e.g. discretized) models, which were the original context for the MDL penalized likelihood risk bounds. We will show that a generalization of that technique works for arbitrary penalties. The only assumption we need is that for any possible data, there exists a (not necessarily unique) minimizer of penalized likelihood.444We will say “the” penalized MLE, even though we do not require uniqueness; any scheme can be used for breaking ties. This existence requirement will be implicit throughout our paper. Theorem II.1 gives a general result that is agnostic about any structure within the data; the consequence for iid data with sample size is pointed out after the proof.

Theorem II.1.

Let , and let be the penalized MLE over indexing a countable model with penalty . Then for any ,

Proof.

We follow the pattern of Jonathan Li’s version of the resolvability bound proof [14].

We were able to bound the random quantity by the sum over all because each of these terms is non-negative.

We will take the expectation of both sides for . To deal with the first term, we use Jensen’s inequality and the definition of Hellinger affinity.

Returning to the overall inequality, we have

Suppose now that the data comprise iid observations and are modeled as such; in other words, the data has the form , and the model has the form . Because and , we can divide both sides of Theorem II.1 by to reveal the role of sample size in this context:

We will see three major advantages to Theorem II.1. The most obvious is that it can handle cases in which the sum of exponential negative half penalties is infinite; unpenalized estimation, for example, has identically zero. One consequence of this is that the resolvability method for minimax risk upper bounds can be extended to models that are not finitely covered by relative entropy balls. We will also find that Theorem II.1 enables us to derive exact risk bounds of order rather than the usual resolvability bounds.

In many cases, it is convenient to have only the function in the summation. Substituting as the pseudo-penalty in Theorem II.1 gives us a corollary that moves out of the summation.

Corollary II.2.

Let , and let be the penalized MLE over indexing a countable model with penalty . Then for any ,

The iid data and model version is

We will use the term pseudo-penalty for the function labeled in either Theorem II.1 or Corollary II.2. Note that is allowed to depend on but not on the data.

A probabilistic loss bound can also be derived for the difference between the loss and the redundancy plus pseudo-penalty.

Theorem II.3.

Let , and let be the penalized MLE over indexing a countable model with penalty . Then for any ,

Proof.

Following the steps described in [7, Theorem 2.3]

, we use Markov’s inequality then bound a non-negative random variable by the sum of its possible values.

For iid data and an iid model, Theorem II.3 implies

Several of our corollaries have and designed to make . In such cases, the difference between loss and the point-wise redundancy plus pseudo-penalty is stochastically less than an exponential random variable.

Often the countable model of interest is a discretization of a continuous model. Given any , an -discretization of is , by which we mean for some . An -discretization of is a set of the form . See Section III-D for a discussion of the behavior of in that context.

To derive useful consequences of the above results, we will explore some convenient choices of pseudo-penalty: zero, Bhattacharyya divergence, log reciprocal pmf of , quadratic forms, and the penalty. We specialize to the iid data and model setting for the remainder of this document to highlight the fact that many of the exact risk bounds we derive are of order in that case.

Ii-a Zero as pseudo-penalty

Setting to zero gives us the traditional resolvability bound, which we review in this section.

Corollary II.4.

Assume , and let be the penalized MLE over indexing a countable iid model with penalty . Then

The usual statement of the resolvability bound [7] assumes is at least twice a codelength function, so that it is large enough for the sum of exponential terms to be no greater than . That is,

(1)

implies

(2)

The quantity on the right-hand side of (2) is called the index of resolvability of for at sample size . Any corresponding minimizer is considered to index an average-case optimal representative for at sample size .

In fact, for any finite sum , the maximizer of the penalized likelihood is also the maximizer with penalty . Thus one has a resolvability bound of the form (2) with the equivalent penalty , which satisfies (1) with equality.

Additionally, the resolvability bounds give an exact upper bound on the minimax risk for any model that can be covered by finitely many relative entropy balls of radius ; the log of the minimal covering number is called the KL-metric entropy . These balls’ center points are called a KL-net; we will denote the net by . With data for any , the MLE restricted to has the resolvability risk bound

If an explicit bound for is known, then the overall risk bound can be optimized over the radius — see for instance [7, Section 1.5].

Because this approach to upper bounding minimax risk requires twice-Kraft-valid codelengths, it only applies to models that can be covered by finitely many relative entropy balls. However, Corollary II.2 reveals new possibilities for establishing minimax upper bounds even if the cover is infinite. Given any , one can use any constant penalty that is at least as large as where is the unpenalized MLE on the net and the summation is taken over those points.555Putting into either Theorem II.1 or Corollary II.2 would give us the same idea. For a minimax result, one still needs this quantity to be uniformly bounded over all data-generating distribution . See Corollary II.10 below as an example.

Ii-B Bhattacharyya divergence as pseudo-penalty

Important corollaries666Our Corollary II.5 was inspired by the very closely related result of [13, Thm 4.2]. to Theorems II.1 and II.2 come from setting the pseudo-penalty equal to ; the expected pseudo-penalty is proportional to the risk, so that term can be subtracted from both sides. For the iid scenario, we also use the product property of Hellinger affinity: .

The following corollaries serve as the starting point for the main bounds in Theorems II.12 and II.11, after which, more refined techniques are used in controlling the two terms in (3) and (4).

Corollary II.5.

Assume , and let be the penalized MLE over indexing a countable iid model with penalty . Then for any ,

(3)
Corollary II.6.

Assume , and let be the penalized MLE over indexing a countable iid model with penalty . Then for any ,

(4)

For simplicity, the corollaries throughout this subsection will use .

Consider a penalized MLE selected from an -discretization of a continuous parameter space; as the sample size increases, one typically wants to shrink to make the grid more refined (see Section III-D). Examining Corollaries II.5 and II.6, we see two opposing forces at work as increases: the grid-points themselves proliferate, while the th power depresses the terms in the summation. An easy case occurs when is bounded by a Gaussian-shaped curve; we apply Corollary II.6 and invoke Lemma III.10.

Corollary II.7.

Assume , and let be the penalized MLE over an -discretization indexing an iid model with penalty . Assume for some and some . Then

With proportional to , our bound on the summation of Hellinger affinities is stable. Corollary II.8 sets to demonstrate a more concrete instantiation of this result.

Corollary II.8.

Assume , and let be the MLE over an -discretization indexing an iid model using . Assume for some and some . Then

If is in an exponential family with natural parameter indexed by , then Hellinger affinities do have a Gaussian-shaped bound as long as the minimum eigenvalue of the sufficient statistic’s covariance matrix is uniformly bounded below by a positive number. We use the notation for the th largest eigenvalue of the matrix argument.

Lemma II.9.

Let be an exponential family with natural parameter and sufficient statistic . Then

where .

In Lemma II.9, does not depend on . If in addition the -discretization is also a KL-net, then the risk of the estimator described in Corollary II.8 is uniformly bounded over data-generating distributions in . The minimax risk is no greater than the supremum risk of this particular estimator.

Corollary II.10.

Let index a set of distributions. Assume that for some , every has the property that . Assume further that there exists such that for all , every -discretization is also a KL-net with balls of radius . Then the minimax Bhattacharyya risk of has the upper bound

In general, however, Hellinger affinity being uniformly bounded by a Gaussian curve may be too severe of a requirement. A weaker condition is to require only a power decay for far from some .

Theorem II.11.

Assume , and let be the penalized MLE over an -discretization indexing an iid model with penalty . Assume that for some , radius and , the Hellinger affinity is bounded by outside the ball and bounded by inside the ball. If , and , then,

Proof.

The part of the summation where Hellinger affinity is bounded by a Gaussian curve has the same bound as in Corollary II.7, which is a direct consequence of Lemma III.10.

(5)

Notice that the “center” point for this Gaussian curve can be different from the center of the ball .

The summation of the remaining terms is handled by Lemma III.14, assuming .

(6)
(7)

The assumption that assures us that , simplifying the bound.

Each of (II-B) and (6) are at least , so by Lemma III.3, the sum of their logs is bounded by the log of their sum plus . Finally, substitute . ∎

The sample size requirement in Theorem II.11 can be avoided by using a squared norm penalty. The bound we derive has superlinear order in the dimension.

Theorem II.12.

Assume , and let be the penalized MLE over an -discretization indexing an iid model with penalty . Assume that for some , radius and , the Hellinger affinity is bounded by outside the ball and bounded by inside the ball. If , then

Proof.

This time we use Corollary II.5 rather than Corollary II.6. The challenge is to bound the summation

Assuming , we can bound that term as in Theorem II.11. With smaller , we invoke Lemmas III.15 and III.16. In each case, the bound is no greater than the one we have claimed. ∎

As in Corollary II.7, the bounds in Theorems II.11 and II.12 remain stable if is proportional to .

As an example, we will see how these bound apply in a location family parametrized by the mean in . First, we establish the power decay, assuming

has a finite first moment. By Lemma 

III.18,

where , and the other constants are the first central moments and . Therefore, Theorems II.11 and II.12 apply if we can find a Gaussian-shaped Hellinger affinity bound that holds inside the ball centered at with radius .

In particular, let us assume the model comprises distributions that are continuous with respect to Lebesgue measure. Then we will also assume that is continuous; otherwise, the risk bound is infinite anyway. These assumptions ensure the existence of exact medians, enabling us to use Lemma III.20.

Let be the vector of marginal medians of the model distribution with mean . The marginal median vector of any model distribution is then . Let be the marginal median vector of . By Lemma III.20, for any , the inequality

holds for within , where is times the minimum squared marginal density of within of its median. It remains to identify an large enough that contains . Using the triangle inequality and then Lemma III.4 to bound the distance between means and medians,

For in the ball , the first term is bounded by . This tells us that the ball contains .

Thus if all the marginal densities of are positive within of their medians, then there is a positive for which

in , confirming that Theorems II.11 and II.12 hold.

If the data-generating distribution is itself in the location family, then and . Thus the bound holds uniformly over . If there exists such that every -discretization of the family is a KL-net with radius , then a minimax risk bound can be derived in the same manner as Corollary II.10.

Ii-C Log reciprocal pmf of as pseudo-penalty

In Section II-B, we chose a pseudo-penalty to have an expectation that easy to handle; we only had to worry about the resulting log summation. Now we will select a pseudo-penalty with the opposite effect. We can eliminate Corollary II.2’s log summation term by letting be twice a codelength function. The smallest resulting comes from setting to be two times the log reciprocal of the probability mass function of . This expectation is the Shannon entropy of the penalized MLE’s distribution (i.e. the image measure of under the -valued deterministic transformation ).

Corollary II.13.

Let , and let be a penalized MLE over all indexing a countable iid model. Then

It is known that the risk of the MLE is bounded by the log-cardinality of the model (e.g. [14]); Corollary II.13 implies a generalization of this fact for penalized MLEs:

Importantly, Corollary II.13 also applies to models of infinite cardinality.

Lemma II.14.

Let be an -discretization, and let be a -valued random vector. Suppose that for some and some radius , every outside of has probability bounded by . Then the entropy of has the bound

If , then this bound grows exponentially in . However, if and are known, then one can set and find that is guaranteed to be bounded by . Of course, one needs to take the behavior of the index of resolvability into account as well; good overall behavior will typically require that has order .

In certain models satisfying for some , we surmise that it may be possible to establish the applicability of Lemma II.14 (with having order ) by using information theoretic large deviation techniques along the lines of [16, Thm 19.2].

Ii-D Quadratic form as pseudo-penalty

Other simple corollaries come from using a quadratic pseudo-penalty for some positive definite matrix . The expected pseudo-penalty is then

where denotes the covariance matrix of the random vector with . For the log summation term, we note that

by Lemma III.10. Using as gives us Corollary II.15.

Corollary II.15.

Assume , and let be the penalized MLE over an -discretization indexing an iid model with penalty . Then for any ,

As described in Section III-D, one gets desirable order behavior from by using proportional to . For either of these two corollaries above to have order bounds, the numerator of the second term should be stable in . In Corollary II.15, one sets proportional to and thus needs to have order . In many cases, such as ordinary MLE with an exponential family, the covariance matrix of the optimizer over is indeed bounded by a matrix divided by . However, one still needs to handle the discrepancy in behavior between the continuous and discretized estimator.

In a sense, Corollary II.15 shifts the problem to another risk-related quantity, while the pseudo-penalties used in Sections II-B and II-C provide more direct ways of deriving exact risk bounds of order .

Ii-E Penalty as pseudo-penalty

Another simple corollary to Theorem II.1 uses .

Corollary II.16.

Assume , and let be the penalized MLE over indexing a countable iid model with penalty . Then