1 Introduction: The Catch-Up Phenomenon
We consider inference based on a countable set of models (sets of probability distributions), focusing on two tasks: model selection and model averaging. In model selection tasks, the goal is to select the model that best explains the given data. In model averaging, the goal is to find the weighted combination of models that leads to the best prediction of future data from the same source.
An attractive property of some criteria for model selection is that they are consistent under weak conditions, i.e. if the true distribution is in one of the models, then the -probability that this model is selected goes to one as the sample size increases. BIC [Schwarz, 1978], Bayes factor model selection [Kass and Raftery, 1995], Minimum Description Length (MDL) model selection [Barron et al., 1998] and prequential model validation [Dawid, 1984] are examples of widely used model selection criteria that are usually consistent. However, other model selection criteria such as AIC [Akaike, 1974] and leave-one-out cross-validation (LOO) [Stone, 1977], while often inconsistent, do typically yield better predictions. This is especially the case in nonparametric settings of the following type:
can be arbitrarily well-approximated by a sequence of distributions in the (parametric) models under consideration, but is not itself contained in any of these. In many such cases, the predictive distribution converges to the true distribution at the optimal rate for AIC and LOO[Shibata, 1983, Li, 1987], whereas in general MDL, BIC, the Bayes factor method and prequential validation only achieve the optimal rate to within an factor [Rissanen et al., 1992, Foster and George, 1994, Yang, 1999, Grünwald, 2007]. In this paper we reconcile these seemingly conflicting approaches [Yang, 2005a] by improving the rate of convergence achieved in Bayesian model selection without losing its consistency properties. First we provide an example to show why Bayes sometimes converges too slowly.
1.1 The Catch-Up Phenomenon
Given priors on parametric models , ,
and parameters therein, Bayesian inference associates each modelwith the marginal distribution , given by
obtained by averaging over the parameters according to the prior. In Bayes factor model selection the preferred model is the one with maximum a posteriori probability. By Bayes’ rule this is , where
denotes the prior probability of. We can further average over model indices, a process called Bayesian Model Averaging (BMA). The resulting distribution can be used for prediction. In a sequential setting, the probability of a data sequence under a distribution typically decreases exponentially fast in . It is therefore common to consider , which we call the code length of achieved by . We take all logarithms to base , allowing us to measure code length in bits. The name code length refers to the correspondence between code length functions and probability distributions based on the Kraft inequality, but one may also think of the code length as the accumulated log loss that is incurred if we sequentially predict the by conditioning on the past, i.e. using [Barron et al., 1998, Grünwald, 2007, Dawid, 1984, Rissanen, 1984]. For BMA, we have
Here the th term represents the loss incurred when predicting given using , which turns out to be equal to the posterior average: .
Prediction using has the advantage that the code length it achieves on is close to the code length of , where is the best of the marginals , i.e. achieves . More precisely, given a prior on model indices, the difference between and must be in the range , whatever data are observed. Thus, using BMA for prediction is sensible if we are satisfied with doing essentially as well as the best model under consideration. However, it is often possible to combine into a distribution that achieves smaller code length than ! This is possible if the index of the best distribution changes with the sample size in a predictable way. This is common in model selection, for example with nested models, say . In this case typically predicts better at small sample sizes (roughly, because has more parameters that need to be learned than ), while predicts better eventually. Figure 1 illustrates this phenomenon.
It shows the accumulated code length difference on “The Picture of Dorian Gray” by Oscar Wilde, where and
are the Bayesian marginal distributions for the first-order and second-order Markov chains, respectively, and each character in the book is an outcome. We used uniform (Dirichlet) priors on the model parameters (i.e., the “transition probabilities”) , but the same phenomenon occurs with other common priors , such as Jeffreys”. Clearly is better for about the first outcomes, gaining a head start of approximately bits. Ideally we should predict the initial outcomes using and the rest using . However, only starts to behave like when it catches up with at a sample size of about , when the code length of drops below that of . Thus, in the shaded area behaves like while is making better predictions of those outcomes: since at , is bits behind, and at , it has caught up, in between it must have outperformed by bits!
Note that the example models and are very crude; for this particular application much better models are available. Thus and serve as a simple illustration only (see the discussion in Section 8.1
). However, our theorems, as well as experiments with nonparametric density estimation on which we will report elsewhere, indicate that the same phenomenon also occurs with more realistic models. In fact, the general pattern that first one model is better and then another occurs widely, both on real-world data and in theoretical settings. We argue that failure to take this effect into account leads to the suboptimal rate of convergence achieved by Bayes factor model selection and related methods. We have developed an alternative method to combine distributionsand into a single distribution , which we call the switch distribution, defined in Section 2. Figure 1 shows that behaves like initially, but in contrast to it starts to mimic almost immediately after starts making better predictions; it essentially does this no matter what sequence is actually observed. differs from in that it is based on a prior distribution on sequences of models rather than simply a prior distribution on models. This allows us to avoid the implicit assumption that there is one model which is best at all sample sizes. After conditioning on past observations, the posterior we obtain gives a better indication of which model performs best at the current sample size, thereby achieving a faster rate of convergence. Indeed, the switch distribution is very closely related to earlier algorithms for tracking the best expert developed in the universal prediction literature; see also Section 7 [Herbster and Warmuth, 1998, Vovk, 1999, Volf and Willems, 1998, Monteleoni and Jaakkola, 2004]; however, the applications we have in mind and the theorems we prove are completely different.
The remainder of the paper is organized as follows (for the reader’s convenience, we have attached a table of contents at the end of the paper). In Section 2 we introduce our basic concepts and notation, and we then define the switch distribution. While in the example above, we switched between just two models, the general definition allows switching between elements of any finite or countably infinite set of models. In Section 3 we show that model selection based on the switch distribution is consistent (Theorem 1). Then in Section 4 we show that the switch distribution achieves a rate of convergence that is never significantly worse than that of Bayesian model averaging, and we show that, in contrast to Bayesian model averaging, the switch distribution achieves the worst-case optimal rate of convergence when it is applied to histogram density estimation. In Section 5 we develop a number of tools that can be used to bound the rate of convergence in Cesàro-mean in more general parametric and nonparametric settings, which include histogram density estimation as a special case. In Section 5.3 and Section 5.4
we apply these tools to show that the switch distribution achieves minimax convergence rates in density estimation based on exponential families and in some nonparametric linear regression problems. In Section6 we give a practical algorithm that computes the switch distribution. Theorem 14 of that section shows that the run-time for predictors is time. In Sections 7 and Section 8 we put our work in a broader context and explain how our results fit into the existing literature. Specifically, Section 7.1 explains how our result can be reconciled with a seemingly contradictory recent result of Yang [2005a], and Section 8.1 describes a strange implication of the catch-up phenomenon for Bayes factor model selection. The proofs of all theorems are in Appendix A (except the central results of Section 5, which are proved in the main text).
2 The switch distribution for Model Selection and Prediction
Suppose , ,
is a sequence of random variables that take values in sample spacefor some . For , let , , denote the first outcomes of , such that takes values in the product space . (We let denote the empty sequence.) For , we write for , , , where is allowed. We omit the subscript when , writing rather than .
Any distribution may be defined in terms of a sequential prediction strategy that predicts the next outcome at any time . To be precise: Given the previous outcomes at time , a prediction strategy should issue a conditional density with corresponding distribution for the next outcome . Such sequential prediction strategies are sometimes called prequential forecasting systems [Dawid, 1984]. An instance is given in Example 1 below. Whenever the existence of a ‘true’ distribution is assumed — in other words, are distributed according —, we may think of any prediction strategy as a procedure for estimating , and in such cases, we will often refer to an estimator. For simplicity, we assume throughout that the density is taken relative to either the usual Lebesgue measure (if is continuous) or the counting measure (if is countable). In the latter case is a probability mass function. It is natural to define the joint density and let be the unique distribution on such that, for all , is the density of its marginal distribution for . To ensure that is well-defined even if is continuous, we will only allow prediction strategies satisfying the natural requirement that for any and any fixed measurable event the probability is a measurable function of . This requirement holds automatically if is countable.
2.2 Model Selection and Prediction
In model selection the goal is to choose an explanation for observed data from a potentially infinite list of candidate models , , We consider parametric models, which we define as sets of prediction strategies that are indexed by elements of , for some smallest possible
, the number of degrees of freedom. A model is more commonly viewed as a set of distributions, but since distributions can be viewed as prediction strategies as explained above, we may think of a model as a set of prediction strategies as well. Examples of model selection are histogram density estimation[Rissanen et al., 1992] ( is the number of bins minus 1), regression based on a set of basis functions such as polynomials ( is the number of coefficients of the polynomial), and the variable selection problem in regression [Shibata, 1983, Li, 1987, Yang, 1999] ( is the number of variables). A model selection criterion is a function that, given any data sequence of arbitrary length , selects the model with index .
With each model we associate a single prediction strategy . The bar emphasizes that is a meta-strategy based on the prediction strategies in . In many approaches to model selection, for example AIC and LOO, is defined using some parameter estimator , which maps a sequence of previous observations to an estimated parameter value that represents a “best guess” of the true/best distribution in the model. Prediction is then based on this estimator: , which also defines a joint density . The Bayesian approach to model selection or model averaging goes the other way around. It starts out with a prior on , and then defines the Bayesian marginal density
When is non-zero this joint density induces a unique conditional density
which is equal to the mixture of according to the posterior, , based on . Thus the Bayesian approach also defines a prediction strategy .
Associating a prediction strategy with each model is known as the prequential approach to statistics [Dawid, 1984] or predictive MDL [Rissanen, 1984]. Regardless of whether is based on parameter estimation or on Bayesian predictions, we may usually think of it as a universal code relative to [Grünwald, 2007].
Suppose . Then a prediction strategy may be based on the Bernoulli model that regards as a sequence of independent, identically distributed Bernoulli random variables with . We may predict using the maximum likelihood (ML) estimator based on the past, i.e. using . The prediction for is then undefined. If we use a smoothed ML estimator such as the Laplace estimator, , then all predictions are well-defined. It is well-known that the predictor defined by equals the Bayesian predictive distribution based on a uniform prior. Thus in this case a Bayesian predictor and an estimation-based predictor coincide!
2.3 The switch distribution
Suppose , , is a list of prediction strategies for . (Although here the list is infinitely long, the developments below can with little modification be adjusted to the case where the list is finite.) We first define a family of combinator prediction strategies that switch between the original prediction strategies. Here the parameter space is defined as
The parameter specifies the identities of constituent prediction strategies and the sample sizes, called switch-points, at which to switch between them. For , let , and . We omit the argument when the parameter is clear from context; e.g. we write for . For each the corresponding is defined as:
Switching to the same predictor multiple times (consecutively or not) is allowed. The extra switch-point is included to simplify notation; we always take , so that represents the strategy that is used in the beginning, before any actual switch takes place.
Given a list of prediction strategies , , , we define the switch distribution as a Bayesian mixture of the elements of according to a prior on :
Definition 1 (switch distribution).
Suppose is a probability mass function on . Then the switch distribution with prior is the distribution for that is defined by the density
for any , , and .
Hence the marginal likelihood of the switch distribution has density
Although the switch distribution provides a general way to combine prediction strategies (see Section 7.3), in this paper it will only be applied to combine prediction strategies , , that correspond to parametric models. In this case we may define a corresponding model selection criterion . To this end, let be a random variable that denotes the strategy/model that is used to predict given past observations . Formally, let be the unique such that and either (i.e. the current sample size is between the -th and -st switch-point), or (i.e. the current sample size is beyond the last switch point). Then
. Now note that by Bayes’ theorem, the prior, together with the data , induces a posterior on switching strategies . This posterior on switching strategies further induces a posterior on the model that is used to predict . Algorithm 6, given in Section 6, efficiently computes the posterior distribution on given :
which selects the model with maximum posterior probability.
If one of the models, say with index , is actually true, then it is natural to ask whether is consistent, in the sense that it asymptotically selects with probability . Theorem 1 below states that, if the prediction strategies associated with the models are Bayesian predictive distributions, then is consistent under certain conditions which are only slightly stronger than those required for standard Bayes factor model selection consistency. It is followed by Theorem 2, which extends the result to the situation where the are not necessarily Bayesian.
Bayes factor model selection is consistent if for all , and are mutually singular, that is, if there exists a measurable set such that and [Barron et al., 1998]. For example, this can usually be shown to hold if (a) the models are nested and (b) for each , is a subset of of -measure . In most interesting applications in which (a) holds, (b) also holds [Grünwald, 2007]. For consistency of , we need to strengthen the mutual singularity-condition to a “conditional” mutual singularity-condition: we require that, for all and all , all , the distributions and are mutually singular. For example, if are independent and identically distributed (i.i.d.) according to each in all models, but also if is countable and for all , all , then this conditional mutual singularity is automatically implied by ordinary mutual singularity of and .
Let denote the set of all possible extensions of to more switch-points. Let , , be Bayesian prediction strategies with respective parameter spaces , , and priors , , , and let be the prior of the corresponding switch distribution.
Theorem 1 (Consistency of the switch distribution).
Suppose is positive everywhere on and such that for some positive constant , for every , . Suppose further that and are mutually singular for all , , all , all . Then, for all , for all except for a subset of of -measure , the posterior distribution on satisfies
The requirement that is automatically satisfied if is of the form
where , and are priors on with full support, and is geometric: for some . In this case .
We now extend the theorem to the case where the universal distributions are not necessarily Bayesian, i.e. they are not necessarily of the form (1). It turns out that the “meta-Bayesian” universal distribution is still consistent, as long as the following condition holds. The condition essentially expresses that, for each , must not be too different from a Bayesian predictive distribution based on (1). This can be verified if all models are exponential families, and the represent ML or smoothed ML estimators (see Theorems 2.1 and 2.2 of [Li and Yu, 2000]). We suspect that it holds as well for more general parametric models and universal codes, but we do not know of any proof.
There exist Bayesian prediction strategies of form (1), with continuous and strictly positive priors such that
The conditions of Theorem 1 hold for and the chosen switch distribution prior .
For all , for each compact subset of the interior of , there exists a such that for all , with -probability 1, for all
For all with and all , the distributions and are mutually singular.
Theorem 2 (Consistency of the switch distribution, Part 2).
Let be prediction strategies and let be the prior of the corresponding switch distribution. Suppose that the condition above holds relative to and . Then, for all , for all except for a subset of of Lebesgue-measure , the posterior distribution on satisfies
4 Risk Convergence Rates
In this section and the next we investigate how well the switch distribution is able to predict future data in terms of expected logarithmic loss or, equivalently, how fast estimates based on the switch distribution converge to the true distribution in terms of Kullback-Leibler risk. In Section 4.1, we define the central notions of model classes, risk, convergence in Cesàro mean, and minimax convergence rates, and we give the conditions on the prior distribution under which our further results hold. We then (Section 4.2) show that the switch distribution cannot converge any slower than standard Bayesian model averaging. As a proof of concept, in Section 4.3 we present Theorem 4, which establishes that, in contrast to Bayesian model averaging, the switch distribution converges at the minimax optimal rate in a nonparametric histogram density estimation setting.
In the more technical Section 5, we develop a number of general tools for establishing optimal convergence rates for the switch distribution, and we show that optimal rates are achieved in, for example, nonparametric density estimation with exponential families and (basic) nonparametric linear regression, and also in standard parametric situations.
4.1.1 Model Classes
The setup is as follows. Suppose is a sequence of parametric models with associated estimators as before. Let us write for the union of the models. Although formally is a set of prediction strategies, it will often be useful to consider the corresponding set of distributions for . With minor abuse of notation we will denote this set by as well.
To test the predictions of the switch distribution, we will want to assume that is distributed according to a distribution that satisfies certain restrictions. These restrictions will always be formulated by assuming that , where is some restricted set of distributions for . For simplicity, we will also assume throughout that, for any , the conditional distribution has a density (relative to the Lebesgue or counting measure) with probability one under . For example, if , then might be the set of all i.i.d. distributions that have uniformly bounded densities with uniformly bounded first derivatives, as will be considered in Section 4.3. In general, however, the sequence need not be i.i.d. (under the elements of ).
We will refer to any set of distributions for as a model class. Thus both and are model classes. In Section 5.5 it will be assumed that , which we will call the parametric setting. Most of our results, however, deal with various nonparametric situations, in which is non-empty. It will then be useful to emphasize that is (much) larger than by calling a nonparametric model class.
Given , we will measure how well any estimator predicts in terms of the Kullback-Leibler (KL) divergence [Barron, 1998]. Suppose that and are distributions for some random variable , with densities and respectively. Then the KL divergence from to is defined as
KL divergence is never negative, and reaches zero if and only if equals . Taking an expectation over leads to the standard definition of the risk of estimator at sample size relative to KL divergence:
Instead of the standard KL risk, we will study the cumulative risk
where the superscript denotes taking the marginal of the distribution on the first outcomes. We will show convergence of the predictions of the switch distribution in terms of the cumulative rather than the individual risk. This notion of convergence, defined below, is equivalent to the well-studied notion of convergence in Cesàro mean. It has been considered by, among others, Rissanen et al. , Barron , Poland and Hutter , and its connections to ordinary convergence of the risk were investigated in detail by Grünwald .
Asymptotic properties like ‘convergence’ and ‘convergence in Cesàro mean’ will be expressed conveniently using the following notation, which extends notation from [Yang and Barron, 1999]:
Definition 2 (Asymptotic Ordering of Functions).
For any two nonnegative functions and any we write if for all there exists an such that for all it holds that . The less precise statement that there exists some such that , will be denoted by . (Note the absence of the subscript.) For , we define to mean , and means that for some , . Finally, we say that if both and .
Note that is equivalent to . One may think of as another way of writing . The two statements are equivalent if is never zero.
We can now succinctly state that the risk of an estimator converges to at rate if , where is a nonnegative function such that goes to as increases. We say that converges to 0 at rate at least in Cesàro mean if . As -ordering is invariant under multiplication by a positive constant, convergence in Cesàro mean is equivalent to asymptotically bounding the cumulative risk of as
We will always express convergence in Cesàro mean in terms of cumulative risks as in (14). The reader may verify that if the risk of is always finite and converges to at rate and , then the risk of also converges in Cesàro mean at rate . Conversely, suppose that the risk of converges in Cesàro mean at rate . Does this also imply that the risk of converges to at rate in the ordinary sense? The answer is “almost”, as shown in [Grünwald, 2007]: The risk of may be strictly larger than for some , but the gap between any two and at which the risk of exceeds must become infinitely large with increasing . This indicates that, although convergence in Cesàro mean is a weaker notion than standard convergence, obtaining fast Cesàro mean convergence rates is still a worthy goal in prediction and estimation. We explore the connection between Cesàro and ordinary convergence in more detail in Section 5.2.
4.1.3 Minimax Convergence Rates
The worst-case cumulative risk of the switch distribution is given by
We will compare it to the minimax cumulative risk, defined as:
where the infimum is over all estimators as defined in Section 2.1. We will say that the switch distribution achieves the minimax convergence rate in Cesàro mean (up to a multiplicative constant) if . Note that there is no requirement that is a distribution in or ; We are looking at the worst case over all possible estimators, irrespective of the model class, , used to approximate . Thus, we may call an “out-model estimator” [Grünwald, 2007].
4.1.4 Restrictions on the Prior
Throughout our analysis of the achieved rate of convergence we will require that the prior of the switch distribution, , can be factored as in (9), and is chosen to satisfy
Thus , the prior on the total number of distinct predictors, is allowed to decrease either exponentially (as required for Theorem 1) or polynomially, but and cannot decrease faster than polynomially. For example, we could set and , or we could take the universal prior on the integers [Rissanen, 1983].
4.2 Never Much Worse than Bayes
Suppose that the estimators are Bayesian predictive distributions, defined by their densities as in (1). The following lemma expresses that the Cesàro mean of the risk achieved by the switch distribution is never much higher than that of Bayesian model averaging, which is itself never much higher than that of any of the Bayesian estimators under consideration.
Let be the switch distribution for with prior of the form (9). Let be the Bayesian model averaging distribution for the same estimators, defined with respect to the same prior on the estimators . Then, for all , all , and all ,
Consequently, if , , are distributed according to any distribution , then for any ,
As mentioned in the introduction, one advantage of model averaging using is that it always predicts almost as well as the estimator for any , including the that yields the best predictions overall. Lemma 3 shows that this property is shared by , which multiplicatively dominates . In the sequel, we investigate under which circumstances the switch distribution may achieve a smaller cumulative risk than Bayesian model averaging.
4.3 Histogram Density Estimation
How many bins should be selected in density estimation based on histogram models with equal-width bins? Suppose , , take outcomes in and are distributed i.i.d. according to , where has density for all . Let for . Let us restrict to the set of distributions with densities that are uniformly bounded above and below and also have uniformly bounded first derivatives. In particular, suppose there exist constants and such that
In this setting the minimax convergence rate in Cesàro mean can be achieved using histogram models with bins of equal width (see below). The equal-width histogram model with bins, , is specified by the set of densities on that are constant within the bins , , , , where . In other words, contains any density such that, for all that lie in the same bin, . The
-dimensional parameter vectordenotes the probability masses of the bins, which have to sum up to one: . Note that this last constraint makes the number of degrees of freedom one less than the number of bins. Following Yu and Speed  and Rissanen et al. , we associate the following estimator with model :
where denotes the number of outcomes in that fall into the same bin as . As in Example 1, these estimators may both be interpreted as being based on parameter estimation (estimating , where denotes the number of outcomes in bin ) or on Bayesian prediction (a uniform prior for also leads to this estimator [Yu and Speed, 1992]).
The minimax convergence rate in Cesàro mean for