Consider data sampled from some distribution with unknown . The likelihood function or the posterior contain the complete statistical information of the sample. Often this information needs to be summarized or simplified for various reasons (comprehensibility, communication, storage, computational efficiency, mathematical tractability, etc.). Parameter estimation, hypothesis testing, and model (complexity) selection can all be regarded as ways of summarizing this information, albeit in different ways or context. The posterior might either be summarized by a single point (e.g. ML or MAP or mean or stochastic model selection), or by a convex set
(e.g. confidence or credible interval), or by a finite set of points(mixture models) or a sample of points (particle filtering), or by the mean and covariance matrix (Gaussian approximation), or by more general density estimation, or in a few other ways [BM98, Bis06]. I have roughly sorted the methods in increasing order of complexity. This paper concentrates on set estimation, which includes (multiple) point estimation and hypothesis testing as special cases, henceforth jointly referred to as “hypothesis identification” (this nomenclature seems uncharged and naturally includes what we will do: estimation and testing of simple and complex hypotheses but not density estimation). We will briefly comment on generalizations beyond set estimation at the end.
Desirable properties. There are many desirable properties any hypothesis identification principle ideally should satisfy. It should
lead to good predictions (that’s what models are ultimately for),
be broadly applicable,
be analytically and computationally tractable,
be defined and make sense also for non-i.i.d. and non-stationary data,
be reparametrization and representation invariant,
work for simple and composite hypotheses,
work for classes containing nested and overlapping hypotheses,
work in the estimation, testing, and model selection regime,
reduce in special cases (approximately) to existing other methods.
Here we concentrate on the first item, and will show that the resulting principle nicely satisfies many of the other items.
The main idea. We address the problem of identifying hypotheses (parameters/models) with good predictive performance head on. If is the true parameter, then is obviously the best prediction of the future observations . If we don’t know but have prior belief about its distribution, the predictive distribution based on the past observations (which averages the likelihood over with posterior weight ) is by definition the best Bayesian predictor Often we cannot use full Bayes (for reasons discussed above) but predict with hypothesis , i.e. use as prediction. The closer is to or 111So far we tacitly assumed that given , is independent . For non-i.i.d. data this is generally not the case, hence the appearance of . the better is ’s prediction (by definition), where we can measure closeness with some distance function . Since and are (assumed to be) unknown, we have to sum or average over them.
Definition 1 (Predictive Loss)
The predictive Loss/ of given based on distance for future observations is
Predictive hypothesis identification (PHI) minimizes the losses w.r.t. some hypothesis class . Our formulation is general enough to cover point and interval estimation, simple and composite hypothesis testing, (mixture) model (complexity) selection, and others.
(Un)related work. The general idea of inference by maximizing predictive performance is not new [Gei93]. Indeed, in the context of model (complexity) selection it is prevalent in machine learning and implemented primarily by empirical cross validation procedures and variations thereof [Zuc00] or by minimizing test and/or train set (generalization) bounds; see [Lan02] and references therein. There are also a number of statistics papers on predictive inference; see [Gei93] for an overview and older references, and [BB04, MGB05] for newer references. Most of them deal with distribution free methods based on some form of cross-validation discrepancy measure, and often focus on model selection. A notable exception is MLPD [LF82], which maximizes the predictive likelihood including future observations. The full decision-theoretic setup in which a decision based on leads to a loss depending on , and minimizing the expected loss, has been studied extensively [BM98, Hut05], but scarcely in the context of hypothesis identification. On the natural progression of estimationpredictionaction, approximating the predictive distribution by minimizing (1) lies between traditional parameter estimation and optimal decision making. Formulation (1) is quite natural but I haven’t seen it elsewhere. Indeed, besides ideological similarities the papers above bear no resemblance to this work.
Contents. The main purpose of this paper is to investigate the predictive losses above and in particular their minima, i.e. the best predictor in . Section 2 introduces notation, global assumptions, and illustrates PHI on a simple example. This also shows a shortcoming of MAP and ML esimtation. Section 3
formally states PHI, possible distance and loss functions, their minima, In Section4, I study exact properties of PHI: invariances, sufficient statistics, and equivalences. Sections 5 investigates the limit in which PHI can be related to MAP and ML. Section 6 derives large sample approximations for which PHI reduces to sequential moment fitting (SMF). The results are subsequently used for Offline PHI. Section 7 contains summary, outlook and conclusions. Throughout the paper, the Bernoulli example will illustrate the general results.
The main aim of this paper is to introduce and motivate PHI, demonstrate how it can deal with the difficult problem of selecting composite and nested hypotheses, and show how PHI reduces to known principles in certain regimes. The latter provides additional justification and support of previous principles, and clarifies their range of applicability. In general, the treatment is exemplary, not exhaustive.
Setup. Let be the observed sample with observations from some measurable space , e.g. or or a subset thereof. Similarly let be potential future observations. We assume that and are sampled from some probability distribution , where is some unknown parameter. We do not assume independence of the unless otherwise stated. For simplicity of exposition we assume that the densities w.r.t. the default (Lebesgue or counting) measure (, , written both henceforth as ) exist.
Bayes. Similarly, we assume a prior distribution with density over parameters. From prior and likelihood we can compute the posterior , where normalizer . The full Bayesian approach uses parameter averaging for prediction
the so-called predictive distribution (or more precisely predictive density), which can be regarded as the gold standard for prediction (and there are plenty of results justifying this [BCH93, Hut05]).
Composite likelihood. Let be the simple hypothesis that is sampled from and the composite hypothesis that is “sampled” from , where . In the Bayesian framework, the “composite likelihood” is actually well defined (for measurable with ) as an averaged likelihood
MAP and ML. Let be the (finite, countable, continuous, complete, or else) class of hypotheses (or for short) from which the “best” one shall be selected. Each is assumed to be a measurable subset of . The maximum a posteriori (MAP) estimator is defined as if contains only simple hypotheses and in the general case. The composite maximum likelihood estimator is defined as , which reduces to ordinary ML for simple hypotheses.
In order not to further clutter up the text with too much mathematical gibberish, we make the following global assumptions during informal discussions:
Global Assumption 2
Wherever necessary, we assume that sets, spaces, and functions are measurable, densities exist w.r.t. some (Lebesgue or counting) base measure, observed events have non-zero probability, or densities conditioned on probability zero events are appropriately defined, in which case statements might hold with probability 1 only. Functions and densities are sufficiently often (continuously) differentiable, and integrals exist and exchange.
Bernoulli Example. Consider a binary i.i.d. process with bias , and the number of observed 1s. Let us assume a uniform uniform prior . Here but not generally in later continuations of the example we also assume . Consider hypothesis class containing simple hypothesis meaning “fair” and composite vacuous alternative meaning “don’t know”. It is easy to see that
hence , i.e. ML always suggests a fair coin however weak the evidence is. On the other hand, , i.e. MAP never suggests a fair coin however strong the evidence is.
Now consider PHI. Let be the number of future 1s. The probabilities of given , , and are, respectively
For we get , so when concerned with predicting only one bit, both hypotheses are equally good. More generally, for an interval , compare to the full Bayesian prediction (Laplace’s rule). Hence if is a class of interval hypotheses, then PHI chooses the whose midpoint is closest to Laplace’s rule, which is reasonable. The size of the interval doesn’t matter, since is independent of it.
Things start to change for . The following table lists for some , together with and , and their prediction error Err for in (1)
The last column contains the identified best predictive hypothesis. For four or more observations, PHI says “fair”, otherwise “don’t know”.
Using (2) or our later results, one can show
more generally that PHI chooses “fair” for and
“don’t know” for .
MAP versus ML versus PHI. The conclusions of the example generalize: For , we have , i.e. MAP always chooses the less specific hypothesis . On the other hand, we have , since the maximum can never be smaller than an average, i.e. composite ML prefers the maximally specific hypothesis. So interestingly, although MAP and ML give identical answers for uniform prior on simple hypotheses, their naive extension to composite hypotheses is diametral. While MAP is risk averse finding a likely true model of low predictive power, composite ML risks an (over)precise prediction. Sure, there are ways to make MAP and ML work for nested hypotheses. The Bernoulli example has also shown that PHI’s answer depends not only on the past data size but also on the future data size . Indeed, if we make only few predictions based on a lot of data (), a point estimation () is typically sufficient, since there will not be enough future observations to detect any discrepancy. On the other hand, if , selecting a vacuous model () that ignores past data is better than selecting a potentially wrong parameter, since there is plenty of future data to learn from. This is exactly the behavior PHI exhibited in the example.
3 Predictive Hypothesis Identification Principle
We already have defined the predictive loss functions in (1). We now formally state our predictive hypothesis identification (PHI) principle, discuss possible distances , and major prediction scenarios related to the choice of .
Distance functions. Throughout this work we assume that is continuous and zero if and only if both arguments coincide. Some popular distances are: the (f) -divergence for convex with , the () -distance , the (1) absolute deviation (), the (h) Hellinger distance (), the (c) chi-square distance , the (k) KL-divergence , and the (r) reverse KL-divergence . The only distance considered here that is not an divergence is the (2) squared distance . The -divergence is particularly interesting, since it contains most of the standard distances and makes Loss representation invariant (RI).
Definition 3 (Predictive hypothesis identification (PHI))
The best () predictive hypothesis in given is defined as
The PHI () principle states to predict with probability (), which we call () prediction.
Prediction modes. There exist a few distinct prediction scenarios and modes. Here are prototypes of the presumably most important ones: Infinite batch: Assume we summarize our data by a model/hypothesis . The model is henceforth used as background knowledge for predicting and learning from further observations essentially indefinitely. This corresponds to . Finite batch: Assume the scenario above, but terminate after predictions for whatever reason. This corresponds to a finite (often large). Offline: The selected model is used for predicting for separately with without further learning from taking place. This corresponds to repeated with common : . Online: At every step we determine a (good) hypothesis from based on past data , and use it only once for predicting . Then for we select a new hypothesis etc. This corresponds to repeated with different : .
The above list is not exhaustive. Other prediction scenarios are definitely possible. In all prediction scenarios above we can use instead of Loss equally well. Since all time steps in Online PHI are completely independent, online PHI reduces to 1-Batch PHI, hence will not be discussed any further.
4 Exact Properties of PHI
Reparametrization and representation invariance (RI). An important sanity check of any statistical procedure is its behavior under reparametrization [KW96] and/or when changing the representation of observations [Wal96], where and are bijections. If the parametrization/representation is judged irrelevant to the problem, any inference should also be independent of it. MAP and ML are both representation invariant, but (for point estimation) only ML is reparametrization invariant.
Proposition 4 (Invariance of Loss)
and are invariant under reparametrization of . If distance is an -divergence, then they are also independent of the representation of the observation space . For continuous , the transformations are assumed to be continuously differentiable.
RI for is obvious, but will see later some interesting consequences. Any exact inference or any specialized form of PHI will inherit RI. Similarly for approximations, as long as they do not break RI. For instance, PHI will lead to an interesting RI variation of MAP.
Sufficient statistic. For large , the integral in Definition 1 is prohibitive. Many models (the whole exponential family) possess a sufficient statistic which allows us to reduce the integral over to an integral over the sufficient statistic. Let
|be a sufficient statistic, i.e.||(3)|
which implies that there exist functions and such that the likelihood factorizes into
The proof is trivial for discrete (choose and ) and follows from Fisher’s factorization theorem for continuous . Let be an event that is independent given . Then multiplying (4) by and integrating over yields
For some let (non-probability) measure () have density () w.r.t. to (Lebesgue or counting) base measure ( in the discrete case). Informally,
where is the Dirac delta for continuous (or the Kronecker delta for countable , i.e. ).
Theorem 5 (PHI for sufficient statistic)
Let be a sufficient statistic (3) for and assume is independent given , i.e. . Then
holds (where and have been defined in (4), (6), and (7)), provided one (or both) of the following conditions hold: (i) distance scales with a power , i.e. for , or (ii) any distance , but in (4). One can choose , the probability density of , in which case .
All distances defined in Section 3 satisfy , the -divergences all with and the square loss with . The independence assumption is rather strong. In practice, usually it only holds for some if it holds for all . Independence of from given for all can only be satisfied for independent (not necessarily identically distributed) .
Theorem 6 (Equivalence of and )
For square distance () and RKL distance (), differs from only by an additive constant c independent of , hence PHI and select the same hypotheses and .
Let us continue with our Bernoulli example with uniform prior.
is a sufficient statistic.
Since is discrete, and . In (4) we can choose which implies and
definition (5) we see that whose expression
can be found in (2). For RKL-distance, Theorem 5
now yields . For a point hypothesis
this evaluates to a constant minus , which is minimized for
. Therefore the best predictive point
= Laplace rule, where we
have used Theorem 6 in the third equality.
5 PHI for -Batch
In this section we will study PHI for large , or more precisely, the regime. No assumption is made on the data size , i.e. the results are exact for any (small or large) in the limit . For simplicity and partly by necessity we assume that the are i.i.d. (lifting the “identical” is possible). Throughout this section we make the following assumptions.
Let be independent and identically distributed,
, the likelihood density twice
continuously differentiable w.r.t. ,
and the boundary of has zero prior probability.
has zero prior probability.
We further define (any ) and the partial derivative . The (two representations of the) Fisher information matrix of
will play a crucial role in this Section. It also occurs in Jeffrey’s prior,
a popular reparametrization invariant (objective) reference prior (when it exists) [KW96]. We call the determinant (det) of , Fisher information. can be interpreted as the intrinsic size of [Grü07]. Although not essential to this work, it will be instructive to occasionally plug it into our expressions. As distance we choose the Hellinger distance.
Theorem 8 ( for large )
Under Assumption 7, for point estimation, the predictive Hellinger loss for large is
where the first expression holds for any continuous prior density and the second expression () holds for Jeffrey’s prior.
IMAP. The asymptotic expression shows that minimizing is equivalent to the following maximization
Without the denominator, this would just be MAP estimation. We have discussed that MAP is not reparametrization invariant, hence can be corrupted by a bad choice of parametrization. Since the square root of the Fisher information transforms like the posterior, their ratio is invariant. So PHI led us to a nice reparametrization invariant variation of MAP, immune to this problem. Invariance of the expressions in Theorem 8 is not a coincidence. It has to hold due to Proposition 4. For Jeffrey’s prior (second expression in Theorem 8), minimizing is equivalent to maximizing the likelihood, i.e. . Remember that the expressions are exact even and especially for small samples . No large approximation has been made. For small , MAP, ML, and IMAP can lead to significantly different results. For Jeffrey’s prior, IMAP and ML coincide. This is a nice reconciliation of MAP and ML: An “improved” MAP leads for Jeffrey’s prior back to “simple” ML.
MDL. We can also relate PHI to MDL by taking the logarithm of the second expression in Theorem 8:
For this is the classical (large approximation of) MDL [Grü07]. So presuming that (11) is a reasonable approximation of PHI even for , MDL approximately minimizes the predictive Hellinger loss iff used for predictions. We will not expand on this, since the alluded relation to MDL stands on shaky grounds (for several reasons).
Corollary 9 ()
The predictive estimator coincides with , a representation invariant variation of MAP. In the special case of Jeffrey’s prior, it also coincides with the maximum likelihood estimator .
Theorem 10 ( for large )
Under Assumption 7, for composite , the predictive Hellinger loss for large is
where the first expression holds for any continuous prior density and the second expression () holds for Jeffrey’s prior.
MAP meets ML half way. The second expression in Theorem 10 is proportional to the geometric average of the posterior and the composite likelihood. For large the likelihood gets small, since the average involves many wrong models. For small , the posterior is proportional to the volume of hence tends to zero. The product is maximal for some in-between:
The regions where the posterior density and where the (point) likelihood are large are quite similar, as long as the prior is not extreme. Let be this region. It typically has diameter . Increasing cannot significantly increase , but significantly decreases the likelihood, hence the product gets smaller. Vice versa, decreasing cannot significantly increase , but significantly decreases the posterior. The value at follows from . Together this shows that approximately maximizes the product of likelihood and posterior. So the best predictive has diameter , which is a very reasonable answer. It covers well but not excessively the high posterior and high likelihood regions (provided is sufficiently rich of course). By multiplying the likelihood or dividing the posterior with only the square root of the prior, they meet half way!
Bernoulli Example. A Bernoulli process with uniform prior and
has posterior variance. Hence any reasonable symmetric interval estimate of will have size . For PHI we get
where equality is a large approximation, and erf is the error function [AS74]. erf has a global maximum at within 1% precision. Hence PHI selects an interval of half-width .
If faced with a binary decision between point estimate
and vacuous estimate , comparing the
losses in Theorems 8 and 10, we see
that for large , is selected, despite being
close to zero for large . In Section 2 we have
explained that this makes from a predictive point of view.
Finding . Contrary to MAP and ML, an unrestricted maximization of (12) over all measurable makes sense. The following result reduces the optimization problem to finding the level sets of the likelihood function and to a one-dimensional maximization problem.
Theorem 11 (Finding )
Let be the -level set of . If is continuous in , then
More precisely, every global maximum of (12) differs from the maximizer at most on a set of measure zero.
Using posterior level sets, i.e. shortest -credible sets/intervals instead of likelihood level sets would not work (an indirect proof is that they are not RI). For a general prior, level sets need to be considered. The continuity assumption on excludes likelihoods with plateaus, which is restrictive if considering non-analytic likelihoods. The assumption can be lifted by considering all in-between and . Exploiting the special form of (12) one can show that the maximum is attained for either or with obtained as in the theorem.
Large . For large (), the likelihood usually tends to an (un-normalized) Gaussian with mean=mode and covariance matrix . Therefore the levels sets are ellipsoids
We know that the size of the maximizing ellipsoid scales with . For such tiny ellipsoids, (12) is asymptotically proportional to
where , and , and , and is the incomplete Gamma function [AS74], and we dropped all factors that are independent of . The expressions also holds for general prior in Theorem 8, since asymptotically the prior has no influence. They are maximized for the following :
i.e. for , unrestricted PHI selects ellipsoid of (linear) size .
So far we have considered . Analogous asymptotic expressions can be derived for : While differs from , for point estimation their minima coincide. For composite , the answer is qualitatively similar but differs quantitatively.
6 Large Sample Approximations
In this section we will study PHI for large sample sizes , more precisely the regime. For simplicity we concentrate on the univariate case only. Data may be non-i.i.d.
Sequential moment fitting (SMF). A classical approximation of the posterior density is by a Gaussian with same mean and variance. In case the class of available distributions is further restricted, it is still reasonable to approximate the posterior by the distribution whose mean and variance are closest to that of . There might be a tradeoff between taking a distribution with good mean (low bias) or one with good variance. Often low bias is of primary importance, and variance comes second. This suggests to first fit the mean, then the variance, and possibly continue with higher order moments.
PHI is concerned with predictive performance, not with density estimation, but of course they are related. Good density estimation in general and sequential moment fitting (SMF) in particular lead to good predictions, but the converse is not necessarily true. We will indeed see that PHI for (under certain conditions) reduces to an SMF procedure.
The SMF algorithm. In our case, the set of available distributions is given by . For some event , let
be the mean and central moments of . The posterior moments are known and can in principle be computed. SMF sequentially “fits” to : Starting with , let be the set of that minimize :
Let be the smallest for which there is no perfect fit anymore (or otherwise). Under some quite general conditions, in a certain sense, all and only the minimize for large .
Theorem 12 (PHI for large by SMF)
For some , assume is times continuously differentiable w.r.t. at the posterior mean . Let and assume , , , and is a bounded function. Then
For the distances we have , for the square distance we have (see Section 3). For i.i.d. distributions with finite moments, the assumption is virtually nil. Normally, no has better loss order than , i.e. can be regarded as the set of all asymptotically optimal predictors. In many cases, contains only a single element. Note that does neither depend on , nor on the chosen distance , i.e. the best predictive hypothesis is essentially the same for all and if is large.
In the Bernoulli Example in Section 2
we considered a binary decision between point estimate
and vacuous estimate , i.e. . For we have
, i.e. both fit the
first moment exactly, hence . For the second moments we
have , but and
, hence for large the point estimate matches
the posterior variance better, so
, which makes sense.
For unrestricted (single) point estimation, i.e. , one can typically estimate the mean exactly but no higher moments. More generally, finite mixture models with
components (degree of freedoms) can fit at mostmoments. For large , the number of that lie in a small neighborhood of some (i.e. the “density” of points in at ) will be proportional to the likelihood . Countably infinite and even more so continuous models if otherwise unrestricted are sufficient to get all moments right. If the parameter range is restricted, anything can happen ( or ). For interval estimation and uniform prior, we have and , hence the first two moments can be fitted exactly and the SMF algorithm yields the unique asymptotic solution . In higher dimensions, common choices of are convex sets, ellipsoids, and hypercubes. For ellipsoids, the mean and covariance matrix can be fitted exactly and uniquely similarly to 1d interval estimation. While SMF can be continued beyond , typically does not contain for anymore. The correct continuation beyond is either or (there is some criterion for the choice), but apart from exotic situations this does not improve the order of the loss, and usually anyway.
Exploiting Theorem 6, we see that SMF is also applicable for and . Luckily, Offline can also be reduced to 1-Batch :
Proposition 13 (Offline = 1-Batch)
If are i.i.d., the Offine is proportional to the 1-Batch :