As Albert Einstein almost said, “Scientific models should be as simple as possible, but no simpler”. Ockham’s razor is a common term for this principle of simplicity. Sober (2015) contains a wide-ranging review, including what Einstein actually said.
In Statistics and Machine Learning, Ockham’s razor leads us to favour less complex models where we can. But why this should be, and what we mean by ‘complex’, are subtle issues. In this paper, we describe one approach to implementing Ockham’s razor for model selection, based on a decomposition of the ‘evidence’ into a ‘fit’ term and a complexity term which we call ‘flexibility’. Our decomposition is exact for all models and all regularizers. Effectively, we complete the strand of research initiated by David MacKay(MacKay, 1992, 2003), and followed up by many others since then. Our analysis is simple enough to warrant inclusion in every practitioner’s toolbox.
Section 2 provides the background, including a discussion of the role of the evidence in model selection. Section 3 presents our exact decomposition of evidence into ‘fit plus flexibility’, and we justify flexibility as a measure of model complexity. Section 4 illustrates, using the Gaussian linear model for which simple closed-form expressions are available. The Bayesian Information Criterion (BIC) penalty is shown to be an approximation to flexibility in this case, but we caution against its use in model selection in general, and even specifically in those cases where the BIC penalty and flexibility are asymptotically equal. Section 5 concludes with a summary.
Our starting-point is a set of observations . A statistical model is proposed,
The defining features of such a model are
where the integral may be replaced by a sum if is countable. This model is augmented with a regularizer . The fitted value of the parameter is
Thus the regularizer is functionally equivalent to a prior distribution
which we assume is proper, and in this case the fitted value is identical to the Maximum A Posteriori (MAP) estimate of .
As this outline makes clear, the estimated value is a function of both the statistical model and the regularizer. To be precise, we would write ‘model and regularizer’ everywhere below, but this would be tedious; therefore we will write ‘model’, but in every case where we write ‘model’ we mean ‘model and regularizer’.
Define the ‘evidence’ of this model as
is a function of the model and the observations, but in a Bayesian approach it has the additional interpretation of ‘marginal likelihood’: the probability density function of the observablesevaluated at the observations . The normalizing constant in the prior distribution (4) is not required to compute , but it is required to compute the evidence directly. Friel and Wyse (2012) provide a review of methods for estimating the evidence, covering a literature which stretches back thirty years.
Suppose, perhaps for pragmatic reasons, that we would like to select a single model from a set of models under consideration, indexed by . For example, might represent a set of Gaussian linear models with different model matrices (see Section 4). The basic claim is that maximizing the evidence is a good way to select such a model. That is, if is the evidence of model , then
is the best single model in . This claim has two justifications.
The first justification is Bayesian. The Bayesian approach to inference with multiple models has a long and rich literature; see, e.g., Bernardo and Smith (1994, ch. 6), O’Hagan and Forster (2004, ch. 7), Robert (2007, ch. 7), or Gelman et al. (2014, ch. 7) for the theory, and Hastie and Green (2012) for computation. Here we narrate only a ‘bare bones’ approach. Suppose that we equip the models in
with prior probabilities,
; then the posterior probabilities are proportional to. If the evidence is highly concentrated relative to the prior probabilities, then is very likely to be the MAP model (certain, if , but this flat prior on models can be problematic). If we can use only one model, then the MAP model is a reasonable candidate. This leads us to select model .
Clearly, this Bayesian justification is highly contingent: on our willingness to provide prior probabilities for the models in , on the concentration of the evidence relative to the prior probabilities, and on the suitability of the MAP model as the best single choice. Concerning this last point, it is easy to imagine a situation where the MAP model is isolated in ‘model-space’, but where there is a cluster of models elsewhere in model-space, for which the cluster collectively has higher posterior probability than the MAP model. A model from the centre of the cluster might well be preferred to the MAP model. All in all, the Bayesian argument for selecting as the single ‘best’ model is suggestive but not compelling.
The second justification is Frequentist, because it claims that is a model-selection algorithm with good properties. It is universally recognized that selecting entirely on the basis of fit is detrimental to good out-of-sample predictive performance, and that good selection algorithms ought to penalize fit with some measure of ‘model complexity’, as summarized by the following schematic:
The first author to derive an explicit simple complexity penalty was Akaike (1973), and there have been many proposals since: this is still an active field of research in Statistics and Machine Learning (see, e.g., Gelman et al., 2014, ch. 7).
David MacKay (MacKay, 1992, 2003) argued that the evidence itself contains a complexity penalty. He termed this penalty the Ockham factor (although he spelled it ‘Occam’; we are following the spelling in Sober, 2015). MacKay’s argument had two strands. First, there was ‘proof by picture’, a compelling illustration that the evidence will sometimes select less complex models over more complex models, shown here as Figure 1.
But the second strand is less compelling. MacKay applied a first-order Laplace approximation to the posterior distribution, and showed that, under this approximation,
where ‘penalty’ has an explicit form in terms of the model and the observations, but its form is not important to our argument. We know, as a matter of logic, that ‘penalty’ must behave something like a model complexity penalty. This follows because the first term on the righthand side will tend to be larger for more complex models, and hence if the evidence can sometimes be smaller for more complex models, as Figure 1 demonstrates, then ‘penalty’ must sometimes be larger for more complex models.
However, the first-order Laplace approximation to the posterior distribution is dubious. It was dubious in 1992, as evidenced by the fact that statisticians at that time were working hard to develop MCMC methods, which would hardly have been necessary had the Laplace approximation been effective (see, e.g., Andrieu et al., 2003, for some history). It is even more dubious today, in our era of massively over-parameterized models and exotic regularizers. Therefore it is gratifying that we can show that there is an exact equality between evidence, fit, and a complexity penalty, which holds in complete generality, and where the penalty has a simple and intuititve form.
Our approach uses a simple but insightful result that is simply a reformulation of Bayes’s Theorem:
for all for which , where is the posterior distribution. Chib (1995, p. 1314) refers to this equality as the basic marginal likelihood identity (BMI); it also goes by the name Candidate’s formula, after Besag (1989).
If we set from (3), then we immediately deduce
from (8). By our argument at the end of the previous section, the second term on the righthand side must behave something like a model complexity penalty. To identify it explicitly, we give it the name
|We have, under this definition,|
an exact decomposition of the evidence, which holds for all models. This is the unique decomposition for which the ‘fit’ term is . A different estimate for
, such as the Generalized Method of Moments (GMM) estimate, would give a different penalty term in the decomposition of the evidence. However, given that the regularizer is functionally equivalent to a prior distribution, the penalized likelihood (or MAP) estimateseems the most natural value to use for .
We contend that flexibility is a reasonable way to measure model complexity. A model will be ‘flexible’ if it contains a large number of degrees of freedom (e.g. represented by, which might be the number of basis functions), and if its coefficients are unconstrained in the regularizer (or prior distribution). A model will be ‘inflexible’ either if it contains few degrees of freedom, or if its coefficients are constrained, or both. A flexible model will often be able to concentrate its posterior probability into a small region of the parameter space, relative to the prior probability, and hence its flexibility penalty will be high. An inflexible model will often struggle to move its posterior probability away from its prior probability, and hence its flexibility penalty will be low. Complex models will typically be flexible, in this sense, and simple models will be inflexible.
Defining model complexity as flexibility unifies the Bayesian and Frequentist justifications for selecting a single model by maximizing the evidence. In other words, the MAP model selected using a Bayesian approach with a flat (or flattish) prior on models is the same as the model selected using a Frequentist approach in which the fit is penalized using flexibility as the complexity penalty. Because the evidence already contains a complexity penalty, it would be a strange decision to add a complexity penalty to the evidence for the purposes of model selection, unless it was quite clear that the flexibility penalty was deficient in some way.
The next Section considers the case where the flexibility has a closed-form expression, and a simple asymptotic approximation.
4 Illustration: Gaussian linear model
Consider the Gaussian linear model with model-matrix
and observation error variance. Using a quadratic regularizer parameterized by ,
|where is the prior precision, and is the posterior precision,|
these are both functions of , which we treat as known (but see below). Hence
This is the exact result. In the limit as , and ; thus , confirming that, asymptotically, flexibility decreases to zero as the penalty on the quadratic regularizer increases, although might have to be huge to overwhelm the term in .
Now consider the effect of when is fixed. Under IID sampling,
where and are both constants. Substituting into (12) and rearranging,
The second term on the lefthand side is the Bayesian Information Criterion (BIC) penalty, and the term on the righthand side is . Thus, for the Gaussian linear model and IID sampling,
For sufficiently large , flexibility BIC penalty does seem to be justifiable, for choosing between models with different model matrices, and thus different ’s. This large- calculation is also applicable when the model gives rise to an approximately Gaussian posterior distribution.
However, we advise caution. First, as already noted, we cannot presume an approximately Gaussian posterior distribution in modern practice, and therefore ‘flexibility BIC penalty’ is a poor generic approximation. Second, it is not safe to drop terms in model comparison (Gelfand and Dey, 1994), as we will discuss further below. Put simply, if flexibility is the right way to penalize model complexity, then the correct approach is to estimate the evidence directly (see, e.g., Friel and Wyse, 2012), rather than to approximate it by computing the fit and replacing the flexibility with the BIC penalty, or some other simple approximation.
To elucidate, we consider submodels within a single model: everything in the previous two sections also applies within a single model, on partitioning the model’s parameters. The Gaussian linear model has coefficients, denoted , but it also has two additional parameters and . So define the evidence function
Each tuple defines a submodel, and the evidence function decomposes as a fit term plus a flexibility term for each submodel:
remembering that is itself a function of . However, the BIC penalty is invariant to the value of , and so if flexibility is approximated by the BIC penalty then the only effect on the evidence of changing is indirectly through its effect on in the fit term (this effect is itself indirect). But controls the effective number of parameters, and to suppress its effect in this way seems undesirable.
Put more generally, the term in the flexibility adjusts for the difference between the nominal and the effective number of parameters. Therefore, approximating flexibility with the BIC penalty misses this crucial feature of modern statistical models (Spiegelhalter et al., 2002).
We illustrate the discrepancy between flexibility and the BIC penalty using the ‘donkeys’ dataset from Milner and Rougier (2014). The Gaussian linear model is
where gender is a factor with levels stallion, gelding, and female
, represented by two dummy variables in the model matrix. The response is centered, and the columns of the model matrix are centred and scaled, as is sensible for a regularizer of the form given in (11a).
This application has and , and is an excellent candidate for a large- approximation. Figure 2 shows the flexibility over a range of moderate values for . Over this range, the difference between the effective and the nominal number of parameters, crudely assessed, varies from to , which is sizable bearing in mind that take moderate values. This difference would be much larger if were pushed to less moderate values, which could easily happen in a numerical optimization.
The evidence of a model is a well-defined quantity, where by ‘model’ we mean a statistical model plus a regularizer for the parameters or, equivalently, plus a prior distribution for the parameters. The evidence can be challenging to evaluate, but this topic has been well-studied, and there are now many options, covering a wide range of models. But what is the evidence for?
A Bayesian argument indidates that the evidence can be used to select a single model according to posterior probability. A Frequentist argument indicates that the evidence has the form of ‘fit minus complexity penalty’, which makes the evidence an attractive optimand for an algorithm to select a single model. We have taken the Frequentist argument to its conclusion, identifying the unique decomposition of evidence into fit minus a term which we label ‘flexibility’. We argue that flexibility behaves like a complexity penalty, although any such argument has to be heuristic, given that model complexity is such an amorphous concept.
We show that in the Gaussian linear model, and by extension in models with Gaussian posterior distributions, the flexibility term equals the Bayesian Information Criterion (BIC) penalty plus an term. But we caution against using the approximation ‘flexibility BIC penalty’ for two reasons. First, Gaussian posterior distributions are not a reliable feature of modern practice in Statistics and Machine Learning. Second, the missing term plays an important role in practice, capturing the difference between the effective and the nominal number of parameters.
Adopting flexibility as the definition of model complexity unifies the Bayesian and Frequentist justifications for selecting a single model on the basis of the evidence. If flexibility is the right way to quantify and penalize model complexity, then we strongly recommend estimating the evidence directly, rather than using ‘evidence fit minus flexibility’ and then replacing flexibility with a simpler term such as the BIC penalty.
- Akaike (1973) H. Akaike. Information theory and an extension of the maximum likelihood principle. In B. N. Petrovand and F. Csaki, editors, Proceedings of the Second International Symposium on Information Theory, pages 267–281. Akademiai Kiado, Budapest, 1973.
- Andrieu et al. (2003) C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan. An introduction to MCMC for Machine Learning. Machine Learning, 50:5–43, 2003.
- Bernardo and Smith (1994) J. M. Bernardo and A. F. M. Smith. Bayesian Theory. Chichester, UK: Wiley, 1994.
- Besag (1989) J. Besag. A candidate’s formula; a curious result in Bayesian prediction. Biometrika, 76(1):183, 1989.
- Chib (1995) S. Chib. Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90:1313–1321, 1995.
- Friel and Wyse (2012) N. Friel and J. Wyse. Estimating the evidence – a review. Statistica Neerlandica, 66(3):288–308, 2012.
- Gelfand and Dey (1994) A.E. Gelfand and D. Dey. Bayesian model choice: Asymptotic and exact calculations. Journal Royal Statistical Society B, 56:501–514, 1994.
- Gelman et al. (2014) A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. Chapman and Hall/CRC, Boca Raton, Florida, 3rd edition, 2014. Online resources at http://www.stat.columbia.edu/~gelman/book/.
Hastie and Green (2012)
D. Hastie and P. J. Green.
Model choice using reversible jump Markov chain Monte Carlo.Statistica Neerlandica, 66:309–338, 2012.
D. J. C. MacKay.
Bayesian interpolation.Neural Computation, 4:415–447, 1992.
- MacKay (2003) D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge, 2003. Available online, http://www.inference.org.uk/mackay/itila/.
- Milner and Rougier (2014) K. Milner and J.C. Rougier. How to weigh a donkey in the Kenyan countryside. Significance, 11(4):40–43, 2014.
- O’Hagan and Forster (2004) A. O’Hagan and J. Forster. Bayesian Inference, volume 2b of Kendall’s Advanced Theory of Statistics. Edward Arnold, London, 2nd edition, 2004.
- Robert (2007) C. P. Robert. The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation. Springer, New York, 2007.
- Sober (2015) E. Sober. Ockham’s Razors: A User’s Manual. Cambridge University Press, Cambridge, 2015.
- Spiegelhalter et al. (2002) D. J. Spiegelhalter, N. G. Best, B. P. Carlin, and A. van der Linde. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, Series B, 64(4):583–616, 2002. With discussion, pp. 616–639.