# Bayesian Evidence and Model Selection

In this paper we review the concepts of Bayesian evidence and Bayes factors, also known as log odds ratios, and their application to model selection. The theory is presented along with a discussion of analytic, approximate and numerical techniques. Specific attention is paid to the Laplace approximation, variational Bayes, importance sampling, thermodynamic integration, and nested sampling and its recent variants. Analogies to statistical physics, from which many of these techniques originate, are discussed in order to provide readers with deeper insights that may lead to new techniques. The utility of Bayesian model testing in the domain sciences is demonstrated by presenting four specific practical examples considered within the context of signal processing in the areas of signal detection, sensor characterization, scientific model selection and molecular force characterization.

## Authors

• 11 publications
• 3 publications
• 1 publication
• 1 publication
• 2 publications
• ### Bayesian inference-driven model parameterization and model selection for 2CLJQ fluid models

A high level of physical detail in a molecular model improves its abilit...
05/14/2021 ∙ by Owen C. Madin, et al. ∙ 0

• ### Accurate Computation of Marginal Data Densities Using Variational Bayes

Bayesian model selection and model averaging rely on estimates of margin...
05/25/2018 ∙ by Gholamreza Hajargasht, et al. ∙ 0

• ### Quantifying the weight of fingerprint evidence using an ROC-based Approximate Bayesian Computation algorithm

The Bayes factor has been advocated to quantify the weight of forensic e...
03/27/2018 ∙ by J. H. Hendricks, et al. ∙ 0

• ### Reconciling the Bayes Factor and Likelihood Ratio for Two Non-Nested Model Selection Problems

In statistics, there are a variety of methods for performing model selec...
01/28/2019 ∙ by Danica M. Ommen, et al. ∙ 0

• ### Simulating normalising constants with referenced thermodynamic integration: application to COVID-19 model selection

Model selection is a fundamental part of Bayesian statistical inference;...
09/08/2020 ∙ by Iwona Hawryluk, et al. ∙ 19

• ### Model Selection Techniques -- An Overview

In the era of big data, analysts usually explore various statistical mod...
10/22/2018 ∙ by Jie Ding, et al. ∙ 0

• ### Occam Factor for Gaussian Models With Unknown Variance Structure

We discuss model selection to determine whether the variance-covariance ...
05/04/2021 ∙ by Zachary M. Pisano, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The application of model-based reasoning techniques employing Bayesian probability theory has recently found wide use in signal processing, and in the physical sciences in general

Jaynes:Book Gregory:2005 Candy:2009 vonToussaint:2011 Gelman+etal:2013 vonderLinden:2014 . In such an approach, it is critical to be able to statistically compare the probability of one model to another. This is performed by computing the Bayesian evidence of the two models and comparing them by forming a ratio, which is often referred to as a Bayes factor or the odds ratio.

In this paper, we present an overview of the theory behind Bayesian evidence, discuss various methods of computation, and demonstrate the application in four practical examples of current interest more closely related to signal processing. We do not aim to cover all of the techniques and applications, as there exists a great number of excellent treatments spanning several decades Jeffreys:1961 DeGroot:2004 Kass+Raftery:1995 Jaynes:Book Zellner+Siow:1980 Sivia+etal:1993 Andrieu+etal:1998 Fitzgerald:2001 Gregory:2005 Bernardo+Smith:2009 Gelman+etal:2013 MacKay:1995 MacKay:2003 Sivia&Skilling vonToussaint:2011 as well as a wide variety of applications spread across a great number of fields, such as acoustics Xiang+Goggans:2003 Jasa+Xiang:2012 Goggans+Henderson+Xiang:2013 , astronomy, astrophysics and cosmology Loredo+Lamb:2002 Liddle+etal:2006 Clark+etal:2007 Heavens:2007 Liddle:2007 Jullo+etal:2007 Trotta:2008 Liddle:2009 Feroz+etal:2011 Feroz:2013 Debono:2014 , chemistry Sivia+Carlile:1992

, computer science and machine learning

Beal+Ghahramani:2003 Marshall+etal:2006 Mackay:1992 Wolpert:1993 Penny+Roberts:1999 , neuroscience Lewicki:1994 Trujillo+etal:2004 Woolrich+etal:2009 Friston+Penny:2011 , nuclear and particle physics GulamRazul+Fitzgerald+Andrieu:2003 DeCruz+etal:2011 Bergstrom:2012 , signal processing Nallanathan+Fitzgerald:1994 Roberts:1998 Kannan+etal:2000 Roberts+Everson:2001 Andrieu+Djuric+Doucet:2001 Penny+Roberts:2002 Punskaya+etal:2002 systems engineering Beck+Yuen:2004 Simon+Weber+Levrat:2007 Chulani+etal:1999 , and statistics in general Madigan+Raftery:1994 Hoeting+etal:1999 .

## 2 Probability

Logical statements can imply other logical statements. Probability theory Berger:1985 Box&Tiao:1992 Jaynes:Book MacKay:2003 Gregory:2005 Sivia&Skilling Robert:2007 Bernardo+Smith:2009 Gelman+etal:2013 vonderLinden:2014 allows one to generalize the concept of implication by providing a measure of the degree of implication among logical statements Cox:1946 Cox:1961 . More specifically, probability is a scalar measure that quantifies, within a topic of discourse, the degree to which one logical statement, representing a state of knowledge, implies another Knuth:measuring Knuth&Skilling:2012 .111This is a relatively new interpretation of probability that has significant advantages over older concepts such as the frequency of occurrences of events, the degree of truth or the degree of belief. As a scalar measure, probability enables one to rank logical statements with respect to a given context or premise.

The utility of probability theory becomes apparent when one considers the degree to which a statement considering a set of several hypotheses or models, , implies a joint statement proposing a particular model in conjunction with additional information or data, , which we write as . The product rule, which can be derived as a consequence of basic symmetries of Boolean logic Cox:1946 Cox:1961 Knuth:measuring Knuth&Skilling:2012 , enable one to express this probability in two ways

 P(m,d|M) = P(m|M)P(d|m,M) (1) = P(d|M)P(m|d,M). (2)

These two expressions can be equated

 P(m|M)P(d|m,M)=P(d|M)P(m|d,M) (3)

and rearranged resulting in the familiar Bayes’ theorem

 P(m|d,M)=P(m|M)P(d|m,M)P(d|M), (4)

where the posterior probability

can be expressed in terms of the product of the prior probability

with a data-dependent term consisting of the ratio of the likelihood to the evidence . It is in this sense that one can think of Bayes’ theorem as a learning rule where one’s prior state of knowledge about the problem, represented by the prior probability, is updated by a data-dependent term resulting in a posterior probability that depends both on the prior state of knowledge as well as the data.

Both the prior probability and the likelihood must be assigned based on any and all additional information that one may possess about the problem. This is not a deficit or drawback of probability theory. Instead it is a strength since symmetries only serve to constrain manipulation of probabilities to the sum and product rules. This leaves free the probability assignments resulting in a theory of inductive logic that can be applied to any particular inference problem. The dependence of these probabilities on problem-specific prior information is often indicated by including the symbol to the right of the solidus. 222This notation goes back to Jaynes Jaynes:Book and has been adopted in several prominent textbooks in the physical sciences Bretthorst:1988 Sivia&Skilling Gregory:2005 vonderLinden:2014 . For example, this is done by writing the prior probability as .

While the posterior probability over the space of models fully quantifies all that is known about the problem, it is often common practice to summarize what is known by focusing on a particular model that maximizes the posterior probability, such that this model is most implied by the data given the prior information. Such a model is referred to as the most probable model or mode (within the context defined by the space of models

), or the maximum a posteriori (MAP) estimate. Often the space of models

to be considered is a parameterized space where each model is represented by a set of particular parameter values that act as coordinates in the space. In this case, one can consider summarizing the posterior using the model given by the mean parameter values found using the posterior. Either way, when the models in the space are parameterized, selecting a particular model given the data and prior information amounts to a parameter estimation problem.

The evidence, which in parameter estimation problems acts mainly as a normalization factor, can be obtained by summing or integrating (marginalizing) over all possible models in the set of models

 P(d|M,I) = ∫dm P(m,d|M,I) (5) = ∫dm P(m|M,I)P(d|m,M,I), (6)

which is the reason that the evidence is often referred to as the marginal likelihood.

We can refer to a set of models, , as a particular theory. Given two competing theories or one can compare the posterior probability to the posterior probability , where, among additional prior information, represents the fact that theories and are among those to be considered. In general, both theories will result in non-zero probabilities. However, the more probable theory can be determined by considering the ratio of their posterior probabilities. We can examine this by considering the ratio of joint probabilities of the sets of models and and the data and then using the product rule to write the joint probability in two ways

 P(M1,d|I)P(M2,d|I) = P(M1,d|I)P(M2,d|I) (7) P(d|I)P(M1|d,I)P(d|I)P(M2|d,I) = P(M1|I)P(d|M1,I)P(M2|I)P(d|M2,I) (8) P(M1|d,I)P(M2|d,I) = P(M1|I)P(M2|I)P(d|M1,I)P(d|M2,I) (9)

so that the ratio of the posterior probabilities of the two theories is proportional to the ratio of their respective evidences. The proportionality becomes an equality in the case where the prior probabilities of the two theories are equal. This leads to the concept of the Bayes factor or odds ratio where we define

 OR=P(d|M1,I)P(d|M2,I) (10)

or, equivalently, the log odds ratio

 logOR=logP(d|M1,I)−logP(d|M2,I). (11)

With this definition, we can write the ratio of posterior probabilities for the two different theories and in terms of the odds ratio

 P(M1|d,I)P(M2|d,I)=P(M1|I)P(M2|I)×OR, (12)

where the two are equal when the ratio of the prior probabilities of the two theories are equal.

In the case of parameter estimation problems, the Bayesian evidence plays a relatively minor role as a normalization factor. However, in problems where two theories are being tested against one another, which is often called a model selection problem333The terminology may be confusing since the term ‘model selection’ seems to refer to the process of selecting a particular model; whereas, it refers to selecting one set of models, or theory, over another., the ratio of evidences is the relevant quantity to consider. In some special cases, the integrals can be solved analytically as described in DeGroot:2004 Zellner+Siow:1980 and demonstrated below in Section 5.1.

## 3 Evidence, Model Order, and Priors

It is instructive to consider how the evidence (6) varies as a function of the considered model order as well as the prior information one may possess about the model. We begin by considering a model consisting of a single parameter , for which we have assigned a uniform prior probability over an interval of width . We can define the effective width of the likelihood over the prior range as

 δx≐1Lmax∫xmaxxmindx P(d|x,M,I), (13)

where is the value of the likelihood attained at the maximum likelihood estimate . The evidence of the model amounts to

 Z≡P(d|M,I)=1Δx∫xmaxxmindx P(d|x,M,I), (14)

which using the definition in (13) can be conveniently expressed in terms of the prior width and effective likelihood width by (Bretthorst:1988, , pp. 63-65)

 Z=LmaxδxΔx. (15)

Thus we can write the evidence as a product of the maximum of the likelihood (the best achievable goodness-of-fit) and an Occam factor :

 Z=LmaxW (16)

where is formally defined as

 W=ZLmax=∫dx P(x|M,I)P(d|x,M,I)Lmax. (17)

For models with a single adjustable parameter the Occam factor is the ratio of the width of the likelihood over the prior range to the width of the prior: . For multiple model parameters this generalizes to the ratio of the volume occupied by those models that are compatible with both data and prior over the prior accessible volume.

By making the prior broader we pay in evidence. It is in this sense that Bayesian probability theory embodies Occam’s razor: “Entities are not to be multiplied without necessity.” If we increase the flexibility of our model by the introduction of more model parameters, we reduce the Occam factor. Let’s for simplicity assume that every additional parameter is also uniform over an interval of length and that there are such parameters . Then beyond a certain model order , we will achieve a perfect fit of the data upon which we cannot improve the likelihood any further. Because the Occam factor scales as , it will disfavor a further increase in model order.

Consider a Gaussian likelihood function, which is normalized so that it integrates to unity. If the data

are modeled as independent observations, the likelihood, assuming a standard deviation

, is

 P(d|x,M,I)=(2πσ2)−n/2exp{−n2σ2[(x−¯¯¯d)2+v]} (18)

where is the sample average and

the sample variance. Maximum likelihood is obtained at

achieving a likelihood of . The evidence is

 P(d|M,I)=Lmax√2πnσΔxerf% (√n2xmax−¯¯¯dσ)+% erf(√n2¯¯¯d−xminσ)2. (19)

For and small or large, we can ignore the last factor involving the error function. The Occam factor is essentially . If falls outside the support of the prior ( or ), the evidence decreases rapidly reflecting the discrepancy between our prior assumptions and the actual observations.

Let us compare a model that has no adjustable parameter and a model with a single adjustable parameter by computing the odds ratio:

 OR=P(d|M0,I)P(d|M1,I)≈P(D|M0,I)P(D|^x,M1,I)Δxδx (20)

The odds ratio is comprised of two factors: the ratio of the likelihoods

 P(D|M0,I)P(D|^x,M1,I)

and the Occam factor . The likelihood ratio is a classical statistic in frequentist model selection. If we only consider the likelihood ratio in model comparison problems, we fail to acknowledge the importance of Occam factors.

## 4 Numerical Techniques

In general, the evidence, which is found by integrating the prior times the likelihood (6) over the entire parameter space, cannot be solved analytically.444A rare exception is given by the first example presented in Section 5.1 where an analytical solution is obtained. This requires that we use numerical techniques to estimate the evidence. Straightforward estimation of the evidence integral directly from posterior sampling, such as in Chib:1995:gibbs , proves to be quite challenging in general, especially in the case of multimodal distributions arising from mixture models or high-dimensional spaces. While a number of sophisticated problem-specific techniques have been developed to handle such difficulties Berkhof+etal:2003 Trias+etal:2009 Chib+Srikanth:2010:tailored , there is a need for more general widely-applicable techniques that require little to no fine tuning.

Other methods, such as

Reversible Jump Markov Chain Monte Carlo

(RJMCMC) treat the model order as a model parameter Green:1995 Brooks+etal:2003:RJMCMC Lopes+West:2004 . However, such techniques typically encounter serious difficulties with inefficient model-switching moves. The difficulties these more direct techniques experience are especially problematic in high-dimensional spaces and in problems where the likelihood calculations are expensive, such as in the case of large data sets or complex forward models.

This has resulted in the development of a rather sophisticated array of computational techniques. Here we briefly review some of the more popular methods, pointing the interested readers to additional excellent resources and reviews, such as Kass+Raftery:1995 and Han+Carlin:2001 , and conclude with a focus on the more recent methods of nested sampling and its cousin MultiNest, which are used in three of the examples provided in the following section.

### 4.1 Laplace Approximation

, is a simple and useful method for approximating a unimodal probability density function with a Gaussian

MacKay:2003 Penny+Kiebel+Friston:2006:vb vonToussaint:2011 vonderLinden:2014 . As such, the Laplace approximation forms the basis of more advanced techniques, such as Gull and MacKay’s Evidence Framework Gull:1989 Mackay:1992 .

Consider a function , which has a peak at . One can write the Taylor series expansion of the logarithm of the probability density about to second order as

 lnp(x)≃lnp(x0)+ddxlnp(x)∣∣∣x=x0(x−x0)+12d2dx2lnp(x)∣∣∣x=x0(x−x0)2+…, (21)

which can be simplified to

 (22)

since the first derivative of evaluated at the peak is zero . By defining to be minus the inverse of the local curvature at the peak

 σ2=(−12d2dx2lnp(x)∣∣∣x=x0)−1, (23)

we can rewrite (22) as

 lnp(x)≃lnp(x0)−12σ2(x−x0)2+…. (24)

Taking the exponential of both sides results in an un-normalized approximation for

 p(x)≃p(x0)exp[−12σ2(x−x0)2], (25)

which would have as its normalization factor

 Z=p(x0)√2πσ2. (26)

If the function is taken to be the product of the prior probability and the likelihood, then, the normalization factor (26) is an approximation of the evidence.

In dimensions, we expand the function as

 lnp(x)≃lnp(x0)−12(x−x0)TA(x−x0)+…, (27)

where is an matrix, known as the Hessian, with matrix elements given by

 Aij=−d2dxidxjlnp(x)∣∣∣x=x0. (28)

The approximation of is then given by

 p(x)≃1Zexp[−12xTAx] (29)

where the normalization factor is

 Z=p(x0)√(2π)NdetA. (30)

Again, if the function is defined by the product of the prior and the likelihood, then is the approximation to the evidence. This method requires that the peak of the distribution be identified and the Hessian estimated either analytically or numerically.

The Laplace approximation has been very useful in performing inference on latent Gaussian models, such as Gaussian processes Rasmussen+Williams:2006 . The Integrated Nested Laplace Approximation (INLA) Rue+etal:2009 Martins+etal:2013

can be used to compute the posteriors of the model parameters in the case of structured additive regression models where the predictor depends on a sum of functions of a set of covariates, and the number of hyperparameters is small (

). This is accomplished by setting up a grid of hyperparameter values where the posterior of the hyperparameter values given the data has been approximated using the Laplace approximation. Then the Laplace approximation is used to compute the marginal posteriors given the data and the hyperparameter values across the grid. The product of the hyperparameter posteriors (given the data) and the marginals (given the data and the hyperparameters) can then be numerically integrated over the hyperparameters to obtain the desired posterior marginals. Another method to approximate the marginals based on expectation propagation Minka:2001 has been proposed by Cseke and Heskes Cseke+Heskes:2011 . They demonstrated that this method is typically more accurate than INLA and works in cases where the Laplace approximation fails.

### 4.2 Importance Sampling

Importance Sampling Neal:1993 allows one to find expectation values with respect to one distribution by computing expectation values with respect to a second distribution that is easier to sample from. The expectation value of with respect to is given by

 ⟨f(x)⟩p=∫f(x)p(x)dx∫p(x)dx. (31)

Note that one can write the distribution as where the only theoretical requirement is that must be non-zero wherever is non-zero. This allows one to rewrite the expectation value above as

 ⟨f(x)⟩p =∫f(x)p(x)q(x)q(x)dx∫p(x)q(x)q(x)dx (32) (33)

which can be approximated with samples from by

 ⟨f(x)⟩p≈∑Ni=1f(xi)p(xi)q(xi)∑Ni=1p(xi)q(xi), (34)

where the samples are drawn from . This works well as long as the ratio defined by does not attain extreme values. Importance sampling is a generally valid method useful even in cases where is not Gaussian, as long as is easier to sample from than using techniques such as existing random number generators or MCMC.

Importance sampling can be used to compute ratios of evidence values in a similar fashion by writing Neal:1993

 ZpZq=∫p(x)dx∫q(x)dx (35)

which can be written as

 ZpZq =∫p(x)q(x)q(x)dx∫q(x)dx (36) =⟨p(x)q(x)⟩q (37)

which can be approximated with samples from by

 ⟨p(x)q(x)⟩q≈∑Ni=1p2(xi)q2(xi)∑Ni=1p(xi)q(xi). (38)

However, again the function must be close to to avoid extreme ratios, which will cause problems for the numeric integration.

### 4.3 Analogy to Statistical Physics

Techniques for evaluating the evidence can build on numerical methods in statistical physics because there is a close analogy between both fields. A key quantity in equilibrium statistical mechanics is the canonical partition function

 Z(β)=∫dx e−βE(x) (39)

where

are the configurational degrees of freedom of a system governed by the energy

and is the inverse temperature. Because is typically very high-dimensional, the partition function can only be evaluated numerically. Instead of computing directly by solving the high-dimensional integral (39), it is convenient to compute the Density of States (DOS)

 g(E)=∫dx δ(E−E(x)) (40)

where is Dirac’s delta function. The partition function and the DOS are linked via a Laplace transform

 Z(β)=∫dEg(E)e−βE. (41)

Therefore, knowing either of the two functions suffices to characterize equilibrium properties of the system and compute, for example, free energies and heat capacities.

In a Bayesian application, the model parameters play the role of the system’s degrees of freedom and the negative log likelihood can be viewed as an energy function . For a given data set , we write the DOS as

 g(E)=∫dm P(m|M,I)δ[E−E(m)] (42)

The evidence can then be written as a one-dimensional integral over the DOS:

 P(d|M,I) = ∫dE g(E)e−E (43) = ∫dm P(m|M,I)∫dE δ[E+logP(d|m,M,I)]e−E = ∫dm P(m|M,I)P(d|m,M,I)

Therefore, knowledge of allows us to compute the evidence in the same way as the canonical partition function (41) can be evaluated through a Laplace transform of the DOS Habeck:2012 .

Physics-inspired algorithms for evaluating the evidence aim to compute either the partition function at or the density of states. The previous class of methods comprises path sampling Gelman+Meng:1998 , parallel tempering Swendsen:1986 ; Geyer:1991 , annealed importance sampling Neal:2001 and other thermal methods that simulate a modified version of the posterior:

 [P(d|m,M,I)]βP(m|M,I) (44)

where the likelihood has been raised to a fractional power. By letting vary between zero and one, we can smoothly bridge between the prior and the posterior. A recent DOS-based algorithm called nested sampling Skilling:2004:nested Skilling:2006:nested Sivia&Skilling vonderLinden:2014 is discussed in Section 4.7.

### 4.4 Path Sampling and Thermodynamic Integration

The method of path sampling is based on the calculation of free energy differences in thermodynamics Gelman+Meng:1998 . The method focuses on the estimation of the difference between the logarithm of two distributions and , which depend on model parameters. One can connect the two distributions by a “path” through a space of distributions by defining what is called the geometric path

 p(x|β)∝p0(x)1−βp1(x)β (45)

where the parameter can vary freely from to so that at the endpoints we have that and . By letting we can establish a direct relation with the canonical ensemble; the normalizing constant is the partition function:

 Z(β) =∫dx p0(x)1−βp1(x)β =∫dx p0(x)e−βE(x). (46)

The log partition function can be estimated using samples from in the following way. We have

 ∂βlogZ(β) =−1Z(β)∫dx E(x)p0(x)e−βE(x) =⟨log[p1/p0]⟩β (47)

where denotes the expectation with respect to the bridging distribution . Integration of the previous equation yields

 log[Z(1)/Z(0)] =∫10dβ ∂βlogZ(β) =∫10dβ ⟨log[p1/p0]⟩β. (48)

By choosing a finely spaced -path we can approximate the ratio of the normalization constants by a sum over the expected energy (log likelihood ratio) over each of the bridging distributions:

 log[Z(1)/Z(0)]≈∑i⟨log[p1/p0]⟩βi(βi+1−βi). (49)

This approach is called thermodynamic integration. It is also possible to estimate the DOS from samples produced along a thermal path bridging between the prior and posterior and thereby obtain an alternative estimate of the evidence that is sometimes more accurate than thermodynamic integration Habeck:2012 ; Habeck:2012b .

If we choose and , we can use path sampling in combination with thermodynamic integration to obtain the log-evidence because and . In case we want to compare two models that share the same parameters , we can use thermodynamic integration to estimate the log odds ratio (11) by defining () and sampling from the following family of bridging distributions

 p(x|β)∝[P(m|M1,I)P(d|m,M1,I)]1−β[P(m|M2,I)P(d|m,M2,I)]β (50)

For the special case that both models also share the same prior, , this simplifies to

 p(m|β)∝P(m|I)[P(d|m,M1,I)]1−β[P(d|m,M2,I)]β. (51)

By drawing models from the mixed posterior the log odds ratio can be computed directly using thermodynamic integration. An open problem relevant to all thermal methods using a geometric path (45

) is where to place the intermediate distributions. This becomes increasingly difficult for complex systems that show a phase transition.

Ensemble Annealing Habeck:2015 , a variant of simulated annealing Kirkpatrick+etal:1983 , aims to circumvent this problem by constructing an optimal temperature schedule in the course of the simulation. This is achieved by controlling the relative entropy between successive intermediate distributions: After simulating the system at a current temperature, the new temperature is chosen such that the estimated relative entropy between the current and the new distribution is constant. Ensemble annealing can be viewed as a generalization of nested sampling (see Section 4.7) to general families of bridging distributions such as the geometric path (45). Ensemble annealing has been applied to various systems showing first- and second-order phase transitions such as Ising, Potts, and protein models Habeck:2015 .

### 4.5 Annealed Importance Sampling

Annealed Importance Sampling (AIS) Neal:2001 is closely related to other annealing methods such as simulated annealing but does not rely on thermodynamic integration. AIS generates multiple independent sequences of states where is a sample from the -th intermediate distribution bridging between the initial distribution and the destination distribution . For example, in case we are using the geometric bridge (45) the states follow

 x(j)i∼p0(x)1−βip1(x)βi (52)

where the superscript enumerates the sequences. Sampling of is typically achieved by starting a Markov chain sampler from the precursor state . Each of the generated sequences is assigned an importance weight

 w(j)=∏ipi+1(x(j)i)pi(x(j)i). (53)

Neal has shown Neal:2001

that the average of the importance weights is an unbiased estimator of the ratio of the normalizing constants:

 Z(1)Z(0)≈1MM∑j=1w(j)=1MM∑j=1∏ipi+1(x(j)i)pi(x(j)i). (54)

It is important to note that the annealing sequence is simulated multiple times, and that the partition function is obtained from the importance weights by an arithmetic average rather than a geometric average. For the special case of the geometric bridge, the AIS estimator is

 Z(1)Z(0)≈1M∑jexp{∑i(βi+1−βi)log[p1(x(j)i)/p0(x(j)i)]}. (55)

On the other hand, if we apply thermodynamic integration [Eq. (49)] to the sequences sampled during AIS, we obtain

 log[Z(1)/Z(0)]≈∑i(βi+1−βi)1M∑jlog[p1(x(j)i)/p0(x(j)i)]. (56)

Both estimators are closely related but not identical. To see this, let us rewrite the estimate obtained by thermodynamic integration:

 Z(1)Z(0) ≈ exp{1M∑j∑i(βi+1−βi)log[p1(x(j)i)/p0(x(j)i)]} (57) ≈ exp{1M∑jlogw(j)}=(∏jw(j))1/M. (58)

This shows that AIS estimates the ratio of partition functions by an arithmetic average over the importance weights, whereas thermodynamic integration averages the importance weights geometrically. Neal’s analysis as well as results from non-equilibrium thermodynamics (e.g. Crooks:1999 ) show that the AIS estimator is valid even if the sequences of states are not in equilibrium.

### 4.6 Variational Bayes

Another technique called Ensemble Learning Hinton+VanCamp:1993 Mackay:1995:ensemble Valpola+Karhunen:2002 , or Variational Bayes (VB) Waterhouse+etal:1996 Attias:1999:vb Lawrence+Bishop:2000:vb Sato:2001:vb Roberts+Penny:2002:vb Penny+Kiebel+Friston:2006:vb Smidl+Quinn:2006:vb , is named after Feynman’s variational free energy method in statistical mechanics Feynman:1972 . As such, it is yet another example of how methods developed in thermodynamics and statistical mechanics have had an impact in machine learning and inference.

We consider a normalized probability density on the set of model parameters , such that

 ∫dmQ(m)=1. (59)

While not obviously useful, the log-evidence can be written as

 logP(M|I)=∫dmQ(m)logP(M|I). (60)

Using the product rule, this can be written as

 logP(M|I) =∫dmQ(m)logP(M,m|I)P(m|M,I) (61) (62)

This expression can be broken up into the sum of the negative free energy

 F(Q(m),P(M,m|I))=∫dmQ(m)logP(M,m|I)Q(m) (63)

and the Kullback-Leibler (KL) divergence

 KL[Q(m)∥P(m|M,I)]=∫dmQ(m)logQ(m)P(m|M,I) (64)

by

 logP(M|I)=F(Q(m),P(M,m|I))+KL[Q(m)∥P(m|M,I)], (65)

which is the critical concept behind variational Bayes.

The properties of the KL divergence expose an important relationship between the negative free energy and the evidence. First, the KL divergence is zero when the density is equal to the posterior . For this reason, is referred to as the approximate posterior. Furthermore, since the KL divergence is always non-negative, we have that

 logP(M|I)=maxQF(Q(m),P(M,m|I))≥F(Q(m),P(M,m|I)) (66)

so that the negative free energy is a lower bound to the log-evidence.

The main idea is to vary the density (approximate posterior) so that it approaches the posterior . One cannot do this directly through the KL divergence since the evidence, which is the normalization factor for the posterior, is not known. Instead, by maximizing the negative free energy in (65), which is the same as minimizing the free energy, the negative free energy approaches the log-evidence and the approximate posterior approaches the posterior. However, this presents a technical difficulty in that the integral for the negative free energy (63) will not be analytically solvable for arbitrary . The approach generally taken involves a concept from the mean field approximation in statistical mechanics McComb:2004:renormalization where a non-factorizable function is replaced by one that is factorizable

 Q(m)=Q(m0)Q(m1) (67)

where the set of model parameters can be divided into two disjoint sets and so that and .

The negative free energy (63) can then be written as Penny+Kiebel+Friston:2006:vb

 F =∫dmQ(m)logP(M,m|I)Q(m) =∫∫dm0dm1Q(m0)Q(m1)logP(M,m0,m1|I)Q(m0)Q(m1) =∫dm0Q(m0)[∫dm1Q(m1)logP(M,m0,m1|I)] −∫dm0Q(m0)logQ(m0)+C =∫dm0Q(m0)I(m0)−∫dm0Q(m0)logQ(m0)+C

where the constant consists of terms that do not depend on and

 I(m0)=∫dm1Q(m1)logP(M,m0,m1|I). (68)

The negative free energy can then be expressed in terms of a KL-divergence by writing

 F=KL[Q(m0)∥exp(I(m0))]+C, (69)

which is minimized when

 Q(m0)∝exp(I(m0)). (70)

This implies that not only can the posterior be approximated with , but also the analytic form of the component posteriors can be determined. This is known as the free-form approximation Penny+Kiebel+Friston:2006:vb , which applies, in general, to the conjugate exponential family of distributions Attias:1999:vb Ghahramani+Beal:2001:vb Winn+Bishop:2005:vb , and can be extended to non-conjugate distributions Attias:1999:vb Jaakkola+Jordan:1996 .

Since the negative free energy (69) is a lower bound to the log-evidence, the log-evidence can be estimated by minimizing the negative free energy, so that the approximate posterior approaches the posterior.

### 4.7 Nested Sampling

relies on stochastic integration to numerically compute the evidence of the posterior probability. In contrast to the thermal algorithms discussed so far, nested sampling aims to estimate the DOS or rather its cumulative distribution function

 X(L) =∫−logL−∞dE g(E) =∫P(d|m,M,I)>LdmP(m|M,I) (71)

which calculates the prior mass contained in the likelihood contour . We can now write the evidence integral as

 Z =∫∞−∞dE g(E)e−E =∫10dX L(X) ≈∑iLi(Xi−1−Xi) (72)

where the likelihood is understood as a function of the cumulative DOS or prior mass (71). Because is unknown for general inference problems, we have to estimate it. Nested sampling does this by estimating its inverse function using walkers that explore the prior constrained by a lower/upper bound on the likelihood/energy (Figure 1A). Since decreases monotonically in likelihood, we can sort the unknown prior masses associated with each walker by sorting them according to likelihood. The walker with worst likelihood will enclose the largest prior mass. The maximum mass can be estimated using order statistics:

 Xmax∼NXN−1maxX(L) (73)

where the walkers have been numbered such that they increase in likelihood and thus . The worst likelihood will define the lower likelihood bound in the next iteration. Walkers to will, by construction, already attain states that are also valid samples from the prior truncated at such that we only have to replace the first walker. This can be done by randomly selecting one among the surviving states and evolving it within the new contour using a Monte Carlo procedure. The initial states are obtained by sampling from the prior (i.e. the lower bound on the likelihood is zero); the associated mass is .

Rather than increasingly giving the data more and more weight as is done in thermal approaches, nested sampling focuses on states with high posterior probability by constructing a sequence of nested priors restricted to higher and higher likelihood regions. Thereby nested sampling locates the relevant states that contribute strongly to the evidence integral and simultaneously constructs an optimal sequence of likelihood contours. In path sampling, the geometric path must be typically chosen by the user; nested sampling elegantly circumvents this problem.

Another advantage of nested sampling is that each likelihood bound in the nested sequence compresses the prior volume by approximately the same factor, which allows nested sampling to handle first order phase transitions. In contrast, tempering methods such as simulated annealing and parallel tempering compress based on steps in temperature (), and as a result they typically fail at phase transitions (Figure 1B).

Often a practical difficulty of applying nested sampling is the requirement to sample from the prior subject to a hard constraint on the likelihood. Mukherjee et al. Mukherjee+etal:2006 developed a version of nested sampling that fits an enlarged ellipse around the walkers and samples uniformly from that ellipse until a sample is drawn that has a likelihood exceeding the old minimum. Sampling within the hard constraint is also made difficult when the distribution is multi-modal. MultiNest Feroz+Hobson:2008 Feroz+etal:2009 Feroz:2013

was developed to handle multi-modal distributions by using K-means clustering to cluster the walkers into a set of ellipsoids. At each iteration, MultiNest replaces the walker with the worst likelihood by a new walker generated by randomly selecting an ellipsoid (uniformly) and sampling uniformly from within the bounds of that ellipsoid. These ellipsoids serve to allow one to detect and characterize multiple peaks in the distribution. However, the method has two drawbacks in which accurate K-means clustering limits the dimensionality of the problem to tens of parameters, and the elliptical regions may not always cover the high likelihood regions of the parameter space.

Other variants of nested sampling couple the technique with Hamilton Monte Carlo Betancourt:2010 or Galilean Monte Carlo Skilling:2012:GMC Feroz+Skilling:2013 Goggans+Henderson+Xiang:2013 , which sample within the hard likelihood constraint by considering the step size to be determined by some particle dynamics depending on the particle velocity, and using that velocity and likelihood gradient to reflect off of the hard likelihood boundary. This has been demonstrated to result in improved exploration in cases of multi-modal distributions and distributions with curved degeneracies.

Another possible way to facilitate sampling from within the hard likelihood constraint is to introduce additional “demon” variables that smooth the constraint boundary and push the walkers away from it Habeck:2014 . This approach can help to solve complex inference problems as they arise, for example, in protein structure determination, at the expense of introducing additional algorithmic parameters.

Diffusive Nested Sampling Brewer+etal:2011 is a variant of nested sampling that monitors the log likelihood values during the MCMC steps and creates nested levels such that each level covers approximately of the prior mass of the previous level. This allows the relative enclosed prior mass of the nested levels to be estimated more accurately than in nested sampling. Samples are then obtained from a weighted mixture of the current level and the previous levels so that a mixture of levels is diffusively explored facilitating travel between isolated modes and allowing a more refined estimate of the log evidence.

## 5 Practical Examples

In this section we consider a set of four practical examples where the Bayesian evidence is both calculated and used in different ways. The purpose of this section is not to compare one computational method against another, since given the large number of techniques available, this would require a more extensive treatment. Instead, the goal is to demonstrate the utility of Bayesian model selection in several examples both relevant to signal processing and spanning the domain sciences.

The first example focuses on the problem of signal detection where the evidence, which is computed analytically, is used to test between two models: signal present and signal absent. The second example focuses on using the evidence, estimated numerically by nested sampling, to select the model order of a Gaussian mixture model of the spatial sensitivity function of a light sensor. The third example relies on the application of the evidence, estimated using MultiNest, to select among a set of exoplanet models each exhibiting different combinations of photometric effects. The final example selects a molecular mechanics force field approximately describing atomic interactions in proteins by computing the evidence of nuclear magnetic resonance (NMR) data.

### 5.1 Signal Detection

In this example, based on the work by Mubeen and Knuth Mubeen+Knuth:2014 , we consider a practical signal detection problem where the log odds-ratio can be analytically derived. The result is a novel signal detection filter that outperforms correlation-based detection methods in the case where both the noise variance and the variance in the overall signal amplitude is known. While this detection filter was originally designed to be used in brain-computer interface (BCI) applications, it is applicable to signal detection in general (with slight modification).

We consider the problem of detecting a stereotypic signal, , which is modeled by a time-series with time points. This signal has the potential to be recorded from detector channels with various (potentially negative) coupling weights where the index refers to the channel. Last, and perhaps more specific to the BCI problem, we consider that the overall amplitude of the emitted signal waveshape can vary. This is modeled using a positive-valued amplitude parameter , which is the only free parameter as it is assumed that the coupling weights and the signal waveshape are known.

There are two states to be considered: signal absent

(null hypothesis) and

signal present. We model the signal absent state as noise only

 MN:xm(t)=nm(t) (74)

where denotes the “noise-only” model, denotes the signal time-series recorded in the channel and refers to the noise signal associated with the channel. The signal present state is modeled as signal plus noise by

 MS+N:xm(t)=αCms(t)+nm(t) (75)

where the symbol denotes the “signal-plus-noise” model and is the amplitude of the signal , which is coupled to each of the detectors with weights .

The odds-ratio can be written as the ratio of evidences

 OR=P(X|MS+N,I)P(X|MN,I)≡ZS+NZN (76)

where

represents the available data, which here will be the recorded time series vector

, and represents any relevant prior information including the coupling weights and the signal waveshape . The two evidence values can be written as

 ZN =P(X|MN,I) (77) =P(x(t)|n(t),I) (78)

and

 ZS+N =P(X|MS+N,I) (79) =∫αmaxαmindαP(α|I)P(x(t)|n(t),I) (80)

where the latter is marginalized over the amplitude range of the signal since we only care to detect the signal. Here represents the prior probability for the amplitude parameter . Note also that and without subscripts refer to the vector of time series over each of the detector channels.

Assuming that the noise signals have identical characteristics in each channel, we assign a Gaussian likelihood with a standard deviation of

to both models. Note that this is not quite the same as assuming that the signals are Gaussian distributed, but rather this is the maximum entropy assignment where both the mean and squared deviation from the mean are known to be relevant quantities. For the “noise-only” model there are no model parameters and the likelihood is equal to the evidence (

78)

 ZN=(2πσn2)−MT/2exp[−12σn2M∑m=1T∑t=1xm2(t)]. (81)

In the “signal-plus-noise” model we have the Gaussian likelihood

 P(x(t)|α,n(t),I)=(2πσn2)−MT/2exp[−12σn2M∑m=1T∑t=1(xm(t)−αCms(t))2]. (82)

By assigning a (potentially-truncated) Gaussian prior to the amplitude parameter ,

 P(α|I)=1Zαexp[−12σα2(α−^α)2], (83)

with a normalization constant given by

 Zα=∫αmaxαmindα exp[−12σ2α(α−^α)2], (84)

one can integrate the likelihood (82) to find the evidence of the “signal-plus-noise” model (80).

By defining

 D=S2+M∑m=1T∑t=1Cm2s2(t) (85)
 E=S2^α+M∑m=1T∑t=1Cmxm(t)s(t) (86)
 F=S2^α2+M∑m=1T∑t=1xm2(t), (87)

where

 S2=σn2σα2, (88)

we can complete the square in the exponent and write the odds ratio as

 ZS+NZN =∫αmaxαmindα