# Bayesian averaging of computer models with domain discrepancies: a nuclear physics perspective

This article studies Bayesian model averaging (BMA) in the context of several competing computer models in nuclear physics. We quantify model uncertainty in terms of posterior prediction errors, including an explicit formula for their posterior variance. We extend BMA when the competing models are defined on non-identical study regions. Any model's local forecasting difficulty is offset by predictions obtained from the average model, extending individual models to the full domain. We illustrate our methodology via simulations and applications to forecasting nuclear masses. We find significant improvement in both the BMA prediction error and the quality of the corresponding uncertainty quantification.

## Authors

• 5 publications
• 6 publications
• 1 publication
• 3 publications
02/11/2020

### Statistical aspects of nuclear mass models

We study the information content of nuclear masses from the perspective ...
01/22/2019

### Neutron drip line in the Ca region from Bayesian model averaging

The region of heavy calcium isotopes forms the frontier of experimental ...
10/28/2019

### Beyond the proton drip line: Bayesian analysis of proton-emitting nuclei

The limits of the nuclear landscape are determined by nuclear binding en...
01/05/2018

### Inverse Uncertainty Quantification using the Modular Bayesian Approach based on Gaussian Process, Part 1: Theory

In nuclear reactor system design and safety analysis, the Best Estimate ...
01/16/2020

### Quantified limits of the nuclear landscape

The chart of the nuclides is limited by particle drip lines beyond which...
03/29/2019

### PAC Learnability of nuclear masses

After more than 80 years from the seminal work of Weizsäcker and the liq...
06/01/2018

### Bayesian approach to model-based extrapolation of nuclear observables

The mass, or binding energy, is the basis property of the atomic nucleus...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Model averaging arises in situations with several competing models available to solve the same or similar problems, and no single model can be selected at a desired level of certainty. This is a common scenario in many scientific fields concerned with modeling complex systems. One of the historically dominant solutions is Bayesian Model Averaging (BMA) (KassRaftery1995; BMA), which is a natural Bayesian framework when facing a finite or countable number of alternative models. The seminal review work by Geweke introduced BMA in the field of econometrics and later in other fields such as in the political and social sciences; BMA has also been applied to the medical sciences (med-Vi14; med-SCH16), ecology and evolution (ecol-Si14; ecol-Ho15), genetics (gen-Vis11; gen-Wen15)

, and machine learning

(ML-Mer11; ML-Hernandez2018).

Contemporary developments in computing capabilities have meanwhile brought new perspectives to model averaging. The rapid surge of models implemented on a computer, which we shall refer to as computer models, opened several challenges to BMA. Nuclear structure provides stimulating examples (NuclearLandscape; ElectricDipole; NeutronDripline; Neufcourt18-2; Olsen18) which illustrate the canonical difficulties one faces with BMA of computer models that might require hundreds to thousands of core hours for a single evaluation.

Consider a general situation where experimental measurements of a physical process are used to predict its values at a new input value . In the nuclear physics context, marks the isotopes, defined by the proton number and neutron number . The output is the corresponding value of an observable such as a nuclear mass or radius, and

are typically extrapolations of these observables. The Bayesian inference approach is to consider a quantity of interest

and derive its posterior probability given the data

and model . This is simply obtained from Bayes’ formula as long as one can express the likelihood and provide a prior distribution . Establishing the likelihood is done through a model . The quantity is typically of universal interest, but can represent any latent quantity at any level of the model. Various well-known methods such as the Metropolis-Hastings algorithm (Metropolis) or more advanced Monte Carlo Methods such as Hamiltonian Monte-Carlo, or No U-Turn Sampler (NUTS) can be applied to obtain samples from the posterior distribution .

Now, let us consider a scenario with competing models , and let us assume that there exists a true model . The Bayes formula can be applied to express the posterior probability of a model , given observations , as

 p(Mk|y)=p(y|Mk)π(Mk)∑Kℓ=1p(y|Mℓ)π(Mℓ)∝kp(y|Mk)π(Mk), (1)

where the symbol indicates proportionality with respect to for fixed . Here, the difficulty lies in evaluation of the evidence integral

for each model under the prior probability

that is the true model , after which the BMA posterior distribution for any quantity of interest can be derived as

 p(Δ|y)=K∑k=1p(Δ|y,Mk)p(Mk|y). (2)

This formula shows that the actual posterior probability of an observable is the average of ’s the posterior distributions given each model, weighted by the model posterior probabilities. In other words (2) is simply a mixture of distributions which makes sampling from the BMA posterior density immediate once we obtain posterior samples under each model.

In particular, the posterior mean and variance of are given by

 E[Δ|y]=K∑k=1E[Δ|y,Mk]p(Mk|y), (3)

and the standard result for the conditional variance of a mixture yields the posterior variance of given

 Var[Δ|y] = K∑k=1p(Mk|y)Var[Δ|y,Mk]+Var[E(Δ|y,M)|y]. (4)

Note that the term

is the variance of a function of the discrete random variable

, which accounts for the model uncertainty. This model uncertainty is not accounted for by individual models. Its inclusion thus allows for a more honest uncertainty quantification.

In this work, we first describe the methodology of Bayesian analysis of computer models with generalization to account for experimental data coming from several sources (several observables). Then, we perform a systematic analysis of the prediction errors, establishing that BMA is the optimal linear combination (projection) in the

sense under the posterior probability distribution, among all possible mixtures of models. Motivated by recurrent scenarios in nuclear structure, we subsequently extend BMA to situations when different models constrain different subsets of the data. Lastly, we present an application of the BMA methodology to two scenarios, a pedagogical ”toy” example based on synthetic data and an illustration of the methodology in a setting with real nuclear-physics data. We frequently use nuclear physics terminology, but we hope the paper will be transparent to scientists from other disciplines.

## 2 Review of Bayesian Analysis of Computer Models

### 2.1 Calibration

Over the past two decades, there has been a considerable amount of research dedicated to Bayesian calibration of computer models starting with the seminal work of KoH with extensions and applications provided by Higdon04; Higdon08; Higdon15. We briefly discuss the standard Bayesian framework and refer to KoH for full treatment.

Let us consider a single computer model

relying on a parameter vector

of dimension and an input point in a finite dimensional space (for simplicity). Model calibration corresponds to determining the unknown and hypothetical true value of the parameter , at which the physical process would satisfy ;

is the systematic discrepancy of the model whose form is generally unknown. The term ”calibration” is broader than the term ”estimation”, which can imply the use of some well-established statistical methodology. Herein, our notion of calibration is a specific Bayesian estimation methodology, which include a full evaluation of uncertainty for every parameter. Thus we can write the complete statistical model as

 yi=f(xi,θ)+δ(xi)+σ(xi)ϵi. (5)

at each of the observations . For brevity, we use for . The term is a scale function to be parametrized and inferred, and are independent standard random variables representing measurement errors, which we assume to be Gaussian.

One may expect the function representing the output of the computer model to be deterministic, given a model and its parameters. This holds in the case of inexpensive computer models, where the evaluation of takes a reasonably ”constant” time. For computationally expensive models the evaluations of cannot be reasonably performed at calibration runtime, and needs to be done beforehand, typically on a grid. Hence it has become a common practice to emulate the computer model by a Gaussian process with mean function and covariance function . The data in this situation also include a number of runs of the computer model at pre-determined points . Refined approaches typically decompose on a dense family of basis functions (wavelets, Fourier, polynomials) across the domain of and . As for the covariance function, the most popular choices are the quadratic exponential kernel, the Matern kernels, and fractional Gaussian noise (RasmussenWilliams).

The discrepancy function represents the systematic error between the computer model and the physical process. While it is intrinsically deterministic, a Gaussian process model is typically imposed for inference. However, the inference becomes tricky when there is only a single observation for every . This is often the reality for nuclear physics datasets, and one cannot hope to distinguish from the noise . Indeed one could take for any value of as well as . It is easy to see that such confounding between the discrepancy term and calibration parameters makes non-identifiable in general. Several authors have pointed this out and proposed various methods to mitigate the problem including Ha; Plumee; Wu1; Wu2; Vlid. Our main goal here, nonetheless, is not correct the identification of , but a prediction of new observations with honest uncertainties quantification. For prediction, the identifiability of calibration parameter is not particularly a concern.

#### Rapid calibration via linearization

Here, we propose a variation to the calibration method in KoH based on a linearization of the computer model. We consider situation where , i.e. a linear model. This simple framework is valid in practice, as the approximation of a function , up to a reparametrization, as long as one can a priori localize the parameters, typically around a special value , such as the maximum likelihood estimator or the solution of a least-squares optimization of the computer model. Indeed, one can replace the computer model by a linear function of in the range of a first order Taylor expansion of around

 f(x,θ)≈f(x,θ0)+∇θf(x,θ0)T⋅(θ−θ0), (6)

as long as one has access to the gradients . The advantage of this method is that it is intuitive, computationally inexpensive, and can be used either as a mean function of a GP emulator for computationally expensive model, or as a surrogate in the procedure outlined herein for calibration of inexpensive computer models.

#### Several observables for the same model.

The simple notational for of model (5) obscures the situation where mixes data of different nature, which can have critical impact on both calibration and model averaging. This can represent different types of observables in nuclear physics.

Suppose that the same model is fitted to types of observables . To recognize the dependency on observable types, we can write

 yoli=fol(xi,θ)+δ(xi)+σl(xi)ϵi,l=1,⋯,q. (7)

and the corresponding likelihood can be computed naturally for each data corresponding to an observable . Allowing different types of observables doesn’t raise significant differences with the standard case. Particular attention is to be brought to the noise scaling parameter which can now vary across observables. In this case the weighting between different observables is done through the relative importance of the scaling functions which can either be fixed or left as parameters. In the context of nuclear physics we refer to the discussion in (JPG, Section 3).

### 2.2 Computing the Evidence Integral

Suppose the -th model is parametrized by for

that consists of both calibration parameters and hyperparameters (i.e. parameters that go into defining prior distributions for other parameters) with likelihoods

. The evidence integral (KassRaftery1995; Fragoso; evint-AA13) of the model is defined as

 p(y|Mk)=∫ϕkp(y|ϕk,Mk)π(ϕk|Mk)dϕk. (8)

The numerical evaluation of evidence integrals is challenging in practice, and requires approximation. A natural idea is to apply the Monte Carlo method, as most commonly used in the literature, and which we have adopted in our applications:

 ˆp(y|Mk)=1nMC∑ip(y|ϕ(i)k,Mk).

Here are i.i.d. samples from the prior for . While the Monte Carlo approximation yields reasonable results, it requires new evaluations of the likelihood on new samples from the prior which can be very costly in computing time.

Another frequently used method is the Laplace approximation, which relies on the fact that the integration in (8

) can be performed formally in case of a linear regression with Gaussian noise. It corresponds to a second order Taylor approximation of the log-likelihood around its maximum, in which case the likelihood becomes Gaussian. The Laplace method typically gives very good results for likelihoods which are very peaked. We refer the reader to

KassRaftery1995 for a survey of different methods used to compute evidence integral.

## 3 BMA and Prediction Error

BMA is only one of various natural ways to deal with several alternative models and to account for model uncertainty, but it does have the property of reducing the Posterior Mean Square Error (PMSE) of prediction of a new observation . In this section, we illustrate this property in a clear and concise way.

Let us, for simplicity of notation, consider two competing models and - the treatment of multiple models follows from a similar argument, and our verbal descriptions below in this section occasionally refer to the general case without further comment. Denote and as the posterior means of under each model, and let . We also define for for the posterior probability of each model. Thus the BMA posterior mean estimator (3) can be written as . The PMSE of is then defined naturally as with the following decomposition.

###### Lemma.

For every satisfying , we have

 E[(y∗−ˆy∗)2|y]=E[(y∗−λ1ˆy∗1−λ2ˆy∗2)2|y]−[(λ1−p1)ˆy∗1−(λ2−p2)ˆy∗2]2 (9)

This lemma, proved in Appendix A, shows explicitly that the BMA PMSE is smaller than the PMSE associated with any convex combination of the each of the two models’ posterior means. It also measures how much smaller it is, and shows that equality holds as soon as the convex coefficients are equal to the posterior probabilities of each model, . Now, applying the Lemma twice, with and with we obtain the following dual expressions for the BMA PMSE, involving each individual model’s PMSE, showing how much smaller the former is compared to the two latter:

 (10)

To be more descriptive about these properties, first we record the optimality of BMA’s PMSE as the following inequality:

 E[(y∗−ˆy∗)2|y]≤E[(y∗−ˆy∗i)2|y],i=1,2. (11)

The previous inequality is the clearest way to state that the BMA estimator (3) produces the smallest prediction error, in the PMSE sense, among all the individual models’ posterior mean estimators, as long as all those models are used in creating the BMA estimator. We interpret this as a translation of the fact that each model that goes into creating the BMA estimator necessarily ignores model uncertainty. Note that this says nothing about how the BMA estimator would compare to a model not used in its definition.

But more can be said. As we mentioned implicitly, the previous inequality is a weaker statement than the statement in the Lemma, since the latter covers optimality over all convex combinations of the original models, not just the individual models themselves: the Lemma shows that BMA achieves the following minimum

 (p(Mk|y))k=1,2:=argminλ∈[0,1]2:λ1+λ2=1E[(y∗−(λ1ˆy∗1+λ2^y∗2))2|y]. (12)

Hence, the BMA estimator is actually optimal over all convex combinations of the individual estimators and .

We can also express the reduction of the PMSE for the BMA estimator, compared to the best (lowest) PMSE among all of the individual models’, as

 r2BMA:=1−E[(ˆy∗−y∗)2|y]miniE[(ˆy∗i−y∗)2|y]. (13)

In the specific case of two competing models, if we assume for instance that the ’best’ model is , we can obtain an even more explicit expression for which provides the relative gain attained by BMA, namely

 r2BMA=p(M1|y)2(ˆy∗1−ˆy∗2)2E[(ˆy∗2−y∗)2|y]. (14)

Below in the Application section, we denote the sample version of the expression in (13) as , which we will use to evaluate the performance of BMA quantitatively.

To finish this section, we decompose the quantity against the residuals , from each individual model. This is easily done by symmetrizing formula (10) via reintroducing to identify these residuals, and then taking another conditional expectation with respect to to avoid an expression which depends on unobserved data. We obtain

 E[(y∗−ˆy∗)2|y] =(p1−p21)E[(y∗−ˆy∗1)2|y]+(p2−p22)E[(y∗−ˆy∗2)2|y] (15) −(p21+p22)E[(ˆy∗1−y∗)(y∗−ˆy∗2)|y].

Formula (15) shows that the PMSE of the BMA estimator is an explicit linear combination of the prediction errors of estimators for each constituent model, those that do not consider model uncertainty, but that one must subtract a coupling correction term on the right hand side of (15).

It is interesting to note that the weights in the aforementioned linear combination can be interpreted as the variances of Bernoulli random variables with the posterior model probabilities and as their success probabilities. Also note that, since these variances , the linear combination is not convex, but is smaller. The correction term is not necessarily a subtraction of a positive term, but it is likely to be so in some generic cases, for example, when both individual models have significant biases in opposite directions for prediction of . This is particularly interesting when both models have similar posterior performances. Indeed, then, both values of will be close to , which minimizes the values of for both . This is a scenario where using the BMA model will significantly improve prediction errors even when each model is competitive compared to the other, regardless of how large the individual models’ biases are, and without knowing in what direction they go, as long as the two models are known or assumed to have significant defects that work in opposite directions.

The PMSE corresponds to an MSE under the posterior probability. It is of different nature that the MSE one would get if the predicted values could be compared to experimental measurements. Nevertheless it is a good proxy for an actual MSE, when posterior predictions are close to actual measurements.

## 4 Domain-Corrected Model Averaging

#### Motivation.

It is easy to imagine scenarios where alternative models are defined on different subsets of the same input space: this can typically arise with local models or with numerical models with different constraints. It is the case for instance for nuclear mass models: e.g. Energy Density Functionals (EDF) primarily cover heavier nuclei, while computations based on two- and three-body interactions (typically known as -body models) range over lighter nuclei; this even occurs for different EDFs (see Kortelainen10; Klupfel08). This can also easily happen for models constrained by observables of different type where one type of observable constrains one parameter, and each model has a wider parameter space for a specific variable. For instance, some nuclear models are mostly fitted on masses, and others on masses and radii. Surprisingly, BMA has not been studied or even proposed in this context; in particular, in nuclear physics, to our knowledge, one does not find studies which provide principled, fully satisfactory answers on how to compared models with similar, overlapping, but significantly non-identical domains. Other applications of our framework could include time series with missing data, or different time scales, e.g. in a financial setting where different classes of assets can also be described as observables.

To address this ”domain discrepancy”, we present a method which relaxes the requirement that all models cover the same domain, and we name our procedure domain-corrected BMA.

We start by considering two models and , which we will also denote by and , or even merely and , for simplicity depending on the context, and we assume that they are respectively defined only on subsets and of the data , denoting and for their respective complements in . The actual Bayesian evidence for each of these models are the probabilities and , but a ”classical” BMA construction would have to be based on and instead, i.e. when each model refers only to its original range of validity. These quantities are related by conditioning as follows:

 p(y|A) = p(y(A),y(−A)|A) = p(y(A)|A)p(y(−A)|y(A),A).

This expression emphasizes that, to obtain Model ’s actual Bayesian evidence, must be multiplied by a correction factor which represents the information one has on assuming that Model holds and that it does not provide any prediction at the data points in . Note that the distribution is meaningful only to the extend that – and thus – is measurable in the underlying probability space, which implies the existence of underlying distributions and subsequently of and . To that extent, the problem of averaging models with different domains can be ill posed, if these distributions cannot be defined convincingly.

Depending on the precise meaning and understanding of the domain of definition of a model, there are two easy situations covering two extreme scenarios.

• If Model does not predict the values to exist, then this factor should be and the model is ruled impossible.

• If neither Model nor the data constrain the values at all, then the factor should be .

#### Inadequacies of the classical BMA.

The second scenario above is helpful to understand the modeling implications of confounding the narrow and broad meanings of Bayesian evidence for models with restricted domains: in the extreme case where no information can be gleaned about from or from , it is legitimate to ignore the aforementioned correction factor. This has been proposed as a methodological matter of convenience: in the absence of any convincing information to the contrary, one may set the correction factor to , and this choice is easily adopted. Trotta08 contains one of the rare discussions on averaging models with different domains. The following passage from this reference is consistent with our comments above:

On the other hand, it is important to notice that the Bayesian evidence does not penalize models with parameters that are unconstrained by the data. It is easy to see that unmeasured parameters (i.e. parameters whose posterior is equal to the prior) do not contribute to the evidence integral, and hence model comparison does not act against them, awaiting better data (Trotta08).

This corresponds to positing that, given Model , should be deterministically equal to its sample value. Let us illustrate how this approach can fail to provide a satisfactory ranking of models in two examples where a model takes a shortcut by ‘refusing’ to predict difficult/challenging points.

#### Scenario 1.

Consider the situation where one model is empty, in the sense that and . On the other hand, any other model which constrains any part of the data will have an evidence lower than which implies that the model will end up with lower posterior weights when starting from equal prior weights. Thus any predictive model will be deemed inferior to a non-predictive one.

#### Scenario 2.

Take two models and with input space (domain of ) . Assume model does well at location , but poorly at location . Assume model does passably well at location , but does not predict anything at location . One can easily choose the numbers to get an extreme situation (e.g. make ’s prediction at location to be extremely poor) where model

ends up with a much higher Bayes factor than model

, while the common sense idea by which no prediction is a form of extremely poor prediction, would always imply that model is better than model .

These examples show how important it is to acknowledge that a model’s inability to make predictions in some locations is not a neutral property.

#### Domain correction with two models.

The above discussion motivates a modification of the procedure to amend the model weights to account for their (in-)ability to provide predictions at locations of interest. To account reasonably for the effect of a model’s domain into the computation of the model weights, we propose the weaker assumption that is independent from the model, i.e. we assume

 p(y(−A)|y(A),A)=p(y(−A)|y(A)).

This is quite natural considering that it implies a distribution but model provides no information on , leaving unconstrained by . This natural, intermediate solution can be shown to correspond to the following choice (details omitted). One replaces the factors of the likelihood corresponding to the missing model predictions by a log-average of the likelihoods over the models which do produce predictions, where the average is based on the predictive models’ posterior weights.

The evidence is now given by

 p(y|A)∝Ap(y(A)|A)p(y(−A)|y(A))π(A). (16)

Our assumption implies that can only be informed by . Hence

 p(y(−A)|y(A))=p(y(−A)|y(A),B)=p(y(−A)|y(A)∩(B),B) (17)

which can be written as an explicit integral with respect to model ’s parameter ,

 ∫p(y(−A)|B,ϕB)p(ϕB|y(A)∩(B),B)dϕB, (18)

and computed similarly to a classical evidence integral (see Section 2.2).

#### Domain correction in the general case.

In the general case, each model constrains a subset of the data (for ); as in the case of two models, denotes the complement subset of in . We also introduce as the set of data which is common to all individual models. Moreover we assume that , i.e. every datapoint is covered by at least one model. We also assume, up to taking equivalence classes on models (see Appendix B for details), that for each pair of models there exists a chain of models joining them where each model shares a data point in its domain with each of its neighbours . Relying on the same principles, we set

 p(y(−k)|y(k),Mk)=p(y(−k)|y(k)),

which leads to the model weights of the form

 p(Mk|y)∝kp(y(−k)|y(k))p(Mk|y(k))=wk. (19)

In the computation of the corrective factors, there is the additional difficulty that, when there is more than one model constraining , the factor is no longer equal to a single but rather to the Bayesian average of all models constraining . The notation for a given corrective factor can become quite cumbersome, or could be ambiguous, when model domains have very general intersections, but these corrective factors can still be computed recursively rather than directly. We relegate the calculations of the general case to the appendix, for the sake of readability.

#### A call for localization.

Let us take now model weights as given, by our domain corrected BMA method or another. The posterior distribution for a general quantity of interest can be calculated from

 p(Δ|y)∝K∑k=1wkp(Δ|y(k),Mk). (20)

Suppose that is . Then

 p(y∗|y)∝∑kwkp(y∗|(k),Mk)=∑x∗∈Dkwkp(y∗|y(k),Mk)+∑x∗∉Dkwkp(y∗|y(k),Mk).

In the traditional BMA framework, would naturally belong to the domain of every model , but this does not hold in our framework. Let us consider the problem of model selection. Picking one best model and using it for predictions at all locations would amount to a likelihood of 0 at all locations outside of the domain of the selected model, which is rather extreme. This calls for a local variation of model selection. For instance, instead of selecting a single best model, one can rank the models such that is the best model and is the worst model and use the following prediction procedure: at each location , the true model is the model with the smallest which contains in its domain. One could easily think of more complex procedures. This also suggest local variations of model mixing.

#### Discussion.

In the naive procedure a model which does not include an existing data point in its domain would have weight 0. With the domain-corrected procedure a model which does not include a given data point in its domain would have the same weight as the same model which would predict the average value over the other models. In that sense, the procedure is neutral towards (does not penalize or favor) a model not producing any prediction and is still conservative.

We can also note that our procedure questions that the models are mutually exclusive (i.e. that there exist a true model such that each is of the form ). Specifically, our procedure is incompatible with global exclusivity of models in the classical BMA framework, in the sense that the classical BMA assumes that one of the model is true on the whole domain. In our new procedure, we replaced the model prediction with a “default” value at locations where it would otherwise predicts nothing, given by the Bayesian average prediction. Consequently our model averages predict something in any given location, even when the set of predictive models depends on the location. Applying the same principle to the germane problem of model selection would lead to producing an ”optimal” model replaced by the BMA outside its own domain.

To be absolutely clear, it is of course always possible to bypass the domain issues by restricting estimations to the data which is common to the domains of all models. This circumvents having to deal with models withholding their predictions, but it can leave significant amounts of data unused. That disadvantage can be particularly stark in nuclear physics, when comparing EDF models with -body models, as we mentioned at the start of this section, since these two model classes have such narrow overlap.

## 5 Examples and Applications

To illustrate the methodology described in the previous sections, we present two generic instances under which BMA of computer models leads to a reduction in prediction error. Our first illustration offers a pedagogical examples with an error reduction exceeding 0.5. Our second set of examples focuses on application of the methodology in a setting with nuclear mass models.

Each of the examples in this section look at a situation with several competing models without any prior knowledge of which is better; thus we set the prior model weights to be uniform over the model space. All the posterior samples were computed using a Hamiltonian Markov-Chain Monte-Carlo algorithm. The evidence integrals were approximated using basic Monte-Carlo integration.

### 5.1 A Pedagogical Example

We assume here that the models under consideration are computationally expensive, and that we have access to their parameter gradients, which is often the case in typical applications. Therefore, we may define two competing computer models as Gaussian processes with means determined by their first order Taylor expansions. In this fashion, the calibration procedure falls under the rapid calibration via linearization. In the spirit of making the examples transparent, we choose to ignore the systematic discrepancy between the computer model and the physical model.

Let us consider a synthetic dataset with 10 identically

observations, whose noise is generated independently from the normal distribution

, at input points . We will see clearly how the BMA intuitively leads to improvement in predictive power. We define the mean values and for the two computer models via their first order Taylor expansions, as follows:

 fi(x,θi) ≈αix2+θi,i∈{1,2} (21)

where and and and represent unknown calibration parameters. One easily conceptualizes this as a scenario with two computationally expensive models (only 10 observations) with quantitatively comparable -dynamics acting in the opposite directions. It is natural to consider mixing both models since neither expansion in (21) provides significant a priori knowledge on which model is better. In addition, this is supported by the calculated posterior model probabilities, found below in Table 1 to be essentially equal. To begin with, we also assume that both models are defined on a common domain.

After calibrating the two models on the full dataset, we sampled from posterior predictive densities of , , and at all available input points.

Figure 1 shows how the structural differences in and are reflected in the predictions they yield. Typically, where overestimates the predicted value, tends to underestimate. On the other hand, the predictions from give a tighter fit to the original dataset and provide a better predictive ability on average. The estimated root MSE (RMSE) reduction for the BMA posterior mean estimator is , a considerable improvement compared to the better of the two models in (21).

To further demonstrate the benefits of BMA, let us now extend this scenario to a situation where models in (21) are defined on different domains, and the corrective terms need to be computed (e.g. the models are defined on different subsets of the nuclear landscape). We again consider a synthetic dataset with observations equal to 0 with measurement error generated independently from , here at input points . Each of the models and was assigned a different training dataset . We constructed four different scenarios, for various proportions of shared observations (), from to , according to the scheme in Table 2.

For each four values of , we carried out the BMA procedure the same way we did in the case where the models share the same domain, with the additional step involving the corrective terms . Note that the approximate computation of these terms is typically more demanding than the computation of evidence integrals (8), because it requires integration over a posterior distribution of parameters.

The RMSE was calculated at the same input points as in the case of the pedagogical example above. The BMA posterior mean estimator consistently outperforms estimators based on each model individually, regardless of the proportion of shared training data. Table 3 gives quantitative results which are comparable with the results in Table 1. In this table,

denotes the posterior odds ratio

for sampling from mixture distribution (20).

The odds ratios stay expectedly close to 1, since the deviations from out-of-domain data are comparable across the models under consideration. Notice that the values of the corrective terms increase (they are closer to 1) as increases. This can be expected, since the extreme case of , where the domain is shared, corresponds to the classical BMA framework where no correction is needed. Also note that, the larger the overlap in models’ domains, the smaller the posterior mean squared error is, while keeping the estimated RMSE reduction close to .

### 5.2 Averaging Nuclear (Mass) Models

An important task in the field of theoretical nuclear structure is the prediction of various physical observables with quantified uncertainties (JPG). Nuclear mass models approximate these observables and other properties of atomic nuclei, for each pair of proton number and neutron number . Of particular interest are the binding energy and the root mean square (rms) proton charge radius, which measures the size of an atomic nucleus. Mass models are typically a combination of a non-linear function with a parameter vector to be determined and a discrepancy term , of the form

 f(Z,N;θ)=mθ(Z,N)+δ(Z,N), (22)

where is a correction due to physics that is not accounted for by the function .

For the purpose of a naive exercise we propose the following family of functions for :

 molθ(Z,N)=α(Z+N)1+γ(Z+N)+βol,l∈{1,2}, (23)

where is a linear or non-linear function representing basic physics based on the atomic number , and is a constant which allows a modeler some additional flexibility to choose the convexity of the mass model depending on their range of interest for . To fix ideas, let us assume that is a constant, thus becoming akin to a scale parameter, to adjust the model to realistic ranges of atomic masses. The mass model (23) describes a scenario where the model behaves linearly for nuclei with small atomic number , i.e. light nuclei, but the denominator on the right-hand side of (23) starts to dominate with increasing atomic number. The function becomes markedly convex for large when , and similarly concave for positive , for heavier nuclei. The model is also allowed to depend on two types of observables through the constant additive parameters .

It is not uncommon for the data used to fit nuclear mass models to be scarce (expensive), and practitioners estimate subsets of the model parameters based on external sources of prior knowledge, before fitting the model to the data. We present a BMA analysis of two models according to (23), which differ in the value of the shape parameter , as chosen by two hypothetical practitioners who want specific features in the regime of large . Both models are meant to agree largely for small , and disagree significantly for large . Specifically we choose , and . In the range of realistic ’s, i.e. integers from 1 to 300, both models are increasing, the one with is roughly linear with a slight convexity for large , the other is strongly concave for large , which forces it to deviate strongly from linearity in that range. The choice of these two values of was designed to produce two models which are not comparable intuitively. The one with represents a simplistic linear approach, which can be interpreted as an effort to cover a wider range of atomic numbers in the absence of any specific physics. One may presume that it operates poorly for very large A. The other, with , represents a model which attempts to cover and patch together two different regimes, where a specific effort is made to handle large . The absolute numerical values of gamma used here are designed to differentiate these two scenarios. The case could be replaced by without major changes to our analysis. The negative sign, which changes the function’s convexity, allows for models which disagree in several respects. The are the calibration parameters for models , , while we assume here that the parameter

has been determined ahead of time without the use of Bayesian statistics.

In this example, we used a data set of measurements of 20 binding energies and of 15 charge radii which are publicly available on the site of National Nuclear Data Center based at the Brookhaven National Laboratory corresponding to nuclei of metals whose predictions have been challenging to nuclear physicists (AME2016; ANGELI). Gaussian process emulators with means according to (23) were fitted on a training set of 10 measurements of binding energies (, , , , , and ) and 10 measurements of charge radii ( and ). To aid with the transparency of the results, we again choose to ignore the systematic discrepancy .

From the point of view of the RMSE, the BMA predictor again provides on average a considerably better fit than or for both observables. Table 4 shows that in general, both models produce relatively small RMSE for the training data, however, the BMA methodology still produces a significant improvement. Especially in the case of binding energies, the improvement is close to relative to the better of the two models. It is also good to know that the posterior probabilities on and do not discard either of the models, despite being presumably superior. This supports our decision to take both models into account, rather than to pick one model over the other.

We further considered a testing data set consisting of 10 nuclei for prediction of binding energies (, , and ) and 5 nuclei for prediction of proton rms charge radii ( and ). This set of nuclei was disjoint from those in the training data. Again, the BMA posterior mean predictor performed better than either of the two models according to the estimated as in the case of the training data. In addition to the enhanced predictive ability of the BMA posterior mean predictor, BMA also increases the honesty of uncertainty quantification for the prediction, by accounting for modeling uncertainty as demonstrated by the decomposition of the total variance in (4). The data-driven averaging of uncertainties on observables according to the posterior probabilities

yields a compromise between possibly over-confident credible intervals provided by the model

and unnecessarily wide intervals of . See Appendix C for results on selected isotopes that illustrate this phenomenon.

### 5.3 Averaging Nuclear Mass Emulators in the Ca Region

In their recent work, Neufcourt18-2 use Gaussian Processes to model the discrepancies between experimental data and theoretical calculations for several nuclear models from the density functional theory. Their objective is to obtain quantified extrapolations for nuclear masses in the regions where no experimental data are available while providing robust uncertainty bounds. Such predictions are of direct interest to guide future nuclear experiments or to feed astrophysical models for the abundance of elements in the universe. Neufcourt18-2 compute a simplified BMA of 9 models (massexplorer; FRDM2012; HFB24), where the weights are based on the ability of the model to predict positive separation energies for all nuclei which have been observed. Here, we perform a full BMA analysis of the same models and compare our results to those obtained by Neufcourt18-2.

We consider the same training dataset of one-neutron () and two-neutron () separation energies AME2003 (AME2003) restricted to the calcium (Ca) region on the nuclear landscape with and . The predictive performance of each model (augmented with a GP model for systematic discrepancies) and the BMA posterior mean predictor is evaluated on both the training dataset and a testing dataset of new measurements in AME2016 (AME2016).

Similarly to Neufcourt18-2 we calculate model posterior probabilities independently on four non-overlapping nuclear domains according to the parity of numbers and . We assess the perfomance of the BMA with two metrics, the RMSE improvement and the fidelity of the empirical coverage probability. These quantities are calculated independently on the four parity domains and aggregated. See the results in Table 5.

The predictions based on the full BMA () outperform the simplified method of Neufcourt18-2 () by on the training dataset and on the testing one, as measured by . The lowest RMSE on the training dataset was attained by SLy4 and UNEDF1 respectively for AME2016 AME2003. This result should not discourage practitioner from using BMA posterior mean predictors, because the BMA methodology outlined in this paper allows for existence of a ”best” model for a particular data domain. However, such a model does not account for modeling uncertainty whereas BMA does, and therefore the BMA posterior mean estimator performs consistently well irrespective of the dataset. In fact it attains the second lowest RMSE on both AME2003 and AME2016 AME2003. Moreover, if we consider only a subset of the whole model space, the BMA attains the lowest root RMSE. See Table 6 in Appendix C for the results with restricted model space.

Fig. 7 in Appendix C shows the empirical coverage probabilities (ECP), i.e. the proportion of independent testing points falling into the respective credibility intervals with nominal value given in the horizontal axis - ideally a straight line. While it is not clear that the BMA has an improved ECP compared to each individual models, its ECP is certainly significantly better than the worst models and similar to the models with highest fidelity.

## 6 Conclusion

Motivated by nuclear physics research problems, we analyzed the Bayesian Model Averaging setup - the natural Bayesian framework to infer any unknown quantity of interest when several models are competing. We focused on the specific challenges arising from BMA of computer models such as the calibration of both computationally inexpensive and expensive models as well as the computation of the evidence integral in this context.

The nuclear physics perspective led us to study the special case of averaging models with different domains, which has not been explored in the literature to our knowledge. We also gave a theoretical justification for the use of BMA posterior mean predictor in terms of PMSE reduction. While this predictor does not guarantee a universal improvement in predictive ability, on average, it performs at least as well as the best models under consideration.

Finally, we applied the methodology outlined in this paper under two general scenarios that lead to considerable RMSE reduction; one set of pedagogical examples with a synthetic dataset and one real data analysis of a pair of simple models fitted to a set of binding energies and proton rms charge radii. These illustrative examples provide insight into the benefits of BMA and serve as a ”proof of concept.” We also provided a full-scale BMA analysis of 9 state-of-the-art nuclear mass models. Fully documented Python code that reproduces all the examples in this paper is available at https://github.com/kejzlarv/BA_of_computer_models and can be easily modified to serve practitioners.

There are several opportunities to further explore BMA of computer models, within and beyond nuclear physics. As any other Bayesian method, it hinges on efficient sampling from posterior distributions. Direct sampling from the BMA posterior distribution could significantly improve the methodology ease of implementation. While our theoretical basis for BMA comes from potential reduction of PMSE, a more universal argument could seek consistency of an estimator of based on its BMA posteriors. Finally our domain correction to BMA corresponds to an elementary and constrained localization. Developing a more elaborated local BMA procedure could answer a wider range of practical challenges in model mixing.

#### Acknowledgement.

The authors thank Witold Nazarewicz for providing us with motivating examples and questions, and for insightful discussions and comments on an early version of our work.

## Appendix A BMA and Prediction Error Lemma

###### Lemma.

For every satisfying , we have

 E[(y∗−ˆy∗)2|y]=E[(y∗−λ1ˆy∗1−λ2ˆy∗2)2|y]−[(λ1−p1)ˆy∗1−(λ2−p2)ˆy∗2]2. (24)
###### Proof.

This follows from taking conditional expectation in the following expression, derived from standard factorization identities, and notice that the right hand side is -measurable :

.

## Appendix B Averaging models with different domains: the general case

Suppose given data and models , and assume that each model is defined on a subset of . Denote also the subset of corresponding to , the complementary subset as well as