 # Reconciling the Bayes Factor and Likelihood Ratio for Two Non-Nested Model Selection Problems

In statistics, there are a variety of methods for performing model selection that all stem from slightly different paradigms of statistical inference. The reasons for choosing one particular method over another seem to be based entirely on philosophical preferences. In the case of non-nested model selection, two of the prevailing techniques are the Bayes Factor and the likelihood ratio. This article focuses on reconciling the likelihood ratio and the Bayes Factor for comparing a pair of non-nested models under two different problem frameworks typical in forensic science, the common-source and the specific-source problem. We show that the Bayes Factor can be expressed as the expected value of the corresponding likelihood ratio function with respect to the posterior distribution for the parameters given the entire set of data where the set(s) of unknown-source observations has been generated according to the second model. This expression leads to a number of useful theoretic and practical results relating the two statistical approaches. This relationship is quite meaningful in many scientific applications where there is a general confusion between the various statistical methods, and particularly in the case of forensic science.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

To begin, Section 2 will summarize the two different model selections frameworks that we call the common-source and the specific-source problems. Then for each of the problems, the two models from which the selection is to be made will be defined. In Section 3, the forms of both the Bayes Factor and the likelihood ratio for the two non-nested model selection problems will be given. Finally, the relationships between them are explored in Section 4. The interested reader is directed to the supplementary material Ommen and Saunders (2018a) for further details that have been omitted from this article for the sake of conciseness and clarity of presentation.

## 2 Non-Nested Model Selection

One of the most common, and increasingly the most difficult, areas of statistical application to forensic science is in the subject of forensic identification of source problems. The general idea of identification of source is that you have an object related to the perpetration of a crime, and you wish to determine where that object originated. For example, a fingerprint is left at the scene of a murder, and you want to determine if the print originates from a finger of the suspect. Similarly, suppose you have two different crime scenes with fingerprints recovered, and you want to know whether the prints were left by the same finger of an unknown perpetrator. (See Ommen and Saunders Ommen and Saunders (2018b) for further details of forensic identification of source problems of these types.) As it turns out, these two scenarios can be expressed statistically as two different non-nested model selection problems.

### 2.1 Common-Source Models

The idea of the common-source non-nested model selection problem is that you have built up a dataset, , of observations from many different subjects (the number of subjects is denoted , and the number of samples from within each subject is denoted ) in some population and two sets of observations with unknown sources, and , have been observed (the sample sizes are denoted by and , respectively). We assume that each set of observations has been generated by one single source, and that they are generated from the model which produced the dataset . Now, it is of interest to determine whether the two sets of unknown-source observations have been generated from a common unspecified (random) subject in the population, denoted by , or if the two sets of unknown-source observations have been generated from two different unspecified (random) subjects in the population, denoted by . In the absence of simplifying assumptions regarding the subjects in the dataset , the model describing the generation of

implicitly defines a latent random variable related to the selection of a subject from the population. This latent random variable will be denoted by

and the observed value of corresponding to the subject in the population will be denoted by . The corresponding sampling models for the entire set of the data, denoted where denotes the index for the various sample sizes, are given in the supplementary material Ommen and Saunders (2018a) and visualized in Figure 1. An example demonstrating the set-up for an application of the common-source framework to forensic evidence is given in Ommen and Saunders Ommen and Saunders (2018b). Figure 1: Hierarchy and notation for the common-source model selection problem where the green boxes denote the datasets and the yellow boxes denote the two competing models.

The goal of specifying the sampling models is to indicate the exchangeability assumptions for the data. To fully specify the necessary likelihood functions, will denote the parameter. For the common-source problem, the model selection is then a selection between and for the two sets of unknown-source observations. Under , the likelihood function for the unknown-source observations will be denoted

 L(θa|xb,xc,M1)=f(xb,xc|θa,M1)=∫f(xb,xc|au,θa)dGθa(au). (1)

Under , the likelihood function for the unknown-source observations will be denoted

 L(θa|xb,xc,M2) = L(θa|xb,M2)L(θa|xc,M2)=f(xb|θa)f(xc|θa) (2) = ∫f(xb|ab,θa)dGθa(ab)∫f(xc|ac,θa)dGθa(ac).

These likelihood functions will be used in Section 3 to define statistics from both the classical and the Bayesian paradigms needed to perform the model selection.

### 2.2 Specific Source Models

In contrast to the common-source problem, the idea of the specific-source non-nested model selection problem is that you have built up a dataset, , of observations from many different subjects in some population and there is one particular subject of interest (that is not a subject from the population mentioned previously) that may be the source of another set of observations. In addition to the dataset (the number of subjects is denoted , and the number of samples from within each subject is denoted ) you have also collected a set of observations from the particular subject of interest. This dataset is denoted (the sample size is ) and is composed of observations from a second, independent population, consisting of a single source. Now, when a set of observations with an unknown source (the sample size is ) is obtained, you want to determine whether this observation has come from the population associated with the subject of interest, denoted by , or if it has been generated by a randomly selected subject from the larger population of many sources, denoted by . The corresponding sampling models for the entire set of data, denoted where denotes the index for the various sample sizes, are given in the supplementary material Ommen and Saunders (2018a) and visualized in Figure 2. An example demonstrating the set-up of the specific-source problem for a forensic science application is given in Ommen and Saunders Ommen and Saunders (2018b). Figure 2: Hierarchy and notation for the specific-source model selection problem where the green boxes denote the datasets and the yellow boxes denote the two competing models.

Similar to the common-source problem, the sampling models only provide information about the exchangeability of the observations so the parametric models need to be specified as well. Let

denote the parameter associated with the large population of many subjects and let denote the parameter associated with the specified subject’s population. For the specific-source problem, the first model implies that has been generated according to model for the specified subject, whereas the second model implies that has been generated according to model for the large population of many subjects. Similar to the common-source problem, the model selection problem is then a selection between and for . Under the first model, the likelihood function for the unknown-source observation is denoted

 L(θb|xc,M1)=f(xc|θb,M1)=Nc∏i=1f(xci|θb) (3)

and under the second model it will be denoted

 L(θa|xc,M2)=f(xc|θa,M2)=∫f(xc|ac,θa)dGθa(ac). (4)

These likelihood functions will be used in Section 3 to define statistics from both the classical and the Bayesian paradigms needed to perform the model selection.

### 2.3 Discussion

While the non-nested model selection problems discussed in this article have been motivated by forensic science applications, they can be applied to a number of other areas as well. The common-source and specific-source frameworks for non-nested model selection might be used in any areas where statistical pattern recognition methods can be applied. For example, consider the situation in which there is a database of observations,

, that have been divided into

different classes, and you wish to classify a unknown-source set of observations,

to one of these classes. Then the traditional Bayes’ classifier is given by

 r(y)=argmaxi∈{1,2,…,k}P(Mi|y)

where

is the frequentist posterior probability of the model for

class creating the unknown-source set of observations Anderson (2003). Now, suppose that one of those classes is of particular interest to you for some reason. The model for this class can be used in the specific-source framework to describe the generation of data from the population associated with the single subject of interest and the combination of all the other models for the other classes can be used in the specific source framework to describe the generation of data from the larger population of many subjects. Then, the specific-source (non-nested) model selection problem would be a selection between the model that generated the unknown-source set of observations, . Or, consider the scenario in which you have two unknown-source sets of observations, and , and you want to know whether or not they come from the same class, without specifying which of the classes. Then the combination of all the models for each of the classes serves as the model associated with the population for the common-source framework, and the (non-nested) model selection problem is then a decision about whether and have come from the same unspecified class or from two different unspecified classes.

The main difference between the model selection problems discussed in this paper and traditional classification or pattern recognition problems is the order in which you observe the sets of data. In forensics applications, the unknown-source data is usually observed first, and then the other sets of data are subsequently collected. This is because the unknown-source data is typically collected from the crime scene. Investigators don’t collect samples from a person of interest unless a crime has been committed first, so the samples from the known subjects are collected last. In traditional classification problems, the known-source data is typically collected first, and then the unknown-source data is analyzed as it is observed. Note, the order in which the data is observed is not a necessary condition for application of this problem set-up. However, the results in this paper rely on the fixed observation of the unknown-source data, making them particularly useful for forensic science, and other similar, applications where this subset of the data is observed first.

## 3 Methods

In general, Bayesian statistical methods of approaching the forensic identification of source problems have been advocated by a number of forensic science researchers, especially in Europe

Eur (2015); Taroni et al. (2016); Berger and Slooten (2016); Biedermann et al. (2016) while alternative methods have been advocated by a number of forensic science researchers primarily in the United States Lund and Iyer (2017); Swofford et al. (2018); Kafadar (2018). The Bayesian methods are centered around computing the Bayes Factor which provides a relative measure for how much the data supports the two competing models. In contrast, the alternative methods typically focus on finding an approximation to the likelihood ratio. One of the most confusing things about these problems in forensics is that both of these values are referred to as the “likelihood ratio” regardless of which method was used to compute the value. It is very easy to imagine that a similar phenomena occurs in related areas outside of forensics, too. Therefore, it is very important to distinguish the methods used to compute these values. The following subsections will provide a very brief review of each of the methods and the notation to define the likelihood ratio and the Bayes Factor specifically for each of the non-nested model selection problems we are interested in.

### 3.1 The Likelihood Ratio

Typically, the term likelihood ratio is used in reference to the likelihood ratio test statistic developed by Neyman and Pearson for simple- and nested-model selection. Another common reference for the term likelihood ratio is the log-likelihood ratio statistic for nested-model selection. However, neither of these methods are directly applicable to non-nested models, like the two model selection problems introduced above. Since these early developments, others have explored the use of the likelihood ratio in more generality for non-nested model selection (see Vuong

Vuong (1989) for example). In this article, the phrase likelihood ratio function will be used to describe the ratio of the two different likelihood functions for the unknown-source observations under the two competing models. In particular, the likelihood ratio function for the common-source model selection problem described in Section 2.1 is given by

 λcs(θa|xb,xc) = L(θa|xb,xc,M1)L(θa|xb,M2)L(θa|xc,M2) (5)

and the likelihood ratio function for the specific-source model selection problem described in Section 2.2 is given by

 λss(θa,θb|xc)=L(θb|xc,M1)L(θa|xc,M2). (6)

For given observations of the unknown-source data, the likelihood ratio function for the common-source model selection problem is a function of the multivariate parameter vector

which is constrained to take values in the space . In the specific-source model selection problem, the likelihood ratio function is a multivariate function of the joint parameter vector for both and , given the observation of the unknown-source data. In this sense, the value of the likelihood ratio function which corresponds to the values of the parameters associated with the true sampling distributions implied by models and can be considered the “true likelihood ratio.” The parameter value is used to indicate the true parameter for the larger population of many sources while is used to indicate the true parameter for the population associated with the single source of interest. Since we do not know what the values of these parameters are, the true likelihood ratio is fixed, but unknown. The form of the true likelihood ratio for the common-source model selection problem is given by

 λcs(θa0|xb,xc) = L(θa0|xb,xc,M1)L(θa0|xb,M2)L(θa0|xc,M2) (7)

and for the specific-source model selection problem by

 λss(θa0,θb0|xc)=L(θb0|xc,M1)L(θa0|xc,M2). (8)

Note that the true likelihood ratio given by either Equation 7 or Equation 8 is not the limiting form of the likelihood ratio function as the size of entire set of data (including the size of the observations with unknown source) grows infinitely larger. Theorem 5.1 from Vuong Vuong (1989) provides the limiting form of the likelihood ratio function for this scenario, which will not be considered further in this article. The true likelihood ratio will take values between zero and positive infinity (exclusive of the endpoints) since the size of the unknown-source observations remains fixed.

### 3.2 The Bayes Factor

The Bayes Factor is a well-studied statistic often used for performing model selection on a wide variety of models Kass and Raftery (1995); Newton and Raftery (1994). One of the advantages of the using the Bayes Factor for model selection is that the models do not need to be nested. The Bayes Factor can be viewed as the strength of the support the data provides for the two competing models. A Bayes Factor greater than one indicates that the data support model over model , in contrast to a Bayes Factor less than one which indicates that the data support model over model

. A Bayes Factor that is equal to one means that the data cannot discriminate between the two competing models. However, the cost for increased applicability of the Bayes Factor for model selection is the need to specify prior probabilities for each of the models. The Bayes Factor can then be used to update the prior odds for the models to arrive at the posterior odds.

 P(M1|X)P(M2|X)Posterior Odds=P(X|M1)P(X|M2)Bayes Factor× P(M1)P(M2)% Prior Odds (9)

The posterior odds will then be used to select the model. A posterior odds greater than one indicates that, given all the data you have observed, model is preferred to model , whereas a prior odds less than one indicates that model is preferred to model .

In the situation where the models involve parametric distributions, the Bayes Factor is generally given by the following expression:

 β(X)=∫f(X|θ,M1) dΠ(θ|M1)∫f(X|θ,M2) dΠ(θ|M2). (10)

This can be interpreted as a ratio of marginal (or sometimes called integrated) likelihoods for the two competing models. Notice that the Bayes Factor itself contains a second set of prior distributions, denoted by , for the values of the parameters under each of the two competing models. These prior distributions are distinct from the prior odds for the models.

The precise form of the Bayes Factor for the common-source model selection problem described in Section 2.1 is derived under the assumption that the prior distribution of will be the same under both models, and is given by

 βcs(XN) = ∫f(xb,xc|θa,M1) dΠ(θa|Xa)∫f(xb|θa,M2)f(xc|θa,M2) dΠ(θa|Xa) ≡ m(xb,xc|Xa,M1)m(xb,xc|Xa,M2).

Under the assumption that the prior distribution of is statistically independent of the prior distribution for , the Bayes Factor for the specific-source model selection problem described in Section 2.2 is given by

 βss(XN) = ∫f(xc|θb,M1) dΠ(θb|xb)∫f(xc|θa,M2) dΠ(θa|Xa) ≡ m(xc|xb,M1)m(xc|Xa,M2).

The derivations of Equation 3.2 and Equation 3.2 are given in the appendix of Ommen et al. 2017 Ommen et al. (2017). Bayes Factors in these forms are often very computationally intensive to compute since the integrals rarely have closed-form solutions Kass and Raftery (1995); Newton and Raftery (1994); Ommen et al. (2017).

## 4 Relationships between the LR and the BF

As we previously mentioned, the likelihood ratio is most often used within the classical paradigm of statistics, while the Bayes Factor is most often associated with the subjective Bayesian paradigm. In this section, we will provide some expressions which directly relate these statistics from the two different paradigms. These expressions will serve as the foundation for subscribers of one statistical paradigm to more effectively communicate their results to subscribers of differing paradigms. For example, imagine that a statistician working within the classical paradigm presents a model selection result, in the form of the likelihood ratio function, to a statistician working within the Bayesian paradigm. Then, the corresponding Bayes Factor for the model selection problem can be expressed as the expected value of the likelihood ratio function with respect to the Bayesian’s subjective posterior belief about the parameters of the sampling models given the entire collection of data under the second model. This is formalized in the following two equations below for the common-source and specific-source model selection problems, respectively. The derivations of Equation 13 and Equation 14 can be found in Appendix A.

 βcs(XN) = ∫λcs(θa|xb,xc) dΠ(θa|XN,M2) (13)
 βss(XN) = ∫λss(θa,θb|xc) dΠ(θa,θb|XN,M2) (14)

A similar expression can be derived using the first model for the data in the posterior distribution, with the introduction of two inverses. The derivations of Equation 15 and Equation 16 can also be found in Appendix A.

 (15)
 βss(XN)=[∫1λss(θa,θb|xc) dΠ(θa,θb|XN,M1)]−1 (16)

In these expressions, notice that the data with unknown source is included in the posterior distribution for the parameters. Another interesting thing to note about these forms for the Bayes Factor is that one of them will be computed using a misspecified model for the data depending on the truth of which model actually generated the evidence.

### 4.1 Asymptotic Results

In the reviewed literature, asymptotic properties of the Bayes Factor, particularly in the case of non-nested model selection, have been examined with respect to data from a single population representing a single source of information when the number of observations from that single population is growing. In this case, the Bayes Factor will diverge to positive infinity (in probability) when the model in the numerator is preferred and will converge to zero (in probability) when the model in the denominator is preferred Chib and Kuffner (2016)

. In the case of both the common-source and specific-source model selection problems, there are multiple sources of information and only the number of observations from a portion of these sources is allowed to grow. In this section, we examine the consistency of the Bayes Factor by way of a well-known result, the Bernstein-von Mises Theorem

van der Vaart (1998). The Bernstein-von Mises Theorem is reproduced from Van der Vaart van der Vaart (1998) for clarity in the supplementary material Ommen and Saunders (2018a). We will show that, under certain regularity conditions, the Bayes Factor for both the common-source and specific-source model selection problems will converge to the true likelihood ratio. These results are formalized in the theorems to follow. The notational modifications used to facilitate the asymptotic results and corresponding proofs are given in Appendix B.

###### Theorem 1 (Common-Source Bayes Factor Consistency).

Given a fixed observation of and , suppose that the likelihood ratio function is bounded in a neighborhood of and that

is a consistent estimator for

. Furthermore, suppose the assumptions of the Bernstein-von Mises Theorem are satisfied. Then as , the Bayes Factor converges in -probability to the true likelihood ratio,

 βcs(Xa,na,xb,xc)Pnaθa⟶λcs(θa0|xb,xc).

The proof of this theorem is provided in Appendix B.

###### Theorem 2 (Specific-Source Bayes Factor Consistency).

Given a fixed observation of , suppose that the likelihood ratio function, , is bounded in a neighborhood of and that is a consistent estimator for . Furthermore, suppose the assumptions of the Bernstein-von Mises Theorem are satisfied. Then the Bayes Factor converges in -probability to the true likelihood ratio as ,

 βss(Xa,na,Xb,nb,xc)Pnθ⟶λss(θ0|xc).

The proof of this theorem is provided in Appendix B.

These results expand upon the foundation for members of one statistical paradigm to communicate more effectively with members of another paradigm. For example, imagine that a statistician working within the Bayesian paradigm presents a model selection result, in the form of the Bayes Factor, to a statistician working within the classical paradigm. Then, the likelihood ratio for the corresponding model selection problem can be thought of as the limit of the given Bayes Factor if more and more information is gathered from a subset of the data sources. Unfortunately, this is not a very useful result in practice since it is often infeasible to gather more information, and we rarely have large enough samples sizes for the value of the Bayes Factor to be used as a large-sample replacement for the true value of the likelihood ratio.

### 4.2 Bayesian Credible Intervals for the LR

Since the results from the previous section only provide theoretical information for transforming the Bayes Factor into the corresponding likelihood ratio, the purpose of this section is to provide statisticians with a practical way to relate the two. Credible intervals, particularly used to determine posterior concentration rates, are a popularly researched topic recently in the high-dimensional and non-parametric statistics literature (see Hoffmann et al. Hoffmann et al. (2015), Rockova and van der Pas Rockova and van der Pas (2018), Donnet et al. Donnet et al. (2018), and Nickl and Sohl Nickl and Sohl (2017) for example). Using similar methods, we will create a credible interval for the value of the likelihood ratio derived from the posterior distribution for the parameters. The necessary notational modifications related to the result and corresponding proofs are given in Appendix C.

###### Theorem 3 (Approximate 1−α Credible Interval for the LR).

Let let the assumptions of Lemma 3.1 and Lemma 3.2 hold and let

 In=β(Xn)±Φ−1(1−α/2)σn

where represents the sequence of either the common-source or the specific-source Bayes Factor, is the desired significance level,

is the sequence of posterior standard errors for the likelihood ratio, and

is the standard normal quantile function. Then as

 Π(λ(θ)∈In|Xn,M2)→1−α.

Please see Appendix C for Lemma 3.1 and Lemma 3.2

and for a proof of this theorem. We would like to note that we chose a form of the interval that is guaranteed to contain the posterior mean, in this case the Bayes Factor, instead of choosing an equal-tails or highest posterior density. In our experience, the posterior distributions may be so skewed that the Bayes Factor doesn’t actually fall in the body of the posterior distribution, an therefore may not be contained in the interval constructed using these other methods. This would have serious implications in applications like forensic science.

The credible interval described above provides a range of probable likelihood ratio values that would correspond to the subjective, personal Bayes Factor provided for the non-nested model selection problem. Now, this method has a couple of drawbacks. The first is that this interval for a classical statistic relies on subjective Bayesian probabilities. The interval is highly dependent on the given value of the (possibly estimated) Bayes Factor and the corresponding prior distributions chosen. Another disadvantage of this method is that if the prior has not been chosen properly, the credible interval for the LR corresponding to the given Bayes Factor may

not actually contain the true value of the likelihood ratio. In this way, the credible interval is easily misinterpreted. The interval must not be interpreted as a range of probable values for the true likelihood ratio (as if you had full, infinite data for every source of information), but instead should be interpreted as a range of probable values for the estimated likelihood ratio given the limited data and given the chosen prior distribution. This interpretation may be unsatisfying for the classical statistician who must rely on someone else’s subjective belief to determine a corresponding value of the estimated likelihood ratio. Finally, if the value of the Bayes Factor has been estimated, using Monte Carlo integration for example, it is unclear how, or even if, the computational error should be incorporated into this credible interval for the corresponding LR.

## 5 Example

One of the areas of forensic science where both the common-source and specific-source frameworks would straightforwardly apply is trace elemental analysis of glass evidence. Ommen et al. Ommen et al. (2017)

provides a set-up for both the common-source and specific-source Bayes Factors for the trace elemental composition of a dataset of 62 different window panes, each with 5 measured fragments of glass. This data set was originally collected by Dr. JoAnn Buscaglia of the Federal Bureau of Investigation Laboratory Division and analyzed by Aitken and Lucy using a multivariate analysis

Aitken and Lucy (2004). For this example, the 16 window panes from the first group will be used as the data, . Figure 3 provides the window ID numbers corresponding to the pairwise values for the transformed trace elemental compositions provided in Figure 4. Figure 3: Pairwise plots of the mean elemental concentrations for each window in the first group along with the corresponding window identification number. Figure 4: Pairwise plots of the elemental concentrations for each fragment in the first group along with the mean elemental concentrations for each associated window.

Directly computing the value of the likelihood ratio for this dataset for either the common-source or specific source frameworks, given by Equation 5 and Equation 6, respectively, is impossible since we do not know the parameter values under either model. Also, approximating the value of the true likelihood ratio using the asymptotic results given by Theorem 1 or Theorem 2 is inadvisable since we have such small sample sizes. So, this would be a case in which we would need to compute credible intervals for the likelihood ratio using Theorem 3 in order to compare the results from the two differing statistical paradigms. The Bayes Factors needed to compute the intervals for the common-source problem are given by Equation 13 and Equation 15 or by Equation 14 and Equation 16 for the specific-source problem. Ommen et al. gives the Bayes Factors in the form of Equation 3.2, Equation 3.2, and Equation 13, but the remaining three required forms follow readily.

In the interest of space, we will only consider the common-source example for this small glass dataset. Following the setup given in Ommen et al. Ommen et al. (2017)

, the transformed trace elemental concentrations will be modeled by the same normal distributions and using the same normal prior distributions for the mean parameters and the same inverse Wishart prior distributions for the covariance parameters. Table

1

gives the value of the Bayes Factor, posterior standard deviation, and the corresponding credible interval for the likelihood ratio for a sampling of window panes serving as

, with window pane 10 serving as

, and using the second group of window panes to determine the prior hyperparameters. Table

2 gives the value of the Bayes Factor, posterior standard deviation, and the corresponding credible interval for the likelihood ratio for the same sampling of window pairs serving as the data for the unknown-source observations, and , but using the third group of window panes to determine the prior hyperparameters. This would be like having two different experts compute the same values using two different prior distributions. In both examples, the remaining 14 window panes are used for . Details of the example are given in the supplementary material Ommen and Saunders (2018a).

As you can see from Table 1 and Table 2, the credible intervals for the likelihood ratios have been truncated at zero. We did this because it is a well-known fact that likelihood ratios cannot be negative. If we did not truncate the lower endpoints at 0, then the endpoints would be negative. This is due to the very small sample sizes that we are working with. In addition, it is possible using this method to have the credible intervals for the likelihood ratio from two different experts be non-overlapping. This means that Expert 1 and Expert 2 would likely disagree on the value of evidence given the limited size of the datasets due to the fact that they used different prior distributions. However, it is worth noting that neither of the intervals contain the neutral value of one, and so neither would present misleading evidence to a fact-finder. In the event that more data was collected, the intervals from the two experts would eventually overcome the disagreement caused by the differing priors. This is a direct result from Theorem 3. Exactly how much additional data would need to be collected to overcome the disagreement in a practical scenario may be estimated through simulation (although this is not directly considered in this example).

In this example, the glass dataset was collected for research purposes, and not intended to represent the type of evidence forensic scientists would see in casework. However, the two non-nested models provided in Ommen et al. Ommen et al. (2017) can be applied to trace elemental analysis of glass evidence in general. In addition, the non-nested models can be modified to work for trace elemental analysis of any type of evidence, not just glass. For instance, the common-source and specific-source frameworks can be used for trace elemental analysis of ink Subedi et al. (2015), paperMcGaw et al. (2009), soil Pye et al. (2006), and paint Hobbs and Almirall (2003) as well. Even more generally, these two frameworks can be applied to evidence measured by nearly any quantitative method. Once the appropriate framework has been set, then all of the results relating the Bayes Factor to the likelihood ratio discussed in this article apply.

## 6 Conclusions

In this article, we have expressed the Bayes Factor for two non-nested model selection problems as the expected value of the corresponding likelihood ratio function with respect to the posterior distribution for the parameters given the entire set of data where the set(s) of unknown-source observations has been generated according to the second model. This expression has led to a number of useful theoretic and practical results relating the Bayesian solution of these non-nested model selection problems to the classical solution. These non-nested model selection problems differ from those previously considered in other asymptotic results regarding the Bayes Factor in that more than one source of information is generating data. In addition, for the asymptotic results, only a portion of these datasets are allowed to grow in size while some observations remained fixed throughout. These results have strong implications for the field of forensic science where the term “likelihood ratio” is used ubiquitously to mean a solution to the forensic identification of source problem, regardless of the style of statistics used in the analysis.

## Acknowledgements

We would like to thank Dr. JoAnn Buscaglia for providing the dataset used in the example and for the many helpful discussions throughout the years. Since the research presented in this article was performed as part of Dr. Ommen’s dissertation research at South Dakota State University, she would like to thank the members of her departmental Ph.D. committee, Dr. Kurt Cogswell and Dr. Cedric Neumann, for their support and guidance.

## References

• Aitken and Lucy (2004) Colin G. G. Aitken and David Lucy. Evaluation of trace evidence in the form of multivariate data. Journal of the Royal Statistical Society. Series C (Applied Statistics), 53(1):109–122, Jan. 2004.
• Anderson (2003) T. W. Anderson. An Introduction to Multivariate Statistical Analysis. John Wiley and Sons, Ltd., Hoboken, NJ, USA, 3rd edition, 2003.
• Ash and Doleans-Dade (2000) Robert B. Ash and Catherine A. Doleans-Dade. Probability and Measure Theory. Harcourt/Academic Press, San Diego, CA, USA, 2nd edition, 2000.
• Berger and Slooten (2016) Charles E.H. Berger and Klaas Slooten. The LR does not exist. Science and Justice, 56(5):388–391, 2016.
• Biedermann et al. (2016) A. Biedermann, S. Bozza, F. Taroni, and C. G. G. Aitken. Reframing the debate: A question of probability, not of likelihood ratio. Science and Justice, 56(5):392–396, 2016.
• Chib and Kuffner (2016) Siddhartha Chib and Todd A. Kuffner. Bayes Factor consistency. Technical report, https://arxiv.org/abs/1607.00292v1, July 2016.
• Donnet et al. (2018) Sophie Donnet, Vincent Rivoirard, Judith Rousseau, and Catia Scricciolo. Posterior concentration rates for empirical Bayes procedures with applications to Dirichlet process mixtures. Technical report, https://arxiv.org/abs/1406.4406v1, 2018.
• Eur (2015) ENFSI Guideline for Evaluative Reporting in Forensic Science. European Network of Forensic Science Institutes, 2015.
• Hobbs and Almirall (2003) Andria L. Hobbs and Jose R. Almirall. Trace elemental analysis of automotive paints by laser ablation-inductively coupled plasma-mass spectrometry (LA-ICP-MS). Analytical and Bioanalytical Chemistry, 376(8):1265–1271, 2003.
• Hoffmann et al. (2015) Marc Hoffmann, Judith Rousseau, and Johannes Schmidt-Hieber. On adaptive posterior concentration rates. Annals of Statistics, 43(5):2259–2295, 2015.
• Kafadar (2018) Karen Kafadar. The critical role of statistics in demonstrating the reliability of expert evidence. Fordham Law Review, 86(4):1617–1637, 2018.
• Kass and Raftery (1995) Robert E. Kass and Adrian E. Raftery. Bayes factors. Journal of the American Statistical Association, 90(430):773–795, Jun. 1995.
• Lund and Iyer (2017) Steven P. Lund and Hari Iyer. Likelihood ratio as weight of forensic evidence: A closer look. Journal of the Research of National Institute of Standards and Technology, 122(27):1–32, 2017.
• McGaw et al. (2009) Elizabeth A. McGaw, David W. Szymanski, and Ruth Waddell Smith. Determination of trace elemental concentrations in document papers for forensic comparison using inductively coupled plasma-mass spectrometry. Journal of Forensic Sciences, 54(5):1163–1170, 2009.
• Newton and Raftery (1994) Michael A. Newton and Adrian E. Raftery.

Approximate Bayesian inference with the weighted likelihood bootstrap.

Journal of the Royal Statistical Society. Series B (Methodological), 56(1):3–48, 1994.
• Nickl and Sohl (2017) Richard Nickl and Jakob Sohl. Bernstein-von Mises theorems for statistical inverse problems II: Compound Poisson processes. Technical report, https://arxiv.org/abs/1709.07752v1, 2017.
• Ommen and Saunders (2018a) Danica M. Ommen and Christopher P. Saunders. Supplementary material to “Reconciling the Bayes Factor and Likelihood Ratio for Two Non-Nested Model Selection Problems”. 2018a.
• Ommen and Saunders (2018b) Danica M. Ommen and Christopher P. Saunders. Building a unified statistical framework for the forensic identification of source problems. Law, Probability, and Risk, 17:179–197, 2018b.
• Ommen et al. (2017) Danica M. Ommen, Christopher P. Saunders, and Cedric Neumann. The characterization of Monte Carlo errors for the quantification of the value of forensic evidence. Journal of Statistical Computation and Simulation, 87(8):1608–1643, 2017.
• Pye et al. (2006) Kenneth Pye, Simon J. Blott, and David S. Wray. Elemental analysis of soil samples for forensic purposes by inductively coupled plasma spectrometry — precision considerations. Forensic Science International, 160(2-3):178–192, 2006.
• Rockova and van der Pas (2018) Veronika Rockova and Stephanie van der Pas. Posterior concentration for Bayesian regression trees and forests. Technical report, https://arxiv.org/pdf/1708.08734v4.pdf, Jul 2018.
• Subedi et al. (2015) Kiran Subedi, Tatiana Trejos, and Jose R. Almirall. Forensic analysis of printing inks using tandem laser induced breakdown spectroscopy and laser ablation inductively coupled plasma mass spectrometry. Spectrochimica Acta Part B: Atomic Spectroscopy, 103-104:76–83, 2015.
• Swofford et al. (2018) H.J. Swofford, A.J. Koerntner, F. Zemp, M. Ausdemore, A. Liu, and M.J. Salyards. A method for the statistical interpretation of friction ridge skin impression evidence: method development and validation. Forensic Science International, 287:113–126, 2018.
• Taroni et al. (2016) Franco Taroni, Silvia Bozza, Alex Biedermann, and Colin G. G. Aitken. Dismissal of the illusion of uncertainty in the assessment of a likelihood ratio. Law, Probability, and Risk, 15(1):1–16, March 2016.
• van der Vaart (1998) A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, Cambridge, UK, 1998.
• van der Vaart and Wellner (2000) A. W. van der Vaart and Jon Wellner. Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer, New York, NY, USA, 2000.
• Vuong (1989) Quang H. Vuong. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica, 57(2):307–333, March 1989.

## Appendix A Bayes Factor Derivations

Further details of all the derivations that follow are given in the supplementary material Ommen and Saunders [2018a].

### a.1 Equation 13

This derivation is a generalization of the derivation given in Ommen et al.Ommen et al. .

 βcs(XN) = (17a) = ∫f(xb,xc|θa,M1)f(xb,xc|θa,M2)×f(xb,xc|θa,M2)m(xb,xc|Xa,M2) dΠ(θa|Xa) (17b) = ∫f(xb,xc|θa,M1)f(xb,xc|θa,M2) dΠ(θa|Xa,xb,xc,M2) (17c) = ∫λcs(θa|xb,xc) dΠ(θa|XN,M2) (17d)

### a.2 Equation 14

This derivation closely follows the derivation of Equation 13 given above.

 βss(XN) (18a) = ∫f(Xa|θa)f(xb|θb)f(xc|θb,M1)m(Xa,xb,xc|M2) dΠ(θa,θb) = ∫f(xc|θb,M1)f(xc|θa,M2)×f(Xa|θa)f(xb|θb)f(xc|θa,M2)m(Xa,xb,xc|M2) dΠ(θa,θb) (18b) = ∫f(xc|θb,M1)f(xc|θa,M2) dΠ(θa,θb|Xa,xb,xc,M2) (18c) = ∫λss(θa,θb|xc) dΠ(θa,θb|XN,M2) (18d)

### a.3 Equation 15

First, consider the reciprocal of the common-source Bayes Factor.

 1βcs(XN) = (19a) = ∫f(xb,xc|θa,M2)f(xb,xc|θa,M1)×f(xb,xc|θa,M1)m(xb,xc|Xa,M1) dΠ(θa|Xa) (19b) = ∫f(xb,xc|θa,M2)f(xb,xc|θa,M1) dΠ(θa|Xa,xb,xc,M1) (19c) = ∫1λcs(θa|xb,xc) dΠ(θa|XN,M1) (19d)

Therefore,

 (20)

### a.4 Equation 16

Similar to the common-source derivation above, consider the reciprocal of the specific-source Bayes Factor.

 1βss(XN) (21a) = ∫f(Xa|θa)f(xb|θb)f(xc|θa,M2)m(Xa,xb,xc|M1) dΠ(θa,θb) = ∫f(xc|θa,M2)f(xc|θb,M1)×f(Xa|θa)f(xb|θb)f(xc|θb,M1)m(Xa,xb,xc|M1) dΠ(θa,θb) (21b) = ∫f(xc|θa,M2)f(xc|θb,M1)×f(Xa,xb,xc|θa,θb,M1)m(Xa,xb,xc|M1) dΠ(θa,θb) (21c) = ∫f(xc|θa,M2)f(xc|θb,M1) dΠ(θa,θb|Xa,xb,xc,M1) (21d) = ∫1λss(θa,θb|xc) dΠ(θa,θb|XN,M1) (21e)

Therefore,

 βss(XN)=[∫1λss(θa,θb|xc) dΠ(θa,θb|XN,M1)]−1 (22)

## Appendix B Consistency Proofs

The Bernstein-von Mises Theorem is a result which describes the contraction of the posterior distribution for the parameters as you observe more and more relevant data. The Bernstein-von Mises Theorem will be used to show that the Bayes Factor for the two different non-nested model selection problems is consistent towards the corresponding true likelihood ratio. A version of the Bernstein-von Mises Theorem from van der Vaart van der Vaart  is reproduced in the supplementary material Ommen and Saunders [2018a] for ease of reference. It should be noted that further details of the proofs in this section are also given in the supplementary material Ommen and Saunders [2018a].

Please observe the following notational extensions to facilitate the proof of the common-source theorem. First, let denote a sequence of random variables corresponding to the generation of datasets where is the index that denotes the increasing number of subjects in the dataset with a fixed number of elements from within each subject. Also, let denote the joint probability measure on for all including . The two sets of unknown-source observations, and , will retain the previous notation since the size and the observation of these datasets will remain fixed. Now, let denote the maximum likelihood estimator for that would be computed using the entire dataset when and are generated under the model . Finally, let denote the random function corresponding to the Bayes Factor given in Equation 13 before the value of has been observed.

###### Theorem (Common-Source Bayes Factor Consistency).

Given a fixed observation of and , suppose that the likelihood ratio function is bounded in a neighborhood of and that is a consistent estimator for . Furthermore, suppose the assumptions of the Bernstein-von Mises Theorem are satisfied. Then as , the Bayes Factor converges in -probability to the true likelihood ratio,

 βcs(Xa,na,xb,xc)Pnaθa⟶λcs(θa0|xb,xc).
###### Proof.

First, let denote the random variable associated with generating the entire dataset where generates the observations from the population of sources and where the observations and are already fixed. Also, let denote the probability measure that is degenerate at .

 ∣∣∣βcs(Xna) − λcs(θa0|xb,xc)∣∣∣ = ∣∣∣∫λcs(θa|xb,xc) d[Π(θa|Xna,M2)−δθa0(θa)]∣∣∣

Note that is a sequence of signed measures. For any signed measure it follows that where is the total variation norm Ash and Doleans-Dade . Now, by the assumption that the likelihood ratio function is bounded, let for some real number . Therefore,

 ∣∣∣βcs(Xna) − λcs(θa0|xb,xc)∣∣∣ ≤ ≤ C ∣∣∣∣∣∣Π(θa|Xna,M2)−δθa0(θa)∣∣∣∣∣∣TV.

Now, let denote the probability measure corresponding to the normal distribution with mean

and variance

where is the corresponding inverse of the observed Fisher’s information matrix. Then by the triangle inequality, we obtain

 ∣∣∣∣∣∣ Π(θa|Xna,M2)−δθa0(θa)∣∣∣∣∣∣TV ≤ ∣∣∣∣∣∣Π(θa|Xna,M2)−Φ(θa|Xna,M2)∣∣∣∣∣∣TV+∣∣∣∣∣∣Φ(θa|Xna,M2)−δθa0(θa)∣∣∣∣∣∣TV.

By the Bernstein-von Mises Theorem, as then

 ∣∣∣∣∣∣Π(θa|Xna,M2)−Φ(θa|Xna,M2)∣∣∣∣∣∣TVPnaθa⟶0.

By the assumption that is consistent and provided that is bounded in -probability in a neighborhood of , then this implies that as

 ∣∣∣∣∣∣Φ(θa|Xna,M2)−δθa0(θa)∣∣∣∣∣∣TVPnaθa⟶0.

Please observe the following notational extensions to facilitate the proof of the specific-source theorem. First, let be defined in the same way as for the common-source model selection problem. Similarly, let denote a sequence of random variables corresponding to the generation of data from the fixed subject of interest where is the index that denotes the increasing number of elements from within that subject. For simplicity, we will fix so that the number of subjects in the population increases in the exact same way as the number of elements from the subject of interest. The proofs of the results can easily be modified to accommodate more flexible relationships between the two sample sizes. In addition, let denote the joint probability measure on and for all where is the joint parameter space, including which denotes the true value of the joint parameter. The set of unknown-source observations will retain the notation since the size and the observation of this set will remain fixed. Now, let denote the maximum likelihood estimate computed using the entire dataset when is generated under the model . Finally, let denote the random function corresponding to the Bayes Factor given in Equation 14 before the values of and have been observed.

###### Theorem (Specific-Source Bayes Factor Consistency).

Given a fixed observation of , suppose that the likelihood ratio function, , is bounded in a neighborhood of and that is a consistent estimator for . Furthermore, suppose the assumptions of the Bernstein-von Mises Theorem are satisfied. Then the Bayes Factor converges in -probability to the true likelihood ratio as ,

 βss(Xa,na,Xb,nb,xc)Pnθ⟶λss(θ0|xc).
###### Proof.

Similar to the proof of Theorem 1. ∎

## Appendix C Credible Interval Proofs

Before we begin the proofs of the results for the credible intervals, it will be necessary to define some notational conventions related to the generation of the data. To start, let denote the joint probability measure on for all including (for both the common-source and specific-source problems). Similarly, for the specific-source problem only, let denote the joint probability measure on for all including . Next, let denote the random variable associated with generating the entire dataset under either of the non-nested model selection problems. For the common-source problem, this means that generates the observations from the population of sources and where the observations and are already fixed. For the specific source problem, this means that generates the observations from the larger population of subjects, generates the observations from the population associated with the single subject of interest, and where is already fixed. Furthermore, let denote the joint probability measure on for all including where is the joint parameter space.

Additional notational conventions needed to understand the proofs of the results for the credible intervals will now be considered. Let denote the likelihood ratio function for either the common-source or the specific-source problem given by Equation 5 or Equation 6, respectively. Moreover, let denote the sequence of probability measures corresponding to the normal distribution with mean vector and a covariance matrix of . Also, let denote the maximum likelihood estimator computed from the entire collection of data where the unknown-source set(s) of observations are generated according to the model , and let denote the corresponding inverse of the observed Fisher’s information matrix. By properties of M-estimators, then can be designed in a way such that it is a consistent estimator of van der Vaart , van der Vaart and Wellner . Next, let, be defined such that

 γ2n|M2=λ′(^θn|M2)TI−1^θn|M2λ′(^θn|M2)

where is the vector of first partial derivatives of the likelihood ratio function. Finally, let represent either the sequence of common-source or specific-source Bayes Factors. It should be noted that further details of the proofs in this section are given in the supplementary material Ommen and Saunders [2018a].

###### Lemma 3.1.

For a given observation of the unknown-source set(s) of observations, suppose that the likelihood ratio function, , is twice continuously differentiable, and that is a consistent estimator for under -probability. If the assumptions of the Bernstein-von Mises Theorem hold, then converges to zero in -probability as .

###### Proof.

Consider the following result based on the Taylor series expansion of about the maximum likelihood estimate :

 √n[λ(θ)−λ(^θn|M2)]=√n(θ−^θn|M2)Tλ′(^θn|M2)+√n2(θ−^θn|M2)Tλ′′(~θn|M2)(θ−^θn|M2)

where is the vector of first partial derivatives of , is the matrix of second partial derivatives of , and is a value on the line between and . Now, consider the error term of the expansion given by

 12[√n(θ−^θn|M2)]T[1√nλ′′(~θn|M2)][√n(θ−^θn|M2)].

The Bernstein-von Mises Theorem implies that the