A General Model Validation and Testing Tool

by   Kevin Vanslette, et al.

We construct and propose the "Bayesian Validation Metric" (BVM) as a general model validation and testing tool. We find the BVM to be capable of representing all of the standard validation metrics (square error, reliability, probability of agreement, frequentist, area, statistical hypothesis testing, and Bayesian model testing) as special cases and find that it can be used to improve, generalize, or further quantify their uncertainties. Thus, the BVM allows us to assess the similarities and differences between existing validation metrics in a new light. The BVM may be used to select models according to novel model validation comparison measures. We constructed the BVM ratio for the purpose of quantifying model selection under arbitrary definitions of agreement. This construction generalizes the Bayesian model testing framework. As an example of the versatility and effectiveness of our method, we formulated a quantitative comparison function to represent the visual inspection an engineer might use to validate a model. The BVM ratio leads to the correct selection of the preferable model in both the completely certain and uncertain cases.



page 13

page 14


Generalized Bayesian Regression and Model Learning

We propose a generalized Bayesian regression and model learning tool bas...

On the Ubiquity of Information Inconsistency for Conjugate Priors

Informally, "Information Inconsistency" is the property that has been ob...

Development and Realization of Validation Benchmarks

In the field of modeling, the word validation refers to simple compariso...

Marginal likelihood computation for model selection and hypothesis testing: an extensive review

This is an up-to-date introduction to, and overview of, marginal likelih...

Designing Test Information and Test Information in Design

DeGroot (1962) developed a general framework for constructing Bayesian m...

Automatic Metric Validation for Grammatical Error Correction

Metric validation in Grammatical Error Correction (GEC) is currently don...

Lipid domain coarsening and fluidity in multicomponent lipid vesicles: A continuum based model and its experimental validation

Liposomes that achieve a heterogeneous and spatially organized surface t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pivotal to the scientific and engineering process is the testing of hypothesis and the validation of mathematical and computational models. Without a validation procedure, there is no reason to believe that a model, which has been designed to serve as a convenient representation of physical and/or human-made processes, has fulfilled its functional purpose. Thus, validation is the result of the positive justification of a model’s representation of relevant features in the real world [1, 2].

We are interested in studying the validation of multivariate computational models that represent uncertain situations and/or data. It is understood that complete certainty is a special case of uncertainty. The uncertainty in a model or data set may originate from stochasticity, model parameter and input data uncertainty, measurement uncertainty, or other possible aleatoric or epistemic sources of uncertainty. Each of the following data modeling schemes may include quantifiable amounts of uncertainty (or certainty) that we would like to validate on the basis of a set of validation data: neural networks and AI models, machine learning models, Gaussian Process Regression models

[3], polynomial chaos and other surrogate models [4, 5, 6, 7], spatial and time series stochastic models, physics based models (usually solutions to differential equations), engineering based models (which are sufficiently abstracted physics based models), Monte Carlo simulation models [8, 9], and more. Model output uncertainties may be quantified through uncertainty propagation techniques (that may or may not include verification, calibration, and validation) [3, 5, 6, 7, 10, 11, 12, 13, 14, 15, 16].

There exist several validation metrics. Each metric is designed to compare features of a model-data pair to quantify validation: square error compares the difference in the data and model values in a point to point fashion, the reliability metric [17] and the probability of agreement [18] compare continuous model outputs and data expectation values (the model reliability metric was extended past expectation values in [19]), the frequentist validation metric [20, 21]

and statistical hypothesis testing compare data and model test statistics, the area metric compares the cumulative distribution of the model to the estimated cumulative distribution of the data

[14, 22]

, and Bayesian model testing compares the posterior probability that each model would correctly output the observed data

[12, 11, 23, 24, 25, 26, 27]. A detailed review of the majority of these metrics may be found in [28] and the references therein.

To assist the comparison of the positive and negative aspects of (most) the above validation metrics, reference [28] outlines six “desirable validation criteria” that a validation metric might have (they extend [20, 22]). One conclusion from [28] is that none of the available metrics simultaneously satisfy all six desirable validation criteria. We summarize the most important features of the desirable validation criteria with the following validation criterion:

  1. A validation metric should be a statistically quantified quantitative measure (as opposed to a qualitative measure) of the agreement between (general) model predictions and data pairs, in the presence or absence of uncertainty.

The desire for objectivity, “that a metric will produce the same assessment for every analyst independent of their individual preferences” [28], is difficult to satisfy because there are no rules in place to guide a modeler toward selecting one validation metric over another. For this reason the individual might simply choose a metric based on their preferences, or worse, be tempted to base their decision on which validation metric gives them the most favorable evaluation. Given individuals may choose different validation metrics for the same model-data pair, it is possible for individuals to impose accuracy requirements that are incompatible with one another and arrive at different conclusions regarding the validity of the same model-data pair. As the final goal is objectivity, when possible, a map between the accuracy requirements should be constructed such that the validation metrics yield consistent evaluations of model-data validity when applicable.

Further, Liu et al.[28] suggest that there is no agreed upon unified model-data comparison function. Even including the results of this article, we expect this statement to hold as it is extremely difficult to guess the prior information about the utility of a model an analyst may be required to include in the validation of a model-data pair. For instance given arbitrary data “What features of the data are relevant to capture with a model?”, “Of these features, are some more relevant than others?”, and “What accuracy is required for the model to be valid?”. Agreement and validation

are ultimately human-made concepts designed for the purpose of expressing that “in general, not every feature or statistic between a model-data pair need to be equal to conserve the utility of the model”. For some model-data pairs, all that may be required is that the model and data averages closely match within uncertainty, while for others, one may require that the model can accurately reproduce the probability distribution of a data set as a whole (as one would do to physically model a noisy measurement device). Given the wide variety of data and the large number of different inferences (and thus models and hypotheses) that one may be interested in drawing from a given data source, i.e. the context of the model-data pair, we do not expect any single set of comparison functions, statistics, or values to be equally relevant and maximally useful for all possible model-data contexts. This, however, does not stop us from quantifying the validity of a model-data pair given any arbitrary comparison function and with any arbitrary definition of agreement.

In this article, we construct and propose the “Bayesian Validation Metric” (BVM) as a general model validation and testing tool. An example validation scenario the BVM can quantify is depicted in Figure 1. We design the BVM to adhere to the desired validation criterion (1.) by using “four BVM inputs”: the model and data comparison values, the model output and data pdfs, the comparison value function, and the agreement function. The comparison value function is a function of model output and data comparison values that provides the desired quantitative comparison measure, e.g. square difference. Using the model output pdf and the data pdf, the value of the comparison value function is statistically quantified. In turn, the agreement function provides an accept/reject rule and effectively wraps the previous three BVM inputs together to give the BVM. From this, the BVM outputs “the probability the model and data agree”, where agreement

is a user defined Boolean function that meets, or does not meet, accuracy requirements between model and data comparison values. Thus, the BVM meets the desired validation criterion (1.) for arbitrary comparison value functions, arbitrary definitions of agreement, and in principle for arbitrary data types such as integers, vectors, tensors, strings, pictures, or others.

The BVM can be use to represent all of the aforementioned validation metrics as special cases. This allows us to compare and contrast the validation metrics from the literature in a new light. We find the conditions under which several of the current validation metrics are effectively equal to one another, which improves the objectivity of the current validation procedure. In brief we find that the frequentist metric (using natural definitions of agreement) is equal to the reliability metric and the probabilities from Bayesian model testing are equal to the probabilities of the improved model reliability metric [19] when one demands exact equality of the model-data comparison values. Because probability can represent both certain and uncertain situations, so can the BVM. Thus, these “special case” metrics can be generalized to quantify certain or uncertain cases, and even be combined into more complex validation requirements using the BVM framework. Thus, the BVM provides a standardized framework to improve, generalize, or further quantify these validation metrics.

Figure 1: This is the depiction of a common model validation scenario. The model line is trained on noisy data (not depicted in the figure) and is to be compared to a set of validation data. As both the model line and the data are uncertain in general, any quantitative measure (i.e. the comparison function) between these comparison values inherits this uncertainty. Thus, any accept/reject rule on the basis of these uncertain comparison function values is uncertain as well. A visual inspection of this graph seems to indicate, up to statistical fluctuation, that the comparison values of the model and data more or less (or probably) agree, but this intuitive measure has yet to be quantified. Graphic adapted from [29].

By constructing the “BVM ratio”, we generalize the Bayesian model testing framework [23]. In the Bayesian model testing framework, one constructs the Bayes ratio to rank models according to the ratio of their posterior probabilities given the data. We show that these posterior probabilities are equal to a special case of the BVM under the definition of agreement that requires these uncertain model outputs and data to match exactly. Thus, nothing prevents us from extending the logic used in the Bayesian model testing framework to our framework and we construct the BVM ratio for the purpose of model selection under arbitrary definitions of agreement, i.e. for arbitrary validation scenarios.

The remainder of the article continues as follows. In Section 2 we construct the BVM by following our validation criterion. Through some edge cases, we show that the BVM satisfies both the six desirable validation criteria from [28] as well as our validation criterion (1.) in Section 3. In Section 4, we summarize the results derived in Appendix A. Appendix A incorporates all of the above standard validation metrics as special cases of the BVM, draws relationships between several of the validation metrics, provides improvements and generalizations to these metrics as is suggested by the functional form of the BVM, and constructs the BVM ratio. In Section 5 we simulate an example BVM model selection scenario using a quantitative representation of the visual inspection an engineer might perform in both completely certain and uncertain settings. This demonstrates the versatility and effectiveness of our method.

2 The Bayesian Validation Metric

For the remainder of the article we will use the following notation and language. We will let denote the output of a model, a set of model outputs, a data point, and a set of data points. The proposition essentially stands for “the model” or “coming from the model”, and stands for “the experiment” or “coming from the experiment”. We let and represent the comparison quantities of interest, which pertaining to the model and the data respectively. Further, we let the comparison quantities take general forms, such as multidimensional vectors, functions, or functionals (e.g. output values, expectation values, pdfs, …), so we can represent any such pair of quantities we may wish to compare between the model and the data. When we refer to “the four BVM inputs” we mean: the comparison values , the model output and data pdf , the comparison value function , and the agreement function . The (denoted) integrals may be integrals or sums depending on the nature of the variable being summed or integrated over, which is to be understood from the discrete or continuous context of the inference at hand. The dot “ ” represents standard multiplication, which is mainly used to improve aesthetics.

Performing uncertainty propagation through a model results in a model output probability (density) distribution that ultimately we would like to validate by comparing it to an uncertain validation data source , to see if they agree (as depicted in Figure 1). The immediate question is, however, “What values do we want to compare and what do we mean by agree?”. Given the wide variety of data and the large number of different inferences (and thus models and hypotheses) that one may be interested in drawing from a given data source, i.e. the context of the model-data pair, we do not expect any single set of comparison functions to be equally relevant and maximally useful for all possible model-data contexts. In light of this, we instead quantify the validity of a model-data pair given any arbitrary comparison value function and with any arbitrary definition of agreement.

2.1 Derivation

Here we will begin constructing the Bayesian Validation Metric (BVM). To capture the concept of what we might mean by agree, we define and to agree, , when the Boolean expression, , is true. Both and are defined by the modeler and their prior knowledge of the context of the model-data pair. Naturally then, the agreement function is some function or functional of a comparison value function .

Given the values of and are known, i.e. certain, we quantify agreement using a probability distribution that assigns certainty,


The indicator function is defined to equal unity if evaluates to “true” (i.e. “agreeing”) and equal to zero otherwise. Thus in the completely certain case, we are certain as to whether the model and data comparison values ‘agree, or do not agree, as defined by and the deterministic evaluation of .111This binary yet probabilistic definition of agreement turns out to be completely satisfactory for our current purposes. One could see trying to represent an amount of partial agreement (as opposed to probabilistic agreement as it is now and will further become) with , where is not a probability and is instead a partial membership function employing fuzzy logic. This would allow for a non-binary ranking of more or less agreeing. A potentially interesting alternative would be to interpret normalized comparison value functions as membership functions in the current framework and in the literature; although, we require no such interpretation here. We will call the “agreement kernel”.

Given that in general the comparison values are uncertain, and quantified by , the probability the comparison values agree, as defined by and , is equal to,


which is a marginalization over the spaces of .222 Recall that the propositions in the probability distributions and are completely arbitrary (in some cases requiring propagation from and

), they could be both continuous, discrete (with order), categorical variables (no well defined order, e.g. strings, pictures,…), or a mix.

Equation (2) is the general form of the Bayesian Validation Metric (BVM). Because is discrete, the BVM is a probability rather than a probability density and it therefore falls in the range . Equation (3) explicitly assumes that the uncertainty in the data is independent of the model, i.e. , that the data does not take or the model (that it is currently being compared to) as inputs.333In a controls system this may not be the case, as the model may interact with the system of interest. In such a case this constraint may be lifted and one should use (2) instead. The joint probability can be used to account for the correlations between the model (the controller or reference) and the data (the measured response of the system being controlled) in a controls setting in principle. This is a relatively common scenario so it is stated explicitly. The BVM may be given a geometric interpretation as the projection of two probability vectors (potentially of unequal length) in a space whose overlap is defined by the agreement kernel – this is an inner product if is symmetric in its arguments. The BVM may be computed using any of the well known computational integration methods.

2.2 An identical representation

In some cases, it is useful to work directly with the probability density , which quantifies the probability the comparison value function takes the value due to uncertainty in its inputs. This pdf is independent of any user defined accuracy requirement. We will call this pdf the comparison value probability density, which is equal to,


This is the net uncertainty propagated through the comparison value function from the uncertain model and data comparison values. All of the expectation values that are associated with may be generated from this pdf.

If one imposes an accuracy requirement with a Boolean expression (i.e. defining agreement according to the value of ), the resulting accumulated probability is the BVM. That is, the BVM, i.e. equation (2), may equally be expressed as,


which is proven through substitution and marginalization over ,


2.3 Importance

The BVM allows the user to, in principle, quantify the probability the model and the data agree with one another under arbitrary comparison value functions and with arbitrary definitions of agreement. The BVM can therefore be used to fully quantify the probability of agreement between arbitrary model and data types using novel or existing comparison value functions and definitions of agreement. Thus, the problem of model-data validation may be reduced to the problem of finding the four BVM inputs in any model validation scenario.

3 Meeting the desirable validation criterion

First we will describe how the BVM, equations (2) and (5), precisely match our validation criterion (1.). As can be seen by equation (4), incorporated into the BVM is a statistically quantified quantitative measure that compares data and model outputs, . However, this pdf is in some sense lacking a context pertaining to the model-data pair. Not until an accept/reject rule is imparted on does one define what is meant by agreement in the model-data context. Thus, the BVM only becomes the probability of agreement between the data and the model when the agreement function is also incorporated. The four BVM inputs are therefore adequate to satisfy (1.) as the BVM is a “statistically quantified quantitative measure () of agreement between model predictions and data pairs , in the presence or absence of uncertainty ”.

There are a few more BVM concepts worth discussing before moving forward. We will show that the BVM is capable of handling general multidimensional model-data comparisons and that there are no conceptual issues when agreement is exact, i.e. is true iff , in the certain and uncertain cases. We will then make comments on the sense in which the BVM adheres to the full set of six desirable validation criteria given in [28] by discussing the criteria that are underrepresented in (1.).

3.1 Compound Booleans

Because Boolean operations between Boolean functions results in a Boolean function itself, the BVM is capable of handling multidimensional model-data comparisons. We will call a Boolean function with this property a “compound Boolean”. A compound Boolean function results from and, , conjunctions and or, , disjunctions between a set of Boolean functions, e.g.,


where each may use a different comparison function . Compound Booleans using conjunctions quantifying the validity of entire model functions (random fields and/or multidimensional vectors) by assessing agreement between each of the model-data comparison field points simultaniously, i.e over the comparison points 1 and points 2 and

so on. The compound Booleans may be factored into their constituting Boolean functions using the standard product and sum rules of probability theory after being mapped to probabilities with the agreement kernel.

3.2 The BVM under the conditions of exact agreement

We can calculate the BVM under the conditions of exact agreement in the completely certain and uncertain cases. Because the BVM is a probability rather than a probability density, the agreement kernel falls in the range . Under the conditions of exact agreement, that is only true when , the agreement kernel is , which is the Kronecker delta, i.e. it is or , but has continuous labels. As it is uncommon to deal with Kronecker delta’s having continuous labels under integration, we will show that the BVM gives reasonable results under the condition of exact agreement in the complete certainty as well as in the general uncertain case.

Complete certainty and exact agreement

Complete certainty is represented using Dirac delta pdf functions over the model and data comparison values. This gives the BVM,


where we are considering the model-data pair to agree iff the comparison values are exactly equal. Using the sifting property of the Dirac delta function, we find the reasonable result that,


which is equal to unity iff and , the definite values of and , are equal.

Uncertainty and exact agreement

In the uncertain case under the condition of exact agreement, the BVM is


We will do the following trick to correctly interpret this integral. We will first let be true if and then take the limit as such that when appropriate. With this Boolean expression, the BVM is,


In the limit , the term in the parenthesis by the definition of probabilities. This gives,


which is understood to be the sum of the model and the data probabilities that jointly output exactly the same values. We see that the BVM in this case is proportional to , , in the general case of exact agreement, and therefore the BVM goes to zero unless the pdf . Thus, we recover the standard logical result for probability densities unless it is offset by . This result is easily generalized to the dependent case using . The result, equation (12), is no more surprising than (9) in principle.

Due to the vast number of possibilities for continuous valued variables, having a pathological definition of exact agreement between continuous variables does not occur in practice. In a computational setting,

becomes a finite difference and these infinitely improbable agreement conceptual issues are avoided. The Bayesian model testing framework avoids these issues by evaluating posterior odds ratios, in which case the measures,

, drop out.

3.3 Meeting underrepresented validation criteria

In this subsection we will discuss how the BVM also meets the validation criteria found in [28]. This is done by using the derived general and special cases of the BVM for each of the criteria which are underrepresented in (1.).

Perhaps the primary underrepresented criterion from [28] is their second. It states that “..the criteria used for determining whether a model is acceptable or not should not be a part of the metric which is expected to provide a quantitative measurement only.” We argue that the functional form of the BVM presented in equation (5) clearly demonstrates this feature as it factors into and . The comparison function represents the “objective quantitative measure” from their first criterion that is separate from the accept/reject rule, which is our agreement function – both of which require definition to ultimately evaluate the validity of a model. We see it as advantageous to quantify the probability the model is accepted or rejected through due to the uncertainty in the value of , which is the general case, and which gives the BVM as the result. As all of the validation metrics presented in [28] (and more) will be shown to be representable with the BVM, and thus placed on the same footing, we find our language of “comparison function” and “agreement function” to ultimately be more useful than a language that only considers comparison functions (without accept/reject rules) to be the validation metrics.

The third criteria in [28] is that ideally the metric should “degenerate to the value from a deterministic comparison between scalar values when uncertainty is absent”. This is indeed the case as can be seen in equation (1) or in equations (2) and (5) by utilizing Dirac delta pdfs similar to their application in (8).

The fifth desirable validation criteria in [28] states that artificially widening probability distributions should not lead to higher rates of validation. They find all but the frequentist metric to have this undesired feature; however, we later see that the frequentist metric may be considered a special case of the reliability metric (when reasonable accuracy requirements are imposed), meaning artificial widening can lead to higher rates of validation for more general instances of the frequentist metric.

Further, we argue that artificially introducing uncertainty for the express purpose of passing a validation test is indistinguishable from scientific misconduct. If there is objective reason to include more uncertainty into the analysis or if the circumstance for what constitutes validation has changed due to a change of context – and it happens to improve the rate of validation – so be it. This is a different context, model, or state of uncertainty than was originally proposed so different rates of acceptance should be expected. Reducing the uncertainty of either the data or the model (the inputs) through additional measurements may later prove the model valid or invalid. Thus, to meet this validation criteria, we simply assume the user is not engaging in scientific misconduct.

Finally, due to the results of the “Compound Booleans” section, their sixth criterion is met. Because the BVM (2) can be used to assess single or multidimensional controllable settings (see footnote number 3.) we can perform global function validity (in or out of a controls setting). As they note, “This last feature is critical from the viewpoint of engineering design”.

Thus, the BVM satisfies both our validation criterion and the six desirable validation criteria outlined in [28]. This was accomplished by representing model-data validation as an inference problem using the four BVM inputs.

4 Representing and generalizing the known validation metrics with the BVM

This section is a review of the material found in Appendix A. The following validation metrics will be represented with the BVM, which are then are improved, generalized, and/or commented on: reliability/probability of agreement, improved reliability metric, frequentist, area metric, statistical hypothesis testing, and Bayesian model testing.

4.1 Representating the known validation metrics with the BVM

Table 1 outlines the values of the four BVM inputs that result in the BVM representing each of the well known validation metrics as special cases. The following notation is used for the comparison values . The brackets denote expectation values, ’s denote averaged values, denote single values, denote multidimensional (or many valued) values,

denote cumulative distribution functions (

), denote test statistics, and denotes the confidence interval of the data. In the agreement function column, an element listed as means the creators of the metric intentionally left the definition of agreement unspecified; however, it is natural to assume it is a function of the comparison function .

Comp. Values Probs. Comp. Func. Agree. Func.
Imp. Reli.
Frequentist Stud. t
Area dy
Stat. Hyp.
Bayes Model

The column headings are the four BVM input values: Comparison Values , Probabilities , Comparison Function , and the Boolean Agreement Function . The row headings read: Reliability, Improved Reliability, Frequentist, Area, Statistical Hypothesis Test, and Bayesian Model Test. The denoted data probability for the average in the frequentist metric, Stud. t, is the Student t distribution.
Table 2: BVM representation of the special cases using the ’s specified in Table 1.
Imp. Reli.
Stat. Hyp.
Bayes Model
Table 1: Specification of the four BVM inputs that give the other validation metrics as special cases.

Table 1 shows some of the similarities and difference between the known validation metrics using the BVM. In particular, by looking at the validation metrics with the same type of comparison values, i.e. the reliability and frequentist or the improved reliability and Bayesian model testing, we can compare them directly. We see that if one lets the frequentist metric allow for more general input probability distributions and the use of a reasonable agreement function (i.e., is true if ), then the frequentist metric is the reliability metric. Further, in Bayesian model testing, if the agreement function is loosened to accept , than the pdf’s that appear in the Bayesian model testing framework are equal to the improved reliability metric. This information improves the objectivity of the current validation procedure because we now have a map between validation metrics that were originally thought to be different.

Table 2 shows the resulting BVM using the specifications listed in Table 1. The value is the standard notation for the reliability metric [17] and we use for the improved reliability metric [19]. The BVM represents each of the known validation metrics as a probability of agreement between the model and the data from equation (2). As no agreement function is specified directly for the frequentist and area metric, the problem is under constrained so the agreement functions are left as general functions over the comparison function . Thus, for any chosen agreement function, the BVM quantifies their probability of agreement. The remaining metrics all do specify (or indicate) an agreement function, and thus, have specified all of the information required to compute the BVM.

The statistical hypothesis test is perhaps a bit out of place among the validation metrics. First, note that the comparison function for statistical hypothesis testing is not a function of both the data and the model. Further note, the model pdf used for statistical hypothesis testing assumes the null hypothesis is true, which in our language is the assumption that

, i.e. that the pdf of the model is equal to the pdf of the data. This shows how statistical hypothesis is a bit out of place here among the validation metrics because here we are attempting to validate a model, usually with its own quantified pdf, rather than, perhaps irresponsibly, assuming it is equal the data pdf before validating that to be the case. This causes standard statistical hypothesis pitfalls, such as type I (rejecting the null hypothesis when it is true) and type II errors (failing to reject the null hypothesis when it is false), to be carried over into BVM, which is unwanted. Several comments are made in Appendix A.4 on this issue.

A perhaps surprising result is the proposed functional form of the BVM that represents Bayesian model testing . This is the probability the model and data output exactly the same values. Usually what is discussed when reviewing Bayesian model testing is the Bayes posterior odds ratio, i.e. the “Bayes Ratio”,

which tests one model , i.e. for validation, against another model . However, in validation metric problems, we are first interested in considering the validation of a single model – the ratio is an extra bit of inference. In Appendix A.5 we show that the BVM result of is exactly what we mean by

in the numerator of the Bayes factor, which effectively quantifies the validation of a single model against data

, all quantified under uncertainty.

4.2 Generalizations and Improvements to the known validation metrics with the BVM

The BVM offers several avenues to either generalize or improve many of the metrics. Theoretical generalizations are to try to generalize each element of Table 1 toward the BVM, whether that is generalizing the comparison values, loosening the constraint on the form of the agreement function, and in the case of the frequentist and area metrics, to allow for uncertainty in their definite model and/or data quantities. These generalizations are only useful if quantitative statements can be made on their behalf – in such a case, these generalizations are improvements. We will give a brief review of the improvements we found below, but the full discussion is located in Appendix A. By making generalizations or improvements to each of the known validation metrics as are implied by the BVM, each metric can be made to satisfy our validation criterion as well as the six desirable validation criteria in [28], due to the results of Section 3.

Appendix A.1 uses the BVM to show that the reliability metric and the improved reliability metric can be generalized to compare values without a unique order, such as strings, in principle. This involves creating an agreement function over sets of values (such as synonymous sets of strings), rather than in a continuous interval, that may be considered to “agree”.

Appendix A.2 derives the frequentist validation metric and generalizes it to the case where both the model and data expectation values are uncertain. The frequentist metric assumes the model outputs are known with certainty, which may or may not be true. If a model is stochastic, the model pdfs may be estimated with Monte Carlo or other uncertainty propagation methods that quantify the pdf directly.

Appendix A.3 shows that the area metric may be cast as a special case of the BVM. The area metric involves quantifying the difference between model and data cumulative distributions on a point to point basis; thus, the comparison values are cumulative distributions themselves. The comparison values are assumed to be known with complete certainty, which in the case of cumulative distributions of data is often difficult to argue. Any quantifiable uncertainty in the cumulative distributions may integrated over, which generalizes the area metric to situations when the model and/or the data cumulative distributions are uncertain. A drawback is that the BVM in these cases may be very computationally intensive and would likely need to be approximated using a random sampling or discretization scheme. This is because the probability over possible cumulative distribution functions effectively constitutes quantifying the probability over a random field. A binned pdf metric is put forward to potentially reduce the computational complexity toward quantifying this generalized area validation metric.

Appendix A.4 uses the BVM to construct an improved statistical hypothesis test that takes into account the model form. Because in principle we have a model output pdf in validation problems , we can use it (in place of assuming the null hypothesis is true) to remove the possibility of both type I and type II errors.

In the improved statistical hypothesis test, the model and the data are defined to agree if their test statistics both lie within one another’s confidence intervals (or “confidence sets” as explained in Appendix A.4). The BVM for the improved statistical hypothesis test becomes the product of the statistical powers of the model and data, denoted in equation (36). Further comments are made about how systematic error (defined as when a test statistic lies outside of its own confidence interval) may be removed.

It is concluded that the BVM for the improved statistical hypothesis test still has a very low resolving power. This is because large confidence intervals imply large tolerance intervals for acceptance. For this reason, statistical hypothesis testing should only be used for validation in situations when a high degree of nonexactness between model and data test statistics is permissible and the pdf’s have very thin tails.

Appendix A.5 finds that Bayesian model testing has the highest possible resolving power because the model and the data are defined to agree only if their values are exactly equal. This is the reverse of what was concluded about statistical hypothesis testing.

Further in Appendix A.5, we argue that, analogous to the Bayesian model testing framework, nothing prevents us from constructing what we call the BVM factor. The BVM factor is,


which is a ratio of the BVMs of two models under arbitrary definitions of agreement

. Using Bayes Theorem,

, we may further construct the BVM ratio,


for the purpose of comparative model selection under a general definition of agreement . The ratio

is the ratio prior probabilities of

and , which analagous to Bayesian model testing, if there is no reason to suspect that one model is a priori more probable than another, one may let , and then in value.

Thus, using the BVM ratio, we can perform general model validation testing under arbitrary definitions of agreement and with any reasonable set of comparison functions. The BVM ratio therefore generalizes the Bayesian model testing framework.

5 BVM: Quantified Visual Inspection Validation Example

In this section we design an agreement function to represent the visual inspection an engineer might perform graphically and use the BVM ratio for model selection under this definition of agreement. We could see such an agreement function to be potentially useful in the experimental, engineering, and data science settings. By quantifying this, a modeler could “visually validate” a model without actually looking at the model-data pair. Thus, a “visual inspection” could be carried out in high dimensional spaces that are beyond human comprehension/visualizability. We will proceed by introducing this agreement function and some simple models to test it on. We will then quantify this measure using the BVM in the completely certain and uncertain cases. Using the results, we will perform BVM model testing.

To quantify something resembling the visual inspection of a model that an engineer might make graphically we use two main criteria. We define the model to be accepted if most of the model and data point pairs lie relatively close to one another and if none of the point pairs deviates too far from one another. We therefore consider a compound Boolean that is true if a percentage larger than % ( 90%) of the model output points lie within of the data and 100% of the model output points lie within some multiple of the data, which rules out obvious model form error. We will call this Boolean the “- Boolean”, which is a compound Boolean. The values , , and can be adjusted to the needs of the modeler. We will perform the analysis for a variety of and values to explore the limits of the metric.

We will calculate the BVM for two different order polynomial models that approximate data points taken from the cosine function , as an illustration. The points are evenly spaced in the range . The first model has uncertain parameters and the second model has uncertain parameters .

To formulate the BVM, we still need to formulate the model and data probability distributions. Because the Boolean expression is over the entire model and data functions, the model probability distribution is and data probability distribution is . These are joint probabilities over all of the points , that constitute a particular path of the model or data, respectively. Because both models are linear in the uncertain coefficients , there is a one to one correspondence from the set of model parameters to the set of the possible paths (given is greater than the number of independent coefficients). This makes the uncertainty propagation from the uncertain model parameters to the full joint probability of the points on a path simple and results in the joint probability of paths being equal to the joint probability of the uncertain input model parameters. For simplicity we will let , where

is a normal distribution of average

and standard deviation

, such that is,

After discretization, the - BVM for each model is,


In principle is an dimensional vector where each may be adjusted to impose more or less stringent agreement conditions on a point to point basis, which may be used to enforce reliability in regions of interest. In our example we let all of the components of be equal. If the uncertainty in the data is less than , one may approximate the BVM as,


which can greatly reduce the number of combinations one must calculate by effectively treating the data as known, deterministic, and equal to . We will use this approximation as it does not take away from the main features of our example, which are to use the - Boolean and perform BVM model testing with it for a variety of and values.

We will use the following numerics. In the completely certain case, we will let the parameters be the Taylor series coefficients ( for model 1) and in the uncertain case we let each coefficient have Gaussian uncertainty centered at their Taylor series coefficients with standard deviations (and where for model 1). We let each model output path have points and we allow for possible values per parameter , which results in possible paths for model 1 and for model 2. We let vary between and using an increment of and let vary between and using an increment of . The value of was chosen to be equal to , which imposes that no model path can have points that are greater than away while still be considered to agree with the data.

The BVM probability of agreement values as a function of are plotted in Figure 2 for model 1 and model 2 in the completely certain case:

Figure 2: Completely Certain Case: The BVM probability of agreement between model 1 and 2 with the data is plotted in the space of . The results for model 1 ( order polynomial) is plotted on the left and model 2 ( order polynomial) is plotted on the right. Because here the models are deterministic, the BVM probability of agreements for each pair is either zero or one. As expected, model 2 better fits the data in the space of as it has more BVM values equal to one than model 1 as it is overall closer to the cosine function being that it is the next nonzero order in the Taylor series expansion. Neither model fits the data exactly as the BVM for both models at is zero.

For a single pair, the model’s BVM ratio (a prior the models are assumed to be equally likely) is,


which, because the numerator and denominator is either 0 or 1 in the deterministic case, gives equal to 1, 0, , or meaning that the models both agree, model 1 does not agree but model 2 agrees, model 1 agrees but model 2 does not agree, or both models disagree, respectively. Thus, the BVM ratio for a single pair between two deterministic models with completely certain data is not particularly insightful as they either agree or do not agree as defined by . As it may not always be clear precisely what values of one should choose to define agreement, one can meaningfully average (marginalize) over a viable volume in the space of , with , and arrive at an averaged Boolean BVM ratio,


which is simply a ratio of the number of agreements found for model 1, , in the volume to the number of agreements found for model 2, , in the selected volume. In our deterministic example, as model 2 better fits the data, as defined by , for the chosen meaningful volume (which is taken to be the whole tested volume in this toy example). The BVM ratio or the averaged Boolean BVM ratio may be used as a guide for selecting models in the deterministic case given reasonable regions are chosen.

The BVM probability of agreement values as a function of are plotted in Figure 3 for model 1 and model 2 in the uncertain model case:

Figure 3: Uncertain Case: The BVM probability of agreement between model 1 and 2 with the data is plotted in the space of . The results for model 1 ( order polynomial) is plotted on the left and model 2 ( order polynomial) is plotted on the right. Because the model paths are uncertain, the BVM probability of agreements for each pair may take any value from zero to one. As expected, model 2 better fits the data in the space of as its BVM is generally larger than that of model 1; however, the BVM values are about equal in cases of large values of and values (the definition of agreement is less stringent and they both “agree”) and in the case of demanding absolute equality () as neither model fits the data exactly.

The BVM ratios of the uncertain models are plotted as a function of in Figure 4:

Figure 4: This is a plot of the BVM ratios in the uncertain case. Model 2 is generally favored over model 1 as there exist no values greater than one on the plot. The amount the BVM ratio favors model 2 over model 1 decreases as the metric becomes less and less stringent (i.e. as decreases and increases). The line was removed because neither model agrees with the data exactly.

The averaged Boolean BVM ratio for the uncertain models is,


which conforms to the notion that model 2 is, generally speaking, the preferable model, and which may be communicated with this single number. Using this article as a theoretical foundation, we plan to tackle more realistic problems in the future.

6 Conclusion

We demonstrated the versatility of the BVM toward expressing and solving model validation problems. The BVM quantifies the probability the model is valid for arbitrary quantifiable definitions of model-data agreement using arbitrary quantified comparison functions of the model-data comparison values. The BVM was shown: to obey all of the desired validation metric criteria [28] (which is a first), to be able to represent all of the standard validation metrics as special cases, to supply improvements and generalizations to those special cases, and to be a tool for quantifying the validity of a model in novel model-data contexts. The later was demonstrated in our example. Using the BVM, one can jointly validate multiple features of the model-data pair in conjunction and/or disjunction with one another, e.g. validating a model-data pair according to pdf difference and average value difference simultaneously.

Finally, it was shown that one can perform model validation selection using the constructed BVM ratio. The BVM model testing framework generalizes the Bayesian model testing framework to arbitrary model-data contexts and with reference to arbitrary comparisons and agreement definitions. That is, the BVM ratio may be used to rank models directly in terms of the relevant model-data validation context. The problem of model-data validation may be reduced to the problem of finding/ defining the four BVM inputs: , , , and , and computing their BVM value – all of which may be regarded as an application of inference. We find that the BVM is a useful tool for quantifying, expressing, and performing model validation and testing.


This work was supported by the Center for Complex Engineering Systems (CCES) at King Abdulaziz City for Science and Technology (KACST) and the Massachusetts Institute of Technology (MIT). We would like to thank all of the researchers with the Center for Complex Engineering (CCES), especially Zeyad Al-Awwad, Arwa Alanqari, and Mohammad Alrished. Finally, we would also like to thank Nicholas Carrara and Ariel Caticha.



  • [1] Oberkampf, W. L., Trucano, T. G., and Hirsch, C., 2004, “Verification, Validation, and Predictive Capability in Computational Engineering and Physics,” Appl. Mech. Rev., 57(3), pp. 345–384.
  • [2] Sornette, D., Davis, A. B., Ide, K., Vixie, K. R., Pisarenko, V., and Kamm, J. R., 2007, “Algorithm for Model Validation: Theory and Applications,” Proc. Natl. Acad. Sci. U.S.A., 104(16), pp. 6562–6567.
  • [3] M. C. Kennedy and A. O’Hagan, Bayesian calibration of computer models, J. R. Stat. Soc.: Ser. B (Stat Methodol.), vol. 63, no. 3, pp. 425-464, 2001.
  • [4] O. P. Maître, O. M. Knio, Spectral Methods for Uncertainty Quantification. New York: Springer Science+Business Media, 2010.
  • [5] Adams, B.M., Bauman, L.E., Bohnhoff, W.J., Dalbey, K.R., Ebeida, M.S., Eddy, J.P., Eldred, M.S., Hough, P.D., Hu, K.T., Jakeman, J.D., Stephens, J.A., Swiler, L.P., Vigil, D.M., and Wildey, T.M., ”Dakota, A Multilevel Parallel Object-Oriented Framework for Design Optimization, Parameter Estimation, Uncertainty Quantification, and Sensitivity Analysis: Version 6.0 User’s Manual,” Sandia Technical Report SAND2014-4633, July 2014. Updated November 2015 (Version 6.3).
  • [6] R. Ghanem, D. Higdon, and H.Owhadi, “The Uncertainty Quantification Toolkit (UQTk)” Handbook of Uncertainty Quantification, Springer, 2016.
  • [7] B. Debusschere, N. Habib, N. Najm, P. Pébay, O. M. Knio, R. Ghanem, and O. P. Maître, Numerical Challenges in the Use of Polynomial Chaos Representations for Stochastic Processes, SIAM J. Sci. Comput. vol. 26, no. 2, pp. 698-719, 2005.
  • [8]

    W. Gilks, S. Richardson, and D. J. Spiegelhalter, Markov chain Monte Carlo in practice. London, UK: Chapman and Hall, 1996.

  • [9] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller, Equation of state calculations by fast computing machines, J. Chem. Phys., vol. 21, no. 6, pp. 1087, 1953.
  • [10] S. Sankararaman and S. Mahadevan, Integration of model verification, validation, and calibration for uncertainty quantification in engineering systems, Reliab. Eng. Syst. Saf., vol. 138, pp. 194-209, 2015.
  • [11] S. Sankararaman and S. Mahadevan, Model validation under epistemic uncertainty, Reliab. Eng. Syst. Saf. vol. 96, no. 9, pp. 1232-1241, 2011.
  • [12] S. Mahadevan and R. Rebba, Validation of reliability computational models using Bayes networks, Reliab. Eng. Syst. Saf. vol. 87, no. 2, pp. 223-232, 2005.
  • [13] C. Li and S. Mahadevan, Role of calibration, validation, and relevance in multi-level uncertainty integration, Reliab. Eng. Syst. Saf., vol. 148, pp. 32-43, 2016.
  • [14] C. Roy and W. Oberkampf, A comprehensive framework for verification, validation, and uncertainty quantification in scientific computing, Comput. Methods Appl. Mech. Engrg., vol. 200, pp. 2131-2144, 2011.
  • [15] M. Stefano and B. Sudret, UQLab user manual - Polynomial chaos expansions. Report UQLab-V0.9-104, Chair of Risk, Safety and Uncertainty Quantification, ETH Zurich, 2015.
  • [16] M. Parno and A. Davis, “MUQ: MIT Uncertainty Quantification Library”, http://muq.mit.edu/home, 2018.
  • [17] R. Rebba and S. Mahadevan, Computational methods for model reliability assessment, Reliab. Eng. Syst. Saf., vol 93, no. 8, pp 1197-1207, 2008.
  • [18] N. T. Stevens, Assessment and comparison of Continuous Measurement Systems, Ph. D. Thesis, Dept. Statistics, Univ. of Waterloo, Waterloo, Ontario, Canada, 2014.
  • [19] S. Sankararaman and S. Mahadevan, Assessing the reliability of computational models under uncertainty. In: The 54th AIAA/ASME/ASCE/AHS/ASC structures, structural dynamics, and materials conference; 2013.
  • [20] W. L. Oberkampf, M. F. Barone, Measures of agreement between computation and experiment: validation metrics, J. Comput. Phys. vol. 217, no. 1, pp. 5-36, 2006.
  • [21] W.L. Oberkampf, M.F. Barone, Measures of agreement between computation and experiment: validation metrics, AIAA Paper 2004 2626.
  • [22] S. Ferson, W. L. Oberkampf, and L. Ginzburg, Model validation and predictive capability for the thermal challenge problem, Comput. Methods Appl. Mech. Eng., vol. 197, no. 29, pp. 2408-2430, 2008.
  • [23] D. Sivia and J. Skilling, “Data Analysis A Bayesian Tutorial second edition”, Oxford University Press, Oxford, UK, 2006.
  • [24] B. Placek, Bayesian Detection and Characterization of Extra-Solar Planets Via Photometric Variations, Ph. D. Thesis, Dept. Physics, Univ. at Alb. (S.U.N.Y.), Albany, NY, USA, 2014.
  • [25] A. E. Gelfand and D. K. Dey, Bayesian model choice: asymptotics and exact calculations, J. R. Stat. Soc. Ser. B (Methodol.), pp. 501-514, 1994
  • [26] J. Geweke, Bayesian model comparison and validation, Am. Econ. Rev. 2007;97(2):60–4.
  • [27] R. Zhang, S. Mahadevan S, Bayesian methodology for reliability model acceptance, Reliab. Eng. Syst. Saf., vol. 80, no. 1, pp. 95-103, 2003.
  • [28] Y. Liu, W. Chen, P. Arendt, and H. Z. Huang, Toward a better understanding of model validation metrics, Trans ASME J. Mech. Des. vol. 133, no. 7, pp. 071005, 2011.
  • [29] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011.
  • [30] E. T. Jaynes. In “Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science Vol. II”, D. Reidel Publishing Company, Dordrecht-Holland, pp. 175-257, 1976.
  • [31] E. T. Jaynes, Probability Theory: The Logic of Science, UK Cambridge: Cambridge University Press, 2003.
  • [32] F. Feroz and M. P. Hobson, Multimodal nested sampling: an efficient and robust alternative to Markov chain Monte Carlo methods for astronomical data analyses, Monthly Notices of the Royal Astronomical Society, vol. 384, no. 2, pp. 449-463, 2008.
  • [33]

    A. Caticha, Entropic Inference and the Foundations of Physics (monograph commissioned by the 11th Brazilian Meeting on Bayesian Statistics - EBEB-2012). 2012. URL

  • [34] K. Knuth, Optimal Data-Based Binning for Histograms, arxiv:physics/0605197v2, https://arxiv.org/abs/physics/0605197v2, 2013.

Appendix A Deriving the other validation metrics from the BVM

In the following subsections we will show some of the special cases of the Bayesian validation metric. Subsequent improvements or immediate generalizations of the metrics using (2) are presented when applicable. A detailed review of the majority of these metrics may be found in [28] and the references therein. Table 1 and 2 in Section 4 outline the results.

a.1 Reliability metric and probability of agreement

There are a few validation metrics related to the reliability metric present in the literature. The reliability metric [17] is equal to the probability that the data and the model expectation values are within a tolerance of size . Their “probability of agreement” introduced in [18] is closely related to , but instead expresses the quantity as “the probability the data and the model expectation values agree within a tolerance (or sliding tolerance) of ”. The reliability metric was expanded in [19] to account for model outputs and data rather than simply comparing the mean of the model prediction against the mean of the data. The improved reliability metric is equal to,


where and are the full joint probability distributions of the model outputs and the data, respectively. This metric quantifies the probability that the error is less than a value on a point to point basis.

The BVM is the reliability metric when the comparison values are and , and takes the form of an inequality, being true if . If we would like to use a sliding interval of “tolerance” or “error acceptance”, denote it by ”, where , and the BVM is,


This is the reliability metric if is a constant and symmetric interval about .

The BVM is the improved reliability metric when , , and when the Boolean is true iff for all pairs. This is,


The BVM quantifies the probability of square error (or difference) is less than some by considering,


Nothing in the BVM requires the variables to be continuous or ordered, so the natural generalization is to let the Boolean expression be true if “The value is in the subset , which is the set of ’s agreeing with ”. For example, if ’s are strings, might be the set of words or phrases in that are reasonably synonymous with . This gives the straightforward generalization to accommodate arbitrary data types,


by using sets rather than intervals.

a.2 Frequentist validation metric

To include the frequentist validation metric in the BVM, we will have to express the comparison variables and and their respective probabilities. The result can be replicated by letting: be the student-t distribution and have a Dirac delta distribution where is the known value of the computational model’s expected output. Because the frequentist validation metric does not force the modeler to define what is meant by agreement, we represent this freedom by keeping general. This gives,


Making the coordinate transformations and gives,


with and . Given that judgments of agreement in the frequentist validation metric are expected to be made based on the confidence level that is within the confidence interval, this may be factored into , as well as other user defined terms toward expressing agreement. Equation (26) is thought to be the full BVM’s representation of the frequentist validation metric.

The immediate generalization to the frequentist validation metric offered by the BVM is to let the model output expectation value have some amount of uncertainty. The uncertainty is perhaps Gaussian or Student distributed in (the true model expectation value) due to only having a finite number of Monte Carlo samples and/or uncertainty induced by discretization error. Because and are both uncertain in general, one generalizes to,


It is interesting to note the consequence of defining a reasonable Boolean expression of agreement on the BVM representation of the frequentist metric. A natural agreement function for the metric is one that is true if . The BVM then gives,


Thus, the frequentinst’s validation metric is the reliability metric [17] and “probability of agreement” [18] when reasonable accuracy requirements are imposed on the acceptable difference between the expectation values.

a.3 Area and Binned Probability Difference Metric

The BVM is able to represent and generalize the area and the binned probability distribution difference metric by letting the comparison values be the the cdfs or pdfs in question. Using the BVM, this is,