Let be a quantity of interest, where
is a finite-dimensional vector space andis a compact set. Let be a measurable space and write
for the set of probability distributions on. Consider a model representing the dependence of the distribution of a data point on the value of the parameter . Throughout, we use to denote the Euclidean norm. We are interested in solving the following problem.
Let be an unknown element of . Given an observation of data, estimate and quantify the uncertainty (accuracy/risk) of the estimate.
We will assume that is a dominated model with positive densities, that is, for each , is defined by a (strictly) positive density with respect to a measure , such that, for each measurable subset of ,
1.1. The three main approaches to UQ
Problem 1 is a fundamental Uncertainty Quantification (UQ) problem, and there are essentially three main approaches for solving it. We now describe them when is a Euclidean space with the loss function.
In a different setting, essentially where the set
consists of probability measures, the OUQ framework provides a worst-case analysis for providing rigorous uncertainty bounds. In the setting of this paper, in the absence of data (or ignoring the data ), the (vanilla) worst-case (or robust optimization) answer is to estimate with the minimizer of the worst-case error
In that approach, are therefore identified as the center and squared radius of the minimum enclosing ball of .
The (vanilla) Bayesian (decision theory) approach (see e.g. Berger [3, Sec. 4.4]) is to assume that is sampled from a prior distribution , and approximate with the minimizer of the Bayesian posterior risk
associated with the decision , where
is the posterior measure determined by the likelihood , the prior and the observation . This minimizer is the posterior mean
and the uncertainty is quantified by the posterior variance
1.1.3. Game/decision theoretic
The Wald’s game/decision theoretic approach is to consider a two-player zero-sum game where player I selects , and player II selects a decision function which estimates the quantity of interest (given the data ), resulting in the loss
for player II. Such a game will normally not have a saddle point, so following von Neumann’s approach , one randomizes both players’ plays to identify a Nash equilibrium. To that end, first observe that, for the quadratic loss considered here (for ease of presentation), only the choice of player I needs to be randomized. So, let be a probability measure randomizing the play of player I, and consider the lift
Consequently, an optimal strategy of player II is then the posterior mean of the form (1.5) determined by a worst-case measure and optimal randomized/mixed strategy for player I
To connect with the Bayesian framework we observe (by changing the order of integration) that the Wald’s risk (1.8) can be written as the average
of the Bayesian decision risk ((1.3) for ) determined by the prior and decision with respect to the -marginal distribution
associated with the prior and the model . However, the prior used in Bayesian decision theory is specified by the practitioner and in the Wald framework is a worst-case prior (1.10).
1.2. Limitations of the three main approaches to UQ
All three approaches described in Section 1.1 have limitations in terms of accuracy, robustness, and computational complexity. Although the worst-case approach is robust, it appears unfavorable in terms of accuracy and data assimilation. The Bayesian approach, on the other hand, suffers from the computational complexity of estimating the posterior distribution and from brittleness  with respect to the choice of prior along with Stark’s admonition  “your prior can bite you on the posterior.” Although Kempthorne  develops a rigorous numerical procedure with convergence guarantees for solving the equations of Wald’s statistical decision theory which appears amenable to computational complexity analysis, it appears it suffers from the curse of dimensionality (see Fig. 1). This can be understood from the fact that the risk associated with the worst-case measure in the Wald framework is an average over the observational variable of the conditional risk, conditioned on the observation . Consequently, for a discrete approximation of a worst-case measure, after an observation is made, there may be insufficient mass near the places where the conditioning will provide a good estimate of the appropriate conditional measure. Indeed, in the proposal 
to develop Wald’s statistical decision theory along the lines of Machine Learning, with its dual focus on performance and computation, it was observed that
“Although Wald’s theory of Optimal Statistical Decisions has resulted in many important statistical discoveries, looking through the three Lehmann symposia of Rojo and Pérez-Abreu  in 2004, and Rojo [28, 29] in 2006 and 2009, it is clear that the incorporation of the analysis of the computational algorithm, both in terms of its computational efficiency and its statistical optimality, has not begun.”
Moreover, one might ask why, after seeing the data, one is choosing a worst-case measure which optimizes the average (1.11) of the Bayesian risk (1.6), instead of choosing it to optimize the value of the risk at the value of the observation .
1.3. UQ of the 4th kind
In this paper, we introduce a framework which is a hybrid between Wald’s statistical decision theory , Bayesian decision theory [3, Sec. 4.4], robust optimization and hypothesis testing. Here we describe its components for simplicity when the loss function is the loss. Later in Section 6 we develop the framework for general loss functions.
1.3.1. Rarity assumption on the data
according to the density of the -marginal determined by and the model , for some . In the proposed framework, we consider playing a game after observing the data whose loss function is defined by the Bayesian decision risk (1.3), where player I selects a prior subject to a rarity assumption () and player II selects a decision . The rarity assumption considered here is
Since for all in the support of any it follows that such a satisfies (1.13) and therefore is sufficient to prevent Bayesian brittleness.
1.3.2. The relative likelihood for the rarity assumption
Observe in (1.4) that the map from the prior to posterior is scale-invariant in the likelihood and that the effects of scaling the likelihood in the rarity assumption can be undone by modifying . Consequently, we scale the likelihood function
to its relative likelihood function
According to Sprott [35, Sec. 2.4], the relative likelihood measures the plausibility of any parameter value relative to a maximum likely and summarizes the information about contained in the sample . See Rossi [31, p. 267] for its large sample connection with the
distribution and several examples of the relationship between likelihood regions and confidence intervals.
For and , let
denote the corresponding likelihood region and, updating (1.14), redefine the rarity assumption by
That is, the rarity constraint constrains priors to have support on the likelihood region . We will now define the confidence level of the family .
1.3.3. Significance/confidence level
For a given , let the significance at the value be the maximum (over ) of the probability that a data does not satisfy the rarity assumption , i.e.,
where, for fixed , is the indicator function of the set . Observe that, in the setting of hypothesis testing, (1) can be interpreted as the p-value associated with the hypothesis that the rarity assumption is not satisfied (i.e. the hypothesis that does not belongs to the set (1.17)), and (2) can be interpreted as the confidence level associated with the rarity assumption (i.e. the smallest probability that belongs to the set (1.17)). Therefore, to select , we set a significance level (e.g. ) and choose to be the largest value such that the significance at satisfies .
For models where the maximum of the likelihood function
is expensive to compute but for which there exists an efficiently computable upper approximation available, the surrogate
to the relative likelihood may be used in place of (1.15). If we let denote the value determined in (1.19) using the surrogate (1.20) and denote the corresponding likelihood region, then we have . Consequently, obtaining for significance level implies that .
As an example, for an -dimensional Gaussian model with with , the elementary upper bound
the surrogate relative likelihood defined in (1.20) becomes
1.3.4. Posterior game and risk
After observing , we now consider playing a game using the loss
Since the likelihood is positive, it follows from
that the map from priors to posteriors is bijective and so that we can completely remove the conditioning in the game defined by (1.21) and instead play a game using the loss
Recall that a pair is a saddle point of the game (1.23) if
We then have the following theorem.
Consider , , and suppose that the relative likelihood and the quantity of interest are continuous. The loss function for the game (6.3) has saddle points and a pair is a saddle point for if and only if
Furthermore the associated risk (the value of the two person game (6.3))
is the same for all saddle points of . Moreover, the second component of the set of saddle points is unique and the set of first components of saddle points is convex, providing a convex ridge of saddle points.
1.4. Duality with the minimum enclosing ball
Although the Lagrangian duality between the maximum variance problem and the minimum enclosing ball problem on finite sets is known, see Yildirim , we now analyze the infinite case. Utilizing the recent generalization of the one-dimensional result of Popoviciu  regarding the relationship between variance maximization and the minimum enclosing ball by Lim and McCann [15, Thm. 1], the following theorem demonstrates that essentially the maximum variance problem (1.25) determining a worst-case measure is the Lagrangian dual of the minimum enclosing ball problem on the image . Let denote the pushforward map (change of variables) defined by for every Borel set , mapping probability measures on to probability measures on .
For , , suppose the relative likelihood and the quantity of interest are continuous. Consider a saddle point of the game (6.3). The optimal decision and its associated risk (6.10) are equal to the center and squared radius, respectively, of the minimum enclosing ball of , i.e. the minimizer and the value of the minimum enclosing ball optimization problem
Moreover, the variance maximization problem on (1.25) pushes forward to the variance maximization problem on the image of the likelihood region under giving the identity
and the latter is the Lagrangian dual to the minimum enclosing ball problem (1.27) on the image . Finally, let , with center , denote the minimum enclosing ball of . Then a measure is optimal for the variance maximization problem (1.25) if and only if
that is, all the mass of lives on the intersection of the image of the likelihood region and the boundary of its minimum enclosing ball and the center of mass of the measure is the center of .
Note that once , and therefore , is determined that the computation of the risk and the minmax estimator is determined by the minimum enclosing ball about , which is also determined by the worst-case optimization problem (1.2) for .
Theorem 1.3 introduces the possibility of primal-dual algorithms. In particular, the availability of rigorous stopping criteria for the maximum variance problem (1.25). To that end, for a feasible measure , let denote its variance and denote by (1.26) the optimal variance. Let be a feasible for the minimum enclosing ball problem (1.27). Then the inequality implies the rigorous bound
quantifying the suboptimality of the measure in terms of known quantities and .
1.5. Finite-dimensional reduction
Let denote the set of convex sums of Dirac measures located in and, let defined by
denote the finite-dimensional subset of the rarity assumption set consisting of the convex combinations of Dirac measures supported in .
As a consequence of Theorems 1.3 and 1.5, a measure with finite support on is the pushforward under of an optimal measure for the maximum variance problem (1.25) if and only if, as illustrated in Figure 2, it is supported on the intersection of and the boundary of the minimum enclosing ball of and the center of is the center of mass of the measure .
1.6. Relaxing MLE with an accuracy/robustness tradeoff
For fixed , assume that the model is such that the maximum likelihood estimate (MLE)
of exists and is unique.
Observe that for near one (1) the support of and concentrate around the MLE and , (2) the risk (6.10) concentrates around zero, and (3) the confidence associated with the rarity assumption is the smallest. In that limit, our estimator inherits the accuracy and lack of robustness of the MLE approach to estimating the quantity of interest.
Conversely for near zero, since by (1.17) , (1) the support of the pushforward of by concentrates on the boundary of and concentrate around the center of the minimum enclosing ball of , (2) the risk (1.26) is the highest and concentrates around the worst-case risk (1.2), and (3) the confidence associated with the rarity assumption is the highest. In that limit, our estimator inherits the robustness and lack of accuracy of the worst-case approach to estimating the quantity of interest.
For between and , the proposed game-theoretic approach induces a minmax optimal tradeoff between the accuracy of MLE and the robustness of the worst case.
2. Two simple examples
2.1. Gaussian Mean Estimation
Consider the problem of estimating the meanwith known variance from the observation of one sample from that distribution and from the information that for some given . Note that this problem can be formulated in the setting of Problem 1 by letting (1) be the Gaussian distribution on with mean and variance , (2) and and (3) be the identity map . Following Remark 1.1, we utilize the surrogate relative likelihood
to define the likelihood region obtained by the maximum likelihood upper bound instead of the true relative likelihood and, for simplicity, remove the prime and henceforth indicate it as . A simple calculation yields
where maximize the variance
where the last two constraints are equivalent to the rarity assumption .
Hence for near , , and by Theorem 1.3, the variance is maximized by placing each Dirac on each boundary point of the region , each receiving half of the total probability mass, that is by , and , in which case and . For , the rarity constraint implies when , leading to the MLE with . Note that from (1.19) we have
where3 the different results of solving the optimization problem (2.4) in the case , and . We plot the curve (top left), the likelihood of the model in and the level sets (top right), and the evolutions of the risk with (bottom left), and the optimal decision with (bottom right). Since, by Theorem 1.3, the optimal decision is the midpoint of the interval with extremes in either the level sets or , we observe that for low , our optimal decision does not coincide with the MLE.
2.2. Coin toss
2.2.1. tosses of a single coin.
In this example, we estimate the probability that a biased coin lands on heads from the observation of independent tosses of that coin. Specifically, we consider flipping a coin which has an unknown probability of coming heads and probability coming up tails . Here , , and the model is and . We toss the coin times generating a sequence of i.i.d. Bernoulli variables all with the same unknown parameter , and let denote the outcome of the experiment. Let denote the number of heads observed and the number of tails. Then the model for the -fold toss is
and, given an observation , the MLE is so that the relative likelihood (1.15) is
Although the fact that violates our positivity assumptions on the model in our framework, in this case this technical restriction can be removed, so we can still use this example as an illustration. We seek to estimate , so let and let the quantity of interest be the identity function . In this case, given , the likelihood region
constrains the support of priors to points with relative likelihood larger than .
where maximize the variance
Equation (1.19) allows us to compute as a function of . The solution of the optimization problem can be found by finding the minimum enclosing ball of the set , which in this -D case is also subinterval of the interval . For tosses resulting in heads and tails, Figure 4 plots (1) , the relative likelihood, its level sets and minimum enclosing balls as a function of , and (2) The risk (1.26) and optimal decision as a function of . Three different points in the curve are highlighted.
2.2.2. and tosses of two coins.
We now consider the same problem with two independent coins with unknown probabilities . After tossing each coin times, the observation consists of heads and tails for each , produce a 2D relative likelihood function on given by
Figure 5 illustrates the level sets and their corresponding bounding balls for and different values of .
3. Computational framework
The introduction developed this framework in the context of a model with density in terms of a single sample . In Section 2.2, the single sample case was extended to i.i.d. samples by defining the multisample and defining the product model density . Extensions incorporating correlations in the samples, such as Markov or other stochastic processes can easily be developed. Here we continue this development for the general model of the introduction for the loss and also develop more fully a Gaussian noise model. Later, in Sections 5 and 4 these models will be tested on estimating a quadratic function and a Lotka-Volterra predator-prey model based on noisy observations. In Section 6 the framework will be generalized to more general loss functions and rarity assumptions, which much of the current section generalizes to.
Let the possible states of nature be a compact subset , the decision space be and the elements of the -fold multisample
lie in , that is, lies in the multisample space . Let the components of the quantity of interest be indicated by . Here, using the i.i.d. product model with density , the definition of in (1.19) becomes
where, for fixed , is the indicator function of the set . We use boldface, such as or to emphasize the vector nature of variables and functions in the computational framework.
where the component of the objective function
is the variance of the random variableunder the measure .
3.1. Algorithm for solving the game
We are now prepared to develop an algorithm for player II (the decision maker) to play the game (1.23), using the saddle point Theorem 1.2 and the finite dimensional reduction Theorem 1.5 after selecting the rarity parameter quantifying the rarity assumption (1.17) in terms of the relative likelihood (1.15) or a surrogate as described in Remark 1.1, to be the largest such that the significance (3.1) at satisfies , the significance level.
At a high level the algorithm for computing a worst-case measure, its resulting risk (variance) and optimal estimator is as follows:
To solve (3.2) in Step 3 we apply the duality of the variance maximization problem with the minimum enclosing ball problem, Theorem 1.3, to obtain the following complete algorithm. It uses Algorithm 3 for computing the minimum enclosing ball about the (generally) infinite set , which in turn uses a minimum enclosing ball algorithm Miniball applied to sets of size at most , see e.g. Welzl , Yildirim  and Gartner . Here we use that of Welzl . See Section 7 for a discussion and a proof in Theorem 7.1 of the convergence of Algorithm 3. Theorem 7.1 also establishes a convergence proof when the distance maximization Step 8a in Algorithm 1 is performed approximately. Note that the likelihood region is defined by
is a MLE.
3.1.1. Large sample simplifications
Here we demonstrate that when the number of samples is large, under classic regularity assumptions, the significance is approximated by the value of a chi-squared distribution, substantially simplifying the determination of in Step 3 of Algorithm 1.
Let and let data be generated by the model at the value . Under standard regularity conditions, check Casella and Berger [7, Sec. 10.6.2 & Thm. 10.1.12], the maximum likelihood estimator (MLE), , is asymptotically efficient for . That is as the sample size
where is the Fisher information matrix. Therefore, standard arguments, see Casella and Berger [7, Thm. 10.3.1], for the asymptotic distribution of the likelihood ratio test result in the following approximation of .
Let and assume that the model density satisfies the regularity conditions of Casella and Berger [7, Section. 10.6.2]. Then
as , where is the chi-square distribution with degrees of freedom.
3.2. Algorithm 1 for a Gaussian noise model
Consider a Gaussian noise model where, and for , the components of the multisample are i.i.d. samples from the Gaussian distribution