## 1. Introduction

Let be a quantity of interest, where

is a finite-dimensional vector space and

is a compact set. Let be a measurable space and writefor the set of probability distributions on

. Consider a model representing the dependence of the distribution of a data point on the value of the parameter . Throughout, we use to denote the Euclidean norm. We are interested in solving the following problem.###### Problem 1.

Let be an unknown element of . Given an observation of data, estimate and quantify the uncertainty (accuracy/risk) of the estimate.

We will assume that is a dominated model with positive densities, that is, for each , is defined by a (strictly) positive density with respect to a measure , such that, for each measurable subset of ,

(1.1) |

### 1.1. The three main approaches to UQ

Problem 1 is a fundamental Uncertainty Quantification (UQ) problem, and there are essentially three main approaches for solving it. We now describe them when is a Euclidean space with the loss function.

#### 1.1.1. Worst-case

In a different setting, essentially where the set

consists of probability measures, the OUQ framework

[22] provides a worst-case analysis for providing rigorous uncertainty bounds. In the setting of this paper, in the absence of data (or ignoring the data ), the (vanilla) worst-case (or robust optimization) answer is to estimate with the minimizer of the worst-case error(1.2) |

In that approach, are therefore identified as the center and squared radius of the minimum enclosing ball of .

#### 1.1.2. Bayesian

The (vanilla) Bayesian (decision theory) approach (see e.g. Berger [3, Sec. 4.4]) is to assume that is sampled from a prior distribution , and approximate with the minimizer of the Bayesian posterior risk

(1.3) |

associated with the decision , where

(1.4) |

is the posterior measure determined by the likelihood , the prior and the observation . This minimizer is the posterior mean

(1.5) |

and the uncertainty is quantified by the posterior variance

(1.6) |

#### 1.1.3. Game/decision theoretic

The Wald’s game/decision theoretic approach is to consider a two-player zero-sum game where player I selects , and player II selects a decision function which estimates the quantity of interest (given the data ), resulting in the loss

(1.7) |

for player II. Such a game will normally not have a saddle point, so following von Neumann’s approach [40], one randomizes both players’ plays to identify a Nash equilibrium. To that end, first observe that, for the quadratic loss considered here (for ease of presentation), only the choice of player I needs to be randomized. So, let be a probability measure randomizing the play of player I, and consider the lift

(1.8) |

of the game (1.7). A minmax optimal estimate of is then obtained by identifying a Nash equilibrium (a saddle point) for (1.8), i.e. and satisfying

(1.9) |

Consequently, an optimal strategy of player II is then the posterior mean of the form (1.5) determined by a worst-case measure and optimal randomized/mixed strategy for player I

(1.10) |

To connect with the Bayesian framework we observe (by changing the order of integration) that the Wald’s risk (1.8) can be written as the average

(1.11) |

of the Bayesian decision risk ((1.3) for ) determined by the prior and decision with respect to the -marginal distribution

(1.12) |

associated with the prior and the model . However, the prior used in Bayesian decision theory is specified by the practitioner and in the Wald framework is a worst-case prior (1.10).

### 1.2. Limitations of the three main approaches to UQ

All three approaches described in Section 1.1 have limitations in terms of accuracy, robustness, and computational complexity. Although the worst-case approach is robust, it appears unfavorable in terms of accuracy and data assimilation. The Bayesian approach, on the other hand, suffers from the computational complexity of estimating the posterior distribution and from brittleness [21] with respect to the choice of prior along with Stark’s admonition [36] “your prior can bite you on the posterior.” Although Kempthorne [13] develops a rigorous numerical procedure with convergence guarantees for solving the equations of Wald’s statistical decision theory which appears amenable to computational complexity analysis, it appears it suffers from the curse of dimensionality (see Fig. 1). This can be understood from the fact that the risk associated with the worst-case measure in the Wald framework is an average over the observational variable of the conditional risk, conditioned on the observation . Consequently, for a discrete approximation of a worst-case measure, after an observation is made, there may be insufficient mass near the places where the conditioning will provide a good estimate of the appropriate conditional measure. Indeed, in the proposal [19]

to develop Wald’s statistical decision theory along the lines of Machine Learning, with its dual focus on performance and computation, it was observed that

“Although Wald’s theory of Optimal Statistical Decisions has resulted in many important statistical discoveries, looking through the three Lehmann symposia of Rojo and Pérez-Abreu [30] in 2004, and Rojo [28, 29] in 2006 and 2009, it is clear that the incorporation of the analysis of the computational algorithm, both in terms of its computational efficiency and its statistical optimality, has not begun.”

Moreover, one might ask why, after seeing the data, one is choosing a worst-case measure which optimizes the average (1.11) of the Bayesian risk (1.6), instead of choosing it to optimize the value of the risk at the value of the observation .

### 1.3. UQ of the 4th kind

In this paper, we introduce a framework which is a hybrid between Wald’s statistical decision theory [42], Bayesian decision theory [3, Sec. 4.4], robust optimization and hypothesis testing. Here we describe its components for simplicity when the loss function is the loss. Later in Section 6 we develop the framework for general loss functions.

#### 1.3.1. Rarity assumption on the data

In [21, Pg. 576] it was demonstrated that one could alleviate the brittleness of Bayesian inference (see [20, 18]) by restricting to priors for which the observed data is not rare, that is,

(1.13) |

according to the density of the -marginal determined by and the model , for some . In the proposed framework, we consider playing a game after observing the data whose loss function is defined by the Bayesian decision risk (1.3), where player I selects a prior subject to a rarity assumption () and player II selects a decision . The rarity assumption considered here is

(1.14) |

Since for all in the support of any it follows that such a satisfies (1.13) and therefore is sufficient to prevent Bayesian brittleness.

#### 1.3.2. The relative likelihood for the rarity assumption

Observe in (1.4) that the map from the prior to posterior is scale-invariant in the likelihood and that the effects of scaling the likelihood in the rarity assumption can be undone by modifying . Consequently, we scale the likelihood function

(1.15) |

to its relative likelihood function

(1.16) |

According to Sprott [35, Sec. 2.4], the relative likelihood measures the plausibility of any parameter value relative to a maximum likely and summarizes the information about contained in the sample . See Rossi [31, p. 267] for its large sample connection with the

distribution and several examples of the relationship between likelihood regions and confidence intervals.

For and , let

(1.17) |

denote the corresponding likelihood region and, updating (1.14), redefine the rarity assumption by

(1.18) |

That is, the rarity constraint constrains priors to have support on the likelihood region . We will now define the confidence level of the family .

#### 1.3.3. Significance/confidence level

For a given , let the significance at the value be the maximum (over ) of the probability that a data does not satisfy the rarity assumption , i.e.,

(1.19) | |||||

where, for fixed , is the indicator function of the set . Observe that, in the setting of hypothesis testing, (1) can be interpreted as the p-value associated with the hypothesis that the rarity assumption is not satisfied (i.e. the hypothesis that does not belongs to the set (1.17)), and (2) can be interpreted as the confidence level associated with the rarity assumption (i.e. the smallest probability that belongs to the set (1.17)). Therefore, to select , we set a significance level (e.g. ) and choose to be the largest value such that the significance at satisfies .

###### Remark 1.1.

For models where the maximum of the likelihood function

is expensive to compute but for which there exists an efficiently computable upper approximation available, the surrogate

(1.20) |

to the relative likelihood may be used in place of (1.15). If we let denote the value determined in (1.19) using the surrogate (1.20) and denote the corresponding likelihood region, then we have . Consequently, obtaining for significance level implies that .

As an example, for an -dimensional Gaussian model with with , the elementary upper bound

the surrogate relative likelihood defined in (1.20) becomes

#### 1.3.4. Posterior game and risk

After observing , we now consider playing a game using the loss

(1.21) |

Since the likelihood is positive, it follows from

(1.22) |

that the map from priors to posteriors is bijective and so that we can completely remove the conditioning in the game defined by (1.21) and instead play a game using the loss

(1.23) |

Recall that a pair is a saddle point of the game (1.23) if

We then have the following theorem.

###### Theorem 1.2.

Consider , , and suppose that the relative likelihood and the quantity of interest are continuous. The loss function for the game (6.3) has saddle points and a pair is a saddle point for if and only if

(1.24) |

and

(1.25) |

Furthermore the associated risk (the value of the two person game (6.3))

(1.26) |

is the same for all saddle points of . Moreover, the second component of the set of saddle points is unique and the set of first components of saddle points is convex, providing a convex ridge of saddle points.

### 1.4. Duality with the minimum enclosing ball

Although the Lagrangian duality between the maximum variance problem and the minimum enclosing ball problem on finite sets is known, see Yildirim [44], we now analyze the infinite case. Utilizing the recent generalization of the one-dimensional result of Popoviciu [25] regarding the relationship between variance maximization and the minimum enclosing ball by Lim and McCann [15, Thm. 1], the following theorem demonstrates that essentially the maximum variance problem (1.25) determining a worst-case measure is the Lagrangian dual of the minimum enclosing ball problem on the image . Let denote the pushforward map (change of variables) defined by for every Borel set , mapping probability measures on to probability measures on .

###### Theorem 1.3.

For , , suppose the relative likelihood and the quantity of interest are continuous. Consider a saddle point of the game (6.3). The optimal decision and its associated risk (6.10) are equal to the center and squared radius, respectively, of the minimum enclosing ball of , i.e. the minimizer and the value of the minimum enclosing ball optimization problem

(1.27) |

Moreover, the variance maximization problem on (1.25) pushes forward to the variance maximization problem on the image of the likelihood region under giving the identity

and the latter is the Lagrangian dual to the minimum enclosing ball problem (1.27) on the image . Finally, let , with center , denote the minimum enclosing ball of . Then a measure is optimal for the variance maximization problem (1.25) if and only if

and

that is, all the mass of lives on the intersection of the image of the likelihood region and the boundary of its minimum enclosing ball and the center of mass of the measure is the center of .

###### Remark 1.4.

Note that once , and therefore , is determined that the computation of the risk and the minmax estimator is determined by the minimum enclosing ball about , which is also determined by the worst-case optimization problem (1.2) for .

Theorem 1.3 introduces the possibility of primal-dual algorithms. In particular, the availability of rigorous stopping criteria for the maximum variance problem (1.25). To that end, for a feasible measure , let denote its variance and denote by (1.26) the optimal variance. Let be a feasible for the minimum enclosing ball problem (1.27). Then the inequality implies the rigorous bound

(1.28) |

quantifying the suboptimality of the measure in terms of known quantities and .

### 1.5. Finite-dimensional reduction

Let denote the set of convex sums of Dirac measures located in and, let defined by

(1.29) |

denote the finite-dimensional subset of the rarity assumption set consisting of the convex combinations of Dirac measures supported in .

###### Theorem 1.5.

As a consequence of Theorems 1.3 and 1.5, a measure with finite support on is the pushforward under of an optimal measure for the maximum variance problem (1.25) if and only if, as illustrated in Figure 2, it is supported on the intersection of and the boundary of the minimum enclosing ball of and the center of is the center of mass of the measure .

### 1.6. Relaxing MLE with an accuracy/robustness tradeoff

For fixed , assume that the model is such that the maximum likelihood estimate (MLE)

(1.33) |

of exists and is unique.

Observe that for near one (1) the support of and concentrate around the MLE and , (2) the risk (6.10) concentrates around zero, and (3) the confidence associated with the rarity assumption is the smallest. In that limit, our estimator inherits the accuracy and lack of robustness of the MLE approach to estimating the quantity of interest.

Conversely for near zero, since by (1.17) , (1) the support of the pushforward of by concentrates on the boundary of and concentrate around the center of the minimum enclosing ball of , (2) the risk (1.26) is the highest and concentrates around the worst-case risk (1.2), and (3) the confidence associated with the rarity assumption is the highest. In that limit, our estimator inherits the robustness and lack of accuracy of the worst-case approach to estimating the quantity of interest.

For between and , the proposed game-theoretic approach induces a minmax optimal tradeoff between the accuracy of MLE and the robustness of the worst case.

## 2. Two simple examples

### 2.1. Gaussian Mean Estimation

Consider the problem of estimating the mean

with known variance from the observation of one sample from that distribution and from the information that for some given . Note that this problem can be formulated in the setting of Problem 1 by letting (1) be the Gaussian distribution on with mean and variance , (2) and and (3) be the identity map . Following Remark 1.1, we utilize the surrogate relative likelihood(2.1) |

to define the likelihood region obtained by the maximum likelihood upper bound instead of the true relative likelihood and, for simplicity, remove the prime and henceforth indicate it as . A simple calculation yields

(2.2) |

Using Theorem 1.5 with , for , one can compute a saddle point of the game (1.23) as

(2.3) |

where maximize the variance

(2.4) |

where the last two constraints are equivalent to the rarity assumption .

Hence for near , , and by Theorem 1.3, the variance is maximized by placing each Dirac on each boundary point of the region , each receiving half of the total probability mass, that is by , and , in which case and . For , the rarity constraint implies when , leading to the MLE with . Note that from (1.19) we have

where

is the cumulative distribution function of the chi-squared distribution with one degree of freedom. We illustrate in Figure

3 the different results of solving the optimization problem (2.4) in the case , and . We plot the curve (top left), the likelihood of the model in and the level sets (top right), and the evolutions of the risk with (bottom left), and the optimal decision with (bottom right). Since, by Theorem 1.3, the optimal decision is the midpoint of the interval with extremes in either the level sets or , we observe that for low , our optimal decision does not coincide with the MLE.### 2.2. Coin toss

#### 2.2.1. tosses of a single coin.

In this example, we estimate the probability that a biased coin lands on heads from the observation of independent tosses of that coin. Specifically, we consider flipping a coin which has an unknown probability of coming heads and probability coming up tails . Here , , and the model is and . We toss the coin times generating a sequence of i.i.d. Bernoulli variables all with the same unknown parameter , and let denote the outcome of the experiment. Let denote the number of heads observed and the number of tails. Then the model for the -fold toss is

(2.5) |

and, given an observation , the MLE is so that the relative likelihood (1.15) is

(2.6) |

Although the fact that violates our positivity assumptions on the model in our framework, in this case this technical restriction can be removed, so we can still use this example as an illustration. We seek to estimate , so let and let the quantity of interest be the identity function . In this case, given , the likelihood region

(2.7) |

constrains the support of priors to points with relative likelihood larger than .

Using Theorem 1.5 with , one can compute a saddle point of the game (1.23) as

(2.8) |

where maximize the variance

(2.9) |

Equation (1.19) allows us to compute as a function of . The solution of the optimization problem can be found by finding the minimum enclosing ball of the set , which in this -D case is also subinterval of the interval . For tosses resulting in heads and tails, Figure 4 plots (1) , the relative likelihood, its level sets and minimum enclosing balls as a function of , and (2) The risk (1.26) and optimal decision as a function of . Three different points in the curve are highlighted.

#### 2.2.2. and tosses of two coins.

We now consider the same problem with two independent coins with unknown probabilities . After tossing each coin times, the observation consists of heads and tails for each , produce a 2D relative likelihood function on given by

(2.10) |

Figure 5 illustrates the level sets and their corresponding bounding balls for and different values of .

## 3. Computational framework

The introduction developed this framework in the context of a model with density in terms of a single sample . In Section 2.2, the single sample case was extended to i.i.d. samples by defining the multisample and defining the product model density . Extensions incorporating correlations in the samples, such as Markov or other stochastic processes can easily be developed. Here we continue this development for the general model of the introduction for the loss and also develop more fully a Gaussian noise model. Later, in Sections 5 and 4 these models will be tested on estimating a quadratic function and a Lotka-Volterra predator-prey model based on noisy observations. In Section 6 the framework will be generalized to more general loss functions and rarity assumptions, which much of the current section generalizes to.

Let the possible states of nature be a compact subset , the decision space be and the elements of the -fold multisample

lie in , that is, lies in the multisample space . Let the components of the quantity of interest be indicated by . Here, using the i.i.d. product model with density , the definition of in (1.19) becomes

(3.1) | |||||

where, for fixed , is the indicator function of the set . We use boldface, such as or to emphasize the vector nature of variables and functions in the computational framework.

In this notation, the finite dimensional reduction guaranteed by Theorem 1.5 in (1.31) and (1.32) of the optimization problem (1.25) defining a worst-case measure of the form takes the form

(3.2) |

where the component of the objective function

is the variance of the random variable

under the measure .### 3.1. Algorithm for solving the game

We are now prepared to develop an algorithm for player II (the decision maker) to play the game (1.23), using the saddle point Theorem 1.2 and the finite dimensional reduction Theorem 1.5 after selecting the rarity parameter quantifying the rarity assumption (1.17) in terms of the relative likelihood (1.15) or a surrogate as described in Remark 1.1, to be the largest such that the significance (3.1) at satisfies , the significance level.

At a high level the algorithm for computing a worst-case measure, its resulting risk (variance) and optimal estimator is as follows:

To solve (3.2) in Step 3 we apply the duality of the variance maximization problem with the minimum enclosing ball problem, Theorem 1.3, to obtain the following complete algorithm. It uses Algorithm 3 for computing the minimum enclosing ball about the (generally) infinite set , which in turn uses a minimum enclosing ball algorithm Miniball applied to sets of size at most , see e.g. Welzl [43], Yildirim [44] and Gartner [10]. Here we use that of Welzl [43]. See Section 7 for a discussion and a proof in Theorem 7.1 of the convergence of Algorithm 3. Theorem 7.1 also establishes a convergence proof when the distance maximization Step 8a in Algorithm 1 is performed approximately. Note that the likelihood region is defined by

where

is a MLE.

#### 3.1.1. Large sample simplifications

Here we demonstrate that when the number of samples is large, under classic regularity assumptions, the significance is approximated by the value of a chi-squared distribution, substantially simplifying the determination of in Step 3 of Algorithm 1.

Let and let data be generated by the model at the value . Under standard regularity conditions, check Casella and Berger [7, Sec. 10.6.2 & Thm. 10.1.12], the maximum likelihood estimator (MLE), , is asymptotically efficient for . That is as the sample size

(3.3) |

where is the Fisher information matrix. Therefore, standard arguments, see Casella and Berger [7, Thm. 10.3.1], for the asymptotic distribution of the likelihood ratio test result in the following approximation of .

###### Theorem 3.1.

Let and assume that the model density satisfies the regularity conditions of Casella and Berger [7, Section. 10.6.2]. Then

(3.4) |

as , where is the chi-square distribution with degrees of freedom.

### 3.2. Algorithm 1 for a Gaussian noise model

Consider a Gaussian noise model where, and for , the components of the multisample are i.i.d. samples from the Gaussian distribution

Comments

There are no comments yet.