Risk-averse estimation, an axiomatic approach to inference, and Wallace-Freeman without MML

06/28/2018 ∙ by Michael Brand, et al. ∙ Monash University 0

We define a new class of Bayesian point estimators, which we refer to as risk-averse estimators. We then use this definition to formulate several axioms that we claim to be natural requirements for good inference procedures, and show that for two classes of estimation problems the axioms uniquely characterise an estimator. Namely, for estimation problems with a discrete hypothesis space, we show that the axioms lead to the MAP estimate, whereas for well-behaved, purely continuous estimation problems the axioms lead to the Wallace-Freeman estimate. Interestingly, this combined use of MAP and Wallace-Freeman estimation reflects the common practice in the Minimum Message Length (MML) community, but there these two estimators are used as approximations for the information-theoretic Strict MML estimator, whereas we derive them exactly, not as approximations, and do so with no use of encoding or information theory.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the fundamental statistical problems is point estimation. In a Bayesian setting, this can be described as follows. Let

be a pair of random variables with a known joint distribution that assigns positive probability / probability density to any

. Here, is known as the observation, as observation space, as the parameter and as parameter space. We aim to describe a function such that is our “best guess” for given .

Such a problem appears frequently for example in scientific inference, where we aim to decide on a theory that best fits the known set of experimental results.

The optimal choice of a “best guess” naturally depends on our definition of “best”. The most common Bayesian approach regarding this is that used by Bayes estimators [3]

, which define “best” explicitly, by means of a loss function. This allows estimators to optimally trade off different types of errors, based on their projected costs.

In this paper, we examine the situation where errors of all forms are extremely costly and should therefore be minimised and if possible avoided, rather than factored in. The scientific scenario, where one aims to decide on a single theory, rather than a convenient trade-off between multiple hypotheses, is an example. We define this scenario rigorously under the name risk-averse estimation.

We show that for problems in which some values have a positive posterior probability, the assumption of risk-averse estimation is enough to uniquely characterise Maximum A Posteriori (MAP),

Risk-averse estimation does not suffice alone, however, to uniquely characterise a solution for continuous problems, i.e. problems where the joint distribution of

can be described by a probability density function

. To do so, we introduce three additional axioms, two of which relate to invariance to representation and the last to invariance to irrelevant alternatives, which reflect natural requirements for a good inference procedure, all of which are also met by MAP.

(Notably, the estimator that maximises the posterior probability density , which in the literature is usually also named MAP, does not satisfy invariance to representation. To avoid confusion, we refer to it as -MAP.)

We prove regarding our risk-aversion assumption and three additional axioms that together (and only together) they do uniquely characterise a single estimation function in the continuous case, namely the Wallace-Freeman estimator (WF) [20],

where is the Fisher information matrix [10], whose element is the conditional expectation

A scenario not covered by either of the above is one where is a continuous variable but is discrete. To handle this case, we introduce a fourth axiom, relating to invariance to superfluous information.

This creates a set of four axioms that is both symmetric and aesthetic: two axioms relate to representation invariance (one in parameter space, the other in observation space), and two relate to invariance to irrelevancies (again, one in each domain).

We show that the four axioms together (and only together) uniquely characterise WF also in the remaining case.

The fact that our axioms uniquely characterise the Wallace-Freeman estimator is in itself of interest, because this estimator exists almost exclusively as part of Minimum Message Length (MML) theory [18], and even there is defined merely as a computationally convenient approximation to Strict MML (SMML) [19], which MML theory considers to be the optimal estimator, for information-theoretical reasons.

Importantly, because SMML is computationally intractable in all but the simplest cases [6], it is generally not used directly, and MML practitioners are encouraged instead to approximate it by MAP in the discrete case and by WF in the continuous (See, e.g., [5], p. 268). Thus, MML’s standard practice coincides with what risk-averse estimation advocates. However, in the case of risk-averse estimation, neither MAP nor WF is an approximation. Rather, they are both optimal estimators in their own rights (within their respective domains), and the justifications given for them are purely Bayesian and involve no coding theory.

Thus, risk-averse estimation provides a new theoretical foundation, unrelated to MML, that explains the empirical success of the MML recipe, for which recent examples include [17, 16, 15, 8, 9].

2 Background

2.1 Bayes estimation

The most commonly used class of Bayesian estimators is Bayes estimators. A Bayes estimator, , is defined over a loss function,

where represents the cost of choosing when the true value of is . The estimator chooses an estimate that minimises the expected loss given the observation, :

We denote by the distribution of at , i.e. the likelihood of given , and assume for all estimation problems and loss functions


When the distribution of is known to be continuous, we denote by the probability density function (pdf) of , i.e. . Throughout, where is known to be continuous, we use interchangeably with , and, in general, pdfs interchangeably with the distributions they represent, e.g. in notation such as “” for “ is a random variable with distribution (pdf) ”.

We say that is discriminative for an estimation problem if for every and every neighbourhood of , the infima over of both and are positive.

Notably, Bayes estimators are invariant to a linear monotone increasing transform in . They may also be defined over a gain function, , where is the result of a monotone decreasing affine transform on a loss function.

Examples of Bayes estimators are posterior expectation, which minimises quadratic loss, and MAP, which minimises loss over the discrete metric. In general, Bayes estimators such as posterior expectation may return a value that is not in . This demonstrates how their trade-off of errors may make them unsuitable for a high-stakes “risk-averse” scenario.

2.2 Set-valued estimators

Before defining risk-averse estimation, we must make a note regarding set-valued estimators.

Typically, estimators are considered as functions from the observation space to (extended) parameter space, . However, all standard point estimators are defined by means of an or an . Such functions intrinsically allow the result to be a subset of , rather than an element of .

We say that an estimator is a well-defined point estimator for if it returns a single-element set for every , in which case we take this element to be its estimate. Otherwise, we say it is a set estimator. The set estimator, in turn, is well-defined on if it does not return an empty set as its estimate for any .

All estimators discussed will therefore be taken to be set estimators, and the use of point-estimator notation should be considered solely as notational convenience.

We also define set limit and use the notation

where is a sequence of sets with an eventually bounded union (i.e., there exists a , such that is bounded), to mean the set of elements for which there exists a monotone increasing sequence of naturals and a sequence , such that for each , and .

3 Risk-averse estimation

The idea behind MAP is to maximise the posterior probability that the estimated value is the correct value. In the continuous domain this cannot hold verbatim, because all have probability zero. Instead, we translate the notion into the continuous domain by maximising the probability that the estimated value is essentially the correct value. The way to do this is as follows.

Definition 1.

A continuously differentiable, monotone decreasing function, , satisfying

  1. ,

  2. ,

will be called an attenuation function, and the minimal will be called its threshold value.

Definition 2.

Let be the loss function of a Bayes estimator and an attenuation function.

We define a risk-averse estimator over and to be the estimator satisfying

where is the Bayes estimator whose gain function is

By convention we will assume , noting that this value can be set by applying a positive multiple to the gain function, which does not affect the definition of the estimator.

The rationale behind this definition is that we use a loss function, , to determine how similar or different is to , and then use an attenuation function, , to translate this divergence into a gain function, where a indicates an exact match and a that is not materially similar to . (Such a gain function is often referred to as a similarity measure.) The parameter is then used to contract the neighbourhood of partial similarity, to the point that anything that is not “essentially identical” to according to the loss function is considered a . Note that this is done without distorting the loss function, as merely introduces a linear multiplication over it, a transformation that preserves not only the closeness ordering of pairs but also the Bayes estimator defined on the scaled function.

In this way, the risk-averse estimator maximises the probability that is essentially identical to , while preserving our notion, codified in , of how various values interrelate.

4 Positive probability events

Theorem 1.

Any risk-averse estimator, , regardless of its loss function or its attenuation function , satisfies for any in any estimation problem in which there exists a with a positive posterior probability that

and is a nonempty set, provided is discriminative for the estimation problem. In particular, is in all such cases a well-defined set estimator, and where MAP is a well-defined point estimator, so is , and


The risk-averse estimator problem is defined by


Fix , and let .

Let be the set of for which is positive. The value of is bounded from both sides by


Because, by discriminativity of , for any neighbourhood of there is a value from which , as goes to infinity both bounds converge to . So, this is the limit for . Also, is a monotone decreasing function of .

The above proves that

is the MAP solution. To show that it is also the limit of the argmax (i.e., when switching back to the order of the quantifiers in (2)), we need to show certain uniformity properties on the speed of convergence, which is what the remainder of this proof is devoted to.

Let , and define an enumeration over , where the values are sorted by descending . (Such an enumeration is not necessarily unique.) If is countably infinite, the values of range in . Otherwise, it is a finite enumeration, with in .

Let be the set for which attains its maximum value, .

Let .

Because we know that for all is monotone decreasing and tending to zero, there is for each such a threshold value, , such that if , . When this is the case, can clearly no longer be part of the argmax in (2). Let be the subset of not thus excluded at .

By discriminativity of , for any subset such that and any there is a threshold value such that for all , there is no such that .

Combining these two observations, let be the set of such that there is some in for which . We conclude that as grows to infinity, the probability tends to zero. In particular, there exists a threshold value, which we will name , for which this probability is lower than .

Let be a set , where is such that


The choice of is not unique. However, such an always exists.

Define . Importantly, sets and are both finite.

Let be a set of neighbourhoods of , respectively, such that no two neighbourhoods intersect. Because this set of values is finite, there is a minimum distance between any two and therefore such neighbourhoods exist.

For each , let .

Because is discriminative, all are positive. Because this is a finite set, is also positive.

Consider now values of which are larger than , where is the attenuation function’s threshold value.

Because we chose all to be without intersection, any can be in at most one . For values as described, only values in can have . In particular, each can contain at most one of .

By (4), any such neighbourhood that does not contain one of , i.e. set , the MAP solutions, has a value lower than , the lower bound for given in (3). This is because for a , the total value from all elements in can contribute, by construction, less than , whereas the one element from that may be in the same neighbourhood can contribute no more than . On the other hand, a has by the definition of .

Therefore, any must have an containing exactly one of .

Let us partition any sequence of such elements according to the element of contained in , discarding any subsequence that is finite.

Consider now only the subsequence such that for some fixed .

By the same logic as before, because is discriminative, for any neighbourhood of , and therefore there exists a value such that for all , if then .

We conclude, therefore, that for all sufficiently large . By definition, the sequence therefore converges to , and the set limit of the entire sequence is the subset of the MAP solution, , for which such infinite subsequences exist.

Because the entire sequence is infinite, at least one of the subsequences will be infinite, hence the risk-averse solution is never the empty set. ∎

5 The axioms

We now describe additional good properties satisfied by the MAP estimator which make it suitable for scenarios such as scientific inference. These natural desiderata will form axioms of inference, which we will then investigate outside the discrete setting.

Our interest is in investigating inference and estimation in situations where all errors are highly costly, and hence we begin with an implicit “Axiom 0” that all estimators investigated are risk averse.

Our remaining axioms are not regarding the estimators themselves, but rather regarding what constitutes a reasonable loss function for such estimators. We maintain that these axioms can be applied equally in all situations in which loss functions are used, such as with Bayes estimators.

In all axioms, our requirement is that the loss function satisfies the specified conditions for every estimation problem , and every pair of parameters and in parameter space.

As always, we take the parameter space to be and the observation space to be .


Axiom 1: Invariance to Representation of Parameter Space (IRP)

A loss function is said to satisfy IRP if for every invertible, continuous, differentiable function , whose Jacobian is defined and non-zero everywhere,

Axiom 2: Invariance to Representation of Observation Space (IRO)

A loss function is said to satisfy IRO if for every invertible, piecewise continuous, differentiable function , whose Jacobian is defined and non-zero everywhere,

Axiom 3: Invariance to Irrelevant Alternatives (IIA)

A loss function is said to satisfy IIA if does not depend on any detail of the joint distribution of (described in the continuous case by the pdf ) other than at .

Axiom 4: Invariance to Superfluous Information (ISI)

A loss function is said to satisfy ISI if for any random variable such that is independent of given ,

A loss function that satisfies both IRP and IRO is said to be representation invariant.

The conditions of representation invariance follow [19], whereas IIA was first introduced in a game-theoretic context by [11].

The ISI axiom is one we need neither in the positive probability case discussed above nor in the continuous case of the next section. However, we will use it in the remaining case, of discrete observations with a continuous parameter space.

6 The continuous case

6.1 Well-behaved problems

We now move to the harder case, where the distribution of is continuous and none of its values is assigned a positive posterior probability. We refer to this as the -continuous case.

We begin our exploration by looking at the special sub-case where the joint distribution of is given by a probability density function . We refer to this as the continuous case. Much of the machinery we develop for the continuous case will be reused, however, in the next section, where we discuss problems with a discrete but a continuous . For this reason, where possible, we describe our results in this section in terminology more general than is needed purely for handling the continuous case.

We show that in the continuous case for any well-behaved estimation problem and well-behaved loss function , if satisfies the first three invariance axioms of Section 5, any risk-averse estimator over equals the Wallace-Freeman estimator, regardless of its attenuation function.

Note that unlike in the discrete case, in the -continuous case we restrict our analysis to “well-behaved” problems. The reason for this is mathematical convenience and simplicity of presentation.

In this section we define well-behavedness. The definition will be one we will reuse for analysing also the -continuous case. However, some well-behavedness requirements for continuous problems are not meaningful for distributions with a discrete , so the definition states explicitly how the requirements are reduced for the more general -continuous case.

We refer to a continuous/-continuous estimation problem as well-behaved if it satisfies the following criteria.

  1. For continuous problems: the function is piecewise continuous in and three-times continuously differentiable in . If, alternatively, is discrete, we merely require that for every , is three-times continuously differentiable in .

  2. The set is a compact closure of an open set.

Additionally, we say that a loss function is well-behaved if it satisfies the following conditions.


If is a well-behaved continuous/-continuous estimation problem, then the function is three times differentiable in and these derivatives are continuous in and .


There exists at least one well-behaved continuous/-continuous estimation problem and at least one choice of , and such that


(For continuous estimation problems only:) is problem-continuous (or “-continuous”), in the sense that if is a sequence of well-behaved continuous estimation problems, such that for every , , then for every ,

In the last criterion, the symbol “” indicates convergence in measure [7]. This is defined as follows. Let be the space of normalisable, non-atomic measures over some , let be a function and let be a sequence of such functions. Then if


where can be, equivalently, any measure in whose support is at least the union of the support of and all .

We will usually take and all to be pdfs. When this is the case, ’s support only needs to equal the support of . Furthermore, because is normalisable, one can always choose values and such that and are both arbitrarily small, for which reason one can substitute the absolute difference “” in (5) with a relative difference “”, and reformulate it in the case that and all are pdfs as


This reformulation makes it clear that convergence in measure over pdfs is a condition independent of representation: it is invariant to transformations of the sort we allow on the observation space.

6.2 The main theorem

Our main theorem for continuous problems is as follows.

Theorem 2.

If is a well-behaved continuous estimation problem for which is a well-defined set estimator, and if is a well-behaved loss function, discriminative for , that satisfies all of IIA, IRP and IRO, then any risk-averse estimator over , regardless of its attenuation function , is a well-defined set estimator, and for every ,

In particular, if is a well-defined point estimator, then so is , and

We prove this through a progression of lemmas. For the purpose of this derivation, the dimension of the parameter space, , and the dimension of the observation space, , are throughout taken to be fixed, so as to simplify notation.

Lemma 1.

For estimation problems with a continuous , if satisfies both IIA and IRP then is a function only of the likelihoods and .


The IIA axiom is tantamount to stating that is dependent only on the following:

  1. the function’s inputs and ,

  2. the likelihoods and , and

  3. the priors and .

We can assume without loss of generality that , or else the value of can be determined to be zero by (1).

Our first claim is that, due to IRP,

can also not depend on the problem’s prior probability densities

and . To show this, construct an invertible, continuous, differentiable function , whose Jacobian is defined and non-zero everywhere, in the following way.

Let be an orthogonal basis for wherein . We design as

where is a continuous, differentiable function onto , with a derivative that is positive everywhere, satisfying

  1. and , and

  2. and , for some arbitrary positive values and .

Such a function is straightforward to construct for any values of and , and by an appropriate choice of these values, it is possible to map into in a way that does not change or , but adjusts and to any desired positive values.

Lastly, we show that can also not depend on the values of and other than through and . For this we once again invoke IRP: by applying a similarity transform on , we can map any and values into arbitrary new values, again without this affecting their respective likelihoods. ∎

In light of Lemma 1, we will henceforth use the notation (or, when is known to be continuous, ) instead of . A loss function that can be written in this way is referred to as a likelihood-based loss function.

For distributions and with a common support , absolutely continuous with respect to each other, let be the Radon-Nikodym derivative [13]. For a value of that has positive probability in both and , this is simply , whereas for pdfs and it is within the common support.

We now define the function by

Lemma 2.

The function is -continuous for continuous distributions, in the sense that if both and , where and are pdfs and and are pdf sequences, then .


For any , let , and let and be the infimum and the supremum , respectively, for which .

Because is a monotone increasing function, is also the supremum for which (unless no such exists, in which case ), so by definition is the supremum of , for all , from which we conclude

Because and , we can use (6) to determine that using a large enough both and are arbitrarily close to in all but a diminishing measure of . Hence,

We conclude that for any a large enough will satisfy

and hence . For all such , and in particular for all ,


A symmetrical analysis on yields that for all ,


Consider, now, the functions


Because each is monotone increasing, so are and . Monotone functions can only have countably many discontinuity points (for a total of measure zero). For any that is not a discontinuity point of either function, we have from (7) and (8) that exists and equals , so the conditions of convergence in measure hold. ∎

Lemma 3.

If satisfies IRO and is a well-behaved likelihood-based loss function and and are piecewise-continuous probability density functions over , then depends only on .


The following conditions are equivalent.

  1. equals the indicator function on in all but a measure zero of values,

  2. equals in all but a measure zero of ,

  3. and are -equivalent, in the sense that a sequence of elements all equal to nevertheless satisfies the condition of -convergence to , and

  4. ,

where the equivalence of the last condition follows from the previous one by problem continuity, together with (1). Hence, if the second condition is met, we are done. We can therefore assume that and differ in a positive measure of , and (because both integrate to ) that they are consequently not linearly dependent.

Because is known to be likelihood-based, the value of is not dependent on the full details of the estimation problem: it will be the same in any estimation problem of the same dimensions that contains the likelihoods and . Let us therefore design an estimation problem that is easy to analyse but contains these two likelihoods.

Let be an estimation problem with and a uniform prior on . Its likelihood at will be , at will be , and we will choose piecewise continuous likelihoods, , over the rest of the so all are linearly independent, share the same support, and differ from each other over a positive measure of , and so that their respective values are all monotone weakly increasing with and with each other.

If , we further choose at to satisfy that is monotone strictly increasing. If , this is not necessary and we, instead, choose .

We then extend this description of at into a full characterisation of all the problem’s likelihoods by setting these to be multilinear functions of the coordinates of .

We now create a sequence of estimation problems, to satisfy the conditions of ’s problem-continuity assumption. We do this by constructing a sequence of subsets of such that for all , tends to , and for every ,


for an arbitrarily-chosen sequence tending to zero. By setting the remaining likelihood values as multilinear functions of the coordinates of , as above, the sequence will satisfy the problem-continuity condition and will guarantee .

Each will be describable by the positive parameters as follows. Let , i.e. the axis parallel, origin-centred, -dimensional cube of side length . will be chosen to contain all such that for all , and is at least a distance of away from the nearest discontinuity point of , as well as from the origin. By choosing small enough and and large enough and , it is always possible to make arbitrarily close to , so the sequence can be made to satisfy its requirements.

We will choose to be a natural.

We now describe how to construct each from its respective . We first describe for each a new function as follows. Begin by setting for all . If , set to zero. Otherwise, complete the functions so that all are linearly independent and so that each is positive and continuous inside , and integrates to . Note that because a neighbourhood around the origin is known to not be in , it is never the case that

. This allows enough degrees of freedom in completing the functions

in order to meet all their requirements.

As all are continuous functions over the compact domain , by the Heine-Cantor Theorem [14] they are uniformly continuous. There must therefore exist a natural , such that we can tile into sub-cubes of side length such that by setting each value in each sub-cube to a constant for the sub-cube equal to the mean over the entire sub-cube tile of , the result will satisfy for all and all , . Because is by design multi-linear in , this implies that for all and all , condition (9) is attained. Furthermore, by choosing a large enough , we can always ensure, because the functions are continuous and linearly independent, that also the functions, for are linearly independent and differ in more than a measure zero of . Together, these properties ensure that the new problems constructed are both well defined and well behaved.

We have therefore constructed as a sequence of well-behaved estimation problems that -approximate arbitrarily well, while being entirely composed of functions whose support is for some natural and whose values within their support are piecewise-constant inside cubic tiles of side-length , for some natural .

We now use IRO to reshape the observation space of the estimation problems in the constructed sequence by a piecewise-continuous transform.

Namely, we take each constant-valued cube of side length and transform it using a scaling transformation in each coordinate, as follows. Consider a single cubic tile, and let the value of at points that are within it be . We scale the first coordinate of the tile to be of length , and all other coordinates to be of length . Notably, this transformation increases the volume of the cube by a factor of , so the probability density inside the cube, for each , will drop by a corresponding factor of .

We now place the transformed cubes by stacking them along the first coordinate, sorted by increasing .

Notably, because the probability density in all transformed cubes is , it is possible to arrange all transformed cubes in this way so that, together, they fill exactly the unit cube in . Let the new estimation problems created in this way be , let be the transformation, , applied on the observation space and let be the first coordinate value of .

By IRO, , which we know tends to .

Consider the probability density of each over its support . This is a probability density that is uniform along all axes except the first, but has some marginal, , along the first axis. We denote such a distribution by . Specifically, for , because of our choice of sorting order, we have , so by Lemma 2, this is known to -converge to .

If , the above is enough to show that the -limit problem of exists. If , consider the following.

Let be the transformation mapping each to the supremum for which . This will be satisfied with equality wherever is continuous, which (because it is monotone) it is in all but a measure zero of the , and therefore of the .

Thus, in all but a diminishing measure of we have that the value of approaches , which in turn equals the value . On the other hand, we have that