# A taxonomy of estimator consistency on discrete estimation problems

We describe a four-level hierarchy mapping both all discrete estimation problems and all estimators on these problems, such that the hierarchy describes each estimator's consistency guarantees on each problem class. We show that no estimator is consistent for all estimation problems, but that some estimators, such as Maximum A Posteriori, are consistent for the widest possible class of discrete estimation problems. For Maximum Likelihood and Approximate Maximum Likelihood estimators we show that they do not provide consistency on as wide a class, but define a sub-class of problems characterised by their consistency. Lastly, we show that some popular estimators, specifically Strict Minimum Message Length, do not provide consistency guarantees even within the sub-class.

## Authors

• 6 publications
• 1 publication
• ### Parameter estimation in branching processes with almost sure extinction

We consider population-size-dependent branching processes (PSDBPs) which...
09/21/2020 ∙ by Peter Braunsteins, et al. ∙ 0

• ### Risk-averse estimation, an axiomatic approach to inference, and Wallace-Freeman without MML

We define a new class of Bayesian point estimators, which we refer to as...
06/28/2018 ∙ by Michael Brand, et al. ∙ 0

• ### Per-Flow Cardinality Estimation Based On Virtual LogLog Sketching

Flow cardinality estimation is the problem of estimating the number of d...
11/30/2018 ∙ by Zeyu Zhou, et al. ∙ 0

• ### MML is not consistent for Neyman-Scott

Strict Minimum Message Length (SMML) is a statistical inference method w...
10/14/2016 ∙ by Michael Brand, et al. ∙ 0

• ### On the Consistency of Maximum Likelihood Estimators for Causal Network Identification

We consider the problem of identifying parameters of a particular class ...
10/17/2020 ∙ by Xiaotian Xie, et al. ∙ 0

• ### Beyond Maximum Likelihood: from Theory to Practice

Maximum likelihood is the most widely used statistical estimation techni...
09/26/2014 ∙ by Jiantao Jiao, et al. ∙ 0

• ### Consistency of the Buckley-Osthus model and the hierarchical preferential attachment model

This paper is concerned with statistical estimation of two preferential ...
10/17/2019 ∙ by Xin Guo, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consistency has been studied extensively as a desirable property of estimators by proponents of diverse approaches to point estimation. For instance there is an extensive literature investigating assumptions sufficient to assure the consistency of maximum likelihood estimators of which Doob (1934) is the first, Wald (1949) is the classic source, Perlman (1972) provides a good summary and Seo & Lindsay (2013) is a more recent development. On the Bayesian side the related issue of “posterior consistency” has an extensive literature beginning with Doob (1949) and summarised in Ghosal (1997). This was related to the consistency of Bayes estimators in Schwartz (1965), and consistency of Bayes estimators was more directly analysed in Diaconis & Freedman (1986). More recently Dowe Dowe et al. (1998); Dowe (2011) conjectured that under certain conditions only Bayesian estimation methods would be consistent.

The statistical notion of consistency is typically discussed in the context of a given estimation method and a single problem or class of problems (as in (Schwartz, 1965, Section 6) and Dowe & Wallace (1997)). By contrast, in this paper we show that estimators fall into natural classes in terms of the properties they require of an estimation problem in order to ensure consistency. This allows us to map any estimator according to the class it belongs to, and, in turn, also to map any estimation problem according to the same classification, by the properties it exhibits.

Specifically, we define three nested classes of estimation problems. The widest class is those including all estimation problems. (Here, “all estimation problems” refers to all discrete estimation problems, i.e. to estimation problems for which both the parameter space and the observation space are countable. Throughout this paper, this will be the scope of the discussion.) Smaller than the class of all problems is the class of problems having convergent likelihood ratios. And the smallest is the class of problems having distinctive likelihoods

. An estimator is classified according to the largest of these classes within which it guarantees consistency for any problem. Thus, we describe estimator consistency as a hierarchy: at level zero of the hierarchy are the estimators consistent for all problems, at level one those that are not at level zero but are consistent for all problems with convergent likelihood ratios (We call such estimators

properly consistent), at level two are those estimators that are not at level one but are consistent for all problems with distinctive likelihoods (We call such estimators likelihood consistent), and at level three are those estimators for which not even the property of distinctive likelihoods is a guarantee of consistency.

In the paper, we provide several characterisations to each of the problem classes, taking both Bayesian and frequentist approaches. We then provide examples for popular estimators and for wide estimator classes mapped into each level, proving both their consistency within the required problem class and their inconsistency within the next-largest problem class.

Specifically, we show that

1. No estimator resides in level zero of the hierarchy, and that, in fact, no estimator can be consistent for any estimation problem outside the class of those with convergent likelihood ratios,

2. Maximum A Posteriori (MAP) resides in level one of the hierarchy, being consistent over all problems that have any consistent estimators, and so do some frequentist estimators, but not Maximum Likelihood (ML),

3. Maximum Likelihood (ML) estimation resides in level two of the hierarchy, being consistent for all problems with distinctive likelihoods, and

4. Strict Minimum Message Length (SMML) estimation resides in level three of the hierarchy, not guaranteeing consistency even for problems with distinctive likelihoods.

The last result, namely the result regarding SMML, resolves in the negative a conjecture by Dowe Dowe et al. (1998); Dowe (2011) to the effect that Minimum Message Length estimators are consistent on more general classes of problems than alternatives such as Maximum Likelihood, and frequentist estimation methods in general.

The remaining parts of this paper are organised as follows. In Section 2, we provide general definitions regarding the point estimation problem, precisely scoping our present results. In Section 3, we define the class of estimation problems with convergent likelihood ratios, the corresponding class of properly consistent estimators, provide examples, and show that no estimation problem without convergent likelihood ratios has consistent estimators at all. In Section 4, we define the class of estimation problems with distinctive likelihoods, the corresponding class of likelihood consistent estimators, and provide examples. Lastly, in Section 5, we analyse the Strict Minimum Message Length estimator, and show an estimation problem with distinctive likelihoods for which it is provably not consistent.

## 2 Definitions

### 2.1 Estimation Problems

Point estimation concerns the problem of inferring a single point estimate of an unknown parameter of a model given observations generated from the model. Point estimation problems are divided into small sample problems (where the task is to make an estimate given a fixed, finite number of observations), and large sample problems (where the task is to provide estimates which have good properties in the limit of infinite observations). While consistency is a concern only in large sample problems, we first introduce small sample problems as they are the most natural context for defining most of the estimators we shall be interested in.

###### Definition 2.1.

A discrete, small sample, classical estimation problem is a tuple where:

• is called the observation space, and is a finite product of countable subsets of some Euclidean space, representing possible finite sequences of observations that might be observed.

• is called the parameter space, and is a countable subset of a Euclidean space representing possible values of the unknown parameter.

• is a set of distinct probability distributions over

indexed by , where represents the probability distribution over the observations implied by the model with parameter value .

The intended interpretation is that we receive an observation (which is a term we will also use when describing a finite sequence of observations), from which we make our estimate, , of

. The observation is modelled as a random variable

whose distribution is , which depends on the true value of the unknown parameter.

The classical approach to point estimation assumes no prior information regarding the value of the unknown parameter value

except for its set of possible values. This is appropriate when one is evaluating “frequentist” approaches which make no use of any prior information, or when considering properties such as bias or variance of estimators. In order to evaluate Bayesian approaches in the same framework, one must assume that the prior is selected by the statistician when selecting an estimation method. This is incongruous with (mainstream) Bayesian philosophy, since according to this philosophy the prior distribution should capture the statistician’s prior knowledge about the unknown parameter, and therefore the statistician is not free to choose this prior arbitrarily. Such a Bayesian, therefore, should treat the prior as given in the estimation problem, instead of something to be determined freely in choosing an estimation method. Therefore in this paper we shall consider what we call

Bayesian estimation problems:

###### Definition 2.2.

A discrete, small sample, Bayesian estimation problem is a tuple,, where , and are defined as in Definition 2.1, and is a probability distribution (called the prior distribution) over such that for all .

The addition of the prior distribution allows us to treat the value of the unknown parameter as a random variable , distributed according to . Note that we assume that the prior distribution is everywhere positive, reflecting the informal notion of as the set of values that the statistician considers possible values for the unknown parameter. Other than this we make no assumptions about the form of the prior distribution.

The distributions in can then be thought of as conditional distributions of , conditioned on different possible values of

. As a result we can alternatively, and more simply, conceive of a large sample Bayesian estimation problem as being characterised by the joint distribution on the pair of random variables

. That is, the following definition is equivalent to Definition 2.2, and is the one we shall mostly use:111Note that, when is uncountable these two definitions are no longer always equivalent, since the conditional probability is not necessarily uniquely defined by the joint distribution in that case.

###### Definition 2.4.

A discrete, small sample, Bayesian estimation problem is a pair of random variables , with a given joint probability distribution, with ranging over and ranging over , with , and with and possessing the same properties as in Definition 2.2.

While estimators are simplest to define in the small sample case, in this paper we are interested in the asymptotic behaviour of estimators in the limit of infinite data. We shall therefore be concerned with what we shall call discrete, large sample Bayesian estimation problems:

###### Definition 2.5.

A discrete, large sample Bayesian estimation problem is a tuple where each is a countable subset of some Euclidean space, and is the observation space (in this case the infinite Cartesian product of the ), is a set of distinct probability distributions over with the associated -algebra being the smallest -algebra containing all open cylinders (, , ), and satisfies the same conditions as in Definition 2.2. Alternatively a discrete, large sample Bayesian estimation problem is a pair of random variables as in Definition 2.4 with ranging over an observation space which is the product of a countably infinite sequence of countable non-empty subsets of a Euclidean space, where the factors of the product are indexed by .

The intended interpretation is that the statistician will receive a sequence of observations, with the observation being a member of . The observations form a sequence of random variables whose distribution depends on the unknown value . Note that we have put no restrictions on the form of the distributions on , and so in particular we do not enforce the common requirement that the observations be independent and identically distributed (i.i.d.).

All the problems we consider from this point forward will be both discrete and Bayesian, so we shall leave out these qualifiers, instead dividing problems only into small sample or large sample estimation problems. In both cases we shall use when referring to probabilities defined in terms of the joint probability distribution referred to in Definitions 2.4 and 2.5. For instance, we may write instead of .

Note that in all estimation problems we have required and to be subsets of some Euclidean space. We do this solely to make our results more comprehensible due to the general familiarity with Euclidean spaces. All that is required for the results that follow is that is a first countable, space and that all subsets of and are measurable.

Finally, we shall need some additional notation for discussing individual observations or finite sequences of observations in large sample estimation problems:

• .

• .

### 2.2 Estimators

An estimator for a small sample estimation problem is a function , where represents a ‘best guess’ as to the true value of , when given the observation . Since the observation is a random variable, is itself a random variable which we will denote . An estimator for a large sample estimation problem is a sequence of estimators for the sequence of small sample estimation problems where .

Estimators are thus defined relative to a given estimation problem, while we are interested in comparing methods of point estimation over whole classes of estimation problems. We thus require the notion of estimator classes which are simply sets of estimators defined over broad classes of estimation problems. For most estimator classes we consider we will define membership explicitly only for small sample estimation problems. In such cases estimators on large sample problems will be considered members of a given estimator class if the sequence of estimators on the small sample problems is eventually in the class of small sample estimators so defined.

As an example, perhaps the most well-studied class of estimators and one of the main classes we shall consider is the Maximum Likelihood (ML) estimator class:

###### Definition 2.10.

An estimator for estimation problem is a Maximum Likelihood (ML) estimator and so a member of the ML estimator class if, for all it satisfies the equation:

 ^θ(x)=\operatornamewithlimitsargmaxθ∈ΘPθ(x). (1)

The ML estimator class is what we shall call a frequentist estimator class, since whether an estimator on a problem is a member of the class can be determined independently of the prior distribution on . We shall also later introduce Bayesian estimator classes whose membership does depend on the prior.

### 2.3 Consistency

While our estimation problems are Bayesian, our evaluative criterion is the frequentist one of consistency

###### Definition 2.11.

An estimator is consistent at if, when is distributed according to , the sequence converges in probability to . That is if for every neighbourhood of ,

 limn→∞

An estimator is consistent for an estimation problem if it is consistent for all .

###### Definition 2.13.

An estimator is strongly consistent at if, when is distributed according to , the sequence converges almost surely to . That is if,

If an estimator is strongly consistent at all then it is strongly consistent.

In this paper however, we are not primarily interested in the consistency of individual estimators, but rather of estimator classes. Ideally we would define an estimator class as (strongly) consistent if every estimator within the class is (strongly) consistent. However, since our definition of estimation problems in Section 2.1 made no assumptions regarding the relationship between the sequence of observations and the true value of , there can be no guarantee in general that a consistent estimator exists for a given estimation problem. Thus estimator classes can only guarantee consistency on subclasses of the class of all estimation problems. We will therefore define consistency of estimator classes relative to a given class of large sample estimation problems:

###### Definition 2.15.

Estimator class is (strongly) consistent over a class of estimation problems if, for every estimation problem , there exists at least one estimator in for and all such estimators are (strongly) consistent on .

Note, however, that an estimator class may fail to contain estimators for a given estimation problem (for instance there may be no maximum likelihood estimator for a problem if the likelihood function for infinitely many fails to attain its supremum). We therefore also define partial consistency:

###### Definition 2.16.

Estimator class is partially (strongly) consistent over class of estimation problems if, for every estimation problem , all estimators in for are (strongly) consistent on .

## 3 Proper Consistency

In this section we introduce our broadest class of estimation problems - those with consistent posteriors. The primary results of this section will show that a consistent posterior is a necessary condition for a problem to have consistent estimators, and that it is also a sufficient condition for an estimation problem to have strongly consistent estimators. An estimator class which is consistent over all problems with consistent posteriors we therefore call properly consistent

. We will then introduce Bayes estimators and show that Bayes estimator classes whose associated loss functions have a property we call

discernment are properly consistent.

To conclude this section we will consider proper consistency from a frequentist perspective. We will show that the existence of a consistent posterior is equivalent to the frequentist property of having convergent likelihood ratios. This entails the existence of properly consistent frequentist estimator classes, however we will show that the Maximum Likelihood estimator class is not properly consistent.

### 3.1 Posterior Consistency

###### Definition 3.17.

Ghosal (1997) An estimation problem is said to have a consistent posterior if, for every , for every neighbourhood of

 limn→∞ (2)

when is distributed according to . A problem without a consistent posterior is said to have an inconsistent posterior.222Note that a probability conditioned on a random variable is itself a random variable.

That is an estimation problem has a consistent posterior if the Bayesian posterior will, in the limit, concentrate all probability mass on regions about the true value with probability 1. Since we only consider the discrete case in this paper, Definition 3.17 can be simplified using Lévy’s upward theorem to show that a consistent posterior will in fact concentrate all probability mass precisely on the true value.

###### Lemma 3.19 (Lévy’s upward theorem).

(Williams, 1991, sec. 14.2) If is a real random variable such that , and if is a random process and , then

 limn→∞E(z|y1:n)=E(z|y)a.s.
###### Lemma 3.20.

For every estimation problem , has a consistent posterior if and only if, for every :

 limn→∞ (3)

when is distributed according to .

###### Proof.

if: For any neighbourhood, of , . Therefore by the order limit theorem.

only if: Assume has a consistent posterior, and assume that is distributed according to . Now, for every neighbourhood, , is the expectation of a random variable bounded between 0 and 1. We can therefore apply Lévy’s upward theorem to Definition  3.17 to get that for every neighbourhood, , of , almost surely. However, by the property of , for every there exists a neighbourhood of such that . Therefore Finally since is discrete:

 = ∑~θ∈U = ∴ = 1 a.s.

Now in order to prove the main result of this subsection (Lemma 3.35) we will need the following (well-known) lemma:

###### Lemma 3.34.

(Resnick, 1999, Thm. 6.3.1(b), p. 172) A sequence of random variables converges in probability to if and only if, for every subsequence there exists a further subsequence which converges almost surely to .

###### Lemma 3.35.

No estimator is consistent for an estimation problem with an inconsistent posterior.

###### Proof.

Assume is an estimation problem with an inconsistent posterior. Then, applying Lévy’s upward theorem to Lemma 3.20, this implies that for some , with positive probability when is distributed according to , . Let be such a member of and let then:

 (4)

Suppose is a consistent estimator for the estimation problem, so that for any , converges to in probability, when is drawn from . Therefore, by Lemma 3.34, for every subsequence there is a further subsequence which converges to almost surely when is distributed according to . Thus, in particular, there exists a subsequence such that . Letting we can restate this as:

 (5)

Thus combining (4) and (5) we get:

 (6)

Since , this implies that . Further by the definition of S, . So by Bayes’ theorem:

 (7)

It follows that for some

 (8)

But again by consistency of there is a subsequence of such that

 (9)

(8) and (9) together imply that for some , contradicting the definition of A. Thus cannot be consistent for this problem, and therefore there can be no consistent estimator for an estimation problem with an inconsistent posterior. ∎

Lemma 3.35 implies that any estimator class which is consistent over the class of problems with consistent posteriors is consistent on all problems for which consistent estimators exist at all. We shall call such estimator classes, properly consistent.

### 3.2 Bayes Estimators

We now introduce Bayes estimator classes which form our first examples of Bayesian estimator classes. In the next subsection we will then give sufficient conditions for Bayes estimator class to be properly consistent. Bayes estimators (Lehmann & Casella, 1998, p.255) are widely discussed examples of Bayesian estimators that attempt to minimise the posterior expectation of some loss function on the parameter space representing the cost of guessing the model incorrectly. Specifically:

###### Definition 3.48.

A function is a loss function for estimation problem if, for all , .

###### Definition 3.49.

A Bayes estimator associated with Loss function is any estimator satisfying:

 ^θL(x)∈\operatornamewithlimitsargminθ0E(L(θ,θ0)|x=x).

Note that while Definition 3.48 treats the two arguments of the loss function symmetrically, Definition 3.49 does not. The reason is that there is a semantic difference in the two arguments of the loss function. represents the loss suffered when is the estimated value and is the true value of . However we make no requirement that the loss functions themselves treat the arguments asymmetrically, and in fact many of the most popular loss functions (such as squared distance) do not. Of the loss function we introduce later in this section the discrete loss function is symmetric, while the Kullback-Leibler loss function is not.

Note further that while Definition 3.49 identifies all and only those estimators that are ordinarily called Bayes estimators, there are two important differences between the way we will treat Bayes estimators in this paper and the way they are treated in the literature (for instance in the complete class theorem Wald (1950); Stein (1955); Sacks (1963)). The first difference was already noted in Section 2.1 — namely that we treat the prior as part of the estimation problem rather than as part of the estimator. The second difference is in the estimator classes we shall consider. Bayes estimators are typically considered individually or grouped together in a single class of all Bayes estimators (such as in the complete class theorem just mentioned). However, in order to use Bayes estimation one must first select a loss function to use. We will therefore subdivide the class of all Bayes estimators into estimator classes based on their associated loss functions.

The construction of the Bayes estimator classes thus takes a little more work, since (like estimators) loss functions only exist for a given estimation problem. To define Bayes estimator classes which exists across estimation problems with varying , we need the notion of a general loss function:

###### Definition 3.50.

A general loss function is a function from estimation problems to loss functions.

The loss function which is the image of estimation problem under general loss function we will denote with . Thus represents the loss from selecting as an estimate when is the true value when applying general loss function to estimation problem .

In practice a loss function for a particular estimation problem is usually chosen using information from only some of the components of an estimation problem, imposing structure on the possible form of the general loss function. In particular general loss functions which are used in practice tend to either be parameter-based or distribution-based. A parameter-based loss function is a general loss function such that depends only on the values and (independently of ). A distribution-based loss function is a general loss function such that depends only on the distributions and in . For instance the general loss function which puts the discrete metric on every space:

 L(T)disc(θ,θ′)def={1if θ≠θ′0if θ=θ′

is a parameter-based loss function. An example of a popular distribution-based loss function is the following, based on Kullback-Leibler divergence:

 L(T)KL(θ,θ′)def=DKL(Pθ,Pθ′)def=EPθ(logdPθdPθ′).

The Bayes estimator class associated with a given general loss function is the class of Bayes estimators whose associated loss function is the image of the estimation problem the estimator is defined on, under the general loss function. A Bayes estimator class whose associated general loss function is parameter-based (distribution-based) we will call a parameter-based (distribution-based) Bayes estimator class. Thus the class of Bayes estimators minimising the discrete metric is a parameter-based Bayes estimator class which is more commonly known as the Maximum A Posteriori (MAP) estimator class, and the class of Bayes estimators minimising Kullback-Leibler divergence is a distribution-based Bayes estimator class which we will follow Dowe et al. (1998) in calling the minimum Expected Kullback-Leibler (minEKL) estimator class. If an individual Bayes estimator is a member of a parameter-based (distribution-based) Bayes estimator class we will call it a parameter-based (distribution-based) estimator. This implies that an estimator may be both a parameter-based and a distribution-based Bayes estimator, while an estimator class cannot be both parameter-based and distribution-based.

We have dealt explicitly only with small sample estimation problems in defining Bayes estimator classes, which in light of Section 2.2 is sufficient to define Bayes estimators and estimator classes over large sample estimation problems as well. There is an important point to note about the associated loss functions in large sample estimation problems however. In the large sample case the distributions associated with each member of , and so the associated loss functions of distribution-based Bayes estimators, depends on the number of data points observed. Thus a distribution-based Bayes estimator on a large sample estimation does not have a single associated loss function. Instead such an estimator has a sequence of associated loss functions . While the associated loss function of a parameter-based Bayes estimator does not depend on the number of data points observed, in order to provide a uniform treatment of the two types of Bayes estimator classes we shall consider all Bayes estimators on large sample estimation problems to have an associated loss function sequence instead of an associated loss function (with parameter-based Bayes estimators’ associated loss function sequences being constant).

### 3.3 Consistency of Bayes Estimator Classes

One natural way to show that a Bayes estimator is consistent is to show that:

1. The expected loss of the true value converges to zero, while

2. The expected loss of values far away from the true value are eventually bounded away from zero.

With , this can fail if either the loss can get too small so there is some sequence of estimates not converging to which has expected losses converging to zero, or conversely if can get too large so that does not converge to zero. This provides the intuition for the following definition of a discerning sequence of loss functions, which is designed to avoid both of these issues.

###### Definition 3.51.

The sequence of loss functions is discerning if for all , and any neighbourhood of ,

 liminfn→∞infθ′∈UcLn(θ,θ′)K(θ)n>0 (10)

where . Loss function is discerning if the constant sequence (i.e., the sequence where for all , ) is discerning. A general loss function sequence is discerning if, for every large sample estimation problem , the loss function sequence is discerning.

Note that for constant loss function sequences equation (10) reduces to

 infθ′∈UcL(θ,θ′)K(θ)>0,

where . This holds if only if and .

Having a discerning associated loss function sequence is a sufficient condition for the consistency of a Bayes estimator:

###### Lemma 3.52.

If is a large sample estimation problem with a consistent posterior, and is a discerning loss function sequence, then any Bayes estimator on associated with , , is strongly consistent on .

###### Proof.

Let be a large sample estimation problem with a consistent posterior, let be a discerning loss function sequence, and for each , let . Then Lemma 3.20 implies:

 (11)

Now suppose that is not strongly consistent. Then there exists a such that does not converge almost surely to when . Thus let , so that by assumption

 (12)

Then from (11) and (12) we get:

and

Therefore is not empty so we may choose . Now note that since is discerning, for all , . Further, because by definition , we have that for any , , and because is a loss function, . Combining these two facts, we get:

 E(Ln(θ,θ)K(θ)n∣∣ ∣∣x1:n=x1:n)≤1− (13)

Therefore, since :

 limn→∞E(Ln(θ,θ∗)K(θ∗)n∣∣ ∣∣x1:n=x∗1:n)=0. (14)

Now, , so

 E(Ln(θ,^θLn)K(θ∗)n∣∣ ∣∣x1:n=x∗1:n)≥ (15)

Since , , so (LABEL:eq:geqprob) implies:

 limsupn→∞E(Ln(θ,^θLn)K(θ∗)n∣∣ ∣∣x1:n=x∗1:n)≥limsupn→∞Ln(θ∗,^θLn(x∗1:n))K(θ∗)n. (16)

However, since , there exists a neighbourhood of , and a subsequence of , such that is in . Hence

 limsupn→∞Ln(θ∗,^θ(x∗1:n))K(θ∗)n≥liminfk→∞Lnk(θ∗,^θLnk(x∗1:nk))K(θ∗)nk≥liminfk→∞infθ′∉ULnk(θ∗,θ′)K(θ∗)nk. (17)

But since the liminf of a subsequence is at least as great as the liminf of the original sequence (17) implies:

 limsupn→∞Ln(θ∗,^θ(x∗1:n))K(θ∗)n≥liminfn→∞infθ′∈UcLn(θ∗,θ′)K(θ∗)n>0, (18)

where the final inequality follows from being discerning. Combining (16) and (18) we conclude . Therefore there exists a such that for infinitely many , . In contrast, it follows from (14) that , so there exists an such that for all , . Therefore, for infinitely many :

 E(Lk(θ,θ∗)|x1:k=x∗1:k)

Since , equation (19) implies:

 ^θLk(x∗1:k)≠\operatornamewithlimitsargminθ′E(Lk(θ,θ′)|x1:k=x∗1:k),

contradicting the definition of (Definition 3.49). Hence, our assumption that is not strongly consistent must be false, proving the theorem. ∎

Lemma 3.52 thus defines a family of estimator classes which are (strongly) properly consistent. It remains to exhibit a member of the family, to show that it is non-empty and give our first example of a (strongly) properly consistent estimator class. Our next Lemma shows that the class of MAP estimators is such a class, and is therefore strongly properly consistent.

###### Corollary 3.63.0.

The MAP estimator class is strongly properly consistent.

###### Proof.

The discrete metric is discerning since for any and , . Thus by Lemma 3.52 every MAP estimator on a problem with a consistent posterior is strongly consistent. That is the MAP estimator class is strongly partially consistent over problems with consistent posteriors.

In order to show that MAP is strongly properly consistent, we must additionally show that for every problem with a consistent posterior, there exists a MAP estimator for : Suppose for contradiction that no MAP estimator exists for . Then for some and , does not exist. In other words for all , . It follows that , and that for all there exists infinitely many such that . This then implies , but we know that by the laws of conditional probability. Hence, the assumption that no MAP estimator for exists must be false, and therefore the class of MAP estimators is strongly properly consistent. ∎

The results concerning posterior consistency can now be neatly summarised in the following main result:

###### Theorem 3.72.

For any estimation problem the following conditions are all equivalent:

1. has a consistent posterior.

2. There exists a consistent estimator for .

3. There exists a strongly consistent estimator for .

###### Proof.

: Every estimator which is strongly consistent is consistent since almost sure convergence implies convergence in probability.
: Lemma 3.35
: Corollary 3.63 gives an example of a strongly consistent estimator for any problem with a consistent posterior. ∎

While we have seen that the MAP estimator class is strongly properly consistent, this does not hold for Bayes estimator classes in general, and, in fact, no distribution-based Bayes estimator class is properly consistent.

###### Theorem 3.73.

No distribution-based Bayes estimator class is properly consistent.

###### Proof.

Consider the large sample estimation problem , where , , for all , and the distribution of is given by:

 Problem 3.75

Thus is deterministic when conditioned on , and takes the value of a sequence of ones followed by an infinite sequence of zeros, or, if , an infinite sequence of ones. Note that since the MAP estimator is consistent on , has a consistent posterior.

Now consider the estimator for this problem where . For any sequence of observations, and any distribution-based loss function has a posterior expected loss of 0, and so is a member of every distribution-based estimator class. However , so is inconsistent. Therefore, no distribution-based Bayes estimator class is properly consistent. ∎

### 3.4 Frequentist Proper Consistency

Up until this point, our approach to properly consistent estimator classes has been entirely Bayesian: proper consistency was defined using the Bayesian posterior and we have thus far only considered the requirements for Bayes estimators to be properly consistent. In this subsection we consider a frequentist approach and show that Bayesianism is unnecessary for both defining proper consistency and for attaining properly consistent estimator classes. In particular, we will show that the consistent posterior property is equivalent to the frequentist property of having convergent likelihood ratios.

The equivalence between posterior consistency and convergent likelihood ratios implies that for any properly consistent Bayesian estimator class , a frequentist estimator class can be constructed by simply choosing a canonical prior for each possible , and then declaring an estimator on estimation problem to be a member of if and only if the same estimator on is a member of .

While it is thus possible to form a frequentist properly consistent estimator class, the most popular frequentist estimator class, Maximum Likelihood, is not properly consistent. There are famous examples of the inconsistency of Maximum Likelihood which can be used to show failure of proper consistency. (For instance, in Hanan (1960) a discrete problem is given, where Maximum Likelihood fails to be consistent, despite the existence of consistent estimators for the problem.) In the conclusion of this section, however, we show that the Maximum Likelihood estimator class is not properly consistent by means of an original example, which we introduce because it provides the motivation for our notion of distinctive likelihoods to be introduced in the next section.

###### Definition 3.77.

A large sample estimation problem has convergent likelihood ratios if, for all such that , the likelihood ratio of over converges almost surely to 0 conditioned on being the true model. That is:

Note that, unlike Definition 3.17, Definition 3.77 makes no use of the prior distribution on .

###### Lemma 3.79.

A large sample estimation problem has a consistent posterior if and only if it has convergent likelihood ratios.

###### Proof.

only if: Suppose first that is a large sample estimation problem with a consistent posterior. Then when is distributed according , almost surely by Lemma  3.20 and so almost surely for any . Thus, taking a fixed , by Lévy’s upward theorem both of the following hold with probability 1: and . Therefore almost surely.

Now, let be the support of , so that for all we get by applying Bayes’ theorem to the numerator and denominator. Then since , almost surely. Therefore, when is distributed according to , almost surely. Since and this value does not depend on , by the algebraic limit theorem (Abbott, 2000, Thm 2.3.3) , proving the first half of the claim.

if: Suppose for any such that , . Applying Bayes’ theorem this implies that when is distributed according to

 limn→∞\prob(θ=θ)\prob(θ=θ′|x1:n)\prob(θ=θ′)\prob(θ=θ|x1:n)=0a.s.. (20)

Since and this value does not depend on , equation (20) implies that almost surely.

By the algebraic limit theorem, we conclude that and so by Lévy’s upward theorem, for all . Since this holds for all and because is discrete, almost surely. Hence, by Lévy’s upward theorem, when is distributed according to , almost surely. Thus, the posterior is consistent by Lemma 3.20. ∎

###### Corollary 3.113.0.

If any of the conditions in Theorem 3.72 are satisfied by estimation problem , then they are also satisfied for any other estimation problem , identical to except for the prior distribution.

###### Proof.

By Lemma 3.79 if satisfies any of the conditions of Theorem 3.72 then is has convergent likelihood ratios. But since whether a problem has convergent likelihood ratios is independent of the problem’s prior also has convergent likelihood ratios, and so by Lemma 3.79 satisfies all of the conditions of Theorem 3.72. ∎

As an example of convergent likelihood ratios, consider the large sample estimation problem of estimating the probability parameter of a sequence of i.i.d., Bernoulli-distributed random variables. This will form a running example, so it is worth defining explicitly.

###### Definition 3.114.

Let be a one-to-one function.

A binomial probability estimation problem is a large sample estimation problem where for each , each is, given , independently Bernoulli-distributed with probability parameter .

###### Lemma 3.115.

Every binomial probability estimation problem has convergent likelihood ratios.

###### Proof.

A Bernoulli distribution admits two possible values. Under the assumption , one of these values has probability and the other . Let be the number of observations, among , to attain the value of probability , and let .

The strong law of large numbers states that

converges almost surely to .

Consider, now, that for any ,

 P(1:n)θ′(x1:n)=p(θ′)k(n)(1−p(θ′))n−k(n)=(p(θ′)ρ(n)(1−p(θ′))1−ρ(n))n, (21)

and that as converges to , the value of converges to .

The function has a unique maximum at . Hence, with probability given and ,

 limn→∞p(θ′)ρ(n)(1−p(θ′))1−ρ(n)p(θ)ρ(n)(1−p(θ))1−ρ(n)=f(p(θ′))f(p(θ))<1, (22)

so combining (21) and (22), we get

 limn→∞P(1:n)θ′(x1:n)P(1:n)θ(x1:n)=0.

Finally we provide our example of maximum likelihood inconsistency.

###### Theorem 3.116.

Maximum Likelihood is not properly consistent

###### Proof.

For in the range , let denote the base breakdown of , as follows:

 θ=∞∑i=1[θ]bib−i,

where for all , . Then let and . Note that the are disjoint so we can define the function over their union such that if and only if .

Now, let be a large sample estimation problem where , and for all , with the independent when conditioned on , with conditional distributions given by:

We will prove that ML is inconsistent for this problem, even though it has convergent likelihood ratios (and therefore a consistent posterior). First, to see that ML is inconsistent, note that for any , if then after observations the likelihood of will be smaller than . However, if then for any value for which , will have the larger likelihood of . Thus, is not the maximum likelihood solution. As grows to infinity, the ML estimate will converge to the value . This value is clearly distinct from , differing from it by at least .

To see that the problem has convergent likelihood ratios (and therefore a consistent posterior) consider any such that , and let . Then:

 P(1:A+n)θ′(x1:A+n)P(1:A+n)θ(x1:A+n)=P(1:A)θ′(x1:A)P(1:A)θ(x1:A)⋅P(A+1:A+n)θ′(xA+1:A+n)P(A+1:A+n)θ(xA+1:A+n).

However, defines the distribution of each in given as an independent Bernoulli distribution, with probability parameter . By Lemma 3.115, this problem has convergent likelihood ratios, so with probability given and , we have that

 limn→∞P(A+1:A+n)θ′(x1:n)P(1:n)θ(x1:n)=0,

and therefore also

 limn→∞P(1:n)θ′(x1:n)P(1:n)θ(x1:n)=P(1:A)θ′(x1:A)P(1:A)θ(x1:A)⋅limn→∞P(A+1:A+n)θ′(x1:n)P(1:n)θ(x1:n)=0.

## 4 Likelihood Consistency

The issue that the example in Theorem 3.116 creates for the consistency of Maximum Likelihood is straightforward, namely that although the likelihood ratios converge to 0 with the true model likelihood in the denominator, they do not do so uniformly. Therefore at no point are all the likelihood ratios less than 1. Requiring the ratios to converge uniformly gives the class of estimation problems with distinctive likelihoods, and estimator classes which are consistent over the class of estimation problems with distinctive likelihoods we call likelihood consistent. We show that having distinctive likelihoods is sufficient to ensure the consistency of all maximum likelihood estimators. Unfortunately maximum likelihood estimators do not exist for all estimation problems with distinctive likelihoods so the maximum likelihood estimator class is only partially likelihood consistent. We therefore introduce the class of approximate maximum likelihood estimators (See Balakrishnan & Cohen, 2014, Chapter 6), which has members for every estimation problem, and show that the class is strongly likelihood consistent. We in fact show that the approximate maximum likelihood estimator class characterises the class of estimation problems with distinctive likelihoods, in the sense that a problem has distinctive likelihoods if and only if all the approximate maximum likelihood estimators on the problem are strongly consistent.

### 4.1 Maximum Likelihood

###### Definition 4.119.

A large sample estimation problem , has distinctive likelihoods if, for all and neighbourhoods of , :

 (23)
###### Lemma 4.121.

Every binomial probability estimation problem with has distinctive likelihoods.

###### Proof.

Let be as in Lemma 3.115, and consider that

 P(1:n)θ′(x1:n)=(θ′ρ(n)(1−θ′)1−ρ(n))n.

Taken as a function of , this has a unique turning point: a maximum at . Recall, however, that by the strong law of large numbers with probability the value of converges to , when . Therefore, with probability , it will eventually be in any neighbourhood of . If, without loss of generality, we take to be the interval , then when