# MML is not consistent for Neyman-Scott

Strict Minimum Message Length (SMML) is a statistical inference method widely cited (but only with informal arguments) as providing estimations that are consistent for general estimation problems. It is, however, almost invariably intractable to compute, for which reason only approximations of it (known as MML algorithms) are ever used in practice. We investigate the Neyman-Scott estimation problem, an oft-cited showcase for the consistency of MML, and show that even with a natural choice of prior, neither SMML nor its popular approximations are consistent for it, thereby providing a counterexample to the general claim. This is the first known explicit construction of an SMML solution for a natural, high-dimensional problem. We use the same novel construction methods to refute other claims regarding MML also appearing in the literature.

## Authors

• 6 publications
• ### Risk-averse estimation, an axiomatic approach to inference, and Wallace-Freeman without MML

We define a new class of Bayesian point estimators, which we refer to as...
06/28/2018 ∙ by Michael Brand, et al. ∙ 0

• ### A taxonomy of estimator consistency on discrete estimation problems

We describe a four-level hierarchy mapping both all discrete estimation ...
09/12/2019 ∙ by Michael Brand, et al. ∙ 0

• ### Statistical Inference and Exact Saddle Point Approximations

Statistical inference may follow a frequentist approach or it may follow...
05/06/2018 ∙ by Peter Harremoes, et al. ∙ 0

• ### RKL: a general, invariant Bayes solution for Neyman-Scott

Neyman-Scott is a classic example of an estimation problem with a partia...
07/20/2017 ∙ by Michael Brand, et al. ∙ 0

• ### SMML estimators for exponential families with continuous sufficient statistics

The minimum message length principle is an information theoretic criteri...
02/04/2013 ∙ by James G. Dowty, et al. ∙ 0

• ### High-dimensional semi-supervised learning: in search for optimal inference of the mean

We provide a high-dimensional semi-supervised inference framework focuse...
02/02/2019 ∙ by Yuqian Zhang, et al. ∙ 0

• ### Structured Discrete Shape Approximation: Theoretical Complexity and Practical Algorithm

We consider the problem of approximating a two-dimensional shape contour...
09/19/2019 ∙ by Andreas M. Tillmann, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Minimum Message Length (MML) is a general name for any member of the family of statistical inference methods based on the minimum message length principle, which, in turn, is closely related to the family of Minimum Description Length (MDL) estimators [1, 2, 3], but predates it. The minimum message length principle was first introduced in [4], and the estimator that follows the principle directly, which was first described in [5], is known as Strict MML (SMML).

Although purportedly returning ideal inferences, SMML is never used in practice because it is computationally unfeasible and analytically intractable in all but a select few cases [6, 7].

A computationally-feasible approximation to SMML was introduced in [8]. This is known as the Wallace-Freeman approximation (WF-MML), and is perhaps the MML variant that is in widest use.

Although not as popular as Maximum Likelihood (ML) or Maximum A Posteriori (MAP), MML still enjoys a wide following, with over 70 papers published regarding it in 2016 alone, including [9, 10, 11]. MML proponents cite for it a wide variety of attributes that make the method and its inferences in some ways superior to other methods, and claim that these benefits outweigh the method’s computational requirements, which are heavy even when using the Wallace-Freeman approximation. (One paper [12] cites MML computation times that are over 400,000 times longer than ML computation times, despite working on a data-set of less than 1,000 items, using only 9 attributes and having only 3 classes.111It may be the case that these times would have been reducible through optimisation, but no information regarding this is given in the paper.)

One of the many properties often attributed to MML is “consistency”. For example, [13] states:

SMML has been studied fairly thoroughly, and is known […] to be consistent and efficient.

Loosely speaking, an estimate is said to be consistent for a particular problem if given enough observations the estimate is guaranteed to converge to the correct parameter value. (See [14] for a formal definition.) Importantly, it is the property of an estimate, not of an estimator: it is given in the context of a specific estimation problem. However, in the MML literature, statements such as the one quoted above are often given without specifying a particular estimation problem. To determine how to interpret such statements, consider the following quote from [15]:

These results of inconsistency of ML and consistency of MML and marginalised maximum likelihood for an increasing number of distributions with shared parameter will remain valid for e.g. the von Mises circular distribution, a distribution where Maximum Likelihood shows even worse small sample bias than it does for the Gaussian distribution. We seek a problem of this Neyman-Scott nature (of many distributions sharing a common parameter value) for which the MML estimate remains consistent (as we know it will) but for which the marginalised Maximum Likelihood estimate is either inconsistent or not defined.

This quote explicitly states that MML is known to be consistent even on completely unspecified estimation problems (except for the fact that they involve many distributions sharing a common parameter value), and gives three specific examples: the von Mises estimation problem, the Neyman-Scott estimation problem, and problems of a “Neyman-Scott nature”.

The claim repeated in the MML literature is therefore that MML’s consistency property is universal, independent of the specific choice of estimation problem.222The title of this paper “MML is not consistent for Neyman-Scott” should be interpreted as the logical opposite, which is to say “It is not true that MML is consistent for a general member of the Neyman-Scott estimation problem family; it will be inconsistent for some Neyman-Scott cases.” This claim is not only simply stated but also argued. For example, [15] claims:

The fact that general MML codes are (by definition) optimal (i.e. have highest posterior probability) implicitly suggests that, given sufficient data, MML will converge as closely as possible to any underlying model.

[B]ecause the SMML estimator chooses the shortest encoding of the model and data, it must be both consistent and efficient.

(Cf. [17].)

MML approximations are not said to hold such universal consistency properties. However, [15] calculated the Wallace-Freeman MML estimate and [18] calculated another MML approximation known as “Ideal Group” (IG), both working on the Neyman-Scott estimation problem [19], which is a problem on which many other estimation methods do not produce consistent estimates.

The fact that this example can be computed using the Ideal Group approximation, which is often as intractable as SMML itself, has made the Neyman-Scott problem a touchstone in the MML literature and a consistently-given example to showcase its superior performance.

Note, however, that Neyman-Scott is a frequentist problem, defined by its likelihoods. To put any Bayesian method, including any of the MML variants discussed, to use on an estimation problem, one must also define a prior distribution for the estimated parameters. Without a specified prior, the “Neyman-Scott problem” is, from a Bayesian viewpoint, an entire family of estimation problems. Both [15] and [18] analysed the problem under a specific prior, which we shall name the Wallace prior.

Importantly, neither in these two papers nor in papers citing this result (e.g., [20, 21, 22, 23]) is the result ever restricted by this choice of prior. For example, [24] writes:

[T]he Wallace-Freeman MML estimator […] and the Dowe-Wallace Ideal Group (IG) estimator have both been shown to be statistically consistent for the Neyman-Scott problem.

This paper analyses the Neyman-Scott problem under another prior which we name the scale free prior. We show that neither the statements about SMML nor about the MML approximations are true for this instance of the Neyman-Scott problem, thus giving a counterexample to these general claims. In fact, outside of a few simple, one-dimensional cases for which ML is also consistent, SMML has not been shown to be consistent anywhere, and there is no reason to assume SMML holds any consistency properties that are superior to those of ML.

The methods developed in this paper, allowing for the first time direct, non-approximated analysis of a high-dimensional SMML solution, are general, and applicable beyond just the Neyman-Scott problem.

Table I lists the consistency properties of MML, both previously known and new to this paper.

## Ii Definitions

### Ii-a Notation

This paper deals with the problem of statistical inference: from a set of observations, , taken from (the observation space) we wish to provide a point estimate, , to the value,

, of a random variable,

, drawn from (parameter space). When speaking about statistical inference in general, we use the symbols introduced above. For a specific problem, such as in discussing the Neyman-Scott problem, we use problem-specific names for the variables. However, in all cases Latin characters refer to observables, Greek to unobservables that are to be estimated, boldface characters to random variables, non-boldface characters to values of said random variables, and hat-notation to estimates. Boldface is used for the observations, too, when considering the observations as random variables.

All point estimates discussed in this paper are defined using an or an . We take these as, in general, returning sets. Nevertheless, we use “”, as shorthand for “”.

To be consistent with the notation of [18], we use to indicate the prior and

 r(x)=∫Θh(θ)f(x|θ)dθ (1)

as the marginal. The integral of over may be (in which case it is a proper prior and the problem is a proper estimation problem) but it may also integrate to other positive values (in which case it is a scaled prior) or diverge to infinity (in which case it is an improper prior). Our analysis will reject a prior as pathological only if it does not allow computation of a marginal using (1).

When speaking of events that have positive probability, we will use the notation. However, in calculating over a scaled or improper prior some probabilities will be correspondingly scaled when computed as an integral over the prior or the marginal. For these we use the notation.

For reasons of mathematical convenience, we take both the observation space, , and the parameter space, , as complete metric spaces, and assume that priors, likelihoods, posterior probabilities and marginals are all continuous, differentiable, everywhere-positive functions. This allows us to take limits, derivatives, s, s, etc., freely, without having to prove at every step that these are well-defined and have a value.

### Ii-B Mml

Minimum Message Length (MML) is an inference method that attempts to codify the principle of Occam’s Razor in information-theoretic terms [18].

Define

 Rθ(x)def=log(r(x)f(x|θ)). (2)

Given a piecewise-constant function , we define

 LE(F)def=Entropy(F(x)),
 LP(F)def=∫Xr(x)RF(x)(x)d% x,

and

 L(F)def=LE(F)+LP(F).

The SMML estimator is usually defined as the minimiser of . However, because this minimiser may not be unique, we use the more rigorous

 ^θSMML(x)=closure⎛⎜⎝⋃F∈\operatornamewithlimitsargminF′L(F′)F(x)⎞⎟⎠.

Functions minimising are known as SMML code-books.

### Ii-C The Ideal Point

We introduce the notion of an “ideal point” which will be central to our analysis. This is built on an approximation for SMML known in the MML literature as Ideal Group [18].

The Ideal Group estimator is defined in terms of its functional inverse, mapping values to (sets of) values. We refer to such functions as reverse estimators and denote them . The Ideal Group reverse estimator is defined as

 ~xIG(θ)def={x∈X|Rθ(x)≤t(θ)}, (3)

where is a threshold whose value is given in [18], and which is computed in a way that guarantees that the ideal group is a non-empty set for each .

Because the ideal group is always non-empty, it must include

 ~xIP(θ)def=\operatornamewithlimitsargminx∈XRθ(x).

We refer to this as the Ideal Point approximation (a notion and a name that, unlike the Ideal Group, are new to this paper).

We denote the inverse functions of reverse estimators, e.g.

 ^θIP(x)def={θ∈Θ|x∈~xIP(θ)},

by the same hat notation as estimators, but stress that these are only true estimators (albeit, perhaps, multi-valued) if the reverse estimator is a surjection.

### Ii-D The Neyman-Scott problem

###### Definition 1.

The Neyman-Scott problem [19] is the problem of jointly estimating the tuple after observing , each element of which is independently distributed .

It is assumed that .

Let

 mndef=∑Jj=1xnjJ

and

 s2def=∑Nn=1∑Jj=1(xnj−mn)2NJ.

Also, let

be the vector

and be the vector .

The interesting case for Neyman-Scott is to observe the behaviour of the estimate for when this estimate is part of the larger joint estimation problem, while taking to infinity and fixing .

This set-up creates an inconsistent posterior [25], a situation where even with unlimited data, the uncertainty regarding the true value of remains high, even though the value of is known with high confidence. Because of this, many of the popular estimation methods fail to return a consistent estimate for in this scenario. Maximum Likelihood, as a case in point, returns the estimate , rather than .

MML’s success on the Neyman-Scott problem has made it an oft-cited showcase for the power of this method. In [18] alone, eight entire sections (4.2 – 4.9) are devoted to it. As another example, [24], using the Neyman-Scott problem as a key example, writes about the family of MML-based estimates that these are likely to be unique in being the only estimates that are both statistically invariant to representation and statistically consistent even for estimation problems where, as in the Neyman-Scott problem, the joint posterior does not fully converge.

It is in this context that our finding that MML is, in fact, no better than ML for Neyman-Scott becomes highly significant for MML at large.

## Iii Analysis of the Ideal Group approximation

Although [18] discusses the Neyman-Scott problem at great length, the actual analysis of the Ideal Group estimate for it (ibid., Section 4.3) is brief enough to be quoted here in full.

Given the Uniform prior on and the scale free prior on , we do not need to explore the details of an ideal group with estimate . It is sufficient to realise that the only quantity which can give scale to the dimensions of the group in

-space is the Standard Deviation

. All dimensions of the data space, viz., are commensurate with .
Hence, for some and , the shape of the ideal group is independent of and , and its volume is independent of but varies with as . Since the marginal data density varies as , the coding probability , which is the integral of over the group, must vary as . The Ideal Group estimate for data obviously has , and the estimate of is found by maximizing as

 ˆσ2IG=Js2/(J−1).

Unfortunately, the argument of [18] is incorrect. For the shape of the ideal group to be independent of and it is not enough for one to be translation invariant and for the other to be scale invariant. The solution of (3) is only scale and translation independent if , as a single unit, is simultaneously both scale and translation invariant.

###### Definition 2.

An inference problem will be called scale free if for some parameterization of and , both of which are assumed to be vector spaces, it is true that for every , , and ,

 ∫Ωh(θ)dθ=∫γΩh(θ)dθ (4)

and

 ∫Af(x|θ)dx=∫γAf(x|γθ)dx, (5)

where “” and “” refer to the set notation

 γSdef={γx|x∈S}.

Translation independence can be defined analogously.

Notably, in our problem, for (5) to hold, i.e. for the shape of the likelihood distribution not to change when switching scales, the scale change must be not only in but also in . The prior advocated in [18] (which we refer to as “the Wallace prior”), does not satisfy (4), because the change from integrating over to integrating over increases the area of integration by a factor of , where the exponent is simply the dimension of the parameter space: one dimension for and for the parameters.

The only prior which satisfies the claims of [18] regarding the shape of the ideal group is therefore . We will refer to it as the scale free prior, and call the Neyman-Scott problem under it the scale free Neyman-Scott problem.

Both priors have an improper scale free distribution on

and both have an improper uniform distribution on

given , but in order to attain scale freedom, one must relinquish the idea that and are independent: in the scale free prior, the are individually scale free, whereas in the Wallace prior they are individually uniformly distributed.

The original proof of [18] is therefore incorrect. Its claim that the ideal group approximation is consistent for the prior is, however, true. We present here an alternative proof for this, which works in the native observation space and utilises the concept of the ideal point.

###### Theorem 1.

The Ideal Group MML reverse estimator is consistent for the Neyman-Scott problem under the Wallace prior.

###### Proof.

In the observation space, the probability density of a given set of observations, , given a particular choice of and , is

 f(x|σ2,μ)=1(√2πσ)NJe−∑Nn=1∑Jj=1(xnj−μn)22σ2. (6)

Under the Wallace prior, this results in the marginal probability density of the observations being

 r(x)=∫∞01σ∫∞−∞⋯∫∞−∞f(x|σ2,μ)dμ1⋯dμndσ=12J−N/2π−N(J−1)2(NJs2)−N(J−1)2Γ(N(J−1)2). (7)

Note that is a sufficient statistic for this problem, because both and can be calculated based on it, where for we use the relation

 N∑n=1J∑j=1(xnj−μn)2=(NJs2)+J∑n(mn−μn)2.

For this reason (following [18]), we can present the equations above solely in terms of .

Substituting now (6) and (7) into (2), we get

 R(σ2,μ) (x)=−N2log(J)+NJ−22log2+N2logπ +log(Γ(N(J−1)2))+NJ2log(σ2)+NJs22σ2 +J2σ2∑n(mn−μn)2−N(J−1)2log(NJs2).

Recall that the ideal point is defined as the value minimising , and for this reason guarantees that the ideal group for necessarily includes it.

Differentiating according to and according to each we reach the desired

 ˆσ2IP(x)=JJ−1s2,ˆμnIP(x)=mn.

Unfortunately, Theorem 1 does not hold for the scale free prior.

###### Theorem 2.

The Ideal Group MML reverse estimator is not consistent for the scale free Neyman-Scott problem. In particular, it contains for the point , which is the (inconsistent) maximum likelihood estimate, as the Ideal Point.

The proof is essentially the same as above, but substituting in the scale free prior instead of the Wallace prior.

###### Proof.

We begin by recalculating and given the new prior.

 r(x) =∫∞01σN+1∬∞−∞f(x|σ2,μ)dμ1⋯dμndσ =2N/2−1J−N/2π−N(J−1)2(NJs2)−NJ2Γ(NJ2).
 R(σ2,μ)(x)=−N2logJ+NJ+N−22log2+N2logπ+log(Γ(NJ2))+NJ2log(σ2)+NJs22σ2+J2σ2∑n(mn−μn)2−NJ2log(NJs2). (8)

Following the same argument as before, the Ideal Point estimator now becomes

 ˆσ2IP(x)=s2,ˆμnIP% (x)=mn.

which is identical to the maximum likelihood estimate, and well known to be inconsistent. ∎

We remark that much as we were able to switch the Ideal Group approximation from being consistent to being inconsistent by the change of prior, we can do the same while keeping the prior but making a slight change in the likelihoods: instead of using

 xij∼N(μ,σ2)

as in the standard Neyman-Scott set-up, one can make the Ideal Group approximation inconsistent by switching to

 xij∼N(μσ,σ2).

The reason for this is that the new problem is the same as the Neyman-Scott problem under the scale free prior, except for a change of parameters: what was before is now . All MML methods discussed are invariant to such re-parameterization.

This demonstrates an important point: the prior is in no way pathological, nor can it be blamed for the inconsistency. The same inconsistency is equally reproducible with the Wallace prior.

## Iv SMML analysis

### Iv-a Some special types of inference problems

The Neyman-Scott problem satisfies many good properties that enable our analysis, but which are not unique to it. We enumerate them here.

We begin by defining transitivity, a property that generalises the notion of scale freedom.

###### Definition 3.

An automorphism for an estimation problem , with and , is a pair of continuous bijections, and , such that

1. For every ,

 ScaledProb(x∈A)=ScaledProb(x∈U(A)), (9)

and

2. For every and every ,

 Prob(x∈A|θ)=Prob(x∈U(A)|T(θ)), (10)

where .

For reasons of mathematical convenience, we assume that and are such that the Jacobians of these bijections, and , are defined everywhere, and their determinants, and , are positive everywhere. This allows us, for example, to restate condition (9) as

 r(x)=r(U(x))∣∣∣dU(x)dx∣∣∣,

and condition (10) as

 f(x|θ)=f(U(x)|T(θ))∣∣∣dU(x)dx∣∣∣.

An estimation problem will be called observation transitive if for every there is an automorphism for which .

An estimation problem will be called parameter transitive if for every there is an automorphism for which .

An estimation problem will be called transitive if it is both observation transitive and parameter transitive.

###### Lemma 2.1.

The scale free Neyman-Scott problem with fixed and and with observable parameters is transitive.

###### Proof.

Transitivity of the Neyman-Scott problem stems from its scale- and translation-invariance: Consider and with .

It is straightforward to verify that is an automorphism. Furthermore, for any and it is straightforward to find parameters and that would map to , and similarly for and . ∎

Transitivity implies other good properties. Define

 R∗θdef=minx∈XRθ(x).
###### Definition 4.

An estimation problem , with and , will be called homogeneous if the value of is a constant, , for all .

###### Lemma 2.2.

Every parameter-transitive estimation problem is homogeneous.

More generally, for any , if there exists an automorphism such that , then .

###### Proof.

Assume to the contrary that for some such , the inequality holds.

Let be an automorphism on such that , and let be a value such that attains its minimum at .

By definition,

 R∗θ1 ≤Rθ1(x)=log(r(x)f(x|θ1)) =log⎛⎜ ⎜⎝r(U(x))∣∣dU(x)dx∣∣f(U(x)|T(θ1))∣∣dU(x)dx∣∣⎞⎟ ⎟⎠ =Rθ2(U(x))=R∗θ2,

The option also cannot hold, because is also an automorphism, this one mapping to . ∎

Similarly:

###### Definition 5.

An estimation problem , with and , will be called comprehensive if the value of

 Ropt(x)def=minθ∈ΘRθ(x)

is a constant, , for all .

###### Lemma 2.3.

Every observation-transitive estimation problem is comprehensive.

More generally, for any for which there exists an automorphism such that , .

###### Proof.

The proof is identical to the proof of Lemma 2.2, except that instead of choosing such that we now choose an automorphism such that , and instead of choosing such that attains its minimum at , we choose such that attains its minimum over all at . ∎

Another good property, and one that one would expect of a typical, natural problem, is concentration.

Define for every ,

 ~xϵ(θ)def={x∈X|Rθ(x)−R∗θ<ϵ},

and

 ^θϵ(x)def={θ∈Θ|x∈~xϵ(θ)}.
###### Definition 6.

An estimation problem will be called concentrated if for every there is an for which is a bounded set.

###### Lemma 2.4.

The scale free Neyman-Scott problem with fixed and and with observable parameters is concentrated.

###### Proof.

The general formula for in the Neyman-Scott problem has been given in (8), and can easily be shown to be a strictly convex function of , with a unique minimum, for any . As such, is bounded for any .

Consider, now, the Neyman-Scott problem under the parameterization and . In this re-parameterization, it is easy to see that for any translation function, , is an automorphism. In particular, this means that for any ,

 ~xϵ(θ0)={x+θ0−θ|x∈~xϵ(θ)}.

All such sets are translations of each other, having the same volume, shape and bounding box dimensions.

It follows regarding the inverse function, , that for any it maps to a set of the same volume and bounding box dimensions as each , albeit with an inverted shape.

In particular, it is bounded.

Being bounded under the new parameterization is tantamount to being bounded under the native problem parameterization. ∎

The last good property we wish to mention regarding the scale free Neyman-Scott problem is the following.

###### Definition 7.

An estimation problem will be called local if there exist values and such that for every there exist , such that for all outside a subset of of total scaled probability at most ,

 γkf(x|θ)
###### Lemma 2.5.

Every proper estimation problem is local.

###### Proof.

Consider any estimation problem over a normalised (unscaled) prior, and consequently also a normalised (unscaled) marginal.

The total probability over all is, by definition, , so choosing satisfies the conditions of locality. ∎

###### Lemma 2.6.

The scale free Neyman-Scott problem is local.

The proof of Lemma 2.6 is given in Appendix V.

###### Definition 8.

An estimation problem is called regular if it is observation-transitive, homogeneous, concentrated and local.

### Iv-B Relating SMML to IP

We will now show that for regular problems, one can infer from the IP solution to the SMML solution.

Our first lemma proves for a family of estimation problems that the SMML solutions to these problems do not diverge entirely, in the sense of allocating arbitrarily high (scaled) probabilities to single values. Although a basic requirement for any good estimator, no such result was previously known for SMML.

For a code-book , let

 regionF(θ)def={x|F(x)=θ}

be known as the region of in .

###### Lemma 2.7.

For every local estimation problem there is a such that no SMML code-book for the problem contains any whose region has scaled probability greater than in the marginal distribution of .

###### Proof.

Let and be as in Definition 7. Note that can always be increased without violating the conditions of the definition, so it can be assumed to be positive.

Assign for a constant to be computed later on, and assume for contradiction that contains a whose region, , has scaled probability greater than . By construction, contains a non-empty, positive scaled probability region wherein (11) is satisfied.

Let be the scaled probability of , and let be .

Also, define , noting that

 β<β0, (12)

because, by assumption, and , so

 β−1=VbVa>VmaxV0−1=β−10.

We will design a code-book such that , proving by contradiction that is not optimal.

Our definition of is as follows. For all , . Otherwise, will be the value among for which the likelihood of is maximal.

Recall that

 L(F)−L(F′)=(LE(F)−LE(F′))+(LP(F)−LP(F′)).

Because, by construction, the set , of scaled probability , satisfies that for any ,

 logf(x|F′(x))−logf(x|F(x))>log(γk),

we have

 LP(F)−LP(F′)>Vblog(γk). (13)

On the other hand, the worst-case addition in (scaled) entropy caused by splitting the set into separate values is if each receives an equal probability. We can write this worst-case addition as

 (14)

This is in the case that . If , the expression is dropped from (14). This change makes no difference in the later analysis, so we will, for convenience, assume for now that .

Under the assumption , we can subtract (14) from (13) to get

 L(F)−L(F′)>Vblog(γk)−Vblogk−Valog(Va+VbVa)−Vblog(Va+VbVb)=Vblogγ−Valog(Va+VbVa)−Vblog(Va+VbVb). (15)

To reach a contradiction, we want . If , equation (15) degenerates to for an immediate contradiction. Otherwise, contradiction is reached if

or equivalently if

 βlog(β−1+1)+log(β+1)≤logγ. (16)

A small enough value can bring the left-hand side of (16) arbitrarily close to , and in particular to a value lower than for any .

By choosing a small enough , we can ensure than any satisfying (12) will also satisfy (16), creating a contradiction and proving our claim. ∎

Lemma 2.7 now allows us to draw a direct connection between SMML and .

###### Theorem 3.

In every local, homogeneous estimation problem , for every SMML code-book and for every there exists a for which the set

 regionF(θ0)∩~xϵ(θ0)

is a set of positive scaled probability in the marginal distribution of .

###### Proof.

Suppose to the contrary that for some , no element is mapped from a positive scaled probability of values from its respective .

Let be the set of values with positive scaled probability regions in , and let be the directed graph whose vertex set is and which contains an edge from to if the intersection

 ~xϵ/2(θ1)∩regionF(θ2)

has positive scaled probability. By assumption, has no self-loops.

Let

 V(θ)=ScaledProb(x∈regionF(θ)).

We claim that for any ,

 logV(θ2)−logV(θ1)≥ϵ/2, (17)

an immediate consequence of which is that and therefore cannot have any cycles.

To prove (17), note first that because of our assumption that all likelihoods are continuous, , for every and any choice of , has positive measure in the space of , and because of our assumption that all likelihoods are positive, a positive measure in the space of translates to a positive scaled probability. This also has the side effect that all vertices in must have an outgoing edge (because this positive scaled probability must be allocated to some edge).

Next, consider how transferring a small subset of , of size , in from to changes . Given that can be made arbitrarily small, we can consider the rate of change, rather than the magnitude of change: for to be optimal, we must have a non-negative rate of change, or else a small-enough can be used to improve . Given that is the sum of over all , by transferring probability from to , the rate of change to is .

Consider now the rate of change to . By transferring probability from , where it is outside of (and therefore by definition assigned an value of at least ) to , where it is assigned into (and therefore by definition assigned an value that is smaller than ) the difference is a reduction rate greater than .

The condition that the rate of change of is nonnegative therefore translates simply to (17), thus proving the equation’s correctness.

However, if contains no self-loops and no cycles, and every one of its vertices has an outgoing edge, then it contains arbitrarily long paths starting from any vertex. Consider any such path starting at some of length greater than , where is as in Lemma 2.7. By (17), we have that the scaled probability assigned to the value ending the path is greater than , thereby reaching a contradiction. ∎

We can now present our main theorem, formalising the connection between the SMML estimator and the ideal point.

###### Theorem 4.

In any regular estimation problem, for every ,

 ^θSMML(x)∩^θIP(x)≠∅. (18)

In particular, is a true estimator, in the sense that for every .

###### Proof.

Let be a value for which we want to prove (18).

From Theorem 3 we know that for all there exists a code-book and a for which is non-empty. Let be a value inside this intersection.

By observation transitivity, there is an automorphism such that . Let .

Let us define by . It is easy to verify that by the definition of automorphism , so is also an SMML code-book, and furthermore

 ~xϵ(θϵ)={U(x)|x∈~xϵ(θ∗)},

so .

Consider now a sequence of such for . The set is a complete metric space, by construction the reside inside the nested sets , and by our assumption that the problem is concentrated, for a small enough , is bounded. We conclude, therefore, that the sequence has a converging sub-sequence. Let be a bound for one such converging sub-sequence.

We claim that is inside both and , thus proving that their intersection is non-empty.

To show this, consider first that we know because is a continuous function, and by construction .

Lastly, for every in the sub-sequence, , so follows from the closure of the SMML estimator (which is guaranteed by definition). ∎

###### Corollary 4.1.

For the scale free Neyman-Scott problem with fixed and , and .

In particular, this is true when approaches infinity, leading SMML to be inconsistent for this problem.

###### Proof.

The IP estimator for the Neyman-Scott problem was already established in Theorem 2 to be single-valued and equal to the ML estimator. The value of the SMML estimator therefore follows from Theorem 4.

As the consistent estimator for is and not , the SMML estimator is inconsistent. ∎

At first glance, this result may seem impossible, because, as established, an SMML code-book can only encode a countable number of values. Corollary 4.2 resolves this seeming paradox.

###### Corollary 4.2.

The scale free Neyman-Scott problem with fixed and admits uncountably many distinct SMML code-books, and for every value there is a continuum of SMML estimates.

###### Proof.

Uncountably many distinct code-books can be generated by arbitrarily scaling and translating any given code-book, which, as we have seen, does not alter .

To show that for every value there are uncountably many distinct SMML estimates, recall from our proof of Lemma 2.4 that if we consider the problem in observation space and parameter space, then both scaling and translation in the original parameter space are translations under the new representation. If any belongs to a region of volume in this space that is mapped to a particular by a particular , one can create a new code-book, , which is a translation of in both and , which would still be optimal.

As long as the translation in observation-space is such that is still mapped into its original region, its associated will be the correspondingly-translated . As such, the volume of values associated with a single is at least as large the volume of the region of (and, by observation-transitivity of the problem, at least as large as the volume of the largest region in the code-book’s partition). ∎

SMML is therefore not a point estimator for this problem at all.

### Iv-C Relating IP to ML

Beyond the connections between the SMML solution and the Ideal Point approximation, there is also a direct link to the maximum likelihood estimate.

###### Theorem 5.

If is a homogeneous, comprehensive estimation problem, then .

###### Proof.

By definition,

 ~xIP(θ)=\operatornamewithlimitsargminx∈XRθ(x)={x∈X|Rθ(x)=minx′∈XRθ(x′)}.

By assumption, the estimation problem is homogeneous, so is a constant, , independent of . Substituting into the definition of and calculating the functional inverse, we get

 ^θIP(x)={θ∈Θ|Rθ(x)=R∗}.

For an arbitrary choice of , let be such that . The value of is , and there certainly is no for which (or this would contradict homogeneity), so, using the notation of Definition 5, .

Thus,

 ^θIP(x) ={θ∈Θ|Rθ(x)=R∗} =\operatornamewithlimitsargminθ∈ΘRθ(x) =\operatornamewithlimitsargminθ∈Θlog(r(x)f(x|θ)) =\operatornamewithlimitsargmaxθ∈Θf(x|θ)=^θML(x).

###### Corollary 5.1.

In any regular estimation problem, for every ,

 ^θSMML(x)∩^θML(x)≠∅.
###### Proof.

From Lemma 2.3 we know every observation-transitive problem is comprehensive, so we can apply both Theorem 4, equating the SMML estimator with the IP one, and Theorem 5, equating the IP one with ML. ∎

The following is a list of additional results that are immediate corollaries of the above. They are given with sketched proofs.

• Contrary to the oft-cited claims of [24], the Wallace-Freeman approximation [8] is inconsistent for the scale free Neyman-Scott problem. In fact, every frequentist estimation problem for which ML is inconsistent (such as the von Mises problem) admits a prior for which the Wallace-Freeman approximation is inconsistent. This follows from the folkloric and immediate result (cf. [18], p. 412) that the Wallace-Freeman approximation coincides with ML for estimation problems whose prior is their Jeffreys prior [26, 27]. The scale free prior happens to also be a Jeffreys prior for the Neyman-Scott problem.

• Contrary to the claims of [13] and others, SMML does not satisfy internal consistency in the sense of returning the same estimate whether it is estimated jointly with or alone. The problem of estimating only is also regular, for which reason IG, IP, SMML and ML all coincide for it. The ML estimate is in this case consistent, and therefore not equal to the former estimate. The same is true also for the Wallace-Freeman approximation, as the marginalised problem also has a Jeffreys prior.

• Contrary to the claims of [28]

, it is not true that when SMML is applied in parallel to a large number of independent estimation problems, its predictions for each individual problem are distributed with the same mean and variance as the posterior for

. Parallel estimation of multiple independent regular problems is, itself, a regular problem. Hence, the SMML estimate for each individual problem will coincide with that problem’s ML estimate, even when this is inconsistent.

[Proof that Neyman-Scott is local]

We prove Lemma 2.6, stating that the scale free Neyman-Scott problem is local.

###### Proof.

Set , for a value to be chosen later on. Importantly, , and all other constants introduced later on in this proof (e.g., , and ) depend solely on and and are not dependent on . As such, they are constants of the construction.

Let , and for let be the vector identical to except that its ’th element equals . Let be the vector identical to except that its ’th element equals .

For , we use all and all . Next, we pick , where is Euler’s constant.

This leaves a further values of to be assigned. To assign these, divide for each the range between and into equal-length segments, and let be the set containing the centres of these segments. We define our remaining values as

 Θ′={(√2NJcσ,μ′1,…,μ′N)∣∣ ∣∣∀n,μ′n∈Ωn}.

We will show that, for a constant to be chosen later on, outside a subset of of total scaled probability ,

 eTf(x|θ)

Equivalently:

 maxilogf(x|θi)−logf(x|θ)>T. (19)

Showing this is enough to prove the lemma, because for a sufficiently large ,

 eT=(c+1)N≥cN+NcN−1>cN+2N+2=k+1,

so by choosing the conditions of Definition 7 are satisfied. (Recall that is a constant of the construction, and therefore can depend on .)

To prove (19), let us divide the problem into cases. First, let us show that this holds true for any value for which, for any , . To show this, assume without loss of generality that for a particular the equation holds true.

 logmax