Minimum Message Length (MML) is a general name for any member of the family of statistical inference methods based on the minimum message length principle, which, in turn, is closely related to the family of Minimum Description Length (MDL) estimators [1, 2, 3], but predates it. The minimum message length principle was first introduced in , and the estimator that follows the principle directly, which was first described in , is known as Strict MML (SMML).
A computationally-feasible approximation to SMML was introduced in . This is known as the Wallace-Freeman approximation (WF-MML), and is perhaps the MML variant that is in widest use.
Although not as popular as Maximum Likelihood (ML) or Maximum A Posteriori (MAP), MML still enjoys a wide following, with over 70 papers published regarding it in 2016 alone, including [9, 10, 11]. MML proponents cite for it a wide variety of attributes that make the method and its inferences in some ways superior to other methods, and claim that these benefits outweigh the method’s computational requirements, which are heavy even when using the Wallace-Freeman approximation. (One paper  cites MML computation times that are over 400,000 times longer than ML computation times, despite working on a data-set of less than 1,000 items, using only 9 attributes and having only 3 classes.111It may be the case that these times would have been reducible through optimisation, but no information regarding this is given in the paper.)
One of the many properties often attributed to MML is “consistency”. For example,  states:
SMML has been studied fairly thoroughly, and is known […] to be consistent and efficient.
Loosely speaking, an estimate is said to be consistent for a particular problem if given enough observations the estimate is guaranteed to converge to the correct parameter value. (See  for a formal definition.) Importantly, it is the property of an estimate, not of an estimator: it is given in the context of a specific estimation problem. However, in the MML literature, statements such as the one quoted above are often given without specifying a particular estimation problem. To determine how to interpret such statements, consider the following quote from :
These results of inconsistency of ML and consistency of MML and marginalised maximum likelihood for an increasing number of distributions with shared parameter will remain valid for e.g. the von Mises circular distribution, a distribution where Maximum Likelihood shows even worse small sample bias than it does for the Gaussian distribution. We seek a problem of this Neyman-Scott nature (of many distributions sharing a common parameter value) for which the MML estimate remains consistent (as we know it will) but for which the marginalised Maximum Likelihood estimate is either inconsistent or not defined.
This quote explicitly states that MML is known to be consistent even on completely unspecified estimation problems (except for the fact that they involve many distributions sharing a common parameter value), and gives three specific examples: the von Mises estimation problem, the Neyman-Scott estimation problem, and problems of a “Neyman-Scott nature”.
The claim repeated in the MML literature is therefore that MML’s consistency property is universal, independent of the specific choice of estimation problem.222The title of this paper “MML is not consistent for Neyman-Scott” should be interpreted as the logical opposite, which is to say “It is not true that MML is consistent for a general member of the Neyman-Scott estimation problem family; it will be inconsistent for some Neyman-Scott cases.” This claim is not only simply stated but also argued. For example,  claims:
The fact that general MML codes are (by definition) optimal (i.e. have highest posterior probability) implicitly suggests that, given sufficient data, MML will converge as closely as possible to any underlying model.
and  adds:
[B]ecause the SMML estimator chooses the shortest encoding of the model and data, it must be both consistent and efficient.
MML approximations are not said to hold such universal consistency properties. However,  calculated the Wallace-Freeman MML estimate and  calculated another MML approximation known as “Ideal Group” (IG), both working on the Neyman-Scott estimation problem , which is a problem on which many other estimation methods do not produce consistent estimates.
The fact that this example can be computed using the Ideal Group approximation, which is often as intractable as SMML itself, has made the Neyman-Scott problem a touchstone in the MML literature and a consistently-given example to showcase its superior performance.
Note, however, that Neyman-Scott is a frequentist problem, defined by its likelihoods. To put any Bayesian method, including any of the MML variants discussed, to use on an estimation problem, one must also define a prior distribution for the estimated parameters. Without a specified prior, the “Neyman-Scott problem” is, from a Bayesian viewpoint, an entire family of estimation problems. Both  and  analysed the problem under a specific prior, which we shall name the Wallace prior.
[T]he Wallace-Freeman MML estimator […] and the Dowe-Wallace Ideal Group (IG) estimator have both been shown to be statistically consistent for the Neyman-Scott problem.
This paper analyses the Neyman-Scott problem under another prior which we name the scale free prior. We show that neither the statements about SMML nor about the MML approximations are true for this instance of the Neyman-Scott problem, thus giving a counterexample to these general claims. In fact, outside of a few simple, one-dimensional cases for which ML is also consistent, SMML has not been shown to be consistent anywhere, and there is no reason to assume SMML holds any consistency properties that are superior to those of ML.
The methods developed in this paper, allowing for the first time direct, non-approximated analysis of a high-dimensional SMML solution, are general, and applicable beyond just the Neyman-Scott problem.
Table I lists the consistency properties of MML, both previously known and new to this paper.
This paper deals with the problem of statistical inference: from a set of observations, , taken from (the observation space) we wish to provide a point estimate, , to the value,
, of a random variable,, drawn from (parameter space). When speaking about statistical inference in general, we use the symbols introduced above. For a specific problem, such as in discussing the Neyman-Scott problem, we use problem-specific names for the variables. However, in all cases Latin characters refer to observables, Greek to unobservables that are to be estimated, boldface characters to random variables, non-boldface characters to values of said random variables, and hat-notation to estimates. Boldface is used for the observations, too, when considering the observations as random variables.
All point estimates discussed in this paper are defined using an or an . We take these as, in general, returning sets. Nevertheless, we use “”, as shorthand for “”.
To be consistent with the notation of , we use to indicate the prior and
as the marginal. The integral of over may be (in which case it is a proper prior and the problem is a proper estimation problem) but it may also integrate to other positive values (in which case it is a scaled prior) or diverge to infinity (in which case it is an improper prior). Our analysis will reject a prior as pathological only if it does not allow computation of a marginal using (1).
When speaking of events that have positive probability, we will use the notation. However, in calculating over a scaled or improper prior some probabilities will be correspondingly scaled when computed as an integral over the prior or the marginal. For these we use the notation.
For reasons of mathematical convenience, we take both the observation space, , and the parameter space, , as complete metric spaces, and assume that priors, likelihoods, posterior probabilities and marginals are all continuous, differentiable, everywhere-positive functions. This allows us to take limits, derivatives, s, s, etc., freely, without having to prove at every step that these are well-defined and have a value.
Minimum Message Length (MML) is an inference method that attempts to codify the principle of Occam’s Razor in information-theoretic terms .
Given a piecewise-constant function , we define
The SMML estimator is usually defined as the minimiser of . However, because this minimiser may not be unique, we use the more rigorous
Functions minimising are known as SMML code-books.
Ii-C The Ideal Point
We introduce the notion of an “ideal point” which will be central to our analysis. This is built on an approximation for SMML known in the MML literature as Ideal Group .
The Ideal Group estimator is defined in terms of its functional inverse, mapping values to (sets of) values. We refer to such functions as reverse estimators and denote them . The Ideal Group reverse estimator is defined as
where is a threshold whose value is given in , and which is computed in a way that guarantees that the ideal group is a non-empty set for each .
Because the ideal group is always non-empty, it must include
We refer to this as the Ideal Point approximation (a notion and a name that, unlike the Ideal Group, are new to this paper).
We denote the inverse functions of reverse estimators, e.g.
by the same hat notation as estimators, but stress that these are only true estimators (albeit, perhaps, multi-valued) if the reverse estimator is a surjection.
Ii-D The Neyman-Scott problem
The Neyman-Scott problem  is the problem of jointly estimating the tuple after observing , each element of which is independently distributed .
It is assumed that .
be the vectorand be the vector .
The interesting case for Neyman-Scott is to observe the behaviour of the estimate for when this estimate is part of the larger joint estimation problem, while taking to infinity and fixing .
This set-up creates an inconsistent posterior , a situation where even with unlimited data, the uncertainty regarding the true value of remains high, even though the value of is known with high confidence. Because of this, many of the popular estimation methods fail to return a consistent estimate for in this scenario. Maximum Likelihood, as a case in point, returns the estimate , rather than .
MML’s success on the Neyman-Scott problem has made it an oft-cited showcase for the power of this method. In  alone, eight entire sections (4.2 – 4.9) are devoted to it. As another example, , using the Neyman-Scott problem as a key example, writes about the family of MML-based estimates that these are likely to be unique in being the only estimates that are both statistically invariant to representation and statistically consistent even for estimation problems where, as in the Neyman-Scott problem, the joint posterior does not fully converge.
It is in this context that our finding that MML is, in fact, no better than ML for Neyman-Scott becomes highly significant for MML at large.
Iii Analysis of the Ideal Group approximation
Although  discusses the Neyman-Scott problem at great length, the actual analysis of the Ideal Group estimate for it (ibid., Section 4.3) is brief enough to be quoted here in full.
Given the Uniform prior on and the scale free prior on , we do not need to explore the details of an ideal group with estimate . It is sufficient to realise that the only quantity which can give scale to the dimensions of the group in
-space is the Standard Deviation. All dimensions of the data space, viz., are commensurate with .
Hence, for some and , the shape of the ideal group is independent of and , and its volume is independent of but varies with as . Since the marginal data density varies as , the coding probability , which is the integral of over the group, must vary as . The Ideal Group estimate for data obviously has , and the estimate of is found by maximizing as
Unfortunately, the argument of  is incorrect. For the shape of the ideal group to be independent of and it is not enough for one to be translation invariant and for the other to be scale invariant. The solution of (3) is only scale and translation independent if , as a single unit, is simultaneously both scale and translation invariant.
An inference problem will be called scale free if for some parameterization of and , both of which are assumed to be vector spaces, it is true that for every , , and ,
where “” and “” refer to the set notation
Translation independence can be defined analogously.
Notably, in our problem, for (5) to hold, i.e. for the shape of the likelihood distribution not to change when switching scales, the scale change must be not only in but also in . The prior advocated in  (which we refer to as “the Wallace prior”), does not satisfy (4), because the change from integrating over to integrating over increases the area of integration by a factor of , where the exponent is simply the dimension of the parameter space: one dimension for and for the parameters.
The only prior which satisfies the claims of  regarding the shape of the ideal group is therefore . We will refer to it as the scale free prior, and call the Neyman-Scott problem under it the scale free Neyman-Scott problem.
Both priors have an improper scale free distribution on
and both have an improper uniform distribution ongiven , but in order to attain scale freedom, one must relinquish the idea that and are independent: in the scale free prior, the are individually scale free, whereas in the Wallace prior they are individually uniformly distributed.
The original proof of  is therefore incorrect. Its claim that the ideal group approximation is consistent for the prior is, however, true. We present here an alternative proof for this, which works in the native observation space and utilises the concept of the ideal point.
The Ideal Group MML reverse estimator is consistent for the Neyman-Scott problem under the Wallace prior.
In the observation space, the probability density of a given set of observations, , given a particular choice of and , is
Under the Wallace prior, this results in the marginal probability density of the observations being
Note that is a sufficient statistic for this problem, because both and can be calculated based on it, where for we use the relation
For this reason (following ), we can present the equations above solely in terms of .
Recall that the ideal point is defined as the value minimising , and for this reason guarantees that the ideal group for necessarily includes it.
Differentiating according to and according to each we reach the desired
Unfortunately, Theorem 1 does not hold for the scale free prior.
The Ideal Group MML reverse estimator is not consistent for the scale free Neyman-Scott problem. In particular, it contains for the point , which is the (inconsistent) maximum likelihood estimate, as the Ideal Point.
The proof is essentially the same as above, but substituting in the scale free prior instead of the Wallace prior.
We begin by recalculating and given the new prior.
Following the same argument as before, the Ideal Point estimator now becomes
which is identical to the maximum likelihood estimate, and well known to be inconsistent. ∎
We remark that much as we were able to switch the Ideal Group approximation from being consistent to being inconsistent by the change of prior, we can do the same while keeping the prior but making a slight change in the likelihoods: instead of using
as in the standard Neyman-Scott set-up, one can make the Ideal Group approximation inconsistent by switching to
The reason for this is that the new problem is the same as the Neyman-Scott problem under the scale free prior, except for a change of parameters: what was before is now . All MML methods discussed are invariant to such re-parameterization.
This demonstrates an important point: the prior is in no way pathological, nor can it be blamed for the inconsistency. The same inconsistency is equally reproducible with the Wallace prior.
Iv SMML analysis
Iv-a Some special types of inference problems
The Neyman-Scott problem satisfies many good properties that enable our analysis, but which are not unique to it. We enumerate them here.
We begin by defining transitivity, a property that generalises the notion of scale freedom.
An automorphism for an estimation problem , with and , is a pair of continuous bijections, and , such that
For every ,
For every and every ,
For reasons of mathematical convenience, we assume that and are such that the Jacobians of these bijections, and , are defined everywhere, and their determinants, and , are positive everywhere. This allows us, for example, to restate condition (9) as
and condition (10) as
An estimation problem will be called observation transitive if for every there is an automorphism for which .
An estimation problem will be called parameter transitive if for every there is an automorphism for which .
An estimation problem will be called transitive if it is both observation transitive and parameter transitive.
The scale free Neyman-Scott problem with fixed and and with observable parameters is transitive.
Transitivity of the Neyman-Scott problem stems from its scale- and translation-invariance: Consider and with .
It is straightforward to verify that is an automorphism. Furthermore, for any and it is straightforward to find parameters and that would map to , and similarly for and . ∎
Transitivity implies other good properties. Define
An estimation problem , with and , will be called homogeneous if the value of is a constant, , for all .
Every parameter-transitive estimation problem is homogeneous.
More generally, for any , if there exists an automorphism such that , then .
Assume to the contrary that for some such , the inequality holds.
Let be an automorphism on such that , and let be a value such that attains its minimum at .
contradicting the assumption.
The option also cannot hold, because is also an automorphism, this one mapping to . ∎
An estimation problem , with and , will be called comprehensive if the value of
is a constant, , for all .
Every observation-transitive estimation problem is comprehensive.
More generally, for any for which there exists an automorphism such that , .
The proof is identical to the proof of Lemma 2.2, except that instead of choosing such that we now choose an automorphism such that , and instead of choosing such that attains its minimum at , we choose such that attains its minimum over all at . ∎
Another good property, and one that one would expect of a typical, natural problem, is concentration.
Define for every ,
An estimation problem will be called concentrated if for every there is an for which is a bounded set.
The scale free Neyman-Scott problem with fixed and and with observable parameters is concentrated.
The general formula for in the Neyman-Scott problem has been given in (8), and can easily be shown to be a strictly convex function of , with a unique minimum, for any . As such, is bounded for any .
Consider, now, the Neyman-Scott problem under the parameterization and . In this re-parameterization, it is easy to see that for any translation function, , is an automorphism. In particular, this means that for any ,
All such sets are translations of each other, having the same volume, shape and bounding box dimensions.
It follows regarding the inverse function, , that for any it maps to a set of the same volume and bounding box dimensions as each , albeit with an inverted shape.
In particular, it is bounded.
Being bounded under the new parameterization is tantamount to being bounded under the native problem parameterization. ∎
The last good property we wish to mention regarding the scale free Neyman-Scott problem is the following.
An estimation problem will be called local if there exist values and such that for every there exist , such that for all outside a subset of of total scaled probability at most ,
Every proper estimation problem is local.
Consider any estimation problem over a normalised (unscaled) prior, and consequently also a normalised (unscaled) marginal.
The total probability over all is, by definition, , so choosing satisfies the conditions of locality. ∎
The scale free Neyman-Scott problem is local.
An estimation problem is called regular if it is observation-transitive, homogeneous, concentrated and local.
Iv-B Relating SMML to IP
We will now show that for regular problems, one can infer from the IP solution to the SMML solution.
Our first lemma proves for a family of estimation problems that the SMML solutions to these problems do not diverge entirely, in the sense of allocating arbitrarily high (scaled) probabilities to single values. Although a basic requirement for any good estimator, no such result was previously known for SMML.
For a code-book , let
be known as the region of in .
For every local estimation problem there is a such that no SMML code-book for the problem contains any whose region has scaled probability greater than in the marginal distribution of .
Let and be as in Definition 7. Note that can always be increased without violating the conditions of the definition, so it can be assumed to be positive.
Assign for a constant to be computed later on, and assume for contradiction that contains a whose region, , has scaled probability greater than . By construction, contains a non-empty, positive scaled probability region wherein (11) is satisfied.
Let be the scaled probability of , and let be .
Also, define , noting that
because, by assumption, and , so
We will design a code-book such that , proving by contradiction that is not optimal.
Our definition of is as follows. For all , . Otherwise, will be the value among for which the likelihood of is maximal.
Because, by construction, the set , of scaled probability , satisfies that for any ,
On the other hand, the worst-case addition in (scaled) entropy caused by splitting the set into separate values is if each receives an equal probability. We can write this worst-case addition as
This is in the case that . If , the expression is dropped from (14). This change makes no difference in the later analysis, so we will, for convenience, assume for now that .
To reach a contradiction, we want . If , equation (15) degenerates to for an immediate contradiction. Otherwise, contradiction is reached if
or equivalently if
A small enough value can bring the left-hand side of (16) arbitrarily close to , and in particular to a value lower than for any .
Lemma 2.7 now allows us to draw a direct connection between SMML and .
In every local, homogeneous estimation problem , for every SMML code-book and for every there exists a for which the set
is a set of positive scaled probability in the marginal distribution of .
Suppose to the contrary that for some , no element is mapped from a positive scaled probability of values from its respective .
Let be the set of values with positive scaled probability regions in , and let be the directed graph whose vertex set is and which contains an edge from to if the intersection
has positive scaled probability. By assumption, has no self-loops.
We claim that for any ,
an immediate consequence of which is that and therefore cannot have any cycles.
To prove (17), note first that because of our assumption that all likelihoods are continuous, , for every and any choice of , has positive measure in the space of , and because of our assumption that all likelihoods are positive, a positive measure in the space of translates to a positive scaled probability. This also has the side effect that all vertices in must have an outgoing edge (because this positive scaled probability must be allocated to some edge).
Next, consider how transferring a small subset of , of size , in from to changes . Given that can be made arbitrarily small, we can consider the rate of change, rather than the magnitude of change: for to be optimal, we must have a non-negative rate of change, or else a small-enough can be used to improve . Given that is the sum of over all , by transferring probability from to , the rate of change to is .
Consider now the rate of change to . By transferring probability from , where it is outside of (and therefore by definition assigned an value of at least ) to , where it is assigned into (and therefore by definition assigned an value that is smaller than ) the difference is a reduction rate greater than .
The condition that the rate of change of is nonnegative therefore translates simply to (17), thus proving the equation’s correctness.
However, if contains no self-loops and no cycles, and every one of its vertices has an outgoing edge, then it contains arbitrarily long paths starting from any vertex. Consider any such path starting at some of length greater than , where is as in Lemma 2.7. By (17), we have that the scaled probability assigned to the value ending the path is greater than , thereby reaching a contradiction. ∎
We can now present our main theorem, formalising the connection between the SMML estimator and the ideal point.
In any regular estimation problem, for every ,
In particular, is a true estimator, in the sense that for every .
Let be a value for which we want to prove (18).
From Theorem 3 we know that for all there exists a code-book and a for which is non-empty. Let be a value inside this intersection.
By observation transitivity, there is an automorphism such that . Let .
Let us define by . It is easy to verify that by the definition of automorphism , so is also an SMML code-book, and furthermore
Consider now a sequence of such for . The set is a complete metric space, by construction the reside inside the nested sets , and by our assumption that the problem is concentrated, for a small enough , is bounded. We conclude, therefore, that the sequence has a converging sub-sequence. Let be a bound for one such converging sub-sequence.
We claim that is inside both and , thus proving that their intersection is non-empty.
To show this, consider first that we know because is a continuous function, and by construction .
Lastly, for every in the sub-sequence, , so follows from the closure of the SMML estimator (which is guaranteed by definition). ∎
For the scale free Neyman-Scott problem with fixed and , and .
In particular, this is true when approaches infinity, leading SMML to be inconsistent for this problem.
As the consistent estimator for is and not , the SMML estimator is inconsistent. ∎
At first glance, this result may seem impossible, because, as established, an SMML code-book can only encode a countable number of values. Corollary 4.2 resolves this seeming paradox.
The scale free Neyman-Scott problem with fixed and admits uncountably many distinct SMML code-books, and for every value there is a continuum of SMML estimates.
Uncountably many distinct code-books can be generated by arbitrarily scaling and translating any given code-book, which, as we have seen, does not alter .
To show that for every value there are uncountably many distinct SMML estimates, recall from our proof of Lemma 2.4 that if we consider the problem in observation space and parameter space, then both scaling and translation in the original parameter space are translations under the new representation. If any belongs to a region of volume in this space that is mapped to a particular by a particular , one can create a new code-book, , which is a translation of in both and , which would still be optimal.
As long as the translation in observation-space is such that is still mapped into its original region, its associated will be the correspondingly-translated . As such, the volume of values associated with a single is at least as large the volume of the region of (and, by observation-transitivity of the problem, at least as large as the volume of the largest region in the code-book’s partition). ∎
SMML is therefore not a point estimator for this problem at all.
Iv-C Relating IP to ML
Beyond the connections between the SMML solution and the Ideal Point approximation, there is also a direct link to the maximum likelihood estimate.
If is a homogeneous, comprehensive estimation problem, then .
By assumption, the estimation problem is homogeneous, so is a constant, , independent of . Substituting into the definition of and calculating the functional inverse, we get
For an arbitrary choice of , let be such that . The value of is , and there certainly is no for which (or this would contradict homogeneity), so, using the notation of Definition 5, .
In any regular estimation problem, for every ,
V Additional results
The following is a list of additional results that are immediate corollaries of the above. They are given with sketched proofs.
Contrary to the oft-cited claims of , the Wallace-Freeman approximation  is inconsistent for the scale free Neyman-Scott problem. In fact, every frequentist estimation problem for which ML is inconsistent (such as the von Mises problem) admits a prior for which the Wallace-Freeman approximation is inconsistent. This follows from the folkloric and immediate result (cf. , p. 412) that the Wallace-Freeman approximation coincides with ML for estimation problems whose prior is their Jeffreys prior [26, 27]. The scale free prior happens to also be a Jeffreys prior for the Neyman-Scott problem.
Contrary to the claims of  and others, SMML does not satisfy internal consistency in the sense of returning the same estimate whether it is estimated jointly with or alone. The problem of estimating only is also regular, for which reason IG, IP, SMML and ML all coincide for it. The ML estimate is in this case consistent, and therefore not equal to the former estimate. The same is true also for the Wallace-Freeman approximation, as the marginalised problem also has a Jeffreys prior.
Contrary to the claims of 
, it is not true that when SMML is applied in parallel to a large number of independent estimation problems, its predictions for each individual problem are distributed with the same mean and variance as the posterior for. Parallel estimation of multiple independent regular problems is, itself, a regular problem. Hence, the SMML estimate for each individual problem will coincide with that problem’s ML estimate, even when this is inconsistent.
[Proof that Neyman-Scott is local]
We prove Lemma 2.6, stating that the scale free Neyman-Scott problem is local.
Set , for a value to be chosen later on. Importantly, , and all other constants introduced later on in this proof (e.g., , and ) depend solely on and and are not dependent on . As such, they are constants of the construction.
Let , and for let be the vector identical to except that its ’th element equals . Let be the vector identical to except that its ’th element equals .
For , we use all and all . Next, we pick , where is Euler’s constant.
This leaves a further values of to be assigned. To assign these, divide for each the range between and into equal-length segments, and let be the set containing the centres of these segments. We define our remaining values as
We will show that, for a constant to be chosen later on, outside a subset of of total scaled probability ,
Showing this is enough to prove the lemma, because for a sufficiently large ,
so by choosing the conditions of Definition 7 are satisfied. (Recall that is a constant of the construction, and therefore can depend on .)
To prove (19), let us divide the problem into cases. First, let us show that this holds true for any value for which, for any , . To show this, assume without loss of generality that for a particular the equation holds true.