1 Introduction
In neyman1948consistent , Neyman and Scott introduced a problem in consistent estimation that has since been studied extensively in many fields (see lancaster2000incidental for a review). It is known under many names, such as the problem of partial consistency (e.g., fan2005semilinear ), the incidental parameter problem (e.g., graham2009incidental ), one way ANOVA (e.g., lindsay1980nuisance ), the two sample normal problem (e.g., ghosh1996noninformative ), or simply as the NeymanScott problem (e.g., li2003efficiency ; kamata2007multilevel ), each name indicating a slightly different scoping of the problem and a slightly different emphasis.
In this paper we return to Neyman and Scott’s first and most studied example case of the phenomenon, namely the problem of consistent estimation of variance based on a fixed number of Gaussian samples.
In Bayesian statistics, this problem has repeatedly been addressed by analysis over a particular choice of prior (as in
Wallace2005 ) or over a particular family of priors (as in Ghosh1994 ). Priors used include several noninformative priors (see yang1996catalog for a list), including reference priors berger1992development and the Jeffreys prior Jeffreys1946invariant ; Jeffreys1998theory .In nonBayesian statistics, the problem has been addressed by means of conditional likelihoods, eliminating nuisance parameters by integrating over them. Analogous techniques exist also in Bayesian analysis.
In (DoweBaxterOliverWallace1998, , p. 93), Dowe et al. opine that such marginalisationbased solution are apriori unsatisfactory because they rely on the estimates for individual parameters to not agree with their own joint estimation. The resulting estimator may therefore be consistent in the usual sense, but by definition exhibits a form of internal inconsistency.
The paper goes on to present a conjecture by Dowe that elegantly excludes all such marginalisationbased methods as well as other simple approaches to the problem by requiring (implicitly) that for a solution to the NeymanScott problem to be satisfactory, it must also satisfy invariance to representation. This excludes most estimators discussed in the literature as either inconsistent or invariant. Indeed, Dowe’s most modern version of his conjecture (see (Dowe2008a, , p. 539) and citations within) is that only estimators belonging to the MML family wallace1975invariant simultaneously satisfy both conditions. The two properties were demonstrated for two algorithms in the MML family in (Wallace2005, , p. 201) and dowe1997resolving , both using the same reference prior.
More recently, however, brand2016neymanscott showed that neither these two algorithms nor Strict MML WallaceBoulton1968 remains consistent under the problem’s Jeffreys prior, leading to the question of whether there is any estimator that retains both properties in a general setting, i.e. under an arbitrary choice of prior.
While we will not answer here the general question of whether an estimation method can be both consistent and invariant for general estimation problems (see BrandXXXX for a discussion), we describe a novel estimation method, RKL, that is usable on general point estimation problems, belongs to the Bayes estimator family, is invariant to representation of both the observation space and parameter space, and for the NeymanScott problem is also consistent regardless of the choice of prior, whether proper or improper.
The method also satisfies the broader criterion of DoweBaxterOliverWallace1998 , in that for the NeymanScott problem the same method can be applied to estimate any subset of the parameters and will provide the same estimates.
The estimator presented, RKL, is the Bayes estimator whose cost function is the Reverse KullbackLeibler divergence. While both the KullbackLeibler divergence (KLD) and its reverse (RKL) are wellknown and muchstudied functions
cover2012elementsand frequently used in machine learning contexts (see, e.g.,
nowozin2016f ), including for the purpose of distribution estimation by minimisation of relative entropy muhlenbein2005estimation ; basu1994minimum , their usage in the context of point estimation, where the true distribution is unknown and must be estimated, is much rarer. Dowe et al. DoweBaxterOliverWallace1998 introduce the usage of KLD in this context under the name “minEKL”, and the use of RKL in the same context is novel to this paper.We argue that the good consistency properties exhibited by RKL on NeymanScott are not accidental, and describe its advantages for the purposes of consistency over alternatives such as minEKL on a wider class of problems.
We remark that despite being both invariant and consistent for the problem, RKL is unrelated to the MML family. It therefore provides further refutation of Dowe’s Dowe2008a conjecture.
2 Definitions
2.1 Point estimation
A point estimation problem lehmann2006theory is a set of likelihoods,
, which are probability density functions over
, indexed by a parameter, . Here, is known as parameter space, as observation space, and as an observation. A point estimator for such a problem is a function, , matching each possible observation to a parameter value.For example, the wellknown Maximum Likelihood estimator (MLE), is defined by
Because this definition is equally applicable for many estimation problems, simply by substituting in each problem’s and , we say that MLE is an estimation method, rather than just an estimator.
In Bayesian statistics, estimation problems are also endowed with a prior distribution over their parameter space, denoted by an everywherepositive probability density function .^{1}^{1}1We take priors that assign a zero probability density to any to be degenerate, and advocate that in this case such should be excluded from . This makes it possible to think of and
as jointly describing the joint distribution of a random variable pair
, where is the marginal distribution of and is the conditional distribution of given . We therefore use as a synonym for .It is convenient to generalise the idea of a prior distribution by allowing priors to be improper, in the sense that
This means that
is no longer described by a probability distribution, but rather by a general measure. When choosing an improper prior more care must be taken: for a prior to be valid, it should still be possible to compute the (equally improper) marginal distribution of
by means ofas without this Bayesian calculations quickly break down. We will, throughout, assume all priors to be valid.
Where , we can also define the posterior distribution
Note that even when and are improper, and are both proper probability distribution functions.
Lemma 1.
In any estimation problem where is always positive, for every .
Proof.
Fix , and for any natural let .
The sequence partitions
into a countable number of parts. As a result, at least one such part has a positive prior probability.
We can now bound from below by
∎
In this paper, we will throughout be discussing estimation problems where the conditions of Lemma 1 hold, for which reason we will always assume that is positive. Coupled with the fact that is, by assumption, also always positive, this leads to positive, well defined, , positive and positive .
2.2 Consistency
In defining point estimation, we treated as a single variable. Typically, however,
is a vector. Consider, for example, an observation space
. In this case, the observation takes the form .Typically, every in an estimation problem is defined such that individual are independent and identically distributed, but we will not require this.
For estimation methods that can estimate from every prefix , it is possible to define consistency, which is one desirable property for an estimation problem, as follows lehmann2006theory .
Definition 1 (Consistency).
Let be an estimation problem over observation space , and let be the sequence of estimation problems created by taking only as the observation.
An estimation method is said to be consistent on if for every and every neighbourhood of , if is taken from the distribution then almost surely
where the choice of estimation problem for is understood from the choice of parameter.
2.3 The NeymanScott problem
Definition 2.
The NeymanScott problem (neyman1948consistent, ) is the problem of jointly estimating the tuple after observing , each element of which is independently distributed .
It is assumed that , and for brevity we take to be the vector .
The NeymanScott problem is a classic casestudy for consistency due to its partiallyconsistent posterior.
Loosely speaking, a posterior, i.e. the distribution of given the observations, is called inconsistent if even in the limit, as , there is no such that every neighbourhood of tends to total probability . (See (ghosal1997review, ) for a formal definition.) In such a case it is clear that no estimation method can be consistent. When keeping constant and taking to infinity, the NeymanScott problem creates such an inconsistent posterior, because the uncertainty in the distribution of each remains high.
The problem is, however, partially consistent in that the posterior distribution for does converge, so it is possible for an estimation method to estimate it, individually, in a consistent way.
For example, the estimator
(1) 
is a wellknown consistent estimator for , where
and
(We use to denote the vector .)
The interesting question for NeymanScott is what estimation methods can be devised for the joint estimation problem, such that their estimate for , as part of the larger estimation problem, is consistent.
Famously, MLE’s estimate for is in this scenario , which is not consistent, and the same is true for the estimates of many other popular estimation methods such as Maximum Aposteriori Probability (MAP) and Minimum Expected KullbackLeibler Distance (minEKL).
It is, of course, possible for an estimation method to work on each coordinate independently. An example of an estimation method that does this is Posterior Expectation (PostEx). Such methods, however, rely on a particular choice of description for the parameter space (and sometimes also for the observation space). If one were to estimate , for example, instead of , the estimates of PostEx for the same estimation problem would change substantially. PostEx may therefore be consistent for the problem, but it is not invariant to representation of and .
The question therefore arises whether it is possible to construct an estimation method that is both invariant (like MLE) and consistent (like the estimators of Ghosh1994 ), and that, moreover, unlike the estimators of dowe1997resolving ; Wallace2005 , retains these properties for all possible priors.
Typically, priors studied in the literature can be described as for some function . These are priors where values are independent and uniform given . The studied methods often break down, as in the case of dowe1997resolving ; Wallace2005 , simply by switching to another .
The RKL estimator introduced here, however, remains consistent under extremely general priors, including ones with distributions that, even given , are not uniform, not identically distributed, and not independent.
3 The RKL estimator
Definition 3.
The Reverse KullbackLeibler (RKL) estimator
is a Bayes estimator, i.e. it is an estimator that can be defined as a minimiser of the conditional expectation of a loss function,
.where for RKL the function is defined by
Here, is the KullbackLeibler divergence (KLD) kullback1997information from to ,
Equivalently, it is the entropy of relative to .
This definition looks quite similar to the definition of the standard minEKL estimator, which uses the KullbackLeibler divergence as its loss function, except that the parameter order has been reversed. Instead of utilising , as in the original definition of minEKL, we use . Because the KullbackLeibler divergence is nonsymmetric, the result is a different estimator.
Although the Reverse KullbackLeibler divergence is a wellknown divergence liese2006divergences ; nowozin2016f , it has to our knowledge never been applied as a loss function in Bayes estimation.
4 Invariance and consistency
In terms of invariance to representation, it is clear that RKL inherits the good properties of divergences.
Lemma 2.
RKL is invariant to representation of and of .
Proof.
The RKL loss function is dependent only on distributions of given a choice of . Renaming the therefore does not affect it. Furthermore, the loss function is an divergence, and therefore invariant to reparameterisations of qiao2008f . ∎
More interesting is the analysis of RKL’s consistency. In this section, we analyse RKL’s consistency on NeymanScott. In the next section, we turn to its consistency properties in more general settings.
Theorem 1.
RKL is consistent for NeymanScott over any valid, nondegenerate prior.
To begin, let us describe the estimator more concretely.
Lemma 3.
For NeymanScott over any valid, nondegenerate prior,
Proof.
In NeymanScott, each observation is distributed independently with some variance and some mean ,
The KLD between two such distributions is
Given that these observations are independent, the KLD over all observations is the sum of the KLD over the individual observations:
The Bayes risk associated with choosing a particular as the estimate is therefore
In finding the combination that minimises this risk, it is clear that the choice of and of each can be made separately, as the expression can be split into additive components, each of which is dependent only on one variable.
The risk component associated with each is
(2)  
(3) 
More interestingly in the context of consistency, the risk component associated with is
This expression is a Bayes risk for the onedimensional problem of estimating from (with a specific loss function), a type of problem that is typically not difficult for Bayes estimators.
We will utilise the fact that the risk function is a linear combination of the functions, when taking these as functions of , indexed by , and these functions, both in their complete form and when separated to components, are convex, differentiable functions with a unique minimum. We conclude that the risk function is also a convex function, and that its minimum can be found by taking its derivative to zero, which, in turn, is a linear combination of the derivatives of the individual loss functions. To solve for , for example, we therefore solve the equation
This leads to
So for the NeymanScott problem, the portion of the RKL estimator is
as required. ∎
For completion, we remark that using the same analysis on (2) we can determine that the estimate for each is
We now turn to the question of how consistent this estimator is for .
Proof of Theorem 1.
We want to calculate
(4) 
The fact that NeymanScott has any consistent estimators (such as, for example, the one presented in (1
)) indicates that the posterior probability, given
, for to be outside any neighbourhood of the real tends to zero. For this reason, posterior expectation in any case where the estimation variable is bounded, is necessarily consistent.Here this is not the case, because tends to infinity when tends to zero. However, given that the denominator of (4) is precisely the marginal , and therefore by assumption positive, it is enough to show that with probability ,
for some , to prove that all parameter options with cannot affect the expectation, and that the resulting estimator is therefore consistent.
To do this, the first step is to recall that by assumption is finite, including at , and therefore, for any ,
Let us now define
Differentiating by , we conclude that this is a monotone increasing function for every
and so in particular for any for a sufficiently large .
For a given , and , reaches its maximum at .
Furthermore, note that for , strictly decreases with , and tends to zero.
Let us now choose an value smaller than both and , and calculate
∎
5 General consistency of RKL
The results above may seem unintuitive: in designing a good loss function, , for Bayes estimators, one strives to find a measure that reflects how bad it would be to return the estimate when the correct value is , and yet the RKL loss function, contrary to minEKL, seems to be using as its baseline, and measures how different the distribution induced by would be at every point. Why should this result in a better metric?
To answer, consider the portion of the (forward) KullbackLeibler divergence that depends on
. In the one dimensional case, and when calculating the divergence between two normal distributions
and , this is .The difference between the two expectations is measured in terms of how many standard deviations away the two are.
Because , the yardstick for the divergence of the expectations, is used in minEKL as , a value to be estimated, it is possible to reduce the measured divergence by unnecessarily inflating the estimate.
By contrast, RKL uses the true value as its yardstick, for which reason no such inflation is possible for it. This makes its estimates consistent.
This is a general trait of RKL, in the sense that it uses as its loss metric the entropy of the estimate relative to the true value, even though the true value is unknown.
While not a fullproof method of avoiding problems created by partial consistency (or, indeed, even some fully consistent scenarios), this does address the problem in a range of situations.
The following alternate characterisation of RKL gives better intuition regarding when the method’s estimates are consistent.
Definition 4 (RKL reference distribution).
Given an estimation problem with likelihoods , let be defined by
(5) 
If for every , is defined and nonzero, let , the RKL reference distribution, be the probability density function , i.e. the normalised version of .
Theorem 2.
Let be an estimation problem with likelihoods and RKL reference distribution .
(6) 
Proof.
Expanding the RKL formula, we get
The move from to is justified because the difference is a positive multiplicative constant , translating to an additive constant after the , and therefore not altering the result of the . ∎
This alternate characterisation makes RKL’s behaviour on NeymanScott more intuitive: the RKL reference distribution is calculated using an expectation over logscaled likelihoods. In the case of Gaussian distributions, representing the likelihoods in log scale results in parabolas. Calculating the expectation over parabolas leads to a parabola whose leading coefficient is the expectation of the leading coefficients of the original parabolas. This directly justifies Lemma
3 and can easily be extended also to multivariate normal distributions.Furthermore, the alternate characterisation provides a more general sufficient condition for RKL’s consistency in cases where the posterior is consistent.
Definition 5 (Distinctive likelihoods).
An estimation problem with likelihoods is said to have distinctive likelihoods if for any sequence and any ,
where “” indicates total variation distance.
Corollary 3.0.
If is an estimation problem with distinctive likelihoods and an RKL reference distribution , such that for every , with probability over an generated from the distribution ,
(7) 
then RKL is consistent on the problem.
Proof.
By Pinsker’s inequality cover2012elements , converges under the total variations metric to both and . By the triangle inequality, the total variation distance between and tends to zero, and therefore by assumption of likelihood distinctiveness
∎
RKL’s consistency is therefore guaranteed in cases where converges to . This type of guarantee is similar to guarantees that exist also for other estimators, such as minEKL and posterior expectation, in that convergence of the estimator is reliant on the convergence of a particular expectation function. Having a consistent posterior guarantees that all values outside any neighbourhood of receive a posterior probability density tending to zero, but when calculating expectations such probability densities are multiplied by the random variable over which the expectation is calculated, for which reason if it tends to infinity fast enough compared to the speed in which the probability density tends to zero, the expectation may not converge.
RKL’s distinction over posterior expectation and minEKL, however, is that, as demonstrated in (5), the random variable of the expectation is taken in log scale, making it much harder for its magnitude to tend quickly to infinity.
This makes RKL’s consistency more robust than minEKL’s over a large class of realistic estimation problems.
6 Conclusions and future research
We’ve introduced RKL as a novel, simple, generalpurpose, parameterisationinvariant Bayes estimation method, and showed it to be consistent over a large class of estimation problems with consistent posteriors and over NeymanScott oneway ANOVA problems regardless of one’s choice of prior.
Beyond being an interesting and useful new estimator in its own right and a satisfactory solution to the NeymanScott problem, the estimator also serves as a direct refutation to Dowe’s conjecture in (Dowe2008a, , p. 539).
The robustness of RKL’s consistency was traced back to its reference distribution being calculated as an expectation in logscale.
This leaves open the question of whether there are other types of scaling functions, with even better properties, that can be used instead of logscale, without losing the estimator’s invariance to parameterisation.
References
 (1) A. Basu and B.G. Lindsay. Minimum disparity estimation for continuous models: efficiency, distributions and robustness. Annals of the Institute of Statistical Mathematics, 46(4):683–705, 1994.
 (2) J.O. Berger and J.M. Bernardo. On the development of reference priors. Bayesian statistics, 4(4):35–60, 1992.
 (3) M. Brand. MML is not consistent for NeymanScott. https://arxiv.org/abs/1610.04336, October 2016.
 (4) M. Brand, T. Hendrey, and D.L. Dowe. A taxonomy of estimator consistency for discrete estimation problems. To appear.
 (5) T.M. Cover and J.A. Thomas. Elements of information theory. John Wiley & Sons, 2012.
 (6) D.L. Dowe. Foreword re C. S. Wallace. Computer Journal, 51(5):523–560, September 2008. Christopher Stewart WALLACE (19332004) memorial special issue.
 (7) D.L. Dowe, R.A. Baxter, J.J. Oliver, and C.S. Wallace. Point estimation using the KullbackLeibler loss function and MML. In Research and Development in Knowledge Discovery and Data Mining, Second PacificAsia Conference, PAKDD98, Melbourne, Australia, April 15–17, 1998, Proceedings, volume 1394 of LNAI, pages 87–95, Berlin, April 15–17 1998. Springer.
 (8) D.L. Dowe and C.S. Wallace. Resolving the NeymanScott problem by Minimum Message Length. Computing Science and Statistics, pages 614–618, 1997.
 (9) J. Fan, H. Peng, and T. Huang. Semilinear highdimensional model for normalization of microarray data: a theoretical analysis and partial consistency. Journal of the American Statistical Association, 100(471):781–796, 2005.

(10)
S. Ghosal.
A review of consistency and convergence of posterior distribution.
In
Varanashi Symposium in Bayesian Inference, Banaras Hindu University
, 1997.  (11) M. Ghosh. On some Bayesian solutions of the NeymanScott problem. In S.S. Gupta and J.O. Berger, editors, Statistical Decision Theory and Related Topics. V, pages 267–276. SpringerVerlag, New York, 1994. Papers from the Fifth International Symposium held at Purdue University, West Lafayette, Indiana, June 14–19, 1992.
 (12) M. Ghosh and M.Ch. Yang. Noninformative priors for the two sample normal problem. Test, 5(1):145–157, 1996.
 (13) B.S. Graham, J. Hahn, and J.L. Powell. The incidental parameter problem in a nondifferentiable panel data model. Economics Letters, 105(2):181–182, 2009.
 (14) H. Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 186(1007):453–461, 1946.
 (15) H. Jeffreys. The theory of probability. OUP Oxford, 1998.
 (16) A. Kamata and Y.F. Cheong. Multilevel Rasch models. Multivariate and mixture distribution Rasch models, pages 217–232, 2007.
 (17) S. Kullback. Information theory and statistics. Courier Corporation, 1997.
 (18) T. Lancaster. The incidental parameter problem since 1948. Journal of econometrics, 95(2):391–413, 2000.
 (19) E.L. Lehmann and G. Casella. Theory of point estimation. Springer Science & Business Media, 2006.
 (20) H. Li, B.G. Lindsay, and R.P. Waterman. Efficiency of projected score methods in rectangular array asymptotics. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(1):191–208, 2003.
 (21) F. Liese and I. Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394–4412, 2006.
 (22) B.G. Lindsay. Nuisance parameters, mixture models, and the efficiency of partial likelihood estimators. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 296(1427):639–662, 1980.
 (23) H. Mühlenbein and R. Höns. The estimation of distributions and the minimum relative entropy principle. Evolutionary Computation, 13(1):1–27, 2005.
 (24) J. Neyman and E.L. Scott. Consistent estimates based on partially consistent observations. Econometrica: Journal of the Econometric Society, pages 1–32, 1948.
 (25) S. Nowozin, B. Cseke, and R. Tomioka. GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016.
 (26) Y. Qiao and N. Minematsu. divergence is a generalized invariant measure between distributions. In Ninth Annual Conference of the International Speech Communication Association, 2008.
 (27) C.S. Wallace. Statistical and Inductive Inference by Minimum Message Length. Information Science and Statistics. Springer Verlag, May 2005.
 (28) C.S. Wallace and D.M. Boulton. An information measure for classification. The Computer Journal, 11(2):185–194, 1968.
 (29) C.S. Wallace and D.M. Boulton. An invariant Bayes method for point estimation. Classification Society Bulletin, 3(3):11–34, 1975.
 (30) R. Yang and J.O. Berger. A catalog of noninformative priors. Institute of Statistics and Decision Sciences, Duke University, 1996.
Comments
There are no comments yet.