Efficiency of maximum likelihood estimation for a multinomial distribution with known probability sums

by   Yo Sheena, et al.

For a multinomial distribution, suppose that we have prior knowledge of the sum of the probabilities of some categories. This allows us to construct a submodel in a full (i.e., no-restriction) model. Maximum likelihood estimation (MLE) under this submodel is expected to have better estimation efficiency than MLE under the full model. This article presents the asymptotic expansion of the risk of MLE with respect to Kullback--Leibler divergence for both the full model and submodel. The results reveal that, using the submodel, the reduction of the risk is quite small in some cases. Furthermore, when the sample size is small, the use of the subomodel can increase the risk.



page 1

page 2

page 3

page 4


Maximum Likelihood Estimation from a Tropical and a Bernstein–Sato Perspective

In this article, we investigate Maximum Likelihood Estimation with tools...

Asymptotic efficiency of M.L.E. using prior survey in multinomial distributions

Incorporating information from a prior survey is generally supposed to d...

Statistical Integration of Heterogeneous Data with PO2PLS

The availability of multi-omics data has revolutionized the life science...

Counterfactual Maximum Likelihood Estimation for Training Deep Networks

Although deep learning models have driven state-of-the-art performance o...

Process monitoring based on orthogonal locality preserving projection with maximum likelihood estimation

By integrating two powerful methods of density reduction and intrinsic d...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider a multinomial distribution as follows: In each independent trial, the random variable

takes its value, which belongs to one of the categories with the probability


Due to the restriction


the dimension of this probability model is equal to , hereafter referred to as the ”full model.” We consider a submodel that is expressed by a set of parameters as


The dimension of this submodel is equal to .

We often encounter situations in which we have prior information on the sum of the probabilities of some categories, namely


with some known constant . If we have a collection of such equations as (4), it formulates a submodel. We call this type of submodel an “-aggregation submodel” and each equation (4) is referred as a “restriction equation.”

Typical examples are as follows:

Example 1—Prior knowledge of some categories’ probabilities–
If we have prior information on the probabilities of some categories, then the restriction equations

obviously formulate an -aggregation submodel of the dimension .

Example 2—A two-way contingency table with some known column and row sums–

Suppose that a two-way contingency table of the

dimension has cells and a multinomial distribution over this table is given by . Without any restriction other than , this is the full model of the dimension of . Suppose that the first column sums are given as

then the submodel defined by these restriction equations is a -aggregation submodel of the dimension . If, in addition to this prior information, the first row sums are also given as

then the submodel becomes a more restricted -aggregation submodel with the dimension .

We can classify

-aggregation submodels into two groups. If each parameter appears no more than once in all restriction equations, we call the model “non-overlapping.” If some s appear multiple times in the restriction, the model is referred to as “overlapping.” For example, if we only know the column sums of a two-way contingency table, the submodel is non-overlapping. If both row- and column-wise sums are known, an overlapping submodel is formulated.

We consider the MLE for the -aggregation submodel and measure its performance through Kullback–Leibler divergence between the true and the predictive distribution obtained by the MLE plug-in.

We denote the MLE of the parameter by

We consider the predictive distribution given by the MLE plug-in, that is, the multinomial distribution with the parameter


We will measure the discrepancy between the true distribution and the predictive distribution using Kullback–Leibler divergence,

(Since Kullback–Leibler divergence is an “-divergence” with , we will use the notation . See Amari [1] and Amari and Nagaoka [2].) See Gokhale and Kullback [4] for the use of Kullback–Leibler divergence in the inference for the multinomial distribution.

We evaluate the performance of MLE using the risk with respect to the Kullback–Leibler divergence, that is


In this paper, we derive the asymptotic expansion of the risk with respect to the sample size up to the second-order term. The first-order term (-order term) and the second-order term (-order term) provide important information on the asymptotic efficiency of the MLE.

Sheena [5] derived the asymptotic expansion of the risk of MLE with regard to

-divergence for a general parametric model (Theorem 1 of

[5]). As an application to the full model multinomial distribution, it demonstrated (see (42)) that




The first-order term is half the relative complexity of the model, that is, the ratio of the model’s dimension to the sample size (“ ratio”). Since this holds true for any parametric model, the first-order term for a submodel of -dimension equals .

Since a submodel is based on some prior knowledge about the parameters of the full model, it is naturally expected to reduce the estimation risk. From the comparison of the first-order terms between the full and the submodel, the risk of the submodel is found to be smaller than that of the full model under a large enough sample size. This is the benefit of the dimension reduction and as the sample size goes to infinity, the risk ratio converges to (“dimension ratio”).

In this paper, we consider more elaborate risk comparison between the full model and the non-overlapping -aggregation submodel. We focus on the situation in which the sample size is not large enough; hence, the second-order terms have a non-negligible effect on the risk comparison. The contribution and structure of this paper is as follows:

  1. We derive the explicit form of the second-order term for the non-overlapping -aggregation submodel (Section 2.1). The result reveals that the second-order term of the non-overlapping -aggregation submodel is larger than that of the full model unless the prior information is as “solid” as in Example 1.

  2. From a simulation study under some concrete examples of contingency tables, we demonstrate that the risk of the submodel is not as small as expected from the dimension reduction when the sample size is “medium” It may be even larger than that of the full model when sample size is “small” (Section 2.2).

2 Risk of non-overlapping -aggregation submodel

2.1 Two-stage multinomial distribution models

We begin by defining the two-stage multinomial distribution. Suppose that the random variable takes values that belong to one of the categories and that


for . (To eliminate a trivial case, we suppose that .) Let . Then, the first-stage model is given by focusing on which the value of belongs to. Its parameters are given by . The th model () in the second stage is given by the categories and the corresponding probabilities () under the condition .

If the restriction for the submodel is non-overlapping, the parameters can be grouped by the restriction equation in which the parameter appears (the parameters that do not appear in the restriction form one group). The full model is decomposed into two stages according to this grouping, and the full model can be classed as a two-stage model.

Kullback–Leibler divergence satisfies the so-called “chain rule.” Let

denote the conditional distribution of when is given. Suppose a pair of the distributions and defines the distribution of for . Then, the following relationship (chain rule) holds.


where is the expectation under the condition . (For more details on divergences, see Vajda [7] and Amari and Nagaoka [2].)

Thanks to this property, we can decompose the MLE risk into several parts, each of which corresponds to the first- and second-stage distributions.

When a sample of size is taken from the two-stage multinomial distribution, let and denote the number of individuals that belong to and , respectively, hence

The notations of and are similarly defined as the corresponding random variables.

In this section, we always assume that and are parameterized independently, that is,

1. depends on . (11)
2. For each , depends on . (12)

We obtain the following result.

Theorem 1.

The MLE for the two-stage multinomial distribution model, , is given by



where is the MLE for the first-stage model based on , and for each , is the MLE for the second-stage model based on .


For the sample , the log-likelihood is expressed as

We obtain

We notice that

is the log-likelihood function of up to constant for the first-stage model based on , whereas, for each , the log-likelihood function of for the second-stage model is given by

up to constant with the sample . Since , all the results are obtained. ∎

Hereon, we consider only MLE as the estimator of , which is given by (13).

The following decomposition of holds;



When the first-stage model is full, the MLE of is given by , and the asymptotic expansion of the risk of the MLE for the first-stage model (denoted by ) with the sample size is given by (see (7))




For the -th model () at the second stage (not necessarily a full model), let denote the second-order term of with the sample size . Namely, we have


Because the dimension of the -th model at the second stage equals due to (12), the first-order term equals (see Theorem 1 of [5]).

It should be noted that it is possible that . Hence, we can encounter a situation in which we are unable to estimate , the parameter of the -th second-stage model, because there is no available sample. We overcome this problem by making it a rule to discard such samples with no estimation. In the following theorems, all the expectations are conditional on the state . However, as Lemma 1 in the Appendix shows, the conditional expectations of Kullback–Leibler divergence and differ from those that are unconditional by for any . Therefore, we use the same notation as that for the unconditional distribution.(Avoiding zero probability estimates is a practically important issue. See Darsheid et.al. [3].)

The next theorem is Theorem 1 of Sheena [6]. It provides the decomposition of when the first-stage model is a full model.

Theorem 2.

If the first-stage model is a full model, then the risk of MLE is equal to



Suppose that we have prior knowledge on in (11), equivalently . The next theorem gives the risk of MLE for this situation.

Theorem 3.

If the first-stage model is a full model and are all known, then the risk of MLE is


Especially, if all the second-stage models are full models, then the risk of MLE is


Since , that is, , we have

(see (14)).


From (17)

hence, we have

We use Lemma 2 in the Appendix to prove .

Taylor expansions of and are given by

Using and , we have



(We omit the proof that and .)

Consequently, we have


When all the second-stage models are full,

If we insert these results into (19), we have


The prior knowledge on the values of is equivalent to the -aggregation submodel


where ’s are known constants. It is clear that any non-overlapping -aggregation submodel can be treated as a two-stage model in which every second-stage model is a full model. Consequently, the risk for the non-overlapping -aggregation submodel is given by (20).

We now compare the risk of (18) and that of the (19). Notice that the difference between Theorem 2 and Theorem 3 is the existence of the prior information on . We call the models in Theorem 2 and 3 “M1” and “M’2,” respectively. If the second-stage models are all full, then M1 becomes the full model and M2 becomes the -aggregation submodel.

If we neglect the terms, the difference between (18) and (19) is equal to


especially, when the second-stage models are all full models,


Regarding the first-order terms, that of M2 is always smaller than that of M1. The risk ratio between the full and submodels is close to the dimension ratio

under a large enough sample size.

However, the second-order term of M2 can be larger than that of M1. Suppose that , then

the second inequality holds due to the rule that the arithmetic mean is not less than the harmonic mean


When the second-stage models are all full, is equivalent to , namely

in (24). We call the restriction equation of this type “solid” as it provides solid information on the probability of a particular cell. The aforementioned result is applied to the -aggregation model as a corollary.

Corollary 1.

If none of the restriction equations of the -aggregation submodel are solid, then the second-order term of the submodel is larger than that of the full model.

This means that (26) is negative for some small values of when the restriction equations are all non-solid. To simplify the explanation, we roughly classify the sample size as follows: “small sample size” when (26) is negative; “medium sample size” when (26) is positive, but the second-order term of (26) is still non-negligible; “large sample size” when the second-order term of (26) is negligible.

Our conclusion is as follows:

  • When is “small,” the rather pathological situation may occur that the submodel (i.e., the prior information) increases the risk, and hence, it is better not to use the submodel.

  • When is “medium,” the submodel has an advantage over the full model. However, the submodel loses the estimation efficiency gained by the dimension reduction due to the larger second-order term. That is, the ratio of the risks between the sub and full models is not as small as the dimension ratio.

  • When is “large,” the risk ratio between the sub and full models is close to the dimension ratio. It should be noted that the dimension ratio is close to one when the number of restriction equations in the submodel () is quite small compared with the dimension of the full model ().

The comparison between the full and sub models is based on (18) and (19), which are “approximations” of the true risks obtained by neglecting terms.

2.2 Numeric analysis of some examples

In the previous subsection, we observed the risk difference between the full model and the non-overlapping -aggregation submodel through the approximated risks, that is, the asymptotic expansion of the risks up to the second order. In this subsection we confirm several results using simulation with three examples. Every example is a two-way contingency table and the non-overlapping -aggregation submodel is given as the one in which the column sums are all known (see Example 2 in Introduction). As a two-stage model, the first-stage model consists of the multinomial distribution over the columns and each second-stage model is the distribution over the rows within a given column.

To compare the submodel and full model, we also use the indicator, “the required sample size (r.s.s) of the submodel to the full model under the condition ,” which is defined by the solution of the following equation

where and are the risks of the sub and full model considered as the function of the sample size, respectively. This reveals the required sample size for the submodel is equal in terms of risk with the full model of the sample size . The risks in the equation above are calculated by approximation or simulation.

The following abbreviations are commonly used in the three examples;
“f.risk.sim(app)” is the risk of the full model obtained by simulation(approximation).
“s.risk.sim(app)” is the risk of the submodel obtained by simulation(approximation).
“ratio.sim(app)” is the risk ratio between the submodel and the full model based on the simulated(approximated) risks.
“r.s.s.sim(app)” is the r.s.s obtained by simulation(approximation).

Example 1
The first example is an artificial setting to confirm some theoretical results in the previous subsection. Consider a 100 x 2 contingency table with uniform probability. Consequently, and

In this setting, the use of the submodel (prior information) makes little contribution to risk reduction because there is virtually just one restriction equation while the dimension of the full model is as large as 199.

The risks and r.s.s. under several values of are given in Table 1. The r.s.s is under the condition . For the calculation of the simulated risk, we used sets of samples and took the average over these.

n f.risk.app s.risk.app ratio.app r.s.s.app f.risk.sim s.risk.sim ratio.sim r.s.s.sim
100 1.3283 1.3332 1.0037 100 1.0046 1.0068 1.0022 101
200 0.5808 0.5808 0.9999 200 0.5712 0.5713 1.0003 201
300 0.3687 0.3681 0.9985 300 0.3846 0.3843 0.9991 300
400 0.2696 0.2690 0.9977 399 0.2833 0.2829 0.9983 400
500 0.2123 0.2117 0.9972 499 0.2219 0.2214 0.9977 499
1000 0.1028 0.1024 0.9961 996 0.1041 0.1037 0.9962 996
2000 0.0506 0.0504 0.9955 1991 0.0507 0.0504 0.9955 1992
Table 1: Example 1—Risk and R.S.S.—

We observe the following results:

  • When is as small as 100 or 200, the use of the submodel increases the risk.

  • Since the dimension ratio () is close to one even when is relatively large, the use of the submodel causes little risk reduction. In view of r.s.s., the contribution of the submodel is almost negligible.

  • The approximated values for the risk are somewhat deviated from the simulated values when is small; however, those for the risk ratio or r.s.s. are quite close to the simulated values even when is small.

Example 2

We use real data on breast cancer taken from the ”UCI machine learning repository” (

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer). We created a cross-tabulation table from the variable “the degree of malignancy” (3 levels:1,2,3) and “age group” (5 groups: 30–39,…,70–79) excluding one person in his or her twenties from the original data set. We gained the relative frequency by dividing each cell by the total number of individuals (285)(see Table 2). We assume that this is the true probability for each category. We treat the submodel according to the situation in which we have prior knowledge of the sums of each column, that is, the distribution over the age groups.

Table 3 reveals the approximated and simulated risk (in parentheses) for the full and sub models for several cases of sample size . It also shows the approximated and simulated r.s.s. of the submodel to the full model under the condition (the simulated r.s.s is in parentheses).

We summarize the result as follows:

  • For this example, when , the value of (25) (equivalently (26)) becomes negative. However, the simulation reveals that the risk of the submodel is always smaller than that of the full model. In fact, even if , the former equals 0.6236 and the latter equals 0.6543.

  • The dimension ratio is . Since the table is small, the knowledge of the column sums significantly reduces the risk. However, we notice that the effect of the dimension reduction is lessened by the larger second-order term of the submodel as the “Risk Ratio” or “R.S.S./” is larger than 0.714.

  • The approximation method for risk calculation is effective under the sample sizes in the table.

30–39 40–49 50–59 60–69 70–79
1 0.025 0.063 0.088 0.060 0.014
2 0.060 0.168 0.137 0.084 0.004
3 0.042 0.084 0.112 0.056 0.004
Column sum 0.126 0.317 0.337 0.200 0.021
Table 2: Breast cancer classification
Full Model 0.0367(0.0361) 0.0179(0.0180) 0.0119(0.0119) 0.0089(0.0090) 0.0071(0.0071)
Submodel 0.0281(0.0277) 0.0133(0.0135) 0.0087(0.0088) 0.0064(0.0066) 0.0051(0.0052)
Risk Ratio 0.766(0.769) 0.741(0.752) 0.732(0.741) 0.728(0.734) 0.725(0.730)
R.S.S. 158(155) 302(305) 445(450) 588(591) 732(739)
R.S.S./ 0.790(0.775) 0.755(0.762) 0.742(0.750) 0.735(0.739) 0.732(0.739)
Table 3: Example 2—Risk and R.S.S.—

Example 3
We use data from the “2014 National Survey of Family Income and Expenditure” by the Statistics Bureau in Japan (https://www.stat.go.jp/english/data/zensho/index.html). Table 4 is the relative frequency obtained from the classification of 100,006 households according to “Yearly income group (Y1,…,Y10)” and “Household age group (H1,…,H6).” We use this relative frequency as the population parameter . We consider the submodel that is based on the prior knowledge of all the column sums. Table 5 presents the results.

We summarize the results as follows:

  • For this example, when , the value of (25) (equivalently (26)) becomes negative. The simulation result shows that this pathological phenomenon actually occurs, but only when is as small as 20.

  • The dimension ratio is . The table is larger than that of Example 2, and the knowledge of the column sums is not as useful in terms of reducing the dimension. The larger second-order term of the submodel is also a burden on risk reduction, which increases the “Risk Ratio” or “R.S.S./” to over 0.915.

  • The approximation method for risk calculation works effectively under the sample sizes in the table.

H1 H2 H3 H4 H5 H6
Y1 0.00161 0.00232 0.00512 0.00395 0.00468 0.00066
Y2 0.00331 0.0081 0.00953 0.00783 0.01145 0.00278
Y3 0.00974 0.02109 0.02046 0.01499 0.02536 0.00494
Y4 0.00799 0.03519 0.03229 0.02017 0.0338 0.00708
Y5 0.00547 0.0376 0.04362 0.02442 0.02675 0.00398
Y6 0.00494 0.05082 0.09003 0.05772 0.03732 0.00452
Y7 0.00126 0.02106 0.0543 0.05531 0.01999 0.00234
Y8 0.00071 0.00961 0.0323 0.04043 0.0108 0.00122
Y9 0.00011 0.00201 0.01204 0.02184 0.00466 0.00052
Y10 0.00006 0.00139 0.00697 0.01582 0.00344 0.00022
Col. Sum 0.03520 0.18919 0.30666 0.26248 0.17825 0.02826
Table 4: Household classification
Full Model 0.0372(0.0295) 0.0231(0.0197) 0.0167(0.0148) 0.0130(0.0119) 0.0107(0.0099)
Submodel 0.0351(0.0275) 0.0216(0.0183) 0.0155(0.0137) 0.0121(0.0109) 0.0099(0.0091)
Risk Ratio 0.945(0.931) 0.937(0.926) 0.932(0.924) 0.929(0.923) 0.927(0.922)
R.S.S. 955(934) 1419(1392) 1880(1848) 2339(2305) 2799(2773)
R.S.S./ 0.955(0.934) 0.946(0.928) 0.940(0.924) 0.936(0.922) 0.933(0.924)
Table 5: Example 3—Risk and R.S.S.—

3 Conclusion and Discussion

The theoretical and numerical analyses revealed the following points:

  • Risk reduction through the use of the submodel (prior information) is mainly determined by the dimension reduction. If we have numerous restriction equations and a large sample, the submodel significantly reduces the risk.

  • When we have relatively few restriction equations compared with the dimension of the full model, and the sample size is small, the submodel can increase the risk. We are tempted to use the submodel as “compensation” for the small sample, but it can increase the risk. The equation (25) (or (26)) is useful to check if the estimation condition (the dimension of the submodel, the sample size) may produce the pathological phenomenon. If the value of (25) is negative, we must be cautious about the use of the submodel.

We briefly refer to the following points for further study:

  • The second-order terms of the asymptotic expansion for a submodel are generally not easily obtained (see [5]). Even if it is restricted to the overlapping -aggregation submodel, the method presented in this paper (the two-stage model) can not be used.

  • The reliability of the prior information creates another problem. If it is not accurate, the use of the submodel could increase the risk regardless of the sample size (e.g., [6]).


  • [1] S. Amari. Information geometry and its applications. Springer, 2016.
  • [2] S. Amari and H. Nagaoka. Methods of information geometry. Translations of mathematical monographs 191. American Mathematical Society, 2000.
  • [3] P. Darscheid, A. Guthke and U. Ehret. A maximum-entropy method to estimate discrete distributions from samples ensuring nonzero probabilities. Entropy, doi:10.3390/e20080601, 2018.
  • [4] D. V. Gokhale and S. Kullback. The information in contingency tables. Marcel Dekker, 1978.
  • [5] Y. Sheena. Asymptotic expansion of the risk of maximum likelihood estimator with respect to -divergence as a measure of the difficulty of specifying a parametric model, Communications in Statistics – Theory and Methods, 47: 4059-4087,2018.
  • [6] Y. Sheena. Asymptotic efficiency of M.L.E. using another survey in multinomial distributions. arXiv:1904.06826 [math.ST]
  • [7] I. Vajda. Theory of statistical inference and information, Kluwer Academic Publishers, 1989.


Lemma 1.


be the random vector whose distribution is defined as the multinomial distribution with (

1) and the sample size . The distribution under the condition is considered. Let the unconditional and conditional expectations of a random variable be denoted by and , respectively. If

holds with some nonnegative numbers , the difference between the two expectations decreases to zero with exponential speed as goes to infinity, namely for any ,


In the special case, MLE of ,

, and the moments of

, the following equations hold for any .


For and , because the following equivalence relationship holds

we have


Let be the set of all -dimensional vectors whose elements are nonnegative integers.

Notice that

which means . Because

we have