# Non-Asymptotic Behavior of the Maximum Likelihood Estimate of a Discrete Distribution

In this paper, we study the maximum likelihood estimate of the probability mass function (pmf) of n independent and identically distributed (i.i.d.) random variables, in the non-asymptotic regime. We are interested in characterizing the Neyman--Pearson criterion, i.e., the log-likelihood ratio for testing a true hypothesis within a larger hypothesis. Wilks' theorem states that this ratio behaves like a χ^2 random variable in the asymptotic case; however, less is known about the precise behavior of the ratio when the number of samples is finite. In this work, we find an explicit bound for the difference between the cumulative distribution function (cdf) of the log-likelihood ratio and the cdf of a χ^2 random variable. Furthermore, we show that this difference vanishes with a rate of order 1/√(n) in accordance with Wilks' theorem.

## Authors

• 4 publications
• 8 publications
• 63 publications
07/20/2020

### Maximum likelihood estimation for matrix normal models via quiver representations

In this paper, we study the log-likelihood function and Maximum Likeliho...
05/22/2020

### Asymptotic accuracy of the saddlepoint approximation for maximum likelihood estimation

The saddlepoint approximation gives an approximation to the density of a...
06/10/2018

### Bounds for the asymptotic distribution of the likelihood ratio

In this paper we give an explicit bound on the distance to chisquare for...
02/20/2021

### A Note on Taylor's Expansion and Mean Value Theorem With Respect to a Random Variable

We introduce a stochastic version of Taylor's expansion and Mean Value T...
02/05/2018

### A useful variant of Wilks' theorem for grouped data

This paper provides a generalization of a classical result obtained by W...
11/28/2019

### An analysis of the maximum likelihood estimates for the Lomax distribution

The Lomax distribution is a popularly used heavy-tailed distribution tha...
11/28/2019

### A note on the Lomax distribution

The Lomax distribution is a popularly used heavy-tailed distribution tha...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The maximum likelihood estimator (MLE), a conventional method for parameter estimation of a statistical model, has been extensively studied in fields such as statistics, information theory, and signal processing. This estimator possesses some remarkable asymptotic (in the number of samples) properties, both in parametric and non-parametric scenarios (ibragimov2013statistical ; Wilks1938 ; Owen2001book ), such as the normality of the estimate for i.i.d. observations (lehmann2004elements, , Ch. 7.3) or the behavior of the log-likelihood ratio (LLR) between a null and alternative hypothesis, known as Wilks’ theorem or phenomenon Wilks1938 ; Owen1988empirical ; billingsley1961statistical

. The LLR in particular and its properties have also been considerably investigated in analysis of significance and confidence intervals in hypothesis testing

Wilks1938 ; Owen2001book ; billingsley1961statistical ; spokoiny2012parametric ; spokoiny2013bernstein ; spokoiny2015bootstrap . In the present work, we intend to provide additional results in this direction.

Let us assume that the random variable is distributed according to , where the probability measure is a member of a parametrized family of distributions . Consider is a subset of , i.e., is described with parameters. In particular, we are interested in deriving a non-asymptotic bound on the cdf of

 Λn≜2[maxθ∈ΘLn(θ)−Ln(θ0)], (1)

where is the log-likelihood function of the parameter given i.i.d. samples111It is a common assumption to analyze the case of i.i.d. samples since the calculations become more tractable by eliminating dependencies among samples (see e.g., Wilks1938 ; Owen1988empirical ). It is possible to extend the results from this work to a more general model, e.g., a Markov process (as in billingsley1961statistical ). However, we study only the simple case for ease of presentation. of the random variable , and assuming the maximum exists. This represents the Neyman–Pearson criterion for testing the true hypothesis within a larger (composite) hypothesis , i.e.,

 H0:θ=θ0↔H1:θ=argmaxθ′∈ΘLn(θ′). (2)

A proper characterization of the statistical distribution of the LLR (1) allows us to determine the performance of the aforementioned test. It is thus no surprise that a major line of research is concerned with identification of the asymptotic (Wilks1938 ; Owen2001book ; billingsley1961statistical ) and non-asymptotic (spokoiny2015bootstrap and the references therein) behavior of the LLR. Additionally, hypothesis tests based on information-theoretic measures such as the mutual and directed information are related to the LLR given that they are defined in terms of the logarithm of a probability ratio. Therefore, the behavior of the LLR appears in the analysis of performance and significance of composite hypothesis tests based on said measures kontoyiannis2016estimating ; Mol2017TestforDIG .

The first characterization of the behavior of is due to Wilks Wilks1938 , who shows that the LLR is asymptotically distributed like a random variable, up to an error of order . For a large but finite number of samples, we may obtain a similar characterization following a two-step approach: first, we establish that has a quadratic form, and second, we identify the behavior of the quadratic form as following a distribution. A conventional technique for the first step is to employ a Taylor expansion of , as proposed in billingsley1961statistical . More recently, in spokoiny2012parametric ; spokoiny2013bernstein ; spokoiny2015bootstrap , an alternative method is presented which uses a bracketing approach to express in the vicinity of two quadratic terms; specifically, it is shown that is pointwise close to a quadratic form via a penalty of order with exponentially high probability in the non-asymptotic regime. Various methods exist to approximate the behavior of the aforementioned quadratic form to that of a , a collection of these is presented in Prokhorov2013 . For instance, Spokoiny and Zhilova spokoiny2015bootstrap use Pinsker’s inequality combined with the Kullback–Leiber divergence to make this approximation, achieving a penalty of where only depends on the dimension of . Sharper bounds are available in the works of Benktus bentkus2003dependence and Götze gotze2014explicit .

The approximation of the LLR is also shown to be valid in high dimensional analysis, i.e., when is very large (albeit smaller than ). Portnoy portnoy1988asymptotic studies the MLE for exponential families with parameters and shows that the approximation holds if . In a recent work anastasiou2018bounds , the authors obtain an explicit asymptotic bound to approximate the LLR with a variable, which is valid in high dimension if . In hjort2009extending ; tang2010penalized , the effect of large dimension is similarly analyzed when the estimation is performed assuming a multinomial distribution, which is based on the work by Owen Owen2001book . In a recent work, the LLR for a logistic model is shown to behave asymptotically in high dimesion as a rescaled sur2017likelihood . Our result in this work is closer to Portnoy’s, yet different since the observed random variables belong to a discrete distribution parametrized with its pmf.

The main contribution of this paper is to derive an explicit bound on the difference between the cdf of and that of a distribution, for a finite number of samples. We start by reformulating (1) using a Taylor series to elicit its quadratic form (Lemma 2). The behavior of is then decomposed into its asymptotic component and a non-asymptotic penalty; the latter is first bounded via the matrix Bernstein inequality and second, via the approximation provided by Benktus bentkus2003dependence . The bound thus obtained is compared to one derived using the tools presented in spokoiny2015bootstrap , which requires additional assumptions to hold, and an improvement is shown for some values of . Furthermore, we investigate the effect of large dimension and show that a sufficient condition for the asymptotic convergence is that .

The paper is organized as follows. In Section 2, notations and required definitions are presented. The main theorem is then introduced in Section 3 and subsequently proved in Section 4. Finally, in Section 5, we compare our result with the one derived from spokoiny2015bootstrap , and the paper is concluded after some final remarks.

## 2 Preliminaries

We begin by describing notations we have used throughout the paper, and the investigated model is explained afterwards. Next, the maximum likelihood estimator of the model’s parameter is reviewed. Finally, the quantity of interest, i.e., , is expressed as a sum of a quadratic term and asymptotically negligible remainders.

### 2.1 Notation

Given two integers and , the expression denotes the set

. For a vector

, the -th component is denoted , while for a matrix , denotes the element in the -th row and -th column. We use to denote a square diagonal matrix whose nonzero entries are the elements from the vector . An all-one and all-zero column vectors are denoted and , respectively.

For a matrix , and

indicate the maximum and minimum eigenvalues of

, respectively. Moreover, the spectral norm of , defined as , is denoted by . For the -norm of a vector , we use the notation .

For a function of a vector , the notation stands for the first derivative with respect to component as

 g′u(⋅;θ)≜∂g(⋅;θ)/∂θu, (3)

and similarly , and denote the second, and third derivative with respect to the components , , and , respectively. Moreover, and denote the gradient and Hessian matrix of the function , respectively.

For a random vector

, the expression indicates that the distribution function of at any continuity point converges to the distribution function corresponding to (see convergence in distribution billingsley2013convergence ).

### 2.2 Model Definition

Consider i.i.d. random variables distributed by where , and we set off to estimate its pmf using a maximum likelihood estimator. The pmf of can be parametrized with a vector where , i.e.,

 Pr{X=j}=P(j;θ)={θjif j∈[1:r]θresif j=r+1, (4)

where we define

 θres≜1−∑rj=1θj. (5)

Throughout the paper, we denote the true value of the parameter vector by .

To prevent undefined behavior of some quantities, like the Fisher information matrix, we make the following assumption.

###### Assumption 1.

All elements of the alphabet have nonzero probability, i.e., , where we define .

Before continuing, let us define for simplicity the function

 g(X;θ)≜logP(X;θ). (6)

Then, the Fisher information matrix about the true parameter contained in is defined as

 Σ ≜E[∇g(X;θ0)∇g(X;θ0)T]=E[−∇2g(X;θ0)], (7)

where the second equality holds under certain conditions (Lehmann1998, , Lem. 5.3), which here we assume them to hold. In our model, each element of the matrix may be characterized using the definition in (3) as follows,

 Σuv =E[g′u(X;θ0)g′v(X;θ0)] =r+1∑j=1P(j;θ0)g′u(j;θ0)g′v(j;θ0) =r∑j=1θ0j∂(logθ0j)∂θu∂(logθ0j)∂θv+θ0res∂(logθ0res)∂θu∂(logθ0res)∂θv. (8)

Given that is a function of every component of as defined in (5) while all the other components are independent of each other, we obtain

 Σ=diag(1θ01,…,1θ0r)+1θ0res11T. (9)

All the entries of are finite as long as Assumption 1 holds true. Additionally, we may obtain the inverse of using the Sherman–Morrison–Woodbury formula (Horn2013, , Sec. 0.7.4):

 Σ−1=diag(θ0)−θ0(θ0)T. (10)

### 2.3 ML Estimation

Given the i.i.d. samples , consider the log-likelihood function

 Ln(θ)≜logP(X1,…,Xn;θ)=∑ni=1g(Xi;θ), (11)

where we use (6). Let us denote the solution to the ML-estimation as (assuming it exists), i.e.,

 θ∗≜argmaxθ∈ΘLn(θ). (12)

Hereafter and to simplify notation we use and to indicate and , respectively.

It is well-known that under some regulatory conditions with probability one (see e.g., Hajek1970 ; Hajek1972 ; LeCam1986 ). To analyze the convergence behavior, we may define

 ln≜√n(θ∗−θ0). (13)

The following lemma provides a bound on the probability of having a large difference between the estimate and the true value of the parameter. We may see that the tail probability of decreases exponentially fast with .

###### Lemma 1.

The following bound holds for the -norm of :

 (14)
###### Proof.

See A. ∎

Two other quantities of interest are the standardized score

 tn≜1√nn∑i=1∇g(Xi;θ0) (15)

and the empirical information matrix

 Jn≜−1nn∑i=1∇2g(Xi;θ0). (16)

There exist situations for finite in which any samples from a specific has not been observed. According to our model (4), this yields to be singular. In order to avoid such deficiencies we make the following assumption to guarantee observing all members of .

###### Assumption 2.

is sufficiently large such that is non-singular and the inverse exists.

In the following by using these quantities, the LLR  (1) may be expressed as a quadratic form with remainders, as long as Assumption 2 holds. This is the main step toward extracting the part from which behaves asymptotically as a random variable.

###### Lemma 2.

If Assumption 2 holds true, there exist and such that the Neyman–Pearson criterion may be formulated as

 Λn =tTnJ−1ntn−(∥ln∥22√n¯Gn)2α′TJ−1nα′+α∥ln∥32√n¯Gn, (17)

where , for ,

 ¯Gn ≜1nn∑i=1G(Xi), (18) G(Xi) ≜supθ′\absg′′′uvw(Xi;θ′), (19)

and is on the line connecting and .

###### Proof.

The derivations in this lemma are similar to the ones found in (billingsley1961statistical, , Ch. 2) for the case of a first order Markov process; in this work, the samples come from an i.i.d. process. The proof is deferred to B. ∎

Lemma 2 shows that, as , the behavior of the LLR is dominated by the first term on the r.h.s. of (17), i.e., . In the next subsection, we see that this term behaves as a random variable with a distribution.

### 2.4 Asymptotic Behavior of Λn

The asymptotic behavior of the ML estimate and the LLR has been extensively studied (see e.g., Owen2001book ; Owen1988empirical ; billingsley1961statistical ). We briefly review these results in the following.

We may see in the definition of in (15) that every summand is a zero-mean random vector; this follows from the fact that

 E[g′u(Xi;θ0)]=θ0u1θ0u−θ0res1θ0res=0, (20)

for . Also, for any , the covariance matrix of is equal to the Fisher information matrix, i.e.,

 Cov[tn] =1nn∑i=1Cov[∇g(Xi;θ0)]=Σ, (21)

where the first equality holds because the samples are i.i.d. and the summands are zero-mean, while the second equality is due to (7

). Consequently, invoking the classical central limit theorem (CLT)

(lehmann2004elements, , Th. 2.4.1) for i.i.d. samples,

 tnL∼N(0,Σ). (22)

Also, by the weak law of large numbers and (

7), asymptotically converges in probability to –the Fisher information matrix about the true parameter . Then, it can be shown that (see Wilks1938 ; Owen2001book ; Owen1988empirical ):

 ΛnL∼χ2r, (23)

since , i.e., the distribution with degrees of freedom, and ; this is known as Wilks’ theorem.

For finite values of , is not necessarily distributed as a random variable. In the next section, we show a non-asymptotic bound on the difference between the cdf of and a random variable, where is the number of free parameters in our model.

## 3 Main Results

In this section, we present an explicit bound for the cdf of for any value of . Let us define

 P∗n≜Pr{Λn

Then, using Lemma 2, we may write

 P∗n=Pr{tTnJ−1ntn−(∥ln∥22√n¯Gn)2α′TJ−1nα′+α∥ln∥32√n¯Gn

As in the asymptotic case, for large222We need to be large enough so that Assumption 2 holds true. but finite , the behavior of the argument of (25) is dominated by the first quadratic term, where is close to the Fisher information matrix . Moreover, the effect of the remaining two terms in the argument of (25) is accounted as a change in the threshold ; thus, loosely the becomes

 Pr{tTnΣ−1tn

However, the following theorem establishes an explicit uniform bound on the behavior of .

###### Theorem 1.

For any choice of and , if Assumptions 1 and 2 hold true, then the following bound holds for :

 F(r,a−Δ(n)1+δs)−μ≤P∗n≤F(r,a+Δ(n)1−δs)+μ, (27)

where is the cdf of a random variable at point , ,

 μ≜ϵ(n,δ′)+2rexp(−2nδr)+h(θ0)√n, (28)

and the rest of the parameters used in (27) and (28) are defined in Table 1.

###### Proof.

See Section 4. ∎

Before proceeding with the proof of the Theorem, we show next that this bound recovers the already known asymptotic behavior of the log-likelihood ratio Owen1988empirical .

###### Corollary 1.

For sufficiently large such that , a more compact, albeit looser, representation of the bound in Theorem 1 is given as

 \absP∗n−F(r,a)≤min{μ′r,1}, (29)

where for

 μ′r≜μ+Δ(n)2(1−δs)(a2+Δ(n)2(1−δs))r2−1, (30)

and for

 (31)
###### Proof.

The Taylor expansion of the function with respect to the second component is expressed using the mean value theorem as

 F(r,a+δm)=F(r,a)+δm2(~a+2)r2−1e−~a+2, (32)

for and . Since , for we may bound (32) from above as follows

 F(r,a+δm)≤F(r,a)+δm2(a+δm2)r2−1, (33)

whereas for (binary case) we obtain

 F(1,a+δm)≤F(1,a)+δm2(a2)−12. (34)

The upper bound in (27) may thus be relaxed using (33) or (34).

On the other hand, again by the mean value theorem,

 F(r,a−δm)=F(r,a)−δm2(~a−2)r2−1e−~a−2, (35)

for and . Hence for we derive from (35) that

 F(r,a−δm)≥F(r,a)−δm2(a2)r2−1, (36)

while for

 F(1,a−δm)≥F(1,a)−δm2(a−δm2)−12. (37)

As a result, the bounds (36) and (37) can relax the lower bound in (27).

To obtain the compact bound (29), the tighter bound (36) for is omitted. Finally, since for small values of the quantity might be large, we may trivially bound the difference of two cdf with , which completes the proof. ∎

###### Proposition 1.

(High dimensional analysis) If the dimension is allowed to grow with respect to such that , the LLR is asymptotically distributed as a random variable, i.e.,

 ΛnL∼χ2r.
###### Proof.

Assume , then and for some . By choosing and with asymptotically large, given the definitions in Table 1, the parameter is of order

 ϵ(n,δ′) =O(exp(−n2c−max{2ζ,c+ζ−12})) =O(exp(−n2c−2ζ)), (38)

which decays exponentially fast.

Furthermore, let such that the condition of Theorem 1 holds. Then, we may see that if , the second term of (28) also converges exponentially fast:

 2rexp(−2nδr)=O(exp(−nc′−ζ)). (39)

The last term in (28) is of order and is the dominant term in the parameter . Now to verify that we have:

 Δ(n) =O(n−12+c)+O(n3ζ+2c′−1)+O(n32ζ+3c′2−12), (40)

which is converging in the area marked in Figure 1. This yields that,

 \absP∗n−F(r,a)≤O(n−12+c). (41)

Given that may be arbitrarily small, (41) illustrates the known asymptotic behavior called Wilks’ phenomena (Wilks1938 ; billingsley1961statistical ). Moreover, by taking the supremum value of in the marked area in Figure 1 and since , the chi-square approximation still holds if . ∎

## 4 Proof of Theorem 1

In the following, we first reformulate the argument of the probability (25) in order to separate the asymptotic and non-asymptotic terms. We then proceed to bound said terms from below and from above to obtain an upper and a lower bound on , respectively.

### 4.1 Preliminaries

The argument of (25) is dominated by its first term; using Lemma 1, it is easy to see that the other two terms become negligible for large . Furthermore, it was previously mentioned that the empirical information matrix , defined in (16), tends to the true Fisher information matrix , defined in (7), as the number of samples grows. However, for a finite , is likely to differ from . Let us define this difference as

 Rn≜Jn−Σ. (42)

Given that the argument of is a function of the inverse of , let us first write this quantity differently,

 J−1n =(I+Σ−1Rn)−1Σ−1 =Σ−1−∞∑k=1(Σ−1Rn)2k−1Σ−1+∞∑k=1(Σ−1Rn)2kΣ−1, (43)

where follows from (Horn2013, , Cor. 5.6.16) as long as the condition holds true. This condition would be fulfilled if as we show in the following lemma. In particular, Lemma 3 will be used later to bound the probability of violating that condition.

###### Lemma 3.

Let us define the matrix as

 B≜Σ−12RnΣ−12. (44)

Then, given the definitions of and ,

 ∥∥Σ−1Rn∥∥=∥B∥≤∥Rn∥. (45)
###### Proof.

The equality is due to the matrices being similar, whereas the proof of the inequality is deferred to C. ∎

We are now ready to reformulate the argument of (25). Consider,

 P∗n=Pr{A1−A2+A3

where

 A1 ≜tTnJ−1ntn, A2 ≜¯G2n∥ln∥42nα′TJ−1nα′, A3 ≜α¯Gn∥ln∥32√n. (47)

Furthermore, if , according to Lemma 3, may be expanded using (43) as

 A1≜A11−A12+A13, (48)

where

 A11 ≜tTnΣ−1tn, A12 ≜tTn(∞∑k=1(Σ−1Rn)2k−1Σ−1)tn, A13 ≜tTn(∞∑k=1(Σ−1Rn)2kΣ−1)tn. (49)

In order to have a simpler expression for these quantities, let us define

 v≜Σ−12tn, (50)

which is asymptotically a zero-mean Gaussian random vector with identity covariance matrix. Then, we may rewrite the terms in (49) as

 A11 =vTv, A12 =vT∞∑k=1B2k−1v, A13 =vT∞∑k=1B2kv, (51)

where was defined in (44).

In the following, we present an upper and a lower bound on (46); we show that the dominating term in the argument has a quadratic form, and thus is close to the cdf of a random variable. The remaining terms are bounded using concentration inequalities for and . Accordingly, let us define the events

 El ≜{∥ln∥22>δ}, (52) ER ≜{∥Rn∥>δ′}. (53)

Lemma 1 bounds , and we introduce the following lemma to bound .

###### Lemma 4.

For any and any , the following bound holds:

 (54)

where is defined in Table 1.

###### Proof.

The proof is an immediate result of the matrix Bernstein inequality (Tropp2015, , Thm. 1.6.2). See D for the complete proof. ∎

### 4.2 Upper Bound on P∗n

In order to bound and , note that for any positive integer , it holds that

 vTBsv≤∥B∥s∥v∥22. (55)

Then, consider the following

 −A12+A13 =−∞∑k=1vTB2k−1v+∞∑k=1vTB2kv (56)

where follows from (55) and the fact that we turn positive terms into negative ones, and stems from Lemma 3. Given that the statement of the Theorem specifies that , if the event occurs, we have that and we may then combine (48), (51), and (56) to obtain:

 (57)

where

 δs≜∞∑k=1(δ′)k=δ′1−δ′. (58)

Consider now the following bound for ,

 \absA2 =¯G2n∥ln∥42n\absα′TJ−1nα′ (a)≤¯G2nr34∥ln∥42n∥∥J−1n∥∥, (59)

where follows from the fact that for according to Lemma 2. We may also bound as

 \absA3 =\absα¯Gn∥ln∥32√n ≤¯Gnr323∥ln∥32√n (60)

where the inequality is due to according to Lemma 2.

From (46) and the bounds (59) and (60), we obtain that

 P∗n≤Pr⎧⎨⎩A1−¯G2nr34∥ln∥42n∥∥J−1n∥∥−¯Gnr323∥ln∥32√n

Before proceeding, we introduce the following two lemmas which enable us to derive refined bounds for and that only depend on and .

###### Lemma 5.

For any , if , the spectral norm of is bounded from above by

 ∥∥J−1n∥∥≤(1−∥Rn∥)−1. (62)

See E. ∎

###### Lemma 6.

According to the model definition, if , then

 (63)
###### Proof.

See F. ∎

Let and , and consider the following expansion of (61),

 P∗n (a)≤Pr⎧⎨⎩A1−r34∥ln∥42n∥∥J−1n∥∥¯G2n−r323∥ln∥32√n¯Gn

where is due to Lemma 4, follows from conditioned on the event , and the use of (57) and Lemma 5, stems from the fact that and the use of Lemma 1, is due to conditioned on the event , the use of Lemma 6, and the fact that . Finally, follows from the definition

 Δ(n)≜δsa+nr3δ2/[1−δ′](θmin−√rδ)6+2nr32δ323(θ% min−√rδ)3. (65)

The first term on the r.h.s. of (64) is the cdf of a quadratic form which asymptotically converges to a distribution. An explicit bound for the gap between the true and the distributions is found in bentkus2003dependence , which we restate here for completeness.

###### Lemma 7.

Let , then the following bound holds for the cdf of the quadratic term :

 supa\absPr{∥v∥22

where

 F(r,a)≜PG(r2,