# A useful variant of Wilks' theorem for grouped data

This paper provides a generalization of a classical result obtained by Wilks about the asymptotic behavior of the likelihood ratio. The new results deal with the asymptotic behavior of the joint distribution of a vector of likelihood ratios which turn out to be stochastically dependent, due to a suitable grouping of the data.

## Authors

• 6 publications
• 1 publication
• 26 publications
• 1 publication
08/17/2018

### Non-Asymptotic Behavior of the Maximum Likelihood Estimate of a Discrete Distribution

In this paper, we study the maximum likelihood estimate of the probabili...
06/11/2021

### On an Asymptotic Distribution for the MLE

The paper presents a novel asymptotic distribution for a mle when the lo...
04/02/2018

### Asymptotic normality and analysis of variance of log-likelihood ratios in spiked random matrix models

The present manuscript studies signal detection by likelihood ratio test...
05/22/2020

### Asymptotic accuracy of the saddlepoint approximation for maximum likelihood estimation

The saddlepoint approximation gives an approximation to the density of a...
08/30/2017

### Asymptotic Bias of Stochastic Gradient Search

The asymptotic behavior of the stochastic gradient algorithm with a bias...
06/11/2020

### Wilks' theorem for semiparametric regressions with weakly dependent data

The empirical likelihood inference is extended to a class of semiparamet...
12/16/2019

### High Order Adjusted Block-wise Empirical Likelihood For Weakly Dependent Data

The upper limit on the coverage probability of the empirical likelihood ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider independent measurements of the same physical phenomenon, with the additional information of the times at which measurements are performed. This paper deals with the problem of testing statistical hypotheses when is large and only a small amount of observations concentrated in short time intervals (critical phenomena) are relevant to the study under investigation. The sample consists of , where

are independent and identically distributed (iid) random variables, and

. We make use of to split into vectors, as follows: fix a basic unit of time in such a way that the whole dataset corresponds to the observation of units of time, and define the random vector , for , whose components are the ’s such that . For instance, for a phenomenon with measurements at every minute for multiple years, may be the number of hours in a year. To complete the picture, let be the set of all possible realizations of any trial, endowed with the -algebra

, and consider a regular parametric model

, where is an open subset of and, for every ,

is a probability density function with respect to a

-finite reference measure on . The notion for regular parametric models will be made precise in Section 2. The common probability density function of the ’s is denoted by , where is the true, but unknown, value of the parameter

. The objective of our study is to test the null hypothesis

against the alternative hypothesis , where denotes a proper subset of .

We define a testing procedure in terms of multiple likelihood ratio (LR) statistics, and in accordance with the following principles: P1) each LR statistic is formed by gathering observations included in subsequent vectors ’s, i.e. observations ’s whose ’s belong to subsequent units of time, where is a suitable time window defined a priori with respect to an arbitrary choice of the “origin of time”; P2) is rejected only if at least LR statistics are sufficiently small, for a suitable choice of . The time window in P1) allows for tuning the LR statistics with respect to what, a priori, is considered to be the typical duration (length of time intervals) of critical phenomena that are supposed to be induced by . Then P2) is justified whenever it is desirable to have repeated manifestations of critical phenomena to accept . Our testing procedure with multiple LR statistics based on P1) and P2) is motivated by the above premise that, among a large number of observations, only critical phenomena are relevant to the study. Indeed, if relevant observations are concentrated in short time intervals of duration less than units of time, then the analysis based on a single LR statistic would be meaningless, since the overwhelming majority of observations would always lead to accept . On the contrary, the application of P1), in conjunction with a reasonable choice of the time window , ensures that observations may be relevant with respect to a subgroup of observations detected during a period of units of time.

Wilks’ theorem on large sample asymptotics for LR statistics (Wilks (1938) and Wald (1943)) may be applied to devise a testing procedure fulfilling principles P1) and P2). Let , with being the number of LR statistics, and let be the collected data. Then we define the vector of LR statistics , where is obtained by gathering data belonging to vectors from to , i.e.

 Λ(st)i:=Λ(st)i(x((i−1)G+1);…;x(iG)):=supθ∈Θ0∏iGp=(i−1)G+1∏npj=1f(x(p)j;θ)supθ∈Θ∏iGp=(i−1)G+1∏npj=1f(x(p)j;θ) (1)

for . Note that, in this framework, the components of turn out to be stochastically independent, since the groups of vectors just considered are disjoint. Then, reject if at least of the ’s are less than some reference value

. Due to independence, the probability of type I error can be evaluated by means of the binomial formula as

, where the probability that a single is less than can be approximated, with sufficiently good precision, by resorting to Wilks’ theorem. In fact, from this theorem one has: if is a regular parametric model and is an -dimensional () sub-manifold of , then, under

, the probability distribution function of

converges weakly, for every , to a standard distribution with degrees of freedom, as go to infinity.

While the above testing procedure is simple and supported by Wilks’ theorem, the number of LR statistics less than may be affected by the arbitrary choice of the “origin of time”, in connection with P1). Indeed since is supported by critical phenomena of duration less than units of time, each of these phenomena is completely seized in a LR statistic only if both its initial time and the final time belong to the interval . On the contrary, if the initial time of a critical phenomena belongs to and the final time belongs to , such a phenomena is not seized, or it is partially seized, with both the LR statistic and the LR statistic being possibly greater than . The application of Wilk’s theorem thus implies a specific choice for the “origin of time”, unless one neglects observations belonging to units of time in between time windows. Clearly, this may affect remarkably the decision process. In this paper we propose an alternative testing procedure which overcomes the problem of the arbitrary choice of the “origin of time”, while fulfilling principles P1) and P2). According to our procedure, for any choice of the time window it is no longer possible to neglect a critical phenomena (of duration less than units of time) starting at the time interval and ending at the time interval . Indeed there will always exist another time interval, in a new finer subdivision, which contains both the initial and the final time instants of the critical phenomena. The proposed approach relies on a novel Wilks’ theorem for grouped data, which leads to a rejection event that includes the corresponding rejection event based on Wilks’ theorem. That is, our testing procedure is more powerful than the above Wilks’ testing procedure.

## 2 Methodology

Consider groups of consecutive vectors, the -th vector consisting of those vectors that are numbered from to , where . Once the data are collected in the form , we associate a LR statistic with each group, obtaining the vector of LR statistics defined by

 Λ(new)i:=Λ(new)i(x(i);…;x(i+G−1)):=supθ∈Θ0∏i+G−1p=i∏npj=1f(x(p)j | θ)supθ∈Θ∏i+G−1p=i∏npj=1f(x(p)j | θ) (2)

for . Differently from , the components ’s of are no more independent. Therefore, our testing procedure will deal with the joint probability distribution of , and in particular with its asymptotic behaviour for large values of the sample sizes . Our result will not provide weak convergence of towards a specific limiting distribution, but only a merging phenomenon, in the following sense: after fixing a distance to compare probability distributions on , we will provide an approximating sequence such that the distance between the probability distribution of and the relative element of the approximating sequence goes to zero as go to infinity. The approximating sequence depends on the data only through the sample sizes , and it does not depend on the model and of the choice of . With such a theoretical result at disposal, we can describe a testing procedure which overcomes the problem of the arbitrary choice of the “origin of time” while fulfilling principles P1) and P2). Such a procedure consists of rejecting whenever there are at least of the ’s, say , with , which are less than . Formally, the rejection rule corresponds to considering the event

 ∪1≤i1<⋯

whose probability can be evaluated after knowing the joint probability distribution of . Theorem 2 below provides with an explicit approximation of such a joint probability distribution for LR statistics.

Before stating Theorem 2, it is worth recalling that the parametric model is called regular when the following conditions are met:

1. , belongs to ;

2. the set does not depend on and ;

3. for any measurable function satisfying for all , derivatives of first and second order (with respect to ) may be passed under the integral sign in ;

4. for any , there exist a measurable function and such that

 ∫XK0(x)f(x;θ0)ν(dx) < +∞ , sup|θ−θ0|≤δ0∣∣∂2∂θi∂θjlogf(x;θ)∣∣ ≤ K0(x)     ∀ x∈X,i,j=1,…,d;
5. the Fisher information matrix , given by

 Ii,j(θ):=−∫X(∂2∂θi∂θjlogf(x;θ))f(x;θ)ν(dx) , (3)

is well-defined and positive definite at every value of ;

6. the model is identified, i.e. entails .

In addition, in order to avoid technical—but not conceptual—complications in the proofs, we require a maximum likelihood estimator (MLE) actually exists as a point of

, meaning that such a MLE must coincide with a root of the likelihood equation. More formally, we assume that

1. , there exists a measurable function such that

 supθ∈Θ[n∏j=1f(xj;θ)]=n∏j=1f(xj;tn(x1,…,xn))          ∀ (x1,…,xn)∈Xn . (4)

To formalize the concept of approximating sequence, we must introduce a suitable distance to compare probability distributions on . See, e.g., Gibbs and Su (2002) or Chapter 2 of Senatov (1998) for a comprehensive treatment of distances for probability distributions. Among the various possible distances, we select the Lévy-Prokhorov distance , which is particularly meaningful with respect to our problem. Specifically, given a pair of probability measures on ,

 Dl(μ1;μ2):=inf{ε>0 | μ1(B)≤μ2(Bε)+ε,μ2(B)≤μ1(Bε)+ε, ∀ B∈B(Rl)},

where . The distance is often used in the context of multidimensional extensions of the Berry-Esseen estimate, being related to the concept of weak convergence of probability measures (see, e.g., Section 11.3 of Dudley (2002)).

Now we can state our first result, which deals with the asymptotic normality of the vector of MLE’s, whose components are defined by , for , with the same as in (4).

###### Theorem 1

Let be the true, but unknown, value of , and let the conditions of regularity C1)-C7) for the parametric model be satisfied. Then, the probability distribution of

 ⎛⎜⎝ ⎷G∑k=1nk⋅(^θn1,…,nG−θ0),…, ⎷P∑k=Mnk⋅(^θnM,…,nP−θ0)⎞⎟⎠ ,

meets

 DdM(μ(dM)n1,…,nP;γ(dM)(RM,I(θ0)−1))→0 (5)

as , where:

• is the matrix whose elements are given by

 ρi,j(n1,…,nP):=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩0if i,j∈{1,…,M},|i−j|≥G∑b(i,j)p=a(i,j)np√∑i+G−1q=i∑j+G−1l=jnqnlif i,j∈{1,…,M},|i−j|

with and ;

• is defined by means of (3);

• is the -dimensional Gaussian probability distribution with zero means and covariance matrix

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝ρ1,1I(θ0)−1ρ1,2I(θ0)−1…ρ1,MI(θ0)−1ρ2,1I(θ0)−1ρ2,2I(θ0)−1…ρ2,MI(θ0)−1⋮⋮⋱⋮ρM,1I(θ0)−1ρM,2I(θ0)−1…ρM,MI(θ0)−1⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ . (7)

It is worth noticing that for , and that the matrix is positive-definite, as it coincides with the covariance matrix of the Gaussian random vector where and is a vector of independent real random variables with .

As a consequence of Theorem 1, we can state the main result of the paper.

###### Theorem 2

Let be an -dimensional sub-manifold of , with , and let the conditions of regularity C1)-C7) for the parametric model be satisfied. If , for , then, under , the probability distribution of meets

 (8)

as , where:

• ;

• stands for the probability distribution of the -dimensional random vector

 (r∑h=1Z2h;1,r∑h=1Z2h;2,…,r∑h=1Z2h;M) ;
• the -dimensional random vector is jointly Gaussian with zero means and covariance matrix given by

 ⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩Var(Zh;i)=1if h=1,…,r  and i=1,…,MCov(Zh;i,Zl;j)=0if h≠l  and i,j=1,…,MCov(Zh;i,Zh;j)=0if |i−j|≥G  and h=1,…,rCov(Zh;i,Zh;j)=ρi,jif |i−j|

From a theoretical perspective, there is a clear improvement in using the new testing procedure based on Theorem (2) rather then the standard testing procedure based on Wilk’s theorem. This is because of the fact that the new rejection event includes its standard counterpart, entailing that the new testing proceure turns out to be more powerful than the standard testing procedure. Moreover, the problem of the arbitrary choice of the “origin of time” is now definitely solved. Indeed it is not possible anymore to neglect a critical phenomena (of duration less than units of time) starting at the time interval and ending at the time interval , for any choice of .

## 3 Discussion

We considered testing hypotheses under this setting: a large number of independent measurements of which only a small amount, concentrated in short periods, are relevant to the study under investigation. Our motivating example comes from recent works on detection of -ray astrophysical sources under the AGILE project (http://agile.asdc.asi.it). See, e.g., Bulgarelli et al. (2012) and Bulgarelli et al. (2014). The ’s are associated to measurements of photons, with the information being the position of the photon in the sky and its energy. The basic unit of time is the hour, and the iid assumption is motivated by the fact that the region of the sky under investigation is invariant for the duration of the AGILE project (5 years). The dataset consists of a huge number of observations, but only a small amount of them, concentrated in periods of less than 24 hours, are relevant. Indeed the number of photons ascribable to distinguish astrophysical sources (e.g., supernova remnants, black hole binaries and pulsar wind nebulae) is much smaller than the total number of observed photons. Bulgarelli et al. (2012) relied on the statistic (1), with , for testing certain hypotheses related to the detection of -ray astrophysical sources. In this paper we discussed how (1), with an arbitrary choice of the “origin of time”, may lead to a meaningless analysis. We then introduced an alternative, and more powerful, test that allows for an arbitrary choice of the “origin of time”. Such a procedure relies on the novel Wilks’ theorem for grouped data, which may be of independent interest. Since a precise formulation of the problem in Bulgarelli et al. (2012) would require to introduce certain (technical) protocols of the AGILE project, we defer the application of our approach to a companion paper for a journal in astrophysics.

## 4 Proofs

The proofs of the main theorems are based on the following three lemmas.

###### Lemma 3

Let and be two sequences of p.m.’s on . If is tight and as , then is also tight.

Proof of Lemma 3. For any , denote by a positive number such that , where . Then, putting , fix for which for every . Since holds for every , one gets

 βn(Bc3ρ(ε)/2)≤β′n((Bc3ρ(ε)/2)δn)+δn≤β′n(Bcρ(ε))+ε/3≤2ε/3

for every . The proof is now completed since it is always possible to find a positive number such that .

For the statement of the second lemma, let and be two families of random elements, indexed by , such that belongs to and is an element of the space of matrices with real entries. It is also required that and depend on only through , for any . Let stand for the probability laws of the vector and assume that

 (9)

in probability as , for suitable non-random matrices . For completeness, the distance between the two vectors of matrices is measured by , where denotes the Frobenius norm. Moreover, for any elements of the space of matrices with real entries, write to indicate the linear mapping . Finally, let and denote the probability laws of and , respectively, and let and denote the probability laws of and
, respectively.

###### Lemma 4

Let (9) be in force.

1. If is a tight family of probability laws, there hold

in probability as . In particular,

as .

2. If are non-singular and as , for some tight family of probability laws on , then as , where designates the composition of mappings.

Proof of Lemma 4. Thanks to the tightness of , for any , there exists a compact subsets of , say , such that . Whence, for any ,

 P[∣∣(Q(i)n−¯¯¯¯Q(i))Y(i)n∣∣>ε] ≤ P[∥Q(i)n−¯¯¯¯Q(i)∥F⋅|Y(i)n|>ε] ≤ P[Y(i)n∉Kδ]+P[∥Q(i)n−¯¯¯¯Q(i)∥F>ε/(supu∈Kδ|u|)]

leading to , by the arbitrariness of . The thesis follows by recalling that the convergence in probability to zero of a sequence of random vectors amounts to the convergence in probability to zero of the sequences of the single components. Moreover, the same argument can be applied to prove the convergence in probability to zero of . To prove the merging of the probability distributions, consider the so-called Fortet-Mourier distance, defined as follows. Given two probability measures and on , set

where denotes the space of real-valued functions on with . To prove that , fix and write, for arbitrary ,

 ∣∣∫RdMh(u)β(dM)n(du)−∫RdMh(u)¯¯¯β(dM)n(du)∣∣ ≤2M∑i=1P[Y(i)n∉Kδ]+2M∑i=1P[∥Q(i)n−¯¯¯¯Q(i)∥F≥η]+ηMsupu∈Kδ|u| .

Therefore, for any , choose and to obtain

 limsup n1,…,nP→+∞D∗dM(β(dM)n;¯¯¯β(dM)n)≤ε ,

which is tantamount to saying that , as . Finally, the thesis follows from the metric equivalence between the Prokhorov and the Fortet-Mourier distance, stated, e.g., in Theorem 11.3.3 of Dudley (2002). Again, an analogous argument shows that as , completing the proof of point .

To prove point , consider again the Fortet-Mourier distance and set to write

 D∗dM(ξ(dM)n;ω(dM)n∘L[¯¯¯¯Q(1),…,¯¯¯¯Q(M)]) ≤ + suph∈BL1(RdM)∣∣E[h(¯¯¯P(1)Q(