    # Divergence measures estimation and its asymptotic normality theory : Discrete case

In this paper we provide the asymptotic theory of the general phi-divergence measures. We use the empirical probability distribution.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

### 1.1. Motivations

In this paper, we study the convergence of empirical discrete probability disributions supported on a finite set.
Let throughout the following be a finite countable space. The distributions probability on

are finite dimensional vectors

p in

 P(X)={p=(pc)c∈X:pc≥0,∀c∈X  and  ∑c∈Xpc=1}.

A divergence measure on  is a function

(1.1)
 D: (P(X))2 ⟶ ¯¯¯¯R (p,q) ⟼ D(p,q)

such that for any p such that in the domain of application of .

The function is not necessarily an application. And if it is, it is not always symmetrical and it does neither have to be a metric. In lack of symmetry, the following more general notation is more appropriate :

(1.2)
 D: P1(X)×P2(X) ⟶ ¯¯¯¯R (p,q) ⟼ D(p,q),

where and are two families of distributions probability on , not necessarily the same. To better explain our concern, let us introduce some of the most celebrated divergence measures.

We may present the following divergence measure : let with , and two probabilities distribution on .

(1) The -divergence measure :

 (1.3) DL2(p,q)=r∑j=1(pj−qj)2.

(2) The family of Renyi’s divergence measures indexed by , , known under the name of Renyi- :

 (1.4) DR,α(p,q)=1α−1log(r∑j=1pαjq1−αj).

(3) The family of Tsallis divergence measures indexed by , , also known under the name of Tsallis- :

 (1.5) DT,α(p,q)=1α−1(r∑j=1pαjq1−αj−1);

(4) The Kulback-Leibler divergence measure

 (1.6) DKL(p,q)=r∑j=1pjlog(qj/pj).

The latter, the Kullback-Leibler measure, may be interpreted as a limit case of both the Renyi’s family and the Tsallis’ one by letting . As well, for near 1, the Tsallis family may be seen as derived from based on the first order expansion of the logarithm function in the neighborhood of the unity.

From this small sample of divergence measures, we may give the following remarks.

(a) The -divergence measure is both an application and a metric on , where is the class of probability measures on such that

 ∑jp2j<+∞.

(b) For both the Renyi and the Tsallis families, we may have computation problems and lack of symmetry. Let give examples. It is clear from the very form of these divergence measures that we do not have symmetry, unless for the special case where . Both families are build on the following functional

 Iα(p,q)=∑jpαjq1−αj

### 1.2. Previous work and main contributions

Our main contribution may be summurized as follows, for data sampled from one or two unknown random variables, we derive almost sure convergency and central limit theorems for empirical

divergences

## 2. Distribution limit for empirical ϕ− divergence

### 2.1. Notation and definitions

Before we state the main results we need a few definitions. Define the empirical probability distribution generated by i.i.d. random variables from the distribution probability p as

 (2.1) ˆpn=(ˆpcn)c∈X,  where  ˆpcjn = 1nn∑i=11cj(Xi)

and is defined in the same way by that is

 (2.2) ˆqm=(ˆqcm)c∈X,  % where  ˆqcjm=1mm∑i=11cj(Yi)
###### Definition 1.

The -divergence between the two probability distributions p and q is given by

 (2.3) J(p,q)=∑j∈Dϕ(pj,qj)

where is a measurable function on which we will make the appropriate conditions.

The results on the functional will lead to those on the particular cases of the Renyi, Tsallis, and Kullback-Leibler measures.

### 2.2. Main resuls

Since for a fixed

has a binomial distribution with parameters

and success probability , therefore

 (2.4) E[ˆpcjn]=pj  and  V(ˆpcjn)=pj(1−pj)n.

Furthermore, by the strong law of large numbers we have that

converges almost surely (and hence in probability) to for every fixed .
By the theorem central limit

 (2.5) √nˆpcjn−pj√pj(1−pj)⇝N(0,1),  as  n→+∞,

where we use the symbol to denote convergence in distribution.

Also for a fixed we have

 (2.6) √nˆqcjm−qj√qj(1−qj)⇝N(0,1),  as  m→+∞.

More generally since is a sample of size from a multinomial distribution with probabilities p, therefore (see Lo et al. (2016))

 √n(ˆpn−p)P⟶Nr(0,Σ(p))  as  n→+∞

where is the multinomial covariance matrix given by

 Σ(p)=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣p1(1−p1)−p1p2⋯−p1pr−p2p1p2(1−p2)⋯−p2pr)⋮⋮⋱⋮−prp1−prp2⋯pr(1−pr)⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

## 3. Asymptotic theory for ϕ-divergence measure

### 3.1. Boundness assumption and notations

Define

 D={j∈{1,2,⋯,r}  such that  pj,qj≥κ>0}

Let

 J(p,q)=∑j∈Dφ(pj,qj)

where is a mesurable function having continuous second order partial derivatives defined as follows :

 ϕ(1)1(s,t)=∂ϕ∂s(s,t), ϕ(1)2(s,t)=∂ϕ∂t(s,t)

and

 ϕ(2)1(s,t)=∂2ϕ∂s2(s,t), ϕ(2)2(s,t)=∂2ϕ∂t2(s,t), ϕ(2)1,2(s,t)=ϕ(2)2,1(s,t)=∂2ϕ∂s∂t(s,t).

Set

 A1,p = ∑j∈D|ϕ(1)1(pj,qj)|,   A2,q=∑j∈D|ϕ(1)2(pj,qj)| A3,q = ∑j∈D|ϕ(1)1(qj,pj)|,  and  A4,p=∑j∈D|ϕ(1)2(qj,pj)|.

Based on (2.1) and (2.2), we will use the following empirical -divergences.

 J(ˆpn,q) = ∑j∈Dφ(ˆpcjn,qj),\ \ \ % \ J(p,ˆqm)=∑j∈Dφ(pj,ˆqcjm), and J(ˆpn,ˆqm)=∑j∈Dφ(ˆpcjn,ˆqcjm).

Set

 (3.1) an = supj∈D|ˆpcjn−pj|,    bm=supj∈D|ˆqcjm−qj|, and cn,m=max(an,bm).

### 3.2. Statements of the main results

The first concerns the almost sure efficiency of the estimators.

###### Theorem 1.

Let a finite countable space and and and be generated by i.i.d. samples and . Then the following asymptotic results hold for the empirical -divergences

• One sample

 (3.2) limsupn→+∞|J(ˆpn,q)−J(p,q)|an≤A1,p,  a.s (3.3) limsupm→+∞∣∣J(p,ˆqm)−J(p,q)∣∣bm≤A2,q,  %a.s
• Two samples

 (3.4) limsup(n,m)→(+∞,+∞)∣∣J(ˆpn,ˆqm)−J(p,q)∣∣cn,m≤A1,p+A2,q\ \ a.s

where , and are as in (3.1).

The second concerns the asymptotic normality of the estimators.

###### Theorem 2.

Let

 V1,p=∑j∈Dpj(1−pj)(ϕ(1)1(pj,qj))2  % and  V2,q=∑j∈Dqj(1−qj)(ϕ(1)2(pj,qj))2.

Under the same assumptions as in theorem 1, the following central limit theorems hold for empirical -divergences

• One sample : as ,

 (3.5) √n(J(ˆpn,q)−J(p,q))⇝N(0,V1,p),
 (3.6) √m(J(p,ˆqm)−J(p,q))⇝N(0,V2,q),
• Two samples : as and ,

 (3.7) (nmmV1,p+nV2,q)1/2(J(ˆpn,ˆqm)−J(p,q))⇝N(0,1)

II - Direct extensions.

Quite a few number of divergence measures are not symmetrical. Among these non-symmetrical measures are some of the most interesting ones. For such measures, estimators of the form , and are not equal to , and respectively.

In one-sided tests, we have to decide whether the hypothesis , for q known and fixed, is true based on data from p. In such a case, we may use the statistics one of the statistics and to perform the tests. We may have information that allows us to prefer one of them. If not, it is better to use both of them, upon the finiteness of both and , in a symmetrized form as

 (3.8) J(s)(p,q)=J(p,q)+J(q,p)2.

The same situation applies when we face double-side tests, i.e., testing from data generated by p et q.

Asymptotic a.e. efficiency.

###### Theorem 3.

Under the same assumptions as in theorem 1, the following hold

• One sample :

 (3.9) limsupn→+∞∣∣J(s)(ˆpn,q)−J(s)(p,q)∣∣an≤12(A1,p+A4,p)   a.e.,
 (3.10) limsupn→+∞∣∣J(s)(p,ˆq% m)−J(s)(p,q)∣∣bn≤12(A2,q+A3,q)   a.e.,
• Two samples :

 (3.11)

Asymptotic Normality.

Denote

 V3,q=∑j∈Dqj(1−qj)(ϕ(1)1(qj,pj))2  % and  V4,p=∑j∈Dpj(1−pj)(ϕ(1)2(qj,pj))2.
 V1,4,p=V1,p+V4,p  and  V2,3,q=V2,q+V3,q.

We have

###### Theorem 4.

Under the same assumptions as in theorem 1, the following hold

• One sample : as

 (3.12) √nV1,4,p(J(s)(ˆpn,g)−J(s)(p,q))⇝N(0,1),
 (3.13) √nV2,3,q(J(s)(p,ˆqn)−J(s)(p,q))⇝N(0,1).
• Two samples : as

 (3.14)

Remark The proof of these extensions will not be given here, since they are straight consequences of the main results. As well, such considerations will not be made again for particular measures for the same reason.

## 4. Particular Cases

### 4.1. Renyi and Tsallis families

These two families are expressed through the functional

 (4.1) Iα(p,q)=∑j∈Dpαjq1−αj,  α>0,  α≠1,

which is of the form of the divergence measure with

 ϕ(x,y)=xαy1−α,  (x,y)∈{(pj,qj), j∈D}.

A- (a)- The asymptotic behavior of the Tsallis divergence measure.

Denote

 AT,α,1=:α|α−1|∑j∈D(pj/qj)α−1  and  AT,α,2=:∑j∈D(pj/qj)α.

We have

###### Corollary 1.

Under the same assumptions as in theorem 1, and for any , the following hold

• One sample :

 limsupn→+∞|DT,α(ˆpn,q)−DT,α(p,%q)|an≤AT,α,1  a.s, limsupn→+∞|DT,α(p,ˆqn)−DT,α(p,%q)|bn≤AT,α,2  a.s.
• Two samples :

 limsup(n,m)→(+∞,+∞)|DT,α(ˆpn,gm)−DT,α(p,q)|cn,m≤AT,α,1+AT,α,2  a.s.

Denote

 σ2T,α,1(p,q) = α2(α−1)2⎡⎣∑j∈Dpj(pj/qj)2α−2−(∑j∈Dpj(pj/qj)α−1)2⎤⎦ σ2T,α,2(p,q) = ∑j∈Dqj(pj/qj)2α−(∑j∈Dqj(pj/qj)α)2.

We have

###### Corollary 2.

Under the same assumptions as in theorem 1, and for any , the following hold

 √n(DT,α(ˆpn,q)−DT,α(p,q))⇝N(0,σ2T,α,1(p,q),  as  n→+∞, √n(DT,α(p,ˆ% qn)−DT,α(p,q))⇝N(0,σ2T,α,2(p,q)  as  n→+∞, and as (n,m)→(+∞,+∞), (mnnσ2T,α,2(p,q)+mσ2T,α,1(p,q))1/2(DT,α(ˆpn,ˆqm)−DT,α(p,q))⇝N(0,1).

As to the symmetrized form

 D(s)T,α(p,q)=DT,α(p,q)+DT,α(g,f)2,

we need the supplementary notations:

 AT,α,3 = α|α−1|∑j∈D(qj/pj)α−1,   AT,α,4=∑j∈D(qj/pj)α, σ2T,α,3 = σ2(α−1)2⎡⎣∑j∈Dqj(qj/pj)2α−2−(∑j∈Dqj(qj/pj)α−1)2⎤⎦ and σ2T,α,4 = ∑j∈Dpj(qj/pj)2α−(pj(qj/pj)α)2.

We have

###### Corollary 3.

Let Assumptions LABEL:C1 and LABEL:C2 hold and let (BD) be satisfied. Then for any , ,

 limsupn→+∞|D(s)T,α(ˆpn,g)−D(s)T,α(p,q)|an≤(AT,α,1+AT,α,4)/2=:A(s)T,α,1  a.s, limsupn→+∞|D(s)T,α(f,ˆqn)−D(s)T,α(p,q)|bn≤(AT,α,2+AT,α,3)/2=:A(s)T,α,2  a.s, and limsup(n,m)→(+∞,+∞)|D(s)T,α(ˆpn,gm)−D(s)T,α(p,q)|cn,m≤A(s)T,α,1+A(s)T,α,2  % a.s.

Denote

 σ2T,α,1:4(p,q) = σ2T,α,1(p,q)+σ2T,α,4(p,q), σ2T,α,2:3(p,q) = σ2T,α,2(p,q)+σ2T,α,3(p,q).

We also have

###### Corollary 4.

Let Assumptions LABEL:C1 and LABEL:C2 hold and let (BD) be satisfied. Then for any , , we have

 √n(D(s)T,α(ˆpn,g)−D(s)T,α(p,q))⇝N(0,σ2T,α,1:4(p,q)),  as  n→+∞, √n(D(s)T,α(f,ˆqn)−D(s)T,α(p,q))⇝N(0,σ2T,α,2:3(p,q)),  as  n→+∞, and as (n,m)→(+∞,+∞), (nmmσ2T,α,1:4(p,q)+nσ2T,α,2:3(p,q))1/2(D(s)T,α(ˆpn,gm)−D(s)T,α(p,q))⇝N(0,1).

A-(b)- The asymptotic behavior of the Renyi- divergence measure.

The treatment of the asymptotic behavior of the Renyi-, , is obtained from Part (A) (a) by expansions and by the application of the delta method.

We first remark that

 DR,α(p,q)=1α−1log(Iα(p,q)).
###### Corollary 5.

Under the same assumptions as in theorem 1, and for any , the following hold

 limsupn→+∞|DR,α(ˆpn,g)−DR,α(p,q)|an≤AT,α,1Iα(p,q)=:AR,α,1  a.s, limsupn→+∞|DR,α(f,ˆqn)−DR,α(p,q)|bn≤AT,α,2Iα(p,q)=:AR,α,2  a.s, and limsupn,m→+∞|DR,α(ˆpn,gm)−DR,α(p,q)|cn,m≤AR,α,1+AR,α,2  a.s.

Denote

 σ2R,α,1(p,q)=σ2T,α,1(p,q)I2α(p,q)  and  σ2R,α,2(p,q)=σ2T,α,2(p,q)I2α(p,q).

We have

###### Corollary 6.

Let Assumptions LABEL:C1 and LABEL:C2 hold and let (BD) be satisfied. Then for any ,

 and as (n,m)→(+∞,+∞) (mnnσR,α,2(p,q)+mσR,α,1(p,q))1/2(DR,α(ˆpn,gm)−DR,α(p,q))⇝N(0,1).

As to the symetrized form

 D(s)R,α(p,q)=DR,α(p,q)−DR,α(g,f)2,

we need the supplementary notations:

 AR,α,3 = AT,α,3Iα(p,q),    AR,α,4=AT,α,4Iα(p,q) σ2R,α,3(p,q) = σ2T,α,3(p,q)I2α(p,q)  and  σ2R,α,4(p,q)=σ2T,α,4(p,q)I2α(p,q).
###### Corollary 7.

Let Assumptions LABEL:C1 and LABEL:C2 hold and let (BD) be satisfied. Then for any ,

 limsupn→+∞|D(s)R,α(ˆpn,g)−D(s)R,α(p,q)|an≤(AR,α,1+AR,α,4)/2=:A(s)R,α,1,  a.s. limsupn→+∞|D(s)R,α(f,ˆqn)−D(s)R,α(p,q)|an≤(AR,α,2+AR,α,3)/2=:A(s)R,α,2, and limsup(n,m)→(+∞,+∞)|D(s)R,α(ˆpn,gm)−D(s)R,α(p,q)|cn,m≤A(s)R,α,1+A(s)R,α,2

Denote

 σ2R,α,1: