# An inverse Sanov theorem for curved exponential families

We prove the large deviation principle (LDP) for posterior distributions arising from curved exponential families in a parametric setting, allowing misspecification of the model. Moreover, motivated by the so called inverse Sanov Theorem, obtained in a nonparametric setting by Ganesh and O'Connell (1999 and 2000), we study the relationship between the rate function for the LDP studied in this paper, and the one for the LDP for the corresponding maximum likelihood estimators. In our setting, even in the non misspecified case, it is not true in general that the rate functions for posterior distributions and for maximum likelihood estimators are Kullback-Leibler divergences with exchanged arguments. Finally, the results of the paper has some further interest for the case of exponential families with a dual one (see Letac (2021+)).

• 2 publications
• 3 publications
11/29/2019

### Maximum likelihood estimation for discrete exponential families and random graphs

We characterize the existence of maximum likelihood estimators for discr...
09/09/2018

### MPS: An R package for modelling new families of distributions

We introduce an |R| package, called |MPS|, for computing the probability...
05/01/2022

### Generalized Fisher-Darmois-Koopman-Pitman Theorem and Rao-Blackwell Type Estimators for Power-Law Distributions

This paper generalizes the notion of sufficiency for estimation problems...
05/09/2022

### Functionals of nonparametric maximum likelihood estimators

Nonparametric maximum likelihood estimators (MLEs) in inverse problems o...
05/01/2019

### Total positivity in structured binary distributions

We study binary distributions that are multivariate totally positive of ...
11/03/2020

### Minimum divergence estimators, Maximum Likelihood and the generalized bootstrap

This paper is an attempt to set a justification for making use of some d...
05/21/2008

### Statistical region-based active contours with exponential family observations

In this paper, we focus on statistical region-based active contour model...

## 1 Introduction

The interest for Bayesian consistency has grown in the last decades, especially in the nonparametric framework, see the survey papers of Ghosal, Ghosh and Ramamoorthi (1998) and Wasserman (1998). Some more recent developments have addressed the issue of misspecification, too, see Kleijn and Van der Vaart (2006). Most of the literature concerns sufficient conditions for the prior distribution to ensure consistency. The work by Ganesh and O’Connell (2000) has been a source of inspiration for this paper. Under a Dirichlet process prior on a compact state space , they prove a Large Deviation Principle (LDP, see (5) for a definition) on , the family of probability measures on , for the family of posterior distributions, as the simple size grows. When the empirical distribution converges weakly to some law (i.e. the true law), such LDP is governed by the following rate function (evaluated at )

 D(P0||P)={∫Klog(dP0dP)dP0, if P0 is absolutely continuous w.r.t. P,+∞, otherwise, (1)

which is the celebrated Kullback-Leibler divergence of with respect to . Consistency is a consequence of this result, since , and the equality holds if and only if . Now observe that the LDP provided by the celebrated Sanov theorem for the empirical distribution of i.i.d. samples drawn from is governed by the rate function , the Kullback-Leibler divergence of with respect to ; so, for this reason, the authors called their result the inverse Sanov theorem because the rate functions in the two LDP’s are obtained one from the other by exchanging the arguments in . Notice also that, even if in a rather formal way, the empirical distribution can be regarded as a non-parametric maximum likelihood estimator of the true distribution, giving a statistical flavour to the Sanov theorem.

The role of the prior distribution for consistency in parametric problems is rather clearcut. Indeed, in general what matters about the prior is only its support, as it entails the choice of a specific statistical model for the observations. As a matter of fact, in an earlier paper by Ganesh and O’Connell (1999) the inverse Sanov theorem is proved for a finite sample space, without any restriction on the prior, except that its support must include the limit assumed for the empirical distribution, which is nothing but the assuption that the model is not misspecified.

Motivated by previous works by the first author (Macci and Petrella (2009) and Macci (2014)), in the present paper we focus our attention on the analysis of parametric problems, establishing a kind of parametric inverse Sanov theorem. By this we mean a LDP for the sequence of posterior distributions, with the rate function of the form (1), but restricted to the parametric family assumed for the data. In addition our derivation covers also the misspecified case. This parametric family is assumed to be a curved exponential family, which in this context means a general subfamily of a full exponential family, called in the sequel the saturated model. The saturated model is generated by some positive -finite Borel measure on , with cumulant generating function

 κ(θ)=log∫Rdeθ⋅xλ(dx),θ∈Rd,

which is regular, that is is not concentrated on a proper affine submanifold of and with open essential domain (domain of finiteness), denoted by . A more general situation will be discussed in Section 4. The full exponential family generated by is defined through the densities

 dPθdλ(x)=eθ⋅x−κ(θ),θ∈dom(κ). (2)

It is well known that the function is smooth in and

 ∇κ(θ)=∫xPθ(dx),θ∈dom(κ).

The normalized log-likelihood function, evaluated on the empirical mean of an observed sample of points in , is defined by

 l(θ;¯xn)=θ⋅¯xn−κ(θ),θ∈Rd. (3)

It is understood that the log-likelihood is set equal to outside .

Within the Bayesian approach, one needs to specify also a prior distribution on the parameter

, that is a probability distribution

on supported by . Any subfamily of (2) of the form , where is a Borel set with (called a measurable support of in the sequel), is a statistical model compatible with the choice of . In order to avoid a separate treatment for cases which are not practically relevant, we will always assume that is atomless. Moreover we assume to be contained in the (topological) support of , that is the complement of the largest open set with -probability . Clearly . The definition of support applies also to the -finite measure as well: its convex hull will play a fundamental role in the sequel.

By Bayes’ formula the posterior distribution, conditional to is given by

 πn(A|¯xn)=∫Aexp(nl(θ;¯xn))ν(dθ)∫Texp(nl(θ;¯xn))ν(dθ), (4)

where is any Borel subset of . In view of the main theorem, we recall that a sequence of probability measures on some topological space , satisfies a LDP with a rate function if is a lower semi-continuous function and

 −infθ∈int(A)I(θ)≤liminfn→∞1nlnϕn(A)≤limsupn→∞1nlnϕn(A)≤−infθ∈¯AI(θ), (5)

where and are the interior and the closure of , respectively.

Now we are ready to state the main result of the paper.

###### Theorem 1.

If the sequence converges to some belonging to the interior of , the sequence of probability measures on defined in (4), satisfies a LDP with rate function

 I(θ)=l(θν;μ0)−l(θ;μ0),θ∈T, (6)

where is any maximizer of in .

The assumption that converges to is quite natural. By the Strong Law of the Large Numbers, if we consider i.i.d. samples drawn from a distribution with mean , such assumption holds almost surely. The regularity of implies that is an upper semi-continuous function with compact superlevel sets (see e.g. Barndorff-Nielsen (1978), page 150), from which the existence (but not the uniqueness) of is guaranteed. Note that, by (3), can be interpreted as a limiting Maximum Likelihood Estimate of in , because is supposed to be the limit of the sample means (as ) .

In addition, by regularity, there exists with and it is unique, since is strictly convex in . By differentiation, one checks that is also the unique minimum point of in the whole : it is the limiting unrestricted MLE. Indeed, for we have

 l(θ0;μ0)−l(θ;μ0)=(θ0−θ)⋅μ0−κ(θ0)+κ(θ)=∫logdPθ0dPθdPθ0=D(Pθ0||Pθ), (7)

which is positive except when . Now either , which is the ”inverse Sanov regime”, in which case and is given by (7) (extended to out of ), or , which means that the model is misspecified. Also when the rate function can be rewritten as an ”excess of divergence over the minimum”, in the form

 I(θ)={l(θ0;μ0)−l(θ;μ0)}−{l(θ0;μ0)−l(θν;μ0)}=D(Pθ0||Pθ)−D(Pθ0||Pθν),

for , and elsewhere. In addition, the function is clearly a lower semi-continuous function.

It is worth to mention that also in the misspecified case the rate function can be written itself as a divergence by means of the Pythagorean identity for linear subfamilies stated below, first proved by Simon (1973). In Section 3 will give an example to illustrate the failure of the property for genuinely curved subfamilies.

###### Proposition 1.

Let and let be a measurable support of , where is an affine submanifold of . Then

is the only vector in

such that the difference is orthogonal to . Moreover for any it holds

 D(Pθ0||Pθ)=D(Pθ0||Pθν)+D(Pθν||Pθ). (8)
###### Proof.

The first statement can be found in Brown(1986), Theorem 5.8 and Construction 5.9. The second statement follows from the first since and writing down (8) using (7) the former is reduced to

 (θ−θν)⋅(∇κ(θν)−μ0)=0.

From the previous result it follows that when the statistical model entailed by the prior is a linear subfamily of the saturated model, the rate function governing the LDP for the sequence of posteriors (4), is the same for a misspecified case and a correctly specified one in which this is replaced by , as long as and is obtained from as in Theorem 1 (indeed notice ).

Finally observe also that the various choices of in the rate function (6) allow to consider different statistical models which are embedded in the saturated model, with a different relative topology in which the LDP holds.

The proof of Theorem 1 will be given in Section 2: it relies on some general facts about convex conjugate functions. In Section 3 we will discuss its frequentist counterpart, namely the LDP for the Maximum Likelihood Estimator (again denoted by MLE). The analysis of large deviations for consistent estimators in classical statistics dates back to the results of Bahadur (see Bahadur et al. (1980)). The application to the MLE in exponential families was discussed by Kester and Kallenberg (1986) and Arcones (2006), who observed that the parametric analogue of the Sanov theorem holds only for linear subfamilies. This is due to the failure of the Pythagorean identity for genuinely curved families, which will be illustrated through an example. Section 4 is devoted to examine an extension of Theorem 1 for non regular families. Since this is more cumbersome to state, we have decided to put it in a separate section. The last section deals with exponential families generated by dual measures, a concept which arise quite naturally from the subject of the paper (see Letac, 2021+).

## 2 Proof of the main theorem

Before giving the proof of Theorem 1, we need to recall some general facts about natural exponential families, that can be found in the books of Barndorff-Nielsen (1978) and Brown (1986). As anticipated in the introduction we assume that is a regular -finite Borel measure on . Then the cumulant generating function is a convex (and lower semi-continuous) function on , which is strictly convex (and continuous) in (see e.g. Barndorff-Nielsen (1978), Theorem 7.1). Moreover is differentiable in , and maps diffeomorphically onto the interior of (see e.g. Barndorff-Nielsen (1978), page 121). Throughout the paper, we set

 l(θ,t)=θ⋅t−κ(θ),

for , and we consider the conjugate function of defined by

 κ∗(t)=supθ∈Rdl(θ;t)=supθ∈dom(κ)l(θ;t). (9)

It is a lower semi-continuous convex function and differentiable in the interior of its essential domain, which coincides with , being

 int(C(λ))⊂dom(κ∗)⊂C(λ)

(see e.g. Barndorff-Nielsen (1978), Theorems 9.1, 9.2 and 9.13). The gradient is the inverse mapping to , thus it is defined in onto . Moreover

 κ∗(t)+κ(θ)≥θ⋅t

whereas and the equality holds if and only if and , or equivalently if , with . As a consequence

 κ∗(t)=∇κ∗(t)⋅t−κ(∇κ∗(t))=l(∇κ∗(t);t)t∈int(C(λ)). (10)

Thus, for , is the MLE for the parameter in the saturated model. Moreover in (7) (since ), and

 D(Pθ0||Pθ)=l(θ0;∇κ(θ0))−l(θ;∇κ(θ0))=κ∗(∇κ(θ0))−l(θ;∇κ(θ0)), (11)

for .

In order to discuss also the LDP’s for MLE’s it is worth observing that, once defined

 l∗(t;θ)=t⋅θ−κ∗(t),

we can also write (similarly to (10))

 κ(θ)=∇κ(θ)⋅θ−κ∗(∇κ(θ))=l∗(∇κ(θ);θ))=supt∈Rdl∗(t;θ),θ∈dom(κ),

and from (11) we get

 D(Pθ0||Pθ)=κ∗(∇κ(θ0))−l(θ;∇κ(θ0))=κ(θ)+κ∗(∇κ(θ0))−θ⋅∇κ(θ0)=κ(θ)−l∗(∇κ(θ0);θ),θ0,θ∈dom(κ). (12)

Finally for an arbitrary set define

 κ∗B(t)=supθ∈Bl(θ;t)=supθ∈Rd{θ⋅t−κ(θ)−δ(θ|B)},

where if and if . The function is again a lower semi-continuous convex function, being a supremum of affine functions. Given that (the former being a supremum constrained to a smaller domain), it is , and therefore . Since a convex function is continuous in the interior of its effective domain (see e.g. Roberts and Varberg (1973), Theorem D, page 93), the function is always continuous in , whatever is the choice of the set .

For proving Theorem 1, first we need the following lemma.

###### Lemma 2.

Under the assumptions of Theorem 1,

 limn→∞1nln∫S(ν)exp(nl(θ;¯xn))ν(dθ)=κ∗S(ν)(μ0)=l(θν;μ0).
###### Proof.

First of all, recall that by assumption

 κ∗S(ν)(μ0)=l(θν;μ0),

where . Now, replacing the integrand with its supremum over the support , we immediately have

 1nln∫S(ν)exp(nl(θ;¯xn))ν(dθ)≤supθ∈S(ν)l(θ;¯xn)=κ∗S(ν)(¯xn).

Hence if tends to as tends to , then is eventually in . By continuity of within this set, tends to and

 limsupn→∞1nln∫S(ν)exp(nl(θ;¯xn))ν(dθ)≤κ∗S(ν)(μ0).

For the reverse inequality let be the ball of radius and center and observe that for any , , from which

 infθ∈S(ν)∩B(θν,ϵ)l(θ;¯xn)+1nlnν(B(θν,ϵ))≤1nln∫S(ν)exp(nl(θ;¯xn))ν(dθ).

Sending to first, and then to , one gets

 supϵ>0liminfn→∞infθ∈S(ν)∩B(θν,ϵ)l(θ;¯xn)≤liminfn→∞1nln∫S(ν)exp(nl(θ;¯xn))ν(dθ). (13)

Finally we prove that the left hand side in display (13) cannot be smaller than . Reasoning by contradiction, suppose that for some

 supϵ>0liminfn→∞infθ∈S(ν)∩B(θν,ϵ)l(θ;¯xn)

Then for any positive integer there exist and an integer such that

 θm⋅¯xnm−κ(θm)<θν⋅μ0−κ(θν)−δ. (14)

Now converges to , converges to , and can be chosen to be increasing with . As we get the convergence of the left hand side of (14) to , which is impossible since we assumed . So we have proved that

 liminfn→∞1nln∫exp(nl(θ;¯xn))ν(dθ)≥l(θν;μ0)=κ∗S(ν)(μ0), (15)

ending the proof. ∎

### Proof of Theorem 1.

The proof of the upper bound consists in estimating the numerator of the Bayes’ formula (4). Choose , where is a Borel set of and is a measurable support of . Then, with exactly the same argument of the previous lemma

 limsupn→∞1nln∫B∩Texp(nl(θ;¯xn))ν(dθ)≤limn→∞κ∗B∩T(¯xn)=κ∗B∩T(μ0), (16)

which, together with Lemma 2, implies the rightmost inequality in (5), with the rate function defined in (6). Indeed the supremum in (16) is increased once it is taken in the closure of (in the relative topology of ).

As far as the lower bound is concerned, let be the interior of the measurable set in the relative topology of . Thus is an open set, and . Repeating the argument of the previous proof with any replacing and replacing , one arrives at

 liminfn→∞1nln∫B∩Texp(nl(θ;¯xn))ν(dθ)≥l(θ∗;μ0). (17)

As a consequence the leftmost inequality in (5) is readily obtained.

###### Example 3.

The Hardy-Weinberg family of distributions, in its simplest form with two alleles (see e.g. Barndorff-Nielsen (1978), Example 8.10), is a subfamily of the family of all distributions over outcomes, coded with the vectors in the plane . By choosing

 λ=12δ0+14δe1+14δe2,

this family is represented as the natural exponential family generated by , with the natural parameter and

 κ(θ)=log∫exp{θ⋅t}λ(dt)=log(2+eθ1+eθ2)−2log2.

The probabilities of these outcomes are

 Pθ(ei)=∂κ∂θi=eθi2+eθ1+eθ2,i=1,2,Pθ(0)=22+eθ1+eθ2.

The Hardy-Weinberg subfamily assumes that these probabilities arise from a binomial distribution with

trials, where corresponds to one success and one failure, hence they are subject to the constraints

 Pθ(0)=2√Pθ(e1)Pθ(e2)

which in term of the natural parameters becomes

 HW={θ1+θ2=0},

taken to be the support of the prior distribution . Let be any vector with positive components and , which means that belongs to the interior of . With simple computations, the maximizer of the likelihood function with is given by

 θν,1=log(1+x−y)−log(1−x+y),θν,2=−θν,1.

Then, by Theorem 1, if the sequence converges to as , the sequence of probability measures on satisfies a LDP with rate function This is better visualized in terms of the parameter , which is the success probability of the underlying binomial distribution. Since is the limiting MLE of this parameter, with simple computations one gets

 I1(θ1)=2{p20logp0p(θ1)+p0(1−p0)logp0(1−p0)p(θ1)(1−p(θ1))+(1−p0)2log1−p01−p(θ1)},

the Kullback-Leibler divergence between two binomials with trials and probability of success and , respectively, in agreement with Proposition 1.

## 3 Large deviation principles for the MLE

This section is devoted to review what can be considered as the frequentist counterpart of Theorem 1, namely the LDP for a MLE in a curved exponential family. Let be an i.i.d. sample drawn from belonging to the family (2), where . A Maximum Likelihood Estimator constrained to a measurable parameter set is a measurable mapping such that

 l(φ(¯xn);¯xn)≥l(θ;¯xn),θ∈T,

almost surely with respect to . In order to relate this terminology with that of the Introduction, notice that we are allowed to say that the values of a maximum likelihood estimator are maximum likelihood estimates (see (3)). When , subtracting the maximum value of the unconstrained likelihood function from both sides (see (10) with ), the above inequality is equivalent formulated as

 D(P∇κ∗(¯xn)||Pφ(¯xn))≤D(P∇κ∗(¯xn)||Pθ),θ∈T.

Under suitable assumptions, the following result can be derived using rather general results of the theory of large deviations.

###### Theorem 4.

Suppose that is an i.i.d. sample drawn from and the sample mean takes values in a.s. Moreover suppose that there exists a continuous function which is a MLE constrained to . Then the sequence satisfies a LDP with a rate function

 ~I(θ)=inf{D(Pθ′||Pθ0):θ′∈dom(κ),φ(∇κ(θ′))=θ},θ∈T. (18)
###### Proof.

By Cramér’s theorem (see e.g. Theorem 2.2.30 in Dembo and Zeitouni (1998)), the sample mean of i.i.d. random variables drawn from

satisfies a LDP with rate function

 ι(t)=supθ{t⋅θ−log∫e(θ0+θ)⋅x−κ(θ0)λ(dx)}=supθ{t⋅θ−κ(θ+θ0)+κ(θ0)},

with the supremum on the whole space . Therefore, by (10) and a straightforward change of variable, for we have

 ι(t)=κ∗(t)−l(θ0;t)=l(∇κ∗(t);t)−l(θ0;t)=D(P∇κ∗(t)||Pθ0).

Finally, by the continuity of , we use the contraction principle (see e.g. Dembo and Zeitouni (1998), Section 4.1.4) and we get the LDP of with rate function defined by

 ~I(θ)=inf{ι(t):t∈int(C(λ)),φ(t)=θ}=inf{ι(∇κ(θ′)):θ′∈dom(κ),φ(∇κ(θ′))=θ}

which is easily seen to coincide with (18) since and are inverse to each other, between and . ∎

As a consequence of the previous result, under the conditions stated therein, the rate function , for any , is computed by solving the following geometrical problem: find the parameter vector ”closest” to within the ”surface of constant MLE”

 Mθ={θ′∈dom(κ):φ(∇κ(θ′))=θ}

in the sense of minimizing . If

 D(Pθ||Pθ0)≤D(Pθ′||Pθ0),∀θ′∈Mθ (19)

then the quantity is equal to ; therefore when , we can say that the ”parametric” Sanov theorem holds.

The property (19) holds for the full exponential family, i.e. when . Indeed the MLE estimator is injective, hence the set reduces to a point. More generally it holds under the assumptions of the following proposition, which is immediately obtained from Proposition 1.

###### Proposition 2.

When , being an affine submanifold of , if under i.i.d. sampling from , with , the sample mean takes values a.s. in , then there is a uniquely defined MLE and

 D(P∇κ∗(t)||Pθ0)=D(P∇κ∗(t)||PφT(t))+D(PφT(t)||Pθ0).

for any . As a consequence, as long as , it holds

 D(Pθ||Pθ0)≤D(P∇κ∗(t)||Pθ0).

When has not the form prescribed by the previous result the above displayed property fails, as illustrated by the following example.

###### Example 5.

The family of Gaussian distributions with mean equal to the standard deviation form a one-parameter curved subfamily of the two-parameter Gaussian exponential family. Recall that the family of distributions in the cartesian plane that are images of Gaussian laws

on the real line under the mapping , where , is a natural exponential family, with a generating measure that can be chosen equal to the image of the Lebesgue measure under tha above mapping. The natural parameters are then

 θ1=μσ2∈R,θ2=−12σ2<0,

and the cumulant generating function is

 κ(θ1,θ2)=−12(2log2+logπ+log(−θ2)+θ212θ2),

 ∂κ∂θ1=−θ12θ2,∂κ∂θ2=θ214θ22−12θ2. (20)

Since is supported by the graph of the function , the set is the subset of the plane above this graph. It is clear that the mean of a sample of size drawn from any law of this exponential family will lie on this set unless all the elements of the sample are equal, which clearly happens with probability zero.

The subfamily of laws with mean equal to the standard deviation corresponds to the following curve in the natural parameter space

 T={(θ1,θ2):θ2=−12θ21=−12q(θ1),θ1>0}, (21)

whose image under the mapping is readily checked to be the graph of the function , restricted to the first quadrant of the plane. The first order condition for the maximization of the likelihood in , with parametrized as in (21), when and , gives the following equation

 (x−∂κ∂θ1(θ1,−12θ21))−θ1(y−∂κ∂θ2(θ1,−12θ21))=(x−1θ1)−θ1(y−2θ21)=0,

whose unique positive solution is

 θ1=x+√x2+4y2y=φ(x,y), (22)

with continuous in . The conditions of Theorem 4 are thus satisfied: the sequence of MLE’s satifies a LDP in with rate function of the form (18), i.e.

 ~I(θ1,θ2)=~I1(θ1),θ2=−12θ21, θ1>0,

where has to be determined. In order to this observe that, by means of (12), the minimization problem appearing in (18) can be rephrased as the maximization of constrained to . Now observe that the set of such that (22) is satisfied can be described as the graph of the function

 y=g(x)=1θ21+1θ1x

so, if we set , the first order condition for such a maximization problem in is

 θ0,1+θ0,2θ1−∂κ∗∂x(x,g(x))−∂κ∗∂y(x,g(x))1θ1=0.

By recalling that and are inverse to each other and taking (20) into account, this gives the quadratic equation in

 2(θ1θ0,1+θ0,2)z2−2(θ1θ0,1+θ0,2−θ21)z−[2(θ1θ0,1+θ0,2)+θ21]=0.

Finally assume that , for , that is belongs to . The above equation has the solution , corresponding to , provided satisfies , which is equivalent to . As a consequence

 ~I1(θ1)≤D(Pθ1,−12q(θ1)||Pθ0,1,−12q(θ0,1)), θ1>0,

and the equality holds if and only if . So the ”parametric” Sanov theorem fails because, for all values , we have the strict inequality.

## 4 LDP’s when the set dom(κ) is not open

The aim of this section is to explain under which circumstances the LDP stated as Theorem 1 continues to hold when the essential domain of the cumulant generating function (of the reference measure ) is not open. In this case remains continuous in the interior of , but this is not necessarily true at boundary points of . The basic assumption remains unchanged: the sequence converges to , which ensures that there exists such that

 κ∗S(ν)