The interest for Bayesian consistency has grown in the last decades, especially in the nonparametric framework, see the survey papers of Ghosal, Ghosh and Ramamoorthi (1998) and Wasserman (1998). Some more recent developments have addressed the issue of misspecification, too, see Kleijn and Van der Vaart (2006). Most of the literature concerns sufficient conditions for the prior distribution to ensure consistency. The work by Ganesh and O’Connell (2000) has been a source of inspiration for this paper. Under a Dirichlet process prior on a compact state space , they prove a Large Deviation Principle (LDP, see (5) for a definition) on , the family of probability measures on , for the family of posterior distributions, as the simple size grows. When the empirical distribution converges weakly to some law (i.e. the true law), such LDP is governed by the following rate function (evaluated at )
which is the celebrated Kullback-Leibler divergence of with respect to . Consistency is a consequence of this result, since , and the equality holds if and only if . Now observe that the LDP provided by the celebrated Sanov theorem for the empirical distribution of i.i.d. samples drawn from is governed by the rate function , the Kullback-Leibler divergence of with respect to ; so, for this reason, the authors called their result the inverse Sanov theorem because the rate functions in the two LDP’s are obtained one from the other by exchanging the arguments in . Notice also that, even if in a rather formal way, the empirical distribution can be regarded as a non-parametric maximum likelihood estimator of the true distribution, giving a statistical flavour to the Sanov theorem.
The role of the prior distribution for consistency in parametric problems is rather clearcut. Indeed, in general what matters about the prior is only its support, as it entails the choice of a specific statistical model for the observations. As a matter of fact, in an earlier paper by Ganesh and O’Connell (1999) the inverse Sanov theorem is proved for a finite sample space, without any restriction on the prior, except that its support must include the limit assumed for the empirical distribution, which is nothing but the assuption that the model is not misspecified.
Motivated by previous works by the first author (Macci and Petrella (2009) and Macci (2014)), in the present paper we focus our attention on the analysis of parametric problems, establishing a kind of parametric inverse Sanov theorem. By this we mean a LDP for the sequence of posterior distributions, with the rate function of the form (1), but restricted to the parametric family assumed for the data. In addition our derivation covers also the misspecified case. This parametric family is assumed to be a curved exponential family, which in this context means a general subfamily of a full exponential family, called in the sequel the saturated model. The saturated model is generated by some positive -finite Borel measure on , with cumulant generating function
which is regular, that is is not concentrated on a proper affine submanifold of and with open essential domain (domain of finiteness), denoted by . A more general situation will be discussed in Section 4. The full exponential family generated by is defined through the densities
It is well known that the function is smooth in and
The normalized log-likelihood function, evaluated on the empirical mean of an observed sample of points in , is defined by
It is understood that the log-likelihood is set equal to outside .
Within the Bayesian approach, one needs to specify also a prior distribution on the parameter
, that is a probability distributionon supported by . Any subfamily of (2) of the form , where is a Borel set with (called a measurable support of in the sequel), is a statistical model compatible with the choice of . In order to avoid a separate treatment for cases which are not practically relevant, we will always assume that is atomless. Moreover we assume to be contained in the (topological) support of , that is the complement of the largest open set with -probability . Clearly . The definition of support applies also to the -finite measure as well: its convex hull will play a fundamental role in the sequel.
By Bayes’ formula the posterior distribution, conditional to is given by
where is any Borel subset of . In view of the main theorem, we recall that a sequence of probability measures on some topological space , satisfies a LDP with a rate function if is a lower semi-continuous function and
where and are the interior and the closure of , respectively.
Now we are ready to state the main result of the paper.
If the sequence converges to some belonging to the interior of , the sequence of probability measures on defined in (4), satisfies a LDP with rate function
where is any maximizer of in .
The assumption that converges to is quite natural. By the Strong Law of the Large Numbers, if we consider i.i.d. samples drawn from a distribution with mean , such assumption holds almost surely. The regularity of implies that is an upper semi-continuous function with compact superlevel sets (see e.g. Barndorff-Nielsen (1978), page 150), from which the existence (but not the uniqueness) of is guaranteed. Note that, by (3), can be interpreted as a limiting Maximum Likelihood Estimate of in , because is supposed to be the limit of the sample means (as ) .
In addition, by regularity, there exists with and it is unique, since is strictly convex in . By differentiation, one checks that is also the unique minimum point of in the whole : it is the limiting unrestricted MLE. Indeed, for we have
which is positive except when . Now either , which is the ”inverse Sanov regime”, in which case and is given by (7) (extended to out of ), or , which means that the model is misspecified. Also when the rate function can be rewritten as an ”excess of divergence over the minimum”, in the form
for , and elsewhere. In addition, the function is clearly a lower semi-continuous function.
It is worth to mention that also in the misspecified case the rate function can be written itself as a divergence by means of the Pythagorean identity for linear subfamilies stated below, first proved by Simon (1973). In Section 3 will give an example to illustrate the failure of the property for genuinely curved subfamilies.
Let and let be a measurable support of , where is an affine submanifold of . Then is the only vector in
is the only vector insuch that the difference is orthogonal to . Moreover for any it holds
From the previous result it follows that when the statistical model entailed by the prior is a linear subfamily of the saturated model, the rate function governing the LDP for the sequence of posteriors (4), is the same for a misspecified case and a correctly specified one in which this is replaced by , as long as and is obtained from as in Theorem 1 (indeed notice ).
Finally observe also that the various choices of in the rate function (6) allow to consider different statistical models which are embedded in the saturated model, with a different relative topology in which the LDP holds.
The proof of Theorem 1 will be given in Section 2: it relies on some general facts about convex conjugate functions. In Section 3 we will discuss its frequentist counterpart, namely the LDP for the Maximum Likelihood Estimator (again denoted by MLE). The analysis of large deviations for consistent estimators in classical statistics dates back to the results of Bahadur (see Bahadur et al. (1980)). The application to the MLE in exponential families was discussed by Kester and Kallenberg (1986) and Arcones (2006), who observed that the parametric analogue of the Sanov theorem holds only for linear subfamilies. This is due to the failure of the Pythagorean identity for genuinely curved families, which will be illustrated through an example. Section 4 is devoted to examine an extension of Theorem 1 for non regular families. Since this is more cumbersome to state, we have decided to put it in a separate section. The last section deals with exponential families generated by dual measures, a concept which arise quite naturally from the subject of the paper (see Letac, 2021+).
2 Proof of the main theorem
Before giving the proof of Theorem 1, we need to recall some general facts about natural exponential families, that can be found in the books of Barndorff-Nielsen (1978) and Brown (1986). As anticipated in the introduction we assume that is a regular -finite Borel measure on . Then the cumulant generating function is a convex (and lower semi-continuous) function on , which is strictly convex (and continuous) in (see e.g. Barndorff-Nielsen (1978), Theorem 7.1). Moreover is differentiable in , and maps diffeomorphically onto the interior of (see e.g. Barndorff-Nielsen (1978), page 121). Throughout the paper, we set
for , and we consider the conjugate function of defined by
It is a lower semi-continuous convex function and differentiable in the interior of its essential domain, which coincides with , being
(see e.g. Barndorff-Nielsen (1978), Theorems 9.1, 9.2 and 9.13). The gradient is the inverse mapping to , thus it is defined in onto . Moreover
whereas and the equality holds if and only if and , or equivalently if , with . As a consequence
Thus, for , is the MLE for the parameter in the saturated model. Moreover in (7) (since ), and
In order to discuss also the LDP’s for MLE’s it is worth observing that, once defined
we can also write (similarly to (10))
and from (11) we get
Finally for an arbitrary set define
where if and if . The function is again a lower semi-continuous convex function, being a supremum of affine functions. Given that (the former being a supremum constrained to a smaller domain), it is , and therefore . Since a convex function is continuous in the interior of its effective domain (see e.g. Roberts and Varberg (1973), Theorem D, page 93), the function is always continuous in , whatever is the choice of the set .
For proving Theorem 1, first we need the following lemma.
Under the assumptions of Theorem 1,
First of all, recall that by assumption
where . Now, replacing the integrand with its supremum over the support , we immediately have
Hence if tends to as tends to , then is eventually in . By continuity of within this set, tends to and
For the reverse inequality let be the ball of radius and center and observe that for any , , from which
Sending to first, and then to , one gets
Finally we prove that the left hand side in display (13) cannot be smaller than . Reasoning by contradiction, suppose that for some
Then for any positive integer there exist and an integer such that
Now converges to , converges to , and can be chosen to be increasing with . As we get the convergence of the left hand side of (14) to , which is impossible since we assumed . So we have proved that
ending the proof. ∎
Proof of Theorem 1.
The proof of the upper bound consists in estimating the numerator of the Bayes’ formula (4). Choose , where is a Borel set of and is a measurable support of . Then, with exactly the same argument of the previous lemma
which, together with Lemma 2, implies the rightmost inequality in (5), with the rate function defined in (6). Indeed the supremum in (16) is increased once it is taken in the closure of (in the relative topology of ).
As far as the lower bound is concerned, let be the interior of the measurable set in the relative topology of . Thus is an open set, and . Repeating the argument of the previous proof with any replacing and replacing , one arrives at
As a consequence the leftmost inequality in (5) is readily obtained.
The Hardy-Weinberg family of distributions, in its simplest form with two alleles (see e.g. Barndorff-Nielsen (1978), Example 8.10), is a subfamily of the family of all distributions over outcomes, coded with the vectors in the plane . By choosing
this family is represented as the natural exponential family generated by , with the natural parameter and
The probabilities of these outcomes are
The Hardy-Weinberg subfamily assumes that these probabilities arise from a binomial distribution with
The Hardy-Weinberg subfamily assumes that these probabilities arise from a binomial distribution withtrials, where corresponds to one success and one failure, hence they are subject to the constraints
which in term of the natural parameters becomes
taken to be the support of the prior distribution . Let be any vector with positive components and , which means that belongs to the interior of . With simple computations, the maximizer of the likelihood function with is given by
Then, by Theorem 1, if the sequence converges to as , the sequence of probability measures on satisfies a LDP with rate function This is better visualized in terms of the parameter , which is the success probability of the underlying binomial distribution. Since is the limiting MLE of this parameter, with simple computations one gets
the Kullback-Leibler divergence between two binomials with trials and probability of success and , respectively, in agreement with Proposition 1.
3 Large deviation principles for the MLE
This section is devoted to review what can be considered as the frequentist counterpart of Theorem 1, namely the LDP for a MLE in a curved exponential family. Let be an i.i.d. sample drawn from belonging to the family (2), where . A Maximum Likelihood Estimator constrained to a measurable parameter set is a measurable mapping such that
almost surely with respect to . In order to relate this terminology with that of the Introduction, notice that we are allowed to say that the values of a maximum likelihood estimator are maximum likelihood estimates (see (3)). When , subtracting the maximum value of the unconstrained likelihood function from both sides (see (10) with ), the above inequality is equivalent formulated as
Under suitable assumptions, the following result can be derived using rather general results of the theory of large deviations.
Suppose that is an i.i.d. sample drawn from and the sample mean takes values in a.s. Moreover suppose that there exists a continuous function which is a MLE constrained to . Then the sequence satisfies a LDP with a rate function
By Cramér’s theorem (see e.g. Theorem 2.2.30 in Dembo and Zeitouni (1998)), the sample mean of i.i.d. random variables drawn fromsatisfies a LDP with rate function
with the supremum on the whole space . Therefore, by (10) and a straightforward change of variable, for we have
Finally, by the continuity of , we use the contraction principle (see e.g. Dembo and Zeitouni (1998), Section 4.1.4) and we get the LDP of with rate function defined by
which is easily seen to coincide with (18) since and are inverse to each other, between and . ∎
As a consequence of the previous result, under the conditions stated therein, the rate function , for any , is computed by solving the following geometrical problem: find the parameter vector ”closest” to within the ”surface of constant MLE”
in the sense of minimizing . If
then the quantity is equal to ; therefore when , we can say that the ”parametric” Sanov theorem holds.
The property (19) holds for the full exponential family, i.e. when . Indeed the MLE estimator is injective, hence the set reduces to a point. More generally it holds under the assumptions of the following proposition, which is immediately obtained from Proposition 1.
When , being an affine submanifold of , if under i.i.d. sampling from , with , the sample mean takes values a.s. in , then there is a uniquely defined MLE and
for any . As a consequence, as long as , it holds
When has not the form prescribed by the previous result the above displayed property fails, as illustrated by the following example.
The family of Gaussian distributions with mean equal to the standard deviation form a one-parameter curved subfamily of the two-parameter Gaussian exponential family. Recall that the family of distributions in the cartesian plane that are images of Gaussian laws
The family of Gaussian distributions with mean equal to the standard deviation form a one-parameter curved subfamily of the two-parameter Gaussian exponential family. Recall that the family of distributions in the cartesian plane that are images of Gaussian lawson the real line under the mapping , where , is a natural exponential family, with a generating measure that can be chosen equal to the image of the Lebesgue measure under tha above mapping. The natural parameters are then
and the cumulant generating function is
whose gradient is given by
Since is supported by the graph of the function , the set is the subset of the plane above this graph. It is clear that the mean of a sample of size drawn from any law of this exponential family will lie on this set unless all the elements of the sample are equal, which clearly happens with probability zero.
The subfamily of laws with mean equal to the standard deviation corresponds to the following curve in the natural parameter space
whose image under the mapping is readily checked to be the graph of the function , restricted to the first quadrant of the plane. The first order condition for the maximization of the likelihood in , with parametrized as in (21), when and , gives the following equation
whose unique positive solution is
where has to be determined. In order to this observe that, by means of (12), the minimization problem appearing in (18) can be rephrased as the maximization of constrained to . Now observe that the set of such that (22) is satisfied can be described as the graph of the function
so, if we set , the first order condition for such a maximization problem in is
By recalling that and are inverse to each other and taking (20) into account, this gives the quadratic equation in
Finally assume that , for , that is belongs to . The above equation has the solution , corresponding to , provided satisfies , which is equivalent to . As a consequence
and the equality holds if and only if . So the ”parametric” Sanov theorem fails because, for all values , we have the strict inequality.
4 LDP’s when the set is not open
The aim of this section is to explain under which circumstances the LDP stated as Theorem 1 continues to hold when the essential domain of the cumulant generating function (of the reference measure ) is not open. In this case remains continuous in the interior of , but this is not necessarily true at boundary points of . The basic assumption remains unchanged: the sequence converges to , which ensures that there exists such that