A Family of Bayesian Cramér-Rao Bounds, and Consequences for Log-Concave Priors

02/22/2019
by   Efe Aras, et al.
0

Under minimal regularity assumptions, we establish a family of information-theoretic Bayesian Cramér-Rao bounds, indexed by probability measures that satisfy a logarithmic Sobolev inequality. This family includes as a special case the known Bayesian Cramér-Rao bound (or van Trees inequality), and its less widely known entropic improvement due to Efroimovich. For the setting of a log-concave prior, we obtain a Bayesian Cramér-Rao bound which holds for any (possibly biased) estimator and, unlike the van Trees inequality, does not depend on the Fisher information of the prior.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

11/13/2021

A discrete complement of Lyapunov's inequality and its information theoretic consequences

We establish a reversal of Lyapunov's inequality for monotone log-concav...
07/09/2020

Physics-inspired forms of the Bayesian Cramér-Rao bound

Using differential geometry, I derive a form of the Bayesian Cramér-Rao ...
11/17/2020

Fisher Information of a Family of Generalized Normal Distributions

In this brief note we compute the Fisher information of a family of gene...
02/24/2021

Saturable Generalizations of Jensen's Inequality

Jensen's inequality can be thought as answering the question of how know...
02/11/2020

Generalized Bayesian Cramér-Rao Inequality via Information Geometry of Relative α-Entropy

The relative α-entropy is the Rényi analog of relative entropy and arise...
01/27/2021

Functional inequalities for perturbed measures with applications to log-concave measures and to some Bayesian problems

We study functional inequalities (Poincaré, Cheeger, log-Sobolev) for pr...
05/11/2020

Non-linear Log-Sobolev inequalities for the Potts semigroup and applications to reconstruction problems

Consider a Markov process with state space [k], which jumps continuously...

1 Introduction

Throughout, we let denote the set of Borel probability measures on . For , we abuse notation slightly and define , where denotes Euclidean length on . Thus,

is the usual variance in dimension

; it is the trace of the covariance matrix corresponding to for arbitrary dimension . A probability measure is said to be log-concave if for convex . All logarithms are taken with respect to the natural base.

Our results are best stated within the general framework of parametric statistics. To this end, we let be a dominated family of probability measures on a measurable space ; with dominating -finite measure . To each , we associate a density (w.r.t. ) according to

The Fisher information of the parametric family evaluated at is defined as

where denotes gradient with respect to . Note that is distinct from the information theorist’s Fisher information , defined as

for a probability measure having density with respect to Lebesgue measure. In the special case where is a location parameter, the two quantities coincide.

For a real-valued parameter and an observation , the basic question of parametric statistics is how well can one estimate from . Here, the Cramér-Rao bound is of central importance in proving lower bounds on estimation error, stating that

(1)

for any unbiased estimator

. The assumption of unbiasedness is quite restrictive, especially since unbiased estimators may not always exist, or may be less attractive than biased estimators for any one of a variety of reasons (computability, performance, etc.). Under the assumption that the parameter is distributed according to some prior , the so-called Bayesian Cramér-Rao bound [1, 2] (also known as the van Trees inequality) states, under mild regularity assumptions, that

(2)

where the expectation is over and, conditioned on , . As noted by Tsybakov [3, Section 2.7.3], this inequality is quite powerful since it does not impose any restriction on unbiasedness, is relatively simple to apply, and often leads to sharp results (including sharp constants). Tsybakov states that one primary disadvantage of (2) is that it applies only to loss. Although it does not appear to be widely known, this is actually not true. Indeed, Efroimovich proved in [4] that

(3)

which is stronger than (2) by the maximum-entropy property of Gaussians. Efroimovich’s inequality can be rearranged to give an upper bound on the mutual information

Such a general upper bound on can be useful in settings beyond those where (2) applies. For example, it can be used to give one direction of the key estimate in Clarke and Barron’s work showing that Jeffrey’s prior is least favorable [5]. It can also be applied to characterize Bayes risk measured under losses other than when coupled with a lower bound on mutual information (see, e.g., [6]). We remark that several systematic techniques exist for lower bounding the mutual information

in terms of Bayes risk (e.g., Fano’s method, or the Shannon lower bound for the rate distortion function), so finding a good upper bound is often the challenge. A typical heuristic is to bound

from above by the capacity of the channel , but this method has the disadvantages that (i) it discards information about the prior ; and (ii) capacity expressions are only explicitly known for very special parametric families (e.g., Gaussian channels). Efroimovich’s inequality overcomes both of these obstacles, but has the undesirable property of being degenerate when . This can be a serious disadvantage in applications since many natural priors have infinite Fisher information, for example uniform measures on convex bodies111Mollification may be a useful heuristic to compensate for infinite in low dimensions, but this becomes fundamentally problematic in high dimensions where mollification picks up dimensional dependence, and generally alters the boundary of a set where the measure concentrates..

Contributions

We make two main contributions, which we describe in rough terms here. Precise statements are given in Section 2. First, we establish a family of Bayesian Cramér-Rao-type bounds indexed by probability measures that satisfy a logarithmic Sobolev inequality on . This generalizes Efroimovich’s inequality (3), which corresponds to the special case where the reference measure is taken to be Gaussian. Second, we specialize the first result to obtain an explicit Bayesian Cramér-Rao-type bound under the assumption of a log-concave prior . In dimension one, the result implies

(4)

provided ; a correction is needed if this condition is not met222It is easy to see why a condition like this is needed: if there were no such assumption, then we could let approximate a point mass, effectively showing that the Cramér-Rao bound holds – up to an absolute constant – for any estimator. This clearly can not be true (consider constant, not equal to ). (see Theorem 2 for a precise statement). In particular,

holds under our assumptions for a universal constant , regardless of whether is biased. This should be compared to the classical Cramér-Rao bound: morally speaking, (1) continues to hold (up to a modest constant factor) for any estimator , provided we are working with a log-concave prior which, together with , satisfies . Note that the crucial (and somewhat surprising) advantage relative to (3) is that the Fisher information does not appear.

Organization

The sequel is organized as follows: main results, along with assumptions and brief discussion are provided in Section 2. The proofs of all results can be found in Section 3.

2 Main Results

2.1 Assumptions

As is typical of Cramér-Rao-type bounds, our main results require us to assume some mild regularity. In particular, for a given measure , we will refer to the following standard condition on the densities associated to :

(5)

where denotes the gradient with respect to . We remark that this holds whenever the orders of differentiation with respect to and integration with respect to can be exchanged (Liebniz rule).

2.2 Statement of Results

Our first main result establishes a family of Cramér-Rao-type bounds on the mutual information in terms of logarithmic Sobolev inequalities on . To this end, we recall the standard definitions of relative entropy and relative Fisher information (the parlance in which logarithmic Sobolev inequalities are framed). Consider , with and . The entropy of , relative to , is defined as

If the density is weakly differentiable, the Fisher information of , relative to , is defined according to

If is not weakly differentiable, we adopt the convention that so that our expressions make sense even in the general case.

A probability measure is said to satisfy a logarithmic Sobolev inequality with constant (or, for short) if, for all probability measures ,

The standard Gaussian measure on is a prototypical example of a measure that satisfies an LSI, and does so with constant . More generally, if with for and the identity matrix, then satisfies [7]; this result is known as the Bakry-Émery theorem, and we shall need it later in the proof of Theorem 2.

With these definitions in hand, our first result is the following:

Theorem 1.

Let satisfy and assume the regularity condition (5) holds. For any probability measure on ,

(6)

Inequality (6) improves the LSI for . Indeed, taking independent of renders , so that the LSI for is recovered. However, the proof of (6) follows from a relatively simple application of the LSI for and some basic calculus, so the two inequalities should be viewed as being formally equivalent in this sense.

Clearly, the statement of Theorem 1 allows us the freedom to choose the measure so as to obtain the tightest possible bound on . However, a notable example is obtained when is taken to be the standard Gaussian measure on . In this case, upon simplification we obtain

(7)

Of note, (7) is not invariant to rescalings of the parameter . So, just as one passes from Lieb’s inequality to the entropy power inequality, we may optimize over all such scalings to obtain the following multidimensional version of (3):

Remark 1.

Efroimovich’s work [4] contains a slightly stronger multidimensional form, stated in terms of determinants of Fisher information matrices. As defined, our Fisher information quantities and

correspond to traces of the same matrices, leading to a weaker inequality by the arithmetic-geometric mean inequality. Nevertheless, the two inequalities should really be regarded as essentially equivalent, as they are both direct consequences of the one-dimensional inequality (where the two results coincide). See

[4, Proof of Theorem 5] for details. It is unclear whether a similar claim holds for non-Gaussian in (6).

We remark that (3) was discovered by Efroimovich in 1979, but does not appear to be widely known (we could not find a statement of the result outside the Russian literature). At the time of Efroimovich’s initial discovery of (3), the study of logarithmic Sobolev inequalities was just getting started, being largely initiated by Gross’s work on the Gaussian case in 1975 [8]. In particular, the derivation of (3) (and, less generally, the van Trees inequality) from the Gaussian logarithmic Sobolev inequality does not appear to have been observed previously. So, from a conceptual standpoint, one contribution of Theorem 1 is that it demonstrates how Efroimovich’s result (and the weaker van Trees inequality) emerges as one particular instance in the broader context of LSIs which, to our knowledge, have not found direct use in parametric statistics beyond their implications for measure concentration (see, e.g., [9]).

A nontrivial consequence of Theorem 1 is a general Cramér-Rao-type bound on , assuming only that is log-concave. Specifically, our second main result is the following:

Theorem 2.

Assume the parametric family satisfies (5) for equal to Lebesgue measure. Let satisfy for some scalar , where is the identity matrix. Define , . It holds that

(8)

where

Remark 2.

The one-dimensional inequality (4) follows directly from Theorem 2 for

, combined with the entropy lower bound for log-concave random variables

due to Marsiglietti and Kostina [10]. Similar statements hold for general dimension

, albeit with a correction factor that depends on dimension (no correction is needed if the hyperplane conjecture is true; see

[11]).

The upper bound (8) should be viewed as a function of two nonnegative quantities: the products and . By the Brascamp-Lieb inequality [12], we always have ; this quantity only depends on the prior and distills what quantitative information is known about its degree of log-concavity. In particular, if is only known to be log-concave, then gives . In the other extreme case, if (e.g., if is scaled standard Gaussian), we have the slightly improved bound . These bounds both essentially behave as for modestly large, so knowledge of (i.e., additional information about the measure ) only significantly affects the behavior of the upper bound (8) for small. To be precise, for near zero, the upper bound behaves as when , and if . Applications in asymptotic statistics consider a sequence of observations , conditionally independent given . In this case, grows linearly with , so that the logarithmic behavior of the bound dominates, regardless of what is known about .

Let us now make a brief observation on the sharpness of Theorem 2. To this end, consider the classical Gaussian sequence model , where is independent of . In this case, the typical quantity of relevance is the signal-to-noise ratio , in terms of which we have the sharp upper bound

(9)

Thus, in view of the previous discussion, we clearly see that Theorem 2 provides a sharp estimate in the regime where is moderately large. We do not yet know whether the bound is sharp for small and , but we believe that it should be.

Finally, we remark that all results have correct dependence on dimension, as can be seen by testing on product measures.

2.3 Remarks on Applications

Applications of Cramér-Rao-type bounds to parameter estimation are numerous, and our results will generally apply in Bayesian settings. In particular, we believe corollaries such as (4) may be especially useful for proving lower bounds on Bayes risk when the prior is log-concave.

We note that our results are quite general in form, and therefore not restricted to applications in parametric statistics. To give one quick example, consider log-concave , normalized so that , and define , where are drawn i.i.d. according to . Then, an immediate corollary of Theorem 2 is that, for sufficiently large,

which is a sort of reverse entropy power inequality, holding for log-concave random vectors. This improves a result of Cover and Zhang

[13] for sufficiently large, in which the leading coefficient in parentheses on the right is . This inequality should also be compared to the formulation of the hyperplane conjecture recently put forth by Marsiglietti and Kostina [14].

3 Proofs

This section contains the proofs of main results.

3.1 Proof of Theorem 1

We may assume that the RHS of equation (6) is finite; else the claim is trivially true. Let , and note that is the joint density of with respect to . Define , and , which is well-defined -a.e. Now, since satisfies , we have for -a.e. 

where we write in place of for brevity. Integrating both sides with respect to the density , we have

Now, observe that

where the penultimate identity follows by the product rule for derivatives and expanding the square. The final cross term is integrable; indeed, Cauchy-Schwarz yields

The exchange of integrals to obtain the last line is justified by Tonelli’s theorem. Therefore, by Fubini’s theorem,

where the last equality follows by the regularity assumption. Summarizing, we have

To finish, we observe that

which proves the claim.

3.2 Proof of Theorem 2

We require the following proposition, the proof of which is the most arduous part of the argument. The ideas of the proof are independent from Theorem 2, so it is deferred to the appendix.

Proposition 1.

Let be a probability density on , with convex.

  1. For each , there exists a unique such that

  2. For as in part (i), and each

To begin the proof, consider the log-concave density , where . For , let be the probability measure with density

where is a normalizing constant and is such that , which exists as a consequence of Proposition 1(i). Note that has density with respect to . Therefore, we may readily compute

By the Bakry-Emery theorem, satisfies , so it follows from Theorem 1 that

By Proposition 1(ii) and the inequality

holding by definition of , we have

(10)

where are as defined in the statement of the theorem. Since the above holds for arbitrary , we now particularize by (optimally) choosing

if , and otherwise choosing

It can be verified that if , then this choice of ensures . On the other hand, if , then this choice of ensures . Hence, substitution into equation (10) and simplifying yields:

where is defined piecewise according to

This bound is actually better than what is stated in the theorem, but is clearly a bit cumbersome. Since , we note the simpler (yet, still essentially as good) bound holding for in the range , completing the proof

Acknowledgement

This work was supported in part by NSF grants CCF-1704967, CCF-0939370 and CCF-1750430.

References

  • [1] R. D. Gill and B. Y. Levit. Applications of the van Trees inequality: a Bayesian Cramér-Rao bound. Bernoulli, 1(1-2):59–79, 1995.
  • [2] H. L. van Trees. Detection, estimation, and modulation theory, part I: detection, estimation, and linear modulation theory. John Wiley & Sons, 1968.
  • [3] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer-Verlag New York, 2009.
  • [4] S. Y. Efroimovich. Information contained in a sequence of observations (in russian). Problems in Information Transmission, 15(3):24–39, 1979.
  • [5] B. S. Clarke and A. R. Barron. Jeffreys’ prior is asymptotically least favorable under entropy risk. Journal of Statistical planning and Inference, 41(1):37–60, 1994.
  • [6] Y. Wu.

    Lecture notes for information-theoretic methods for high-dimensional statistics, July 2017.

  • [7] D. Bakry and M. Émery. Diffusions hypercontractives. In Séminaire de Probabilités XIX 1983/84, pages 177–206. Springer, 1985.
  • [8] L. Gross. Logarithmic Sobolev inequalities. American Journal of Mathematics, 97(4):1061–1083, 1975.
  • [9] M. Ledoux. The concentration of measure phenomenon. Number 89. American Mathematical Soc., 2001.
  • [10] A. Marsiglietti and V. Kostina. A lower bound on the differential entropy of log-concave random vectors with applications. Entropy, 20(3):185, 2018.
  • [11] S. Bobkov and M. Madiman. The entropy per coordinate of a random vector is highly constrained under convexity conditions. IEEE Transactions on Information Theory, 57(8):4940–4954, 2011.
  • [12] H. J. Brascamp and E. H. Lieb. On extensions of the Brunn-Minkowski and Prékopa-Leindler theorems, including inequalities for log concave functions, and with an application to the diffusion equation. Journal of Functional Analysis, 22(4):366–389, 1976.
  • [13] T. M. Cover and Z. Zhang. On the maximum entropy of the sum of two dependent random variables. IEEE Transactions on Information Theory, 40(4):1244–1246, 1994.
  • [14] A. Marsiglietti and V. Kostina. New connections between the entropy power inequality and geometric inequalities. In 2018 IEEE International Symp. on Information Theory (ISIT), pages 1978–1982. IEEE, 2018.

Appendix

This appendix contains the proof of the following extended version of Proposition 1. It may be of independent interest.

Lemma 1.

Let be a probability density on , with convex.

  1. For each , there exists a unique such that

  2. For each , the map

    has a unique global maximum at .

  3. The map is continuous on . In particular, for each , there is a neighborhood of and such that for all .

  4. For as in part (i), and each ,

Remark 3.

An intuitive interpretation is as follows: If we convolve a log-concave density with a Gaussian of variance , then the point of maximum likelihood of the resulting density (call it ) is unique, and changes smoothly as we adjust . The last part of the lemma gives a lower bound on the likelihood at . The only real surprise is the fact that is also the barycenter of the density proportional to , which is part (i) of the claim.

The proof of Lemma 1 starts by showing that the map defined by

is a contraction with respect to the usual Euclidean metric. Then, the claims follow from the well-known Banach fixed-point theorem:

Lemma 2 (Banach Fixed Point Theorem).

Let be a complete metric space, and let satisfy for all , where . Then has a unique fixed point . Moreover, if and , , then

(11)

So, to begin, let denote the probability measure with density proportional to . We note that cannot split off an independent Gaussian factor with variance . Indeed, if this were the case, then after suitable change of coordinates, we could assume splits off an independent Gaussian factor of variance in the first coordinate, so that

for some . Rearranging, this yields for some constant . This would imply is not integrable in coordinate , a contradiction. Thus, we must have

for some . This follows from the Brascamp-Lieb inequality, and the fact that Gaussians are the only extremizers.

By differentiating the th coordinate of at , we see that

Hence, the Jacobian of has entries . Recalling the variance inequality above,

so that is a contraction as claimed. Hence, the desired existence and uniqueness of follows from the Banach Fixed Point Theorem.

To prove the second claim, note that for any and ,

The strict inequality holds since is a contraction and for . Thus, for any not equal to , the map is strictly increasing on , so that achieves a unique global maximum at as claimed.

Toward proving the third claim, we first note that (ii) proved above yields a uniform bound on for all . In particular,

Since

is log-concave, it has finite moments of all orders, and we conclude