# Generalized Bayesian Cramér-Rao Inequality via Information Geometry of Relative α-Entropy

The relative α-entropy is the Rényi analog of relative entropy and arises prominently in information-theoretic problems. Recent information geometric investigations on this quantity have enabled the generalization of the Cramér-Rao inequality, which provides a lower bound for the variance of an estimator of an escort of the underlying parametric probability distribution. However, this framework remains unexamined in the Bayesian framework. In this paper, we propose a general Riemannian metric based on relative α-entropy to obtain a generalized Bayesian Cramér-Rao inequality. This establishes a lower bound for the variance of an unbiased estimator for the α-escort distribution starting from an unbiased estimator for the underlying distribution. We show that in the limiting case when the entropy order approaches unity, this framework reduces to the conventional Bayesian Cramér-Rao inequality. Further, in the absence of priors, the same framework yields the deterministic Cramér-Rao inequality.

## Authors

• 18 publications
• 8 publications
• ### Cramér-Rao Lower Bounds Arising from Generalized Csiszár Divergences

We study the geometry of probability distributions with respect to a gen...
01/14/2020 ∙ by M. Ashok Kumar, et al. ∙ 2

• ### Entropy and Compression: A simple proof of an inequality of Khinchin

We prove that Entropy is a lower bound for the average compression ratio...
07/10/2019 ∙ by Riccardo Aragona, et al. ∙ 0

• ### Hybrid and Generalized Bayesian Cramér-Rao Inequalities via Information Geometry

Information geometry is the study of statistical models from a Riemannia...
04/02/2021 ∙ by Kumar Vijay Mishra, et al. ∙ 0

• ### A Family of Bayesian Cramér-Rao Bounds, and Consequences for Log-Concave Priors

Under minimal regularity assumptions, we establish a family of informati...
02/22/2019 ∙ by Efe Aras, et al. ∙ 0

• ### Information Geometric Approach to Bayesian Lower Error Bounds

Information geometry describes a framework where probability densities c...
01/15/2018 ∙ by M. Ashok Kumar, et al. ∙ 0

• ### Barankin Vector Locally Best Unbiased Estimates

The Barankin bound is generalized to the vector case in the mean square ...
06/30/2017 ∙ by Bruno Cernuschi-Frias, et al. ∙ 0

• ### Physics-inspired analysis of the two-class income distribution in the USA in 1983-2018

The first part of this paper is a brief survey of the approaches to econ...
10/07/2021 ∙ by Danial Ludwig, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

In information geometry, a parameterized family of probability distributions is expressed as a manifold in the Riemannian space [1], in which the parameters form the coordinate system on manifold and the distance measure is given by the Fisher information matrix (FIM) [2]. This framework reduces certain important information-theoretic problems to investigations of different Riemannian manifolds [3]. This perspective is helpful in analyzing many problems in engineering and sciences where probability distributions are used, including optimization [4], signal processing [5]

[6], optimal transport [7], and quantum information [8].

In particular, when the separation between the two points on the manifold is defined by Kullback-Leibler divergence (KLD) or relative entropy between two probability distributions and on a finite state space , i.e.,

 I(p,q):=∑x∈Xp(x)logp(x)q(x), (1)

then the resulting Riemmanian metric is defined by FIM [9]. This method of defining a Riemannian metric on statistical manifolds from a general divergence function is due to Eguchi [10]. Since FIM is the inverse of the well-known deterministic Cramér-Rao lower bound (CRLB), the information-geometric results are directly connected with those of estimation theory. Further, the relative entropy is related to the Shannon entropy by , where

is the uniform distribution on

.

It is, therefore, instructive to explore information-geometric frameworks for key estimation-theoretic results. For example, the Bayesian CRLB [11, 12]

is the analogous lower bound to CRLB for random variables. It assumes the parameters to be random with an

a prioriprobability density function. In [13], we derived Bayesian CRLB using a general definition of KLD when the probability densities are not normalized.

Recently, [14] studies information geometry of Rényi entropy [15], which is a generalization of Shannon entropy. In source coding problem where normalized cumulants of compressed lengths are considered instead of expected compressed lengths, Rényi entropy is used as a measure of uncertainty [16]. The Rényi entropy of of order , , , is defined to be . In the context of source distribution version of this problem, the Rényi analog of relative entropy is relative -entropy [17, 18]. The relative -entropy of with respect to (or Sundaresan’s divergence between and ) is defined as

 Iα(p,q):= α1−αlog∑xp(x)q(x)α−1 −11−αlog∑xp(x)α+log∑xq(x)α. (2)

It follows that, as , we have and [19]. Rényi entropy and relative -entropy are related by the equation . Relative -entropy is closely related to the Csiszár -divergence as

 Iα(p,q)=α1−αlog[% sgn(1−α)⋅Df(p(α),q(α))+1], (3)

where , , and [19, Sec. II]. The measures and are called -escort or -scaled measures [20, 21]. It is easy to show that, indeed, the right side of (3) is the Rényi divergence between and of order .

The Rényi entropy and relative -entropy arise in several important information-theoretic problems such as guessing [22, 18, 23] and task encoding [24]. Relative

-entropy arises in statistics as a generalized likelihood function robust to outliers

[25], [26]. It also shares many interesting properties with relative entropy; see, e.g. [19, Sec. II] for a summary. For example, relative -entropy behaves like squared Euclidean distance and satisfies a Pythagorean property in a similar way relative entropy does [19, 13]. This property helps in establishing a computation method [26] for a robust estimation procedure [27].

Motivated by such analogous relationships, our previous works [14] investigated the relative -entropy from a differential geometric perspective. In particular, we applied Eguchi’s method with relative entropy as the divergence function to obtain the resulting statistical manifold with a general Riemannian metric. This metric is specified by the Fisher information matrix that is the inverse of the so called deterministic -CRLB [19]. In this paper, we study the structure of statistical manifolds with respect to a relative -entropy in a Bayesian setting. This is a non-trivial extension of our work in [13], where we proposed Riemmanian metric arising from the relative entropy for the Bayesian case. In the process, we derive a general Bayesian Cramér-Rao inequality and the resulting Bayesian -CRLB which embed the compounded effects of both Rényi order and Bayesian prior distribution. We show that, in limiting cases, the bound reduces to deterministic -CRLB (in the absence of prior), Bayesian CRLB (when ) or CRLB (no priors and ).

The rest of the paper is organized as follows. In the next section, we provide the essential background to information geometry. We then introduce the definition of Bayesian relative -entropy in Section III and show that it is a valid divergence function. In Section IV, we establish the connection between this divergence and the Riemannian metric and then derive the Bayesian -version of Cramér-Rao inequality in Section V. Finally, we state our main result for the Bayesian -CRLB in Section VI and conclude in Section VII.

## Ii Desiderata for Information Geometry

A -dimensional manifold is a Hausdorff and second countable topological space which is locally homeomorphic to Euclidean space of dimension [2]. A Riemannian manifold is a real differentiable manifold in which the tangent space at each point is a finite dimensional Hilbert space and, therefore, equipped with an inner product. The collection of all these inner products is Riemannian metric. In information geometry, the statistical models play the role of a manifold and the Fisher information matrix and its various generalizations play the role of a Riemannian metric. The statistical manifold here means a parametric family of probability distributions with a continuously varying parameter space (statistical model). The dimension of a statistical manifold is the dimension of the parameter space. For example, is a two dimensional statistical manifold. The tangent space at a point of is a linear space that corresponds to a “local linearization” at that point. The tangent space at a point of is denoted by . The elements of are called tangent vectors of at . A Riemannian metric at point of

is an inner product defined for any pair of tangent vectors of

at .

Let us restrict to statistical manifolds defined on a finite set . Let denote the space of all probability distributions on . Let be a sub-manifold. Let be a parameterization of . By a divergence, we mean a non-negative function defined on such that iff . Given a divergence function on , Eguchi [28] defines a Riemannian metric on by the matrix

 G(D)(θ)=[g(D)i,j(θ)],

where

 g(D)i,j(θ):=−D[∂i,∂j]:=−∂∂θ′j∂∂θiD(pθ,pθ′)∣∣∣θ=θ′

where is the elements in the th row and th column of the matrix , , , and dual affine connections and , with connection coefficients described by following Christoffel symbols

 Γ(D)ij,k(θ):=−D[∂i∂j,∂k]:=−∂∂θi∂∂θj∂∂θ′kD(pθ,pθ′)∣∣∣θ=θ′

and

 Γ(D∗)ij,k(θ) := −D[∂k,∂i∂j]:=−∂∂θk∂∂θ′i∂∂θ′jD(pθ,pθ′)∣∣∣θ=θ′,

such that, and form a dualistic structure in the sense that

 ∂kg(D)i,j=Γ(D)ki,j+Γ(D∗)kj,i, (4)

where .

## Iii Relative α-entropy in the Bayesian Setting

We now introduce relative -entropy in the Bayesian case. Define as a -dimensional sub-manifold of and

 ~S:={~pθ(x)=pθ(x)λ(θ):pθ∈S}, (5)

where is a probability distribution on . Then, is a sub-manifold of . Let . The relative entropy of with respect to is (c.f. [29, Eq.  (2.4)] and [13])

 I(~pθ∥~pθ′) = ∑x~pθ(x)log~pθ(x)~pθ′(x)−∑x~pθ(x)+∑x~pθ′(x) = ∑xpθ(x)λ(θ)logpθ(x)λ(θ)pθ′(x)λ(θ′)−λ(θ)+λ(θ′).

We define relative -entropy of with respect to by

 Iα(~pθ,~pθ′) :=λ(θ)1−αlog∑xpθ(x)(λ(θ′)pθ′(x))α−1+λ(θ′) −λ(θ)[log∑xpθ(x)αα(1−α)−{1+logλ(θ)}−1αlog∑xpθ′(x)α]

We present the following Lemma 1 which shows that our definition of Bayesian relative -entropy is not only a valid divergence function but also coincides with the KLD as .

###### Lemma 1.
1. with equality if and only if

2. as .

###### Proof:

1) Let . Applying Holder’s inequality with Holder conjugates and , we have

 ∑xpθ(x)(λ(θ′)pθ′(x))α−1≤∥pθ∥λ(θ′)α−1∥p′θ∥α−1,

where denotes -norm. When , the inequality is reversed. Hence

 λ(θ)1−αlog∑xpθ(x)(λ(θ′)pθ′(x))s−1 ≥λ(θ)log∑xpθ(x)αα(1−α)−λ(θ)logλ(θ′)−λ(θ)αlog∑xpθ′(x)α ≥λ(θ)log∑xpθ(x)αα(1−α)−λ(θ)logλ(θ)−λ(θ)+λ(θ′) =λ(θ)[log∑xpθ(x)αα(1−α)−{1+logλ(θ)}−log∑xpθ′(x)α]+λ(θ′),

where the second inequality follows because, for ,

 logxy=−xlogyx≥−x(y/x−1)≥−y+x,

and hence

 xlogy≤xlogx−x+y.

The conditions of equality follow from the same in Holder’s inequality and .

2) This follows by applying L’Hôpital rule to the first term of :

 limα→1[α1−αλ(θ)log∑xpθ(x)(λ(θ′)pθ′(x))α−1] =limα→1[11α−1λ(θ)log∑xpθ(x)(λ(θ′)pθ′(x))α−1] =λ(θ)limα→1⎡⎢⎣1−1α2∑xpθ(x)(λ(θ′)pθ′(x))α−1log(λ(θ′)pθ′(x))∑xpθ(x)(λ(θ′)pθ′(x))α−1⎤⎥⎦ =−∑x(λ(θ)pθ(x))log(λ(θ′)pθ′(x)),

and since Renyi entropy coincides with Shannon entropy as . ∎

## Iv Fisher Information Matrix for the Bayesian Case

The Eguchi’s theory we provided in section II can also be extended to the space of all positive measures on , that is, . Following Eguchi [28], we define a Riemannian metric on by

 g(Iα)i,j(θ) =−∂∂θ′j∂∂θiIα(~pθ,~pθ′)∣∣∣θ′=θ =1α−1⋅∂′j∂iλ(θ)log∑ypθ(x)(λ(θ′)pθ′(x))α−1∣∣∣θ′=θ−∂iλ(θ)∂′jlog∑xpθ′(x)α∣∣∣θ′=θ =1α−1⎧⎪⎨⎪⎩λ(θ)∑x∂ipθ(x)⋅∂′j[(λ(θ′)pθ′(x))α−1∑ypθ(y)(λ(θ′)pθ′(x))α−1]θ′=θ+∂iλ(θ)⋅⎡⎢⎣∑xpθ(x)∂′j(λ(θ)pθ′(x))α−1∑xpθ(x)(λ(θ′)pθ′(x))α−1⎤⎥⎦θ′=θ⎫⎪⎬⎪⎭ −∂iλ(θ)∂′jlog∑xpθ′(x)α∣∣∣θ′=θ (6) =λ(θ){∑x∂ipθ(x)(λ(θ)pθ(x))α−2∂j(λ(θ)pθ(x))∑xpθ(x)(λ(θ)pθ(x))α−1−∑x(∂ipθ(x))pθ(x)α−1∑xpθ(x)α⋅∑xpθ(x)(λ(θ)pθ′(x))α−2∂j(λ(θ)pθ(x))∑xpθ(x)(λ(θ)pθ(x))α−1 =λ(θ){Eθ(α)[∂ilogpθ(X)∂jlogpθ(X)]+∂jlogλ(θ)Eθ(α)[∂ilogpθ(X)]−Eθ(α)[∂ilogpθ(X)][Eθ(α)[∂jlogpθ(X)]+∂jlogλ(θ)] =λ(θ)[Covθ(α)[∂ilogpθ(X),∂jlogpθ(X)]+∂ilogλ(θ)⋅{Eθ(α)[∂jlogpθ(X)]+∂jlogλ(θ)}]−∂iλ(θ)Eθ(α)[∂jlogpθ(X)] =λ(θ){Covθ(α)[∂ilogpθ(X),∂jlogpθ(X)]+∂ilogλ(θ)∂jlogλ(θ)} =λ(θ)[g(α)i,j(θ)+Jλi,j(θ)], (7)

where

 g(α)i,j(θ):=Covθ(α)[∂ilogpθ(X),∂jlogpθ(X)], (8)

and

 Jλi,j(θ):=∂i(logλ(θ))⋅∂j(logλ(θ)). (9)

Let , and . Notice that, when , becomes , the usual Fisher information matrix in the Bayesian case [c.f. [13]].

## V An α-Version of Cramér-Rao Inequality in the Bayesian Setting

We now investigate the geometry of with respect to the metric . Later, we formulate an -equivalent version of the Cramér-Rao inequality associated with a submanifold . Observe that is a subset of , where . The tangent space at every point of is . That is, . We denote a tangent vector (that is, elements of ) by . The manifold can be recognized by its homeomorphic image under the mapping . Under this mapping the tangent vector can be represented which is defined by and we define

 T(e)~p(~P)={X(e):X∈T~p(~P)}={A∈R~X:E~p[A]=0}. (10)

Motivated by the expression for the Riemannian metric in (IV), define

 ∂(α)i(pθ(x)) :=1α−1∂′i⎛⎝pθ′(x)α−1∑ypθ(y)pθ′(y)α−1⎞⎠∣∣∣θ′=θ =1α−1∂′i⎛⎝pθ′(x)α−1∑ypθ(y)pθ′(y)α−1⎞⎠∣∣∣θ′=θ =⎡⎣pθ(x)α−2 ∂ipθ(x)∑ypθ(y)α−pθ(x)α−1 ∑ypθ(y)α−1∂ipθ(y)(∑ypθ(y)α)2⎤⎦ =[pθ(x)(α)pθ(x)∂i(logpθ(x))−pθ(x)(α)pθ(x)Eθ(α)[∂i(logpθ(X))]].

We shall call the above an -representation of at . With this notation, the is given by

 g(α)i,j(θ)=∑x∂ipθ(x)⋅∂(α)i(pθ(x)).

It should be noted that . This follows since

 ∂(α)i(pθ)=p(α)θpθ∂ilogp(α)θ.

When , the right hand side of (V) reduces to .

Motivated by (V), the -representation of a tangent vector at is

 X(α)~p(x) := [~p(α)(x)p(x)X(e)~p(x)−~p(α)(x)p(x)E~p(α)[X(e)~p]] (12) = [p(α)(x)p(x)(X(e)~p(x)−Ep(α)[X(e)~p])],

where the last equality follows because . The collection of all such -representations is

 T(α)~p(P):={X(α)~p:X∈T~p(~P)}. (13)

Clearly . Also, since any with is

 A=[p(α)~p(B−Ep(α)[B])]

with where

 ~B(x):=[~p(x)p(α)(x)A(x)].

In view of (10), we have

 T(e)~p(~P)=T(α)~p(~P). (14)

Now the inner product between any two tangent vectors defined by the -information metric in (IV) is

 ⟨X,Y⟩(α)~p:=E~p[X(e)Y(α)]. (15)

Consider now an -dimensional statistical manifold , a submanifold of , together with the metric as in (15). Let be the dual space (cotangent space) of the tangent space and let us consider for each , the element which maps to . The correspondence is a linear map between and . An inner product and a norm on are naturally inherited from by

 ⟨ωX,ωY⟩~p:=⟨X,Y⟩(α)~p

and

 ∥ωX∥~p:=∥X∥(α)~p=√⟨X,X⟩(α)~p.

Now, for a (smooth) real function on , the differential of at , , is a member of which maps to . The gradient of at p is the tangent vector corresponding to , hence, satisfies

and

Since is a tangent vector,

for some scalars . Applying (16) with , for each , and using (18), we obtain

 (∂j)(f) = ⟨n∑i=1hi∂i,∂j⟩(α) = n∑i=1hi⟨∂i,∂j⟩(α) = n∑i=1hig(α)i,j,j=1,…,n.

This yields

 [h1,…,hn]T=[G(α)]−1[∂1(f),…,∂n(f)]T,

and so

From (16), (17), and (19), we get

 ∥(df)~p∥2~p=∑i,j(gi,j)(α)∂j(f)∂i(f) (20)

where is the th entry of the inverse of .

With these preliminaries, we now state our main results. These are analogous to those in [30, Sec. 2.5].

###### Theorem 2.

Let be any mapping (that is, a vector in . Let be the mapping . We then have

 Varp(α)[~pp(α)(A−E~p[A])]=∥(dE~p[A])~p∥2~p. (21)

###### Proof.

For any tangent vector ,

 X(E~p[A]) = ∑xX(x)A(x) (22) = E~p[X(e)~p⋅A] = E~p[X(e)~p(A−E~p[A])]. (23)

Since (c.f. (14)), there exists such that , and . Hence we see that

 ∥(dE[A])~p∥2~p = E~p[Y(e)~pY(α)~p] = E~p[Y(e)~p(A−E~p[A])] (a)= E~p[{~p(X)p(α)(X)(A−E~p[A])+Ep(α)[Y(e)~p]}(A−E~p[A])] (b)= E~p[p(X)p(α)(X)(A−E~p[A])(A−E~p[A])] = Ep(α)[~p(X)p(α)(X)(A−E~p[A])~p(X)p(α)(X)(A−E~p[A])] = Varp(α)[~p(X)p(α)(X)(A−E~