# Information Geometric Approach to Bayesian Lower Error Bounds

Information geometry describes a framework where probability densities can be viewed as differential geometry structures. This approach has shown that the geometry in the space of probability distributions that are parameterized by their covariance matrix is linked to the fundamentals concepts of estimation theory. In particular, prior work proposes a Riemannian metric - the distance between the parameterized probability distributions - that is equivalent to the Fisher Information Matrix, and helpful in obtaining the deterministic Cramér-Rao lower bound (CRLB). Recent work in this framework has led to establishing links with several practical applications. However, classical CRLB is useful only for unbiased estimators and inaccurately predicts the mean square error in low signal-to-noise (SNR) scenarios. In this paper, we propose a general Riemannian metric that, at once, is used to obtain both Bayesian CRLB and deterministic CRLB along with their vector parameter extensions. We also extend our results to the Barankin bound, thereby enhancing their applicability to low SNR situations.

## Authors

• 8 publications
• 18 publications
• ### Hybrid and Generalized Bayesian Cramér-Rao Inequalities via Information Geometry

Information geometry is the study of statistical models from a Riemannia...
04/02/2021 ∙ by Kumar Vijay Mishra, et al. ∙ 0

• ### Barankin Vector Locally Best Unbiased Estimates

The Barankin bound is generalized to the vector case in the mean square ...
06/30/2017 ∙ by Bruno Cernuschi-Frias, et al. ∙ 0

• ### Generalized Bayesian Cramér-Rao Inequality via Information Geometry of Relative α-Entropy

The relative α-entropy is the Rényi analog of relative entropy and arise...
02/11/2020 ∙ by Kumar Vijay Mishra, et al. ∙ 0

• ### Clustering in Hilbert simplex geometry

Clustering categorical distributions in the probability simplex is a fun...
04/03/2017 ∙ by Frank Nielsen, et al. ∙ 0

• ### Physics-inspired forms of the Bayesian Cramér-Rao bound

Using differential geometry, I derive a form of the Bayesian Cramér-Rao ...
07/09/2020 ∙ by Mankei Tsang, et al. ∙ 0

• ### Moufang Patterns and Geometry of Information

Technology of data collection and information transmission is based on v...
07/15/2021 ∙ by Noemie Combe, et al. ∙ 0

• ### Target Detection within Nonhomogeneous Clutter via Total Bregman Divergence-Based Matrix Information Geometry Detectors

Information divergences are commonly used to measure the dissimilarity o...
12/27/2020 ∙ by Xiaoqiang Hua, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Information geometry is a study of statistical models from a Riemannian geometric perspective. The differential geometric modeling methods were introduced to statistics by C. R. Rao in his seminal paper [1] and later formally developed by Cencov [2]. The information geometric concept of a manifold that describes the parameterized probability distributions has garnered considerable interest in recent years. The main advantages include structures in the space of probability distributions that are invariant to non-singular transformation of parameters [3], robust estimation of covariance matrices [4] and usage of Fisher Information Matrix (FIM) as a metric [5].

Information geometry has now transcended its initial statistical scope and expanded to several novel research areas including, but not limited to, Fisher-Rao Riemannian geometry [6], Finsler information geometry [7], optimal transport geometry [8], and quantum information geometry [9]

. Many problems in science and engineering use probability distributions and, therefore, information geometry has been used as a useful and rigorous tool for analyses in applications such as neural networks

[10, 11], optimization [12, 13], radar systems [14, 15], communications [16][6]

, and machine learning

[17, 18]

. More recently, several developments in deep learning

[19, 20] that employ various approximations to the FIM to calculate the gradient descent have incorporated information geometric concepts.

The information geometry bases the distance between the parameterized probability distributions on the FIM [3]. In estimation theory, the well-known deterministic Cramér-Rao lower bound (CRLB) is the inverse of FIM. Therefore, the results derived from information geometry are directly connected with the fundamentals of estimation theory. Nearly all prior works exploited this information geometric link to CRLB in their analyses because the CRLB is most widely used benchmark for evaluating the mean square error (MSE) performance of an estimator. However, the classical CRLB holds only if the estimator is unbiased. In general, the estimators are biased in many practical problems such as nonparametric regression [21], communication [22], and radar [23]. The above-mentioned information geometric framework ceases its utility in these cases.

Moreover, the classical CRLB is a tight bound only when the errors are small. It is well known that in case of the nonlinear estimation problems with finite support parameters, for example the time delay estimation in radar [23], the performance of the estimator is characterized by the presence of three distinct signal-to-noise-ratio (SNR) regions [24]. When the observations are large or the SNR is high (asymptotic region), the CRLB describes the MSE accurately. In case of few observations or low SNR regions, the information from signal observations is insufficient and the estimator criterion is hugely corrupted by the noise. Here, the MSE is close to that obtained via only a priori

information, that is, a quasi-uniform random variable on the parameter support. In between these two limiting cases lies the

threshold

region where the signal observations are subjected to ambiguities and the estimator MSE increases sharply due to the outlier effect. The CRLB is used only in the asymptotic area and is not an accurate predictor of MSE when the performance breaks down due to increase in noise.

In this paper, we propose the information geometric framework that addresses these drawbacks of the classical CRLB. The Bayesian CRLB [21] is typically used for assessing the quality of biased estimators. It is similar to the deterministic CRLB except that it assumes the parameters to be random with an a prioriprobability density function. We develop a general Riemannian metric that can be modified to link to both Bayesian and deterministic CRLB. To address the problem of the threshold effect, other bounds that are tighter than the CRLB have been developed (see e.g. [25] for an overview) to accurately identify the SNR thresholds that define the ambiguity region. In particular, Barankin bound [26] is a fundamental statistical tool to understand the threshold effect for unbiased estimators. In simple terms, the threshold effect can be understood as the region where Barankin deviates from the CRLB [27]. In this paper, we show that our metric is also applicable for the Barankin bound. Hence, compared to previous works [3], our information geometric approach to minimum bounds on MSE holds good for both the Bayesian CRLB and deterministic CRLB, their vector equivalents and the threshold effect through the Barankin bound.

The paper is organized as follows: In the next section, we provide a brief background to the information geometry and describe the notation used in the later sections. Further, we explain the dual structure of the manifolds because, in most applications, the underlying manifolds are dually flat. Here, we define a divergence function between two points in a manifold. In Section III

, we establish the connection between the Riemannian metric and the Kullback-Leibler divergence for the Bayesian case. The manifold of all discrete probability distributions is dually flat and the Kullback-Leibler divergence plays a key role here. We also show that, under certain conditions, our approach yields the previous results from

[3]. In Section IV, we state and prove our main result applicable to several other bounds before providing concluding remarks in Section V.

## Ii Information Geometry: A Brief Background

A -dimensional manifold is a Hausdorff and second countable topological space which is locally homeomorphic to Euclidean space of dimension [28, 29, 30]. A Riemannian manifold is a real differentiable manifold in which the tangent space at each point is a finite dimensional Hilbert space and, therefore, equipped with an inner product. The collection of all these inner products is called a Riemannian metric.

In the information geometry framework, the statistical models play the role of a manifold and the Fisher information matrix and its various generalizations play the role of a Riemannian metric. Formally, by a statistical manifold, we mean a parametric family of probability distributions with a “continuously varying" parameter space (statistical model). The dimension of a statistical manifold is the dimension of the parameter space. For example, is a two dimensional statistical manifold. The tangent space at a point of is a linear space that corresponds to a “local linearization” at that point. The tangent space at a point of is denoted by . The elements of are called tangent vectors of at . A Riemannian metric at a point of is an inner product defined for any pair of tangent vectors of at .

In this paper, let us restrict to statistical manifolds defined on a finite set . Let denote the space of all probability distributions on . Let be a sub-manifold. Let be a parameterization of . Given a divergence function 111By a divergence, we mean a non-negative function defined on such that iff . on , Eguchi [31] defines a Riemannian metric on by the matrix

 G(D)(θ)=[g(D)i,j(θ)],

where

 g(D)i,j(θ) := −D[∂i,∂j] := −∂∂θi∂∂θ′jD(pθ,pθ′)∣∣∣θ=θ′

where is the elements in the th row and th column of the matrix , , , and dual affine connections and , with connection coefficients described by following Christoffel symbols

 Γ(D)ij,k(θ) := −D[∂i∂j,∂k] := −∂∂θi∂∂θj∂∂θ′kD(pθ,pθ′)∣∣∣θ=θ′

and

 Γ(D∗)ij,k(θ) := −D[∂k,∂i∂j] := −∂∂θk∂∂θ′i∂∂θ′jD(pθ,pθ′)∣∣∣θ=θ′,

such that, and form a dualistic structure in the sense that

 ∂kg(D)i,j=Γ(D)ki,j+Γ(D∗)kj,i, (1)

where .

## Iii Fisher Information Matrix for the Bayesian Case

Eguchi’s theory in section II can also be extended to the space of all measures on . That is, . Let be a -dimensional sub-manifold of and let

 ~S:={~pθ(x)=pθ(x)λ(θ):pθ∈S}, (2)

where is a probability distribution on . Then is a -dimensional sub-manifold of . For , the Kullback-Leibler divergence (KL-divergence) between and is given by

 I(~pθ∥~pθ′) = ∑x~pθ(x)log~pθ(x)~pθ′(x)−∑x~pθ(x)+∑x~pθ′(x) = ∑xpθ(x)λ(θ)logpθ(x)λ(θ)pθ′(x)λ(θ′)−λ(θ)+λ(θ′).

We define a Riemannian metric on by

 g(I)i,j(θ) (4) := −I[∂i∥∂j] = −∂∂θi∂∂θ′j∑xpθ(x)λ(θ)logpθ(x)λ(θ)pθ′(x)λ(θ′)∣∣ ∣∣θ′=θ = ∑x∂i(pθ(x)λ(θ))⋅∂jlog(pθ(x)λ(θ)) = ∑xpθ(x)λ(θ)∂i(logpθ(x)λ(θ))⋅∂j(log(pθ(x)λ(θ))) = λ(θ)∑xpθ(x)[∂i(logpθ(x))+∂i(logλ(θ))] ⋅[∂j(logpθ(x))+∂j(logλ(θ))] = λ(θ){Eθ[∂ilogpθ(X)⋅∂jlogpθ(X)] ⋅+∂i(logλ(θ))⋅∂j(logλ(θ))} = λ(θ){g(e)i,j(θ)+Jλi,j(θ)}, (5)

where

 g(e)i,j(θ):=Eθ[∂ilogpθ(X)⋅∂jlogpθ(X)], (6)

and

 Jλi,j(θ):=∂i(logλ(θ))⋅∂j(logλ(θ)). (7)

Let and . Then

 (8)

Notice that is the usual Fisher information matrix. Also observe that is a subset of , where . The tangent space at every point of is . That is, . We denote a tangent vector (that is, elements of ) by . The manifold can be recognized by its homeomorphic image under the mapping . Under this mapping the tangent vector can be represented which is defined by and we define

 T(e)~p(~P)={X(e):X∈T~p(~P)}={A∈R~X:E~p[A]=0}.

For the natural basis of a coordinate system , and .

With these notations, for any two tangent vectors , the Fisher metric in (5) can be written as

 ⟨X,Y⟩(e)~p=E~p[X(e)Y(e)].

Let be a sub-manifold of of the form as in (2), together with the metric as in (5). Let be the dual space (cotangent space) of the tangent space and let us consider for each , the element which maps to . The correspondence is a linear map between and . An inner product and a norm on are naturally inherited from by

 ⟨ωX,ωY⟩~p:=⟨X,Y⟩(e)~p

and

 ∥ωX∥~p:=∥X∥(e)~p=√⟨X,X⟩(e)~p.

Now, for a (smooth) real function on , the differential of at , , is a member of which maps to . The gradient of at is the tangent vector corresponding to , hence satisfies

and

Since is a tangent vector, we can write

for some scalars . Applying (9) with , for each , and using (11), we get

 (∂j)(f) = ⟨k∑i=1hi∂i,∂j⟩(e) = k∑i=1hi⟨∂i,∂j⟩(e) = k∑i=1hig(e)i,j,j=1,…,k.

From this, we have

 [h1,…,hk]T=[G(e)]−1[∂1(f),…,∂k(f)]T,

and so

From (9), (10), and (12), we get

 ∥(df)~p∥2~p=∑i,j(gi,j)(e)∂j(f)∂i(f) (13)

where is the th entry of the inverse of .

The above and the following results are indeed an extension of [3, Sec.2.5.] to .

###### Theorem 1 ([3])

Let be any mapping (that is, a vector in . Let be the mapping . We then have

 Var(A)=∥(dE~p[A])~p∥2~p. (14)

###### Proof:

For any tangent vector ,

 X(E~p[A]) = ∑xX(x)A(x) (15) = E~p[X(e)~p⋅A] = E~p[X(e)~p(A−E~p[A])]. (16)

Since , there exists such that , and . Hence we see that

 ∥(dE[A])~p∥2~p =Ep[Y(e)~pY(e)~p] =E~p[(A−E~p[A])2].
###### Corollary 2 ([3])

If is a submanifold of , then

 Var~p[A]≥∥(dE[A]|S)~p∥2~p (17)

with equality iff

 A−E~p[A]∈{X(e)~p:X∈T~p(S)}=:T(e)~p(S).

###### Proof:

Since is the orthogonal projection of onto , the result follows from the theorem.

## Iv Derivation of Error Bounds

We state our main result in the following theorem.

###### Theorem 3

Let and be as in (2). Let be an estimator of . Then

1. Bayesian Cramér-Rao:

 Eλ[Varθ(ˆθ)]≥{Eλ[G(I)(θ)]}−1, (18)

where is the covariance matrix and is as in (8). (In (18), we use the usual convention that, for two matrices and , means that is positive semi-definite.)

2. Deterministic Cramér-Rao (unbiased): If is an unbiased estimator of , then

 Varθ[ˆθ]≥[G(e)(θ)]−1. (19)
3. Deterministic Cramér-Rao (biased): For any estimator of ,

 MSEθ[ˆθ]≥(1+B′(θ))[G(e)(θ)]−1(1+B′(θ)) +b(θ)b(θ)T,

where is the bias and is the matrix whose th entry is if and is if .

4. Barankin Bound: (Scalar case) If be an unbiased estimator of , then

 (20)

where and the supremum is over all , , and .

###### Proof:
1. Let , where is an unbiased estimator of , in corollary 2. Then, from (17), we have

 ∑i,jcicjCov~θ(ˆθi,ˆθj)≥∑i,jcicj(g(I))i,j(θ).

This implies that

 λ(θ)∑i,jcicjCovθ(ˆθi,ˆθj)≥∑i,jcicj(g(I))i,j(θ). (21)

Hence, taking expectation on both sides with respect to , we get

 ∑i,jcicjEλ[Covθ(ˆθi,ˆθj)]≥∑i,jcicjEλ[(g(I))i,j(θ)].

That is,

 Eλ[Varθ(ˆθ)]≥Eλ[G(I)(θ)−1].

But by [32]. This proves the result.

2. This follows from (21) by taking .

3. Let us first observe that is an unbiased estimator of . Let as before. Then . Then, from corollary 2 and (17), we have

 Varθ[ˆθ]≥(1+B′(θ))[G(e)(θ)]−1(1+B′(θ))

But . This proves the assertion.

4. For fixed and , let us define a metric by the following formula

 g(θ):=∑x[n∑l=1alLθ(l)(x)]2pθ(x). (22)

Let be the mapping . Let , where is an unbiased estimator of . Then the partial derivatives of in (13) equals to

 ∑x(ˆθ(x)−θ)(n∑l=1alLθ(l)(x))pθ(x) =n∑l=1al(∑x(ˆθ(x)−θ)pθ(l)(x)pθ(x)pθ(x)) =n∑l=1al(θ(l)−θ).

Hence from (13) and corollary 2, we have

 Varθ[ˆθ]≥[n∑l=1al(θ(l)−θ)]2∑x[n∑l=1alLθ(l)(x)]2pθ(x)

Since and are arbitrary, taking supremum over all , , and , we get (20).

## V Conclusion

We have shown that our Theorem 3 provides a general information geometric characterization of the statistical manifolds linking them to the Bayesian CRLB for vector parameters; the extension to estimators of measurable functions of the parameter is trivial. We exploited the general definition of Kullback-Leibler divergence when the probability densities are not normalized. This is an improvement over Amari’s work [3] on information geometry which only dealt with the notion of deterministic CRLB of scalar parameters. Further, we proposed an approach to arrive at the Barankin bound thereby shedding light on the relation between the threshold effect and information geometry. Both of our improvements enable usage of information geometric approaches in critical scenarios of biased estimators and low SNRs. This is especially useful in the analyses of many practical problems such as radar and communication. In future investigations, we intend to explore these methods further especially in the context of the threshold effect.

## References

• [1] C. R. Rao, “Information and the accuracy attainable in the estimation of statistical parameters,” Bulletin of Calcutta Mathematical Society, vol. 37, pp. 81–91, 1945.
• [2] N. N. Cencov, Statistical decision rules and optimal inference (Translations of mathematical monographs).   American Mathematical Society, 2000, no. 53.
• [3] S. Amari and H. Nagaoka, Methods of information geometry.   American Mathematical Society, Oxford University Press, 2000, vol. 191.
• [4] B. Balaji, F. Barbaresco, and A. Decurninge, “Information geometry and estimation of toeplitz covariance matrices,” in IEEE Radar Conference, 2014, pp. 1–4.
• [5] F. Nielsen, “Cramér-Rao lower bound and information geometry,” arXiv preprint arXiv:1301.3578, 2013.
• [6] S. J. Maybank, S. Ieng, and R. Benosman, “A Fisher-Rao metric for paracatadioptric images of lines,” International Journal of Computer Vision, vol. 99, no. 2, pp. 147–165, 2012.
• [7] Z. Shen, “Riemann-Finsler geometry with applications to information geometry,” Chinese Annals of Mathematics-Series B, vol. 27, no. 1, pp. 73–94, 2006.
• [8] W. Gangbo and R. J. McCann, “The geometry of optimal transportation,” Acta Mathematica, vol. 177, no. 2, pp. 113–161, 1996.
• [9] M. R. Grasselli and R. F. Streater, “On the uniqueness of the Chentsov metric in quantum information geometry,” Infinite Dimensional Analysis, Quantum Probability and Related Topics, vol. 4, no. 02, pp. 173–182, 2001.
• [10] S. Amari, “Information geometry of neural networks: An overview,” in Mathematics of Neural Networks, ser. Operations Research/Computer Science Interfaces Series, S. W. Ellacott, J. C. Mason, and I. J. Anderson, Eds.   Springer US, 1997, vol. 8.
• [11] ——, “Information geometry of neural learning and belief propagation,” in IEEE International Conference on Neural Information Processing, vol. 2, 2002, pp. 886–vol.
• [12] S. Amari and M. Yukawa, “Minkovskian gradient for sparse optimization,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 4, pp. 576–585, 2013.
• [13]

H.-G. Beyer, “Convergence analysis of evolutionary algorithms that are based on the paradigm of information geometry,”

Evolutionary Computation, vol. 22, no. 4, pp. 679–709, 2014.
• [14] E. de Jong and R. Pribić, “Design of radar grid cells with constant information distance,” in IEEE Radar Conference, 2014, pp. 1–5.
• [15] F. Barbaresco, “Innovative tools for radar signal processing based on Cartan’s geometry of SPD matrices & information geometry,” in IEEE Radar Conference, 2008, pp. 1–6.
• [16] M. Coutino, R. Pribić, and G. Leus, “Direction of arrival estimation based on information geometry,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 3066–3070.
• [17] K. Sun and S. Marchand-Maillet, “An information geometry of statistical manifold learning,” in International Conference on Machine Learning, 2014, pp. 1–9.
• [18] S. Amari, “Natural gradient works efficiently in learning,” Neural computation, vol. 10, no. 2, pp. 251–276, 1998.
• [19] G. Desjardins, K. Simonyan, R. Pascanu et al., “Natural neural networks,” in Advances in Neural Information Processing Systems, 2015, pp. 2071–2079.
• [20] N. L. Roux, P.-A. Manzagol, and Y. Bengio, “Topmoumoute online natural gradient algorithm,” in Advances in neural information processing systems, 2008, pp. 849–856.
• [21] R. D. Gill and B. Y. Levit, “Applications of the van Trees inequality: A Bayesian Cramér-Rao bound,” Bernoulli, pp. 59–79, 1995.
• [22] R. Prasad and C. R. Murthy, “Cramér-rao-type bounds for sparse bayesian learning,” IEEE Transactions on Signal Processing, vol. 61, no. 3, pp. 622–632, 2013.
• [23] K. V. Mishra and Y. C. Eldar, “Performance of time delay estimation in a cognitive radar,” in IEEE Int. Conf. Acoustics, Speech and Signal Process., 2017, pp. 3141–3145.
• [24] H. L. Van Trees, K. L. Bell, and Z. Tian, Detection Estimation and Modulation Theory, Part I: Detection, Estimation, and Filtering Theory, 2nd ed.   Wiley, 2013.
• [25] A. Renaux, P. Forster, P. Larzabal, C. D. Richmond, and A. Nehorai, “A fresh look at the Bayesian bounds of the Weiss-Weinstein family,” IEEE Transactions on Signal Processing, vol. 56, no. 11, pp. 5334–5352, 2008.
• [26] E. W. Barankin, “Locally best unbiased estimates,” The Annals of Mathematical Statistics, pp. 477–501, 1949.
• [27] L. Knockaert, “The Barankin bound and threshold behavior in frequency estimation,” IEEE Transactions on Signal Processing, vol. 45, no. 9, pp. 2398–2401, 1997.
• [28] S. Gallot, D. Hulin, and J. Lafontaine, Methods of information geometry.   Riemannian Geometry, 2004.
• [29] J. Jost, Riemannian Geometry and Geometric Analysis.   Springer, 2005.
• [30] M. Spivak, A Comprehensive Introduction to Differential Geometry - Volume I.   Publish or Perish Inc., 2005.
• [31] S. Eguchi, “Geometry of minimum contrast,” Hiroshima Mathematical Journal, vol. 22, no. 3, pp. 631–647, 1992.
• [32] T. Groves and T. Rothenberg, “A note on the expected value of an inverse matrix,” Biometrika, vol. 56, pp. 690–691, 1969.