I Introduction
Information geometry is a study of statistical models from a Riemannian geometric perspective. The differential geometric modeling methods were introduced to statistics by C. R. Rao in his seminal paper [1] and later formally developed by Cencov [2]. The information geometric concept of a manifold that describes the parameterized probability distributions has garnered considerable interest in recent years. The main advantages include structures in the space of probability distributions that are invariant to nonsingular transformation of parameters [3], robust estimation of covariance matrices [4] and usage of Fisher Information Matrix (FIM) as a metric [5].
Information geometry has now transcended its initial statistical scope and expanded to several novel research areas including, but not limited to, FisherRao Riemannian geometry [6], Finsler information geometry [7], optimal transport geometry [8], and quantum information geometry [9]
. Many problems in science and engineering use probability distributions and, therefore, information geometry has been used as a useful and rigorous tool for analyses in applications such as neural networks
[10, 11], optimization [12, 13], radar systems [14, 15], communications [16][6], and machine learning
[17, 18]. More recently, several developments in deep learning
[19, 20] that employ various approximations to the FIM to calculate the gradient descent have incorporated information geometric concepts.The information geometry bases the distance between the parameterized probability distributions on the FIM [3]. In estimation theory, the wellknown deterministic CramérRao lower bound (CRLB) is the inverse of FIM. Therefore, the results derived from information geometry are directly connected with the fundamentals of estimation theory. Nearly all prior works exploited this information geometric link to CRLB in their analyses because the CRLB is most widely used benchmark for evaluating the mean square error (MSE) performance of an estimator. However, the classical CRLB holds only if the estimator is unbiased. In general, the estimators are biased in many practical problems such as nonparametric regression [21], communication [22], and radar [23]. The abovementioned information geometric framework ceases its utility in these cases.
Moreover, the classical CRLB is a tight bound only when the errors are small. It is well known that in case of the nonlinear estimation problems with finite support parameters, for example the time delay estimation in radar [23], the performance of the estimator is characterized by the presence of three distinct signaltonoiseratio (SNR) regions [24]. When the observations are large or the SNR is high (asymptotic region), the CRLB describes the MSE accurately. In case of few observations or low SNR regions, the information from signal observations is insufficient and the estimator criterion is hugely corrupted by the noise. Here, the MSE is close to that obtained via only a priori
information, that is, a quasiuniform random variable on the parameter support. In between these two limiting cases lies the
thresholdregion where the signal observations are subjected to ambiguities and the estimator MSE increases sharply due to the outlier effect. The CRLB is used only in the asymptotic area and is not an accurate predictor of MSE when the performance breaks down due to increase in noise.
In this paper, we propose the information geometric framework that addresses these drawbacks of the classical CRLB. The Bayesian CRLB [21] is typically used for assessing the quality of biased estimators. It is similar to the deterministic CRLB except that it assumes the parameters to be random with an a prioriprobability density function. We develop a general Riemannian metric that can be modified to link to both Bayesian and deterministic CRLB. To address the problem of the threshold effect, other bounds that are tighter than the CRLB have been developed (see e.g. [25] for an overview) to accurately identify the SNR thresholds that define the ambiguity region. In particular, Barankin bound [26] is a fundamental statistical tool to understand the threshold effect for unbiased estimators. In simple terms, the threshold effect can be understood as the region where Barankin deviates from the CRLB [27]. In this paper, we show that our metric is also applicable for the Barankin bound. Hence, compared to previous works [3], our information geometric approach to minimum bounds on MSE holds good for both the Bayesian CRLB and deterministic CRLB, their vector equivalents and the threshold effect through the Barankin bound.
The paper is organized as follows: In the next section, we provide a brief background to the information geometry and describe the notation used in the later sections. Further, we explain the dual structure of the manifolds because, in most applications, the underlying manifolds are dually flat. Here, we define a divergence function between two points in a manifold. In Section III
, we establish the connection between the Riemannian metric and the KullbackLeibler divergence for the Bayesian case. The manifold of all discrete probability distributions is dually flat and the KullbackLeibler divergence plays a key role here. We also show that, under certain conditions, our approach yields the previous results from
[3]. In Section IV, we state and prove our main result applicable to several other bounds before providing concluding remarks in Section V.Ii Information Geometry: A Brief Background
A dimensional manifold is a Hausdorff and second countable topological space which is locally homeomorphic to Euclidean space of dimension [28, 29, 30]. A Riemannian manifold is a real differentiable manifold in which the tangent space at each point is a finite dimensional Hilbert space and, therefore, equipped with an inner product. The collection of all these inner products is called a Riemannian metric.
In the information geometry framework, the statistical models play the role of a manifold and the Fisher information matrix and its various generalizations play the role of a Riemannian metric. Formally, by a statistical manifold, we mean a parametric family of probability distributions with a “continuously varying" parameter space (statistical model). The dimension of a statistical manifold is the dimension of the parameter space. For example, is a two dimensional statistical manifold. The tangent space at a point of is a linear space that corresponds to a “local linearization” at that point. The tangent space at a point of is denoted by . The elements of are called tangent vectors of at . A Riemannian metric at a point of is an inner product defined for any pair of tangent vectors of at .
In this paper, let us restrict to statistical manifolds defined on a finite set . Let denote the space of all probability distributions on . Let be a submanifold. Let be a parameterization of . Given a divergence function ^{1}^{1}1By a divergence, we mean a nonnegative function defined on such that iff . on , Eguchi [31] defines a Riemannian metric on by the matrix
where
where is the elements in the th row and th column of the matrix , , , and dual affine connections and , with connection coefficients described by following Christoffel symbols
and
such that, and form a dualistic structure in the sense that
(1) 
where .
Iii Fisher Information Matrix for the Bayesian Case
Eguchi’s theory in section II can also be extended to the space of all measures on . That is, . Let be a dimensional submanifold of and let
(2) 
where is a probability distribution on . Then is a dimensional submanifold of . For , the KullbackLeibler divergence (KLdivergence) between and is given by
We define a Riemannian metric on by
(4)  
(5) 
where
(6) 
and
(7) 
Let and . Then
(8) 
Notice that is the usual Fisher information matrix. Also observe that is a subset of , where . The tangent space at every point of is . That is, . We denote a tangent vector (that is, elements of ) by . The manifold can be recognized by its homeomorphic image under the mapping . Under this mapping the tangent vector can be represented which is defined by and we define
For the natural basis of a coordinate system , and .
With these notations, for any two tangent vectors , the Fisher metric in (5) can be written as
Let be a submanifold of of the form as in (2), together with the metric as in (5). Let be the dual space (cotangent space) of the tangent space and let us consider for each , the element which maps to . The correspondence is a linear map between and . An inner product and a norm on are naturally inherited from by
and
Now, for a (smooth) real function on , the differential of at , , is a member of which maps to . The gradient of at is the tangent vector corresponding to , hence satisfies
(9) 
and
(10) 
Since is a tangent vector, we can write
(11) 
for some scalars . Applying (9) with , for each , and using (11), we get
From this, we have
and so
(12) 
From (9), (10), and (12), we get
(13) 
where is the th entry of the inverse of .
The above and the following results are indeed an extension of [3, Sec.2.5.] to .
Theorem 1 ([3])
Let be any mapping (that is, a vector in . Let be the mapping . We then have
(14) 
Proof:
For any tangent vector ,
(15)  
(16) 
Since , there exists such that , and . Hence we see that
Corollary 2 ([3])
If is a submanifold of , then
(17) 
with equality iff
Proof:
Since is the orthogonal projection of onto , the result follows from the theorem.
Iv Derivation of Error Bounds
We state our main result in the following theorem.
Theorem 3
Let and be as in (2). Let be an estimator of . Then

Deterministic CramérRao (unbiased): If is an unbiased estimator of , then
(19) 
Deterministic CramérRao (biased): For any estimator of ,
where is the bias and is the matrix whose th entry is if and is if .

Barankin Bound: (Scalar case) If be an unbiased estimator of , then
(20) where and the supremum is over all , , and .
Proof:

This follows from (21) by taking .
V Conclusion
We have shown that our Theorem 3 provides a general information geometric characterization of the statistical manifolds linking them to the Bayesian CRLB for vector parameters; the extension to estimators of measurable functions of the parameter is trivial. We exploited the general definition of KullbackLeibler divergence when the probability densities are not normalized. This is an improvement over Amari’s work [3] on information geometry which only dealt with the notion of deterministic CRLB of scalar parameters. Further, we proposed an approach to arrive at the Barankin bound thereby shedding light on the relation between the threshold effect and information geometry. Both of our improvements enable usage of information geometric approaches in critical scenarios of biased estimators and low SNRs. This is especially useful in the analyses of many practical problems such as radar and communication. In future investigations, we intend to explore these methods further especially in the context of the threshold effect.
References
 [1] C. R. Rao, “Information and the accuracy attainable in the estimation of statistical parameters,” Bulletin of Calcutta Mathematical Society, vol. 37, pp. 81–91, 1945.
 [2] N. N. Cencov, Statistical decision rules and optimal inference (Translations of mathematical monographs). American Mathematical Society, 2000, no. 53.
 [3] S. Amari and H. Nagaoka, Methods of information geometry. American Mathematical Society, Oxford University Press, 2000, vol. 191.
 [4] B. Balaji, F. Barbaresco, and A. Decurninge, “Information geometry and estimation of toeplitz covariance matrices,” in IEEE Radar Conference, 2014, pp. 1–4.
 [5] F. Nielsen, “CramérRao lower bound and information geometry,” arXiv preprint arXiv:1301.3578, 2013.
 [6] S. J. Maybank, S. Ieng, and R. Benosman, “A FisherRao metric for paracatadioptric images of lines,” International Journal of Computer Vision, vol. 99, no. 2, pp. 147–165, 2012.
 [7] Z. Shen, “RiemannFinsler geometry with applications to information geometry,” Chinese Annals of MathematicsSeries B, vol. 27, no. 1, pp. 73–94, 2006.
 [8] W. Gangbo and R. J. McCann, “The geometry of optimal transportation,” Acta Mathematica, vol. 177, no. 2, pp. 113–161, 1996.
 [9] M. R. Grasselli and R. F. Streater, “On the uniqueness of the Chentsov metric in quantum information geometry,” Infinite Dimensional Analysis, Quantum Probability and Related Topics, vol. 4, no. 02, pp. 173–182, 2001.
 [10] S. Amari, “Information geometry of neural networks: An overview,” in Mathematics of Neural Networks, ser. Operations Research/Computer Science Interfaces Series, S. W. Ellacott, J. C. Mason, and I. J. Anderson, Eds. Springer US, 1997, vol. 8.
 [11] ——, “Information geometry of neural learning and belief propagation,” in IEEE International Conference on Neural Information Processing, vol. 2, 2002, pp. 886–vol.
 [12] S. Amari and M. Yukawa, “Minkovskian gradient for sparse optimization,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 4, pp. 576–585, 2013.

[13]
H.G. Beyer, “Convergence analysis of evolutionary algorithms that are based on the paradigm of information geometry,”
Evolutionary Computation, vol. 22, no. 4, pp. 679–709, 2014.  [14] E. de Jong and R. Pribić, “Design of radar grid cells with constant information distance,” in IEEE Radar Conference, 2014, pp. 1–5.
 [15] F. Barbaresco, “Innovative tools for radar signal processing based on Cartan’s geometry of SPD matrices & information geometry,” in IEEE Radar Conference, 2008, pp. 1–6.
 [16] M. Coutino, R. Pribić, and G. Leus, “Direction of arrival estimation based on information geometry,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 3066–3070.
 [17] K. Sun and S. MarchandMaillet, “An information geometry of statistical manifold learning,” in International Conference on Machine Learning, 2014, pp. 1–9.
 [18] S. Amari, “Natural gradient works efficiently in learning,” Neural computation, vol. 10, no. 2, pp. 251–276, 1998.
 [19] G. Desjardins, K. Simonyan, R. Pascanu et al., “Natural neural networks,” in Advances in Neural Information Processing Systems, 2015, pp. 2071–2079.
 [20] N. L. Roux, P.A. Manzagol, and Y. Bengio, “Topmoumoute online natural gradient algorithm,” in Advances in neural information processing systems, 2008, pp. 849–856.
 [21] R. D. Gill and B. Y. Levit, “Applications of the van Trees inequality: A Bayesian CramérRao bound,” Bernoulli, pp. 59–79, 1995.
 [22] R. Prasad and C. R. Murthy, “Cramérraotype bounds for sparse bayesian learning,” IEEE Transactions on Signal Processing, vol. 61, no. 3, pp. 622–632, 2013.
 [23] K. V. Mishra and Y. C. Eldar, “Performance of time delay estimation in a cognitive radar,” in IEEE Int. Conf. Acoustics, Speech and Signal Process., 2017, pp. 3141–3145.
 [24] H. L. Van Trees, K. L. Bell, and Z. Tian, Detection Estimation and Modulation Theory, Part I: Detection, Estimation, and Filtering Theory, 2nd ed. Wiley, 2013.
 [25] A. Renaux, P. Forster, P. Larzabal, C. D. Richmond, and A. Nehorai, “A fresh look at the Bayesian bounds of the WeissWeinstein family,” IEEE Transactions on Signal Processing, vol. 56, no. 11, pp. 5334–5352, 2008.
 [26] E. W. Barankin, “Locally best unbiased estimates,” The Annals of Mathematical Statistics, pp. 477–501, 1949.
 [27] L. Knockaert, “The Barankin bound and threshold behavior in frequency estimation,” IEEE Transactions on Signal Processing, vol. 45, no. 9, pp. 2398–2401, 1997.
 [28] S. Gallot, D. Hulin, and J. Lafontaine, Methods of information geometry. Riemannian Geometry, 2004.
 [29] J. Jost, Riemannian Geometry and Geometric Analysis. Springer, 2005.
 [30] M. Spivak, A Comprehensive Introduction to Differential Geometry  Volume I. Publish or Perish Inc., 2005.
 [31] S. Eguchi, “Geometry of minimum contrast,” Hiroshima Mathematical Journal, vol. 22, no. 3, pp. 631–647, 1992.
 [32] T. Groves and T. Rothenberg, “A note on the expected value of an inverse matrix,” Biometrika, vol. 56, pp. 690–691, 1969.
Comments
There are no comments yet.