1 Introduction
In this communication, we begin with a variation on a -divergence, which introduces an averaging with respect to an arbitrary distribution. We give some properties of this divergence, including monotonicity and a data processing inequality. Then, in the context of parameter estimation, we show that this induces both an extension of the standard Fisher information, that reduces to the standard Fisher information in a particular case, and leads to an extension of the Cramér-Rao inequality for the estimation of a parameter. We show how these results can be expressed in the multidimensional case, with general norms. We also show how the classical notion of Fisher information matrix can be extended in this context, but unfortunately in a non-explicit form.
In the case of a translation parameter and using the concept of escort distributions, the general Cramér-Rao inequality leads to an inequality for distributions which is saturated by generalized
-Gaussian distributions. These generalized
-Gaussians are important in several areas of physics and mathematics. They are known to maximize the -entropies subject to a moment constraint. The Cramér-Rao inequality shows that the generalized -Gaussians also minimize the generalized Fisher information among distributions with a fixed moment. In information theory, the de Bruijn identity links the Fisher information and the derivative of the entropy. We show that this identity can be extended to generalized versions of entropy and Fisher information. More precisely, the generalized Fisher information naturally pops up in the expression of the derivative of the entropy. Finally, we give an extended version of the Weyl-Heisenberg uncertainty relation as a consequence of the generalized multidimensional Cramér-Rao inequality. Due to the lack of space, we will omit or only sketch the proofs, but will reference to the literature whenever possible.2 Context
Let , and
be three probability distributions, with
and a parameter of these densities, . We will deal with a measure of a divergence between two probability distributions, say andand we will also be interested in the estimation of a vector
, with the corresponding estimator.If is an arbitrary norm on , its dual norm is defined by
where is the standard inner product. In particular, if is an norm, then is an norm, with
A very basic idea in this work is that it may be useful to vary the probability density function with respect to which are computed the expectations. Typically, in the context of estimation, if the error is
then the bias can be evaluated as while a general moment of an arbitrary norm of the error can be computed with respect to another probability distribution, say , as in . The two distributions and can be chosen very arbitrary. However, one can also build as a transformation of that highlights, or on the contrary scores out, some characteristics of . For instance, can be a weighted version of , i.e. or a quantized version as where denotes the integer part. Another important special case is when is defined as the escort distribution of order of , where plays the role of a tuning parameter: and are a pair of escort distributions, which are defined as follows:(1) |
where is a positive parameter, and provided of course that involved integrals are finite. These escort distributions are an essential ingredient in the nonextensive thermostatistics context. Actually, the escort distributions have been introduced as an operational tool in the context of multifractals, c.f. [8], [4], with interesting connections with the standard thermodynamics. Discussion of their geometric properties can be found in [1, 14]. Escort distributions also prove useful in source coding, as noticed by [5].
Finally, we will see that in our results, the generalized -Gaussians play an important role. These generalized -Gaussians appear in statistical physics, where they are the maximum entropy distributions of the nonextensive thermostatistics [16]. The generalized -Gaussian distributions, which reduce to the standard Gaussian in a particular case, define a versatile family that can describe problems with compact support as well as problems with heavy tailed distributions. They are also analytical solutions of actual physical problems, see e.g. [13], [15] and are sometimes known as Barenblatt-Pattle functions, following their identification by Barenblatt and Pattle. We shall also mention that the Generalized -Gaussian distributions appear in other fields, namely as the solution of non-linear diffusion equations, or as the distributions that saturate some sharp inequalities in functional analysis [10], [9].
3 The modified -divergence
The results presented in this section build on a beautiful, but unfortunately overlooked, work by Vajda [17]. In this work, Vajda presented and characterized an extension of the Fisher information popping up as a limit case of a divergence (for consistency with our notations in previous papers, we will use here the superscript instead of as in Vajda’s paper). Here, we simply make a step beyond on this route.
The -divergence between two probability distributions and is defined by
(2) |
with In the case and a parametric density it is known that the Fisher information of is nothing but
In [17], Vajda extended this to any and defined a generalized Fisher information as , assuming that is differentiable wrt to . Let us consider again a -divergence as in (2), but modified in order to involve a third distribution
Obviously, this formula includes the standard divergence. For , let us denote by the partial derivative with respect to the component, and let be a vector that increments this component. Then, we have . Doing this for all components and summing the resulting vector, we finally arrive at the following definition
(3) |
where denotes the gradient of and is the -norm. A version involving a general norm instead of the -norm is given in [7]. Vajda’s generalized Fisher information [17] corresponds to the scalar case and . We will see that this generalized Fisher information, which includes previous definitions as particular cases, is involved in a generalized Cramér-Rao inequality for parameter estimation.
The modified -divergence enjoys some important properties:
Property 1
The modified -divergence has information monotonicity. This means that coarse-graining the data leads to a loss of information. If and denote the probability densities after coarse-graining, then we have . A proof of this result can be obtained following the lines in [2]. A consequence of this result is a data processing inequality: if and if and denotes the densities after this transformation, then
with equality if the transformation is invertible. It must be mentioned here that this also yields an important data processing inequality for the generalized Fisher information: .
Property 2
Matrix Fisher data processing inequality. Consider the quadratic case, i.e. and an increment on . Assuming that the partial derivatives of wrt the components of exist and are absolutely integrable, a Taylor expansion about gives Hence, to within the second order terms,
where is a Fisher information matrix computed wrt to and is a generalized score function. By information monotocity, , and therefore we get that
(the difference of the two matrices is positive semi definite) which is a data processing inequality for Fisher information matrices.
Property 3
If is a statistic, then with , we have
(4) |
It suffices here to consider and then apply the Hölder inequality to the left hand side.
4 The generalized Cramér-Rao inequality
The Property 3 above can be used to derive a generalized Cramér-Rao inequality involving the generalized Fisher information (3). Let us consider the scalar case. Set and denote Then divide both side of (4) by , substitute by and take the limit Assuming that we can exchange the order of integrations and derivations, and using the definition(3) in the scalar case, we obtain
(5) |
which reduces to the standard Cramér-Rao inequality in the case and . A multidimensional version involving arbitrary norms can be obtained by the following steps: (a) evaluate the divergence of the bias, (b) introduce an averaging with respect to in the resulting integral (c) apply a version of the Hölder inequality for arbitrary norms. The proof can be found in [7] for the direct estimation of the parameter . For the estimation of any function we have a generalized Cramér-Rao inequality that enables to lower bound a moment of an arbitrary norm of the estimation error, this moment being computed wrt any distribution .
Proposition 1
[Generalized Cramér-Rao inequality, partially in [7]] Under some regularity conditions, then for any estimator of ,
(6) |
where and are Hölder conjugate of each other, i.e. , with , and where the second factor in the left side is actually .
Many consequences can be obtained from this general result. For instance, if one chooses
and an unbiased estimator
, then the inequality above reduces to(7) |
Taking , we obtain an extension of the standard Cramér-Rao inequality, featuring a general norm and an arbitrary power; in the scalar case, we obtain the Barankin-Vajda result [3, 17], see also [19].
When is scalar valued, we have the following variation on the theme (proof omitted), which involves a Fisher information matrix, but unfortunately in a non-explicit form: for any matrix , we have
(8) |
where is a score function. We define as Fisher information matrix the matrix which maximizes the right hand side. In the quadratic case and using the inequality valid for any positive definite matrix , one can check that the maximum is precisely attained for that is the Fisher information matrix we obtained above in Property 2, in the quadratic case. The inequality reduces to the inequality which is known in the standard case.
In the quadratic case, it is also possible to obtain an analog of the well known result that the covariance of the estimation error is greater than the inverse of the Fisher information matrix (in the Löwner sense). The proof follows the lines in [11, pp. 296-297] and we get that
(9) |
with the matrix defined by and with equality iff , with .
Let us now consider the case of a location parameter for a translation family . In such a case, we have . Let us also assume, without loss of generality, that has zero mean. In these conditions, the estimator is unbiased. Finally, taking , the relation (7) leads to
Proposition 2
[Functional Cramér-Rao inequality] For any pair of probability density functions, and under some technical conditions,
(10) |
with equality if
At this point, we can obtain a interesting new characterization of the generalized -Gaussian distributions, which involves the generalized Fisher information. Indeed, if we take and as a pair of escort distributions, with
(11) |
Proposition 3
[-Cramér-Rao inequality [7]] For any probability density if is the moment of order of the norm of and if
(12) |
is the Fisher information of order (), with then
(13) |
with equality if and only if is a generalized Gaussian of the form
(14) |
Let us simply note here that is also called stretched -Gaussian,
become a stretched Gaussian for and a standard Gaussian when
in addition The inequality (13)
shows that the generalized -Gaussians minimize the generalized
Fisher information among all distributions with a given moment.
Let us also note and mention that the inequality (13)
is similar, but different, to an inequality given by Lutwak et al
[12] which is also saturated by the generalized
Gaussians (14). Finally, for a location parameter,
the matrix inequality (9) reduces to
with equality iff is a generalized -Gaussian with covariance
matrix .
The generalized Fisher information (12) also pops up in an extension of the de Bruijn identity. This identity is usually shown for the solutions of a heat equation. An extension is obtained by considering the solutions of a doubly-nonlinear equation [18]
(15) |
Proposition 4
[Extended de Bruijn identity [6]] For , and the Tsallis entropy, we have
(16) |
Of course, the standard de Bruijn identity is recovered in the particular case and
We close this paper by indicating that it is possible to exhibit new uncertainty relations, beginning with the generalized Cramér-Rao inequality (13). These inequalities involve moments computed with respect to escort distributions like (11). We denote by an expectation computed wrt an escort of order If is a wave function, and two Fourier conjugated variables, then
Proposition 5
[Uncertainty relations ] For , and ,
(17) |
For the lower bound is attained if and only is is a generalized Gaussian. For this inequality yields a multidimensional version of the Weyl-Heisenberg uncertainty principle. For we also get the inequality
References
- [1] Abe, S.: Geometry of escort distributions. Physical Review E 68(3), 031101 (2003)
- [2] Amari, S.I.: -divergence is unique, belonging to both -divergence and Bregman divergence classes. IEEE Transactions on Information Theory 55(11), 4925–4931 (2009)
- [3] Barankin, E.W.: Locally best unbiased estimates. The Annals of Mathematical Statistics 20(4), 477–501 (Dec 1949)
- [4] Beck, C., Schloegl, F.: Thermodynamics of Chaotic Systems. Cambridge University Press (1993)
- [5] Bercher, J.F.: Source coding with escort distributions and Rényi entropy bounds. Physics Letters A 373(36), 3235–3238 (Aug 2009)
- [6] Bercher, J.F.: Some properties of generalized Fisher information in the context of nonextensive thermostatistics (2012), http://hal.archives-ouvertes.fr/hal-00766699
- [7] Bercher, J.F.: On multidimensional generalized Cramér-Rao inequalities, uncertainty relations and characterizations of generalized -Gaussian distributions. Journal of Physics A: Mathematical and Theoretical 46(9), 095303 (Feb 2013), http://hal-upec-upem.archives-ouvertes.fr/hal-00766695
- [8] Chhabra, A., Jensen, R.V.: Direct determination of the singularity spectrum. Physical Review Letters 62(12), 1327 (Mar 1989)
- [9] Cordero-Erausquin, D., Nazaret, B., Villani, C.: A mass-transportation approach to sharp Sobolev and Gagliardo-Nirenberg inequalities. Advances in Mathematics 182(2), 307–332 (Mar 2004)
- [10] Del Pino, M., Dolbeault, J.: Best constants for Gagliardo-Nirenberg inequalities and applications to nonlinear diffusions. Journal de Mathématiques Pures et Appliquées 81(9), 847–875 (Sep 2002)
- [11] Liese, F., Miescke, K.J.: Statistical Decision Theory: Estimation, Testing, and Selection. Springer, 2008 edn. (Jun 2008)
- [12] Lutwak, E., Lv, S., Yang, D., Zhang, G.: Extensions of Fisher information and Stam’s inequality. IEEE Transactions on Information Theory 58(3), 1319 –1327 (Mar 2012)
- [13] Lutz, E.: Anomalous diffusion and Tsallis statistics in an optical lattice. Physical Review A 67(5), 051402 (2003)
- [14] Ohara, A., Matsuzoe, H., Amari, S.I.: A dually flat structure on the space of escort distributions. Journal of Physics: Conference Series 201, 012012 (2010)
- [15] Schwämmle, V., Nobre, F.D., Tsallis, C.: -Gaussians in the porous-medium equation: stability and time evolution. The European Physical Journal B-Condensed Matter and Complex Systems 66(4), 537–546 (2008)
- [16] Tsallis, C.: Introduction to Nonextensive Statistical Mechanics. Springer (Apr 2009)
- [17] Vajda, I.: -divergence and generalized Fisher information. In: Transactions of the Sixth Prague Conference on Information Theory, Statistical Decision Functions and Random Processes. pp. 873–886 (1973)
- [18] Vázquez, J.L.: Smoothing and Decay Estimates for Nonlinear Diffusion Equations: Equations of Porous Medium Type. Oxford University Press, USA (Oct 2006)
- [19] Weinstein, E., Weiss, A.: A general class of lower bounds in parameter estimation. IEEE Transactions on Information Theory 34(2), 338–342 (1988)