 # Statistical Meaning of Mean Functions

The basic properties of the Fisher information allow to reveal the statistical meaning of classical inequalities between mean functions. The properties applied to scale mixtures of Gaussian distributions lead to a new mean function of purely statistical origin, unrelated to the classical arithmetic, geometric, and harmonic means. We call it the informational mean and show that when the arguments of the mean functions are Hermitian positive definite matrices, not necessarily commuting, the informational mean lies between the arithmetic and harmonic means, playing, in a sense, the role of the geometric mean that cannot be correctly defined in case of non-commuting matrices. Surprisingly the monotonicity and additivity properties of the Fisher information lead to a new generalization of the classical inequality between the arithmetic and harmonic means.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction.

Fisher information is a fundamental concept in statistics because it quantifies the efficiency of point estimators in finite samples and the asymptotic behavior of maximum estimators. The importance of Fisher information is derived from two properties:

• Monotonicity: The Fisher information in a statistic (a reduction of a set of data) is never greater than the information in the complete data set.

• Additivity: The total Fisher information in a set of independent observations is the sum of the Fisher informations of each of its components.

In this article we apply Fisher information to develop analytic inequalities involving both scalars and matrices. The monotonicity and additivity of Fisher information are key tools in deriving or reproving analytical inequalities, as shown below. Our general approach is to formulate a probability model, specialize it to Gaussian distributions, and use information-theoretic properties of the model to derive inequalities based on statistical principles.

In Kagan and Smith (2001) we used Fisher information to create statistical proofs of the monotonicity and convexity of the matrix function for Hermitian matrices. That is,

 A≥B⇒B−1≥A−1

and, given weights such that and ,

 (w1A1+⋯+wnAn)−1≤w1A−11+⋯+wnA−1n.

Here and throughout the paper, for any pair of Hermitian matrices, means is nonnegative definite. Similarly the matrix function is shown to be convex using statistical methods.

The convexity result above was extended to a notion of matrix-weighted averages in Kagan and Smith (1999). The scalar weights in are replaced by matrix weights as follows:

 BT1A1B1+⋯+BTnAnBn

where . It was shown that and are hyperconvex functions, meaning that

 (BT1A1B1+⋯+BTnAnBn)2≤BT1A21B1+⋯+BTnA2nBn

and

 (BT1A1B1+⋯+BTnAnBn)−1≤BT1A−11B1+⋯+BTnA−1nBn.

As before, these results were derived by making use of the properties of Fisher information.

Our work is similar to the use of properties of entropy and related informational quantities to derive and extend classical inequalities. See Dembo, Cover and Thomas (1991) for an exposition of that work.

## 2 Properties of Fisher Information.

Basic results concerning Fisher information are given in standard textbooks on mathematical statistics, for example Rao (1971) or Bickel and Doksum (2015). Let

be a random vector with density

depending on a parameter . We assume the score function

 J(x;\boldmathθ)=(∂/∂\boldmathθ)logp(x;\boldmathθ)

is well defined. Then , the Fisher information on contained in , is defined as

 IX(\boldmathθ)=Var-Cov[[J(X;\boldmathθ)]=E\boldmathθ[J(X;\boldmathθ)J(X;%\boldmath$θ$)T].

Under further regularity conditions,

 IX(\boldmathθ)=E\boldmathθ[−∂∂\boldmathθ∂∂%\boldmath$θ$Tlogp(x;θ)].

The fundamental information inequality (or Cramér-Rao inequality) states that if

is an unbiased estimator of

, then

 Var\boldmathθ[T]≥IX(θ)−1.

(If A and B are Hermitian matrices, the notation means that is nonnegative definite.)

When is a location parameter, has density . The Fisher information on a location parameter becomes

 IX=∫(∂logp(x)/∂x)(∂logp(x)/∂xT)p(x)dx.

Plainly, is constant in . (The notation by default denotes the information on a location parameter throughout this paper.)

If is distributed as , the density of is and plainly .

For a scalar Gaussian random variable

one has , and for any with and , . This is a consequence of the Cramér-Rao inequality.

## 3 Mixtures, Mean Functions and Inequalities.

Consider an experiment consisting of observing a pair , where

is a discrete random variable with

and the conditional distribution of given is .

The marginal distribution of is a scale mixture of Gaussian distributions with mixture parameter . Its density is

 p(x−θ)=w1φσ1(x−θ)+⋯+wnφσn(x−θ). (1)

Here is the density of the standard normal

. The variance

of with density (1) is

 σ2=w1σ21+⋯+wnσ2n. (2)

The Fisher information on contained in the pair is

 I(Δ,X)=w1/σ21+⋯+wn/σ2n. (3)

Monotonicity of the Fisher information (the information in whole data set is never less than in any part of it; in our case is a part of ) implies

 IX≤IΔ,X.

For any with , . Hence one gets a two-sided inequality for with density :

 [n∑1wiσ2i]−1≤IX≤n∑1wi/σ2i. (4)

Since in (1) is completely determined by the weights and variances , so is . On setting , the inequality (4) takes the form

 [n∑1wi/ai]≤IX(a1,…,an;w1,…,wn)≤n∑1wiai. (5)

Recall that a function is called a mean function if for all
:

1. ,

2. for any , .

Classical examples of mean functions are the arithmetic, geometric and harmonic means.

From (5), satisfies (i). Furthermore, for any , is the Fisher information in with density

 pλ(x−θ)=w1φσ1/λ+…+wnφσn/λ=√λp(√λ(x−θ))

and due to the well known property of the Fisher information mentioned above,

 IX(λa1,I…,λan;w1,…,wn)=λIX(a1,…,an;w1,…,wn)

so that satisfies (ii). Thus, is a mean function. We suggest calling it the informational mean.

Inequalities (4) and (5) have a statistical interpretation. Their right hand sides are the Fisher information on in the pair with

 P(Δ=i)=wi,  X|{Δ=i}∼N(θ,ai=1/σ2i),  i=1,…,n. (6)

The left hand sides are the Fisher information on in a Gaussian with given by (2).

Turn now to the case when are replaced with Hermitian positive definite matrices . As is well known, the inequality between the arithmetic and harmonic means still holds:

 [w1A−11+⋯+wnA−1n]−1≤w1A1+⋯+wnAn. (7)

The matrices are not assumed to commute so that their geometric mean is not defined.

Suppose that is a -dimensional random vector with distribution given by a density , where is a -dimensional parameter, the vector score,

 J(X−\boldmathθ)=(∂logp/∂θ1,…,∂logp/∂θd)T,

is well defined and . Then the matrix is called the matrix of Fisher information on contained in . (The superscript denotes transposition.)

For any Gaussian with mean vector and non-degenerate covariance matrix . For any with covariance matrix , the information matrix is evidently constant in and . (Here and throughout this paper, means that the matrix is nonnegative definite.)

Let be a pair of random elements whose distribution is given by

 P(Δ=i)=wi,  X|{Δ=i}∼N(\boldmathθ,Vi),  i=1,…,n. (8)

The marginal density of is the mixture of the densities of , , with a mixture parameter . Similarly to (2), the covariance matrix of is

 V=w1V1+⋯+wnVn (9)

and the matrix of Fisher information on in the pair is

 IΔ,X=w1V−11+…+wnV−1n, (10)

which is constant in .

As in the case of a scalar valued , when is vector valued, the matrix of Fisher information is monotone. In our case, .

On setting , becomes a function of and the mixing probabilities . Comparing it with on one side and with the matrix of Fisher information in a Gaussian on the other leads to

 (w1A−11+…+wnA−1n)−1≤IX(A1,…,An;w1,…,wn)≤w1A1+…+wnAn (11)

We want to emphasize that the matrices are not assumed to commute.

As a function of , satisfies the above condition (ii) and the following version of (i): if a matrix and a positive matrix are such that , then The statistical interpretation of (11) is the same as that of (4) and (5).

## 4 An inequality for Fisher information in sums of random variables.

In the previous section, we considered the Fisher information in a scale mixture of Gaussian densities to obtain analytic inequalities of mean functions. In this section we follow a different approach by examining the Fisher information on weighted location parameters in an independent sample of observations. The model is as follows.

For independent with finite Fisher information and , set

 Ui=Xi+wαiθ,  i=1,…,n. (12)

The information in on equals Observe that for any constant , the information in equals that in .

Multiplying both sides of (12) by with and taking the sum of the results gives

 U=n∑1wβiUi=n∑1wβiXi+θ

whence

 IU=I∑n1wβiXi. (13)

The information about in the vector with independent components is the same as in the vector . Due to monotonicity and additivity of the Fisher information,

 IU=I∑n1wβiUi≤n∑1IUi (14)

whence

 I∑n1wβiXi≤n∑1w2αiIXi (15)

for . For this inequality is known (e.g., see Dembo, Cover & Thomas 1999, Theorem 13).

When the are independent Gaussian variables with variances the sum has a Gaussian distribution with variance and (15) takes the form

 n∑1w2αiai≥1∑w2βi/ai. (16)

for subject to .

Replacing with subject to gives a generalization, in a sense, of the classical inequality between the arithmetic and harmonic means:

 n∑1wαiai≥1∑wβi/ai (17)

for .

The paper reveals statistical meaning of classical mean functions (see in this connection Rao (2000), Kagan and Smith (2001), Kagan (2003), Kagan and Rao (2003)) and introduces a new one of purely statistical origin, called the informational mean. It leads to a new inequality similar to the classical inequality between the arithmetic, geometric and harmonic means and holds when the arguments of the mean functions are Hermitian positive definite matrices, not necessarily commuting in which case the geometric mean cannot be defined.
The material of the paper can be used as a part of the chapter on the Fisher information in graduate courses in Statistics.

REFERENCES

1. Bickel, P.J. and Doksum, K.A. (2015), Mathematical Statistics (Vol. 1, 2nd ed.), Boca Raton: CRC Press.

2. Dembo, A., Cover, T.M., and Thomas, J.A. (1991), “Information Theoretic Inequalities,” IEEE Trans. Information Theory 37, 1501-1518.

3. Kagan, A. and Smith, P.J. (1999), “A Stronger Version of Matrix Convexity as Applied to Functions of Hermitian Matrices,” J. Inequal. & Appl., 3, 143-152.

4. Kagan, A. and Smith, P.J. (2001), “Multivariate Normal Distributions, Fisher Information and Matrix Inequalities,”

Int. J. Math Educ. Sci. Technol., 32, 91-96.

5. Kagan, A. (2003), “Statistical Approach to Some Mathematical Problems”, Austrian J. Statist., 32(1-2), 71-83.

6. Kagan, A. and Rao, C. R. (2003), “Some Properties and Applications of the Efficient Fisher Score”, J. Statist. Plann. Inference, 116, 343-352.

7. Rao, C. R. (2000), “Statistical Proofs of Some Matrix Inequalities”, Linear Algebra Appl., 321, 307-320.

8. Rao, C.R. (1973), Linear Statistical Inference and Its Applications, Hoboken, NJ: J. Wiley & Sons.