The Fisher information is a fundamental concept in Statistics and Information Theory (Rissanen, 1996), e.g. it features in Jeffreys prior (Jeffreys, 1946), the Cramér-Rao lower bound (Cramér, 1946; Rao, 1992)
and in the analysis of the asymptotics of maximum-likelihood estimators(Le Cam, 1986; Douc et al., 2004, 2011). Although different generalisations have been proposed, see e.g. (Lutwak et al., 2005, 2012)
, the standard formulation of the Fisher information often involves a parametric family of probability measures which are all absolutely continuous with respect to a common reference measure in order to define the corresponding probability density functions. This though can be a restrictive assumption for some statistical models.
Let be a given open set of parameters and let be a parametric family of probability measures on a Polish space equipped with its Borel -algebra and with a reference measure . Most often, is a subset of for some and is the Lebesgue measure, although Haar measures can be considered more generally for locally-compact topological groups. We will consider the former since the main practical limitation with the usual definition of Fisher information does not come from the lack of natural reference measure but instead from the irregularity of the probability distributions of interest. The usual setting is to assume that for all it holds that is absolutely continuous with respect to , denoted . In this case, the probability density function can be defined as the Radon-Nikodym derivative
that is, as the function on defined uniquely up to a -null set by
for all . In this situation, assuming that is differentiable with respect to , the score is defined as or indeed . Under the final assumption that the score is square integrable, the Fisher information (Lehmann & Casella, 1998) is defined as
The objective in this article is twofold. For some applications, it is necessary to relax the requirement that holds for all , or indeed any , and an appropriate definition of is needed in these cases. Upon addressing this issue, our second objective is then to study the Fisher information of some observation models frequently used in multi-object tracking. Our starting point is the following generalisation of the score given in Heidergott & Vázquez-Abad (2008),
where is the (yet to be formally defined) derivative of the probability measure with respect to and the ratio in eq. 2 is the Radon-Nikodym derivative of with respect to . Heidergott & Vázquez-Abad (2008)
introduced this definition of the score in the context of sensitivity analysis for performance measures of Markov chains(Rubinstein & Shapiro, 1993). We define the Fisher information using this expression for the score and then study the loss of information in the context of some statistical estimation problems arising in Engineering (see section 2.) Indeed, as shown in proposition 3, when the family have differentiable densities with respect to the Lebesgue measure, the Fisher information defined using the score in eq. 2 coincides with eq. 1.
The first problem studied in section 2.1
concerns fitting a parametric model to random vectors which are observed through a sensor that randomly permutes the components of the vector. This problem arises in the context of multi-object tracking(Houssineau et al., 2017) where the random vector corresponds to recorded measurements from distinct objects (e.g. vehicles) being tracked using a radar. The radar is able to provide (noisy) measurements of the locations of these object but without knowledge of the association of recorded measurements to the objects themselves. Our analysis involves studying a parametric model that does not have a common dominating measure and through the proposed definition of the Fisher information we provide a simple proof that association uncertainty results in a loss of information. This fact is surprisingly undocumented in the literature despite the numerous articles in Engineering on statistical inference for these types of models.
Multi-object observation models often also include thinning and clutter. Clutter are spurious observations, unrelated to the objects being tracked, generated by radar reflections from non-targets. Thinning is the random deletion of target generated measurements which models the occasional obscuring of targets by obstacles. The augmented set of thinned and spurious observations can be modelled as a spatial point process and section 2.2 concerns fitting a parametric model to a spatial point process that is observed under thinning and superposition. Like random permutation, thinning and superposition results in a loss of information, which is easily shown using the Fisher information defined via eq. 2 and its associated properties. These properties are invoked in the proofs in section 2 but are formally stated and proven in the final section, section 3.
2 Motivating examples
2.1 Random permutation of a random vector
Consider a parametric probability measure , . For each , is the law of a random vector where each are in , i.e. is a probability measure on . Assume are fixed. Let , a random permutation of , where
is a random variable with values in the setof permutations of . Throughout this section, denotes the vector .
In multi-object tracking, each corresponds to a measurement of a distinct object being tracked; there are of them. The sensor acquiring , e.g. a radar, returns the vector but with the association of observations to the targets lost, which can be modelled as . Filtering for such models has spawned an entire family of algorithms. e.g. see Blackman (1986); Bar-Shalom (1987).
The following theorem shows that the Fisher information of the law of , i.e. after the random permutation, is smaller than the Fisher information of . The concept of weak-differentiability will be defined formally in the next section.
Assume the family is weakly-differentiable. Then any random permutation of that is independent of incurs a loss of information, that is .
Let be the probability distribution of on , then a version of the conditional law of given is
for any . The fact that does not depend on follows from the independence of the random permutation from the parameter. From corollaries 1 and 1 the score corresponding to the extended model can then be expressed as equationparentequation
for all and all in . Note that is not absolutely continuous with respect to the Lebesgue measure on even when has a density with respect to the Lebesgue measure. Using the extension of the Fisher identity (see proposition 4), it follows that equationparentequation
with the marginal law of . Applying Jensen’s inequality to the function , we conclude that
which concludes the proof of the theorem. ∎
A different proof of this result has been proposed in Houssineau et al. (2017) using the standard formulation of Fisher information. However the proof presented here is remarkably concise and less tedious thanks to the possibility of defining in eq. 3 the score of the extended parametric model which does not have a common dominating measure. The final result then follows from the identity in eq. 4 and Jensen’s inequality.
It is not possible to establish a strict information loss in general, e.g. if is symmetrical or if is related to some summary statistics that is not affected by random permutation. Additional assumption that guarantee a strict loss are given in Houssineau et al. (2017).
2.2 Thinning and superposition of point processes
Spatial point processes are important in numerous applications (Baddeley et al., 2006), e.g. Forestry (Stoyan & Penttinen, 2000) and Epidemiology (Elliot et al., 2000). In addition, point process models are widely used in formulating multi-object tracking problems (Mahler, 2007) as they naturally account for an unknown number of objects which are observed indirectly without association and under thinning and superposition. We adopt the approach of the previous section but now characterise the Fisher information of a family of point process parametrized by observed under thinning and superposition. (Note the loss of Fisher information in the presence of association uncertainty has already been established in section 2.1.)
Let denote a point process on with parametrised distribution on , with denotes an arbitrary isolated point representing the absence of points in the process. A realisation from is a random vector where both the number of points and their locations are random. However, point-process distributions on are not always absolutely continuous with respect to the corresponding Lebesgue measure. In particular, the distribution of a non-simple point process, which is a point process such that there is a positive probability of two or more points of its realisation, say and of , being identical; see Schoenberg (2006) for a discussion about non-simple point processes and examples, e.g. by duplicating the points in a realisation as discussed further below. Assuming that the family is weakly-differentiable, the Fisher information corresponding to the parametrised distribution of can then be expressed as
where is a probability mass function on characterising the number of points in and where is the conditional distribution of the location of the points in given that the number of points is (which is supported by ). A straightforward example is when is an independently identically distributed point process. Its distribution factorises as
for any and any , where is a probability measure on . Using the product rule of corollary 1 the expression of the Fisher information simplifies in the independently identically distributed case to
where is a random variables with distribution .
A trivial construction of a non-simple point process can be obtained from an independently identically distributed point process by duplicating its realisation. The resulting point process, denoted , has each point of present twice. The Fisher information of can be expressed with the proposed formulation in spite of the lack of absolute continuity with respect to to the reference measure on . Indeed, the law of the point process is
and , where a probability measure supported by the diagonal of such that for any . One can verify that so that
from which it follows that , that is, duplicating each point in the point process does not change the Fisher information. In the context of parameter inference, this is in agreement with the natural approach of removing the duplicate points before estimating .
Returning now to a general point process which is not necessarily independently identically distributed. For each , let denote the thinned version of where each point of its realisation is retained independently of the other points with probability . In multi-object tracking, an independently thinned point processes arises because a radar can fail to return a credible observation for an object in its surveillance region.
Let be a point process characterised by a weakly-differentiable family of probability distributions parametrised by , then holds for any . If then the inequality is strict when .
The probability distribution of the thinned point process given can be expressed as
for any , any and any integers such that , with so that is the th element of . We obtain from the Fisher identity that the score associated with the point process with law verifies equationparentequation
where the use of as an argument of point-process distributions is possible because of the irrelevance of the points’ ordering. The proof of can now be concluded using the decomposition in (5) and invoking Jensen’s inequality as in theorem 1. The proof of the strict inequality is deferred to the Appendix. ∎
The decrease of the Fisher information demonstrated in theorem 2 can be quantified in the special case of an independently identically distributed point process as follows.
Let be an independently identically distributed point process characterised by a weakly-differentiable family of probability distributions parametrised by and assume its cardinality distribution does not depend on , then
for any .
The parameter of the distribution is omitted in this proof as a consequence of the assumption of independence. Additionally, thinning does not affect the common distribution of the points in so that, from (2.2), both point processes have and their terms are equal. Thus, denoting the random number of points in , the objective is to show that is greater than . It holds that the distribution of verifies
for any , so that
The second sum in the right hand side can be recognised to be the second moment of Bernoulli random variable so that equationparentequation
from which the result follows. ∎
Proposition 1 sheds light on the source of the information loss when applying independent thinning to a point process: the quantity , which can be seen as a relative loss of Fisher information, is shown to be related to the first and second moments of the random variable associated with the number of points in the process. This is because the operation of thinning applied to the considered type of independently identically distributed point process incurs a loss of information only through the decrease of the number of points.
The focus is now on how information evolves when the points of are augmented with that of another point process which has a distribution not depending on . In the context of multi-object observation models, the point process being augmented to are spurious observations called clutter which is unrelated to the objects being tracked, e.g. generated by radar reflections from non-targets. This, combined with the fact that the number of clutter points received is a priori unknown, shows that treating clutter as a -independent point process is appropriate. Superposition is less straightforward than thinning since the resulting augmented point process will have an altered spatial distribution and cardinality distribution. However, the operation of superposition can be expressed as a Markov kernel that transforms to a new point process and this Markov kernel is independent of . Thus the same approach as in theorem 2 can be applied to show that superposition (in general) also leads to a loss of Fisher information. In the following proposition, stands for the point process resulting from the superposition of with another point process .
Let be a point process characterised by a weakly-differentiable family of probability distributions parametrised by and let be another point process whose conditional distribution given does not depend on . Then .
Let be the conditional law of given , then the law of the point process given a realisation of is
for any . The desired can be now established by proceeding as in the proof of theorem 2; details are omitted. ∎
3 Fisher information via the weak derivative
To start with, the derivative has to be be defined formally. For this purpose, we consider the following weak form of measure-valued differentiation (Pflug, 1992), where the notation is used to denote the integral . Henceforth, the set will be assumed to be Polish with its Borel -algebra.
Let be a parametric family of finite measures on , then is said to be weakly differentiable at if there exists a signed finite measure on such that
holds for all bounded continuous functions on .
Although the signed measure is only characterised by the mass is gives to bounded continuous functions, one can show that this characterisation is sufficient to define on the whole Borel -algebra , see lemma 2 in the Appendix.
Assuming that has a derivative at , that is absolutely continuous with respect to , and that the square of the score is integrable, the Fisher information is defined to be
Simple cases where this more versatile definition of Fisher information is useful can be given using Dirac measures on the real line as in the following examples.
Consider , for some given . Indeed, in this case, is not absolutely continuous with respect to the natural reference measure on the real line, the Lebesgue measure . However,
which is a signed measure and
where the Radon-Nikodym derivative is assumed without loss of generality to be equal to everywhere it is not uniquely defined. It follows from basic calculations that
This unsurprisingly is the Fisher information of a Bernoulli experiment with probability of success equal to . Example 2 is meant to be an illustrative calculation executing the definition of : indeed the same result can be recovered by simply restricting the domain of definition of to the set for all . The following result illustrates a usual setting one would expect both definitions of the Fisher information to coincide.
For some dominating measure , assume for all and let denote its density. For each , assume is differentiable w.r.t. and
for all and -almost all where is some integrable function on . Then .
The assumption of eq. 9 is often invoked in the analysis of maximum likelihood estimation (Douc et al., 2004; Dean et al., 2014) to interchange the order of integration and differentiation, and thus not unique to us. An alternative to assumption in eq. 9 is to assume that the mapping is a continuous function of . This will imply
Recalling that the probability density function of with respect to is defined as
for all , it follows from Leibniz’s rule that equationparentequation
for any bounded continuous mappings on , and we conclude that is the Radon-Nikodym derivative of with respect to . Rewriting the Fisher information as
concludes the proof of the proposition. ∎
The proposed expression of Fisher information can be easily extended to cases where the parameter is vector-valued: each component of the Fisher information matrix can be simply defined based on the partial version of the weak differentiation introduced in definition 1.
Another Polish space is now considered in order to study the Fisher information for probability measures on product spaces. A function on is said to be a signed kernel from to if is a signed finite measure for all and if is measurable for all (with equipped with the Borel -algebra, which will be considered by default). If, in particular, is a probability measure for all then is said to be a Markov kernel. If is a probability measure on then we denote by the probability measure on characterised by for all in the product -algebra . A family of Markov kernels from to is said to be weakly-differentiable if the measure is weakly-differentiable for all and for all ; it is additionally said to be bounded weakly-differentiable if
where the supremum is taken over all bounded continuous functions. If the latter condition is satisfied, then is itself a signed kernel (see (Heidergott et al., 2008, theorem 1)). Some technical results are first required.
A formal approach to the weak differentiability of product measures has been considered in Heidergott & Leahu (2010) and we consider here an easily-proved corollary of (Heidergott & Leahu, 2010, theorem 6.1).
Let be a weakly-differentiable parametric family of probability measures on and let be a bounded weakly-differentiable parametric family of Markov kernels from to , then
Corollary 1 was used at several occasions in the examples of section 2 for the special case where the kernel does not depend on , that is . In these examples, the key argument was the simplification of terms that appear both in the numerator and denominator of the score function, using the following lemma.
Let and be finite signed measures on such that and let and be signed kernels from to such that for all , then
for -almost every .
Denoting the Radon-Nikodym derivative of by , it holds by definition that
for all , so that
which implies that, for all , it holds that
for -almost every . Since is a Polish space, there exists a countable collection of subsets of that is a -system and that is generating . Equation 12 implies that for all , there exists a subset of with full -measure such that is true for all . Considering the countable intersection , it follows that the statement of interest is true for all and all . To prove the equality of the measures defined on each side of eq. 12 it is sufficient to prove their equality on a -system as demonstrated. We conclude that is also the Radon-Nikodym derivative of by for -almost every , which proves the first result. The second result can be proved in a similar but simpler way. ∎
Now assuming that the interest is in the marginal law of on , it is often easier to express as
for any . In this case, the score can be computed as in the following proposition.
Proposition 4 (Fisher identity).
Let be the law of a random variable from to defined as the marginal of the law of on , and let and be respectively weakly-differentiable and bounded weakly-differentiable, then
with the conditional expectation for a given .
For any , the marginal is simply the probability measure , so that the family inherits weak-differentiability from and . The derivative can then be characterised for all by equationparentequation
Recalling that concludes the proof of the proposition. ∎
The Fisher identity is particularly important when the interest is in the Fisher information with respect to the successive observations of a state space model (Douc et al., 2004; Dean et al., 2014), in which case it is defined as the limit
where refers to the time horizon and where is the stationary distribution of the observation process.
The results of lemmas 1 and 1 also lead to the following extension of a known property of Fisher information, involving the Fisher information of a random variable calculated with respect to the conditional law of given another random variable , defined as
where is the law of and where
is assumed to be a measurable mapping, with a Markov kernel identified with the conditional law of given . Note that making the law of dependent on the parameter does not induce any difficulties.
Let and be random variables on a common probability space whose laws are parametrised by , let the family of laws of be weakly-differentiable, and let the family of laws of given be bounded and weakly-differentiable, then the Fisher information corresponding to the law of can be expressed as
where and correspond to the random variables and respectively.
S.S. Singh would like to thank Prof. Ioannis Kontoyiannis for helpful remarks. All authors were supported by Singapore Ministry of Education AcRF tier 1 grant R-155-000-182-114. AJ is affiliated with the Risk Management Institute, OR and analytics cluster and the Center for Quantitative Finance at NUS.
Proofs and technical details
Proof of strict inequality in theorem 2.
Jensen’s inequality is strict unless it is applied to a non-strictly-convex function or to a degenerate random variable. In the context of theorem 2, the involved function is so that we only have to verify that the random variable
is not -measurable:
We can rule out (for some constant ) almost surely as follows: since , it follows that But this violates the assumption that .
Since is not a constant almost surely, there exists a set such that . Then equationparentequation
since almost surely where denotes the number of points in and where is the indicator of the event . We can similarly show that eq. 15 holds with replaced with . Thus
which violates the following fact: Let and be integrable random variables, assume is an atom of and has positive probability. If is measurable then is either or equal to .
If be a finite signed measure on a metric space characterised by the value of for all bounded continuous mappings on . Then is uniquely defined on .
Let be another finite signed measure that is characterised by for all bounded continuous functions on . We first prove that and agree on the closed subsets of . Let be the metric on and let denote the usual distance between a point and set . Let be the continuous function for some some closed set and some where denotes the positive part of a function . Note that is a continuous function that approximates and