Information cohomology of classical vector-valued observables

We provide here a novel algebraic characterization of two information measures associated with a vector-valued random variable, its differential entropy and the dimension of the underlying space, purely based on their recursive properties (the chain rule and the nullity-rank theorem, respectively). More precisely, we compute the information cohomology of Baudot and Bennequin with coefficients in a module of continuous probabilistic functionals over a category that mixes discrete observables and continuous vector-valued observables, characterizing completely the 1-cocycles; evaluated on continuous laws, these cocycles are linear combinations of the differential entropy and the dimension.

Authors

• 4 publications
02/08/2021

Variations on a Theme by Massey

In 1994, James Lee Massey proposed the guessing entropy as a measure of ...
07/12/2018

Shannon and Rényi entropy rates of stationary vector valued Gaussian random processes

We derive expressions for the Shannon and Rényi entropy rates of station...
02/09/2022

Towards Empirical Process Theory for Vector-Valued Functions: Metric Entropy of Smooth Function Classes

This paper provides some first steps in developing empirical process the...
03/04/2020

A homological characterization of generalized multinomial coefficients related to the entropic chain rule

There is an asymptotic relationship between the multiplicative relations...
01/04/2021

The Gaussian entropy map in valued fields

We exhibit the analog of the entropy map for multivariate Gaussian distr...
01/08/2018

Discrete Gaussian distributions via theta functions

We introduce a discrete analogue of the classical multivariate Gaussian ...
11/17/2020

Change the coefficients of conditional entropies in extensivity

The Boltzmann–Gibbs entropy is a functional on the space of probability ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Baudot and Bennequin [Baudot2015] introduced information cohomology, and identified Shannon entropy as a nontrivial cohomology class in degree 1. This cohomology has an explicit description in terms of cocycles and coboundaries; the cocycle equations are a rule that relate different values of the cocycle. When its coefficients are a module of measurable probabilistic functionals on a category of discrete observables, Shannon’s entropy defines a -cocycle and the aforementioned rule is simply the chain rule; moreover, the cocycle equations in every degree are systems of functional equations and one can use the techniques developed by Tverberg, Lee, Knappan, Aczél, Daróczy, etc. [Aczel1975] to show that, in degree one, the entropy is the unique measurable, nontrivial solution. The theory thus gave a new algebraic characterization of this information measure based on topos theory à la Grothendieck, and showed that the chain rule is its defining algebraic property.

It is natural to wonder if a similar result holds for the differential entropy. We consider here information cohomology with coefficients in a presheaf of continuous probabilistic functionals on a category that mixes discrete and continuous (vector-valued) observables, and establish that every

-cocycle, when evaluated on probability measures absolutely continuous with respect to the Lebesgue measure, is a linear combination of the differential entropy and the dimension of the underlying space (the term continous has in this sentence three different meanings). We already showed that this was true for gaussian measures

[Vigneaux2019-thesis]; in that case, there is a finite dimensional parametrization of the laws, and we were able to use Fourier analysis to solve the -cocycle equations. Here we exploit that result, expressing any density as a limit of gaussian mixtures (i.e. convex combinations of gaussian densities), and then using the -cocycle condition to compute the value of a -cocycle on gaussian mixtures in terms of its value on discrete laws and gaussian laws. The result depends on the conjectural existence of a “well-behaved” class of probabilities and probabilistic functionals, see Section 3.2.

The dimension appears here as an information quantity in its own right: its “chain rule” is the nullity-rank theorem. In retrospective, its role as an information measure is already suggested by old results in information theory. For instance, the expansion of Kolmogorov’s -entropy of a continuous, -valued random variable “is determined first of all by the dimension of the space, and the differential entropy appears only in the form of the second term of the expression for .” [Kolmogorov1993, Paper 3, p. 22]

2 Some known results about information cohomology

Given the severe length constraints, it is impossible to report here the motivations behind information cohomology, its relationship with traditional algebraic characterizations of entropy, and its topos-theoretic foundations. For that, the reader is referred to the introductions of [Vigneaux2019-thesis] and [Vigneaux2020information]. We simply remind here a minimum of definitions in order to make sense of the characterization of -cocycles that is used later in the article.

Let be a partially ordered set (poset); we see it as a category, denoting the order relation by an arrow. It is supposed to have a terminal object and to satisfy the following property: whenever are such that and , the categorical product exists in . An object of of (i.e. ) is interpreted as an observable, an arrow as being coarser than , and as the joint measurement of and .

The category is just an algebraic way of encoding the relationships between observables. The measure-theoretic “implementation” of them comes in the form of a functor that associates to each a measurable set , and to each arrow in a measurable surjection . To be consistent with the interpretations given above, one must suppose that and that is mapped injectively into by . We consider mainly two examples: the discrete case, in which finite and the collection of its subsets, and the Euclidean case, in which is a Euclidean space and is its Borel -algebra. The pair is an information structure.

Throughout this article, conditional probabilities are understood as disintegrations. Let a -finite measure on a measurable space , and a -finite measure on . The measure has a disintegration with respect to a measurable map and , or a -disintegration, if each is a -finite measure on concentrated on —i.e. for -almost every —and for each measurable nonnegative function , the mapping is measurable and [Chang1997].

We associate to each the set of probability measures on i.e. of possible laws of , and to each arrow the marginalization map that maps to the image measure . More generally, we consider any subfunctor of that is stable under conditioning: for all , , and , belongs to for -almost every , where is the -disintegration of .

We associate to each the set , which is a monoid under the product introduced above. The assignment defines a contravariant functor (presheaf). The induced algebras give a presheaf . An -module is a collection of modules over , for each , with an action that is “natural” in . The main example is the following: for any adapted probability functor , one introduces a contravariant functor declaring that are the measurable functions on , and is precomposition with for each morphism in . The monoid acts on by the rule:

 ∀Y∈SX,∀ρ∈Q(X),Y.ϕ(ρ)=∫EYϕ(ρ|Y=y)dπYX∗ρ(y) (1)

where stands for the marginalization induced by in . This action can be extended by linearity to and is natural in .

In [Vigneaux2020information], the information cohomology is defined using derived functors in the category of -modules, and then described explicitly, for each degree , as a quotient of -cocycles by -coboundaries. For , the coboundaries vanish, so we simply have to describe the cocycles. Let be the -module freely generated by a collection of bracketed symbols ; an arrow induces an inclusion , so is a presheaf. A -cochain is a natural transformations , with components ; we use as a shorthand for . The naturality implies that equals , a property that [Baudot2015] called locality; sometimes we write instead of . A -cochain is a -cocycle iff

 ∀X∈ObS,∀X1,X2∈SX,φX[X1∧X2]=X1.φX[X2]+φX[X1]. (2)

Remark that this is an equality of functions in .

An information structure is finite if for all , is finite. In this case, [Vigneaux2020information, Prop. 4.5.7] shows that, whenever an object can be written as a product and is “close” to , as formalized by the definition of nondegenerate product [Vigneaux2020information, Def. 4.5.6], then there exists such that for all and in

 ΦW(ρ)=−K∑w∈EWρ(w)logρ(w). (3)

The continuous case is of course more delicate. In the case of taking values in vector spaces, and made of gaussian laws, [Vigneaux2019-thesis] treated it as follows. We start with a vector space with Euclidean metric , and a poset of vector subspaces of , ordered by inclusion, satisfying the hypotheses stated above; remark that corresponds to intersection. Then we introduce by , and further identify is using the metric (so that we only deal with vector subspaces of ). We also introduce a sheaf , such that consists of affine subspaces of and is closed under intersections; the sheaf is supposed to be closed under the projections induced by and to contain the fibers of all these projections. On each affine subspace there is a unique Lebesgue measure induced by the metric . We consider a sheaf such that are probabilities measures that are absolutely continuous with respect to for some and have a gaussian density with respect to it. We also introduce a subfunctor of made of functions that grow moderately (i.e. at most polynomially) with respect to the mean, in such a way that the integral (1) is always convergent. Ref. [Vigneaux2019-thesis] called a triple sufficiently rich when there are “enough supports”, in the sense that one can perform marginalization and conditioning with respect to projections on subspaces generated by elements of at least two different bases of . In this case, we showed that every -cocycle , with coefficients in , there are real constants and such that, for every and every gaussian law with support

and variance

(a nondegenerate, symmetric, positive bilinear form on ),

Moreover, its completely determined by its behavior on nondegenerate laws. (The measure is enough to define the determinant [Vigneaux2019-thesis, Sec. 11.2.1], but the latter can also be computed w.r.t. a basis of such that

is represented by the identity matrix.)

3 An extended model

3.1 Information structure, supports, and reference measures

In this section, we introduce a more general model, that allows us to “mix” discrete and continuous variables. It is simply the product of a structure of discrete observables and a structure of continuous ones.

Let be a finite information structure, such that for every , there exist with , and for every , there is a that can be written as non-degenerate product and such that ; this implies that (3) holds for every . Let be a sufficiently rich triple in the sense of the previous section, associated to a vector space with metric , so that (4) holds for every .

Let be the product in the category of information structures, see [Vigneaux2020information, Prop. 2.2.2]. By definition, every object has the form for and , and , and there is an arrow in if and only if there exist arrows in and in ; under the functor , such is mapped to . There is an embedding in the sense of information structures, see [Vigneaux2019-thesis, Sec. 1.4], , ; we call its image the “continuous sector” of ; we write instead of and instead of . The “discrete sector” is defined analogously.

We extend the sheaf of supports to the whole setting when , and then for any . The resulting is a functor on closed under projections and intersections, that contains the fibers of every projection.

For every and , there is a unique reference measure compatible with : on the continuous sector it is the Lebesgue measure given by the metric on the affine subspaces of , on the discrete sector it is the counting measure restricted to , and for any product , with and , it is just , where is the image of under the inclusion . We write instead of .

The disintegration of the counting measure into counting measures is trivial. The disintegration of the Lebesgue measure on a support under the projection of vector spaces is given by the Lebesgue measures on the fibers , for . We recall that we are in a framework where and are identified with subspaces of , which has a metric ; the disintegration formula is just Fubini’s theorem.

To see that disintegrations exist under any arrow of the category, consider first an object and arrows and , when and . By definition , and the canonical projections and are the images under of and , respectively. Set and . According to the previous definitions, has reference measure where is the image of under the inclusion . Hence by definition, is -disintegration of . Similarly, has as -disintegration the family of measures , where each is the counting measure restricted on the fiber .

More generally, the disintegration of reference measure on a support of under the arrow is the collection of measures such that

 μZ,A×B,x′,y′=∑y∈π−12(y′)μyX,A,x′ (5)

where is the image measure, under the inclusion , of the measure that comes from the -disintegration of .

3.2 Probability laws and probabilistic functionals

Consider the subfunctor of that associates to each the set of probability measures on that are absolutely continuous with respect to the reference measure on some . We define the (affine) support or carrier of , denoted , as the unique such that .

In this work, we want to restrict our attention to subfunctors of probability laws such that:

2. for each , the differential entropy exists i.e. it is finite;

3. when restricted to probabilities in with the same carrier , the differential entropy is a continuous functional in the total variation norm;

4. for each and each , the gaussian mixtures carried by are contained in —cf. next section.

Problem 1

The characterization of functors that satisfy properties 1.-4.

Below we use kernel estimates, which interact nicely with the total variation norm. This norm is defined for every measure

on by . Let be a real-valued functional defined on , and the space of functions with total mass 1 i.e. ; the continuity of in the total variation distance is equivalent to the continuity of in the -norm, because of Scheffe’s identity [Devroye2002, Thm. 45.4].

The characterization referred to in Problem 1

might involve the densities or the moments of the laws. It is the case with the main result that we found concerning convergence of the differential entropy and its continuity in total variation

[Ghourchian2017]. Or it might resemble Otáhal’s result [Otahal1994]: if densities tend to in and , for some , then .

For each , let be the vector space of measurable functions of , equivalently , where is an element of , is a global determination of reference measure on any affine subspace given by the metric , and is the corresponding reference measure on the carrier of under this determination.

We want to restrict our attention to functionals for which the action (1) is integrable. Of course, these depends on the answer to Problem 1.

Problem 2

What are the appropriate restrictions on the functionals to guarantee the convergence of (1)?

4 Computation of 1-cocycles

4.1 A formula for gaussian mixtures

Let be a probability functor satisfying conditions (1)-(4) in Subsection 3.2, and a linear subfunctor of such that (1) converges for laws in . In this section, we compute .

Consider a generic object of ; we write everywhere and instead of and . We suppose that is an Euclidean space of dimension . Remind that is a finite set and . Let be gaussian densities on (with mean and covariance ), and a density on ; then is a probability measure on , absolutely continuous with respect to , with density . We have that has density with respect to the counting measure , whereas is absolutely continuous with respect to (see [Chang1997, Thm. 3]) with density

 μxZ(r)=∫EXdρdμZdμxZ=∑y∈EYp(y)GMy,Σy(x). (6)

i.e. it is a gaussian mixture. For conciseness, we utilize here linear-functional notation for some integrals, e.g. . The measure has a -disintegration into probability laws , such that each is concentrated on and

 ρx(x,y)=p(y)GMy,Σy(x)∑y∈EYp(y)GMy,Σy(x). (7)

In virtue of the cocycle condition (remind that ),

 ΦZ(ρ)=φZ[Y](ρ)+∑y∈EYp(y)φZ[X](GMy,ΣyμyX) (8)

Locality implies that , and the later equals , for some , in virtue of the characterization of cocycles restricted to the discrete sector. Similarly, for some , since our hypotheses are enough to characterize the value of any cocycle restricted to the gaussian laws on the continuous sector. Hence

 ΦZ(ρ)=−b∑y∈EYp(y)logp(y)+∑y∈EYp(y)(alogdet(Σy)+cdimEX). (9)

Remark that the same argument can be applied to any densities instead . Thus, it is enough to determine on general densities, for each , to characterize completely the cocycle .

The cocycle condition also implies that

 ΦZ(ρ) =ΦX(πXZρ)+∫EXφZ[Y](ρx)%$$d$$πXZ∗ρ(x). (10)

The law is a composite gaussian, and the law is supported on the discrete space , with density

 (x,y)↦ρx(y)=p(y)GMy,Σy(x)∑y′∈EYp(y′)GMy,Σy(x)=r(x,y)μxZ(r). (11)

Using again locality and the characterization of cocycles on the discrete sector, we deduce that

 φZ[Y](ρx)=ΦY(πYZ∗ρx)=−b∑y∈EYρx(y)logρx(y). (12)

A direct computation shows that equals:

 ∑y∈EYp(y)logp(y)−∑y∈EYp(y)SμX(GMy,Σy)+SμX(μxZ(r)), (13)

where is the differential entropy, defined for any by .

It is well known that

Equating the right hand sides of (9) and (10), we conclude that

 ΦX(πXZρ)=∑y∈EYp(y)((a−b2)logdetΣy+cd−bd2log(2πe)))+bSμX(μxZ(r)). (14)

4.2 Kernel estimates and main result

Equation 17 gives an explicit value to the functional evaluated on a gaussian mixture. Any density in can be approximated by a random mixture of gaussians. This approximation is known as a kernel estimate.

Let be a sequence of independently distributed random elements of , all having a common density with respect to the Lebesgue measure . Let be a nonnegative Borel measurable function, called kernel, such that , and a sequence of positive real numbers. The kernel estimate of is given by

 fn(x)=1nhdnn∑i=1K(x−Xihn). (15)

The distance is a random variable, invariant under arbitrary automorphisms of . The key result concerning these estimates [Devroye1985, Ch. 3, Thm. 1] says, among other things, that in probability as for some if and only if almost surely as for all , which holds if and only if and .

Theorem 4.1

Let be a -cocycle on with coefficients in , an object in , and a probability law in absolutely continuous with respect to . Then, there exist real constants such that

 ΦX(ρ):=φX[X](ρ)=c1SμA(ρ)+c2dimEX. (16)
Proof

By hypothesis. is an object in the continuous sector of . Let be any density of with respect to , an i.i.d sequence of points of with law , and any sequence such that and . Let be any realization of the process such that tend to in . We introduce, for each , the kernel estimate (15) evaluated at , taking equal to the density of a standard gaussian. Each is the density of a composite gaussian law equal to . This can be “lifted” to , this is, there exists a law on with density , where is taken to be the uniform law. The arguments of Section 4 then imply that

 ΦX(ρn)=2d(a−b2)loghn+cd−bd2log(2πe)+bSμA(ρn). (17)

In virtue of the hypotheses on , is finite and . Since is continuous when restricted to and is a real number, we conclude that necessarily . The statement is then just a rewriting of (17).

This is the best possible result: the dimension is an invariant associated to the reference measure, and the entropy depends on the density.