A combinatorial interpretation for Tsallis 2-entropy

07/13/2018
by   Juan Pablo Vigneaux, et al.
0

While Shannon entropy is related to the growth rate of multinomial coefficients, we show that Tsallis 2-entropy is connected to their q-version; when q is a prime power, these coefficients count the number of flags in F_q^n with prescribed length and dimensions (F_q denotes the field of order q). In particular, the q-binomial coefficients count vector subspaces of given dimension. We obtain this way a combinatorial explanation for non-additivity. We show that statistical systems whose configurations are described by flags provide a frequentist justification for the maximum entropy principle with Tsallis statistics. We introduce then a discrete-time stochastic process associated to the q-binomial distribution, that generates at time n a vector subspace of F_q^n. The concentration of measure on certain "typical subspaces" allows us to extend the asymptotic equipartition property to this setting. We discuss the applications to information theory, particularly to source coding.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/04/2020

A homological characterization of generalized multinomial coefficients related to the entropic chain rule

There is an asymptotic relationship between the multiplicative relations...
01/06/2018

Statistical estimation of the Shannon entropy

The behavior of the Kozachenko - Leonenko estimates for the (differentia...
09/05/2016

Reflections on Shannon Information: In search of a natural information-entropy for images

It is not obvious how to extend Shannon's original information entropy t...
03/16/2019

Entropy modulo a prime

Building on work of Kontsevich, we introduce a definition of the entropy...
07/30/2021

Representing Pareto optima in preordered spaces: from Shannon entropy to injective monotones

Shannon entropy is the most widely used measure of uncertainty. It is us...
07/09/2021

Information cohomology of classical vector-valued observables

We provide here a novel algebraic characterization of two information me...
09/03/2021

The typical set and entropy in stochastic systems with arbitrary phase space growth

The existence of the typical set is key for the consistence of the ensem...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Two faces of entropy

It is well known that Shannon entropy

is related to the exponential growth of multinomial coefficients. More precisely: given a discrete probability law

,

(1)

These coefficients have a -analog. For given , the -integers are defined by , the -factorials by , and the -multinomial coefficients are

(2)

where are such that . When is a prime power, these coefficients count the number of flags of vector spaces such that (here denotes the finite field of order ); we refer to the sequence as the type of the flag. In particular, the -binomial coefficient counts vector subspaces of dimension in .

In Section 3 we study in detail the asymptotic behavior of these coefficients. In particular, we show that, given a discrete probability law ,

(3)

The function is known as quadratic entropy [5].

More generally, one can introduce a parameterized family of functions , for , that generalize the usual logarithm through the formula

(4)

The -surprise of a random event of probability is then defined as

, following the traditional definitions in information theory. Given a random variable

111In this work, the range of every random variable is supposed to be a finite set. with law (a probability on ), its -entropy is defined as the expected -surprise . This -entropy or any real multiple of it can be taken as a generalized information measure. The -entropy is the usual Shannon entropy

(5)

whereas implies

(6)

This function appears in the literature under several denominations: it was introduced by Havrda and Charvát [9] as structural -entropy and Aczél and Daróczy [1] call it generalized information function of degree , but by far the most common name is Tsallis -entropy,222In the physics literature, it is customary to use the letter instead of , but we reserve for the ’quantum’ parameter that appears in the -integers, -multinomial coefficients, etc. because Tsallis popularized its use in statistical mechanics.

Given a second variable and a law for the pair , the -entropy satisfy the equations

(7)

where , the symbol denotes the conditional law, and is the push-forward of the law on under the canonical projection . We have shown in [24] that is the only family of measurable real-valued functions that satisfy these functional equations for generic collections of random variables and probabilities, up to a multiplicative constant . The case is already treated in [3]. Of course, this depends on a long history of axiomatic characterizations of entropy that begins with Shannon himself, see [22, 1, 5].

In particular, if , represent the possibles states of two independent systems (e.g. physical systems, random sources), in the sense that , then

(8)

This property of Shannon entropy is called additivity. Under the same assumptions, Tsallis entropy verifies (for ):

(9)

One says that Tsallis entropy is non-additive.333Originally, this was called non-extensivity, which explains the name ‘non-extensive statistical mechanics’.

This property is problematic from the point of view of heuristic justifications for information functions, that have always assumed as ‘intuitive’ that the amount of information given by two independent events should be computed as the sum of the amounts of information given by each one separately (this explains the use of the logarithm to define the surprise).

The initial motivation behind this paper was to understand better these generalized information functions of degree . Tsallis used them as the foundation of non-extensive statistical mechanics, a generalization of Boltzmann-Gibbs statistical mechanics that was expected to describe well some systems with long-range correlations. It is not completely clear which kind of statistical systems follow these “generalized statistics”.444“…the entropy to be used for thermostatistical purposes would be not universal but would depend on the system or, more precisely, on the nonadditive universality class to which the system belongs.”[23, p. xii] There is extensive empirical evidence about the pertinence of the predictions made by non-extensive statistical mechanics [23]. However, very few papers address the microscopical foundations of the theory (for instance, [21, 8, 12]). We present here a novel approach in this direction, based on the combinatorics of flags, but only for the case . However, we indicate in the last section how these ideas could be extended to other cases.

There is a connection between the combinatorial and algebraic characterizations of entropy, that we describe in Section 2 (Shannon entropy) and Section 3.3 (quadratic entropy). The well known multiplicative relations at the level of multinomial coefficients shed new light on additivity/non-additivity. In the simplest case, let , be two probability laws on ; then

(10)

Applying to both sides and taking the limit , we recover (8). Equation (10) remains valid for the -multinomial coefficients, but in this case one should apply to both sides to obtain the quadratic entropy:

Thus, asymptotically, the number of flags of type can be computed in terms of the number of flags of type and those flags of type —where can take the values or — through this non-additive formula.

1.2 Statistical models

The asymptotic formula (1) plays a key role in information theory. Consider a random source that emits at time a symbol in , each being an independent realization of a -valued random variable with law . A message (at time ) corresponds to a random sequence taking values in with law . The type of a sequence

is the probability distribution on

given by the relative frequency of appearance of each symbol in it; for example, when , the type of a sequence with ones is . A “typical sequence” is expected to have type , and therefore its probability is approximately . The cardinality of the set of sequences of type is . This implies, according to Shannon, that “it is possible for most purposes to treat the long sequences as though there were just of them, each with a probability [22, p. 397]. This result is known nowadays as the asymptotic equipartition property (AEP), and can be stated more precisely as follows [4, Th. 3.1.2]: given and , it is possible to find and sets , , such that, for every ,

  1. , and

  2. for every ,

    (11)

Furthermore, if denotes

then

(12)

The set can be defined to contain all the sequences whose type is close to , in the sense that is upper-bounded by a small quantity; this is known as strong typicality (see [6, Def. 2.8]).

Similar conclusions can be drawn for a system of independent particles, the state of each one being represented by a random variable ; in this case, the vector is called a configuration. The set can be thought as an approximation to the effective phase space (“reasonable probable” configurations) and the entropy as a measure of its size, see [11, Sec. V]. In both cases —messages and configurations— the underlying probabilistic model is a process linked to the multinomial distribution, and the AEP is merely a result on measure concentration around the expected type.

We envisage a new type of statistical model, such that a configuration of a system with particles is represented by a flag of vector spaces . In the simplest case () a configuration is just a vector space in . While the type of a sequence is determined by the number of appearances of each symbol, the type of a flag is determined by its dimensions or —equivalently— by the numbers associated to it; by abuse of language, we refer to as the type. The cardinality of the set of flags that have type is , where is an appropriate constant.

To push the analogy further, we need a random process that produces at time a flag that would correspond to a generalized message. We can define such process if we restrict our attention to the binomial case (). This is the purpose of Section 4.

Let be a positive real number, and let be a collection of independent random variables that satisfy , for each . We fix a a sequence of linear embeddings , and identify with its image in . We define then a stochastic process such that each is a vector subspace of , as follows: and, at step , the dimension of increases by if and only if ; in this case, is picked at random (uniformly) between all the -dilations of . When , one sets . The -dilations of a subspace of are defined as

(13)

We prove that, for any subspace of dimension , . This implies that , which appears in the literature as -binomial distribution. (We have used here the -Pochhammer symbols , with .)

For the multinomial process, the probability concentrates on types close to i.e. appearances close to the expected value , for each . In the case of , the probability also concentrates on a restricted number of dimensions (types). In fact, it is possible to prove an analog of the asymptotic equipartition partition property; this is the main result of this work, Theorem 1. It can be paraphrased as follows:

for every and almost every (except a countable set), there exist and sets , for all , such that is a number that just depends on , and, for any such that ,

(14)

Moreover, the size of is optimal, up to the first order in the exponential: let denote

then

The set correspond to the “typical subspaces”, in analogy with the typical sequences introduced above. We close Section 5 with an application of this theorem to source coding.

2 Combinatorial characterization of Shannon’s information

Let be a finite random variable that takes values in the set . We suppose that, among independent trials of the variable , the result appears times, for each . Evidently, .

The number of sequences in that agree with the prescribed counting is given by the multinomial coefficient

(15)

But we could also reason iteratively. Let’s consider a partition of in disjoint sets, denoted . These can be seen as level sets of a new variable , taking values in a set ; by definition, . There is surjection that sends to the unique such that . The probability on can be pushed-forward under this surjection; the resulting law satisfies . Our counting problem can be solved as follows: count first the number of sequences in such that values correspond to the group , for . This equals

(16)

Then, for each group , count the number of sequences of length (subsequences of the original ones of length ) such that every appears times. These are

(17)

In total, the number of sequences of length such that appears times, for every , are

(18)

The considerations above give the identity

(19)

This can be rephrased as follows: the multinomial expansion of and the iterated multinomial expansion of assign the same coefficient to .

Equation (19) implies that

(20)

We can see this as a discrete analog of the third axiom of Shannon. The connection can made explicit using the following proposition.

Proposition 1

[18, Lemma 4.1] Let be a natural number and such that . Suppose that as , for all . Then

(21)

where denotes Shannon entropy:

(22)

By convention, . If we take the limit of (20) under the hypotheses of the previous proposition, we obtain

(23)

This proves combinatorially that Shannon entropy satisfy all the functional equations of the form (23).

Consider now the particular case , for a second random variable taking values on . We use the notations introduced in Section 1.1. Since the support of is , isomorphic to by the natural projection , there is a clear identification of with . Therefore, (23) reads

(24)

The ensemble of these equations —for a given family of finite sets and surjections between them— constitute a cocycle condition in information cohomology (see [3] and [24]). These are functional equations that have as unique solution Shannon entropy.

3 The -multinomial coefficients

3.1 Definition

Let be any complex number that is not a root of unity. Given such that , the -multinomial coefficient is defined by the formula

(25)

We have used the notation for -factorials introduced in Section 1.

Throughout this paper, we shall assume that is a fixed prime power. For such , the -binomial coefficient counts the number of -dimensional subspaces in . More generally, given a set of integers such that , the -multinomial coefficient is defined as the number of flags of vector spaces such that [19, 20]. We will say that these flags are of type .

It is possible to introduce a function as the normalized solution of a functional equation that guaranties that , see [2]. When and , this function is given by the formula [17]:

(26)
(27)

where we have used the Pochhammer symbol

(28)

The equivalent expressions for the function come from the following identity

(29)

known as -binomial theorem (see [13, p. 30]).

Recall [15, p. 92] that an infinite product is said to be convergent if

  1. there exists such that for all ;

  2. exists and is different from zero.

An infinite product in the form is said to be absolutely convergent when converges. One can show that absolute convergence implies convergence. Moreover, when the terms , the product is convergent if and only if the series converges. The convergence of gives then the following result, that is used without further comment throughout the paper.

Lemma 1

For every , the product converges. Moreover, if , then .

The function gives an alternative expression for the -multinomial coefficients

(30)

which in turn extends its definition to complex arguments.

We close this subsection with a remark on the unimodality of the -binomial coefficients.

Lemma 2

For every ,

Consider the quotient

(31)

Then, iff iff , with equality just in the case (when

is odd).

3.2 Asymptotic behavior

The quadratic entropy of a probability law is defined by the formula: 555We fix the constant in front of . In [24] we have characterized Tsallis -entropy () with system of functional equations (as a -cocycle in cohomology), whose general solution is , for an arbitrary constant.

(32)
Proposition 2

For each , let be a set of positive real numbers such that (we write when is clear from context). Suppose that, for each , it is verified that as . Then,

(33)

Recall that means as . By convention, . We simply substitute (26) in (30) (the powers of cancel):

(34)

We shall prove that, for any sequence of positive numbers, if , and if .

Remark that

can be written as , where denotes the counting measure and is given by

(35)

Moreover, , because , and is integrable, . Therefore, in virtue of Lebesgue’s dominated convergence theorem,