1 Introduction
1.1 Two faces of entropy
It is well known that Shannon entropy
is related to the exponential growth of multinomial coefficients. More precisely: given a discrete probability law
,(1) 
These coefficients have a analog. For given , the integers are defined by , the factorials by , and the multinomial coefficients are
(2) 
where are such that . When is a prime power, these coefficients count the number of flags of vector spaces such that (here denotes the finite field of order ); we refer to the sequence as the type of the flag. In particular, the binomial coefficient counts vector subspaces of dimension in .
In Section 3 we study in detail the asymptotic behavior of these coefficients. In particular, we show that, given a discrete probability law ,
(3) 
The function is known as quadratic entropy [5].
More generally, one can introduce a parameterized family of functions , for , that generalize the usual logarithm through the formula
(4) 
The surprise of a random event of probability is then defined as
, following the traditional definitions in information theory. Given a random variable
^{1}^{1}1In this work, the range of every random variable is supposed to be a finite set. with law (a probability on ), its entropy is defined as the expected surprise . This entropy or any real multiple of it can be taken as a generalized information measure. The entropy is the usual Shannon entropy(5) 
whereas implies
(6) 
This function appears in the literature under several denominations: it was introduced by Havrda and Charvát [9] as structural entropy and Aczél and Daróczy [1] call it generalized information function of degree , but by far the most common name is Tsallis entropy,^{2}^{2}2In the physics literature, it is customary to use the letter instead of , but we reserve for the ’quantum’ parameter that appears in the integers, multinomial coefficients, etc. because Tsallis popularized its use in statistical mechanics.
Given a second variable and a law for the pair , the entropy satisfy the equations
(7) 
where , the symbol denotes the conditional law, and is the pushforward of the law on under the canonical projection . We have shown in [24] that is the only family of measurable realvalued functions that satisfy these functional equations for generic collections of random variables and probabilities, up to a multiplicative constant . The case is already treated in [3]. Of course, this depends on a long history of axiomatic characterizations of entropy that begins with Shannon himself, see [22, 1, 5].
In particular, if , represent the possibles states of two independent systems (e.g. physical systems, random sources), in the sense that , then
(8) 
This property of Shannon entropy is called additivity. Under the same assumptions, Tsallis entropy verifies (for ):
(9) 
One says that Tsallis entropy is nonadditive.^{3}^{3}3Originally, this was called nonextensivity, which explains the name ‘nonextensive statistical mechanics’.
This property is problematic from the point of view of heuristic justifications for information functions, that have always assumed as ‘intuitive’ that the amount of information given by two independent events should be computed as the sum of the amounts of information given by each one separately (this explains the use of the logarithm to define the surprise).
The initial motivation behind this paper was to understand better these generalized information functions of degree . Tsallis used them as the foundation of nonextensive statistical mechanics, a generalization of BoltzmannGibbs statistical mechanics that was expected to describe well some systems with longrange correlations. It is not completely clear which kind of statistical systems follow these “generalized statistics”.^{4}^{4}4“…the entropy to be used for thermostatistical purposes would be not universal but would depend on the system or, more precisely, on the nonadditive universality class to which the system belongs.”[23, p. xii] There is extensive empirical evidence about the pertinence of the predictions made by nonextensive statistical mechanics [23]. However, very few papers address the microscopical foundations of the theory (for instance, [21, 8, 12]). We present here a novel approach in this direction, based on the combinatorics of flags, but only for the case . However, we indicate in the last section how these ideas could be extended to other cases.
There is a connection between the combinatorial and algebraic characterizations of entropy, that we describe in Section 2 (Shannon entropy) and Section 3.3 (quadratic entropy). The well known multiplicative relations at the level of multinomial coefficients shed new light on additivity/nonadditivity. In the simplest case, let , be two probability laws on ; then
(10) 
Applying to both sides and taking the limit , we recover (8). Equation (10) remains valid for the multinomial coefficients, but in this case one should apply to both sides to obtain the quadratic entropy:
Thus, asymptotically, the number of flags of type can be computed in terms of the number of flags of type and those flags of type —where can take the values or — through this nonadditive formula.
1.2 Statistical models
The asymptotic formula (1) plays a key role in information theory. Consider a random source that emits at time a symbol in , each being an independent realization of a valued random variable with law . A message (at time ) corresponds to a random sequence taking values in with law . The type of a sequence
is the probability distribution on
given by the relative frequency of appearance of each symbol in it; for example, when , the type of a sequence with ones is . A “typical sequence” is expected to have type , and therefore its probability is approximately . The cardinality of the set of sequences of type is . This implies, according to Shannon, that “it is possible for most purposes to treat the long sequences as though there were just of them, each with a probability ” [22, p. 397]. This result is known nowadays as the asymptotic equipartition property (AEP), and can be stated more precisely as follows [4, Th. 3.1.2]: given and , it is possible to find and sets , , such that, for every ,
, and

for every ,
(11)
Furthermore, if denotes
then
(12) 
The set can be defined to contain all the sequences whose type is close to , in the sense that is upperbounded by a small quantity; this is known as strong typicality (see [6, Def. 2.8]).
Similar conclusions can be drawn for a system of independent particles, the state of each one being represented by a random variable ; in this case, the vector is called a configuration. The set can be thought as an approximation to the effective phase space (“reasonable probable” configurations) and the entropy as a measure of its size, see [11, Sec. V]. In both cases —messages and configurations— the underlying probabilistic model is a process linked to the multinomial distribution, and the AEP is merely a result on measure concentration around the expected type.
We envisage a new type of statistical model, such that a configuration of a system with particles is represented by a flag of vector spaces . In the simplest case () a configuration is just a vector space in . While the type of a sequence is determined by the number of appearances of each symbol, the type of a flag is determined by its dimensions or —equivalently— by the numbers associated to it; by abuse of language, we refer to as the type. The cardinality of the set of flags that have type is , where is an appropriate constant.
To push the analogy further, we need a random process that produces at time a flag that would correspond to a generalized message. We can define such process if we restrict our attention to the binomial case (). This is the purpose of Section 4.
Let be a positive real number, and let be a collection of independent random variables that satisfy , for each . We fix a a sequence of linear embeddings , and identify with its image in . We define then a stochastic process such that each is a vector subspace of , as follows: and, at step , the dimension of increases by if and only if ; in this case, is picked at random (uniformly) between all the dilations of . When , one sets . The dilations of a subspace of are defined as
(13) 
We prove that, for any subspace of dimension , . This implies that , which appears in the literature as binomial distribution. (We have used here the Pochhammer symbols , with .)
For the multinomial process, the probability concentrates on types close to i.e. appearances close to the expected value , for each . In the case of , the probability also concentrates on a restricted number of dimensions (types). In fact, it is possible to prove an analog of the asymptotic equipartition partition property; this is the main result of this work, Theorem 1. It can be paraphrased as follows:
for every and almost every (except a countable set), there exist and sets , for all , such that is a number that just depends on , and, for any such that ,
(14) 
Moreover, the size of is optimal, up to the first order in the exponential: let denote
then
The set correspond to the “typical subspaces”, in analogy with the typical sequences introduced above. We close Section 5 with an application of this theorem to source coding.
2 Combinatorial characterization of Shannon’s information
Let be a finite random variable that takes values in the set . We suppose that, among independent trials of the variable , the result appears times, for each . Evidently, .
The number of sequences in that agree with the prescribed counting is given by the multinomial coefficient
(15) 
But we could also reason iteratively. Let’s consider a partition of in disjoint sets, denoted . These can be seen as level sets of a new variable , taking values in a set ; by definition, . There is surjection that sends to the unique such that . The probability on can be pushedforward under this surjection; the resulting law satisfies . Our counting problem can be solved as follows: count first the number of sequences in such that values correspond to the group , for . This equals
(16) 
Then, for each group , count the number of sequences of length (subsequences of the original ones of length ) such that every appears times. These are
(17) 
In total, the number of sequences of length such that appears times, for every , are
(18) 
The considerations above give the identity
(19) 
This can be rephrased as follows: the multinomial expansion of and the iterated multinomial expansion of assign the same coefficient to .
Equation (19) implies that
(20) 
We can see this as a discrete analog of the third axiom of Shannon. The connection can made explicit using the following proposition.
Proposition 1
[18, Lemma 4.1] Let be a natural number and such that . Suppose that as , for all . Then
(21) 
where denotes Shannon entropy:
(22) 
By convention, . If we take the limit of (20) under the hypotheses of the previous proposition, we obtain
(23) 
This proves combinatorially that Shannon entropy satisfy all the functional equations of the form (23).
Consider now the particular case , for a second random variable taking values on . We use the notations introduced in Section 1.1. Since the support of is , isomorphic to by the natural projection , there is a clear identification of with . Therefore, (23) reads
(24) 
The ensemble of these equations —for a given family of finite sets and surjections between them— constitute a cocycle condition in information cohomology (see [3] and [24]). These are functional equations that have as unique solution Shannon entropy.
3 The multinomial coefficients
3.1 Definition
Let be any complex number that is not a root of unity. Given such that , the multinomial coefficient is defined by the formula
(25) 
We have used the notation for factorials introduced in Section 1.
Throughout this paper, we shall assume that is a fixed prime power. For such , the binomial coefficient counts the number of dimensional subspaces in . More generally, given a set of integers such that , the multinomial coefficient is defined as the number of flags of vector spaces such that [19, 20]. We will say that these flags are of type .
It is possible to introduce a function as the normalized solution of a functional equation that guaranties that , see [2]. When and , this function is given by the formula [17]:
(26)  
(27) 
where we have used the Pochhammer symbol
(28) 
The equivalent expressions for the function come from the following identity
(29) 
known as binomial theorem (see [13, p. 30]).
Recall [15, p. 92] that an infinite product is said to be convergent if

there exists such that for all ;

exists and is different from zero.
An infinite product in the form is said to be absolutely convergent when converges. One can show that absolute convergence implies convergence. Moreover, when the terms , the product is convergent if and only if the series converges. The convergence of gives then the following result, that is used without further comment throughout the paper.
Lemma 1
For every , the product converges. Moreover, if , then .
The function gives an alternative expression for the multinomial coefficients
(30) 
which in turn extends its definition to complex arguments.
We close this subsection with a remark on the unimodality of the binomial coefficients.
Lemma 2
For every ,
3.2 Asymptotic behavior
The quadratic entropy of a probability law is defined by the formula: ^{5}^{5}5We fix the constant in front of . In [24] we have characterized Tsallis entropy () with system of functional equations (as a cocycle in cohomology), whose general solution is , for an arbitrary constant.
(32) 
Proposition 2
For each , let be a set of positive real numbers such that (we write when is clear from context). Suppose that, for each , it is verified that as . Then,
(33) 
Recall that means as . By convention, . We simply substitute (26) in (30) (the powers of cancel):
(34) 
We shall prove that, for any sequence of positive numbers, if , and if .
Remark that
can be written as , where denotes the counting measure and is given by
(35) 
Moreover, , because , and is integrable, . Therefore, in virtue of Lebesgue’s dominated convergence theorem,
Comments
There are no comments yet.