# Entropy modulo a prime

Building on work of Kontsevich, we introduce a definition of the entropy of a finite probability distribution in which the "probabilities" are integers modulo a prime p. The entropy, too, is an integer mod p. Entropy mod p is shown to be uniquely characterized by a functional equation identical to the one that characterizes ordinary Shannon entropy, justifying the name. We also establish a sense in which certain real entropies have residues mod p, connecting the concepts of entropy over R and over Z/pZ. Finally, entropy mod p is expressed as a polynomial which is shown to satisfy several identities, linking into work of Cathelineau, Elbaz-Vincent and Gangl on polylogarithms.

## Authors

• 4 publications
02/25/2021

### New identities for the Shannon function and applications

We show how the Shannon entropy function H(p,q)is expressible as a linea...
08/23/2019

### The Group Theoretic Roots of Information I: permutations, symmetry, and entropy

We propose a new interpretation of measures of information and disorder ...
09/15/2020

### A functorial characterization of von Neumann entropy

We classify the von Neumann entropy as a certain concave functor from fi...
07/23/2021

### Entropy, Derivation Operators and Huffman Trees

We build a theory of binary trees on finite multisets that categorifies,...
05/13/2018

### Kolmogorov-Sinai entropy and dissipation in driven classical Hamiltonian systems

A central concept in the connection between physics and information theo...
07/13/2018

### A combinatorial interpretation for Tsallis 2-entropy

While Shannon entropy is related to the growth rate of multinomial coeff...
07/10/2021

### Dirichlet polynomials and entropy

A Dirichlet polynomial d in one variable 𝓎 is a function of the form d(𝓎...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The concept of entropy is applied in almost every branch of science. Less widely appreciated, however, is that from a purely algebraic perspective, entropy has a very simple nature. Indeed, Shannon entropy is characterized nearly uniquely by a single equation, expressing a recursivity property. The purpose of this work is to introduce a parallel notion of entropy for probability distributions whose ‘probabilities’ are not real numbers, but integers modulo a prime . The entropy of such a distribution is also an integer mod .

We will see that despite the (current) lack of scientific application, this ‘entropy’ is fully deserving of the name. Indeed, it is characterized by a recursivity equation formally identical to the one that characterizes classical real entropy. It is also directly related to real entropy, via a notion of residue informally suggested by Kontsevich [13].

Among the many types of entropy, the most basic is the Shannon entropy of a finite probability distribution , defined as

It is this that we will imitate in the mod  setting.

The aforementioned recursivity property concerns the entropy of the composite of two processes, in which the nature of the second process depends on the outcome of the first. Specifically, let be a finite probability distribution, and let be further distributions, writing . Their composite is

 π∘(γ1,…,γn)=(π1γ11,…,π1γ1k1, …, πnγn1,…,πnγnkn), (1)

a probability distribution on elements. (Formally, this composition endows the sequence of simplices with the structure of an operad.) The chain rule (or recursivity or grouping law) for Shannon entropy is that

 H(π∘(γ1,…,γn))=H(π)+n∑i=1πiH(γi). (2)

The chain rule can be understood in terms of information. Suppose we toss a fair coin and then, depending on the outcome, either roll a fair die or draw fairly from a pack of 52 cards. There are possible final outcomes, and their probabilities are given by the composite distribution

 (12,12)∘((16,…,166), (152,…,15252))=(112,…,1126,1104,…,110452).

Now, the entropy of a distribution measures the amount of information gained by learning the outcome of an observation drawn from (measured in bits, if logarithms are taken to base ). In our example, knowing the outcome of the composite process tells us with certainty the outcome of the initial coin toss, plus with probability the outcome of a die roll and with probability the outcome of a card draw. Thus, the entropy of the composite distribution should be equal to

 H(12,12)+12H(16,…,16)+12H(152,…,152).

This is indeed true, and is an instance of the chain rule.

A classical theorem essentially due to Faddeev [10] states that up to a constant factor, Shannon entropy is the only continuous function assigning a real number to each finite probability distribution in such a way that the chain rule holds. In this sense, the chain rule is the characteristic property of entropy.

Our first task will be to formulate the right definition of entropy mod . An immediate obstacle is that there is no logarithm function mod , at least in the most obvious sense. Nevertheless, the classical Fermat quotient turns out to provide an acceptable substitute (Section 2). Closely related to the real logarithm is the nonlinear derivation , and its mod  analogue is (a -derivation, in the language of Buium [4]).

The entropy of a mod  probability distribution , with , is then defined as

 H(π)=∑∂(ai)−∂(∑ai)=1p(1−∑api)∈Z/pZ,

where is an integer representing (Section 3). The definition is independent of the choice of representatives . This entropy satisfies a chain rule formally identical to that satisfied by real entropy (Section 4). We prove in Section 5 that up to a constant factor, is the one and only function satisfying the chain rule. This is the main justification for the definition.

Classical Shannon entropy quantifies the information associated with a probability space, but one can also seek to quantify the information lost by a map between probability spaces, seen as a deterministic process. For example, if one chooses uniformly at random a binary number with ten digits, then discards the last three, the discarding process loses three bits.

There is a formal definition of information loss, it includes the definition of entropy as a special case, and it has been uniquely characterized in work of Baez, Fritz and Leinster [2]. The advantage of working with information loss rather than entropy is that the characterizing equations look exactly like the linearity and homomorphism conditions that occur throughout algebra—in contrast to the chain rule. In Section 6, we show that an analogous characterization theorem holds mod .

We then make precise an idea of Kontsevich linking entropy over with entropy over . Consider a distribution whose probabilities are rational numbers. On the one hand, we can take its real entropy . On the other, whenever is a prime not dividing the denominator of any , we can view as a probability distribution mod  and therefore take its entropy mod . Kontsevich suggested viewing as the ‘residue’ of , and Section 7 establishes that this construction has the basic properties that one would expect from the name.

Finally, we analyse not as a function but as a polynomial (Sections 8 and 9). We show that

 H(π)=−∑0≤r1,…,rn

(which formally is equal to ). We prove several identities in this polynomial. In the case of distributions on two elements, we find that

 H(π,1−π)=p−1∑r=1πrr (3)

for , and we discuss some properties that this polynomial possesses.

The present work should be regarded as a beginning rather than an end. In information theory, Shannon entropy is just the simplest of a family of fundamental concepts including relative entropy, conditional entropy, and mutual information. It is natural to seek their mod  analogues, and to prove analogous theorems; however, this is not attempted here.

### Related work

This work is a development and extension of a two-and-a-half page note of Kontsevich [13]. In it, Kontsevich did just enough to show that a reasonable definition of entropy mod  must exist, but without actually giving the definition (except for probability distributions on two elements). He also suggested regarding the entropy mod of a distribution with rational probabilities as the ‘residue’ of its real entropy, but did no more than make the suggestion. The relationship between [13] and the present work is further clarified at the start of Section 7 and the end of Section 9.

Kontsevich’s note appears to have been motivated by questions about polylogarithms. (The polynomial (3) is a truncation of the power series of , and one can consider more generally a finite version of the th polylogarithm.) That line of enquiry has been pursued by Elbaz-Vincent and Gangl [8, 9]. As recounted in the introduction to [8], some of Kontsevich’s results had already appeared in papers of Cathelineau [5, 6]. The connection between this part of algebra and information theory was noticed at least as far back as 1996 ([6], p. 1327). In the present work, however, polylogarithms play no part and entropy takes centre stage.

Unlike much previous work on characterizations of entropies, we are able to dispense completely with symmetry axioms. For example, Faddeev’s theorem on real entropy [10] characterized it as the unique continuous quantity satisfying the chain rule and invariant under permutation of its arguments. However, a careful reading of the proof shows that the symmetry assumption can be dropped. The axiomatization of entropy via the so-called fundamental equation of information theory, discussed in Section 9, also makes use of a symmetry assumption. While symmetry appears to be essential to that approach (Remark 9), we will not need it.

The chain rule (2) is often stated in the special case , , or occasionally in the different special case . In the presence of the symmetry axiom, either of these special cases implies the general case, by a routine induction. For example, Faddeev used the first special case, whose asymmetry forced him to impose the symmetry axiom too; but that can be avoided by assuming the chain rule in its general form.

The operation mentioned above is basic in the theory of -derivations (as in Buium [4]), which are themselves closely related to Frobenius lifts and the Adams operations on -theory (as in Joyal [12]).

There is a growing understanding of the homological aspects of entropy. This was exploited directly by Kontsevich [13] and developed at length by Baudot and Bennequin [3]; we touch on it in Section 9.

One can speculate about extending the theory of entropy to fields other than and , and in particular to the -adic numbers; however, the -adic entropy of Deninger [7] is of a different nature.

### Convention

Throughout, denotes a prime number, possibly .

### Acknowledgements

I thank James Borger, Herbert Gangl and Todd Trimble for enlightening conversations.

## 2 Logarithms and derivations

Real entropy is a kind of higher logarithm, in the senses that it has the multiplication-to-addition property

 H(π⊗γ)=H(π)+H(γ)

(in notation defined at the end of Section 4

), and that when restricted to uniform distributions, it is the logarithm function itself:

 H(1/n,…,1/n)=logn.

To find the right definition of entropy mod , we therefore begin by considering logarithms mod .

Lagrange’s theorem immediately implies that there is no logarithm mod , or more precisely that the only homomorphism from the multiplicative group to the additive group is trivial. However, there is an acceptable substitute. For an integer not divisible by , the Fermat quotient of mod  is the integer

 qp(a)=ap−1−1p.

We usually regard as an element of . The following lemma sets out the basic properties of Fermat quotients, established by Eisenstein; the proof is elementary and omitted.

###### Lemma

The map has the following properties:

1. for all not divisible by , and ;

2. for all with not divisible by ;

3. for all not divisible by .

The lemma implies that defines a group homomorphism

 qp:(Z/p2Z)×→Z/pZ.

It is surjective, since by the lemma again, it has a section .

The Fermat quotient is the closest approximation to a logarithm mod , in the sense that although there is no nontrivial group homomorphism , it is a homomorphism . It is essentially unique as such:

###### Proposition

Every group homomorphism is a scalar multiple of the Fermat quotient.

###### Proof

This follows from the standard fact that the group is cyclic (Theorem 10.6 of Apostol [1], for instance), together with the observation that is nontrivial (being surjective). Indeed, let be a generator of ; then given , we have where .

Our characterization theorem for entropy mod  will use the following characterization of the Fermat quotient.

###### Proposition

Let be a function. The following are equivalent:

1. and for all not divisible by ;

2. for some .

###### Proof

Since satisfies the conditions in (i), so does any constant multiple. Hence (ii) implies (i). The converse follows from Proposition 2.

The entropy of a real probability distribution is

 HR(π)=n∑i=1∂R(πi),

where

 ∂R(x)={−xlogxif x>0,0if x=0. (4)

The operator is a nonlinear derivation, in the sense that

 ∂R(xy)=∂R(x)y+x∂R(y),∂R(1)=0.

In particular, . A useful viewpoint on the entropy of is that it measures the failure of the nonlinear operator to preserve the sum :

 HR(π)=n∑i=1∂R(πi)−∂R(n∑i=1πi). (5)

We will define entropy mod  in such a way that the analogue of this equation holds.

The mod  analogue of is the function given by

 ∂p(a)=a−app={−aqp(a)if p∤a,a/pif p∣a.

We usually write as just , and treat as an integer mod . Evidently the element of depends only on the residue class of mod , so we can also view (like ) as a function .

for all , and .

Elementary.

## 3 The definition of entropy

For , write

 Πn={π∈(Z/pZ)n:π1+⋯+πn=1}.

An element of will be called a probability distribution mod , or simply a distribution. We will define the entropy of any such distribution.

A standard elementary lemma (proof omitted) will be repeatedly useful:

###### Lemma

Let . If then .

The observations at the end of Section 2 suggest that we define entropy mod  by the analogue of equation (5), replacing by . In principle this is impossible, since although is well-defined on congruence classes mod , it is not well-defined on congruence classes mod . Thus, for , the term is not well-defined. Nevertheless, the strategy can be made to work:

###### Lemma

For all and such that ,

 n∑i=1∂(ai)−∂(n∑i=1ai)≡1p(1−n∑i=1api)(modp).

###### Proof

The right-hand side is an integer, since . The lemma is equivalent to the congruence

 ∑(ai−api)−{∑ai−(∑ai)p}≡1−∑api(modp2).

Cancelling, this reduces to

 (∑ai)p≡1(modp2).

But , so by Lemma 3.

###### Definition

Let and . The entropy of is

where represents . We often write as just .

Lemma 3 guarantees that the definition is independent of the choice of representatives , and Lemma 3 gives

 Hp(π)=∑∂p(ai)−∂p(∑ai), (6)

as in the real case (equation (5)). But in contrast to the real case, the term is not always zero, and if it were omitted then the right-hand side would no longer be independent of the choice of integers .

###### Example

Let with . Since is invertible mod , there is a uniform distribution

 un=(1/n,…,1/nn)∈Πn.

Choose representing . By equation (6) and then the derivation property of ,

 Hp(un)=n∂(a)−∂(na)=−a∂(n).

But , so . This result over is analogous to the formula for the real entropy of a uniform distribution.

###### Example

Let . For , write . Since , the cardinality of

must be odd. Directly from the definition of entropy,

is given by

 H(π)=12(|supp(π)|−1)={0if |supp(π)|≡1(mod4),1if |supp(π)|≡3(mod4).

In preparation for the next example, we record a standard lemma (proof omitted):

for all .

###### Example

We compute the entropy of a distribution on two elements. Choose representing . Directly from the definition of entropy, and assuming that ,

 H(π,1−π)=1p(1−ap−(1−a)p)=p−1∑r=1(−1)r+11p(pr)ar.

But , so by Lemma 3, the coefficient of in the sum is just . Hence

 H(π,1−π)=p−1∑r=1πrr.

The function on the right-hand side was the starting point of Kontsevich’s note [13], and we return to it in Section 9. In the case , we have for both values of .

###### Example

Appending zero probabilities to a distribution does not change its entropy:

 H(π1,…,πn,0,…,0)=H(π1,…,πn).

This is immediate from the definition. But a subtlety of distributions mod , absent in the standard real setting, is that nonzero ‘probabilities’ can sum to zero. So, one can ask whether

 H(π1,…,πn,τ1,…,τm)=H(π1,…,πn)

whenever with . The answer is no. For instance, when , and , Example 3 gives

 H(1,1,1,1)=H(u4)=q3(4)=13(42−1)=−1≠0=H(1),

even though .

## 4 The chain rule

Here we formulate the mod  version of the chain rule for entropy, which will later be shown to characterize entropy uniquely up to a constant.

In the Introduction, it was noted that real probability distributions can be composed in a way that corresponds to performing two random processes in sequence. Exactly the same formula (1) defines a composition of probability distributions mod , where now

 π∈Πn,γi∈Πki,π∘(γ1,…,γn)∈Πk1+⋯+kn.

And entropy mod  satisfies exactly the same chain rule for composition:

###### Proposition (Chain rule)

We have

 Hp(π∘(γ1,…,γn))=Hp(π)+n∑i=1πiHp(γi)

for all , all , and all .

###### Proof

Write . Choose representing and representing , for each and . Write .

We evaluate in turn the three terms , and . First, by Lemma 3 and the derivation property of (Lemma 2),

 H(π∘(γ1,…,γn)) =n∑i=1∂(ai)Bi+n∑i=1aiki∑j=1∂(bij)−∂(n∑i=1aiBi).

Second, since , we have , so represents . Hence

 H(π) =n∑i=1∂(ai)Bi+n∑i=1ai∂(Bi)−∂(n∑i=1aiBi).

Third,

 n∑i=1πiH(γi)=n∑i=1aiki∑j=1∂(bij)−n∑i=1ai∂(Bi).

The result follows.

A special case of composition is the tensor product of distributions, defined for and by

 π⊗γ =π∘(γ,…,γ) =(π1γ1,…,π1γk, …, πnγ1,…,πnγk)∈Πnk.

In the analogous case of real distributions,

is the joint distribution of two independent random variables with distributions

and .

The chain rule immediately implies a logarithmic property of entropy mod :

for all and .

## 5 Unique characterization of entropy

Our main theorem is that up to a constant factor, entropy mod  is the only quantity satisfying the chain rule.

###### Theorem

Let be a sequence of functions. The following are equivalent:

1. satisfies the chain rule (that is, satisfies the conclusion of Proposition 4 with in place of );

2. for some .

Since satisfies the chain rule, so does any constant multiple. Hence (ii) implies (i). We now begin the proof of the converse.

For the rest of the proof, let be a sequence of functions satisfying the chain rule. Recall that denotes the uniform distribution , for .

###### Lemma
1. for all not divisible by ;

2. .

###### Proof

By the chain rule, has the logarithmic property

 I(π⊗γ)=I(π∘(γ,…,γ))=I(π)+I(γ)

for all and . In particular, for all not divisible by ,

 I(umn)=I(um⊗un)=I(um)+I(un),

proving (i). For (ii), take in (i).

.

###### Proof

To prove that , we compute in two ways. On the one hand, using the chain rule,

 I(1,0,0)=I((1,0)∘((1,0),u1))=I(1,0)+1⋅I(1,0)+0⋅I(u1)=2I(1,0).

On the other, using the chain rule again and the fact that ,

 I(1,0,0)=I((1,0)∘(u1,(1,0)))=I(1,0)+1⋅I(u1)+0⋅I(1,0)=I(1,0).

Hence . The proof that is similar.

###### Lemma

For all and ,

 I(π1,…,πn)=I(π1,…,πi,0,πi+1,…,πn).

###### Proof

First suppose that . Then

 (π1,…,πi,0,πi+1,…,πn)=π∘(u1,…,u1i−1,(1,0),u1,…,u1n−i).

Applying to both sides, then using the chain rule and , gives the result. The case is proved similarly, using .

We will prove the characterization theorem by analysing as varies. The chain rule will allow us to deduce the value of for more general distributions , thanks to the following lemma.

###### Lemma

Let with for all . For each , let be an integer representing , and write . Then

 I(π)=I(uk)−n∑i=1kiI(uki).

###### Proof

First note that none of is a multiple of , so and are well-defined. We have

 π∘(uk1,…,ukn)=(1,…,1k)=uk.

Applying to both sides and using the chain rule gives the result.

We come now to the most delicate part of the argument. Since , and since is -periodic in , if is to be a constant multiple of then must also be -periodic in . We show this directly.

###### Lemma

for all natural numbers not divisible by .

###### Proof

First we prove the existence of a constant such that for all not divisible by ,

 I(un+p)=I(un)−c/n. (7)

(Compare Lemma 2(ii).) An equivalent statement is that is independent of . Since for any and we can choose some with , it is enough to show that whenever with and ,

 n(I(un+p)−I(un))=I(um+p)−I(um). (8)

To prove this, consider the distribution

 π=(n,1,…,1m−n).

By Lemma 5 and the fact that ,

 I(π)=I(um)−nI(un).

But also

 π=(n+p,1,…,1m−n),

so by the same argument,

 I(π)=I(um+p)−(n+p)I(un+p)=I(um+p)−nI(un+p).

Comparing the two expressions for gives equation (8), thus proving the initial claim.

By induction on equation (7),

 I(un+rp)=I(un)−cr/n

for all with . The result follows by putting .

We can now prove the characterization theorem for entropy modulo .

###### Proof of Theorem 5

Define by . Lemma 5, Lemma 5 and Proposition 2 together imply that for some . By Example 3, an equivalent statement is that for all not divisible by .

Since both and satisfy the chain rule, Lemma 5 applies to both; and since and are equal on uniform distributions, they are also equal on all distributions such that for all . Finally, applying Lemma 5 to both and , we deduce by induction that for all .

A variant of the characterization theorem will be useful. The distributions considered so far can be viewed as probability measures (mod ) on sets of the form , but it will be convenient to generalize to arbitrary finite sets.

Thus, given a finite set , write for the set of families of elements of such that . A finite probability space mod  is a finite set together with an element .

As in the real case, we can take convex combinations of probability spaces. Given a finite probability space and a further family of finite probability spaces, all mod , we obtain a new probability space

 (∐x∈XYx,∐x∈Xπxγx)

mod . Here is the disjoint union of the sets , and gives probability to an element .

The operation of taking convex combinations is simply composition of distributions, in different notation. More precisely, if and () then the set is naturally identified with , and under this identification, corresponds to the composite distribution .

The entropy of is, of course, defined as

 H(π)=1p(1−∑x∈Xapx),

where represents for each .

This entropy is isomorphism-invariant, in the sense that whenever and are finite probability spaces mod  and there is some bijection satisfying for all , then . The chain rule for entropy mod , translated into the notation of convex combinations, states that

 H(∐x∈Xπxγx)=H(π)+∑x∈XπxH(γx) (9)

for all finite probability spaces and mod .

###### Corollary

Let be a function assigning an element of to each finite probability space mod . The following are equivalent:

1. is isomorphism-invariant and satisfies the chain rule (9) (with in place of );

2. for some .

###### Proof

We have just observed that satisfies the conditions in (i), and it follows that (ii) implies (i).

Conversely, take a function satisfying (i). Restricting to finite sets of the form defines, for each , a function satisfying the chain rule. Hence by Theorem 5, there is some constant such that for all and . Now take any finite probability space . We have

 (Y,σ)≅({1,…,n},π)

for some and , and then by isomorphism-invariance of both and ,

 I(σ)=I(π)=cH(π)=cH(σ),

proving (ii).

###### Remark

This corollary is slightly weaker than our main characterization result, Theorem 5. Indeed, if is an isomorphism-invariant function on the class of finite probability spaces mod  then in particular, permuting the arguments of a measure does not change the value that gives it. Thus, the corollary also follows from a weaker version of Theorem 5 in which the putative entropy function is also assumed to be symmetric in its arguments. But Theorem 5 shows that the symmetry assumption is, in fact, unnecessary.

## 6 Information loss

Grothendieck came along and said, ‘No, the Riemann–Roch theorem is not a theorem about varieties, it’s a theorem about morphisms between varieties.’ —Nicholas Katz (quoted in [11], p. 1046).

The entropy of a probability space is a special case of a more general concept, the information loss of a map between probability spaces. This is most easily approached through the real case, as follows.

Given a real probability distribution on a finite set, the entropy of is usually understood as the amount of information gained by learning the result of an observation drawn from . For instance, if then the entropy (taking logarithms to base ) is , reflecting the fact that results of draws from cannot be communicated in fewer than bits each.

In the same spirit, one can ask how much information is lost by a deterministic process. Consider, for instance, the process of forgetting the suit of a card drawn fairly from a standard -card pack. Since the four suits are distributed uniformly, bits of information are lost. An alternative viewpoint is that the information loss is the amount of information at the start of the process minus the amount at the end, which is

 H(1/52,…,1/52)−H(1/13,…,1/13)=log52−log13=log4.

If we take logarithms to base then the information loss is, again, bits. Hence the two viewpoints give the same result.

Generally, given a measure-preserving map between finite probability spaces, we can quantify the information lost by in either of two equivalent ways. We can condition on the outcome , taking for each the amount of information lost by collapsing the fibre :

 ∑x:πx≠0πxH((σyπx)y∈f−1(x)). (10)

(The argument of is the restriction of the distribution to , normalized to sum to .) Alternatively, we can subtract the amount of information at the end of the process from the amount at the start:

 H(σ)−H(π). (11)

The two expressions (10) and (11) are equal, as we will show in the analogous mod case.

Entropy is the special case of information loss where one discards all the information. That is, the entropy of a probability distribution on a set is the information loss of the unique map from to the one-point space. In this sense, the concept of information loss subsumes the concept of entropy.

The description so far is of information loss over , which was analysed and characterized in Baez, Fritz and Leinster [2]. (In particular, equation (5) of [2] describes the relationship between information loss and conditional entropy.) We now show that a strictly analogous characterization theorem is available over , even in the absence of an information-theoretic interpretation.

Let and