The concept of entropy is applied in almost every branch of science. Less widely appreciated, however, is that from a purely algebraic perspective, entropy has a very simple nature. Indeed, Shannon entropy is characterized nearly uniquely by a single equation, expressing a recursivity property. The purpose of this work is to introduce a parallel notion of entropy for probability distributions whose ‘probabilities’ are not real numbers, but integers modulo a prime . The entropy of such a distribution is also an integer mod .
We will see that despite the (current) lack of scientific application, this ‘entropy’ is fully deserving of the name. Indeed, it is characterized by a recursivity equation formally identical to the one that characterizes classical real entropy. It is also directly related to real entropy, via a notion of residue informally suggested by Kontsevich .
Among the many types of entropy, the most basic is the Shannon entropy of a finite probability distribution , defined as
It is this that we will imitate in the mod setting.
The aforementioned recursivity property concerns the entropy of the composite of two processes, in which the nature of the second process depends on the outcome of the first. Specifically, let be a finite probability distribution, and let be further distributions, writing . Their composite is
a probability distribution on elements. (Formally, this composition endows the sequence of simplices with the structure of an operad.) The chain rule (or recursivity or grouping law) for Shannon entropy is that
The chain rule can be understood in terms of information. Suppose we toss a fair coin and then, depending on the outcome, either roll a fair die or draw fairly from a pack of 52 cards. There are possible final outcomes, and their probabilities are given by the composite distribution
Now, the entropy of a distribution measures the amount of information gained by learning the outcome of an observation drawn from (measured in bits, if logarithms are taken to base ). In our example, knowing the outcome of the composite process tells us with certainty the outcome of the initial coin toss, plus with probability the outcome of a die roll and with probability the outcome of a card draw. Thus, the entropy of the composite distribution should be equal to
This is indeed true, and is an instance of the chain rule.
A classical theorem essentially due to Faddeev  states that up to a constant factor, Shannon entropy is the only continuous function assigning a real number to each finite probability distribution in such a way that the chain rule holds. In this sense, the chain rule is the characteristic property of entropy.
Our first task will be to formulate the right definition of entropy mod . An immediate obstacle is that there is no logarithm function mod , at least in the most obvious sense. Nevertheless, the classical Fermat quotient turns out to provide an acceptable substitute (Section 2). Closely related to the real logarithm is the nonlinear derivation , and its mod analogue is (a -derivation, in the language of Buium ).
The entropy of a mod probability distribution , with , is then defined as
where is an integer representing (Section 3). The definition is independent of the choice of representatives . This entropy satisfies a chain rule formally identical to that satisfied by real entropy (Section 4). We prove in Section 5 that up to a constant factor, is the one and only function satisfying the chain rule. This is the main justification for the definition.
Classical Shannon entropy quantifies the information associated with a probability space, but one can also seek to quantify the information lost by a map between probability spaces, seen as a deterministic process. For example, if one chooses uniformly at random a binary number with ten digits, then discards the last three, the discarding process loses three bits.
There is a formal definition of information loss, it includes the definition of entropy as a special case, and it has been uniquely characterized in work of Baez, Fritz and Leinster . The advantage of working with information loss rather than entropy is that the characterizing equations look exactly like the linearity and homomorphism conditions that occur throughout algebra—in contrast to the chain rule. In Section 6, we show that an analogous characterization theorem holds mod .
We then make precise an idea of Kontsevich linking entropy over with entropy over . Consider a distribution whose probabilities are rational numbers. On the one hand, we can take its real entropy . On the other, whenever is a prime not dividing the denominator of any , we can view as a probability distribution mod and therefore take its entropy mod . Kontsevich suggested viewing as the ‘residue’ of , and Section 7 establishes that this construction has the basic properties that one would expect from the name.
(which formally is equal to ). We prove several identities in this polynomial. In the case of distributions on two elements, we find that
for , and we discuss some properties that this polynomial possesses.
The present work should be regarded as a beginning rather than an end. In information theory, Shannon entropy is just the simplest of a family of fundamental concepts including relative entropy, conditional entropy, and mutual information. It is natural to seek their mod analogues, and to prove analogous theorems; however, this is not attempted here.
This work is a development and extension of a two-and-a-half page note of Kontsevich . In it, Kontsevich did just enough to show that a reasonable definition of entropy mod must exist, but without actually giving the definition (except for probability distributions on two elements). He also suggested regarding the entropy mod of a distribution with rational probabilities as the ‘residue’ of its real entropy, but did no more than make the suggestion. The relationship between  and the present work is further clarified at the start of Section 7 and the end of Section 9.
Kontsevich’s note appears to have been motivated by questions about polylogarithms. (The polynomial (3) is a truncation of the power series of , and one can consider more generally a finite version of the th polylogarithm.) That line of enquiry has been pursued by Elbaz-Vincent and Gangl [8, 9]. As recounted in the introduction to , some of Kontsevich’s results had already appeared in papers of Cathelineau [5, 6]. The connection between this part of algebra and information theory was noticed at least as far back as 1996 (, p. 1327). In the present work, however, polylogarithms play no part and entropy takes centre stage.
Unlike much previous work on characterizations of entropies, we are able to dispense completely with symmetry axioms. For example, Faddeev’s theorem on real entropy  characterized it as the unique continuous quantity satisfying the chain rule and invariant under permutation of its arguments. However, a careful reading of the proof shows that the symmetry assumption can be dropped. The axiomatization of entropy via the so-called fundamental equation of information theory, discussed in Section 9, also makes use of a symmetry assumption. While symmetry appears to be essential to that approach (Remark 9), we will not need it.
The chain rule (2) is often stated in the special case , , or occasionally in the different special case . In the presence of the symmetry axiom, either of these special cases implies the general case, by a routine induction. For example, Faddeev used the first special case, whose asymmetry forced him to impose the symmetry axiom too; but that can be avoided by assuming the chain rule in its general form.
There is a growing understanding of the homological aspects of entropy. This was exploited directly by Kontsevich  and developed at length by Baudot and Bennequin ; we touch on it in Section 9.
One can speculate about extending the theory of entropy to fields other than and , and in particular to the -adic numbers; however, the -adic entropy of Deninger  is of a different nature.
Throughout, denotes a prime number, possibly .
I thank James Borger, Herbert Gangl and Todd Trimble for enlightening conversations.
2 Logarithms and derivations
Real entropy is a kind of higher logarithm, in the senses that it has the multiplication-to-addition property
(in notation defined at the end of Section 4
), and that when restricted to uniform distributions, it is the logarithm function itself:
To find the right definition of entropy mod , we therefore begin by considering logarithms mod .
Lagrange’s theorem immediately implies that there is no logarithm mod , or more precisely that the only homomorphism from the multiplicative group to the additive group is trivial. However, there is an acceptable substitute. For an integer not divisible by , the Fermat quotient of mod is the integer
We usually regard as an element of . The following lemma sets out the basic properties of Fermat quotients, established by Eisenstein; the proof is elementary and omitted.
The map has the following properties:
for all not divisible by , and ;
for all with not divisible by ;
for all not divisible by .
The lemma implies that defines a group homomorphism
It is surjective, since by the lemma again, it has a section .
The Fermat quotient is the closest approximation to a logarithm mod , in the sense that although there is no nontrivial group homomorphism , it is a homomorphism . It is essentially unique as such:
Every group homomorphism is a scalar multiple of the Fermat quotient.
This follows from the standard fact that the group is cyclic (Theorem 10.6 of Apostol , for instance), together with the observation that is nontrivial (being surjective). Indeed, let be a generator of ; then given , we have where .
Our characterization theorem for entropy mod will use the following characterization of the Fermat quotient.
Let be a function. The following are equivalent:
and for all not divisible by ;
for some .
The entropy of a real probability distribution is
The operator is a nonlinear derivation, in the sense that
In particular, . A useful viewpoint on the entropy of is that it measures the failure of the nonlinear operator to preserve the sum :
We will define entropy mod in such a way that the analogue of this equation holds.
The mod analogue of is the function given by
We usually write as just , and treat as an integer mod . Evidently the element of depends only on the residue class of mod , so we can also view (like ) as a function .
for all , and .
3 The definition of entropy
For , write
An element of will be called a probability distribution mod , or simply a distribution. We will define the entropy of any such distribution.
A standard elementary lemma (proof omitted) will be repeatedly useful:
Let . If then .
The observations at the end of Section 2 suggest that we define entropy mod by the analogue of equation (5), replacing by . In principle this is impossible, since although is well-defined on congruence classes mod , it is not well-defined on congruence classes mod . Thus, for , the term is not well-defined. Nevertheless, the strategy can be made to work:
For all and such that ,
The right-hand side is an integer, since . The lemma is equivalent to the congruence
Cancelling, this reduces to
But , so by Lemma 3.
Let and . The entropy of is
where represents . We often write as just .
as in the real case (equation (5)). But in contrast to the real case, the term is not always zero, and if it were omitted then the right-hand side would no longer be independent of the choice of integers .
Let with . Since is invertible mod , there is a uniform distribution
Choose representing . By equation (6) and then the derivation property of ,
But , so . This result over is analogous to the formula for the real entropy of a uniform distribution.
Let . For , write . Since , the cardinality of
must be odd. Directly from the definition of entropy,is given by
In preparation for the next example, we record a standard lemma (proof omitted):
for all .
We compute the entropy of a distribution on two elements. Choose representing . Directly from the definition of entropy, and assuming that ,
But , so by Lemma 3, the coefficient of in the sum is just . Hence
Appending zero probabilities to a distribution does not change its entropy:
This is immediate from the definition. But a subtlety of distributions mod , absent in the standard real setting, is that nonzero ‘probabilities’ can sum to zero. So, one can ask whether
whenever with . The answer is no. For instance, when , and , Example 3 gives
even though .
4 The chain rule
Here we formulate the mod version of the chain rule for entropy, which will later be shown to characterize entropy uniquely up to a constant.
In the Introduction, it was noted that real probability distributions can be composed in a way that corresponds to performing two random processes in sequence. Exactly the same formula (1) defines a composition of probability distributions mod , where now
And entropy mod satisfies exactly the same chain rule for composition:
Proposition (Chain rule)
for all , all , and all .
Write . Choose representing and representing , for each and . Write .
A special case of composition is the tensor product of distributions, defined for and by
In the analogous case of real distributions,and .
The chain rule immediately implies a logarithmic property of entropy mod :
for all and .
5 Unique characterization of entropy
Our main theorem is that up to a constant factor, entropy mod is the only quantity satisfying the chain rule.
Let be a sequence of functions. The following are equivalent:
satisfies the chain rule (that is, satisfies the conclusion of Proposition 4 with in place of );
for some .
For the rest of the proof, let be a sequence of functions satisfying the chain rule. Recall that denotes the uniform distribution , for .
for all not divisible by ;
To prove that , we compute in two ways. On the one hand, using the chain rule,
On the other, using the chain rule again and the fact that ,
Hence . The proof that is similar.
For all and ,
First suppose that . Then
Applying to both sides, then using the chain rule and , gives the result. The case is proved similarly, using .
We will prove the characterization theorem by analysing as varies. The chain rule will allow us to deduce the value of for more general distributions , thanks to the following lemma.
Let with for all . For each , let be an integer representing , and write . Then
First note that none of is a multiple of , so and are well-defined. We have
Applying to both sides and using the chain rule gives the result.
We come now to the most delicate part of the argument. Since , and since is -periodic in , if is to be a constant multiple of then must also be -periodic in . We show this directly.
for all natural numbers not divisible by .
First we prove the existence of a constant such that for all not divisible by ,
To prove this, consider the distribution
By Lemma 5 and the fact that ,
so by the same argument,
Comparing the two expressions for gives equation (8), thus proving the initial claim.
By induction on equation (7),
for all with . The result follows by putting .
We can now prove the characterization theorem for entropy modulo .
Proof of Theorem 5
A variant of the characterization theorem will be useful. The distributions considered so far can be viewed as probability measures (mod ) on sets of the form , but it will be convenient to generalize to arbitrary finite sets.
Thus, given a finite set , write for the set of families of elements of such that . A finite probability space mod is a finite set together with an element .
As in the real case, we can take convex combinations of probability spaces. Given a finite probability space and a further family of finite probability spaces, all mod , we obtain a new probability space
mod . Here is the disjoint union of the sets , and gives probability to an element .
The operation of taking convex combinations is simply composition of distributions, in different notation. More precisely, if and () then the set is naturally identified with , and under this identification, corresponds to the composite distribution .
The entropy of is, of course, defined as
where represents for each .
This entropy is isomorphism-invariant, in the sense that whenever and are finite probability spaces mod and there is some bijection satisfying for all , then . The chain rule for entropy mod , translated into the notation of convex combinations, states that
for all finite probability spaces and mod .
Let be a function assigning an element of to each finite probability space mod . The following are equivalent:
is isomorphism-invariant and satisfies the chain rule (9) (with in place of );
for some .
Conversely, take a function satisfying (i). Restricting to finite sets of the form defines, for each , a function satisfying the chain rule. Hence by Theorem 5, there is some constant such that for all and . Now take any finite probability space . We have
for some and , and then by isomorphism-invariance of both and ,
This corollary is slightly weaker than our main characterization result, Theorem 5. Indeed, if is an isomorphism-invariant function on the class of finite probability spaces mod then in particular, permuting the arguments of a measure does not change the value that gives it. Thus, the corollary also follows from a weaker version of Theorem 5 in which the putative entropy function is also assumed to be symmetric in its arguments. But Theorem 5 shows that the symmetry assumption is, in fact, unnecessary.
6 Information loss
Grothendieck came along and said, ‘No, the Riemann–Roch theorem is not a theorem about varieties, it’s a theorem about morphisms between varieties.’ —Nicholas Katz (quoted in , p. 1046).
The entropy of a probability space is a special case of a more general concept, the information loss of a map between probability spaces. This is most easily approached through the real case, as follows.
Given a real probability distribution on a finite set, the entropy of is usually understood as the amount of information gained by learning the result of an observation drawn from . For instance, if then the entropy (taking logarithms to base ) is , reflecting the fact that results of draws from cannot be communicated in fewer than bits each.
In the same spirit, one can ask how much information is lost by a deterministic process. Consider, for instance, the process of forgetting the suit of a card drawn fairly from a standard -card pack. Since the four suits are distributed uniformly, bits of information are lost. An alternative viewpoint is that the information loss is the amount of information at the start of the process minus the amount at the end, which is
If we take logarithms to base then the information loss is, again, bits. Hence the two viewpoints give the same result.
Generally, given a measure-preserving map between finite probability spaces, we can quantify the information lost by in either of two equivalent ways. We can condition on the outcome , taking for each the amount of information lost by collapsing the fibre :
(The argument of is the restriction of the distribution to , normalized to sum to .) Alternatively, we can subtract the amount of information at the end of the process from the amount at the start:
Entropy is the special case of information loss where one discards all the information. That is, the entropy of a probability distribution on a set is the information loss of the unique map from to the one-point space. In this sense, the concept of information loss subsumes the concept of entropy.
The description so far is of information loss over , which was analysed and characterized in Baez, Fritz and Leinster . (In particular, equation (5) of  describes the relationship between information loss and conditional entropy.) We now show that a strictly analogous characterization theorem is available over , even in the absence of an information-theoretic interpretation.