 # Trimming the Independent Fat: Sufficient Statistics, Mutual Information, and Predictability from Effective Channel States

One of the most fundamental questions one can ask about a pair of random variables X and Y is the value of their mutual information. Unfortunately, this task is often stymied by the extremely large dimension of the variables. We might hope to replace each variable by a lower-dimensional representation that preserves the relationship with the other variable. The theoretically ideal implementation is the use of minimal sufficient statistics, where it is well-known that either X or Y can be replaced by their minimal sufficient statistic about the other while preserving the mutual information. While intuitively reasonable, it is not obvious or straightforward that both variables can be replaced simultaneously. We demonstrate that this is in fact possible: the information X's minimal sufficient statistic preserves about Y is exactly the information that Y's minimal sufficient statistic preserves about X. As an important corollary, we consider the case where one variable is a stochastic process' past and the other its future and the present is viewed as a memoryful channel. In this case, the mutual information is the channel transmission rate between the channel's effective states. That is, the past-future mutual information (the excess entropy) is the amount of information about the future that can be predicted using the past. Translating our result about minimal sufficient statistics, this is equivalent to the mutual information between the forward- and reverse-time causal states of computational mechanics. We close by discussing multivariate extensions to this use of minimal sufficient statistics.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

How do we elucidate dependencies between variables? This is one of the major challenges facing today’s data-rich sciences, a task often stymied by the curse of dimensionality. One approach to circumventing the curse is to reduce each variable while still preserving its relationships with others. The maximal reduction—the minimal sufficient statistic—is known to work for a single variable at a time

. In the multivariate setting, though, it is not straightforward to demonstrate that, as intuition might suggest, all variables can be simultaneously replaced by their minimal sufficient statistics. Here, we prove that this is indeed the case in the two and three variable settings.

The need for sufficient statistics arises in many arenas. Consider, for example, the dynamics of a complex system. Any evolving physical system can be viewed as a communication channel that transmits (information about) its past to its future through its present . Shannon information theory  tells us that we can monitor the amount of information being transmitted through the present by the past-future mutual information—the excess entropy . However, this excess entropy can rarely be calculated from past and future sequence statistics, since the sequences are semi-infinite. This makes calculating the excess entropy an ideal candidate for using sufficient statistics. The latter take the form of either a process’ prescient states or its causal states . Though known for some time , a detailed proof of this relationship was rather involved, as laid out in Ref. .

The proof of our primary result turns on analyzing the information-theoretic relationships among four random variables , , , and . All possible informational relationships—in terms of Shannon multivariate information measures—are illustrated in the information diagram [6, 7] (I-diagram) of Fig. 1. This Venn-like diagram decomposes the entropy of the joint random variable into a number of atoms—informational units that cannot be further decomposed using the variables at hand. For example, take the region labeled in Fig. 1; this region is the conditional entropy . Similarly, one has the four-variable mutual information and the condition mutual information . The analogy with set theory, while helpful, must be handled with care: Shannon informations form a signed measure. Any atom quantifying the information shared among at least three variables can be negative. In the context of our example, Fig. 1, atoms , , , , and can be negative. Negative information has led to a great deal of investigation; see, for example, Refs. [8, 9].

Here we are interested in what happens when is a sufficient statistic of about and is a sufficient statistic of about  . We denote this and . The resulting (reduced) I-diagram provides a useful and parsimonious view of the relations among the four variables. In particular, it leads us to the main conclusion that each variable can be simultaneously reduced to its sufficient statistic while maintaining the mutual informations. Our development proceeds as follows: Section II defines sufficient statistics and utilizes two of their properties to reduce the informational relationships among the variables. Section III discusses how this result applies to stochastic processes as communication channels. Section IV extends our results to the three variable case and makes a conjecture about broader applicability. Finally, Section V outlines further directions and applications.

## Ii Sufficient Statistics

A statistic is a function of random variable samples . Let denote the set of all functions of a random variable . These functions are also random variables. Given variables and , a variable forms a Markov chain if . Let

denote the set of all variables that form a Markov chain with

and . A sufficient statistic of about is an element of .111Our definition here is equivalent to that provided in, , Ref. , but in a form that more directly emphasizes the properties we exploit over the next two subsections. The minimal sufficient statistic of about is the minimal-entropy sufficient statistic:

 X\mssY=\argminV{\HV∣V∈SX→Y} . (1)

It is unique up to isomorphism .

The minimal sufficient statistic can be directly constructed from variables and . Consider the function mapping to the conditional distribution ; then  [11, 12]. Put more colloquially, aggregates the outcomes that induce the same conditional distribution . This is an equivalence class over

, where the probability of each class is the sum of the probabilities of the outcomes contained in that class.

### ii.1 Sufficient Statistic as a Function

Our first step in reducing Fig. 1 is to consider the fact that is a function of .222By , we mean for all , . Any if and only if  . Furthermore, conditional entropies are never increased by conditioning on additional variables . Since conditional entropies are nonnegative , conditioning on variables in addition to can only yield additional zeros. In terms of the information atoms, the relations:

 W̋ ∣X =a+d+h+l=0 W̋ ∣X,Y =a+d=0. W̋ ∣X,Z =a+l=0. W̋ ∣X,Z,Y =a=0,

imply . A symmetric argument implies that . Each of these zeros is marked with an asterisk in Fig. 2. Figure 1: Information diagram (I-diagram) for four random variables X, W, Z, and Y. Each is depicted as a stadium shape and the information atoms are obtained by forming all possible intersections. Individual atoms are identified with lowercase letters.

### ii.2 Sufficient Statistic as a Markov Chain

Variables , , and form a Markov chain if and only if . Said informally, statistically shields and , rendering them conditionally independent. Applied to variable we find:

 \IX:Y∣W =0 m+o =0 ,

and similarly for ,

 \IX:Y∣Z =0 n+o =0 .

Since is a conditional mutual information, is nonnegative by the standard Shannon inequality .

Thus far, and are not individually constrained and so could be negative. However, consider , another conditional mutual information, which is therefore also nonnegative. It is already known that , therefore is nonnegative. Clearly, then, and are individually zero.

Analogously, we find that is nonnegative and conclude that and are individually zero. These vanishing atoms are marked with in the simplified I-diagram in Fig. 2. Figure 2: I-diagram for sufficient statistics: The vanishing information atoms implied by a sufficient statistic being a function of a random variable are labeled 0∗. Those vanishing atoms implied by a sufficient statistic forming a Markov chain are marked with 0†.

From this reduced diagram we can easily read that:

 k =\IX:Y (2) =\IX:Z =\IW:Y =\IW:Z =\IX:W:Z =\IX:W:Y =\IX:Z:Y =\IW:Z:Y =\IX:W:Z:Y .

Furthermore, one can remove the atoms that vanish to arrive at the reduced I-diagram of Fig. 3. It contains only five nonzero atoms. Figure 3: Minimal I-diagram containing only nonvanishing atoms in Fig. 2.

## Iii Stochastic Processes as Channels

We find useful application of this result in the analysis of stationary stochastic processes. Computational mechanics  is an information-theoretic framework for analyzing structured stochastic processes. There, a process is considered a channel that communicates its (semi-infinite) past to its (semi-infinite) future through the present [2, 13]. (The following suppresses when indexing.) An important process property—excess entropy—is the mutual information between the past and future. is the amount of uncertainty in the future than can be removed by observing the past.

At first blush, it is not clear how to proceed in computing a mutual information between two infinite-dimensional random variables such as this. The answer lies in the concept of causal states. Causal states play a central role as the minimal effective states of a process’ channel. The forward-time causal states comprise the minimal amount of information from the past required for predicting the future. More precisely, the random variable is the minimal sufficient statistic of the past about the future. Analogously, the reverse-time causal states embody the minimal sufficient statistic of the future about the past—the states needed for optimally retrodicting the past from the future.

By making the following substitutions: , , , and in Eq. (2), we immediately see that the excess entropy (past-future mutual information) has several alternate expressions:

 \EE ≡\I\Past:\Future (3) =\I\Past:\CausalState−0 =\I\CausalState+0:\Future =\I\CausalState+0:\CausalState−0 . (4)

The last identity gives our main result: The excess entropy is the mutual information between the forward-time and reverse-time causal states. As such, this provocatively suggests a communication channel between the forward- and reverse-causal-state processes—a channel that determines the amount information being transmitted through the present. See also Fig. 1 in Ref. , analogous to Fig. 3.

We can interpret this operationally. Consider a past , the particular forward-time causal state it induces, and an instance of the future following this state. This future analogously induces a reverse-time causal state . Considering the above channel between forward- and reverse-time states, the forward state corresponds to a distribution over reverse-time causal states . Sampling a state from this distribution results in a state that gives as much information (retrodictivity) about the past as the particular reverse state determined by the future.

Continuing, there are a number of related multivariate mutual information  identities that following directly:

 \EE =\I\Past:\CausalState+0:\CausalState−0 =\I\Past:\CausalState+0:\Future =\I\Past:\CausalState−0:\Future =\I\CausalState+0:\CausalState−0:\Future =\I\Past:\CausalState+0:\CausalState−0:\Future .

Furthermore, making use of the vanishing information atoms, we find that the following Markov chains exist:

 \Past−\CausalState+0−\CausalState−0−\Future , \CausalState+0−\Past−\CausalState−0−\Future , \Past−\CausalState+0−\Future−\CausalState−0 , and \CausalState+0−\Past−\Future−\CausalState−0 .

Causal states are, as noted, minimal sufficient statistics. This minimality is not necessary in the above development. As defined in Ref. , a prescient state is one for which and is a function of the past. In contrast to the causal states, prescient states need not be minimal. And so, with little else said, the analogous results follow for predictive and retrodictive prescient states. For example, we have .

If we were to lift the restriction that prescient states are functions of the past (or the future), the resulting forward and reverse generative  states may interact in their “gauge” informations. That is, the atom labeled in Fig. 1 may be nonzero; for more on this, see Ref. . The utility of our mutual information identities is then unclear.

The excess entropy, and related information measures, are widely-used diagnostics for complex systems, having been applied to detect the presence of organization in dynamical systems [16, 17, 18, 19], in spin systems [20, 21], in Markov random fields , in neurobiological systems [23, 24, 25], in long-memory processes , and even in human language [27, 28].

With these application domains in mind, we should call out the analytical benefits of using causal states, along the lines analyzed here. The benefits are particularly apparent in Refs. [25, 26], for example. While closed-form expressions for excess entropy of finite-state processes have existed for several years [2, 13], it is only recently that it has been analyzed for truly complex (infinite-state) processes [25, 26]

. In this work, identifying and then framing calculations around the causal states led to substantial progress. The detailed results here show why this is true: as sufficient statistics, causal states capture the essential structural information in a process. Similar benefits should also accrue when developing empirical estimation and inference algorithms for related information measures. Figure 4: Minimal I-diagram involving three variables and their minimal sufficient statistics. This differs from a standard 3-variable I-diagram by the addition of three atoms: \HX∣X\mssYZ, \HY∣Y\mssXZ, and \HZ∣Z\mssXY.

## Iv Multivariate Extensions

The results can be extended to multivariate systems as well as to alternative measures of shared information. Consider a system of three variables , , and . The I-diagram of interest involves six variables: , , , and their sufficient statistics about the other variables: , , and . This I-diagram contains atoms. It can be substantially simplified along the lines of the previous section. First, note that if , , , and form the Markov chain , then we also have the chains and . Second, recall our primary result that and note there are similar relations for the pairs and ). Combining these two observations and the methods employed in Section II allows one to determine that atoms are identically . This reduction results in the I-diagram of Fig. 4.

Remarkably, the structure of this reduced I-diagram allows us to immediately conclude that the total correlation  , dual total correlation  , co-information  [31, 32], CAEKL mutual information  , and any other multivariate generalization of the mutual information remains unchanged under substitution of sufficient statistics. That is:

 \TX:Y:Z =\TX\mssYZ:Y\mssXZ:Z\mssXY , \BX:Y:Z =\BX\mssYZ:Y\mssXZ:Z\mssXY , \IX:Y:Z =\IX\mssYZ:Y\mssXZ:Z\mssXY , and \JX:Y:Z =\JX\mssYZ:Y\mssXZ:Z\mssXY .

We conjecture that this behavior holds for any number of variables. That is, replacing each variable by its sufficient statistic about the others does not perturb the informational interactions among the variables. Nor does it induce any additional interactions among the sufficient statistics. And so, any multivariate mutual information will be invariant. We further conjecture that this is true of any common information, such as the Gács-Körner common information [34, 35], the Wyner common information [36, 37], and the exact common information .

## V Concluding Remarks

We demonstrated that it is proper to replace each variable with a sufficient statistic about its other variables without altering information-theoretic interactions among the variables. This is a great asset in many types of analysis and provides a principled method of performing lossless dimensionality reduction. As an important specific application, we demonstrated how the causal states of computational mechanics allow for the efficient computation of the excess entropy.

Our proof method centered around the use of an I-diagram and its atoms. Steps in our proof, such as identifying that the atom labeled is nonnegative via its containment in , are greatly aided by this graphical tool. Despite this, we believe that a superior proof of these results exists—a proof that does not depend on demonstrating atom-by-atom that all but a select few are zero. Such a proof would, hopefully, apply generically and directly to an -variable system, hold for the menagerie of multivariate generalizations of the mutual information, and perhaps apply even to the common informations.

## Acknowledgments

We thank Dowman P. Varn for helpful conversations. This material is based upon work supported by, or in part by, the John Templeton Foundation grant 52095, the Foundational Questions Institute grant FQXi-RFP-1609, and the U. S. Army Research Laboratory and the U. S. Army Research Office under contracts W911NF-13-1-0390 and W911NF-13-1-0340.

## References

•  T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, New York, second edition, 2006.
•  J. P. Crutchfield, C. J. Ellison, and J. R. Mahoney. Time’s barbed arrow: Irreversibility, crypticity, and stored information. Phys. Rev. Lett., 103(9):094101, 2009.
•  J. P. Crutchfield and D. P. Feldman. Regularities unseen, randomness observed: Levels of entropy convergence. CHAOS, 13(1):25–54, 2003.
•  J. P. Crutchfield. Between order and chaos. Nature Physics, 8(January):17–24, 2012.
•  J. P. Crutchfield and C. J. Ellison. The past and the future in the present. arxiv.org:1012.0356 [nlin.CD].
•  F. M. Reza. An Introduction to Information Theory. Courier Corporation, 1961.
•  R. W. Yeung. Information Theory and Network Coding. Springer, New York, 2008.
•  R. G. James, C. J. Ellison, and J. P. Crutchfield. Anatomy of a bit: Information in a time series observation. CHAOS, 21(3):037109, 2011.
•  P. L. Williams and R. D. Beer. Nonnegative decomposition of multivariate information. arXiv:1004.2515.
•  C. R. Shalizi and J. P. Crutchfield. Computational mechanics: Pattern and prediction, structure and simplicity. J. Stat. Phys., 104:817–879, 2001.
•  S. Kamath and V. Anantharam. A new dual to the gács-körner common information defined via the gray-wyner system. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pages 1340–1346. IEEE, 2010.
•  S. Wolf and J. Wultschleger. Zero-error information and applications in cryptography. In Information Theory Workshop, 2004. IEEE, pages 1–6. IEEE, 2004.
•  C. J. Ellison, J. R. Mahoney, and J. P. Crutchfield. Prediction, retrodiction, and the amount of information stored in the present. J. Stat. Phys., 136(6):1005–1034, 2009.
•  W. Löhr and N. Ay. On the generative nature of prediction. Advances in Complex Systems, 12(02):169–194, 2009.
•  C. J. Ellison, J. R. Mahoney, R. G. James, J. P. Crutchfield, and J. Reichardt. Information symmetries in irreversible processes. CHAOS, 21(3):037107, 2011.
•  A. Fraser and H. L. Swinney. Independent coordinates for strange attractors from mutual information. Phys. Rev. A, 33:1134–1140, 1986.
•  M. Casdagli and S. Eubank, editors. Nonlinear Modeling, SFI Studies in the Sciences of Complexity, Reading, Massachusetts, 1992. Addison-Wesley.
•  J. C. Sprott. Chaos and Time-Series Analysis. Oxford University Press, Oxford, United Kingdom, second edition, 2003.
•  H. Kantz and T. Schreiber. Nonlinear Time Series Analysis. Cambridge University Press, Cambridge, United Kingdom, second edition, 2006.
•  J. P. Crutchfield and D. P. Feldman. Statistical complexity of simple one-dimensional spin systems. Phys. Rev. E, 55(2):R1239–R1243, 1997.
•  I. Erb and N. Ay. Multi-information in the thermodynamic limit. J. Stat. Phys., 115:949–967, 2004.
•  W. Bulatek and B. Kaminski. On excess entropies for stationary random fields. Prob. Math. Stat., 29(2):353–367, 2009.
•  G. Tononi, O. Sporns, and G. M. Edelman. A measure for brain complexity: Relating functional segregation and integration in the nervous system. Proc. Nat. Acad. Sci. USA, 91:5033–5037, 1994.
•  W. Bialek, I. Nemenman, and N. Tishby. Predictability, complexity, and learning. Neural Computation, 13:2409–2463, 2001.
•  S. Marzen, M. R. DeWeese, and J. P. Crutchfield.

Time resolution dependence of information measures for spiking neurons: Scaling and universality.

Front. Comput. Neurosci., 9:109, 2015.
•  S. Marzen and J. P. Crutchfield. Statistical signatures of structural organization: The case of long memory in renewal processes. Phys. Lett. A, 380(17):1517–1525, 2016.
•  W. Ebeling and T. Poschel. Entropy and long-range correlations in literary english. Europhys. Lett., 26:241–246, 1994.
•  L. Debowski. On the vocabulary of grammar-based codes and the logical consistency of texts. IEEE Trans. Info. Th., 2008.
•  S. Watanabe. Information theoretical analysis of multivariate correlation. IBM J. Res. Dev., 4(1):66–82, 1960.
•  T. S. Han. Linear dependence structure of the entropy space. Inf. Control, 29(4):337–368, 1975.
•  A. J. Bell. The co-information lattice. In S. Makino S. Amari, A. Cichocki and N. Murata, editors,

Proceedings of the Fifth International Workshop on Independent Component Analysis and Blind Signal Separation

, volume ICA 2003, pages 921–926, New York, 2003. Springer.
•  W. J. McGill. Multivariate information transmission. Psychometrika, 19(2):97–116, 1954.
•  C. Chan, A. Al-Bashabsheh, J. B. Ebrahimi, T. Kaced, and T. Liu. Multivariate mutual information inspired by secret-key agreement. Proc. IEEE, 103(10):1883–1913, 2015.
•  P. Gacs and J. Korner. Common information is much less than mutual information. Problems Contr. Inform. Th., 2:149–162, 1973.
•  H. Tyagi, P. Narayan, and P. Gupta. When is a function securely computable? IEEE Trans. Info. Th., 57(10):6337–6350, 2011.
•  A. D. Wyner. The common information of two dependent random variables. IEEE Trans. Info. Th., 21(2):163–179, 1975.
•  W. Liu, G. Xu, and B. Chen. The common information of N dependent random variables. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pages 836–843. IEEE, 2010.
•  G. R. Kumar, C. T. Li, and A. El Gamal. Exact common information. In Proc. IEEE ISIT, pages 161–165. IEEE, 2014.