Information-theoretical analysis of the statistical dependencies among three variables: Applications to written language

07/30/2015
by   Damián G. Hernández, et al.
0

We develop the information-theoretical concepts required to study the statistical dependencies among three variables. Some of such dependencies are pure triple interactions, in the sense that they cannot be explained in terms of a combination of pairwise correlations. We derive bounds for triple dependencies, and characterize the shape of the joint probability distribution of three binary variables with high triple interaction. The analysis also allows us to quantify the amount of redundancy in the mutual information between pairs of variables, and to assess whether the information between two variables is or is not mediated by a third variable. These concepts are applied to the analysis of written texts. We find that the probability that a given word is found in a particular location within the text is not only modulated by the presence or absence of other nearby words, but also, on the presence or absence of nearby pairs of words. We identify the words enclosing the key semantic concepts of the text, the triplets of words with high pairwise and triple interactions, and the words that mediate the pairwise interactions between other words.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/24/2017

The Stochastic complexity of spin models: How simple are simple spin models?

Simple models, in information theoretic terms, are those with a small st...
08/02/2020

Statistical Inference of Minimally Complex Models

Finding the best model that describes a high dimensional dataset, is a d...
12/20/2016

Inferring the location of authors from words in their texts

For the purposes of computational dialectology or other geographically b...
01/14/2014

Survey On The Estimation Of Mutual Information Methods as a Measure of Dependency Versus Correlation Analysis

In this survey, we present and compare different approaches to estimate ...
04/18/2021

Linguistic dependencies and statistical dependence

What is the relationship between linguistic dependencies and statistical...
05/16/2017

All-relevant feature selection using multidimensional filters with exhaustive search

This paper describes a method for identification of the informative vari...
09/06/2017

The Voynich Manuscript is Written in Natural Language: The Pahlavi Hypothesis

The late medieval Voynich Manuscript (VM) has resisted decryption and wa...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Imagine a game where, as you read through a piece of text, you occasionally come across a blank space representing a removed or occluded word. Your task is to guess the missing word. This is an example sentence, —— your guess. If you were able to replace the blank space in the previous sentence with “make”, or “try”, or some other related word, you have understood the rules of the game. The task is called the Cloze test Taylor (1953) and is routinely administered to evaluate language proficiency, or expertise in a given subject.

The cues available to the player to solve the task can be divided into two major groups. First, surrounding words restrict the grammatical function of the missing word, since, for example, a conjugated verb cannot usually take the place of a noun, nor vice versa. Second, and assuming that the grammatical function of the word has already been surmised, semantic information provided by the surrounding words is typically helpful. That is, the presence or absence of specific words in the neighborhood of the blank space affect the probability of each candidate missing word. For example, if the word bee is near the blank space, the likelihood of honey is larger than when bee is absent.

In this paper we study the structure of the probabilistic links between words due to semantic connections. In particular, we aim at deciding whether binary interactions between words suffice to describe the structure of dependencies, or whether triple and higher-order interactions are also relevant: Should we only care for the presence or absence of specific words in the vicinity of the blank space, or does the presence or absence of specific pairs (or higher-order combinations) also matter in our ability to guess the missing word? For example, one would expect that the presence of the word cell would increase the probability of words as cytoplasm, phone or prisoner. The word wax, in turn, is easily associated with ear, candle or Tussaud. However, the conjoint presence of cell and wax points much more specifically to concepts such as bee or honey, and diminish the probability of words associated with other meanings of cell and wax. Combinations of words, therefore, also matter in the creation of meaning, and context. The question is how relevant this effect is, and whether the effect of the pair (cell + wax) is more, equal or less than the sum of the two individual contributions (effect of cell + effect of wax

). Here we develop the mathematical methods to estimate these contributions quantitatively.

The problem can be framed in more general terms. In any complex system, the statistical dependence between individual units cannot always be reduced to a superposition of pairwise interactions. Triplet, or even higher-order dependencies may arise either because three or more variables are dynamically linked together, or because some hidden variables, not accessible to measurement, are linked to the visible variables through pairwise interactions.

In 2006, Schneidman and coworkers Schneidman et al. (2006)

demonstrated that, in the vertebrate retina, up to pairwise correlations between neurons could account for approximately 90% of all the statistical dependencies in the joint probability distribution of the whole population. This finding brought relief to the scientific community, since an expansion up to the second order was regarded sufficient to provide an adequate description of the correlation structure of the full system. As a consequence, not much effort has been dedicated to the detection and the characterization of third or higher-order interactions. To our knowledge, the present work constitutes the first example offering an exact description of third-order dependencies. We derive the relevant information-theoretical measures, and then apply them to actual data.

As a model system, we work with the vast collection of words found in written language, since this system is likely to embody complex statistical dependencies between individual words. The dependencies arise from the syntactic and semantic structures required to map a network of interwoven thoughts into an ordered sequence of symbols, namely, words. The projection from the high-dimensional space of ideas onto the single dimension represented by time can only be made because language encodes meaning in word order, and word relations. In particular, if specific words appear close to each other, they are likely to construct a context, or a topic. The context is important in disambiguating among the several meanings that words usually have. Therefore, language constitutes a model system where individual units (words) can be expected to exhibit high-order interactions.

Statistics and information theory have proved to be useful in understanding language structures. Since Zipf’s empirical law Zipf (1949) on the frequency of words, and the pioneering work of Shannon Shannon (1951) measuring the entropy of printed English, a whole branch of science has followed these lines Grassberger (1989); Ebeling and Pöschel (1994); Montemurro and Zanette (2010). In recent years, the discipline gained momentum with the availability of large data sources in the internet Petersen et al. (2012); Perc (2012); Gerlach and Altmann (2013, 2014).

In this paper we quantify the amount of double and triple interactions between words of a given text. In addition, by means of a careful analysis of the structure of pairwise interactions we distinguish between pairs of variables that interact directly, and pairs of variables that are only correlated because they both interact with a third variable. With these goals in mind, we define and measure dependencies between words using concepts from information theory Shannon (1948); Jaynes (1957); Cover and Thomas (2012), and apply them in later sections to the analysis of written texts.

Ii Statistical dependencies among three variables

When it comes to quantifying the amount of statistical dependence between two variables and with joint probabilities and marginal probabilities and , Shannon’s mutual information Shannon (1948); Cover and Thomas (2012)

(1)

stands out for its generality and its simplicity. Throughout this paper we take all logarithms in base 2, and therefore measure all information-theoretical quantities in bits. In Fig. 1, pairwise statistical dependencies are represented by the rods connecting two variables (independent variables appear disconnected). Since

is the Kullback-Leibler divergence

Cover and Thomas (2012)

between the joint distribution

and its independent approximation , the mutual information is always non-negative. Moreover, and are independent if and only if their mutual information vanishes.

Three variables, in turn, may interact in different ways; Fig. 1 illustrates all the possibilities.

Figure 1: Different ways in which three variables may interact. A: The three variables are independent. B: Only pairwise interactions exist. These may involve 1, 2 or 3 links (from left to right). C: The three variables are connected by a single triple interaction. D: Double and triple interactions may coexist. The most general case is illustrated in the bottom-right panel.

In this section, we discuss several quantities that measure the strength of the different interactions. So far, no general consensus has been reached regarding the way in which statistical dependencies between three variables should be quantified Darroch (1962); McGill (1954); Agresti (2014); Martignon et al. (2000); Amari (2001); Bell (2003); Schneidman et al. (2003a); Nemenman (2004); Vitányi (2011); Griffith and Koch (2014). One attempt in the framework of Information Theory is the symmetric quantity , sometimes called the co-information Cover and Thomas (2012); Bell (2003), defined as

(2)

where is the conditional mutual information,

(3)

The co-information measures the way one of the variables (no matter which) influences the transmission of information between the other two. Positive or negative values of the co-information have often been associated with redundancy or synergy between the three variables, though one should be careful to distinguish between several possible meanings of the words synergy and redundancy (see below, and also Schneidman et al. (2003b); Eyherabide and Samengo (2010)).

In an attempt to provide a systematic expansion of the different interaction orders, Amari Amari (2001) developed an alternative way of measuring triple and higher-order interactions. His approach unifies concepts from categorical data analysis and maximum entropy techniques. The theory is based on a decomposition of the joint probability distribution as a product of functions, each factor accounting for the interactions of a specific order. The first term embodies the independent approximation, the second term adds all pairwise interactions, subsequent terms orderly accounting for triplets, quadruplets and so forth. This approach constitutes the starting point for the present work.

Given the random variables

, …, governed by a joint probability distribution , all the marginal distributions of order can be calculated by summing the values of the joint distribution over of the variables. Since there are ways of choosing variables among the original , the number of marginal distributions of order is Amari defined the probability distribution as the one with maximum entropy among all those that are compatible with all the marginal distributions of order . The maximization of the entropy under such constraints has a unique solution Csiszár (1975): the distribution allowing variables to vary with maximal freedom, inasmuch they still obey the restriction imposed by the marginals. Hence, contains all the statistical dependencies among groups of variables that were present in the original distribution, but none of the dependencies involving more than variables.

The interactions of order are quantified by the decrease of entropy from to , which can be expressed as a Kullback-Leibler divergence

(4)

where is the entropy of . The last inequality of Eq. (4) derives from the generalized Pythagoras theorem Amari (2001). As increasing constraints cannot increase the entropy, is always non-negative.

The total amount of interactions within a group of variables, the so called multi-information McGill (1954), is defined as the Kullback-Leibler divergence between the actual joint probability distribution and the distribution corresponding to the independent approximation. The multi-information naturally splits in the sum of the different interaction orders

(5)

For two variables, there are at most pairwise interactions. Their strength, measured by , coincides with Shannon’s mutual information

(6)

since the distribution with maximum entropy that is compatible with the two univariate marginals is . This result is easily obtained by searching for the joint distribution that maximizes the entropy using Lagrange multipliers for the constraints given by the marginals (Kapur, 1989).

When studying three variables, , and , we separately quantify the amount of pairwise and of triple interactions. In this context, measures the amount of statistical dependency that cannot be explained by pairwise interactions, and is defined as

(7)

where represents the full entropy of the triplet calculated with .

The distribution contains up to pairwise interactions. If the actual distribution coincides with , there are no third-order interactions. Within Amari’s framework, hence, if , some of the statistical dependency among triplets cannot be explained in terms of pairwise interactions.

Both and are generalizations of the mutual information intended to describe the interactions between three variables, and both of them can be extended to an arbitrary number of variables Amari (2001); Yeung (2002). It is important to notice, however, that the two quantities have different meanings. A vanishing co-information () implies that the mutual information between two of the variables remains unaffected if the value of the third variable is changed. However, this does not mean that it suffices to measure only pairs of variables—and thereby obtain the marginals —to reconstruct the full probability distribution . Conversely, a vanishing triple interaction () ensures that pairwise measurements suffice to reconstruct the full joint distribution. Yet, the value of any of the variables may still affect how much information is transmitted between the other two.

We shall later need to specify the groups of variables whose marginals are used as constraints. We therefore introduce a new notation for the maximum entropy probability distributions and for the maximum entropies. Let represent a set of variables. For example, if , we may have . When studying the dependencies of -th order, we shall be working with all sets that can be formed with variables, where Let be the probability distribution of maximum entropy that satisfies the marginal restrictions of . Under this notation,

(8)

Respectively, the maximum entropies are and . Under the present notation, the mutual information is , and the co-information of three variables is written as .

The amount of pairwise interactions between variables and is known to be bounded by Cover and Thomas (2012)

(9)

We have derived an analogous bound for triple interactions (see Appendix A). The resulting inequality links the amount of triple interactions with the co-information ,

(10)

These bounds imply that pure triple interactions, appearing in the absence of pairwise interactions (see Fig. 1C), may only exist if the co-information is negative.

ii.1 Characterization of the joint probability distribution of variables with high triple interactions

Two binary variables and can have maximal mutual information 1 bit in two different situations. For the sake of concreteness, assume that . Maximal mutual information is obtained either when or when . In other words, the joint probability distribution must either vanish when the two variables are equal, or when the two variables are different, as illustrated in Fig. 2A.

Figure 2: A: Density plot of the two bivariate probability distributions that have bit. Dark states have zero probability, and white states have . B: Density plot of the two trivariate probability distributions with bit. Dark states have zero probability, and white states have

. C: Gradual change between a uniform distribution and a

distribution, for different values of (Eq. (13)). D: Amount of triple interactions as a function of the parameter .

If the mutual information is high, though perhaps not maximal, then the two variables must still remain somewhat correlated, or anti-correlated. The joint probability distribution, hence, must drop for those states where the variables are equal - or different. In this section we develop an equivalent intuitive picture of the joint probability distribution of triplets with maximal (or, less ambitiously, just high) triple interaction.

Consider three binary variables taking values with joint probability distribution

(11)

as illustrated in Fig. 2B, left side. For this probability distribution, the three univariate marginals are uniform, that is, . Moreover, the three bivariate marginals are also uniform: . The full distribution, however, is far from uniform, since only half of the 8 possible states have non-vanishing probability.

The probability distribution of Eq. (11) is henceforth called a distribution. The name is inspired by the fact that two independent binary variables and can be combined into a third dependent variable , where represents the logical function exclusive-OR. If the two input variables have equal probabilities for the two states , then Eq. (11) describes the joint probability distribution of the triplet .

The maximum-entropy probability compatible with uniform bivariate marginals is uniform, . The amount of triple interactions is therefore

(12)

and , i.e. all interactions are tripletwise and reaches the maximum value allowed for binary variables. Of course, the same amount of triple interactions is obtained for the complementary probability distribution (a so-called negative-XOR), for which when (see Fig. 2B, right side).

So far we have demonstrated that and distributions contain the maximal amount of triple interactions. Amari Amari (2001) has proved the reciprocal result: If the amount of triple interactions is maximal, then the distribution is either or . We now demonstrate that if the joint distribution lies somewhere in between a uniform distribution and a (or a ) distribution, then the amount of triple interactions lies somewhere in between 0 and 1, and the correspondence is monotonic. To this end, we consider a family of joint probability distributions parametrized by a constant , defined as a linear combination of a uniform distribution and a distribution,

(13)

where . Varying from zero to shifts the from the uniform distribution to the XOR probability of Eq. (11) (see Fig. 2C). Negative values, in turn, shift the distribution to . All the bivariate marginals of the distribution are uniform, and equal to 1/4. The maximum-entropy model compatible with these marginals is the uniform distribution . Hence, the amount of triple interactions is

(14)

As shown in Fig. 2D, this function is even, and varies monotonically in each of the intervals and . Therefore, there is a one to one correspondence between the similarity between the distribution and the amount of triple interactions. The same result is obtained for arbitrary binary distributions, as argued in the last paragraph of Appendix B. As a consequence, we conclude that for binary variables, the distribution is not just one possible example distribution with triple interactions, but rather, it is the only way in which three binary variables interact in a tripletwise manner. If bivariate marginals are kept fixed, and triple interactions are varied, then the joint probability distribution either gains or loses a -like component, as illustrated in Fig. 2C.

Iii Triplet analysis of pairwise interactions

In a triplet of variables , three possible binary interactions can exist, quantified by , and . In this section we characterize the amount of overlap between these quantities, we bound their magnitude, and we learn how to distinguish between reducible and irreducible interactions.

iii.1 Redundancy among the three mutual informations within a triplet

In the previous section, we saw that when there are only two variables and , coincides with the mutual information . When there are more than two variables, can no longer be equated to a mutual information, since there are several mutual informations in play, one way per pair of variables: , etc. In this section, we derive a relation between all these quantities for the case of three interacting variables. The multi-information of Eq. (5) decomposes into pairwise and triple interactions,

(15)

from where we arrive at

(16)

The total amount of pairwise dependencies, hence, is in general different from the sum of the three mutual informations. That is, depending on the sign of , the amount of pairwise interactions can be larger or smaller than . This range of possibilities suggests that may be a useful measure of the amount of redundancy or synergy within the pairwise interactions inside the triplet, and this is the measure that we adopt in the present paper.

This measure coincides with the co-information when there are no triple dependencies, that is, when . In this case,

(17)

Under these circumstances, a positive value of implies that the sum of the three mutual informations is larger than the total amount of pairwise interactions. The content of the three informations, hence, must somehow overlap. This observation supports the idea that a positive co-information is associated with redundancy among the variables. In turn, a negative value of implies that although the maximum entropy distribution compatible with the pairwise marginals is not equal to (that is, although ), when taken two at a time, variables do look independent (that is ). The statistical dependency between the variables of any pair, hence, only becomes evident when fixing the third variable. This behavior supports the idea that a negative co-information is associated with synergy among the variables.

Of course when , the co-information is no longer so simply related to concepts of synergy and redundancy, not at least, if the latter are understood as the difference between the sum of the three informations and . However, below we show that in actual data, one can often find a close connection between the amount of triple interactions and the co-information.

iii.2 Triangular binary interactions

In a group of interacting variables, if has some degree of statistical dependence with , and has some statistical dependence with , one could expect and to show some kind of statistical interaction, only due to the chained dependencies , even in the absence of a direct connection. Here we demonstrate that indeed, two strong chained interactions necessarily imply the presence of a third connection closing the triangle. In the pictorial representation of the middle column of Fig. 1, this means that if only two connections exist (there is no link closing the triangle), then the two present interactions cannot be strong. For example, with binary variables, it is not possible to have bit, and . The general inequality reads (see the derivation in Appendix A)

(18)

iii.3 Identification of pairwise interactions that are mediated through a third variable

In the previous section we demonstrated that the chained dependencies can induce some statistical dependency between and . On the other hand, it is also possible for and to interact directly, inheriting their interdependence from no other variable. These two possible scenarios cannot be disambiguated by just measuring the mutual information between pairs of variables. In Appendix C, we explain how, starting from the most general model (illustrated in the lower-right panel of Fig. 1), the analysis of triple interactions allows us to identify those links that can be explained from binary interactions involving other variables, and those that cannot: the so-called irreducible interactions. Briefly stated, we need to evaluate whether the interaction between and (captured by the bivariate marginal ) and the interaction between and (captured by ) suffice to explain all pairwise interactions within the triplet, including also the interaction between and . To that end, we compute a measure of the discrepancy between the two corresponding maximum entropy models,

(19)

The amount of irreducible interaction, that is, the amount of binary interaction between and that remains unexplained through the chain is defined as

(20)

In Sect. V.4, we search for pairs of variables with small irreducible interaction, by computing using all possible candidate variables that may act as mediators. From them, we keep the one giving minimal irreducible interaction, that is, the one for which the chain provides the best explanation for the interaction between and .

Iv Marginalization and hidden variables

Imagine we have a system of variables that are linked through just pairwise interactions. In such a system, for any pair of variables there is a third variable producing a vanishing irreducible interaction . By selecting a subset of variables, we may calculate the -th order marginal , by marginalizing over the remaining variables. As opposed to the original multivariate distribution , the marginal may well contain triple and higher-order interactions. In other words, there may be pairs of variables that belong to the subset for which there is no other third variable in the subset producing a vanishing irreducible interaction . The high-order interactions in the subset, therefore, result from the fact that not all interacting variables are included in the analysis. Therefore, triple and higher-order statistical dependencies do not necessarily arise due to irreducible triple and higher-order interactions: Just pairwise interactions may suffice to induce them, whenever we marginalize over one or more of the interacting variables. An example of this effect is derived in Appendix D. In the same way, marginalization may introduce spurious pairwise interactions between variables that do not interact directly, as illustrated in Fig. 3.

Figure 3: Examples illustrating the effects of marginalization in a pair of variables (A) or a triplet (B). In each case, the variable represented in black drives the other slave variables, which do not interact directly with each other (top). However, after marginalizing over the driving variable, a statistical dependence between the remaining variables appears. The new interaction can be pairwise (A), or pairwise and tripletwise (B).

Therefore, even if, by construction, we happen to know that the system under study can only contain pairwise statistical dependencies, it may be important to compute triple and higher-order interactions, whenever one or a few of the relevant variables are not measured.

Virtually all scientific studies focus their analysis in only a subset of all the variables that truly interact in the real system. However, as stated above, neglecting some of the variables typically induces high-order correlations among the remaining variables. If such correlations are interpreted within the reduced framework of the variables under study, they are spurious, at least, in the sense that there may well be no mechanistic interaction among the selected variables that gives rise to such high-order interactions. However, if interpreted in a broader sense (i.e., a mathematical fact, that may result as a consequence of marginalization), high-order correlations may be viewed as a footprint of the marginalized variables, which are often inaccessible. As such, they constitute an opportunity to characterize those parts of the system that cannot be described by the values of the recorded variables.

Below we analyze the statistics of written language. We select a group of words (each selected word defines one variable), and we measure the presence or absence of each of these words in different parts of the book. For simplicity, not all the words in the book are included in the analysis, so the discarded words constitute examples of marginalized variables. However, marginalized variables are not always as concrete as non-analyzed words. Other non-registered factors may also influence the presence or absence of specific words, for example, those related to the thematic topic or the style that the author intended for each part of the book. These aspects are latent variables that we do not have access to by simply counting words. An analysis of the high-order statistics among the subgroup of selected words may therefore be useful to characterize such latent variables, which are otherwise inaccessible through automated text analysis.

As an ansatz, we can imagine that each topic affects the statistics of a subgroup of all the words. The fact that topics are not included in the analysis is equivalent to having marginalized over topics. By doing so, we create interactions within the different subgroups of words. If the topics do not overlap too much, from the network of the resulting interactions, we may be able to identify communities of words highly connected, that are related to certain topics. Variations in the topic can therefore be diagnosed from variations in the high-order statistics.

V Occurrence of words in a book

Before analyzing a book, all its words are taken in lowercase, and spaces and punctuation marks are neglected. Each word is replaced by its base uninflected form using the WordData function from the program Mathematica®wor . In this way, for instance, a word and its plural are considered as the same, and verb conjugations are unified as well.

In order to construct the network of interactions between words, we analyze the probability that different words appear near to each other. The notion of neighborhood is introduced by segmenting each book into parts. A book containing words is divided into parts, so that there are words per part. We analyze the statistics of a subgroup of selected words , and define the variables

(21)

The different parts of the book constitute the different samples of the joint probability , or of the corresponding marginals. Notice that if word is found in a given part of the book, in that sample , no matter whether the word appeared one or many times. The marginal probability is the average frequency with which word appears in one (any) of the parts. Here, we analyze up to triple dependencies, so we work with joint distributions of at most three variables .

In the present work, we choose to study words that have an intermediate range of frequencies. We disregard the most frequent words (which are generally stop words such as articles, pronouns and so on) because they predominantly play a grammatical role, and only to a lesser extent they influence the semantic context Zanette and Montemurro (2002). We also discard the very infrequent words (those appearing only a few times in the whole book), because their rarity induces statistical inaccuracies due to limited sampling Samengo (2002). Discarding words implies that only a seemingly small number of words are analyzed, allowing us to illustrate the fact that even a small number of variables suffices to infer important aspects of the structure of the network of statistical dependencies among words. In other types of data, the limitation in the number of variables may arise from unavoidable technical constraints, and not from a matter of choice.

We analyzed two books, On the Origin of Species (OS) by Charles Darwin and The Analysis of Mind (AM) by Bertrand Russell, both taken from Project Gutenberg website gut . Each book was divided into parts. In OS, each part contained words, and in AM, . Parts should be big enough so that we can still see the structure of semantic interactions, and yet, the number of parts should not be too small as to induce inaccuracies due to limited sampling.

In both books, we analyzed words with intermediate frequencies. For OS, the analyzed words appeared a total number of times , with . For AM, we analyzed words with . Since for these words the number of samples (parts) is much greater than the number of states (2), entropies were calculated with the maximum likelihood estimator. We are able to detect differences in entropy of bits, with a significance of (see Appendix E for a analysis of significance). A Bayesian analysis of the estimation error due to finite sampling was also included, allowing us to bound errors between bits and bits, depending on the size of the interaction (see Appendix F).

v.1 Statistics of single words

Before studying interactions between two or more words, we characterize the statistical properties of single words. Specifically, we analyze the frequency of individual words, and their predictability of its presence in one (any) part of the book. Within the framework of Information Theory, the natural measure of (un)predictability is entropy.

Using the notation , the entropy is

(22)

This quantity is maximal ( bit) when , that is, when the word appears in half of the parts. When appears in either most of the parts or in almost none, approaches zero. For all the analyzed words, . In this range, the entropy is a monotonic function of .

The value of , however, is not univocally determined by the number of times that the word appears in the book. If appears at most once per part, then . If tends to appear several times per part, then .

In addition, one can determine whether the fraction of parts containing the word is in accordance with the expected fraction given the total number of times the word appears in the whole book. If is half the number of parts (that is, ), then implies that the words are distributed as uniformly as they possibly can: Half of the parts do not contain the word, and the other half contain it just once. If, instead, , a value of corresponds to a highly non-uniform distribution: The word is absent from half of the parts, but it appears many times in the remaining half.

To formalize these ideas, we compared the entropy of each selected word with the entropy that would be expected for a word with the same probability per part , but randomly distributed throughout the book and sampled times. The binomial probability of finding the word times in one (any) part is

(23)

Equation (23) describes an integer variable. In order to compare with Eq. (22), we define as the binary variable measuring the presence/absence of word

in one (any) part, assuming that the word is binomially distributed. That is,

if , and if . The marginal probability of is . This formula is also obtained when all the words in the book are shuffled. In this case

follows a hypergeometric distribution, such that

, where the last equality holds when .

Hence, the entropy of the binary variable associated with the binomial (or the shuffled) model is

(24)

The entropy of the variable measured from each book is compared with the entropy of the binomial-derived variable in Fig. 4.

Figure 4: Entropy of the 400 selected words in each book (one data point per word), compared to the expected entropy for a binomial variable with the same total count (continuous line), as a function of the total count. Entropies are calculated with the maximum likelihood estimator. The analytical expression of Eq. (24) is represented with the black line, and the gray area corresponds to the percentiles 1%-99% of the dispersion expected in the binomial model, when using a sample of words. Data points outside the gray area, hence, are highly unlikely under the binomial hypothesis, even when allowing for inaccuracies due to limited sampling. A: OS. B: AM.

Even if the process were truly binomial, the estimation of the entropy may still fluctuate, due to limited sampling. In Fig. 4

, the gray region represents the area expected for 98% of the samples under the binomial hypothesis. We expect 1% of the words to fall above this region, and another 1%, below. However, in OS, out of 400 words, none of them appears above, and 15% appear below. In AM, the percentages are 0% and 16.5%. In both cases, the outliers with small entropy are 15 times more numerous than predicted by the binomial model, and no outliers with high entropy were found, although 4 were expected for each book. In both books, hence, individual word entropies were significantly smaller than predicted by the binomial approximation, implying that they are not distributed randomly: In any given part, each word tends to appear many times, or not at all.

A list of the words with highest difference is shown in Table 1. Interestingly, most of these words are nouns, with the first exception appearing in place 10 (the adjective “rudimentary”) for OS. As reported previously Zanette and Montemurro (2002), words with relevant semantic content are the ones that tend to be most unevenly distributed.

Word (OS) Word (AM)
bee 0.369 proposition 0.335
cell 0.365 appearance 0.315
slave 0.302 box 0.299
stripe 0.295 datum 0.258
pollen 0.275 animal 0.240
sterility 0.266 objective 0.215
pigeon 0.252 star 0.211
fertility 0.248 content 0.206
nest 0.242 emotion 0.205
rudimentary 0.234 consciousness 0.204
Table 1: Words with highest difference in entropy , expressed in bits. Left: OS. Right: AM.

v.2 Statistics of pairs of words

In principle, there are two possible scenarios in which the mutual information between two variables can be high: (a) in each part of the book the two words either appear together or are both absent, and (b) the presence of one of the words in a given part excludes the presence of the other. In Table 2 we list the pairs of words with highest mutual information. In all these cases, the two words in the pair tend to be either simultaneously present or simultaneously absent (option (a) above).

(OS) (OS) (AM) (AM)
male female 0.242 0.504 0.409 1 2 0.191 0.330 0.337
south america 0.210 0.480 0.560 truth falsehood 0.110 0.429 0.191
reproductive system 0.152 0.290 0.474 response accuracy 0.107 0.306 0.264
north america 0.133 0.429 0.560 depend upon 0.107 0.229 0.616
cell wax 0.122 0.201 0.150 mnemic phenomena 0.095 0.423 0.516
bee cell 0.120 0.330 0.201 mnemic causation 0.090 0.423 0.381
fertile sterile 0.120 0.345 0.330 consciousness conscious 0.089 0.504 0.352
deposit bed 0.109 0.322 0.314 door window 0.086 0.160 0.128
fertility sterility 0.109 0.352 0.322 stimulus response 0.085 0.474 0.306
southern northern 0.107 0.306 0.264 pain pleasure 0.079 0.171 0.181
Table 2: Pairs of words with highest mutual information. Left: OS. Right: AM. The values are in bits.

The words listed in Table 2 are semantically related. In both books, there are examples of words that participate in two pairs: cell is connected to both bee and wax (OS) and mnemic is connected to both phenomena and causation (AM). These examples keep appearing if the lists of Table 2 are extended further down. Their structure corresponds to the double links in the second and third columns of Figs. 1B and 1D. As explained in Sect. III.2, two strong binary links imply that the third link closing the triangle should also be present. Indeed, in OS, america is linked to both south and north (rows 2 and 4 of Table 2). The words south and north are also linked to each other, but they only appear in position 32, with a mutual information that is approximately 1/3 of the two principal links. A similar situation is seen with bee and wax, both connected to cell, although the direct connection between bee and wax appears sooner, in position 16. The same happens in AM with phenomena and causation, linked through mnemic, which are connected to each other in the 39th place of the list. These examples pose the question whether the weakest link in the triangle could be entirely explained as a consequence of the two stronger links. A triplet analysis of pairwise interactions allows us to assess whether such is indeed the case (see Sect. III.3).

We finish the pairwise analysis with a graphical representation of the words that are most strongly linked with pairwise connections (left panels of Fig. 5).

Figure 5: Central graph: Network of pairwise interactions in OS. Width of links proportional to the mutual information between the two connected words. Insets: Detail of selected subnetworks. Top graph: links proportional to mutual information. Bottom graph: links proportional to irreducible interaction.

Words belonging to a common topic are displayed in different grey levels (different colors, online), and tend to form clusters. In each cluster (insets in Fig. 5), triplets of words often form triangles of pairwise interactions. In the central plot, and in the top graph of each inset, the width of each link is proportional to the mutual information between the two connected words.

v.3 Statistics of triplets

In order to determine whether triple interactions provide a relevant contribution to the overall dependencies between words, we compare with the total amount of pairwise interactions within the triplet, .

Figure 6: Fraction of the total interaction within a triplet that corresponds to tripletwise dependencies, , as a function of the total interaction. The grey level of each data point is proportional to the (logarithm of the) number of triplets at that location (scale bars on the right). values above 0.01 bits are significant (see Appendix). A: OS. B: AM. Dashed line: averages over all triplets with the same .

Figure 6 shows the fraction of the total interaction that corresponds to triple dependencies, , as a function of the total interaction . The data extends further to the right, but the triplets with bits are less than 0.4%. The first thing to notice is that the values of the total interaction (values in the horizontal axis) are approximately an order of magnitude smaller than the entropies of individual words (see Fig.4). Individual entropies range between 0.1 and 0.9 bits, and interactions are around 0 and 0.05. In order to get an intuition of the meaning of such a difference, we notice that if we want to know whether words , and appear in a given part, the number of binary questions that we need to ask is (depending on the three chosen words) between 0.3 and 2.7 if we assume the words are independent (), and between 0.25 and 2.2, if we make use of their mutual dependencies (). Although sparing of the questions may seem a meager gain, it can certainly make a difference when processing large amounts of data.

The second thing to notice, is that triple interactions are by no means small as compared to the total interactions within the triplet, since there are triplets with of order unity. In other words, triple interactions are not negligible, when compared to pairwise interactions. In the triplets with , the departure from the independent assumption resembles the XOR behavior (or XOR), in the sense that the states for which have a lower (higher) probability than the states with . The first case corresponds to triplets where all pairs of words tend to appear together, but the three of them are rarely seen together. In the second case, the words tend to appear either the three together or each one on its own, but they are rarely seen in pairs.

Tag
america south north 0.065 0.005 0.16
inherit occasional appearance 0.040 0.96
action wide branch 0.036 0.93
europe perhaps chapter 0.036 0.90
climate expect just 0.035 0.97
speak causation appropriate 0.041 0.93
sense perception natural 0.033 0.90
since actual wholly 0.033 0.90
wish me connection 0.033 0.95
consist should life 0.033 0.92
Table 3: Words with highest triple information . The first column displays a tag that allows us to identify each triplet in Fig. 7. The last column indicates whether the triplet behaves as a gate (+1) or a (1). Top: OS. Bottom: AM. Values in bits.

Table 3 shows the words with largest triple information. These interactions are well above the significance threshold of bits. The triplet (america, south, north) is similar to a gate, so these words tend to appear in pairs but not all three together. In certain contexts the author uses the combination south america, in other contexts, north america, and yet in others, he discusses topics that require both south and north but no america.

Most of the triplets in Table 3 have triple information values that are equal in magnitude to the co-information but with opposite sign, that is, . Besides, for these triplets, most of the interaction is tripletwise, that is, .

Figure 7: Triple information as a function of the co-information for all triplets. The grey level of each data point is proportional to the (logarithm of the) number of triplets at that location (scale bars on the right). values above 0.01 bits are significant (see Appendix). A: OS. B: AM.

To determine whether such tendency is preserved throughout the population, in Fig. 7 we plot the triple information as a function of the co-information for all triplets. We see that the vast majority of triplets are located along the diagonal . In order to understand why this is so, we analyze how data points are distributed when picking a triplet of words randomly. The cases A, B, C and D of Fig. 1 are ordered in decreasing probability. That is, picking three unrelated words (Fig. 1A) has higher probability that picking a triplet with only pairwise interactions (B), which is still more likely than picking a case with only triple interactions (C), leaving the case of double and triple interactions (D) as the least probable. All cases with no triple interaction (A and B) fall on the horizontal axis in Fig. 7. Therefore, in order to understand why points outside the horizontal axis cluster along the diagonal we must analyze the triplets that do have a triple interaction (panels C and D in Fig. 1). We begin with case C, because it has a higher probability than case D. This case corresponds to and . It is easy to see that in these circumstances, , and hence, . We continue with the left column of case D, since having a single pairwise interaction has higher probability than having more. This case corresponds to , and , for some ordering of the indexes . In these circumstances, , which again implies that . Therefore, all triplets containing some triple interaction and at most a single pairwise interaction fall along the diagonal in Fig. 7. The only outliers are triplets with and at least two links with pairwise interactions, which, as derived in Sect. III.2, most likely contain also the third pairwise link. Such highly connected triplets are typically few.

From Eq. (16) we see that the triplets that are near the diagonal are neither synergistic nor redundant, that is, . Those located above the diagonal have redundant pairwise information ( ), whereas those below are synergistic. In the two analyzed books, very few () triplets satisfy bits. Contrastingly, triplets have significant redundant pairwise information ( bits). The triplets located far from the diagonal correspond, in both cases, to those with a large total dependency ( bits). Table 4 displays the words with highest redundant pairwise interaction, that is, .

Tag
bee cell wax 0.089
america south north 0.070
glacial southern northern 0.065
mountain glacial northern 0.062
male female sexual 0.057
leave door window 0.061
stimulus response accuracy 0.039
mnemic phenomena causation 0.038
truth false falsehood 0.036
place 2 1 0.027
Table 4: Triplets with highest redundant pairwise information . The first column displays a tag that allows us to identify each triplet in Fig. 7. Top: OS. Bottom: AM. Values in bits.

With the exception of data point (america, south, north), the triplets that have highest redundancy tend to be in the lower right part of Fig. 7, whereas the ones with highest triple interaction lie in the upper left corner.

v.4 Identification of irreducible binary interactions

Using the tools of Sect.III.3, here we identify the pairs of words that interact only because the two of them have strong binary interactions with a third word. In the first place, the pairs of words whose mutual information is larger than the significance level (0.01 bits) are selected. For those pairs, the irreducible interaction is calculated by considering all other candidate intermediary words, and selecting the one that minimizes Eq. (20). We observe that many pairs have a low irreducible interaction, implying that their dependency can be understood by a path that goes through a third variable , such as

(25)

In these situations, the behavior of the pair can be predicted from the dependency between and the dependency between .

In Table 5, we list the pairs of words that have smallest irreducible interaction, including the third word () that acts as a mediator.

bee wax 0.093 0.003 cell
south north 0.071 0.001 america
continent south 0.032 0.001 america
lay wax 0.032 0.000 cell
southern arctic 0.031 0.001 northern
phenomena causation 0.042 0.004 mnemic
stimulus accuracy 0.039 0.000 response
place 2 0.028 0.000 1
proposition falsehood 0.024 0.002 truth
proposition door 0.022 0.000 window
Table 5: Pairs of words with lowest irreducible interaction. The first column displays a tag that allows us to identify each triplet in Fig. 7. Top: OS. Bottom: AM. Values in bits.

In these triplets, most of the interaction between words and is explained in terms of . Mediators tend to have a high semantic content, and to provide a context in which the other two words interact. Besides, the triplets in Table 5 tend to cluster in the lower right corner of Fig. 7, implying that pairs of words share redundant mutual information.

The number of pairs with significant mutual information (i.e., bits), and whose interaction is explained at least in a through a third word (i.e., ) is higher in the book OS () than in book AM (). Out of the pairs of OS, are explained through the word cell, through america, through northern, through glacial, through sterility and so on. The fact that specific words tend to mediate the interaction between many pairs suggests that they may act as hubs in the network.

In the right panels of Fig. 5, we see the network of irreducible interactions. When compared with the network of mutual informations (left panels), the irreducible network contains weaker bonds, as expected, since by definition, cannot be larger than . In the figure, we can identify some of the pairs of Table 5, whose interaction is mediated by a third word. Such pairs appear with a significantly weaker bond in the right panel, as for example, bee-wax (mediator = cell, OS), and stimulus-accuracy, (mediator = response, AM). Moreover, one can also identify the pairs whose interaction is intrinsic (that is, not mediated by a third word) as those where the link on the right has approximately the same width as on the left. Notable examples are male-female (OS), and depend-upon.

Vi Conclusions

In this paper, we developed the information-theoretical tools to study triple dependencies between variables, and applied them to the analysis of written texts. Previous studies had proposed two different measures to quantify the amount of triple dependencies: the co-information and the total amount of triple interactions . Given that there is a certain controversy regarding which of these measures should be used, it is important to notice that is a function of three specific variables , whereas is a global measure of all triple interactions within a wider set of variables, with . Therefore, it only makes sense to compare the two measures when is calculated for the same group of variables as , which implies using .

The two measures have different meanings. Whereas the co-information quantifies the effect of one (any) variable in the information transmission between the other two, the amount of triple interactions measures the increase in entropy that results from approximating the true distribution by the maximum-entropy distribution that only contains up to pairwise interactions. When studied with all generality, these two quantities need not be related, that is, by fixing one of them, one cannot predict the value of the other. When restricting the analysis to binary variables, however, a link between them arises. Three binary variables are characterized by a probability distribution over possible states. Due to the normalization restriction, the distribution is determined once the probability of 7 states are fixed. Choosing those 7 numbers is equivalent to choosing the three entropies , the three mutual informations , and one more parameter. This extra parameter can be either the co-information (in which case the triple interaction is fixed), or the triple interaction (in which case the co-information is fixed). Hence, although in general the co-information and the amount of triple interactions are not related to one another, for binary variables, once the single entropies and the pairwise interactions are determined, and become linked. In this particular situation, hence, there is no controversy between the two quantities, because they both provide the same information, only with different scales.

Moreover, we have shown that when pooling together all the triplets in the system, and now without fixating the value of individual entropies or pairwise interactions, and often add up to zero. This effect results from the fact that most triplets contain at most a single pairwise interaction. Hence, for most of the triplets the two measures provide roughly the same information. The exception involves the triplets containing at least two binary interactions, which are likely to contain all three interactions, in view of Sect. III.2.

One could repeat the whole analysis presented here, but with = number of times the word appeared in a given part (instead of the binary variable appeared / not appeared). This choice would transform the binary approach into an integer description, which could potentially be more accurate, if enough data are available. It should be borne in mind, however, that the size of the space grows with the cube of the number of states, so serious undersampling problems are likely to appear in most real applications. We choose here the binary description to ensure good statistics. In addition, this choice allowed us to (a) relate triple interactions with the gate, and (b) related the co-information with the amount of triple interactions.

In the present work we studied interactions between words in written language through a triple analysis. This approach allowed to accomplish two goals. First, we detected pure triple dependencies that would not be detectable by studying pairs of variables. Second, we determined whether pairwise interactions can be explained through a third word.

We found that on average, 11% and 13% of the total interaction within a group of three words is pure tripletwise. On average, triple dependencies are weaker than pairwise interactions. However, in 7% and 9% of the total number of triplets, triple interactions are larger than pairwise. Although this is a small fraction of all the triplets, all the 400 selected words participate in at least one such triplet. Hence, if word interactions are to be used to improve the performance in a Cloze test, triple interactions are by no means negligible.

We believe that in particular for written language the presence of triple interactions is mainly due the marginalization over the latent topics. For example, the triplet (america, south, north) resembles a gate, so variables tend to appear two at a time, but not alone, nor the three together. Imagine we include an extra variable (this time, a non-binary variable), specifying the geographic location of the phenomena described in each part of the book. The new variable would take one value in those parts where Darwin describes events of North America, another value for South America, and yet other values in other parts of the globe. If these topic-like variables are included in the analysis, the amount of high order interactions between words is likely to diminish, because complex word interactions would be mediated by pairwise interactions between words and topics. However, since topic-like variables are not easily amenable to automatic analysis, here we have restricted the study to word-like variables. We conclude that high-order interactions between words is likely to be the footprint of having ignored (marginalized) over topic-like variables.

Acknowledgements.
We thank Agencia Nacional de Promoción Científica y Tecnológica, Comisión Nacional de Energía Atómica and Universidad Nacional de Cuyo for supporting the present research.

Appendix A Mathematical proofs

a.0.1 Derivation of the bound in Eq. (10)

As imposing more restrictions cannot increase the entropy, . Using the fact that (see Appendix B), it follows from Eq. (7) that

(26)

This inequality is tight, since a probability distribution exists for which the equality is fulfilled: when , that is, when .

The derivation can be done removing any of the restrictions . Therefore,

(27)

where is the co-information. From Eq. (27), it also follows that

(28)

a.0.2 Derivation of Eq. (18)

Inserting the upper bound of Eq. (26) in Eq. (16),

(29)

Therefore,

(30)

In addition, since reducing the number of marginal restrictions cannot diminish the entropy of the maximum entropy distribution,

(31)

Combining Eqs. (30) and (31),

Therefore, if and are large, cannot be too small.

Appendix B Maximum entropy solution

The problem of finding the probability distribution that maximizes the entropy under linear constrains, such as fixing some of the marginals, has a unique solution Csiszár (1975). Although no explicit closed form is known for the case where each variable varies in an arbitrary domain, there are procedures, for example the iterative proportional fitting Csiszár (1975), that converge to the solution.

In some special cases a closed form exists. For example, when the univariate marginals are fixed, the solution is the product of such marginals. Another case is when we look for the maximum entropy distribution of three variables