 # The origins of Zipf's meaning-frequency law

In his pioneering research, G. K. Zipf observed that more frequent words tend to have more meanings, and showed that the number of meanings of a word grows as the square root of its frequency. He derived this relationship from two assumptions: that words follow Zipf's law for word frequencies (a power law dependency between frequency and rank) and Zipf's law of meaning distribution (a power law dependency between number of meanings and rank). Here we show that a single assumption on the joint probability of a word and a meaning suffices to infer Zipf's meaning-frequency law or relaxed versions. Interestingly, this assumption can be justified as the outcome of a biased random walk in the process of mental exploration.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Introduction

G. K. Zipf1949a investigated many statistical regularities of language. Some of them have been investigated intensively such as Zipf’s law for word frequencies Fedorowicz (1982); Ferrer-i-Cancho  Gavaldà (2009); Font-Clos . (2013); Ferrer-i-Cancho (20161) or Zipf’s law of abbreviation Strauss . (2006); Ferrer-i-Cancho . (2013). Some others, such as Zipf’s law of meaning distribution have received less attention. In his pioneering research, Zipf1945a found that more frequent words tend to have more meanings. The functional dependency between , the number of meanings of a word, and , the frequency of a word, has been approximated with Ilgen  Karaoglan (2007); Baayen  Moscoso del Prado Martín (2005); Zipf (1945)

 μ∝fδ, (1)

where is a constant such that . Eq. 1 defines Zipf’s meaning-frequency law. Equivalently, the meaning-frequency law can be defined as

 f∝μ1/δ. (2)

G. K. Zipf derived the meaning-frequency law assuming two laws, the popular Zipf’s law for word frequencies and the law of meaning distribution. On the one hand, Zipf’s law for word frequencies states the relationship between the frequency of a word and its rank (the most frequent word has rank , the 2nd most frequent word has rank and so on) as

 f∝i−α, (3)

where is a constant Zipf (1945, 1949). On the other hand, the law of meaning distribution Zipf (1945, 1949) states that

 μ∝i−γ, (4)

where . Notice that is still the rank of a word according to its frequency. The constants , and

can be estimated applying some regression method as in Zipf’s pioneering research

Zipf (1945, 1949).

Sometimes, power-laws such as those described in Eqs. 1-4 are defined using asymptotic notation. For instance, random typing yields , i.e. for sufficiently large , one has that Conrad  Mitzenmacher (2004)

 a1i−α≤f≤a2i−α, (5)

where and are constants such that . Eq. 5 can be seen as relaxation of Eq. 3. Similarly, Heaps’ law on the relationship between , the number of types, as function of , the number of tokens, is defined as with Baeza-Yates  Navarro (2000), a relaxed version of (see Font2015a for a critical revision of the power law model for Heaps’ law).

The meaning-frequency law (Eq. 1) and the law of meaning distribution (Eq. 4) predict the number of meanings of a word using different variables as predictors. The meaning-frequency law has been confirmed empirically in various languages: directly through Eq. 1 in Dutch and English Baayen  Moscoso del Prado Martín (2005) or indirectly through Eq. 4 and the assumption of Zipf’s law Zipf (1945); Ilgen  Karaoglan (2007) in Turkish and English. Qualitatively, the meaning-frequency law defines a positive correlation between frequency and the number of meanings. Using a proxy of word meaning the qualitative version of the law has been found in dolphin whistles Ferrer-i-Cancho  McCowan (2009) and in chimpanzee gestures Hobaiter  Byrne (2014). Thus, the law is a candidate for a universal property of communication.

Zipf1945a argued that Eq. 1 with follows from Eq. 3 with and Eq. 4 with . Indeed, it has been proven that Eq. 1 with follows from Eqs. 1 and 4 Ferrer-i-Cancho (20162). Here we consider alternative derivations of Eq. 1 or relaxed versions of Eq. 1 from the assumption of a biased random walk Sinatra . (2011); Gómez-Gardeñes  Latora (2008) over word-meaning associations. The remainder of the article is organized as follows.

First, we will present the mathematical framework.

Second, we will present a minimalist derivation of the meaning-frequency law (Eq. 1) with law that is based on just one assumption on the joint probability of a word and a meaning. Suppose a word that is connected to meanings and a meaning that is connected to words. Assuming that the joint probability of and is proportional to if and are connected and zero otherwise, suffices to obtain Eq. 1 with . A problem of the argument is that the definition is somewhat arbitrary and theoretically superficial.

Third, we will replace this simplistic assumption by a more fundamental assumption, namely that the joint probability of and is proportional to if and are connected and zero otherwise. This assumption is a more elegant solution for two reasons: it corrects the arbitrariness of the assumption of the minimalist derivation, fits into standard network theory and it can be embedded into a general theory of communication. From this deeper assumption we derive the meaning-frequency law following three major paths. The 1st path consists of assuming the principle of contrast Clark (1987) or the principle of no synonymy (Goldberg, 1995, p. 67), namely, for all words, which leads exactly to Eq. 1. The 2nd path consists of assuming that meaning degrees are mean independent of word degrees, which leads to a mirror of the meaning-frequency law (Eq. 2)

 E[p|μ]∝μ1/δ, (6)

where is the expectation of , the probability of a word, knowing that its degree is . Notice that is linked with as , where is the length of a text in tokens, i.e. total number of tokens (the sum of all type frequencies) Moreno-Sànchez . (2016). The 3rd path makes no assumption to obtain a relaxed version of the meaning-frequency law, namely the number of meanings is bounded above and below by two power-laws over , i.e.

 b1fδ≤μ≤b2fδ,

where and are constants such that . The result can be summarized as

 μ=Θ(fδ). (7)

Put together, these three paths strongly suggest that languages are channeled to reproduce Zipf’s meaning-frequency law.

Fourth, we will review a family of optimization models of communication that was put forward to investigate the origins of Zipf’s law for word frequencies Ferrer-i-Cancho  Díaz-Guilera (2007) but that has recently been used to shed light on patterns of vocabulary learning and the mapping of words into meanings Ferrer-i-Cancho (20163). Interestingly, models from that family give Eq. 1 with Ferrer-i-Cancho (20162). Crucially, however, the true exponent is Zipf (1945); Ilgen  Karaoglan (2007). The mismatch should not be surprising. Imagine that a speaker has to choose a word for a certain meaning. Those models assume that given a meaning, all the words with that meaning are equally likely Ferrer-i-Cancho  Díaz-Guilera (2007). However, this simple assumption is not supported by psycholinguistic research Snodgrass  Vanderwart (1980). We will show how to modify their definition so that the words that are used for a certain meaning do not need to be equally likely and one can obtain Eq. 1 with or relaxed versions. Finally, we will discuss the results, highlighting the connection with biased random walks, and indicate directions for future research.

## A mathematical framework

As in the family of models of communication above, we assume a repertoire of words, ,…,,… and a repertoire of meanings, ,…,,…,. Words and meanings are associated through an adjacency matrix : if and are associated ( otherwise). defines the edges of an undirected bipartite network of word-meaning associations. The degree of the -th word is

 μi=m∑j=1aij (8)

while the degree of the -th meaning is

 ωj=n∑i=1aij. (9)

In human language, the relationship between sound and meaning has been argued to be arbitrary to a large extent Saussure (1916); Hockett (1966); Pinker (1999). That is, there is no intrinsic relationship between the word form and its meaning. For example the word "car" is nothing like an actual automobile. An obvious exception are onomatopoeias, which are relatively rare in language. However, despite the immense flexibility of the world’s languages, some sound-meaning associations are preferred by culturally, historically, and geographically diverse human groups Blasi . (2016). The framework above is agnostic concerning the type of association between sound and meaning. By doing that, we are borrowing the abstract perspective of network theory, that is a priori neutral concerning the nature or the origins of the edges Newman (2010); Barthélemy (2011). Our framework could be generalized to accommodate Peirce’s classic types of reference, i.e., iconic, indexical and symbolic Deacon (1997), or the state-of-the-art on the iconicity/systematicity distinction Dingemanse . (2015). An crucial reason to remain neutral is that the distinctions above were not made when defining the laws of meaning that are the target of this article.

The framework allows one to model lexical ambiguity: a lexically ambiguous word is a word such that its degree is greater than one. Although the model starts from a flat hierarchy of concepts (by default all concepts have the same degree of generality), a word with an abstract meaning could be approximated either as a word linked to a single abstract concept or as a word linked to the multiple specific meanings it covers Ferrer-i-Cancho (20163). As for the latter approach, the word for vehicle would be linked to the meanings for car, bike, ship, airplane,…

Suppose that is the joint probability of the unordered pair formed by and . By definition,

 n∑i=1m∑j=1p(si,rj)=1. (10)

The probability of is

 p(si)=m∑j=1p(si,rj) (11)

and the probability of is

 p(rj)=n∑i=1p(si,rj). (12)

Our model shares the assumptions of distributional semantics that the meaning of a word is represented as a vector of the weights of different concepts for that word

Lund  Burgess (1996). In our framework, the meaning of the word , is represented by the -dimensional vector

 {p(si,r1),...,p(si,rj),...,p(si,rm)}

The joint probabilities for all words and meanings defines a weighted matrix of the same size of . In the coming sections, we will derive the meaning frequency-law defining as a function of . Put differently, we will derive the law from a weighted undirected bipartite graph that is build from the unweighted undirected graph defined by . This organization in two graphs (one unweighted and the other weighted) instead of a single weighted graph is borrowed from successful models of communication Ferrer-i-Cancho (20163) and allows one to apply the theory of random walks Sinatra . (2011); Gómez-Gardeñes  Latora (2008) as we will see later on.

## A minimalist derivation of the law

The law of meaning distribution can be derived by making just one rather simple assumption, i.e.

 p(si,rj)∝aijμi, (13)

Applying Eq. 10, one obtains

 p(si,rj)=caijμi, (14)

where is a normalization constant defined as

 c=1∑ni=1∑mj=1aijμi=1∑ni=1μi∑mj=1aij=1∑ni=1μ2i.

Notice that is not a parameter and its value if determined by the definition of probability in Eq. 10. Applying Eq. 11 to Eq. 14 gives

 p(si)=cμ2i,

namely Eq. 1 with . Our derivation of the strong and relaxed version of the meaning-frequency law is simpler than that of Zipf’s in the sense that it requires assuming a smaller number of equations (we are assuming only Eq. 13 while Zipf assumed Eqs. 3 and 4). However, the challenge of our approach is the justification of Eq. 13.

## A theoretical derivation of the law

The definition of in Eq. 13 suffices as a model but not for the construction of a real theory of language. Eq. 13 is simple but somewhat arbitrary: the degree of the word, , contributes raised to 1 but the degree of the meaning, has no direct contribution, or one may say that it contributes raised to 0. Therefore, a less arbitrary equation would be

 p(si,rj)∝aij(μiωj)ϕ. (15)

where is a positive parameter (). Applying Eq. 10 to Eq. 15, one obtains

 p(si,rj)=caij(μiωj)ϕ. (16)

with and

 M=n∑i=1m∑j=1aij(μiωj)ϕ=n∑i=1μϕim∑j=1aijωϕj. (17)

Notice that is the only parameter of the model given and . Applying Eq. 11 to Eq. 16, one obtains

 p(si)=cμϕim∑j=1aijωϕj. (18)

Eq. 15 is theoretically appealing for various reasons. If is regarded as the weight of the association between and , it defines the general form of the relationship between the weight of and edge and the product of the degrees of vertices at both ends that is found in real networks Barrat . (2004). For this reason, a unipartite version of Eq. 15 is assumed to study dynamics on networks Baronchelli . (2011). When , it matches the definition of models about the origins of Zipf’s law for word frequencies Ferrer-i-Cancho (20052), the variation of the exponent of the law Ferrer-i-Cancho (20051, 2006) and vocabulary learning Ferrer-i-Cancho (20163). When , it defines an approximation to the stationary probability of observing a transition involving and in a random walk on a network that is biased to maximize the entropy rate of the walks (Appendix A), thus suggesting that the meaning-frequency law could be a manifestation of a particular random walk process on semantic memory.

Two equivalent linguistic principles, the principle of contrast Clark (1987) and the principle of no synonymy (Goldberg, 1995, p. 67) can be implemented in our model as . From an algebraic standpoint, the condition is equivalent to orthogonality of the word vectors of the matrix . If indicates the row vector of for the -th word, and are orthogonal if and only if , where the dot indicates the scalar product of two vectors. To simplify matters, we assume that there is no row vector of that equals , a vector that has in all components. So far we have used and to refer, respectively, to the degree of the -th word and the -th meaning. We define and as the degree of the word and the degree of the meaning of the -th edge. and are components of the vectors and , respectively. We have because by definition. A deeper insight can be obtained with the concept of remaining degree, the degree at one end of the edge after subtracting the unit contribution of the edge Newman (2002). The vectors of remaining degrees are then

 →μ′ei=→μei−→1 →ω′ei=→ωei−→1.

The condition is equivalent to . leads to but trivially for being null.

The assumption and (orthogonality of row vectors of ), transform Eq. 18 into Eq. 13 because and are equivalent when does not exceed 1. In general, Eq. 18 combined with the principle of contrast gives

 p(si) = cμϕim∑j=1aij = cμϕ+1i

and finally Eq. 1 with

 δ=1ϕ+1.

When , we get again. Interestingly, the principle of contrast follows from the principle of mutual information maximization, a more fundamental principle that allows one to predict vocabulary learning in children and that can be combined with the principle of entropy minimization to predict Zipf’s law for word frequencies Ferrer-i-Cancho (20163). With Eq. 15, we follow Bunge (Bunge, 2013, pp. 32-33) preventing scientific knowledge from becoming “an aggregation of disconnected information” and aspiring to build a “system of ideas that are logically connected among themselves”.

It is possible to obtain a relaxed meaning frequency-law under more general conditions. In particular, we would like to get rid of the heavy constraint that meaning degrees cannot exceed one. Suppose that is a constant such that . Some obvious but not very general conditions are for all or for all . It is easy to see that they lead again to Eq. 13 when . A more general condition can be defined as follows. First, we define as the conditional expectation of given for an edge. Here and are the degrees at both ends of an edge. Then suppose that is given and that the is mean independent of , namely for each value of Kolmogorov (1956); Poirier (1995), then the expectation of (as defined in Eq. 18) knowing is

 E[p(si)|μi] = E[cμϕim∑j=1aijωϕj∣∣ ∣∣μi] = cμϕiE[m∑j=1aijωϕj∣∣ ∣∣μi] = cμϕim∑j=1aijE[ωϕj∣∣μi] = cμϕim∑j=1aijE[ωϕj] = cE[ωj]μϕ+1i,

which can be seen as a regression model Ritz  Streibig (2008) for the meaning-frequency law (Eq. 1) with word degree as predictor. Notice that mean independence is a more general condition than mutual or statistical independence but a particular case of uncorrelation Ferrer-i-Cancho . (2014).

So far, we have seen ways of obtaining the meaning-frequency law from Eq. 15 making further assumptions. It is possible to obtain a relaxed version of the meaning-frequency law making no additional assumption (except Eq. 15 or the biased random walk that justifies it). Eq. 18 can be expressed as

 p(si)=μϕim∑j=1aijTj (19)

with

 Tj=cωϕj.

Assuming that

 Tmin≤Tj≤Tmax,

 Tminμϕ+1i≤p(si)≤Tmaxμϕ+1i (20)

or equivalently

 1Tminp(si)1ϕ+1≤μi≤1Tmaxp(si)1ϕ+1.

Recalling , these results can be summarized using asymptotic notation as or . The power of the bounds above depends on the gap between and . The gap can be measured with the ratio

 TmaxTmin=ωϕmaxωϕmin,

where and are the minimum and the maximum meaning degree, respectively. The principle of mutual information maximization between words and meanings, a general principle of communication Ferrer-i-Cancho (20163), puts pressure for concordance with the meaning-frequency law. To see it, we consider two cases: and . When , its maximization predicts (Appendix B). As unlinked meanings are irrelevant (they do not belong to the support set), we have . As pressure for mutual information maximization increases, tends to 1 and thus tends to . Put differently, the gap between the upper and the lower bound in Eq. 20 reduces as pressure for mutual information maximization increases. When , mutual information maximization predicts that , where is an integer such that (Appendix B). We have seen above that one obtains the meaning-frequency law (Eq. 1) immediately from Eq. 15 when is constant. We conclude that the chance of observing the meaning-frequency law increases as pressure for mutual information maximization increases.

## A family of optimization models of communication

Here we revisit a family of optimization models of communication Ferrer-i-Cancho  Díaz-Guilera (2007) in light of the results of the previous sections. These models share the assumption that the probability that a word is employed to refer to meaning is proportional to , i.e.

 p(si|rj)∝aij, (21)

Applying

 n∑i=1p(si|rj)=1

to Eq. 21, we obtain

 p(si|rj)=aijωj. (22)

We adopt the convention when .

Eq. 22 defines the probability of transition of a standard (unbiased) random walk to a word Noh  Rieger (2004), i.e. given a meaning, all related words are equally likely. This is unrealistic in light of picture naming norms Snodgrass  Vanderwart (1980); Duñabeitia . (2017). Consider the picture-naming norms compiled by Snodgrass1980a, who simply asked participants to name 260 black-and-white line drawings of common objects. For some objects (e.g., balloon, banana, sock, star) there was 100% agreement among the participants for the word used to name the pictured object. However, for other objects there was considerable variability in the word used to name the pictured object. Important for the present argument, the other words that were used in such cases were not selected with equal likelihood. For example, the picture of a wineglass had 50% agreement, with the word glass (36% of the responses) and the word goblet (14% of the responses) also being used to name the object, showing that all the words that could be used for a given meaning are not equally likely. Although subjects tend to provide more specific responses when the concept is presented in textual form with respect to a visual form presentation Tversky  Hemenway (1983), we used the visual case simply to challenge the assumption of an unbiased random walk in general and justify a more realistic approach.

In contrast to Eq. 22, the fundamental assumption in Eq. 15 leads to

 p(si|rj)=aijμϕi∑kakjμϕk, (23)

namely the transition probabilities of a biased random walk when Sinatra . (2011); Gómez-Gardeñes  Latora (2008). To see it, notice that the combination of Eq. 12 and 16 produces

 p(rj)=n∑i=1p(si,rj)=cωϕjn∑i=1aijμϕi. (24)

Recalling the definition of conditional probability

 p(si|rj)=p(si,rj)p(rj)

and applying Eq. 16 again, one obtains Eq. 23.

Recalling the definition of in Eq. 9, it is easy to realize that Eq. 22 is a particular case of Eq. 23 with . While the family of models above stems from a concrete definition of a conditional probability, i.e. in Eq. 22, the general model that we have presented in this article is specified by a definition of the joint probability, i.e. in Eq. 15.

Models within that family are generated through

 p(si,rj)=p(si|rj)p(rj), (25)

assuming an unbiased random walk from a meaning to a word (Eq. 22) and making different assumptions on .

If one assumes that all meanings are equally likely ( with ) one obtains the 1st model Ferrer-i-Cancho  Solé (2003). If one assumes that the probability of a meaning is proportional to its degree () one obtains the 2nd model Ferrer-i-Cancho (20052). While in the 2nd model defines an unbiased random walk from to (all ’s connected to are equally likely), this is not necessarily the case for the 1st model Ferrer-i-Cancho  Díaz-Guilera (2007). Therefore, the 2nd model defines a pure unbiased random walk while the 1st model is unbiased from meaning to words but biased from words to meanings.

Now we will introduce a generalized version of the family of models above consisting of replacing Eq. 22 by Eq. 23 and generating the corresponding variants of the 1st and the 2nd model applying the same procedure as in the original family, namely via Eq. 25. Notice that Eq. 23 defines the probability of reaching from in a biased random walk when .

Concerning the 1st model, suppose that the probabilities of the meanings are given a priori (they are independent from the matrix), e.g., all meanings are equally likely. Then it is easy to show that the model yields a relaxed version of the meaning frequency law, namely , the number of meanings is bounded above and below by two power-laws, i.e. (Appendix C)

 b1p(si)δ≤μi≤b2p(si)δ, (26)

where and are constants () and . Eq. 26 defines non-trivial bounds when (Appendix C). The case matches that an optimization model of Zipf’s law for word frequencies Ferrer-i-Cancho (20052, 20162).

To generate a variant of the 2nd model, recall that Eq. 23 comes from Eq. 15. Eqs. 12 and 16 produce Eq. 24. This variant of the 2nd model derives all probability definitions from Eq. 15. We have shown above that this variant is able to generate the meaning-frequency law.

## Discussion

We have seen that it is possible to obtain the meaning-frequency law (Eqs. 1 and 2) from Eq. 15 making certain assumptions. We have also seen that a relaxed version of the law (Eq. 7 can be obtained from Eq. 15 without making any further assumption. Our findings suggest that word probabilities are channeled somehow to manifest the meaning-frequency law. We have seen that the principle of mutual information maximization contributes to the emergence of the law. Our derivation is theoretically appealing for various reasons. First, it is more parsimonious than G. K. Zipf’s concerning the number of equations that are assumed (we only need Eq. 15 while Zipf involved Eqs. 3 and 4). Second, it can help a family of optimization models of language to reproduce the meaning-frequency law.

Therefore, a crucial assumption is Eq. 16, that we have justified as the outcome of a random walk that is biased to maximize the entropy rate of the paths (Appendix A). A random walk is the correlate in network theory of the concept of mental exploration (navigation without a target or nonstrategic search) in cognitive science and related fields Baronchelli . (2013). Semantic memory processes can be usefully theorized as searches over a network Thompson  Kello (2014); Abbott . (2015) or some semantic space Smith . (2013)

. These approaches support the hypothesis of a Markov chain process for memory search

Bourgin . (2014), provide a deeper understanding of creativity Kenett  Austerweil (2016a) and help to develop efficient navigation strategies Capitán . (2012).

A random walk in a unipartite word network of word-word associations has been argued to underlie Zipf’s law for word frequencies Allegrini . (2004). Here we contribute with a new hypothesis linking random walks with a linguistic law: that the meaning-frequency law would be an epiphenomenon of a biased random walk over a bipartite network of word-meaning associations in the process of mental exploration. The bias consists of exploiting local information, namely the degrees of first neighbours Sinatra . (2011). Transitions to nodes with higher degree are preferred. Our model shows that it is possible to approximate the optimal solution to a problem (maximizing the entropy rate of the paths) following an apparently nonstrategic search Hills . (2012); Abbott . (2015).

The probability of a word in Eq. 18 defines the probability that a random walker visits the word in the long run. This probability is what the PageRank algorithm estimates in the context of a standard (non-biased) random walk Page . (1998). The assumption of a random walk with the particular bias above could help to improve random walk/PageRank methods to predict the prominence in memory of a word Griffiths . (2007) or the importance of a tag Jäschke . (2007). A virtue of our biased random walk is that it predicts an uneven conditional probability of a word given a meaning (Eq. 23) as it happens in real language Snodgrass  Vanderwart (1980). A standard (uniform) random walk cannot explain this fact and for that reason the family of optimization models of language revisited above fails to reproduce the meaning-frequency law with .

Although biased random walks have already been used to solve information retrieval problems (see Duarte2014a and references therein), a bias based on the degree of neighbours has not been considered as far as we know. We hope that our results stimulate further research on linguistics laws and biased random walks in the information sciences. Specifically, we hope that our article becomes the fuel of future empirical research.

## Acknowledgements

We are specially grateful to R. Pastor-Satorras and Massimo Stella for helpful comments and insights. We also thank S. Semple, M. Gentili and E. Bastrakova for helpful discussions. This research was supported by the grant APCOM (TIN2014-57226-P) from MINECO (Ministerio de Economía y Competitividad) and the grant 2014SGR 890 (MACDA) from AGAUR (Generalitat de Catalunya).

## Appendix A Random walks

We will show that Eq. 15 defines the probability of observing a transition between and in any direction in a biased random walk. We will proceed in two steps. First, we will summarize some general results on biased random walks on unipartite networks and then we will adapt them to bipartite networks.

Suppose a unipartite network of nodes that is defined by an adjacency matrix such that if the -th and the -th node are connected and otherwise. Let be the degree of the -th node, namely,

 ki=n∑j=1bij.

Suppose a random walk over the vertices of a network where is the probability of jumping from to . A first order approximation to the that maximizes the entropy rate is Sinatra . (2011)

 p(j|i)=bijkj∑nl=1bilkl. (27)

We choose a generalization Gómez-Gardeñes  Latora (2008)

 p(j|i)=bijkϕj∑nl=1bilkϕl, (28)

that gives Eq. 27 with . The stationary probability of visiting the -th vertex in the biased random walk defined by Eq. 28 is Gómez-Gardeñes  Latora (2008)

 p(i)=kϕiciT, (29)

where

 ci=n∑j=1bijkϕj (30)

and

 T=n∑i=1cikϕi. (31)

Now we adapt the results above to a bipartite graph of word-meaning associations. As the graph is bipartite, the random walker will be alternating between words and meanings. The probability that the vertex visited is a word is (the same probability for a meaning). Suppose that there are words and meanings. Recall that the bipartite network of word-meaning associations is defined by an adjacency matrix such that if the -th word and the -th meaning are connected and otherwise. is the degree of the -th word is (Eq. 8) whereas is the degree of the -th meaning (Eq. 9). The probability of jumping from to becomes (recall Eq. 28)

 pv(si|rj)=aijμϕi∑nl=1aljμϕl.

The probability of jumping from to is

 pv(rj|si)=aijωϕj∑ml=1ailωϕl. (32)

The stationary probability of visiting the word becomes (recall Eq. 29 and 30)

 pv(si)=μϕi∑mj=1aijωϕjMv, (33)

where corresponds to in Eq. 29. Adapting Eqs. 31 and 30, one obtains

 Mv = n∑i=1μϕim∑j=1aijωϕj+m∑j=1ωϕjn∑i=1aijμϕi = 2M,

where is defined as in Eq. 17. Applying Eq. 33, it is easy to see that

 n∑i=1pv(si) = 12Mn∑i=1μϕim∑j=1aijωϕj = 12.

as expected.

The combination of Eqs. 32 and 33 allows one to derive the probability of observing the transition from to as

 pv(si→rj) = pv(rj|si)pv(si) = cvaij(μiωj)ϕ,

where . Similarly, the probability of observing the transition from to is

 pv(si←rj) = pv(si|rj)pv(rj) = cvaij(μiωj)ϕ.

Therefore the stationary probability of observing a transition between and in any direction (from to or from to ) is

 p(si,rj) = pv(si→rj)+pv(si←rj) = 2cvaij(μiωj)ϕ. = caij(μiωj)ϕ.

with , as we wanted to show.

Finally, we will link , the probability of a word that is used in the main text to derive the meaning-frequency law, with . Notice that , the latter being the probability of visiting vertex knowing that it belongs to the partition , the partition of words. Since the graph is bipartite, , probability that the random walk is visiting a vertex of partition , is . The joint probability of visiting vertex and that the vertex belongs to is

 pv(si,S) = pv(S|si)pv(si) = pv(si).

Therefore,

 p(si) = pv(si|S) = pv(si,S)pv(S) = 2pv(si).

Then is the stationary probability of visiting in a biased random walk knowing that the vertex is in .

## Appendix B Mutual information maximization

Suppose that is the mutual information between words () and meanings (), that can be defined as

 I(S,R)=H(S)−H(S|R), (34)

where is the entropy of words and is the conditional entropy of words given meanings. For the case , the configurations that maximize when are defined by two conditions Ferrer-i-Cancho (20163)

1. with for .

2. for .

When , those configurations are the symmetric, i.e. Ferrer-i-Cancho (20163)

1. with for .

2. for .

Here we will show that the configurations that maximize are the same as in the case when is a positive and finite real number (). By symmetry, it suffices to show it for the case . We will proceed in three steps. First, deriving the configurations minimizing . Second, showing that the configurations above yield maximum . Third, showing that they are the only configurations.

Step 1: Recall that

 H(S|R) = E[H(S|rj)] = m∑j=1p(rj)H(S|rj)

where is the conditional entropy of words given the meaning . Eq. 24 implies that is equivalent to and then

 H(S|R) = m∑j=1p(rj)≠0p(rj)H(S|rj) = m∑j=1wj>0p(rj)H(S|rj).

Knowing that

 H(S|rj)=−n∑i=1p(si|rj)logp(si|rj)

it is easy to see that when for : by continuity since as (Cover  Thomas, 2006, p. 14) and obviously . Eq. 23 implies that is equivalent to being the only neighbour of , i.e. . Therefore, implies for .

Step 2: notice that the 2nd condition of the case above implies (recall Step 1). The 2nd condition transforms Eq. 17 into

 M=n∑i=1μϕim∑j=1aij=n∑i=1μϕ+1i

and Eq. 18 into

 p(si)=cμϕim∑j=1aij=cμϕ+1i.

Adding the 1st condition, one obtains

 M=n∑i=1dϕ+1=ndϕ+1 p(si)=cμϕ+1i=1Mdϕ+1=1n.

and then (as all words are equally likely). Thus, is taking its maximum possible value whereas is taking its minimum value. As , it follows that is maximum.

Step 3: notice that

• If the 2nd condition fails, then and thus even if is maximum because of Eq. 34. Thus, the 2nd condition is required to maximize .

• If the 1st condition fails (while the 2nd condition holds), then words are not equally likely as the probability of a word is proportional to a power of its degree (Eq. B). Then one has that and it follows that is not maximum because .

## Appendix C New models

Combining Eqs. 11 and 23, one obtains

 p(si)=m∑j=1p(si,rj)=m∑j=1p(si|rj)p(rj)=μϕim∑j=1aijTj (35)

with

 Tj=p(rj)∑ni=1aijμϕi.

Suppose that

 Tmin≤Tj≤Tmax

when . Eq. 35 leads to

 μϕim∑j=1aijTmin≤p(si)≤μϕim∑j=1aijTmax

and finally

 Tminμϕ+1i≤p(si)≤Tmaxμϕ+1i (36)

recalling the definition of in Eq. 8. Equivalently,

 T−δmaxp(si)δ≤μi≤T−δminp(si)δ, (37)

with

 δ=1ϕ+1,

namely a relaxed version of the meaning-frequency law when .

Notice that Eqs. 36 and 37 define non-trivial bounds in the sense that they are not expected from bounds on join-probability alone. If the range of variation of satisfies

 πmin≤p(si,rj)≤πmax

when , then Eq. 11 gives

 m∑j=1aijπmin≤p(si)≤m∑j=1aijπmax

and then

 πminμi≤p(si)≤πmaxμi.

Therefore, the finding that

 b1p(si)δ≤μi≤b2p(si)δ,

where and are constants is trivial when .

## References

• Abbott . (2015) Abbot2015aAbbott, JT., Austerweil, JL.  Griffiths, T.  2015. Random walks on semantic networks can resemble optimal foraging Random walks on semantic networks can resemble optimal foraging. Psychological Science122558–569.
• Allegrini . (2004) Allegrini2003aAllegrini, P., Gricolini, P.  Palatella, L.  2004. Intermittency and scale-free networks: a dynamical model for human language complexity Intermittency and scale-free networks: a dynamical model for human language complexity. Chaos, solitons and fractals2095-105.
• Baayen  Moscoso del Prado Martín (2005) Baayen2005aBaayen, H.  Moscoso del Prado Martín, F.  2005. Semantic density and past-tense formation in three Germanic languages Semantic density and past-tense formation in three Germanic languages. Language81666-698.
• Baeza-Yates  Navarro (2000) Baeza_Yates2000aBaeza-Yates, R.  Navarro, G.  2000. Block addresing indices for approximate text retrieval Block addresing indices for approximate text retrieval. Journal of the American Society for Information Science51169-82.
• Baronchelli . (2011) Baronchelli2011aBaronchelli, A., Castellano, C.  Pastor-Satorras, R.  2011. Voter models on weighted networks Voter models on weighted networks. Physical Review E83066117.
• Baronchelli . (2013) Baronchelli2013aBaronchelli, A., Ferrer-i-Cancho, R., Pastor-Satorras, R., Chatter, N.  Christiansen, M.  2013. Networks in cognitive science Networks in cognitive science. Trends in Cognitive Sciences17348-360.
• Barrat . (2004) Barrat2004aBarrat, A., Barthélemy, M., Pastor-Satorras, R.  Vespignani, A.  2004. The architecture of complex weighted networks The architecture of complex weighted networks. Proc. Nat. Acad. Sci. USA101113747-3752.
• Barthélemy (2011) Barthelemy2011aBarthélemy, M.  2011. Spatial networks Spatial networks. Physics Reports49911 - 101.
• Blasi . (2016) Blasi2016aBlasi, DE., Wichmann, S., Hammarström, H., Stadler, P.  Christiansen, M.  2016. Sound-meaning association biases evidenced across thousands of languages Sound-meaning association biases evidenced across thousands of languages. Proceedings of the National Academy of Sciences1133910818-10823.
• Bourgin . (2014) Bourgin2014aBourgin, DD., Abbott, J., Griffiths, T., K.A., KS.  Vul, E.  2014. Empirical evidence for Markov Chain Monte Carlo in memory search Empirical evidence for Markov Chain Monte Carlo in memory search. Proceedings of the 36th Annual Meeting of the Cognitive Science Society Proceedings of the 36th annual meeting of the cognitive science society ( 224-229).
• Bunge (2013) Bunge2013aBunge, M.  2013. La ciencia. Su método y su filosofía La ciencia. su método y su filosofía. PamplonaLaetoli.
• Capitán . (2012) Capitan2012aCapitán, JA., Borge-Holthoefer, J., Gómez, S., Martínez-Romo, J., Araujo, L., Cuesta, JA.  Arenas, A.  201208. Local-based semantic navigation on a networked representation of information Local-based semantic navigation on a networked representation of information. PLOS ONE781-10.
• Clark (1987) Clark1987aClark, E.  1987. The principle of contrast: a constraint on language acquisition The principle of contrast: a constraint on language acquisition. B. MacWhinney (), Mechanisms of language acquisition. Mechanisms of language acquisition. Hillsdale, NJLawrence Erlbaum Associates.
• Conrad  Mitzenmacher (2004) Conrad2004aConrad, B.  Mitzenmacher, M.  2004. Power laws for monkeys typing randomly: the case of unequal probabilities Power laws for monkeys typing randomly: the case of unequal probabilities. IEEE Transactions on Information Theory5071403-1414.
• Cover  Thomas (2006) Cover2006aCover, TM.  Thomas, JA.  2006. Elements of information theory Elements of information theory. New YorkWiley. 2nd edition
• Deacon (1997) Deacon1997Deacon, TW.  1997. The Symbolic Species: the Co-evolution of Language and the Brain The symbolic species: the co-evolution of language and the brain. New YorkW. W. Norton & Company.
• Dingemanse . (2015) Dingemanse2015aDingemanse, M., Blasi, DE., Lupyan, G., Christiansen, MH.  Monaghan, P.  2015. Arbitrariness, iconicity, and systematicity in language Arbitrariness, iconicity, and systematicity in language. Trends in Cognitive Sciences1910603 - 615.
• Duarte Torres . (2014) Duarte2014aDuarte Torres, S., Hiemstra, D., Weber, I.  Pavel, S.  2014. Query recommendation in the information domain of children Query recommendation in the information domain of children. Journal of the Association for Information Science and Technology6571368–1384.
• Duñabeitia . (2017) Dunabeitia2017aDuñabeitia, JA., Meyer, DCAS., Boris, B., Pliatsikas, C., Smolka, E.  Brysbaert, M.  2017. MultiPic: A standardized set of 750 drawings with norms for six European languages MultiPic: A standardized set of 750 drawings with norms for six European languages. The Quarterly Journal of Experimental Psychologyin press.
• Fedorowicz (1982) Fedorowicz1982aFedorowicz, J.  1982. The theoretical foundation of Zipf’s law and its application to the Bibliographic Database Environment The theoretical foundation of Zipf’s law and its application to the Bibliographic Database Environment. J. Am. Soc. Inf. Sci.33285-293.
• Ferrer-i-Cancho (20051) Ferrer2004aFerrer-i-Cancho, R.  20051. The variation of Zipf’s law in human language The variation of Zipf’s law in human language. European Physical Journal B44249-257.
• Ferrer-i-Cancho (20052) Ferrer2004eFerrer-i-Cancho, R.  20052.

Zipf’s law from a communicative phase transition Zipf’s law from a communicative phase transition.

European Physical Journal B47449-457.
• Ferrer-i-Cancho (2006) Ferrer2005eFerrer-i-Cancho, R.  2006. When language breaks into pieces. A conflict between communication through isolated signals and language When language breaks into pieces. A conflict between communication through isolated signals and language. Biosystems84242-253.
• Ferrer-i-Cancho (20161) Ferrer2016bFerrer-i-Cancho, R.  20161. Compression and the origins of Zipf’s law for word frequencies Compression and the origins of Zipf’s law for word frequencies. Complexity21409-411.
• Ferrer-i-Cancho (20162) Ferrer2014dFerrer-i-Cancho, R.  20162. The meaning-frequency law in Zipfian optimization models of communication The meaning-frequency law in Zipfian optimization models of communication. Glottometrics3528-37.
• Ferrer-i-Cancho (20163) Ferrer2013gFerrer-i-Cancho, R.  20163. The optimality of attaching unlinked labels to unlinked meanings The optimality of attaching unlinked labels to unlinked meanings. Glottometrics361-16.
• Ferrer-i-Cancho  Díaz-Guilera (2007) Ferrer2007aFerrer-i-Cancho, R.  Díaz-Guilera, A.  2007. The global minima of the communicative energy of natural communication systems The global minima of the communicative energy of natural communication systems. Journal of Statistical MechanicsP06009.
• Ferrer-i-Cancho  Gavaldà (2009) Ferrer2009aFerrer-i-Cancho, R.  Gavaldà, R.  2009. The frequency spectrum of finite samples from the intermittent silence process The frequency spectrum of finite samples from the intermittent silence process. Journal of the American Association for Information Science and Technology604837-843.
• Ferrer-i-Cancho . (2014) Ferrer2012hFerrer-i-Cancho, R., Hernández-Fernández, A., Baixeries, J., Dębowski, Ł.  Mačutek, J.  2014. When is Menzerath-Altmann law mathematically trivial? A new approach When is Menzerath-Altmann law mathematically trivial? A new approach. Statistical Applications in Genetics and Molecular Biology13633-644.
• Ferrer-i-Cancho . (2013) Ferrer2012dFerrer-i-Cancho, R., Hernández-Fernández, A., Lusseau, D., Agoramoorthy, G., Hsu, MJ.  Semple, S.  2013. Compression as a universal principle of animal behavior Compression as a universal principle of animal behavior. Cognitive Science3781565-1578.
• Ferrer-i-Cancho  McCowan (2009) Ferrer2009fFerrer-i-Cancho, R.  McCowan, B.  2009. A law of word meaning in dolphin whistle types A law of word meaning in dolphin whistle types. Entropy114688-701. 10.3390/e11040688
• Ferrer-i-Cancho  Solé (2003) Ferrer2002aFerrer-i-Cancho, R.  Solé, RV.  2003. Least effort and the origins of scaling in human language Least effort and the origins of scaling in human language. Proceedings of the National Academy of Sciences USA100788-791.
• Font-Clos . (2013) Font2013aFont-Clos, F., Boleda, G.  Corral, A.  2013. A scaling law beyond Zipf’s law and its relation to Heaps’ law A scaling law beyond Zipf’s law and its relation to Heaps’ law. New Journal of Physics15093033.
• Font-Clos  Corral (2015) Font2015aFont-Clos, F.  Corral, A.  2015. Log-log convexity of type-token growth in Zipf’s Systems Log-log convexity of type-token growth in Zipf’s systems. Phys. Rev. Lett.114238701. 10.1103/PhysRevLett.114.238701
• Goldberg (1995) Goldberg1995Goldberg, A.  1995. Constructions: a construction grammar approach to argument structure Constructions: a construction grammar approach to argument structure. ChicagoChicago University Press.
• Gómez-Gardeñes  Latora (2008) Gomez-Gardenes2008aGómez-Gardeñes, J.  Latora, V.  2008. Entropy rate of difussion process on complex networks Entropy rate of difussion process on complex networks. Physical Review E78065102(R).
• Griffiths . (2007) Griffiths2007aGriffiths, T., Steyvers, M.  Firl, A.  2007. Google and the mind. Predicting fluency with PageRank Google and the mind. Predicting fluency with PageRank. Psychological Science181069-1076.
• Hills . (2012) Hills2012aHills, T., Jones, M.  Todd, P.  2012. Optimal foraging in semantic memory Optimal foraging in semantic memory. Psychological Science119431–440.
• Hobaiter  Byrne (2014) Hobaiter2014aHobaiter, C.  Byrne, RW.  2014. The meanings of chimpanzee gestures The meanings of chimpanzee gestures. Current Biology241596-1600.
• Hockett (1966) Hockett1966aHockett, CF.  1966. The problem of universals in language The problem of universals in language. Universals of language Universals of language ( 1-29). Cambridge, MAThe MIT Press.
• Ilgen  Karaoglan (2007) Ilgen2007aIlgen, B.  Karaoglan, B.  2007. Investigation of Zipf’s “law-of-meaning” on Turkish corpora Investigation of Zipf’s “law-of-meaning” on Turkish corpora. 22nd International Symposium on Computer and Information Sciences (ISCIS 2007) 22nd international symposium on computer and information sciences (iscis 2007) ( 1-6).
• Jäschke . (2007) Jaschke2007aJäschke, R., Marinho, L., Hotho, A., Schmidt-Thieme, L.  Stumme, G.  2007. Tag recommendations in folksonomies Tag recommendations in folksonomies. JN. Kok, J. Koronacki, RL. de Mantaras, S. Matwin, D. Mladenič  A. Skowron (), Knowledge Discovery in Databases: PKDD 2007: 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, Warsaw, Poland, September 17-21, 2007. Proceedings Knowledge discovery in databases: Pkdd 2007: 11th european conference on principles and practice of knowledge discovery in databases, warsaw, poland, september 17-21, 2007. proceedings ( 506–514). Berlin, HeidelbergSpringer Berlin Heidelberg.
• Kenett  Austerweil (2016a) Kenett2016aKenett, Y.  Austerweil, J.  2016a. Examining search processes in low and high creative individuals with random walks Examining search processes in low and high creative individuals with random walks. Proceedings of the 38th Annual Meeting of the Cognitive Science Society Proceedings of the 38th annual meeting of the cognitive science society ( 313-318).
• Kolmogorov (1956) Kolmogorov1956aKolmogorov, AN.  1956. Foundations of the theory of probability Foundations of the theory of probability (2nd ). New YorkChelsea Publishing Company.
• Lund  Burgess (1996) Lund1996aLund, K.  Burgess, C.  1996. Producing high-dimensional semantic spaces from lexical co-occurrence Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, and Computers282203–208.
• Moreno-Sànchez . (2016) Moreno2016aMoreno-Sànchez, I., Font-Clos, F.  Corral, A.  201601. Large-scale analysis of Zipf’s law in English texts Large-scale analysis of Zipf’s law in English texts. PLOS ONE1111-19.
• Newman (2002) Newman2002aNewman, MEJ.  2002. Assortative mixing in networks Assortative mixing in networks. Phys. Rev. Lett.89208701.
• Newman (2010) Newman2010aNewman, MEJ.  2010. Networks. An introduction Networks. an introduction. OxfordOxford University Press.
• Noh  Rieger (2004) Noh2004aNoh, JD.  Rieger, H.  2004. Random walks on complex networks Random walks on complex networks. Physical Review Letters92118701.
• Page . (1998) Page1998aPage, L., Brin, S., Motwani, R.  Winograd, T.  1998. The PageRank citation ranking: bringing order to the web The PageRank citation ranking: bringing order to the web . Stanford, CAStanford Digital Library Technologies Project.
• Pinker (1999) Pinker1999aPinker, S.  1999. Words and rules: The ingredients of language Words and rules: The ingredients of language. New YorkPerseous Books.
• Poirier (1995) Poirier1995aPoirier, DJ.  1995. Intermediate Statistics and Econometrics: A Comparative Approach Intermediate statistics and econometrics: A comparative approach. CambridgeMIT Press.
• Ritz  Streibig (2008) Ritz2008aRitz, C.  Streibig, JC.  2008. Nonlinear regression with R Nonlinear regression with r. New YorkSpringer.
• Saussure (1916) Saussure1916aSaussure, F.  1916. Cours de linguistique générale Cours de linguistique générale (C. Bally, A. Sechehaye  A. Riedlinger, ). Lausanne and ParisPayot.
• Sinatra . (2011) Sinatra2011aSinatra, R., Gómez-Gardeñes, J., Lambiotte, R., Nocosia, V.  Latora, V.  2011. Maximal-entropy random walks in complex networks with limited information Maximal-entropy random walks in complex networks with limited information. Physical Review E83030103(R).
• Smith . (2013) Smith2013aSmith, KA., Huber, DE.  Vul, E.  2013. Multiply-constrained semantic search in the Remote Associates Test Multiply-constrained semantic search in the Remote Associates Test. Cognition128164 - 75.
• Snodgrass  Vanderwart (1980) Snodgrass1980aSnodgrass, JG.  Vanderwart, M.  1980. A standardized set of 260 pictures: norms for name agreement, image agreement, familiarity, and visual complexity A standardized set of 260 pictures: norms for name agreement, image agreement, familiarity, and visual complexity. Journal of Experimental Psychology: Human Learning and Memory6174-215.
• Strauss . (2006) Strauss2006aStrauss, U., Grzybek, P.  Altmann, G.  2006. Word length and word frequency Word length and word frequency. P. Grzybek (), Contributions to the Science of Text and Language: Text, Speech and Language Technology Contributions to the science of text and language: Text, speech and language technology ( 31,  277-294). BerlinSpringer.
• Thompson  Kello (2014) Thomson2014aThompson, G.  Kello, C.  2014. Walking across Wikipedia: a scale-free network model of semantic memory retrieval Walking across Wikipedia: a scale-free network model of semantic memory retrieval. Frontiers in Psychology586.
• Tversky  Hemenway (1983) Tversky1983aTversky, B.  Hemenway, K.  1983. Categories of environmental scenes Categories of environmental scenes. Cognitive Psychology15121-149.
• Zipf (1945) Zipf1945aZipf, GK.  1945. The Meaning-Frequency Relationship of Words The meaning-frequency relationship of words. Journal of General Psychology33251-266.
• Zipf (1949) Zipf1949aZipf, GK.  1949. Human behaviour and the principle of least effort Human behaviour and the principle of least effort. Cambridge (MA), USAAddison-Wesley.