Results
We represent wordmeaning and meaningmeaning relations uncovered by translation dictionaries between each language in the unbiased sample and major modern European languages by constructing a network structure. Two meanings (represented by a set of English words) are linked if they are translated from one to another and then back, and the link is weighted by the number of paths of the translation, or the number of words that represent both meanings (see Methods for detail). Figure 1 illustrates the construction in the case of two languages, Lakhota (primarily spoken in North and South Dakota) and Coast Tsimshian (mostly spoken in northwestern British Columbia and southeastern Alaska). Translation of SUN in Lakhota results wí and ápawí. While the later picks up no other meaning, wí is a polysemy that possesses additional meanings of MOON and month, hence they are linked to SUN. Such polysemy is also observed in Coast Tsimshian where gyemk, translated from SUN, covers additional meanings including, thus additionally linking to, heat.
Each language has its own way of partitioning meanings by words, captured in a semantic network of the language. It is conceivable, however, that a group of languages bear structural resemblance perhaps because the speakers share historical or environmental features. A link between SUN and MOON, for example, reoccurs in both languages, but does not appear in many other languages. SUN is instead linked to divinity and time in Japanese, and to thirst and DAY/DAYTIME in !Xóõ. The question then is the degree to which the observed polysemy patterns are general or sensitive to the environment inhabited by the speech community, phylogenetic history of the languages, and intrinsic linguistic factors such as literary tradition. We test such question by grouping the individual networks in a number of ways according to properties of their corresponding languages. We first analyze the networks of the entire languages, and then of subgroups.
In Fig. 2, we present the network of the entire languages exhibiting the broad topological structure of polysemies observed in our data. It reveals three almostdisconnected clusters, groups of concepts that are indeed more prone to polysemy within, that are associated with a natural semantic interpretation. The semantically most uniform cluster, colored in blue, includes concepts related to water. A second, smaller cluster, colored in yellow, associates solid natural substances (centered around STONE/ROCK) with their topographic manifestation (MOUNTAIN). The third cluster, in red, is more loosely connected, bridging a terrestrial cluster and a celestial cluster, including less tangible substances such as WIND, SKY, and FIRE, and salient time intervals such as DAY and YEAR. In keeping with many traditional oppositions between EARTH and SKY/heaven, or darkness, and light, the celestial, and terrestrial components form two subclusters connected most strongly through CLOUD, which shares properties of both. The result reveals a coherent set of relationships among concepts that possibly reflects human cognitive conceptualization of these semantic domains Croft2003 ; Croft2010 ; Vygotsky:thought_lang:02 .
We test whether these relationships are universal rather than particular to properties of linguistic groups such as physical environment that human societies inhabit. We first categorized languages by nonlinguistic variables such as geography, topography, climate, and the existence or nonexistence of a literary tradition (Table 2 in Appendix) and constructed a network for each group. A spectral algorithm then clusters Swadesh entries into a hierarchical structure or dendrogram for each language group. Using standard metrics on trees Critchlow:96 ; Dobson:75 ; RobinsonFoulds , we find that the dendrograms of language groups are much closer to each other than to dendrograms of randomly permuted leaves: thus the hypothesis that languages of different subgroups share no semantic structure in common is rejected (, see Methods)—SEA/OCEAN and SALT are, for example, more related than either is to SUN in every group we tried. In addition, the distances between dendrograms of language groups are statistically indistinguishable from the distances between bootstrapped languages (). Figure 3 shows a summary of the statistical tests of 11 different groups. Thus our data analyses provide consistent evidences that all languages share semantic structure, the way concepts are clustered in Fig. 2, with no significant influence from environmental or cultural factors.
Another structural feature apparent in Fig. 2 is the heterogeneity of the node degrees and link weights. The numbers of polysemies involving individual meanings are uneven, possibly toward a heavytailed distribution (Fig. 4). This indicates concepts not only form clusters within which they are densely connected, but also exhibit different levels of being polysemous. For example, EARTH/SOIL has more than hundreds of polysemes while SALT has only a few. Having shown that some aspects of the semantic network are universal, we next ask whether the observed heterogeneous degrees of polysemy, possibly a manifestation of varying densities of near conceptual neighbors, arise as artifacts of language family structure in our sample, or if they are inherent to the concepts themselves. Simply put, is it an intrinsic property of the concept, EARTH/SOIL, to be extensively polysemous, or is it a few languages that happened to call the same concept in so many different ways.
Suppose an underlying “universal space” relative to which each language randomly draws a subset of polysemies for each concept . The number of polysemies should then be linearly proportional to both the tendency of the concept to be polysemous for being close to many other concepts, and the tendency of the language to distinguish word senses in basic vocabulary. In our network representation, a proxy for the former is the weighted degree of node , and a proxy for the latter is the total weight of links in language . Then the number of polysemies is expected (see Methods):
(1) 
This simple model indeed captures the gross features of the data very well (Fig. 5
in the Appendix). Nevertheless, the KullbackLeibler divergence between the prediction
and the empirical data identifies deviations beyond the sampling errors in three concepts—MOON, SUN and ASHES—that display nonlinear increase in the number of polysemies () with the tendency of the language distinguish word senses as Fig. 6 in the Appendix shows. Accommodating saturation parameters (Table 3 in the Appendix) enables the random sampling model to reproduce the empirical data in good agreement keeping the two parameters independent, hence retain the universality over language groups.Discussion
The similarity relations between word meanings through common polysemies exhibit a universal structure, manifested as intrinsic closeness between concepts, that transcends cultural or environmental factors. Polysemy arises when two or more concepts are fundamental enough to receive distinct vocabulary terms in some languages, yet similar enough to share a common term in others. The highly variable degree of these polysemies indicates such salient concepts are not homogeneously distributed in the conceptual space, and the intrinsic parameter that describes the overall propensity of a word to participate in polysemies can then be interpreted as a measure of the local density around such concept. Our model suggests that given the overall semantic ambiguity observed in the languages, such local density determines the degree of polysemies.
Universal structures in lexical semantics would greatly aid another subject of broad interest, namely reconstruction of human phylogeny using linguistic data Dunn2011 ; Bouckaert2012 . Much progress has been made in reconstructing the phylogenies of word forms from known cognates in various languages, thanks to the ability to measure phonetic similarity and our knowledge of the processes of sound change. However, the relationship between semantic similarity and semantic shift is still poorly understood. The standard view in historical linguistics is that any meaning can change to any other meaning Hock1986 ; Fox1995 , and that no constraint is imposed on what meanings can be compared to detect cognates Nichols1996
. It is, however, generally accepted among historical linguists that language change is gradual, and that words in transition from having one meaning to being extended to another meaning should be polysemous. If this is true, then the weights on different links reflect the probabilities that words in transition over these links will be captured in “snapshots” by language translation at any time. Such semantic shifts can be modeled as diffusion in the conceptual space, or along a universal polysemy network where our constructed networks can serve an important input to methods of inferring cognates.
The absence of significant cladistic correlation with the patterns of polysemy suggests a possibility to extend the constructed conceptual space by utilizing digitally archived dictionaries of the major languages of the world with some confidence that their expression of these features is not strongly biased by correlations due to language family structure. Largecorpus samples could be used to construct the semantic space in as yet unexplored domains using automated means.
Methods
Polysemy data
Highquality bilingual dictionaries between the object language and the semantic metalanguage for crosslinguistic comparison are used to identify polysemies. The 81 object languages were selected from a phylogenetically and geographically stratified sample of lowlevel language families or genera, listed in Tab. 1 in the Appendex Dryer1989 . Translations into the object language of each of the 22 word senses from the Swadesh basic vocabulary list were first obtained (See Appendix.1); all translations (that is, all synonyms) were retained. Polysemies were identified by looking up the metalanguage translations (backtranslation) of each objectlanguage term. The selected Swadesh word senses, and the selected languages are listed in the Appendix.
We use modern European languages as a semantic metalanguage, i.e., bilingual dictionaries between such languages and the other languages in our sample. This could be problematic if these languages themselves display polysemies; for example, English day expresses both DAYTIME, and 24HR PERIOD. In many cases, however, the lexicographer is aware of these issues, and annotates the translation of the object language word accordingly. In the lexical domain chosen for our study, standard lexicographic practice was sufficient to overcome this problem.
Comparing semantic networks between language groups
A hierarchical spectral algorithm clusters the Swadesh word senses. Each sense is assigned to a position in based on the th components of the eigenvectors of the weighted adjacency matrix. Each eigenvector is weighted by the square of its eigenvalue, and clustered by a greedy agglomerative algorithm to merge the pair of clusters having the smallest squared Euclidean distance between their centers of mass, through which a binary tree or dendrogram is constructed We construct a dendrogram for each subgroup of languages according to nonlinguistic variables such as geography, topography, climate, and the presence or absence of a literary tradition (Table 2 in Appendix).
The structural distance between the dendrograms of each pair of language subgroups is measured by two standard tree metrics. The triplet distance Dobson:75 ; Critchlow:96 is the fraction of the distinct triplets of senses that are assigned a different topology in the two trees: that is, those for which the trees disagree as to which pair of senses are more closely related to each other than they are to the third. The RobinsonFoulds distance RobinsonFoulds is the number of “cuts” on which the two trees disagree, where a cut is a separation of the leaves into two sets resulting from removing an edge of the tree.
For each pair of subgroups, we perform two types of bootstrap experiments. First, we compare the distance between their dendrograms to the distribution of distances we would see under a hypothesis that the two subgroups have no shared lexical structure. Were this null hypothesis true, the distribution of distances would be unchanged under the random permutation of the senses at the leaves of each tree (For simplicity, the topology of the dendrograms are kept fixed.) Comparing the observed distance against the resulting distribution gives a
value, called in Figure 3. These values are small enough to decisively reject the null hypothesis. Indeed, for most pairs of groups the RobinsonFoulds distance is smaller than that observed in any of the 1000 bootstrap trials () marked as in the table. This gives overwhelming evidence that the semantic network has universal aspects that apply across language subgroups: for instance, in every group we tried, SEA/OCEAN, and SALT are more related than either is to SUN.In the second bootstrap experiment, the null hypothesis is that the nonlinguistic variables have no effect on the semantic network, and that the differences between language groups simply result from random sampling: for instance, the similarity between the Americas and Eurasia is what one would expect from any disjoint subgroups of the 81 languages of given sizes 29 and 20 respectively. To test this null hypothesis, we generate random pairs of disjoint language subgroups with the same sizes as the groups in question, and measure the distribution of their distances. The values, called in Figure 3, are not small enough to reject this null hypothesis. Thus, at least given the current data set, there is no statistical distinction between random sampling and empirical data —further supporting our thesis that it is, at least in part, universal.
Null model
The model treats all concepts as independent members of an unbiased sample that the aggregate summary statistics of the empirical data reflects the underlying structure. The simplest model perhaps then assumes no interaction between concept and languages: the number of polysemies of concept in language , that is
, is linearly proportional to both the tendency of the concept to be polysemous and the tendency of the language to distinguish word senses; and these tendencies are estimated from the marginal distribution of the observed data as the fraction of polysemy associated with the concept,
, and the fraction of polysemy in the language, , respectively. The model can, therefore, be expressed as, , a product of the two.To test the model, we compare the KullbackLeibler (KL) divergence of ensembles of the model with the observation Cover1991 . Ensembles are generated by the multinominal distribution according to the probability . The KL divergence is an appropriate measure for testing typicality of this random process because it is the leading exponential approximation (by Stirling’s formula) to the log of the multinomial distribution produced by Poisson sampling (see Appendix .4). The KL divergence of ensembles is where is the number of polysemies that the model generates divided by , and the KL divergence of the empirical observation is . Note that is and it is a different value from an expected value of the model, . The value is the cumulative probability of to the right of .
Acknowledgments
HY acknowledges support from CABDyN Complexity Centre, and the support of research grants from the National Science Foundation (no. SMA1312294). WC and LS acknowledge support from the University of New Mexico Resource Allocation Committee. TB, JW, ES, CM, and HY acknowledge the Santa Fe Institute, and the Evolution of Human Languages program. Authors thank Ilia Peiros, George Starostin, and Petter Holme for helpful comments. W.C. and T.B. conceived of the project and participated in all methodological decisions. L.S. and W.C. collected the data, H.Y., J.W., E.S., and T.B. did the modeling and statistical analysis. I.M. and W.C. provided the crosslinguistic knowledge. H.Y., E.S., and C.M. did the network analysis. The manuscript was written mainly by H.Y., E.S., W.C., C.M., and T.B., and all authors agreed on the final version.
References
 (1) Whorf BL, Language, Thought and Reality: Selected Writing. (MIT Press, Cambridge, 1956).
 (2) Fodor JA, The language of thought. (Harvard Univ., New York, 1975).
 (3) Wierzbicka, A., Semantics: primes and universals. (Oxford University Press. 1996)
 (4) Lucy JA, Grammatical categories and cognition: a case study of the linguistic relativity hypothesis. (Cambridge University Press, 1992).
 (5) Levinson SC, Space in language and cognition: explorations in cognitive diversity. (Cambridge University Press, 2003)
 (6) Choi S, Bowerman M (1991) Learning to express motion events in English and Korean: The influence of languagespecific lexicalization patterns. Cognition 41, 83121.
 (7) Majid A, Boster JS, Bowerman M (2008) Cognition 109, 235250.
 (8) Croft W (2010) Relativity, linguistic variation and language universals. CogniTextes 4, 303.
 (9) Evans N, Levinson SC (2009) The myth of language universals: language diversity and its importance for cognitive science. Behav. Brain Sci. 21 429492.
 (10) Comrie B Language universals and linguistic typology, 2nd ed. (University of Chicago Press., 1989).
 (11) Croft W, Typology and universals, 2nd ed. (Cambridge University Press. 2003).
 (12) Henrich J, Heine SJ, Norenzayan A (2010) The weirdest people in the world? Behav. Brain Sci. 33 175 (2010).
 (13) Shopen T (ed.), Language typology and syntactic description, 2nd ed. (3 volumes) (Cambridge University Press, Cambridge, 2007).
 (14) Croft W, Cruse DA, Cognitive linguistics. (Cambridge University Press. 2004).
 (15) KoptjevskajaTamm M, Vanhove M (2012) New directions in lexical typology. Linguistics 50, 3.
 (16) Brown CH (1976) General principles of human anatomical partonomy and speculations on the growth of partonomic nomenclature. Am. Ethnol. 3, 400424 (1976).
 (17) Witkowski SR, Brown CH (1978) Lexical universals, Ann. Rev. of Anthropol. 7 42751.
 (18) Brown CH (1983) Where do cardinal direction terms come from? Anthropological Linguistics 25, 121161.
 (19) Viberg Å (1983) The verbs of perception: a typological study. Linguistics 21, 123162.
 (20) Evans N, Multiple semiotic systems, hyperpolysemy, and the reconstruction of semantic change in Australian languages. in Diachrony within synchrony: language history and cognition (Peter Lang. Frankfurt, 1992).

(21)
Derrig S (1978) Metaphor in the color lexicon.
Chicago Linguistic Society, the Parasession on the Lexicon 8596.  (22) Swadesh M (1952) Lexicostatistical dating of prehistoric ethnic contacts. P. Am. Philos. Soc. 96, 452463.
 (23) Vygotsky L, Thought and Language. (MIT Press, Cambridge, MA, 2002).
 (24) Critchlow DE, Pearl DK, Qian CL (1996) The triples distance for rooted bifurcating phylogenetic trees. Syst. Biol. 45, 323–334.
 (25) Dobson AJ, Comparing the Shapes of Trees, Combinatorial mathematics III, (SpringerVerlag, New York 1975).
 (26) Robinson DF, Foulds LR (1981) Comparison of phylogenetic trees. Math. Biosci. 53, 131–147.
 (27) Dunn M, et al. (2011) Evolved structure of language shows lineagespecific trends in wordorder universals. Nature 473, 7982.
 (28) Bouckaert R, et al. (2012) Mapping the origins and expansion of the IndoEuropean language family. Science 337, 957.
 (29) Fox A, Linguistic reconstruction: an introduction to theory and method. (Oxford University Press. 1995).
 (30) Hock HH Principles of historical linguistics. (Mouton de Gruyter, Berlin, 1986).

(31)
Nichols J, The comparative method as heuristic. in
The comparative method reviewed: regularity and irregularity in language change (Oxford University Press, 1996).  (32) Dryer MS (1989) large linguistic areas and language sampling. Studies in Language 13, 257292.
 (33) Cover TM, and Thomas JA, Elements of Information Theory, (Wiley, New York, 1991).
 (34) Brown CH, A theory of lexical change (with examples from folk biology, human anatomical partonomy and other domains). Anthropol. Linguist. 21, 257276 (1979).
 (35) Brown CH & Witkowski SR, Figurative language in a universalist perspective. Am. Ethnol. 8 596615 (1981).
Appendix
.1 Criteria for selection of meanings
Our translations use only lexical concepts as opposed to grammatical inflections or function words. For the purpose of universality and stability of meanings across cultures, we chose entries from the Swadesh 200word list of basic vocabulary. Among these, we have selected categories that are likely to have singleword representation for meanings, and for which the referents are material entities or natural settings rather than social or conceptual abstractions. We have selected 22 words in domains concerning natural and geographic features, so that the web of polysemy will produce a connected graph whose structure we can analyze, rather than having an excess of disconnected singletons. We have omitted body parts—which by the same criteria would provide a similarly appropriate connected domain—because these have been considered previously Brown1976 ; Brown1979 ; WitkowskiBrown1978 ; BrownWitkowski1981 . The final set of 22 words are as follows:

Celestial Phenomena and Related Time Units:
STAR, SUN, MOON, YEAR, DAY/DAYTIME, NIGHT 
Landscape Features:
SKY, CLOUD(S), SEA/OCEAN, LAKE, RIVER, MOUNTAIN 
Natural Substances:
STONE/ROCK, EARTH/SOIL, SAND, ASH(ES), SALT, SMOKE, DUST, FIRE, WATER, WIND
.2 Language List
The languages included in our study are listed in Tab. 1. Notes: Oceania includes Southeast Asia; the Papuan languages do not form a single phylogenetic group in the view of most historical linguists; other families in the table vary in their degree of acceptance by historical linguists. The classification at the genus level, which is of greater importance to our analysis, is generally agreed upon.
Region  Family  Genus  Language 
Africa  Khoisan  Northern  Ju’hoan 
Central  Khoekhoegowab  
Southern  !Xóõ  
NigerKordofanian  NW Mande  Bambara  
Southern W. Atlantic  Kisi  
Defoid  Yorùbá  
Igboid  Igbo  
Cross River  Efik  
Bantoid  Swahili  
NiloSaharan  Saharan  Kanuri  
Kuliak  Ik  
Nilotic  Nandi  
BangoBagirmiKresh  Kaba Démé  
AfroAsiatic  Berber  Tumazabt  
West Chadic  Hausa  
E Cushitic  Rendille  
Semitic  Iraqi Arabic  
Eurasia  Basque  Basque  Basque 
IndoEuropean  Armenian  Armenian  
Indic  Hindi  
Albanian  Albanian  
Italic  Spanish  
Slavic  Russian  
Uralic  Finnic  Finnish  
Altaic  Turkic  Turkish  
Mongolian  Khalkha Mongolian  
Japanese  Japanese  Japanese  
Chukotkan  Kamchatkan  Itelmen (Kamchadal)  
Caucasian  NW Caucasian  Kabardian  
Nax  Chechen  
Katvelian  Kartvelian  Georgian  
Dravidian  Dravidian Proper  Badaga  
SinoTibetan  Chinese  Mandarin  
Karen  Karen (Bwe)  
KukiChinNaga  Mikir  
BurmeseLolo  Hani  
Naxi  Naxi  
Oceania  HmongMien  HmongMien  Hmong Njua 
Austroasiatic  Munda  Sora  
PalaungKhmuic  Minor Mlabri  
Aslian  Semai (Sengoi)  
Daic  KamTai  Thai  
Austronesian  Oceanic  Trukese  
Papuan  Middle Sepik  Kwoma  
E NG Highlands  Yagaria  
Angan  Baruya  
C and SE New Guinea  Kolari  
West Bougainville  Rotokas  
East Bougainville  Buin  
Australian  Gunwinguan  Nunggubuyu  
Maran  Mara  
PamaNyungan  E and C Arrernte  
Americas  EskimoAleut  Aleut  Aleut 
NaDene  Haida  Haida  
Athapaskan  Koyukon  
Algic  Algonquian  Western Abenaki  
Salishan  Interior Salish  Thompson Salish  
Wakashan  Wakashan  Nootka (Nuuchahnulth)  
Siouan  Siouan  Lakhota  
Caddoan  Caddoan  Pawnee  
Iroqoian  Iroquoian  Onondaga  
Coastal Penutian  Tsimshianic  Coast Tsimshian  
Klamath  Klamath  
Wintuan  Wintu  
Miwok  Northern Sierra Miwok  
Gulf  Muskogean  Creek  
Mayan  Mayan  Itzá Maya  
Hokan  Yanan  Yana  
Yuman  Cocopa  
UtoAztecan  Numic  Tümpisa Shoshone  
Hopi  Hopi  
Otomanguean  Zapotecan  Quiavini Zapotec  
Paezan  Warao  Warao  
Chimúan  Mochica/Chimu  
Quechuan  Quechua  Huallaga Quechua  
Araucanian  Araucanian  Mapudungun (Mapuche)  
TupíGuaraní  TupíGuaraní  Guaraní  
MacroArawakan  Harákmbut  Amarakaeri  
Maipuran  Piro  
MacroCarib  Carib  Carib  
PebaYaguan  Yagua 
.3 Language groups
We performed several tests to see if the structure of the polysemy network (or whatever we’re calling it) depends in a statistically significant way on typological features, including the presence or absence of a literary tradition, geography, topography, and climate. The typological features tested, with the numbers of languages indicated for each feature shown in parentheses, are listed in Tab. 2
Variable  Subset  Size 

Geography  Americas  29 
Eurasia  20  
Africa  17  
Oceania  15  
Climate  Humid  38 
Cold  30  
Arid  13  
Topography  Inland  45 
Coastal  36  
Literary tradition  Some or long literary tradition  28 
No literary tradition  53 
Various groups of languages based on nonlinguistic variables. For each variable we measured the difference between the subsets’ semantic networks, defined as the tree distance between the dendrograms of Swadesh words generated by spectral clustering.
.4 Model for Degree of Polysemy
.4.1 Aggregation of language samples
We now consider more formally the reasons sample aggregates may not simply be presumed as summary statistics, because they entail implicit generating processes that must be tested. By demonstrating an explicit algorithm that assigns probabilities to samples of Swadesh node degrees, presenting significance measures consistent with the aggregate graph and the sampling algorithm, and showing that the languages in our dataset are typical by these measures, we justify the use and interpretation of the aggregate graph (Fig. 2 ).
We begin by introducing an error measure appropriate to independent sampling from a general mean degree distribution. We then introduce calibrated forms for this distribution that reproduce the correct sample means as functions of both Swadeshentry and languageweight properties.
The notion of consistency with random sampling is generally scaledependent. In particular, the existence of synonymous polysemy may cause individual languages to violate criteria of randomness, but if the particular duplicated polysemes are not correlated across languages, even small groups of languages may rapidly converge toward consistency with a random sample. Therefore, we do not present only a single acceptance/rejection criterion for our dataset, but rather show the smallest groupings for which sampling is consistent with randomness, and then demonstrate a model that reproduces the excess but uncorrelated synonymous polysemy within individual languages.
.4.2 Independent sampling from the aggregate graph
Figure 2 treats all words in all
languages as independent members of an unbiased sample. To test the
appropriateness of the aggregate as a summary statistic, we ask: do
random samples, with link numbers equal to those in observed
languages, and with link probabilities proportional to the weights in
the aggregate graph, yield ensembles of graphs within which the actual
languages in our data are typical?
Statistical tests
The appropriate summary statistic to test for typicality, in ensembles produced by random sampling (of links or linkends) is the KullbackLeibler (KL) divergence of the sample counts from the probabilities with which the samples were drawn Cover1991 . This is because the KL divergence is the leading exponential approximation (by Stirling’s formula) to the log of the multinomial distribution produced by Poisson sampling.
The appropriateness of a randomsampling model may be tested independently of how the aggregate link numbers are used to generate an underlying probability model. In this section, we will first evaluate a variety of underlying probability models under Poisson sampling, and then we will return to tests for deviations from independent Poisson samples. We first introduce notation: For a single language, the relative degree (link frequency), which is used as the normalization of a probability, is denoted as , and for the joint configuration of all words in all languages, the link frequency of a single entry relative to the total will be denoted .
Corresponding to any of these, we may generate samples of links to define the null model for a random process, which we denote , , etc. We will generally use samples with exactly the same number of total links as the data. The corresponding sample frequencies will be denoted by and , respectively.
Finally, the calibrated model, which we define from properties of aggregated graphs, will be the prior probability from which samples are drawn to produce
values for the data. We denote the model probabilities (which are used in sampling as “true” probabilities rather than sample frequencies) by , , and .For links sampled independently from the distribution for language , the multinomial probability of a particular set may be written, using Stirling’s formula to leading exponential order, as
(2) 
where the KullbackLeibler (KL) divergence Cover1991
(3) 
For later reference, note that the leading quadratic approximation to Eq. (3) is
(4) 
so that the variance of fluctuations in each word is proportional to its expected frequency.
As a null model for the joint configuration over all languages in our set, if links are drawn independently from the distribution , the multinomial probability of a particular set is given by
(5) 
where^{1}^{1}1As long as we calibrate to agree with the perlanguage link frequencies in the data, the data will always be counted as more typical than almostall random samples, and its probability will come entirely from the KL divergences in the individual languages.
Multinomial samples of assignments to each of the pairs, with links total drawn from distribution
, will produce KL divergences uniformly distributed in the coordinate
, corresponding to the uniform increment of cumulative probability in the model distribution. We may therefore use the cumulative probability to the right of (onesided value), in the distribution of samples , as a test of consistency of our data with the model of random sampling.In the next two subsections we will generate and test candidates for
which are different functions of the
mean link numbers on Swadesh concepts and the total links numbers in
languages.
Product model with intrinsic property of concepts
In general we wish to consider the consistency of joint configurations with random sampling, as a function of an aggregation scale. To do this, we will rankorder languages by increasing , form nonoverlapping bins of 1, 3, or 9 languages, and test the resulting binned degree distributions against different meandegree and sampling models. We denote by the average total link number in a bin, and by the average link number per Swadesh entry in the bin. The simplest model, which assumes no interaction between concept and language properties, makes the model probability a product of its marginals. It is estimated from data without regard to binning, as
(7) 
The independent mean values are thereby specified in terms of sample estimators.
The KL divergence of the joint configuration of links in the actual data from this model, under whichever binning is used, becomes
(8) 
As we show in Fig. 7 below, even for 9language bins which we expect to average over a large amount of languagespecific fluctuation, the product model is ruled out at the level.
We now show that a richer model, describing interaction between word
and language properties, accepts not only the 9language aggregate,
but also the 3language aggregate with a small adjustment of the
language size to which words respond (to produce consistency with
wordsize and languagesize marginals). Only fluctuation statistics
at the level of the joint configuration of 81 individual languages
remains strongly excluded by the null model of random sampling.
Product model with saturation
An inspection of the deviations of our data from the product model shows that the initial propensity of a word to participate in polysemies, as inferred in languages where that word has few links, in general overestimates the number of links (degree). Put it differently, languages seem to place limits on the weight of single polysemies, favoring distribution over distinct polysemies, but the number of potential distinct polysemies is an independent parameter from the likelihood that the available polysemies will be formed. Interpreted in terms of our supposed semantic space, the proximity of target words to a Swadesh entry may determine the likelihood that they will be polysemous with it, but the total number of proximal targets may vary independently of their absolute proximity. These limits on the number of neighbors of each concept are captured by additional 22 parameters.
Meaning category  Saturation:  Propensity 

STAR  0.025  
SUN  0.126  
YEAR  0.021  
SKY  0.080  
SEA/OCEAN  0.026  
STONE/ROCK  0.041  
MOUNTAIN  0.049  
DAY/DAYTIME  0.087  
SAND  0.026  
ASH(ES)  0.068  
SALT  0.007  
FIRE  0.065  
SMOKE  0.031  
NIGHT  0.034  
DUST  0.065  
RIVER  0.048  
WATER  0.073  
LAKE  0.047  
MOON  0.997  
EARTH/SOIL  0.116  
CLOUD(S)  0.033  
WIND  0.051 
To accommodate such characteristic, we revise the model Eq. (7) to the following function:
where degree numbers for each Swadesh is proportional to and language size, but is bounded by , the number of proximal concepts. The corresponding model probability for each language then becomes
(9) 
As all we recover the product model, with and .
A firstlevel approximation to fit parameters and is given by minimizing the weighted meansquare error
(10) 
The function (10) assigns equal penalty to squared error within each language bin , proportional to the variance expected from Poisson sampling. The fit values obtained for and do not depend sensitively on the size of bins except for the Swadesh entry MOON in the case where all 81 singlelanguage bins are used. MOON has so few polysemies, but the MOON/month polysemy is so likely to be found, that the language Itelman, with only one link, has this polysemy. This point leads to instabilities in fitting in singlelanguage bins. For bins of size 3–9 the instability is removed. Representative fit parameters across this range are shown in Table 3. Examples of the saturation model for two words, plotted against the 9language binned degree data in Fig. 6, show the range of behaviors spanned by Swadesh entries.
The leastsquares fits to and do not directly yield a probability model consisent with the marginals for language size that, in our data, are fixed parameters rather than sample variables to be explained. They closely approximate the marginal N (deviations link for every ) but lead to mild violations . We corrected for this by altering the saturation model to suppose that, rather than word properties’ interacting with the exact value , they interact with a (wordindependent but languagedependent) multiplier , so that the model for in each language becomes becomes
in terms of the leastsquares coefficients and of Table 3. The values of are solved with Newton’s method to produce , and we checked that they preserve within small fractions of a link. The resulting adjustment parameters are plotted versus for individual languages in Fig. 8. Although they were computed individually for each , they form a smooth function of , possibly suggesting a refinement of the product model, but also perhaps reflecting systematic interaction of smalllanguage degree distributions with the error function (10).
With the resulting joint distribution
, tests of the joint degree counts in our dataset for consistency with multinomial sampling in 9 ninelanguage bins are shown in Fig. 7, and results of tests using 27 threelanguage bins are shown in Fig. 9. Binning nine languages clearly averages over enough languagespecific variation to make the data strongly typical of a random sample (), while the product model (which also preserves marginals) is excluded at the level. The marginal acceptance of the data even for the joint configuration of threelanguage bins () suggests that language size is an excellent explanatory variable and that residual language variations cancel to good approximation even in small aggregations..4.3 Single instances as to aggregate representation
The preceding subsection showed intermediate scales of
aggregation of our language data are sufficiently random that they can
be used to refine probability models for mean degree as a function of
parameters in the globallyaggregated graph. The saturation model,
with dataconsistent marginals and multinomial sampling, is weakly
plausible by bins of as few as three languages. Down to this scale, we
have therefore not been able to show a requirement for deviations from
the independent sampling of links entailed by the use of the aggregate
graph as a summary statistic. However, we were unable to find a
further refinement of the mean distribution that would reproduce the
properties of single language samples. In this section we
characterize the nature of their deviation from independent samples of
the saturation model, show that it may be reproduced by models of
nonindependent (clumpy) link sampling, and suggest that these reflect
excess synonymous polysemy.
Power tests and uneven distribution of singlelanguage values
To evaluate the contribution of individual languages versus language aggregates to the acceptance or rejection of randomsampling models, we computed values for individual languages or language bins, using the KLdivergence (3). A plot of the singlelanguage values for both the null (product) model and the saturation model is shown in Fig. 10. Histograms for both single languages (from the values in Fig. 10) and aggregate samples formed by binning consecutive groups of three languages are shown in Fig. 11.
For samples from a random model, values would be uniformly distributed in the unit interval, and histogram counts would have a multinomial distribution with singlebin fluctuations depending on the total sample size and bin width. Therefore, Fig. 11 provides a power test of our summary statistics. The variance of the multinomial may be estimated from the largevalue body where the distribution is roughly uniform, and the excess of counts in the small
value tail, more than one standard deviation above the mean, provides an estimate of the number of languages that can be confidently said to violate the randomsampling model.
From the upper panel of Fig. 11, with a total sample of 81 languages, we can estimate a number of excess languages at the lowest values of 0.05 and 0.1, with an additional 2–3 languages rejected by the product model in the range value . Comparable plots in Fig. 11 (lower panel) for the 27 threelanguage aggregate distributions are marginally consistent with random sampling for the saturation model, as expected from Fig. 9 above. We will show in the next section that a more systematic trend in language fluctuations with size provides evidence that the cause for these rejections is excess variance due to repeated attachment of links to a subset of nodes.
Excess fluctuations in degree of polysemy
If we define the sizeweighted relative variance of a language analogously to the error term in Eq. (10), as
(11) 
Fig. 12 shows that has high rank correlation with
and a roughly linear regression over most of the range.
^{2}^{2}2Recall from Eq. (4) that the leading quadratic term in the KLdivergence differs from in that it presumes Poisson fluctuation with variance at the level of each word, rather than uniform variance across all words in a language. The relative variance is thus a less specific error measure.Two languages (Itelmen and Hindi), which appear as large outliers relative to the product model, are within the main dispersion in the saturation model, showing that their discrepency is corrected in the mean link number. We may therefore understand a large fraction of the improbability of languages as resulting from excess fluctuations of their degree numbers relative to the expectation from Poisson sampling.
Fig. 13 then shows the relative variance from the saturation model, plotted versus total average link number for both individual languages and threelanguage bins. The binned languages show no significant regression of relative variance away from the value unity for Poisson sampling, whereas single languages show a systematic trend toward larger variance in larger languages, a pattern that we will show is consistent with “clumpy” sampling of a subset of nodes. The disappearance of this clumping in binned distributions shows that the clumps are uncorrelated among languages at similar .
Correlated link assignments
We may retain the mean degree distributions, while introducing a systematic trend of relative variance with , by modifying our sampling model away from strict Poisson sampling to introduce “clumps” of links. To remain within the use of minimal models, we modify the sampling procedure by a single parameter which is independent of word , languagesize , or particular language .
We introduce the sampling model as a function of two parameters, and show that one function of these is constrained by the regression of excess variance. (The other may take any interior value, so we have an equivalence class of models.) In each language, select a number of Swadesh entries randomly. Let the Swadesh indices be denoted . We will take some fraction of the total links in that language, and assign them only to the Swadeshes whose indices are in this privileged set. Introduce a parameter that will determine that fraction.
We require correlated link assignments be consistent with the mean determined by our model fit, since binning of data has shown no systematic effect on mean parameters. Therefore, for the random choice , introduce the normalized density on the privileged links
(12) 
if and otherwise. Denote the aggregated weight of the links in the priviledged set by
(13) 
Then introduce a modified probability distribution based on the randomly selected links, in the form
(14) 
Multinomial sampling of links from the distribution will produce a sizedependent variance of the kind we see in the data. The expectated degrees given any particular set will not agree with the means in the aggregate graph, but the ensemble mean over random samples of languages will equal
, and binned groups of languages will converge toward it according to the centrallimit theorem.
The proof that the relative variance increases linearly in comes from the expansion of the expectation of Eq. (11) for random samples, denoted
(15)  
The first expectation over is constant (of order unity) for Poisson samples, and the second expectation (over the sets that generate ) does not depend on except in the prefactor. Crossterms vanish because link samples are not correlated with samples of . Both terms in the third line of Eq. (15) scale under binning as . The first term is invariant due to Poisson sampling, while in the second term, the centrallimit theorem reduction of the variance in samples over cancels growth in the prefactor due to aggregation.
Because the linear term in Eq. (15) does not systematically change under binning, we interpret the vanishing of the regression for threelanguage bins in Fig. 13 as a consequence of fitting of the mean value to binned data as sample estimators.^{3}^{3}3We have verified this by generating random samples from the model (15), fitting a saturation model to binned sample configurations using the same algorithms as we applied to our data, and then performing regressions equivalent to those in Fig. 13. In about of cases the fitted model showed regression coefficients consistent with zero for threelanguage bins. The typical behavior when such models were fit to random sample data was that the threebin regression coefficient decreased from the singlelanguage regression by . Therefore, we require to choose parameters and so that regression coefficients in the data are typical in the model of clumpy sampling, while regressions including zero have nonvanishing weight in models of threebin aggregations.
Fig. 14 compares the range of regression coefficients obtained for random samples of languages with the values in our data, from either the original saturation model , or the clumpy model randomly resampled for each language in the joint configuration. Parameters used were (, ).^{4}^{4}4Solutions consistent with the regression in the data may be found for ranging from 3–17. was chosen as an intermediate value, consistent with the typical numbers of nodes appearing in our samples by inspection. With these parameters, of links were assigned in excess to of words, with the remaining of links assigned according to the mean distribution.
The important features of the graph are: 1) Binning does not change the mean regression coefficient, verifying that Eq. (15) scales homogeneously as . However, the variance for binned data increases due to reduced number of sample points; 2) the observed regression slope 0.012 seen in the data is far out of the support of multinomial sampling from , whereas with these parameters, it becomes typical under while still leaving significant probability for the threelanguage binned regression around zero (even without expost fitting).