Human languages order information efficiently

10/09/2015 ∙ by Daniel Gildea, et al. ∙ 0

Most languages use the relative order between words to encode meaning relations. Languages differ, however, in what orders they use and how these orders are mapped onto different meanings. We test the hypothesis that, despite these differences, human languages might constitute different `solutions' to common pressures of language use. Using Monte Carlo simulations over data from five languages, we find that their word orders are efficient for processing in terms of both dependency length and local lexical probability. This suggests that biases originating in how the brain understands language strongly constrain how human languages change over generations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We test the hypothesis that language change is subject to small but persistent biases that result, on average, in languages that are easier to process. Biases for grammars with higher processing efficiency could be the direct result of abstract learning biases [74, 21] or they could result from the pressures of language use [11, 42, 43], such as preferences that have been hypothesized to operate during language production [30, 48, 58] or biases originating in comprehension [37, 68, 71, 81].

If language change is indeed subject to biases towards languages with higher processing efficiency and if these biases are sufficiently strong, these biases should over accumulate over historical time, leading natural languages that have existed for sufficiently long to have higher than expected processing efficiency. This is the primary hypothesis we set out to test. Some evidence suggests that the sound structure and lexicon of natural languages exhibit properties that are expected under this hypothesis

[35, 64, 69, 70, 85]. At those levels of linguistic organization, studies over the last couple of years have also provided more direct correlational evidence that language change is affected by processing [82]. Miniature language learning experiments have documented similar biases during language acquisition and that these biases can accumulate over generations of learners [50].

The level of linguistic organization that has remained elusive with regard to this question, however, is also arguably the one that is the one that makes human languages most unique compared to all other animal communication systems: syntax –or some aspects of syntax (recursion)– give human languages infinite expressivity with finite means [45, 67] and it is syntax that has been taken to be the defining property of human languages (e.g., [40]; but see [72]). Whether at least some properties of the syntactic systems of languages can be derived from the fact that languages need to be processed continues to be a heatedly debated question (for recent high impact reviews, see [21, 51, 50, 72]). One reason why this question has not been directly addressed, as we detail below, is that until very recently it has been impossible to directly test whether the syntax of natural languages tends to facilitate processing efficiency. Here we present the results of several large-scale computational simulations that address these questions. For the purpose of presentation, we group these simulations into Studies 1 and 2.

Study 1 tests and finds confirmed the hypothesis that natural languages have word orders that makes them easier to process than expected by chance. From this is does not follow that natural languages have optimal or even close to optimal processing efficiency. Processing efficiency is presumably just one of several factors that might bias language change (the ease of acquisition of a grammar being another constraint). Still, if biases towards efficient processing are among the most influential factors influencing language change, we would expect human languages to have word orders that are pretty close to optimal in terms of processing efficiency. This hypothesis is tested in Study 2. Taken together, Studies 1 and 2 suggest that language processing exhibits a surprisingly strong bias on language change.

The hypothesis we test is one that has long intrigued language researchers. The pressures inherent to language processing have long been assumed to shape languages over time, including not only phonology and the lexicon [54, 58, 68, 46, 85], but also syntactic structure [11, 5, 6, 42, 74]

. However, until relatively recently it has virtually been impossible to obtain reliable estimates of the processing efficiency of a language. Imagine one was to obtain such estimates experimentally (e.g., by obtaining estimates of the word-by-word processing times a native speaker of that language experiences while reading sentences from that language). A reliable estimate of the processing efficiency of an entire language would require reading data for a representative sample of the language. Ideally, this sample would be representative in terms of its lexical and grammatical distributions –i.e., it should contain both low and high frequency words, more and less complex syntactic structures, and so on. Further reliable estimates would require that individual differences in, for instance, reading abilities are averaged out. In short, hundreds of readers would likely have to read thousands of sentences. This alone is a daunting task. In one of the two most commonly used methods to obtain word-by-word reading time estimates (self-paced reading), it takes between .5-1 hour to obtain reading times for 100 sentences. So, to obtain data from 100 readers on, say 1000 sentences from a language –which would still not be a lot of sentences–, we would require about 500-1000 participant hours.

However, by far the biggest challenge lies in establishing a chance-level against which to compare the processing efficiency of a language. This requires estimates of processing efficiency from a large set of randomized variants of a language (see below). This further increases the required experimental data by several orders of magnitude. Assessing the processing efficiency of a language based on human data is thus prohibitively expensive and time-consuming. The smallest study we present below would correspond to 500,000 participant hours. At New York State minimum wage (as of 12/31/2014), this approach of assessing processing efficiency would cost over 4 million US Dollars per language, for a total of 20 million Dollars for the five languages we examine. It would also arguably provide an utterly anti-conservative estimate of chance (to say the least): without extensive training on the new language variant, participants would experience massive interference from their native language, making it appear as if human languages are highly efficient simply because it is the one that participants are familiar with (it takes most learners of a language years to have approximately native-like processing speeds).

Here, we take an alternative approach. We take advantage of advances in computational psycholinguistics, natural language processing, and the availability of large linguistic databases. Rather than to obtain estimates of processing efficiency from human readers, we automatically estimate the processing efficiency of a language from large linguistically annotated collections of text (syntactically annotated corpora). This is now possible, because psycholinguistic research has identified grammar-dependent measures of processing efficiency. Here, we focus on two properties that are known to affect word-by-word processing times: a word’s Shannon information in context (i.e., its

surprisal [39, 56]) and the length of the dependencies that are integrated at the word (dependency length, [28, 29]).

We describe and further motivate these two measures in more detail in below. For now, it suffices to say that processing difficulty (as assessed through, e.g., per-word reading times) is positively correlated with surprisal and dependency length. If a bias for processing efficiency affects the development of languages over time, it is thus expected that natural languages have lower average surprisal and shorter average dependency lengths than expected by chance.

We test these predictions against data from five languages: Arabic (Modern Standard), Czech, English (American), German, and Mandarin Chinese. These five languages were chosen for two reasons. First, we aimed for representative linguistic coverage. Languages often share linguistic properties simply because they are historically related or because they have co-existed in geographic proximity over long periods of time, with the ensuing language contact leading to lexical and grammatical borrowings. Here, we are interested in testing hypotheses that are assumed to apply universally across all languages. The less historically and geographically related the languages in our sample are, the more likely any effect found on this sample is to generalize beyond the particular sample to any language.

The five languages we investigated represent three major language families (Sino-Tibetan, Indo-European, and Semitic) and four language subfamilies (Chinese, Balto-Slavic, Germanic, and Arabic). The five languages also differ in a variety of linguistic properties that are known to be relevant to processing difficulty. For example, three of the languages in our sample have dominant Subject-Verb-Object (SVO) order, one of them has dominant VSO order (Arabic), and one has no dominant word order (German). The languages also differ in whether they productively use morphological means to mark grammatical relations, such as using case (Arabic, Czech, German), or not (English, Mandarin). As a third and final example, the languages differ in whether and how they express certain arguments to the verb. For example, pronominal elements in subject position (e.g., I, you, he) can optionally be omitted in Mandarin, are realized as suffixes on the verb in Czech, but are more or less obligatorily realized as separate words in English and German. Any of these properties could theoretically affect the measures we assess in our studies.

Second, as we describe next, sufficiently large electronic corpora with the necessary linguistic annotations are now available for these languages. Corpus size is critical for our purpose. The reliability of the estimates we derive below depends on the number of words and sentences in the corpus. For example, the accuracy and reliability of the processing efficiency estimates described below increases with the number of words in a corpus. The corpora we employ in our studies are the largest available corpora for the five languages with the required linguistic annotation. The methods we use to obtain surprisal and dependency length estimates further increase robustness of estimates.

2 Data

The data for all languages comes from newspaper corpora. For English, we also had access to a corpus of conversational speech data with the required annotations. An overview of the corpora is provided in Table 1.

sentences sentence length
mean std dev min max
Arabic (Modern Standard) 6,776 35.4 (26.9) 1 387
Czech 72,703 14.8 (9.6) 1 166
English (American, written) 39,832 20.9 (10.1) 1 122
English (American, spoken) 17,968 7.9 (8.2) 1 92
German 45,422 15.5 (9.6) 1 115
Mandarin Chinese 28,289 23.8 (16.4) 1 212
Table 1: Overview of corpora used in the current studies

Specifically, the Arabic data consists of 6776 sentences from the Penn Arabic Treebank, in the dependency representation of the Prague Arabic Dependency Treebank version 1 [38]. The Czech data consists of 72,703 sentences from the Prague Dependency Treebank version 1 [7], as used in the CoNLL 2006 dependency parsing evaluation [10]. The English data comes from two sources. For written data, we use the 39,832 sentences from the Wall Street Journal portion of the Penn Treebank version 3 [65]. For spoken data, we use 17,968 sentences from the Switchboard corpus of spoken English [32]. The German data consists of 45,422 sentences from the TIGER corpus [9], which primarily consists of articles from the German newspaper “Frankfurter Rundschau”. Finally, the Mandarin Chinese data consists of 28,289 sentences from the Penn Chinese Treebank version 6.0 [83]. This includes newswire from Xinhua News Agency, articles from Sinorama Magazine, news from the website of the Hong Kong Special Administrative Region and transcripts from various broadcast news programs.

All corpora consist of sentences that have been manually annotated with the syntactic structure of each sentence. The annotations specify a syntactic structure for each sentence. The annotation types differed somewhat between languages. An example, from the English corpus is shown in Figure 1.

[.S [.SBAR [.WHADVP [.WRB When ] ] [.S [.NP [.DT the ] [.NN man ] ] [.VP [.VBD arrived ] ] ] ] [., , ] [.NP [.PRP I ] ] [.VP [.VBD left ] ] [.. . ] ]

Figure 1: Example syntax tree

We automatically converted the different syntactic annotations into a dependency representation, as shown in Figure 2. We use the dependency representation because dependency length has been shown to be an important variable affecting human language processing (see below). The dependency representation is a directed graph specifying, for each word in the sentence, the head word (or ‘sender’ [23]) that it modifies. For example, subjects and direct objects modify the main verb of the clause; determiners, adjectives, and relative pronouns typically modify nouns; prepositions can modify nouns or verbs; prepositions are modified by the object nouns; and so on. We convert trees to dependency representations using a set of rules which specify which child of each node in the tree is the head child, i.e., the main component of the phrase [62, 16]. Recursively choosing a head child for each node from the top down, we find a head word for each node in the tree. At each node in the tree, dependency relations are created indicating that the head word of the head child is modified by the head word of each other child. The dependency representation tends to be robust to the details of the syntactic annotation schemes used by various corpora.

When

the

man

arrived

I

left

DTNN

SBJS

SSBAR

SBARS

SBJS
Figure 2: Dependency structure, converted from the syntactic tree in Figure 1

For English, German, and Chinese, we extracted dependencies from constituent representations, converting the representation of Figure 1 to that of Figure 2. Specifically, we extract dependencies using the head-finding rules of Collins [16]. Our dependency types consist of pairs of syntactic categories, with one element representing the category of the maximal projection of the head, and one representing the category of the maximal projection of the modifier. Additionally, we include a special subject type in order to differentiate verb subjects and direct objects, by using the “SBJ” function tag in the Penn treebank annotation (see Figure 2).

For Czech and Arabic, our data was originally annotated in a dependency representation. We take advantage of relation labels provided, which included relations such subject, object, attribute, and so on. Our dependency types consist of both the relation of a word and the relation of its parent, in order to allow us to distinguish between, for example, an attribute relation in a subject noun phase and an attribute relation modfifying a verb in a relative clause.

3 Estimating the Processing Efficiency of Languages

As outlined in the introduction, we focus on two measures of processing efficiency that have received broad empirical support: surprisal [39, 56] and dependency length, [28, 29].

The surprisal of a word is identical to its Shannon information (in bits) in context, which is defined as the logarithm (to base 2) of the inverse of its probability in context.

(1)
(2)

A word’s surprisal (conditioned on all relevant preceding context) has been shown to be identical to the relative entropy (or Kullback-Leibler divergence) between the distribution over all possible parses prior to the word and the distribution over all possible parses after processing the word

[55]. Surprisal can thus be understood as a measure of the amount of syntactic belief-updating that is associated with processing the word. Crucially, a word’s surprisal has been found to be a good predictor of its reading times in context [8, 18, 25, 66, 75]. For example, in a large-scale reading experiment, Smith and Levy [75] found that per-word reading times were linear in the word’s surprisal. This relation held over six orders of magnitude in the probability, from almost perfectly predictable instances of words to barely predictable instances (). Surprisal has also been found to be reflected in neural responses.

Dependency length, too, has been found to affect processing difficulty, with longer dependencies leading to longer reading times at their integration point. Consider the word left in the example in Figure 2. Two dependencies end –and are thus assumed to be integrated– at the word left. One is the dependency between the verb left and its subject (I). This dependency is local. The other dependency is between the verb and its temporal modifier (when the man arrived). This dependency is non-local. Psycholinguistic research has found that non-local dependencies tend to cause processing difficulty ([28, 29, 36]; though see [57, 79] for discussion). There is also evidence that cross-linguistically speakers prefer shorter dependencies over longer ones when their language provides them with two ways of encoding a message (e.g., for Basque [73]; English: [2, 1, 59]; Japanese [84]; Korean [14]; for reviews and discussion, see [41, 47]).

Here, we estimate these two measures for entire languages. That is, unlike in psycholinguistic work, which has focused on the word-by-word effects of surprisal and dependency length on language processing, we are estimating surprisal and dependency length at the system level. To us, these measures are of interest because they provide an estimate of the average processing difficulty a native speaker of a language experiences while processing that language. This allows us to test whether natural languages have lower average surprisal and shorter average dependency lengths than expected by chance. An overview of the procedure is given in Figure 3.

Figure 3: Overview of procedure used to compare the processing efficiency of natural languages (measured in terms of their average information density and dependency length) against the baseline efficiency expected by chance. Weighted grammars are the random reorderings of sets of dependencies (see text).

There are other factors that are known to contribute to processing efficiency. For example, among the primary contributors to word-by-word processing are lexical properties. To name just a few of these properties, a word’s length, frequency, neighborhood density, part-of-speech, and morphological structure are all correlated with the average time it takes to comprehend or produce that word [3, 4, 8, 17, 18, 60, 63]. As expected under the general hypothesis tested here, several studies have found the lexicon of languages to exhibit properties that are consistent with the hypothesis that processing efficiency over time shapes the phonology of words [12, 15, 35, 64, 69, 70, 85]. Here, however, we are interested in a grammatical property of languages –specifically, word order– and how it affects processing efficiency. It is grammatical properties that would differ between grammatical systems, thus allowing processing preferences to affect ‘selection’ of these properties over time. The approach we present below therefore holds constant all context-insensitive lexical properties, ruling these factors out as an explanation for hypothetical preferences for certain grammatical systems.

3.1 Estimating Processing Efficiency

3.1.1 Surprisal and Information Density

We estimate surprisal by means of a trigram model, which conditions a word’s probability on the previous two words. For example, probability of the sentence in Figure 2 would be modeled as:

where indicates a sentence boundary.

N-gram models of this type are widely used in speech recognition [49, 33] and machine translation [53]. N-gram models like the ones used here are also known to provide good approximations to computationally far more complex language models, such as probabilistic phrase structure grammars (see e.g., [27]). One reason for this is presumably that the local context of a word often captures many semantic phenomena through the co-occurrence of related words (e.g., read and book in the trigram read the book). Trigrams also capture local syntactic patterns, such as the requirement of accusative case after certain prepositions (e.g., to me) or subject-verb agreement (e.g., man arrives).

N-gram models also have two properties that make them particularly appealing for the current purpose. First, estimating n-gram probabilities from corpora is far less computationally complex than estimating the same probabilities from structurally more complex models (such as probabilistic phrase structure grammars). Since, as we detail below, this modeling needs to be repeated many times for each language, computational simplicity is critical for the current study. Second, n-gram models have also been successfully used as models of human language processing [8, 25, 48]. In fact, recent studies have argued that models that primarily rely on the information captured by local context (such as the two preceding words) fair better in explaining word-by-word variation in human processing times than structurally more complex models ([25]; but see also [24, 77]). Indeed, the finding we mentioned above, that a word’s probability in context is log-linearly related to the processing difficulty it causes, was based on a trigram estimate of the type employed here [75]. In short, trigram models are well-suited for the current purpose of estimating processing efficiency. One reason for this might be that human language processing preferably relies on more local information –for example, because non-local information will tend to be less informative or because non-local information will be more costly or less reliably retrieved from memory (consistent with the observation that non-local dependencies are harder to process).

In order to obtain reliable estimates of a word’s trigram probability even when the preceding two words were rarely (or never) observed in the training corpus, we smooth our trigram probabilities using the interpolated Kneser-Ney method

[52, 13]. Kneser-Ney is a technique that assigns probability to unseen n-grams according to a measure of how likely the words in the trigram are to combine with new words. Using Kneser-Ney smoothed trigram probabilities have two advantages over alternative n-gram models. First, Kneyser-Ney smoothing perform well across a wide variety of tasks and is considered one of the most effective methods of dealing with unobserved trigrams. Second, it is specifically Kneser-Ney smoothed trigram estimates of surprisal that recent work found to be linearly correlated with reaction times [75]. This makes this particular approach well-suited for our purpose of estimating the average processing efficiency of a language.

Surprisal and information density can be estimated at different levels of linguistic description. For example, in the psycholinguistic literature on sentence processing, surprisal is usually calculated per word [25, 56, 75]. However, psycholinguistic research on phonetic production has also calculated information density at the sub-lexical level (e.g., the information per sound in a word, [15, 78]). Natural languages could theoretically be efficient at one level but not the other.

Here, we consider two estimates of information density. The first estimate is the by-word information density based on the unnormalized per-word information derived from the trigram model. This is essentially the same measure that has found to correlate linearly with word-by-word reading times in English [75].

The second estimate is a normalized by-character estimate of the amount of information per sound or writing unit. For this second estimate, we first counted the number of unique characters in the data base (see Data above). Specifically, we used the logarithm to base 2 of that count, thereby measuring the number of bits one would need to encode all unique characters observed in the databased for each language. For example, there were 48 unique characters (5.6 bits) in our English corpora (this includes special symbols like $) and 4394 unique characters (12.1 bits) in our Mandarin database. We then normalized the information content of each word by the number of letters in that word multiplied by the per-character bits for that language. This normalization has the advantage that it applies the same standard across different writing systems. For example, Mandarin Chinese employs a logographic writing system, so that there are no letters. For spoken language, our normalization approximates the number of phonemes in a word and its spoken duration, while also taking into account the number of distinct sounds in the language. For written language, our normalization corresponds directly to information per character, taking into account the number of distinct symbols used in the database.

We note that our results are not sensitive to the choice of normalization: all results were qualitatively similar without any length normalization. Furthermore, the specific normalization procedure chosen here only affects comparisons across languages (which is not of theoretical interest here), as the normalization constant does not vary within one language (see Equation 5 below, where only varies by word, whereas the per-character bits are a constant factor).

3.1.2 Dependency Length

Our other measure of processing difficulty is dependency length. This metric can be read off the dependency trees, counting the number of words from each modifier to its head in the linear order of the sentence. For example, for the dependency structure in Figure 2, the word left is the fifth word from the word when. The length of the SBARS dependency between when and left is thus of length 5. The SBJS dependency between the words I and left, on the other hand, is of length 1. In our experiments, we compute the average length of all the dependencies in all sentence. In Figure 2, we have dependencies of length 1, 1, 1, 3, and 5, for an average length of 2.2.

A number of different metrics have been proposed to measure dependency length. For example, dependency length is sometimes measured in terms of the number of intervening non-discourse given referents [29], or in terms of the syntactic complexity of intervening material. All of these measures tend to be highly correlated [76, 80]. For the current purpose, we measure dependency length in words (following [23, 31, 43, 44, 59, 73]). This measure has the advantage that it is easy to calculate and achieves broad-coverage (see also [18]).

3.2 Estimating Chance

To obtain a chance baseline against which to compare the processing efficiency of each language, we create 1000 variants for each language. Specifically, we obtain 1000 pseudo-grammars by randomly re-ordering the dependency structures described above, while keeping the dependency relations between heads and their dependents intact. Each pseudo-grammar thus describes a theoretically possible reordering of the actual human language. Critically, this variant holds constant:

  • all context-insensitive lexical properties, including all semantic and phonological factors at the level of the word

  • the number and identity of the sentences in the corpus

  • the number of words in each sentence (which is known to affect estimates of the per-word information) and in the corpus

  • the identity of the words in each sentence (including their part of speech) and in the corpus

  • the number of heads, dependents, and dependencies in each sentence and in the corpus

  • the frequency of different types of dependencies in each sentence and in the corpus

We then measure the average information density and dependency length of each variant of a language, allowing us to compare the information density and dependency length of the actual languages against what is expected by chance (i.e., against the distribution of information density and dependency lengths observed for the 1000 pseudo-grammars derived from that language).

For our representation of a possible fixed order, we use weighted grammars [31]. In this representation, each dependency type (e.g., SBJS in Figure 2 is assigned a numeric weight between -1 and 1. The head itself always has weight zero. Dependencies with negative weights appear to the left of the head, and dependencies with positive weights to the right. For all studies reported below, these weights were held constant for each dependency type. More specifically, we held orders constant within each set of dependencies, where a set refers to all dependency types that end in the same head. One example of a dependency set are all dependencies that end in a head noun (i.e., all noun phrase-internal dependencies). Weights thus define a deterministic order over all dependents of a head, from left to right in order of their numeric weights. For example, with regard to the head of the sentence (S), a given pseudo-grammar might define the order SBJ SBAR S PP NP. The relative order for the four dependency types in the rule above (SBJS, SBARS, PPS, and NPS), then also implies an order of SBJ S NP for sentences in which only these two dependencies connect to S. As we show in Control Study 1, this is a conservative assumption for the calculation of chance for both information density and dependency length, i.e., it biases against the hypothesis tested here.

An example of a possible re-ordering of the example sentence of Figure 2 is shown in Figure 4.

When

arrived

the

man

left

I

DTNN

SBJS

SSBAR

SBARS

SBJS
Figure 4: Word order of a pseudo-grammar for the same sentence shown in Figure 2.

For each pseudo-grammar specified by a set of weights , we estimate the information density and dependency length with the following procedure:

  1. Order the training portion of our corpus according to .

  2. Estimate a Kneser-Ney trigram language model from the training corpus.

  3. Order the test portion of our corpus according to .

    1. Compute the average per-word information, , and normalized per-character information, , in the test data according to , where is number of word tokens in the database, is the th word token in the test data, are the two preceding word tokens, and is the probability according to :

      (3)
      (4)

      is thus the average per-word information of a language sample, which we refer to below as the by-word information density. And for the by-character estimate of information density:

      (5)

      where is the length of in characters, and is the number of unique characters in the database.

    2. Compute the average dependency length of each word in the test data.

In all experiments, we use 9/10s of the available data as training data in step 1 above, and the remaining 1/10 as test data in steps 4 and 5. This procedure takes several hours (a few minutes per random order) of computer time, as it involves building a large table of n-gram counts for each new random order considered.

Figure 5: Average information density and dependency length of randomly generated pseudo-grammars. Individual data points show the 1000 samples for each language. Contour lines show a 2D density estimation based on a bivariate normal kernel. This summarizes the distribution of the random pseudo-grammars. Panel LABEL: shows this for the by-character estimate of information density and Panel LABEL: for the by-word estimate of information density.

Figure 6 shows the information density and dependency length of all 1000 samples for the five languages. As indicated by the non-parametric smoother in Figure 6, information density and dependency length are positively correlated in the random pseudo-grammars. Although the strength of this correlation differs across languages (see Table 2), this correlation is significant in all languages (Pearson correlation s ). This means that shorter dependencies (i.e., keeping words that belong together adjacent to each other) also tend to reduce information density. This correlation makes intuitive sense. Recall that we are using a trigram language model to estimate information density. To the extent that the syntactic dependencies annotated in the corpora we employed (see Data above) capture relevant statistical dependencies between words, it is thus expected that trigram probabilities will be higher (and information density estimates lower) for word orders that keep syntactic dependencies (and thus more often within the three word window). It is, however, an interesting question for future research whether the correlation we observe here holds even when more computational more complex estimates of word probabilities are used.

Pearson
by-character by-word
Arabic 0.50 0.41
Czech 0.25 0.47
English 0.40 0.62
German 0.15 0.43
Mandarin 0.23 0.24
Table 2: Correlation of information density and dependency length in the random samples created for each language.

4 Results

We first compare the actual information density and dependency length of five languages in our sample against the pseudo-grammars derived from them. Then we present four control studies that serve to illustrate the robustness of our results.

4.1 Study 1: Comparing the information density and dependency length of human languages to chance

Figure 6 shows both the actual human languages and the 1000 random samples for each of them on a plane defined by the two measures of processing efficiency considered here. Table 3 provides a numerical summary. As can be seen, the processing efficiency of actual Arabic, Czech, English, German, and Mandarin Chinese is considerably better than expected by chance. Specifically, applying a standard significance criterion of , all five languages have lower information density than expected by chance, and all languages but Chinese have shorter dependency lengths than expected by chance.

Figure 6: Illustration of the processing efficiency of actual human languages (solid shapes) compared to their processing efficiency expected by chance (contour lines). Processing efficiency is measured in terms of information density (y-axis) and dependency length (x-axis), which are both known to be positively correlated with processing times. Processing efficiency is thus higher, the lower the average information density and the lower the average dependency length. Contour lines show a 2-dimensional density estimation based on a bivariate normal kernel, summarizing the distribution of the random pseudo-grammar. Panel LABEL: uses by-character information density. Panel LABEL: uses by-word information density.
Average information density Average dependency length
by-character by-word
actual higher than actual higher than actual higher than
language random language random language random
Arabic 0.143 0/1000 9.434 0/1000 3.21 0/1000
Czech 0.435 39/1000 12.150 0/1000 2.94 0/1000
English 0.377 0/1000 8.781 0/1000 2.25 0/1000
German 0.380 19/1000 10.395 0/1000 3.28 16/1000
Mandarin 0.159 0/1000 9.760 0/1000 3.44 220/1000
Table 3: Mean information density and dependency length of actual human language (on test data) compared to 1000 random pseudo-grammars of that language when constituent order is assumed to be fixed within each dependency type.

Next, we present three control studies that demonstrate the robustness of our findings. Since the studies we present here are computationally demanding, we limit our control studies to one of the two information density estimates. We chose to focus on the per-character estimate, as we take it to be less reflective of properties specific to the writing systems of the language (such as what constitutes a written word).111For example, whereas compounds are generally written as one word in German (e.g., Rotwein), they tend to be written as separate words in English (e.g., red wine). The per-character estimate of information density is not affected by this orthographic decision. Additionally, this is the more conservative approach given the results in Table 3, which are stronger for by-word information density.

4.2 Control study 1: Fixed vs. flexible constituent order

The results in Table 3 are based on pseudo-grammars that were calculated under the assumption that languages have fixed constituent orders within a dependency type. Interestingly, this assumption does approximate, but not quite match, what is observed for human languages. Table 4 provides a measure of the word order consistency of the languages in our sample.

Percentage
Arabic 92.7%
Czech 74.5%
English 92.8%
German 88.7%
Mandarin 90.1%
Table 4: Average percentage of most frequent ordering within a dependency set, averaged across all dependency sets, for the languages in our sample.

For the calculation of chance for information density, the fixed-order assumption made in Study 1 is expected to be conservative, biasing against the hypothesis we are testing: On average, fixed constituent orders increased the predictability of words, thereby lowering the average information density. This should give the pseudo-grammars derived for Study 1 a distinct advantage compared to the actual human languages, which often do not have fixed constituent orders (or at least not entirely fixed orders). For example, even English, which is considered a relatively fixed order language, allows constituent order variation. Most obviously this holds for alternations, such as the choice between active and passive or heavy noun phrase shift (e.g.,he put the book on the table vs. he put on the table the book, but he put the book he had gotten from a long lost friend on the table vs. he put on the table the book he had gotten from a long lost friend). Generally, the assumption of fixed constituent orders in Study 1 should thus be conservative with regard to information density.

However, for the calculation of chance for dependency length, the consequences of the assumption of fixed constituent order are less clear. It is possible that this assumption made the dependency length results anti-conservative. We therefore repeated Study 1 while allowing constituent order to vary absolutely freely. That is, rather than to use the weighted grammar approach described above in creating random pseudo-grammars, we randomly ordered all dependents for each instance of a dependency.

Table 5 summarizes the results. For all languages in our sample, both the by-character information density and dependency length of actual human languages were better than that observed for any of the 1000 pseudo-grammars. Control Study 1 thus replicates the results of Study 1 and shows that the assumption of fixed constituent order made in Study 1 biases against our hypothesis, relatively to allowing constituent order freedom.

Average information density Average dependency length
by-character
actual higher than actual higher than
language random language random
Arabic 0.143 0/1000 3.21 0/1000
Czech 0.435 0/1000 2.94 0/1000
English 0.377 0/1000 2.25 0/1000
German 0.380 0/1000 3.28 0/1000
Mandarin 0.159 0/1000 3.44 0/1000
Table 5: Mean information density and dependency length of actual human language (on test data) compared to 1000 random pseudo-grammars of that language when constituent order is completely randomized.

We further note that the results of Study 1 were also replicated when constituents were allowed to order freely, but the position of dependents relative to the head was held constant (e.g., if all dependents occurred to the right of their head). Taken together, this suggests that the results obtained in Study 1 are robust to assumptions about constituent order freedom in the calculation of the chance baseline.

4.3 Control study 2: Sensitivity to Genre and Mode

While our primary datasets are taken from newspaper text, we wanted to test whether our results were sensitive to the genre of the corpus, and in particular whether edited, written text might have different properties than spontaneous, spoken text. Unfortunately, large syntactically annotated corpora are available only for very few languages. Here we test our hypothesis against conversational speech data from English.

Repeating Study 1 on the English Switchboard corpus of conversational speech, we again find that the processing efficiency of actual English is better than expected by chance. Actual conversational English had better per-character information density than all of 1000 random word orders, and better dependency length than all of 1000 random orders (both s ).

4.4 Control study 3: Sensitivity to Corpus Size

Finally, we tested the sensitivity of our results to the amount of text available for estimating the parameters of the Kneser-Ney trigram model. We thus repeated the analysis reported above, using a much smaller data set of 1000 sentences randomly drawn from the Wall Street Journal (i.e., about 2.5% of the original corpus). Unsurprisingly, information density estimates were higher compared to the main study (reported in Table 3)–this is a direct consequence of the reduced data size: for smaller corpora, there will be more n-grams in the test data that were never observed in the training data; these n-grams are assigned low probability (and thus high information). The estimates based on a smaller corpus are also expected to be less reliable because a large porportion of the n-grams in the test data will be unseen in the training data, regardless of the word order used. To quantify this effect, using actual English word order, we find that with 1000 training sentences, only 10% of bigram tokens in test data have been observed in training data, even when not predicting any single words in the test data that are unseen in training data. In contrast, with our full training set for English, the corresponding figure is 31%. Despite the low coverage of n-grams in our 1000-sentence training set, we again find that actual English has better per-character information density than all of 1000 random word orders, and better dependency length than all of 1000 random orders (both s ).

4.5 Summary

We find that the five languages we investigated all have significantly higher processing efficiency than expected by chance. This holds for both measures of processing efficiency considered here. For information density, all five languages fall into the top 95th percentile or better. For dependency length, four of the five languages fall into the 95th percentile, and one language (Mandarin) falls into approximately the 75th percentile of the distribution defined by the random pseudo-grammars. Our findings held regardless of the size of the corpus and, more importantly, for both written and spoken language. Taken together, this suggests that language use –specifically, pressures that originate in the incremental processing of language– shape the grammar of languages over time.

We note that information density and dependency length were not independent in our random samples. It is thus possible that what we have –following the literature– treated as two independent measures of processing efficiency is in reality due to one underlying cause. This would not affect the conclusion that the processing efficiency of human languages is better than expected by chance. Further, it is worth noting that the correlations in Table 2 are mild. Indeed, Study 2 finds that information density and dependency length can be optimized separately.

5 Study 2: Are natural languages optimal with regard to information density and dependency length?

Next, we tested whether an even stronger claim can be made. Specifically, we wondered whether the pressures for efficient processing are sufficiently strong to constrain language change to the subspace of possible grammars that is optimal (or very close to optimal) in terms of processing efficiency. As outlined in the introduction, many pressures of language use have been hypothesized to bias and constrain language change, thereby contributing to cross-linguistically observed properties of languages. We thus did not expect languages to be optimal in terms of processing efficiency. We begin by describing the procedure used to estimate the minimum possible information density for each language. Then we describe the procedure used to estimate the minimum possible dependency length for each language. Finally, we present a procedure that jointly optimizes both information density and dependency length for each language, allowing us to compare human languages against pseudo-grammars that optimally trade-off the two major contributors to processing efficiency. The results of these three calculations are presented and discussed at the end of this section.

5.1 Computing pseudo-grammars with optimal information density

We begin by describing the procedure used to calculate the minimum possible information density for each language:

In order to find , we optimize one weight at a time, holding all others fixed, and iterating though the set of weights to be set. The objective function describing information density is piecewise constant, as the objective function will not change until one weight crosses some other, causing two dependents to reverse order, at which point the objective will discontinuously jump. This non-differentiability implies that methods based on gradient ascent will not apply. However, because the objective function only changes at points where one weight crosses another’s value, the set of segments of weight values with different values of the objective function can be exhaustively enumerated. In fact, the only significant points are the values of other weights for dependency types which occur in the corpus attached the same head as the dependency being optimized. We build a table of interacting dependencies as a preprocessing step on the data, and then when optimizing a weight, consider the sequence of values between consecutive interacting weights. For each value in this sequence, we evaluate the objective function on the test corpus, and choose the value yielding the minimum value of .

This search procedure is similar to one previously used for finding the grammar that minimizes dependency length [31]. Our present problem, however, is considerably more computationally intensive, because when evaluating each possible value of each weight, we must examine the entire test corpus, whereas, when optimizing dependency length, one can take a shortcut in evaluation, by only considering sentences with the dependency type whose weight has been modified. In our case, since all the parameters of the n-gram language model are subject to change at each step, we must evaluate the language model on every word in the test corpus.

This optimization process is not guaranteed to find the global maximum, but is guaranteed to converge simply from the fact that there are a finite number of objective function values, and the objective function must increase at each step at which weights are adjusted. Running the optimization procedure from different random initializations, we find that, while final grammars are not identical, they are very close in terms of our objective function, which indicates that we are likely to be close to the global optimum. For example, in ten runs optimizing the by-word information density of English, our final values of the objective function have a variance of less than

.

For our experiments, we find that the optimization procedure converges after several days of computer time. However, the procedure reaches points very close to the eventual optimum within several hours.

5.2 Computing pseudo-grammars with optimal dependency length

We also optimize our pseudo-grammar in order to find the weights giving the lowest dependency length:

where is the average dependency length for the pseudo-grammar with weights . The search over weights uses the same algorithm described above.

Figure 7: Comparison of human languages to grammars that optimize information density and dependency length. Gray-shaded lines show grammars that are optimal under a given trade-off between information density and dependency length (bottom-most point: grammar with optimal information density; left-most point: grammar with optimal dependency length; in between: grammars that optimized weighted sums of information density and dependency length). To facilitate cross-language comparison, information density is normalized by character and both information density and dependency lengths are standardized, so that axes represent -values. Contour lines and fill show the distribution of the random reorderings of all languages combined.

5.3 Joint Optimization

There may be a trade-off between information density and dependency length. Although we found information density and dependency length to be positively correlated in the random pseudo-grammars generated for Study 1, these correlations were mild to moderate. It is therefore possible that optimization of information density trades off against optimization of dependency length (and vice versa). We therefore investigate the effect on the dependency length of optimizing for information density, and vice versa. We also experiment with a joint objective function that combines information density and dependency length:

We then applied the same optimization algorithm described above to this joint measure of processing efficiency. The separate optimizations described in the previous two sections correspond to s of 1 and 0, respectively. We also considered s of , , , , and (note that these weights are hard to interpret by themselves: information density and dependency length are on different scales, as shown in Table 3 above.

5.4 Results

Table 6 shows results when optimizing for different objective functions in each column. Rows show the by-character information density and dependency length for each language. The left half of the table shows the information density and dependency length for the optimal pseudo-grammars derived by optimizing information density (ID), dependency length (DL) or both jointly with (ID & DL). The right half of the table shows the information density and dependency length of the actual human language, as well as the mean of the random pseudo-grammars generated for Study 1.

Our first observation from Table 6 is that optimizing either information density or dependency length indeed comes at the expense of the other property (despite the overall positive correlations between information density and dependency length in the random pseudo-grammars, cf. Figure 5). Further, the average information density and dependency length of all five natural languages is overall closer to the joint optimum, than to either of the separate optima, suggesting that natural languages indeed trade-off information density and dependency length.

Optimize for
ID ID & DL DL Actual Mean of random
language pseudo-grammars
Arabic
information density 0.135 0.141 0.143 0.143 0.146
dependency length 4.42 2.73 2.73 3.21 4.75
Czech
information density 0.413 0.431 0.432 0.435 0.436
dependency length 3.41 2.45 2.45 2.94 3.27
English
information density 0.372 0.386 0.391 0.377 0.394
dependency length 3.45 2.01 2.00 2.25 2.79
German
information density 0.358 0.380 0.383 0.380 0.386
dependency length 3.76 2.18 2.18 3.28 3.91
Mandarin
information density 0.156 0.163 0.162 0.159 0.164
dependency length 3.06 2.55 2.55 3.44 4.56
Table 6: Results of separate and joint optimization of information density (ID) and dependency length (DL). For comparison, the two rightmost columns provide the information density and dependency length of human languages as well as the mean of the random pseudo-grammars (repeated from Table 3).

This means that the jointly optimized pseudo-grammars provide the most relevant point of comparison for natural languages, since we are interested in understanding whether the grammars of natural languages are close to optimal in their overall processing efficiency. One way to further illustrate this trade-off is to look at the equi-weighted joint optimization (). These jointly optimized grammars do not unambiguously outperform actual natural languages. For two of the five languages, Arabic, and Czech, the equi-weighted jointly optimized pseudo-grammar has lower information density and dependency length. For the other three languages, however, the optimized pseudo-grammar is better on one dimension of processing efficiency, but worse on the other (e.g., for the jointly optimized pseudo-grammar of Mandarin, a slight improvement in dependency length results in a considerably worsening in information density, compared to actual Mandarin). This is visualized in Figure 8.

Figure 8: Comparison of human languages and optimized languages derived from them, in terms of by-character information density and dependency length. To put all languages on a comparable scale, the two axes are standardized (i.e., they represent -values). Solid shapes show the human languages. Arrows point to optimal pseudo-grammars. Contour lines show a 2D density estimation based on a bivariate normal kernel. This summarizes the distribution of the random pseudo-grammar.

The full results of the different joint optimizations, varying the weighting parameter , are shown in Figure 7. If language processing is just one among many equally important factors that shape word order preferences over time, the processing efficiency of optimized grammars should far outperform that of the actual human languages. The gray lines in the Figure 7 can be thought of as representing a ‘frontier’ of optimality within the space of possible grammars for each of the languages in our sample.

None of the languages in our sample lies on this frontier. That is, all languages could theoretically change to have better information density and dependency length. However, we also find that the word orders of some languages (Arabic and English) have close to optimal processing efficiency. Of the five languages investigated here, only the word order of German clearly has non-optimal processing efficiency. One possible reason for such striking differences between languages might be differences in how they use other means than word order to convey the relations between words in a sentence (e.g., word-internal structure and morphosyntactic means). It is also possible that the historical development of some languages has been more strongly affected by other factors of language use (such ease of production or learnability). The current computational simulations cannot distinguish between these possibilities. The approach applied here does, however, point a way forward: as better quantitative models of these other factors become available, future simulations can investigate to what extent other aspects of language use shape the cultural evolution of language.

We further note that our optimization procedure held constant the headedness of each dependency type. As mentioned above, this is close to, but not identical to, what the human languages in our sample do. It is an open question how a relaxation of the the constant-headedness constraint would affect our results. Varying headedness arguably makes a language harder to learn, so that the assumption of constant headedness can be seen as holding constant what likely constitutes an additional (third) constraint that languages need to balance.

6 Discussion

The present results suggest that language processing affects language change: all natural languages for which we could test the hypothesis have word orders that make them easier to process than expected by chance. Specifically, the average information density and dependency length of the natural languages in our sample is lower than would be expected if language change was not subject to a bias toward systems with high processing efficiency.

To the best of our knowledge, this is the first cross-linguistic broad-coverage study of the processing efficiency of natural languages. The measures of processing efficiency we have employed here are two of the best documented correlates of processing complexity. By calculating the average information density and dependency length of natural languages based on large collections of text from these languages, we were able to side step the insurmountable challenges that would be associated with a behavioral approach to this question (see Introduction). There are three directly related previous findings that we are aware of [23, 31, 26]. Gildea and Temperley [31] investigated the average dependency length of two closely related languages, English and German. Ferrer i Cancho [23] investigates Czech and Romanian, while Futrell et al. [26] study 37 languages spoken worldwide. These studies found that the languages studied had shorter average dependency length than expected by chance. Our contribution is to –for the first time, to our knowledge– assess the joint effect of two of the biggest contributors to the grammatical processing efficiency of a language, information density and dependency length. As the trade-off between these factors in Study 2 shows, it is crucial to investigate the effect of multiple contributors to processing efficiency simultaneously.

Some other recent studies complement the approach taken here. These studies have tested whether learners of miniature languages designed by experimenters prefer languages that increase processing efficiency [20, 21, 22]. In the most informative of these studies, great care is taken to rule out native language biases as the source of the observed preferences (cf. [34]). For example, [19] finds that language learners prefer languages that reduce unnecessary uncertainty about the syntactic structure of sentences. Studies like this provide evidence that processing preferences can bias the outcome of language learning and thus provide support for one causal pathway through which processing preferences could come to affect language change, thereby shaping languages over time.

There are several caveats that apply to our study. The most obvious perhaps is that we have only considered syntactic dependencies that are annotated in the available syntactic corpora. These dependencies constitute an impoverished subset of all the semantic and syntactic dependencies that human comprehenders process when listening or reading. For example, one obvious omission in our approach is that we did not consider the internal structure of words, the complexity of which varies starkly across languages. Another simplifying assumption we have implicitly made in our studies is the focus on information density and dependency length. While these two measure of processing efficiency are arguably the best documented ones, there are other properties of grammatical systems that are known to affect processing efficiency (e.g., interference in memory due to similar words or referents, [57, 61]). As far as we can tell, neither of these simplifying assumption is likely to have biased the results in favor of the hypothesis tested here (for that to be the case, the measures of processing efficiency we have employed here would have to be systematically inversely correlated with other measures or properties of the languages under study).

Acknowledgments

The authors thank Masha Fedzechkina, Chigusa Kurumada, and Olga Nikolayeva for feedback on earlier versions of this paper. This work was partially funded by National Science Foundation award IIS-1446996 to DG and National Science Foundation CAREER award IIS-1150028 to TFJ. The views expressed here at not necessarily those of the funding agencies.

References

  • [1] Jennifer E Arnold, Thomas Wasow, Ash Asudeh, and Peter Alrenga. Avoiding attachment ambiguities : The role of constituent ordering. Journal of Memory and Language, 51:55–70, 2004.
  • [2] Jennifer E Arnold, Thomas Wasow, Anthony Losongco, and Ryan Ginstrom. Heaviness vs. newness: the effects of structural complexity and discourse status on constituent ordering. Language, 76(1):28–55, 2000.
  • [3] R H Baayen. Morphological influences on the recognition of monosyllabic monomorphemic words. Journal of Memory and Language, 55:290–313, 2006.
  • [4] David A Balota, Michael J Cortese, Susan D Sergent-Marshall, Daniel H Spieler, and Melvin J Yap. Visual Word Recognition of Single-Syllable Words. Journal of experimental psychology: General, 133(2):283–316, 2004.
  • [5] Elizabeth Bates and Brian MacWhinney. Functionalist approaches to grammar. In Language acquisition: the state of the art, pages 173–218. 1982.
  • [6] Elizabeth Bates and Brian MacWhinney. Competition, Variation, and Language Learning. In Brian MacWhinney, editor, Mechanisms of Language Acquisition, chapter 6, pages 157–194. 1987.
  • [7] A. Böhmová, J. Hajič, E. Hajičová, and B. Hladká. The PDT: a 3-level annotation scenario. In A. Abeillé, editor, Treebanks: Building and Using Parsed Corpora, volume 20 of Text, Speech and Language Technology, chapter 7. Kluwer Academic Publishers, Dordrecht, 2003.
  • [8] Marisa Ferrara Boston, John Hale, Reinhold Kliegl, Umesh Patil, and Shravan Vasishth. Parsing costs as predictors of reading difficulty: An evaluation using the Potsdam Sentence Corpus. Journal of Eye Movement Research, 2(1):1–12, 2008.
  • [9] S. Brants, S. Dipper, S. Hansen, W. Lezius, and G. Smith. The TIGER treebank. In Proc. of the 1st Workshop on Treebanks and Linguistic Theories (TLT), 2002.
  • [10] Sabine Buchholz and Erwin Marsi. CoNLL-X shared task on multilingual dependency parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning, pages 149–164. Association for Computational Linguistics, 2006.
  • [11] Joan Bybee and Paul J Hopper. Frequency and the Emergence of Linguistic Structure. 2001.
  • [12] Joan Bybee and Joanne Scheibman. The effect of usage on degrees of constituency: the reduction of dont́ in English. Linguistics, 37(4):575–596, 1999.
  • [13] Stanley F Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359–393, 1999.
  • [14] Hye-won Choi. Length and Order: A Corpus Study of Korean Dative-Accusative Construction. 담화와 인지, 14(3):207–227, 2007.
  • [15] Uriel Cohen Priva. Using Information Content to Predict Phone Deletion. In Natasha Abner and Jason Bishop, editors, Proceedings of the 27th West Coast Conference on Formal Linguistics, pages 90–98, 2008.
  • [16] Michael John Collins. Head-driven Statistical Models for Natural Language Parsing. PhD thesis, University of Pennsylvania, Philadelphia, 1999.
  • [17] Fermín Moscoso del Prado Martín, Aleksandar Kostić, and R Harald Baayen. Putting the bits together: an information theoretical perspective on morphological processing. Cognition, 94(1):1–18, November 2004.
  • [18] Vera Demberg and Frank Keller. Data from eye-tracking corpora as evidence for theories of syntactic processing complexity. Cognition, 109(2):193–210, November 2008.
  • [19] Maryia Fedzechkina. Communicative Efficiency, Language Learning, and Language Universals. PhD thesis, University of Rochester, 2014.
  • [20] Maryia Fedzechkina, T Florian Jaeger, and Elissa L Newport. Functional Biases in Language Learning: Evidence from Word Order and Case-Marking Interaction. In 33rd Annual Meeting of the Cognitive Science Society, number 2004, pages 318–323, 2011.
  • [21] Maryia Fedzechkina, T. Florian Jaeger, and Elissa L. Newport. Language learners restructure their input to facilitate efficient communication. Proceedings of the National Academy of Sciences of the United States of America, pages 1–6, October 2012.
  • [22] Maryia Fedzechkina, T Florian Jaeger, and Elissa L Newport. Communicative biases shape structures of newly acquired languages. In M. Knauff, N. Pauen, N.Sebanz, and I. Wachsmuth, editors, Proceedings of the 35th Annual Meeting of the Cognitive Science Society (CogSci13), pages 430–435. Cognitive Science Society, Austin, TX, 2013.
  • [23] Ramon Ferrer i Cancho. Euclidean distance between syntactically linked words. Physical Review E, 70(056135):1–5, 2004.
  • [24] Victoria Fossum and Roger Levy. Sequential vs. hierarchical syntactic models of human incremental sentence processing. In Proceedings of the 3rd Workshop on Cognitive Modeling and Computational Linguistics, pages 61–69. Association for Computational Linguistics, 2012.
  • [25] Stefan L Frank and Rens Bod. Insensitivity of the human sentence-processing system to hierarchical structure. Psychological Science, 22(6):829–834, 2011.
  • [26] Richard Futrell, Kyle Mahowald, and Edward Gibson. Large-scale evidence of dependency length minimization in 37 languages. Proceedings of the National Academy of Sciences, 112(33):10336–10341, 2015.
  • [27] Dmitriy Genzel and Eugene Charniak. Entropy Rate Constancy in Text. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 199–206, 2002.
  • [28] E Gibson. Linguistic complexity: locality of syntactic dependencies. Cognition, 68(1):1–76, August 1998.
  • [29] Edward Gibson. The Dependency Locality Theory: A Distance-Based Theory of Linguistic Complexity. In Alec Marantz, Yasushi Miyashita, and Wayne O’Neil, editors, Image, language, brain: Papers from the first mind articulation symposium, chapter 5, pages 95–126. 2000.
  • [30] Edward Gibson, Steven T Piantadosi, Kimberly Brink, Leon Bergen, Eunice Lim, and Rebecca Saxe. A noisy-channel account of crosslinguistic word-order variation. Psychological Science, 24(7):1079–88, July 2013.
  • [31] Daniel Gildea and David Temperley. Do grammars minimize dependency length? Cognitive Science, 34(2):286–310, 2010.
  • [32] J. Godfrey, E. Holliman, and J. McDaniel. SWITCHBOARD: Telephone speech corpus for research and development. In IEEE ICASSP-92, pages 517–520, San Francisco, 1992. IEEE.
  • [33] Ben Gold, Nelson Morgan, and Dan Ellis. Speech and audio signal processing: processing and perception of speech and music. John Wiley & Sons, 2011.
  • [34] Adele E Goldberg. Substantive learning bias or an effect of familiarity? comment on. Cognition, 127(3):420–426, 2013.
  • [35] Peter Graff and T Florian Jaeger. Locality and Feature Specificity in OCP Effects: Evidence from Aymara, Dutch, and Javanese. In Proceedings of the Main Session of the 45th Meeting of the Chicago Linguistic Society, pages 1–15, 2009.
  • [36] Daniel Grodner and Edward Gibson. Consequences of the serial nature of linguistic input for sentenial complexity. Cognitive science, 29(2):261–90, March 2005.
  • [37] Gegory R Guy. Form and function in linguistic variation. In Gegory R Guy, Crawford Feagin, Deborah Schiffrin, and John Baugh, editors, Towards a social science of language: Papers in honor of William Labov. Volume 1: Variation and change in language and society, pages 221–252. Benjamins Publishing Compagny, Amsterdam, 1996.
  • [38] Jan Hajič, Otakar Smrž, Petr Zemánek, Jan Šnaidauf, and Emanuel Beška. Prague arabic dependency treebank: Development in data and tools. In Proc. of the NEMLAR Intern. Conf. on Arabic Language Resources and Tools, pages 110–117, 2004.
  • [39] John Hale. A Probabilistic Earley Parser as a Psycholinguistic Model. In NAACL ’01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, pages 1–8, 2001.
  • [40] Marc D Hauser, Noam Chomsky, and W Tecumseh Fitch. The faculty of language: what is it, who has it, and how did it evolve? science, 298(5598):1569–1579, 2002.
  • [41] J. A. Hawkins. Cross-linguistic variation and efficiency. Oxford University Press, Oxford, UK, 2014.
  • [42] John Hawkins. A Performance Theory of Order and Constituency. Cambridge University Press, Cambridge, UK, 1994.
  • [43] John A Hawkins. Efficiency and complexity in grammars. Oxford Univ Press, Oxford, 2004.
  • [44] John A. Hawkins. Processing typology and why psychologists need to know about it. New Ideas in Psychology, 25(2):87–107, August 2007.
  • [45] W. von Humboldt. Linguistic Variability and Intellectual Development. University of Pennsylvania Press, Philadelphia, PAadelphia, 1972.
  • [46] Elizabeth Hume and Frédéric Mailhot. The Role of Entropy and Surprisal in Phonologization and Language Change. In Alan C. L. Yu, editor, Origins of Sound Patterns: Approaches to Phonologization, pages 29–50. Oxford University Press, Oxford, UK, 2013.
  • [47] T F Jaeger and E J Norcliffe. The cross-linguistic study of sentence production. Language and Linguistics Compass, 3:1–22, 2009.
  • [48] T Florian Jaeger. Redundancy and reduction: speakers manage syntactic information density. Cognitive Psychology, 61(1):23–62, August 2010.
  • [49] Frederick Jelinek. Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA, 1997.
  • [50] Simon Kirby, Hannah Cornish, and Kenny Smith. Cumulative cultural evolution in the laboratory: an experimental approach to the origins of structure in human language. Proceedings of the National Academy of Sciences of the United States of America, 105(31):10681–6, August 2008.
  • [51] Simon Kirby, Mike Dowman, and Thomas L Griffiths. Innateness and culture in the evolution of language. Proceedings of the National Academy of Sciences, 104(12):5241–5245, 2007.
  • [52] Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 181–184, Detroit, MI, 1995. IEEE.
  • [53] Philipp Koehn. Statistical machine translation. Cambridge University Press, 2009.
  • [54] K J Kohler. The Phonetics/Phonology Issue in the Study of Articulatory Reduction. Phonetica, 48(2-4):180–192, 1991.
  • [55] Roger Levy. Probabilistic Models of Word Order and Syntactic Discontinuity. PhD thesis, Stanford University, 2005.
  • [56] Roger Levy. Expectation-based syntactic comprehension. Cognition, 106(3):1126–77, March 2008.
  • [57] Richard L Lewis, Shravan Vasishth, and Julie a Van Dyke. Computational principles of working memory in sentence comprehension. Trends in Cognitive Sciences, 10(10):447–54, October 2006.
  • [58] Björn Lindblom. Explaining phonetic variation: A sketch of the H&H theory. In W J Hardcastle and A Marchal, editors, Speech Production and Speech Modeling, pages 403–439. Kluwer Academic Publishers, 1990.
  • [59] Barbara Lohse, John A Hawkins, and Thomas Wasow. Domain Minimization in English Verb-Particle Constructions. Language, 80(2):238–261, 2004.
  • [60] Paul A Luce and David B Pisoni. Recognizing Spoken Words: The Neighborhood Activation Model. Ear and Hearing, 19(1):1–36, 1998.
  • [61] Maryellen C MacDonald. How language production shapes language form and comprehension. Frontiers in Psychology, 4(April):226, January 2013.
  • [62] David Magerman.

    Natural Language Parsing as Statistical Pattern Recognition

    .
    PhD thesis, Stanford University, 1994.
  • [63] James S Magnuson, James A Dixon, Michael K Tanenhaus, and Richard N Aslin. The dynamics of lexical competition during spoken word recognition. Cognitive science, 31(1):133–56, February 2007.
  • [64] D Yu Manin. Experiments on predictability of word in context and information rate in natural language. pages 1–12, 2006.
  • [65] Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The Penn treebank. Computational Linguistics, 19(2):313–330, June 1993.
  • [66] Scott A. McDonald and Richard C. Shillcock. Low-level predictive inference in reading: the influence of transitional probabilities on eye movements. Vision Research, 43(16):1735–1751, July 2003.
  • [67] Martin A Nowak, Joshua B Plotkin, and Vincent AA Jansen. The evolution of syntactic communication. Nature, 404(6777):495–498, 2000.
  • [68] John J Ohala. Discussion of Bjoern Lindblom’s ’Phonetic Invariance and the adaptive nature of speech’. In Working Models of Human Perception. Academic Press, London, UK, 1988.
  • [69] Steven T Piantadosi, Harry Tily, and Edward Gibson. Word lengths are optimized for efficient communication. PNAS, 108(9):3526 –3529, 2011.
  • [70] Steven T Piantadosi, Harry Tily, and Edward Gibson. The communicative function of ambiguity in language. Cognition, 122(3):280–91, March 2012.
  • [71] Janet B Pierrehumbert. Word-specific phonetics. In Carlos Gussenhoven and Natasha Warner, editors, Laboratory phonology 7, pages 101–139. Mouton de Gruyter, Berlin, 2002.
  • [72] Steven Pinker and Ray Jackendoff. The faculty of language: what’s special about it? Cognition, 95(2):201–36, March 2005.
  • [73] Idoia Ros, Mike Santesteban, Kumiko Fukumura, and Itziar Laka. Aiming at shorter dependencies: the role of agreement morphology. Language, Cognition, and Neuroscience, 2015.
  • [74] Dan I Slobin. Language Change in Childhood and in History. Working Papers of the Language Behavior Research Laboratory, 41:185–214, 1975.
  • [75] Nathaniel J Smith and Roger Levy. The effect of word predictability on reading time is logarithmic. Cognition, 128(3):302–19, September 2013.
  • [76] Benedikt M Szmrecsányi. On Operationalizing Syntactic Complexity. In JADT 2004 : 7es Journées internationales d’Analyse statistique des Données Textuelles, pages 1031–1038, 2004.
  • [77] Marten van Schijndel and William Schuler. Hierarchic syntax improves reading time prediction. In Proceedings of the 2013 Meeting of the North American chapter of the Association for Computational Linguistics (NAACL-15), 2015.
  • [78] R J J H van Son and Louis C W Pols. How efficient is speech? Proceedings Institute of Phonetic Sciences, University of Amsterdam, 25:171–184, 2003.
  • [79] Shravan Vasishth and R L Lewis. An activation-based model of sentence processing as skilled memory retrieval. Cognitive science, 29(3):375–419, 2005.
  • [80] Thomas Wasow. Post-verbal behavior. CSLI Publications, Stanford, CA, 2002.
  • [81] Andrew Wedel. Exemplar models, evolution and language change. The Linguistic Review, 23:247–274, 2006.
  • [82] Andrew Wedel, Scott Jackson, and Abby Kaplan. Functional Load and the Lexicon: Evidence that Syntactic Category and Frequency Relationships in Minimal Lemma Pairs Predict the Loss of Phoneme contrasts in Language Change. Language and Speech, 56(3):395–417, July 2013.
  • [83] Nianwen Xue, Fei Xia, Fu-Dong Chiou, and Martha Palmer. The penn chinese treebank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11:207–238, 2005.
  • [84] Hiroko Yamashita and Franklin Chang. ”Long before short” preference in the production of a head-final language. Cognition, 81:45–55, 2001.
  • [85] George K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, New York, 1949.