Phonotactic Complexity and its Trade-offs

by   Tiago Pimentel, et al.
University of Cambridge

We present methods for calculating a measure of phonotactic complexity—bits per phoneme—that permits a straightforward cross-linguistic comparison. When given a word, represented as a sequence of phonemic segments such as symbols in the international phonetic alphabet, and a statistical model trained on a sample of word types from the language, we can approximately measure bits per phoneme using the negative log-probability of that word under the model. This simple measure allows us to compare the entropy across languages, giving insight into how complex a language's phonotactics are. Using a collection of 1016 basic concept words across 106 languages, we demonstrate a very strong negative correlation of -0.74 between bits per phoneme and the average length of words.


page 1

page 2

page 3

page 4


Complexity and universality in the long-range order of words

As is the case of many signals produced by complex systems, language pre...

Word forms - not just their lengths- are optimized for efficient communication

The inverse relationship between the length of a word and the frequency ...

The word entropy of natural languages

The average uncertainty associated with words is an information-theoreti...

Generating a Gray code for prefix normal words in amortized polylogarithmic time per word

A prefix normal word is a binary word with the property that no substrin...

Phonology-Augmented Statistical Framework for Machine Transliteration using Limited Linguistic Resources

Transliteration converts words in a source language (e.g., English) into...

Disambiguatory Signals are Stronger in Word-initial Positions

Psycholinguistic studies of human word processing and lexical access pro...

Deciphering Undersegmented Ancient Scripts Using Phonetic Prior

Most undeciphered lost languages exhibit two characteristics that pose s...

1 Introduction

One prevailing view on system-wide phonological complexity is that as one aspect increases in complexity (e.g., size of phonemic inventory), another reduces in complexity (e.g., degree of phonotactic interactions). Underlying this claim—the so-called compensation hypothesis Martinet (1955); Moran and Blasi (2014)—is the conjecture that languages are, generally speaking, of roughly equivalent complexity, i.e., no language is overall inherently more complex than another. This conjecture is widely accepted in the literature and dates back, at least, to the work of hockett1958course. Since along any one axis, a language may be more complex than another, this conjecture has a corollary that compensatory relationships between different types of complexity must exist. Such compensation has been hypothesized to be the result of natural processes of historical change, and is sometimes attributed to a potential linguistic universal of equal communicative capacity Pellegrino et al. (2011); Coupé et al. (2019).

Figure 1: Bits-per-phoneme vs average word length using an LSTM language model.

Methods for making hypotheses about linguistic complexity objectively measurable and testable have long been of interest, though existing measures are typically relatively coarse—see, e.g., moran2014cross and § 2 below for reviews. Briefly, counting-based measures such as inventory sizes (e.g., numbers of vowels, consonants, syllables) typically play a key role in assessing phonological complexity. Yet, in addition to their categorical nature, such measures generally do not capture longer-distance (e.g., cross-syllabic) phonological dependencies such as vowel harmony. In this paper, we take an information-theoretic view of phonotactic complexity, and advocate for a measure that permits straightforward cross-linguistic comparison: bits per phoneme. For each language, a statistical language model over words (represented as phonemic sequences) is trained on a sample of types from the language, and then used to calculate the bits per phoneme for new samples, thus providing an upper bound of the actual entropy Brown et al. (1992).

Characterizing phonemes via information theoretic measures goes back to cherry1953toward, who discussed the information content of phonemes in isolation, based on the presence or absence of distinctive features, as well as in groups, e.g., trigrams or possibly syllables. Here we leverage modern recurrent neural language modeling methods to build models over full word forms represented as phoneme strings, thus capturing any dependencies over longer distances (e.g., harmony) in assigning probabilities to phonemes in sequence. By training and evaluating on comparable corpora in each language, consisting of concept-aligned words, we can characterize and compare their phonotactics. While probabilistic characterizations of phonotactics have been employed extensively in psycholinguistics (see § 2.4

), such methods have generally been used to assess single words within a lexicon (e.g., classifying high versus low probability words during stimulus construction), rather than information-theoretic properties of the lexicon as a whole, which our work explores.

The empirical portion of our paper exploits this information-theoretic take on complexity to examine multiple aspects of phonotactic complexity:

  1. Bits-per-phoneme and Word Length: In § 5.1, we show a very high negative correlation of between bits-per-phoneme and average word length for the same 1016 basic concepts across 106 languages. This correlation is plotted in Fig. 1. In contrast, conventional phonotactic complexity measures, e.g., number of consonants in an inventory, demonstrate poor correlation with word length. Our results are consistent with speech, who show a similar correlation in speech.111See also coupe2019different, where syllable-based bigram models are used to establish a comparable information rate in speech across 17 typologically diverse languages. We additionally establish, in § 5.2, that the correlation persists when controlling for characteristics of long words, e.g., early versus late positions in the word.

  2. Constraining Language: Despite often being thought of as adding complexity, processes like vowel harmony and final-obstruent devoicing improve the predictability of subsequent segments by constraining the number of well formed forms. Thus, they reduce complexity measured in bits per phoneme. We validate our models by systematically removing certain constraints in our corpora in § 5.3.

  3. Intra- versus Inter-Family Correlation: Additionally, we present results in § 5.4 showing our complexity measure not only correlates with word length in a diverse set of languages, but also intra language families. Standard measures of phonotactic complexity do not show such correlations.

  4. Explicit feature representations: We also find (in § 5.5) that methods for including features explicitly in the representation, using methods described in § 4.1, yield little benefit except in an extremely low-resource condition.

Our methods222Code to train these models and reproduce results is available at permit a straight-forward cross-linguistic comparison of phonotactic complexity, which we use to demonstrate an intriguing trade-off with word length. Before motivating and presenting our methods, we next review related work on measuring complexity and phonotactic modeling.

2 Background: Phonological Complexity

2.1 Linguistic complexity

Linguistic complexity is a nuanced topic. For example, one can judge a particular sentence to be syntactically complex relative to other sentences in the language. However, one can also describe a language as a whole as being complex in one aspect or another, e.g., polysynthetic languages are often deemed morphologically complex. In this paper, we look to characterize phonotactics at the language level. However, we use methods more typically applied to specific sentences in a language, for example in the service of psycholinguistic experiments.

In cross-linguistic studies, the term complexity is generally used chiefly in two manners, which moran2014cross follow miestamo2006feasability in calling relative and absolute. Relative complexity metrics are those that capture the difficulty of learning or processing language, which miestamo2006feasability points out may vary depending on the individual (hence, is relative to the individual being considered). For example, vowel harmony, which we will touch upon later in the paper, may make vowels more predictable for a native speaker, hence less difficult to process; for a second language learner, however, vowel harmony may increase difficulty of learning and speaking. Absolute complexity measures, in contrast, assess the number of parts of a linguistic (sub-)system, e.g., number of phonemes or licit syllables.

In the sentence processing literature, surprisal Hale (2001); Levy (2008) is a widely used measure of processing difficulty, defined as the negative log probability of a word given the preceding words. Words that are highly predictable from the preceding context have low surprisal, and those that are not predictable have high surprisal. The phonotactic measure we advocate for in § 3 is related to surprisal, though at the phoneme level rather than the word level, and over words rather than sentences. Measures related to phonotactic probability have been used in a range of psycholinguistic studies—see § 2.4—though generally to characterize single words within a language (e.g., high versus low probability words) rather than for cross-linguistic comparison as we are here. Returning to the distinction made by miestamo2006feasability, we will remain agnostic in this paper as to which class (relative or absolute) such probabilistic complexity measures fall within, as well as whether the trade-offs that we document are bonafide instances of complexity compensation or are due to something else, e.g., related to the communicative capacity as hypothesized by speech. We bring up this terminological distinction primarily to situate our use of complexity within the diverse usage in the literature.

Additionally, however, we will point out that an important motivation for those advocating for the use of absolute over relative measures to characterize linguistic complexity in cross-linguistic studies is a practical one. Miestamo Miestamo (2006, 2008) claims that relative complexity measures are infeasible for broadly cross-linguistic studies because they rely on psycholinguistic data, which is neither common enough nor sufficiently easily comparable across languages to support reliable comparison. In this study, we demonstrate that surprisal and related measures are not subject to the practical obstacles raised by Miestamo, independently of whichever class of complexity they fall into.

2.2 Measures of Phonological Complexity

The complexity of phonemes has long been studied in linguistics, including early work on the topic by zipfpsycho, who argued that a phoneme’s articulatory effort was related to its frequency. trubetzkoy introduced the notion of markedness of phonological features, which bears some indirect relation to both frequency and articulatory complexity. Phonological complexity can be formulated in terms of language production (e.g., complexity of planning or articulation) or in terms of language processing (e.g., acoustic confusability or predictability), a distinction often framed around the ideas of articulatory complexity and perceptual salience—see, e.g., maddieson2009calculating. One recent instantiation of this was the inclusion of both focalization and dispersion to model vowel system typology Cotterell and Eisner (2017).

It is also natural to ask questions about the phonological complexity of an entire language in addition to that of individual phonemes—whether articulatory or perceptual, phonemic or phonotactic. Measures of such complexity that allow for cross-linguistic comparison are non-trivial to define. We review several previously proposed metrics here.

Size of phoneme inventory.

The most basic metric proposed for measuring phonological complexity is the number of distinct phonemes in the language’s phonemic inventory Nettle (1995). There has been considerable historical interest in counting both the number of vowels and the number of consonants—see, e.g., hockett1955manual,greenberg1978universals,maddieson1984patterns. Phoneme inventory size has its limitations—it ignores the phonotactics of the language. It does, however, have the advantage that it is relatively easy to compute without further linguistic analysis. Correlations between the size of vowel and consonant inventories (measured in number of phonemes) have been extensively studied, with contradictory results presented in the literature—see, e.g., moran2014cross for a review. Increases in phonemic inventory size are also hypothesized to negatively correlate with word length measured in phonemes Moran and Blasi (2014). In segmental, an inverse relationship was demonstrated between the size of the segmental inventory and the mean word length for 10 languages, and similar results (with some qualifications) were found for a much larger collection of languages in moran2014cross.333Note that by examining negative correlations between word length and inventory size within the context of complexity compensation, word length is also being taken implicitly as a complexity measure, as we shortly make explicit. We will explore phoneme inventory size as a baseline in our studies in § 5.

Markedness in Phoneme Inventory.

A refinement of phoneme inventory size takes into account markedness of the individual phonemes. mcwhorter2001world argues that one should judge the complexity of an inventory by counting the cross-linguistic frequency of the phonemes in the inventory, channeling the spirit of greenberg1966language. Thus, a language that has fewer phonemes, but contains cross-linguistically marked ones such as clicks, could be more complex.444mcwhorter2001world was one of the first to offer a quantitative treatment of linguistic complexity at all levels. Note, however, he rejects the equal complexity hypothesis, arguing creoles are simpler than other languages. As our data contains no creole, we cannot address this hypothesis; rather we only compare non-creole languages. McWhorter justifies this definition with the observation that no attested language has a phonemic inventory that consists only of marked segments. Beyond frequency, lindblom1988phonetic propose a tripartite markedness rating scheme for various consonants. In this paper, we are principally looking at phonotactic complexity, though we did examine the joint training of models across languages, which can be seen as modeling some degree of typicality and markedness.

Word length.

As stated earlier, word length, measured in the number of phonemes in a word, has been shown to negatively correlate with other complexity measures, such as phoneme inventory size Nettle (1995); Moran and Blasi (2014). To the extent that this is interpreted as being a compensatory relation, this would indicate that word length is being taken as an implicit measure of complexity. Alternatively, word length has a natural interpretation in terms of information rate, so trade-offs could be attributed to communicative capacity Pellegrino et al. (2011); Coupé et al. (2019).

Number of Licit Syllables.

Phonological constraints extend beyond individual units to the structure of entire words themselves, as we discussed above; so why stop at counting phonemes? One step in that direction is to investigate the syllabic structure of language, and count the number of possible licit syllables in the language Maddieson and Disner (1984); Shosted (2006). Syllabic complexity brings us closer to a more holistic measure of phonological complexity. Take, for instance, the case of Mandarin Chinese. At first blush, one may assume that Mandarin has a complex phonology due to an above-average-sized phonemic inventory (including tones); closer inspection, however, reveals a more constrained system. Mandarin only admits two codas: /n/ and /N/.

While syllable inventories and syllable-based measures of phonotactic complexity—e.g., highest complexity syllable type in maddieson2006— do incorporate more of the constraints at play in a language versus segment-based measures, (a) they remain relatively simple counting measures; and (b) phonological constraints do not end at the syllable boundary. Phenomena such as vowel harmony operate at the word level. Further, the combinatorial possibilities captured by a syllabic inventory, as discussed by maddieson2009calculating, can be seen as a sort of categorical version of a distribution over forms. Stochastic models of word-level phonotactics permit us to go beyond simple enumeration of a set, and characterize the distribution in more robust information-theoretic terms.

2.3 Phonotactics

Beyond characterizing the complexity of phonemes in isolation or the number of syllables, one can also look at the system determining how phonemes combine to form longer sequences in order to create words. The study of which sequences of phonemes constitute natural-sounding words is called phonotactics. For example, as chomskyhalle1965 point out in their oft-cited example, brick is an actual word in English;555For convenience, we just use standard orthography to represent actual and possible words, rather than phoneme strings. blick is not an actual word in English, but is judged to be a possible word by English speakers; and bnick is neither an actual nor a possible word in English, due to constraints on its phonotactics.

Psycholinguistic studies often use phonotactic probability to characterize stimuli within a language—see § 2.4 for details. For example, goldrick2008phonotactic demonstrate that both articulatory complexity and phonotactic probability influence the speed and accuracy of speech production. Measures of the overall complexity of a phonological system must thus also account for phonotactics.

cherry1953toward took an explicitly information-theoretic view of phonemic structure, including discussions of both encoding phonemes as feature bundles and the redundancy within groups of phonemes in sequence. This perspective of phonemic coding has led to work on characterizing the explicit rules or constraints that lead to redundancy in phoneme sequences, including morpheme structure rules Halle (1959) or conditions Stanley (1967). Recently, futrell2017generative take such approaches as inspiration for a generative model over feature dependency graphs. We, too, examine decomposition of phonemes into features for representation in our model (see § 4.1), though in general this only provided modeling improvements over atomic phoneme symbols in a low-resource scenario.

Much of the work in phonotactic modeling is intended to explain the sorts of grammaticality judgments exemplified by the examples of chomskyhalle1965 discussed earlier. Recent work is typically founded on the commonly held perspective that such judgements are gradient666Gradient judgments would account for the fact that bwick is typically judged to be a possible English word like blick but not as good. In other words, bwick is better than bnick but not as good as blick. hence amenable to stochastic modeling, e.g., hayes2008maximum and futrell2017generative—though cf. gorman2013generative. In this paper, however, we are looking at phonotactic modeling as the means for assessing phonotactic complexity and discovering potential evidence of trade-offs cross-linguistically, and are not strictly speaking evaluating the model on its ability to capture such judgments, gradient or otherwise.

2.4 Phonotactic probability and surprisal

A word’s phonotactic probability has been shown to influence both processing and learning of language. Words with high phonotactic probabilities (see brief notes on the operationalization of this below) have been shown to speed speech processing, both recognition (e.g., Vitevitch and Luce, 1999) and production (e.g., Goldrick and Larson, 2008). Phonotactically probable words in a language have also been shown to be easier to learn (Storkel, 2001, 2003; Coady and Aslin, 2004, inter alia), although such an effect is also influenced by neighborhood density Coady and Aslin (2003), as are the speech processing effects Vitevitch and Luce (1999). Informally, phonological neighborhood density is the number of similar sounding words in the lexicon, which, to the extent that high phonotactic probability implies phonotactic patterns frequent in the lexicon, typically correlates to some degree with phonotactic probability—i.e., dense neighborhoods will typically consist of phonotactically probable words. Some effort has been made to disentangle the effect of these two characteristics (Vitevitch and Luce, 1999; Storkel et al., 2006; Storkel and Lee, 2011, inter alia).

Within the psycholinguistics literature referenced above, phonotactic probability was typically operationalized by summing or averaging the frequency with which single phonemes and phoneme bigrams occur, either overall or in certain word positions (initial, medial, final); and neighborhood density of a word is typically the number of words in a lexicon that have Levenshtein distance 1 from the word (see, e.g., Storkel and Hoover, 2010). Note that these measures are used to characterize specific words, i.e., given a lexicon, these measures allow for the designation of high versus low phonotactic probability words and high versus low neighborhood density words, which is useful for designing experimental stimuli. Our bits-per-phoneme measure, in contrast, is used to characterize the distribution over a sample of a language rather than specific individual words in that language.

Other work has made use of phonotactic probability to examine how such processing and learning considerations may impact the lexicon. dautriche2017words take phonotactic probability as one component of ease of processing and learning—the other being perceptual confusability—that might influence how lexicons become organized over time. They operationalize phonotactic probability via generative phonotactic models (phoneme -gram models and probabilistic context-free grammars with syllable structure), hence closer to the approaches described in this paper than the work cited earlier in this section. Generating artificial lexicons from such models, they find that real lexicons demonstrate higher network density (as indicated by Levenshtein distances, frequency of minimal pairs, and other measures) than the randomly generated lexicons, suggesting that the pressure towards highly clustered lexicons is driven by more than just phonotactic probability.

Evidence of pressure towards communication efficiency in the lexicon has focused on both phonotactic probability and word length. The information content, as measured by the probability of a word in context, is shown to correlate with orthographic length (taken as a proxy for phonological word length) Piantadosi et al. (2009, 2011). piantadosi2012communicative show that words with lower bits per phoneme have higher rates of homophony and polysemy, in support of their hypothesis that words that are easier to process will have higher levels of ambiguity. Relatedly, mahowald2018word demonstrate, in nearly all of the 96 languages investigated, a high correlation between orthographic probability (as proxy for phonotactic probability) and frequency, i.e., frequently used forms tend to be phonotactically highly probable, at least within the word lengths examined (3-7 symbols). A similar perspective on the role of predictability in phonology holds that words that are high probability in context (i.e., low surprisal) tend to be reduced, and those that are low probabilty in context are prone to change Hume and Mailhot (2013) or to some kind of enhancement Hall et al. (2018). As priva2018interdependence point out, frequency, predictabilty and information content (what they call informtivity and operationalize as expected predictability) are related and easily confounded, hence the perspectives presented by these papers are closely related. Again, for these studies and those cited earlier, such measures are used to characterize individual words within a language rather than the lexicon as a whole.

3 The Probabilistic Lexicon

In this work, we are interested in a hypothetical phonotactic distribution over the lexicon. In the context of phonology, we interpret as all “universally possible phonological surface forms,” following hayes2008maximum.777hayes2008maximum label as . The distribution , then, assigns a probability to every possible surface form . In the special case that is a log-linear model, then we arrive at what is known as a maximum entropy grammar Goldwater and Johnson (2003); Jäger (2007). A good distribution should assign high probability to phonotactically valid words, including non-existent ones, but little probability to phonotactic impossibilities. For instance, the possible English word blick should receive much higher probability than bnick, which is not a possible English word. The lexicon of a language, then, is considered to be generated as samples without replacement from .

If we accept the existence of the distribution , then a natural manner by which we should measure the phonological complexity of language is through Shannon’s entropy Cover and Thomas (2012):


The units of are bits as we take to be base 2. Specifically, we will be interested in bits per phoneme, that is, how much information each phoneme in a word conveys, on average.

3.1 Linguistic Rationale

Here we seek to make a linguistic argument for the adoption of bits per phoneme as a metric for complexity in the phonological literature. Bits are fundamentally units of predictability: If the entropy of your distribution is higher, that is more bits, then it is less predictable, and if the entropy is lower, that is, fewer bits, then it is more predictable with an entropy of 0 indicating determinism.

Holistic Treatment.

When we just count the number of distinctions in individual parts of the phonology, e.g., number of vowels or number of consonants, we do not get a holistic picture of how these pieces interact. A simple probabilistic treatment will inherently capture nuanced interactions. Indeed, it is not clear how to balance the number of consonants, the number of vowels and the number of tones to get a single number of phonological complexity. Probabilistically modeling phonological strings, however, does capture this. We judge the complexity of a phonological system as its entropy.

Longer-Distance Dependencies.

To the best of the authors’ knowledge, the largest phonological unit that has been considered in the context of cross-linguistic phonological complexity is the syllable, as discussed in § 2.2. However, the syllable clearly has limitations. It cannot capture, tautologically, cross-syllabic phonological processes, which abound in the languages of the world. For instance, vowel and consonant harmony are quite common cross-linguistically. Naturally, a desideratum for any measure of phonological complexity is to consider all levels of phonological processes. Examples of vowel harmony in Turkish are presented in Tab. 1.

English Turkish English Turkish
ear kulak throat boğaz
rain yağmur foam köpük
summit zirve claw pençe
nail tırnak herd sürü
horse beygir dog köpek
Table 1: Turkish evinces two types of vowel harmony, front-back and round-unround. Here we focus on just front-back harmony. The examples in the table above are such that all vowels in a word are either back (ı, u, a, o) or front (i, ü, e, ö), which is generally the case.

Frequency Information.

None of the previously proposed phonological complexity measures deals with the fact that certain patterns are more frequent than others; probability models inherently handle this as well. Indeed, consider the role of the unvoiced velar fricative /x/ in English; while not part of the canonical consonant inventory, /x/ nevertheless appears in a variety of loanwords. For instance, many native English speakers do pronounce the last name of composer Johann Sebastian Bach as /bax/. Moreover, English phonology acts upon /x/ as one would expect: consider Morris Halle’s Halle (1978) example Sandra out-Bached Bach, where the second word is pronounced /out-baxt/ with a final /t/ rather than a /d/. We conclude that /x/ is in the consonant inventory of at least some native English speakers. However, counting it on equal status with the far more common /k/ when determining complexity seems incorrect. Our probabilistic metric covers this corner case elegantly.

Relatively Modest Annotation Requirements.

Many of these metrics require a linguist’s analysis of the language. This is a tall order for many languages. Our probabilistic approach only requires relatively simple annotations, namely, a swadesh1955towards-style list in the international phonetic alphabet (IPA) to estimate a distribution. When discussing why he limits himself to counting complexities, maddieson2009calculating writes:

“[t]he factors considered in these studies only involved the inventories of consonant and vowel contrasts, the tonal system, if any, and the elaboration of the syllable canon. It is relatively easy to find answers for a good many languages to such questions as ‘how many consonants does this language distinguish?’ or ‘how many types of syllable structures does this language allow?’ ”

The moment one searches for data on more elaborate notions of complexity, e.g., the existence of vowel harmony, one is faced with the paucity of data—a linguist must have analyzed the data.

3.2 Constraints Reduce Entropy

Many phonologies in the world employ hard constraints, e.g., a syllable final obstruent must be devoiced or the vowels in a word must be harmonic. Using our definition of phonological complexity as entropy, we can prove a general result that any hard-constraining process will reduce entropy, thus, making the phonology less complex. The fact that this holds for any hard contraint, be it vowel harmony or final-obstruent devoicing, is a fact that conditioning reduces entropy.

3.3 A Variational Upper Bound

If we want to compute eq. 1, we are immediately faced with two problems. First, we do not know : we simply assume the existence of such a distribution from which the words of the lexicon were drawn. Second, even if we did know , computation of the would be woefully intractable, as it involves an infinite sum. Following brown1992estimate, we tackle both of these issues together. Note that this line of reasoning follows cotterell-etal-2018-languages and mielke-etal-2019-kind who use a similar technique for measuring language complexity at the sentence level.

We start with a basic inequality from information theory. For any distribution with the same support as , the cross-entropy provides an upper bound on the entropy, i.e.,


where cross-entropy is defined as


Note that eq. 2 is tight if and only if . We still are not done, as eq. 3 still requires knowledge of and involves an infinite sum. However, we are now in a position to exploit samples from . Specifically, given , we approximate


with equality if we let

. In information theory, this equality in the limit is called the asymptotic equipartition property and follows easily from the weak law of large numbers. Now, we have an empirical procedure for estimating an upper bound on

. For the rest of the paper, we will use the right-hand side of eq. 4 as a surrogate for the phonotactic complexity of a language.

How to choose ?

Choosing a good is a two-step process. First, we choose a variational family . Then, we choose a specific by minimizing the right-hand side of eq. 4


This procedure corresponds to maximum likelihood estimation. In this work, we consider two variational families: (i) a phoneme -gram model, and (ii) a phoneme-level RNN language model. We describe each in § 4.1.

3.4 A Note on Types and Tokens

To make the implicit explicit, in this work we will exclusively be modeling types, rather than tokens. We briefly justify this discussion from both theoretical and practical concerns. From a theoretical side, a token-based model is unlikely to correctly model an OOV distribution as very frequent tokens often display unusual phonotactics for historical reasons. A classic example comes from English: consider the appearance of /D/. Judging by token-frequency, /D/ is quite common as it starts some of the most common words in the language: the, they, that, etc. However, novel words categorically avoid initial /D/. From a statistical point of view, one manner to justify type-level modeling is through the Pitman–Yor process Ishwaran and James (2003). goldwater2006interpolating showed that type-level modeling is a special case of the stochastic process, writing that they “justif[y] the appearance of type frequencies in formal analyses of natural language.”

Practically, using token-level frequencies, even in a dampened form, is not possible due to the large selection of languages we model. Most of the languages we consider do not have corpora large enough to get reasonable token estimates. Moreover, as many of the languages we consider have a small number of native speakers, and, in extreme cases, are endangered, the situation is unlikely to remedy itself, forcing the phonotactician to rely on types.

4 Methods

4.1 Phoneme-Level Language Models



be a discrete alphabet of symbols from the international phonetic alphabet (IPA), including special beginning-of-string and end-of-string symbols. A character level language model (LM) models a probability distribution over



Trigram LM.

-grams assume the sequence follows a

-order Markov model, conditioning the probability of a phoneme on the

previous ones


where we assume the string

is properly padded with beginning and end-of-string symbols.

The trigram model used in this work is estimated as the deleted interpolation

Jelinek (1980) of the trigram, bigram and unigram relative frequency estimates


where the mixture parameters were estimated via Bayesian optimization with a Gaussian prior maximizing the expected improvement on a validation set, as discussed by snoek2012practical.

Recurrent Neural LM.

Recurrent neural networks excel in language modeling, being able to capture complex distributions Mikolov et al. (2010); Sundermeyer et al. (2012). Empirically, recent work has observed dependencies on up to around 200 tokens Khandelwal et al. (2018)

. We use a character-level Long Short-Term Memory

(LSTM, Hochreiter and Schmidhuber, 1997) language model, which is the state of the art for character-level language modeling Merity et al. (2018).

Our architecture receives a sequence of tokens and embeds each token

using a dictionary-lookup embedding table. This results in vectors

which are fed into an LSTM. This LSTM produces a high-dimensional representation of the sequence, often termed hidden states:


These representations are then fed into a softmax to produce a distribution over the next character:


where is a final projection matrix and is a bias term. In our implementation, is a vector of all zeros and is the lookup embedding for the beginning-of-string token.

Phoneme Embedding LM.

When developing a phoneme-level recurrent neural LM, one can use a base of phonemic features—e.g. Phoible (Moran et al., 2014)—to implement a multi-hot embedding such that similar phonemes will have similar embedding representations. A phoneme will have a set of binary attributes (e.g. stress, sonorant, nasal), each with its corresponding embedding representation . A phoneme embedding will, then, be composed by the element-wise average of each of its features lookup embedding


where is if phoneme presents attribute and is the lookup embeddings of attribute . This architecture forces similar phonemes, measured in terms of overlap in distinctive features, to have similar representations.

4.2 NorthEuraLex Data

We make use of data from the NorthEuraLex corpus Dellert and Jäger (2017). The corpus is a concept-aligned multi-lingual lexicon with data from 107 languages. The lexicons contains 1016 “basic” concepts. Importantly, NorthEuraLex is appealing for our study as all the words are written in a unified IPA scheme. A sample of the lexicon is provided in Tab. 2. For the results reported in this paper, we omitted Mandarin, since no tone information was included in its annotations, causing its phonotactics to be greatly underspecified. No other tonal languages were included in the corpus, so all reported results are over 106 languages.

Why is Base-Concept Aligned Important?

Making use of data that is concept aligned across the languages provides a certain amount of control (to the extent possible) of the influence of linguistic content on the forms that we are modeling. In other words, these forms should be largely comparable across the languages in terms of how common they are in the active vocabulary of adult speakers. Further, base concepts as defined for the collection are more likely to be lemmas without inflection, thus reducing the influence of morphological processes on the results.888Most of the concepts in the dataset do not contain function words and verbs are in the bare infinitive form – e.g., have, instead of to have) – although there are a few exceptions. For example, the German word hundert is represented as a hundred in English.

To test this latter assertion, we made use of the UniMorph999 morphological database Kirov et al. (2018) to look up words and assess the percentage that correspond to lemmas or base forms. Of the 106 languages in our collection, 48 are also in the UniMorph database, and 46 annotate their lemmas in a way that allowed for simple string matching with our word forms. For these 46 languages, on average we found 313 words in UniMorph of the 1016 concepts (median 328). A mean of 87.2% (median 93.3%; minimum 58.6%) of these matched lemmas for that language in the UniMorph database. This rough string matching approach provides some indication that the items in the corpus are largely composed of such base forms.

Concept Language Word IPA
eye portuguese olho /oLu/
ear finnish korva /kOrVA/
give north karelian antua /AntUA/
tooth veps hambaz /hAmbAz/
black northern sami c̆áhppes /\͡textteshligaahppes/
immediately hill mari töpök /tørøk/
Table 2: Sample of the lexicon in NorthEuraLex corpus.

Dataset Limitations.

Unfortunately, there is less typological diversity in our dataset than we would ordinarily desire. NorthEuraLex draws its languages from 21 distinct language families that are spoken in Europe and Asia. This excludes languages indigenous to the Americas,101010Inuit languages, which are genetically related to the languages of Siberia, are included in the lexicon. Australia, Africa and South-East Asia. While lamentable, we know of no other concept-aligned lexicon with broader typological diversity that is written in a unified phonetic alphabet, so we must save studies of more typologically diverse set of languages for future work.

In addition, we note that the process of base concept selection and identification of corresponding forms from each language (detailed in Dellert, 2015, 2017) was non-trivial, and some of the corpus design decisions may have resulted in somewhat biased samples in some languages. For example, there was an attempt to minimize the frequency of loanwords in the dataset, which may make the lexicons in loanword heavy languages, such as English with its extensive Latinate vocabulary, somewhat less representative of everyday use than in other languages. Similarly, the creation of a common IPA representation over this number of languages required choices that could potentially result in corpus artifacts. As with the issue of linguistic diversity, we acknowledge that the resource has some limitations but claim that it is the best currently available dataset for this work.

Figure 2: Per-phoneme complexity vs average word length under both a trigram and an LSTM language model.
Figure 3: Conventional measures of phonological complexity vs average word length. These complexity measures are based in inventory size.

Splitting the Data.

We split the data at the concept level into 10 folds, used for cross validation. We create train-dev-test splits where the training portion has 8 folds ( concepts) and the dev and test portions have 1 fold each ( concepts). We then create language-specific sets with the language-specific words for the concept to be rendered. Cross-validation allows us to have all 1016 concepts in our test sets (although evaluated using different model instances), and we do our following studies using all of them.

4.3 Artificial Languages

In addition to naturally occurring languages, we are also interested in artificial ones. Why? We wish to validate our models in a controlled setting, quantifying the contribution of specific linguistic phenomena to our complexity measure. Thus, developing artificial languages, which only differ with respect to one phonological property is useful.

The Role of Final-Obstruent Devoicing.

Final-obstruent devoicing reduces phonological complexity under our information-theoretic metric. The reason is simple: there are fewer valid syllables as all those with voiced final obstruents are ruled out. Indeed, this point is also true of the syllable counting metric discussed in § 2.2. One computational notion of complexity might say that the complexity of the phonology is equal to the number of states required to encode the transduction from an underlying form to a surface form in a minimal finite-state transduction. Note that all SPE-style rules may be so encoded Kaplan and Kay (1994). Thus, the complexity of the phonotactics could be said to be related to the number of SPE-style rules that operate. In contrast, under our metric, any process that constrains the number of possibilities will, inherently, reduce complexity. The studies in § 5.3 allow us to examine the magnitude of such a reduction, and validate our models with respect to this expected behavior.

We create two artificial datasets without final-obstruent devoicing based on the German and Dutch portions of NorthEurLex. We reverse the process, using the orthography as a guide. For example, the German /uk/ is converted to /ug/ based on the orthography Zug.

The Role of Vowel Harmony.

Like final obstruent devoicing, vowel harmony plays a roll in reducing the number of licit syllables. In contrast to final obstruent devoicing, however, vowel harmony acts cross-syllabically. Consider the Turkish lexicon, where most, but not all basic lexical items obey vowel harmony. Processes like this reduce the entropy of and, thus, can be considered as creating a less complex phonotactics.

For vowel harmony, we create 10 artificial datasets by randomly replacing each vowel in a word with a new sampled (with replacement) vowel from that language’s vowel inventory. This breaks all vowel harmony, but keeps the syllabic structure.

Measure Pearson Spearman
Number of:
          phonemes -0.047 -0.054
          vowels -0.164 -0.162
          consonants 0.030 0.045
          unigram -0.217 -0.222
          trigram -0.682 -0.672
          LSTM -0.762 -0.744
Table 3: Pearson and Spearman rank correlation coefficients between complexity measures and average word length in phoneme segments.

5 Results

5.1 Study 1: Bits Per Phoneme Negatively Correlates with Word Length

As stated earlier, speech investigated a complexity trade-off with the information density of speech. From a 7-language study they found a strong correlation () between the information density and the syllabic complexity of a language. One hypothesis adduced to explain these findings is that, for functional reasons, the rate of linguistic information is very similar cross-linguistically. Inspired by their study, we conduct a similar study with our phonotactic setup. We hypothesize that the bits per phoneme for a given concept correlates with the number of phonemes in the word. Moreover, the bits per word should be similar across languages.

We consider the relation between the average bits per phoneme of a held-out portion of a language’s lexicon, as measured by our best language model, and the average length of the words in that language. We present the results in Figures 2 and 3 and in Tab. 3. We find a strong correlation under the LSTM LM (Spearman’s ). At the same time, we only see a weak correlation under conventional measures of phonotactic complexity, such as vowel inventory size (Spearman’s ). In Fig. 4

, we plot the kernel density estimate and histogram densities (both 10 and 100 bins) of word-level complexity (bits per word).

Figure 4: Kernel density estimate (KDE) of the average phonotactic complexity per word across 106 different languages. Different languages tend to present similar complexities (bits per word).

5.2 Study 2: Possible Confounds for Negative Correlations

One possible confound for these results is that phonemes later in a word may in general have higher probability given the previous phonemes than those earlier in the string. This sort of positional effect was demonstrated in Dutch van Son and Pols (2003)

, where position in the word accounted for much of the variance in segmental information.

111111We briefly note that the son2003information study did not make use of a train/dev/test split of their data, but rather simply analyzed raw relative frequency over their Dutch corpus. As a result, all positions beyond any word onset that is unique in their corpus would have probability one, leading to a more extreme position effect than we would observe using regularization and validating on unseen forms. To ensure that we are not simply replicating such a positional effect across many languages, we performed several additional analyses.

Truncated words.

First, we calculated the bits-per-phoneme for just the first three positions in the word, and then looked at the correlation between this word-onset bits-per-phoneme and the average (full) word length in phoneme segments. In other words, for the purpose of calculating bits-per-phoneme, we truncated all words to a maximum of three phonemes, and in such a way explicitly eliminated the contribution of positions later in any word. Using the LSTM model, this yielded a Spearman correlation of () , in contrast to without such truncation (reported in Tab. 3). This suggests that there is a contribution of later positions to the effect presented in Tab. 3 that we lose by eliding them, but that even in the earlier positions of the word we are seeing a trade-off with full average word length.

Correlation with phoneme position.

We next looked to measure a position effect directly, by calculating the correlation between word position and bits for that position across all languages. Here we find a Spearman correlation of (), which again supports the contention that later positions in general require fewer bits to encode. Nonetheless, this correlation is still weaker than the per-language word length one (of ).

Per-word correlations.

We also calculated the correlation between word length and bits per phoneme across all languages (without averaging per language here). The Spearman correlation between these factors—at the word level using all languages—is (). Analyzing each language individually, there is an average Spearman’s () between bits per phoneme and word length. The minimum negative (i.e., highest magnitude) correlation of any language in the set is . These per word correlations are reported in the upper half of Tab. 4.

Measure Pearson Spearman
Per Word:
      all languages -0.269 -0.312
      each language (avg) -0.220 -0.257
      each language (min) -0.561 -0.607
Per Language:
      Fake (avg) -0.270 -0.254
      Fake (min) -0.586 -0.568
      Real -0.762 -0.744
Table 4: Pearson and Spearman rank correlation coefficients between complexity measures and word length in phoneme segments. All correlations are statistically significant with .

Permuted ‘language’ correlations.

Finally, to determine if our language effects perhaps arise due to the averaging of word lengths and bits per phoneme for each language, we ran a permutation test on languages. We shuffle words (with their pre-calculated bits-per-phoneme values) into 106 sets with the same size as the original languages—thus creating fake ‘languages’. We take the average word length and bits per phoneme in each of these fake languages and compare the correlation—returning to the ‘language’ level this time—with the original correlation. After running this test for permutations, we found no shuffled set with an equal or higher Spearman (or Pearson) correlation than the real set. Thus, with a strong confidence () we can state there is a language level effect. Average and minimum negative correlations for these ‘fake’ languages (as well as the real set for ease of comparison) are presented in the lower half of Tab. 4.

Model Orig Art Diff
          German 3.703 3.708 0.005 (0.13%)
          Dutch 3.607 3.629 0.022 (0.58%)
          German 3.230 3.268 0.038 (1.18%)
          Dutch 3.161 3.191 0.030 (0.95%)
Table 5: Complexities for original and artificial languages when removing final-obstruent devoicing.  represents an statistically significant difference with

5.3 Study 3: Constraining Languages Reduces Phonotactic Complexity

Final-obstruent devoicing and vowel harmony reduce the number of licit syllables in a language, hence reducing the entropy. To determine the magnitude that such effects can have on the measure for our different model types, we conduct two studies. In the first, we remove final-obstruent devoicing from the German and Dutch languages in NorthEuraLex, as discussed in § 4.3. In the second study, we remove vowel harmony from 10 languages that have it,121212The languages with vowel harmony are: bua, ckt, evn, fin, hun, khk, mhr, mnc, myv, tel, and tur. as also explained in § 4.3.

After deriving two artificial languages without obstruent devoicing from both German and Dutch, we used 10 fold cross validation to train models for each language. The statistical relevance of differences between normal and artificial languages was analyzed using paired permutation tests between the pairs. Results are presented in Tab. 5. We see that the -gram can capture this change in complexity for Dutch, but not for German. At the same time, the LSTM shows a statistically significant increase of bits per phoneme when we remove obstruent devoicing from both languages. Fig. 5 presents a similar impact on complexity from vowel harmony removal, as evidenced by the fact that all points fall above the equality line. Average complexity increased by bits per phoneme (an approximate entropy increase), as measured by our LSTM models.

In both of these artificial language scenarios, the LSTM models appeared more sensitive to the constraint removal, as expected.

Figure 5: Complexities for natural and artifical languages when removing vowel harmony. A paired permutation test showed all differences present statistical difference with .

5.4 Study 4: Negative Trade-off Persists Within and Across Families

moran2014cross investigated the correlation between the number of phonological units in a language and its average word length across a large and varied set of languages. They found that, while these measures of phonotactic complexity (number of vowels, consonants or phonemes in a language) are correlated with word length when measured across a varied set of languages, such a correlation usually does not hold within language families. We hypothesize that this is due to their measures being rather coarse approximations to phonotactic complexity, so that only large changes in the language would show significant correlation given the noise. We also hypothesize that our complexity measure is less noisy, hence should be able to yield significant correlations both within and across families.

Results in Tab. 3 show a strong correlation for the LSTM measure, while they show a weak one for conventional measures of complexity. As stated before, moran2014cross found that vowel inventory size shows a strong correlation to word length on a diverse set of languages, but, as mentioned in § 4.2, our dataset is more limited than desired. To test if we can mitigate this effect we average the complexity measures and word length per family (instead of per language) and calculate the same correlations again. These results are presented in Tab. 6

and show that when we average these complexity measures per family we indeed find a stronger correlation between vowel inventory size and average word length, although with a higher null hypothesis probability (Spearman’s

). We also see our LSTM based measure still shows a strong correlation (Spearman’s ).

Measure Pearson Spearman
Number of:
          phonemes -0.214 -0.095
          vowels -0.383 -0.367
          consonants -0.147 -0.092
          unigram -0.267 -0.232
          trigram -0.621 -0.520
          LSTM -0.778 -0.526
Table 6: Pearson and Spearman correlation between complexity measures and word length in phoneme segments averaged across language families.

We now analyze these correlations intra families, for all family languages in our dataset with at least 4 languages. These results are presented in Tab. 7. Our LSTM based phonotactic complexity measure shows strong intra family correlation with average word length for all five analyzed language families (). At the same time, vowel inventory size only shows a negative statistically significant correlation within Turkic.

Family LSTM Vowels # Langs
Dravidian -1.0 -0.894 4
Indo-European -0.662 -0.218 37
Nakh-Daghestanian -0.771 -0.530 6
Turkic -0.690 -0.773 8
Uralic -0.874  0.363 26
Statistically significant with
Statistically significant with
Table 7: Spearman correlation between complexity measures and average word length per language family. Phonotactic complexity in bits per phoneme presents very strong intra-family correlation with word length in three of the five families. Size of vowel inventory presents intra-family correlation in Turkic and Uralic.

5.5 Study 5: Explicit feature representations do not generally improve models

Tab. 3 presents strong correlations when using an LSTM with standard one-hot lookup embedding. Here we train LSTMs with three different phoneme embedding models: (1) a typical Lookup embedding, in which each Phoneme has an associated embedding; (2) a phoneme features based embedding, as explained in § 4.1; (3) the concatenation of the Lookup and the Phoneme embedding. We also train these models both using independent models for each language, and with independent models, but sharing embedding weights across languages.

We first analyze these model variants under the same lens as used in Study 1. Tab. 8 shows the correlations between the complexity measure resulting from each of this models and the average number of phonemes in a word. We find strong correlations for all of them (). We also present in Tab. 8 these models’ cross entropy, averaged across all languages. At least for the methods that we are using here, we derived no benefit from either more explicit featural representations of the phonemes or by sharing the embeddings across languages.

Model Complexity Spearman
          unigram 4.477 -0.222
          trigram 3.270 -0.672
Independent Embeddings:
          Lookup 2.976 -0.744
          Phoneme 2.992 -0.741
          Lookup + Phoneme 2.975 -0.752
Shared Embeddings:
          Lookup 2.988 -0.743
          Phoneme 2.977 -0.744
          Lookup + Phoneme 2.982 -0.740
Table 8: Average cross-entropy across all languages and the correlation between complexity and average word length for different models.

We also investigated scenarios using less training data, and it was only in very sparse scenarios (e.g., using just 10% of the training used in our standard trials, or 81 example words) where we observed even a small benefit to explicit feature representations and shared embeddings.

6 Conclusion

We have presented methods for calculating a well-motivated measure of phonotactic complexity: bits per phoneme. This measure is derived from information theory and its value is calculated using the probability distribution of a language model. We demonstrate that cross-linguistic comparison is straightforward using such a measure, and find a strong negative correlation with average word length. This trade-off with word length can be seen as an example of complexity compensation or perhaps related to communicative capacity.


We thank Damián E. Blasi for his feedback on previous versions of this paper and the anonymous reviewers, as well as action editor Eric Fosler-Lussier, for their constructive and detailed comments—the paper is much improved as a result.


  • Brown et al. (1992) Peter F. Brown, Vincent J. Della Pietra, Robert L. Mercer, Stephen A. Della Pietra, and Jennifer C. Lai. 1992. An estimate of an upper bound for the entropy of English. Computational Linguistics, 18(1):31–40.
  • Cherry et al. (1953) E. Colin Cherry, Morris Halle, and Roman Jakobson. 1953. Toward the logical description of languages in their phonemic aspect. Language, pages 34–46.
  • Chomsky and Halle (1965) Noam Chomsky and Morris Halle. 1965. Some controversial questions in phonological theory. Journal of Linguistics, 1(2):97–138.
  • Coady and Aslin (2003) Jeffry A. Coady and Richard N. Aslin. 2003. Phonological neighbourhoods in the developing lexicon. Journal of Child Language, 30(2):441–469.
  • Coady and Aslin (2004) Jeffry A. Coady and Richard N. Aslin. 2004. Young children’s sensitivity to probabilistic phonotactics in the developing lexicon. Journal of Experimental Child Psychology, 89(3):183–213.
  • Cotterell and Eisner (2017) Ryan Cotterell and Jason Eisner. 2017. Probabilistic typology: Deep generative models of vowel inventories. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1182–1192, Vancouver, Canada. Association for Computational Linguistics.
  • Cotterell et al. (2018) Ryan Cotterell, Sebastian J. Mielke, Jason Eisner, and Brian Roark. 2018. Are all languages equally hard to language-model? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 536–541, New Orleans, Louisiana. Association for Computational Linguistics.
  • Coupé et al. (2019) Christophe Coupé, Yoon Oh, Dan Dediu, and François Pellegrino. 2019. Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche. Science Advances, 5(9):eaaw2594.
  • Cover and Thomas (2012) Thomas M. Cover and Joy A. Thomas. 2012. Elements of Information Theory. John Wiley & Sons.
  • Dautriche et al. (2017) Isabelle Dautriche, Kyle Mahowald, Edward Gibson, Anne Christophe, and Steven T. Piantadosi. 2017. Words cluster phonetically beyond phonotactic regularities. Cognition, 163:128–145.
  • Dellert (2015) Johannes Dellert. 2015. Compiling the Uralic dataset for NorthEuraLex, a lexicostatistical database of northern Eurasia. In First International Workshop on Computational Linguistics for Uralic Languages.
  • Dellert (2017) Johannes Dellert. 2017. Information-Theoretic Causal Inference of Lexical Flow. Ph.D. thesis, University of Tübingen.
  • Dellert and Jäger (2017) Johannes Dellert and Gerhard Jäger. 2017. Northeuralex (version 0.9).
  • Futrell et al. (2017) Richard Futrell, Adam Albright, Peter Graff, and Timothy J. O’Donnell. 2017. A generative model of phonotactics. Transactions of the Association for Computational Linguistics, 5:73–86.
  • Goldrick and Larson (2008) Matthew Goldrick and Meredith Larson. 2008. Phonotactic probability influences speech production. Cognition, 107(3):1155–1164.
  • Goldwater and Johnson (2003) Sharon Goldwater and Mark Johnson. 2003. Learning OT constraint rankings using a maximum entropy model. In Proceedings of the Stockholm Workshop on Variation within Optimality Theory.
  • Goldwater et al. (2006) Sharon Goldwater, Mark Johnson, and Thomas L. Griffiths. 2006. Interpolating between types and tokens by estimating power-law generators. In Advances in Neural Information Processing Systems, pages 459–466.
  • Gorman (2013) Kyle Gorman. 2013. Generative phonotactics. Ph.D. thesis, University of Pennsylvania.
  • Greenberg (1966) Joseph Greenberg. 1966. Language universals, with special reference to feature hierarchies. The Hague: Mouton.
  • Greenberg et al. (1978) Joseph H. Greenberg, Charles A. Ferguson, and Edith Moravcsik, editors. 1978. Universals of human language. Vol. 2: Phonology. Stanford University Press.
  • Hale (2001) John Hale. 2001. A probabilistic Earley parser as a psycholinguistic model. In Proceedings of the 2nd meeting of the North American Chapter of the Association for Computational Linguistics, pages 1–8.
  • Hall et al. (2018) Kathleen Currie Hall, Elizabeth Hume, T. Florian Jaeger, and Andrew Wedel. 2018. The role of predictability in shaping phonological patterns. Linguistics Vanguard, 4(s2).
  • Halle (1959) Morris Halle. 1959. The sound pattern of Russian. Mouton, The Hague.
  • Halle (1978) Morris Halle. 1978. Knowledge unlearned and untaught: What speakers know about the sounds of their language. In M. Halle, J. Bresnan, and G. Miller, editors, Linguistic Theory and Psychological Reality, pages 294–303. The MIT Press, Cambridge, MA.
  • Hayes and Wilson (2008) Bruce Hayes and Colin Wilson. 2008. A maximum entropy model of phonotactics and phonotactic learning. Linguistic Inquiry, 39(3):379–440.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • Hockett (1955) Charles Francis Hockett. 1955. A manual of phonology. Waverly Press, Baltimore, MD.
  • Hockett (1958) Charles Francis Hockett. 1958. A course in modern linguistics. Macmillan, New York.
  • Hume and Mailhot (2013) Elizabeth Hume and Frédéric Mailhot. 2013. The role of entropy and surprisal in phonologization and language change. In Alan C.L. Yu, editor, Origins of sound change: Approaches to phonologization, pages 29–47. Oxford University Press, Oxford.
  • Ishwaran and James (2003) Hemant Ishwaran and Lancelot F. James. 2003. Generalized weighted Chinese restaurant processes for species sampling mixture models. Statistica Sinica, pages 1211–1235.
  • Jäger (2007) Gerhard Jäger. 2007. Maximum entropy models and stochastic optimality theory. Architectures, Rules, and Preferences: Variations on Themes by Joan W. Bresnan. Stanford: CSLI, pages 467–479.
  • Jelinek (1980) Frederick Jelinek. 1980. Interpolated estimation of Markov source parameters from sparse data. In

    Proc. Workshop on Pattern Recognition in Practice, 1980

  • Kaplan and Kay (1994) Ronald M. Kaplan and Martin Kay. 1994. Regular models of phonological rule systems. Computational Linguistics, 20(3):331–378.
  • Khandelwal et al. (2018) Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp nearby, fuzzy far away: How neural language models use context. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 284–294, Melbourne, Australia. Association for Computational Linguistics.
  • Kirov et al. (2018) Christo Kirov, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Sebastian J. Mielke, Arya McCarthy, Sandra Kübler, David Yarowsky, Jason Eisner, and Mans Hulden. 2018. UniMorph 2.0: Universal morphology. In Proceedings of the 11th Language Resources and Evaluation Conference.
  • Levy (2008) Roger Levy. 2008. Expectation-based syntactic comprehension. Cognition, 106(3):1126–1177.
  • Lindblom and Maddieson (1988) Björn Lindblom and Ian Maddieson. 1988. Phonetic universals in consonant systems. Language, Speech, and Mind, pages 62–78.
  • Maddieson (2006) Ian Maddieson. 2006. Correlating phonological complexity: Data and validation. Linguistic Typology, 10(1):106–123.
  • Maddieson (2009) Ian Maddieson. 2009. Calculating phonological complexity. In François Pellegrino, Ioana Chitoran, Egidio Marsico, and Christophe Coupe, editors, Approaches to phonological complexity, pages 85–110. Mouton de Gruyter Berlin, Germany.
  • Maddieson and Disner (1984) Ian Maddieson and Sandra Ferrari Disner. 1984. Patterns of sounds. Cambridge University Press.
  • Mahowald et al. (2018) Kyle Mahowald, Isabelle Dautriche, Edward Gibson, and Steven T. Piantadosi. 2018. Word forms are structured for efficient use. Cognitive science, 42(8):3116–3134.
  • Martinet (1955) André Martinet. 1955. Économie des changements phonétiques. Éditions A. Francke S. A.
  • McWhorter (2001) John McWhorter. 2001. The world’s simplest grammars are creole grammars. Linguistic Typology, 5(2):125–66.
  • Merity et al. (2018) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018. An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240.
  • Mielke et al. (2019) Sebastian J. Mielke, Ryan Cotterell, Kyle Gorman, Brian Roark, and Jason Eisner. 2019. What kind of language is hard to language-model? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4975–4989, Florence, Italy. Association for Computational Linguistics.
  • Miestamo (2006) Matti Miestamo. 2006. On the feasibility of complexity metrics. In FinEst Linguistics, Proceedings of the Annual Finnish and Estonian Conference of Linguistics, pages 11–26.
  • Miestamo (2008) Matti Miestamo. 2008. Grammatical complexity in a cross-linguistic perspective. In Matti Miestamo, Kaius Sinnemaki, and Fred Karlsson, editors, Language complexity: Typology, contact, change, pages 23–41. John Benjamins Amsterdam, The Netherlands.
  • Mikolov et al. (2010) Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association.
  • Moran and Blasi (2014) Steven Moran and Damián Blasi. 2014. Cross-linguistic comparison of complexity measures in phonological systems. In Frederick J. Newmeyer and Laurel B. Preston, editors, Measuring grammatical complexity, pages 217–240. Oxford University Press Oxford, UK.
  • Moran et al. (2014) Steven Moran, Daniel McCloy, and Richard Wright, editors. 2014. PHOIBLE Online. Max Planck Institute for Evolutionary Anthropology, Leipzig.
  • Nettle (1995) Daniel Nettle. 1995. Segmental inventory size, word length, and communicative efficiency. Linguistics, 33:359–367.
  • Pellegrino et al. (2011) François Pellegrino, Ioana Chitoran, Egidio Marsico, and Christophe Coupé. 2011. A cross-language perspective on speech information rate. Language, 87(3):539–558.
  • Piantadosi et al. (2011) Steven T. Piantadosi, Harry Tily, and Edward Gibson. 2011. Word lengths are optimized for efficient communication. Proceedings of the National Academy of Sciences, 108(9):3526–3529.
  • Piantadosi et al. (2012) Steven T. Piantadosi, Harry Tily, and Edward Gibson. 2012. The communicative function of ambiguity in language. Cognition, 122(3):280–291.
  • Piantadosi et al. (2009) Steven T. Piantadosi, Harry J. Tily, and Edward Gibson. 2009. The communicative lexicon hypothesis. In The 31st annual meeting of the Cognitive Science Society (CogSci09), pages 2582–2587.
  • Priva and Jaeger (2018) Uriel Cohen Priva and T. Florian Jaeger. 2018. The interdependence of frequency, predictability, and informativity in the segmental domain. Linguistics Vanguard, 4(s2).
  • Shosted (2006) Ryan K. Shosted. 2006. Correlating complexity: A typological approach. Linguistic Typology, 10(1):1–40.
  • Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2012.

    Practical Bayesian optimization of machine learning algorithms.

    In Advances in Neural Information Processing Systems, pages 2951–2959.
  • van Son and Pols (2003) R.J.J.H. van Son and Louis C.W. Pols. 2003. Information structure and efficiency in speech production. In Eighth European Conference on Speech Communication and Technology (Eurospeech).
  • Stanley (1967) Richard Stanley. 1967. Redundancy rules in phonology. Language, pages 393–436.
  • Storkel (2001) Holly L. Storkel. 2001. Learning new words: Phonotactic probability in language development. Journal of Speech, Language, and Hearing Research, 44(6):1321–1337.
  • Storkel (2003) Holly L. Storkel. 2003. Learning new words II: Phonotactic probability in verb learning. Journal of Speech Language and Hearing Research, 46(6):1312–1323.
  • Storkel et al. (2006) Holly L. Storkel, Jonna Armbrüster, and Tiffany P. Hogan. 2006. Differentiating phonotactic probability and neighborhood density in adult word learning. Journal of Speech, Language, and Hearing Research, 49(6):1175–1192.
  • Storkel and Hoover (2010) Holly L. Storkel and Jill R. Hoover. 2010. An online calculator to compute phonotactic probability and neighborhood density on the basis of child corpora of spoken American English. Behavior Research Methods, 42(2):497–506.
  • Storkel and Lee (2011) Holly L. Storkel and Su-Yeon Lee. 2011. The independent effects of phonotactic probability and neighbourhood density on lexical acquisition by preschool children. Language and Cognitive Processes, 26(2):191–211.
  • Sundermeyer et al. (2012) Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM neural networks for language modeling. In Thirteenth Annual Conference of the International Speech Communication Association.
  • Swadesh (1955) Morris Swadesh. 1955. Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics, 21(2):121–137.
  • Trubetzkoy (1938) Nikolaï Sergeyevich Trubetzkoy. 1938. Grundzüge der phonologie. Van den Hoeck & Ruprecht, Gottigen, Germany.
  • Vitevitch and Luce (1999) Michael S. Vitevitch and Paul A. Luce. 1999. Probabilistic phonotactics and neighborhood activation in spoken word recognition. Journal of Memory and Language, 40(3):374–408.
  • Zipf (1935) George Kingsley Zipf. 1935. The psycho-biology of language: An introduction to dynamic philology. MIT Press, Cambridge.