Hrate package for R
The average uncertainty associated with words is an information-theoretic concept at the heart of quantitative and computational linguistics. The entropy has been established as a measure of this average uncertainty - also called average information content. We here use parallel texts of 21 languages to establish the number of tokens at which word entropies converge to stable values. These convergence points are then used to select texts from a massively parallel corpus, and to estimate word entropies across more than 1000 languages. Our results help to establish quantitative language comparisons, to understand the performance of multilingual translation systems, and to normalize semantic similarity measures.READ FULL TEXT VIEW PDF
We study the entropy of Chinese and English texts, based on characters i...
We introduce a new measure of distance between languages based on word
Major advancement in the performance of machine translation models has b...
We present methods for calculating a measure of phonotactic complexity—b...
Psycholinguistic studies of human word processing and lexical access pro...
In this article we develop an algorithm to detect parallel texts in the
A common task in computational text analyses is to quantify how two corp...
Hrate package for R
The predictability of symbols in strings is the most fundamental concept of information encoding in general, and natural language production in particular. Shannon  defined the entropy - or average information content as a measure of uncertainty or “choice” inherent to strings of symbols. Since then, Shannon  and others have undertaken great efforts to estimate precisely the entropy of written English [4, 12, 8, 27], and other languages [1, 17, 18].
In computational linguistics, entropy - and related measures - have been widely applied to tackle problems relating to machine translation [2, 20], distributional semantics [10, 22, 26, 25], information retrieval [3, 30, 16], and multiword expressions [33, 24]
. All these accounts crucially hinge upon estimating the probability and uncertainty associated with words - i.e. theword entropy - in a given text and language.
There are two central questions associated with this estimation: 1) what is the text size (in number of tokens) at which word entropies converge to a stable value? 2) How much systematic difference in word entropies do we find across different languages? The first question is related to the problem of data sparsity. The performance of NLP tools relying on word probabilities is lower-bounded by the minimum text size at which word entropies can still be reliably estimated. The second question relates to the applicability of NLP tools across different languages. The performance of a single tool can be systematically biased in a specific language due to generally higher or lower word entropies.
In this study, we use a state-of-the-art method to estimate block entropies , and also implement a source entropy estimator . This allows us to establish word entropy convergence points for parallel texts of up to 30M tokens in 21 languages. Based on these analyses, we select texts with sufficiently large token counts from a massively parallel corpus and estimate word entropies across 1360 texts and 1001 languages.
A fundamental property of words is their frequency of occurrence, and hence their probability in language production. This probability is at the heart of many applications in natural language processing.
For example, the probability of an expert translating the English word the into German der can be estimated as
This is the conditional probability of finding in a word aligned German translation where we have in the English original. More generally, we could have a set of possible German translations
assigned with estimated probabilities:
In the uniform case, all of these are assigned probability 1/6, and there is maximum uncertainty about which German word corresponds to English the. In the clearest case, one of the probabilities is 1 and the others are 0. The entropy over such a distribution of word probabilities measures exactly this uncertainty or “choice”. In our example of German articles, the word entropy is in the uniform case and 0 in the clear case. Entropy calculation is detailed below (Section 4).
Given the example above, the task of a machine translation system is to map the English word onto the correct choice from the set of German translations. The maximum entropy approach , for instance, requires that the probabilities are estimated as accurately as possible from the given training data. Probabilities which cannot be estimated reliably are then chosen in a way to maximize the overall entropy of the model.
Crucially, the translation difficulty in such an account is a direct function of the entropy of the word distribution. In the English-to-German mapping - with uniform probabilities - there are 2.6 bits of “choice”, whereas in the German-to-English mapping there are 0 bits of “choice”.
This is closely related to Koehn’s  finding that when translating into English from 10 different source languages, BLEU scores negatively correlate with the number of words in the source language. Hence, it is a harder problem to translate into a language with higher word entropies than to translate into a language with lower word entropies - everything else being equal. From this perspective, estimating the entropy of word distributions can help to predict translation performance.
Measures of semantic content and similarity [25, 26, 22, 10] often also rely on information-theoretic concepts. For example, the classical study by Resnik  defines the similarity between two concepts and in a semantic hierarchy as
with being the set of concepts which subsume both and . To get the information content , the probability of a shared concept is estimated from corpus data as
where is the count of word tokens (nouns in this case) per concept (e.g. occurrences of dollar, euro, and coin would count towards the frequency of the concept money), and is the overall number of word tokens observed. Note that via the estimated this similarity measure depends on the overall distribution of word probabilities, i.e. the word entropy. Namely, for a language with more word types of overall lower token frequencies the probability is biased to be lower on average, and hence the information content is biased to be higher.
Similar considerations apply to finding hypernyms in vector spaces, measuring semantic content to establish asymmetries between derived words and their base forms , as well as measuring differences in semantic content to establish hyponym-hypernym relations .
Overall, estimating a) the convergence points (in number of word tokens), and b) reliable approximations for word entropies are crucial prerequisites for understanding the performance of translation systems, semantic similarity measures, and more generally, any NLP tool that relies on probabilities of words.
To control for constant content across languages, we use two sets of parallel texts: 1) the European Parliament Corpus (EPC) , and 2) the Parallel Bible Corpus (PBC) .111Last accessed on 09/03/2016 Details about the corpora can be seen in Table 1. The general advantage of the EPC is that it is big in terms of numbers of word tokens per language (ca. 30M), whereas the PBC is smaller (ca. 280K word tokens per language), but massively parallel in terms of encompassing languages.
The basic information encoding unit chosen in this study is the word. Earlier studies on the entropy of English [28, 12, 27] often chose letters instead. However, in computational linguistics, word tokens are are a common working unit.
A word token is here defined as a string of alphanumeric UTF-8 characters delimited by white spaces, with all letters converted to lower case and punctuation removed. Note that scripts of Mandarin Chinese (cmn) and Khmer (khm), for instance, delimit phrases and sentences by white spaces, rather than words. However, such scripts constitute a negligible proportion of our sample (%). In fact, % of the texts are written in Latin script.
Given words as basic information encoding units, we can estimate the word entropy as outlined in the following.
Assume a text is a random variablecreated by a process of drawing and concatenating tokens from a set (or vocabulary) of word types , with vocabulary size . Word type probabilities are distributed according to for . Given these definitions, the entropy of can be calculated as 
can be seen as the average information content of word types. A crucial step towards estimating is to reliably approximate the probabilities .
In a text, each word type has a token frequency .
Take the first verse of the English Bible as a text.
in the beginning god created the heavens and the earth
and the earth was waste and empty […]
In this example, the word type the occurs 4 times, and occurs 3 times, etc. As a simple approximation, can be estimated via the maximum likelihood method:
where the denominator is the overall number of word tokens. For the Bible verse we would thus have:
However, there are two main caveats with this so-called plug-in approach:
Firstly, it has been shown that the maximum likelihood estimator is unreliable, especially for small text sizes [19, 9], i.e. small . A range of remedies have been proposed to overcome this problem [27, 19, 9, 13].
Secondly, estimating the entropy from raw counts assumes that word tokens are drawn independently from a multinomial distribution, meaning there are no dependencies between words. Clearly, this requirement is not met for natural languages .
To overcome this problem, instead of using unigrams as “blocks” of information encoding, we could use bigrams, trigrams, n-grams, and thus increase block sizes to 2, 3, n. This yieldsblock entropies  defined as
where is the block size. If is big enough to take into account most long-range correlations between word tokens, then is a close approximation of . However, since the number of different blocks grows exponentially with , very big corpora are needed to get reliable estimates. Note that Schürmann & Grassberger  use an English corpus of 70M words and assert that entropy estimation beyond a block size of 5 letters (not words) is already unreliable. We will therefore stick with block sizes of 1, i.e. unigram entropies here.
Applied to the problem of estimating word entropies, the method works as follows: for any given word token in a text find the longest match-length for which the string matches a preceding string in . Formally, define as
This is an adaptation of Gao et al.’s  match-length definition.222Note that Gao et al.  give a more general definition that also holds for the so-called sliding window, or LZ77 estimator.
To illustrate this, take the example of the English Bible again:
in the beginning god created the heavens and the
earth and the earth was waste and empty […]
For the word token beginning, in position , there is no match in the preceding string (in the). Hence, the match-length is . If we look at and in position , then the longest matching string is and the earth. Hence, the match-length is .
Note that the average match-lengths across all word tokens reflect the redundancy in the string - which is the inverse of unpredictability. Based on this connection, Gao et al.  (Equation 6) show that the entropy of the string can be approximated as
In other words, as the number of tokens approaches infinity, reflects the average information content of a token conditioned on all preceding tokens. So accounts for all statistical dependencies between tokens .
We will call and its approximation the source entropy - after Shannon’s  formulation of the entropy of an information source.333Note that in the limit, i.e. as block sizes approach infinity, block and source entropies are the same. Also, for an independent and identically distributed (i.i.d) random variable, the block entropy of block size 1, i.e. , is identical to the source entropy . However, as pointed out above, in natural languages words are not independently distributed.
We estimate unigram entropies - i.e. entropies for block sizes of 1 () - using the Python implementation of the Nemenman-Shafee-Bialek (NSB)  estimator. This estimator has a faster convergence rate compared to other block entropy estimators .444https://gist.github.com/shhong/1021654/
We first assess the text sizes at which both block and source entropies converge to a stable value using a subset of 21 languages (Section 5.1). We then select texts from the full PBC corpus based on the minimum number of tokens needed for convergence, estimate their entropies, and compare their spread on an entropy spectrum (Section 5.2). Finally, in Section 5.3, we investigate the correlation between block and source entropies.
Figure 1 illustrates the convergence of block and source entropies across 21 languages of the EPC. Note that block entropies are generally higher than source entropies. This is expected given that uncertainty is reduced for source entropies by taking the preceding co-text into account.
We further establish convergence points by calculating SDs of entropies for step sizes of 10K tokens, illustrated in Figure 2. If we choose as a convergence criterion, then we get the convergence points per language given in Table 2.
Note that all 21 languages converge to stable entropies below text sizes of 100K tokens, with a maximum of 70K tokens (bg and ro) and an average of 35K for source and 38K for block entropies. This is an encouraging result, since texts with a minimum of around 70K tokens are available for a wide range of languages in the PBC.
Based on the convergence analyses, we choose 100K tokens as a cut-off point for inclusion of PBC texts. This leaves us with 1352 texts for block entropies, and 1360 texts for source entropies.555The number of texts is lower for block entropies since estimations failed for 8 texts due to some punctuation marks that were not correctly removed. Both cover 1001 languages. Figure 3
is a density plot of the estimated entropy values. Block entropies are approximately normally distributed around a mean of(), and source entropies around a mean of (). Again, source entropies are systematically lower than block entropies.
It is remarkable that given the wide range of potential entropies - from 0 to 16 - most natural languages fall on a relatively narrow spectrum. For example, block entropies mainly fall in the range between 7 and 12, thus only covering around 30% of the possible range.
The similarities in convergence lines in Figure 1 suggest that there is a correlation between block and source entropy estimates. Figure 4 elicits this correlation by plotting block entropies (x-axis) versus source entropies (y-axis). The Pearson correlation is strong (), suggesting that despite the differences in the estimation methods there is a strong connection between them.666 Five clear outliers were removed here. These are texts of languages like Chinese (cmn) and Khmer (khm), which rather delimit phrases and sentences by white spaces. These have incommensurable source and block entropies.
Five clear outliers were removed here. These are texts of languages like Chinese (cmn) and Khmer (khm), which rather delimit phrases and sentences by white spaces. These have incommensurable source and block entropies.
Independent of the estimation method - using block or source entropies - texts of 100K tokens are generally sufficient to estimate values reliably. This was shown across 21 languages of the EPC. Of course, this is not to say that there are no texts/languages in a bigger corpus like the PBC for which convergence might take longer. Note that the 21 EPC languages cover around % of the full range of values across the PBC sample. For example, block entropies range from for English to for Finnish in the EPC, and from to in the PBC.
Moreover, there is a strong correlation between block and source entropies. This is somewhat surprising considering that the probabilities of word occurrences are generally assumed to be strongly dependent on co-text. Of course, there is a co-text effect. It yields source entropies which are lower than block entropies. However, the difference between them is systematic and allows us to predict source entropies from unigram block entropies. Namely, a linear model fitted through the points in Figure 4 can be specified as
Via Equation 10 we can convert block entropies into source entropies with a mean difference of . Note that estimating source entropies requires searching strings of length . As increases, the CPU time per additional word token increases linearly, whereas unigram block entropies can be estimated based on dictionaries of word types and their token frequencies, and the processing time per additional word token is constant. Hence, Equation 10 can help to reduce processing cost.
Based on estimated entropies per language, we can predict the difficulty of pairwise translations between languages, i.e. translation system performance.
 used a probabilistic phrase-based model to translate between the (then available) 11 languages of the EPC. As is shown in Figure 5, Koehn’s BLEU scores for pairwise translations correlate with the pairwise ratios of entropies (, ).
For example, Finnish (fi) is a high entropy language () compared to English (en) (). Translating from Finnish to English gives a BLEU score of , and from English to Finnish . As pointed out above, this is due to the fact that translating into a higher entropy language means having more “choice” - or uncertainty - when translating words (or phrases). So the low performance from English to Finnish is predicted by a low English-to-Finnish entropy ratio (), compared to the Finnish-to-English ratio ().
In Section 2, some examples were given of studies that use information content in a distributional semantics context. Resnik , for instance, builds a similarity measure for concepts and words based on the information content, e.g. . Remember that the entropy as defined in Equation 4 is the average information content. It is clear from the range of word entropies in Figure 3 that languages can be strongly biased to have words with either high or low information contents on average. For example, words in Finnish are biased to have higher information content on average than words in English. This bias is a problem for the cross-linguistic application of similarity measures. It can be overcome by normalization using the estimated entropy:
The entropy, average information content, uncertainty or “choice” is a core property of words. Words, in turn, constitute fundamental building blocks in computational linguistics. Understanding word entropies is therefore a prerequisite for evaluation and improvement of the performance of NLP systems.
We have here established convergence points for 21 languages, illustrating that average word entropies can be reliable estimated with text sizes of K. Based on these findings, we estimated entropies across 1360 texts and 1001 languages. Furthermore, we have shown empirically that there is a strong correlation between block and source entropies across these languages.
Overall, our results help to understand better the performance of multilingual translation systems, and to make measures in distributional semantics cross-linguistically applicable.
CB is funded by the DFG Center for Advanced Studies Words, Bones, Genes, Tools, and the ERC grant EVOLAEMP at the University of Tübingen.
The Journal of Machine Learning Research, 10:1469–1484, 2009.
S. Padó, A. Palmer, M. Kisselew, and J. Šnajder.Measuring semantic content to assess asymmetry in derivation. In Workshop on Advances in Distributional Semantics, 2015.