DeepAI AI Chat
Log In Sign Up

The word entropy of natural languages

by   Christian Bentz, et al.

The average uncertainty associated with words is an information-theoretic concept at the heart of quantitative and computational linguistics. The entropy has been established as a measure of this average uncertainty - also called average information content. We here use parallel texts of 21 languages to establish the number of tokens at which word entropies converge to stable values. These convergence points are then used to select texts from a massively parallel corpus, and to estimate word entropies across more than 1000 languages. Our results help to establish quantitative language comparisons, to understand the performance of multilingual translation systems, and to normalize semantic similarity measures.


page 1

page 2

page 3

page 4


Quantitative Entropy Study of Language Complexity

We study the entropy of Chinese and English texts, based on characters i...

Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique

Major advancement in the performance of machine translation models has b...

Phonotactic Complexity and its Trade-offs

We present methods for calculating a measure of phonotactic complexity—b...

Disambiguatory Signals are Stronger in Word-initial Positions

Psycholinguistic studies of human word processing and lexical access pro...

Parallel Texts in the Hebrew Bible, New Methods and Visualizations

In this article we develop an algorithm to detect parallel texts in the ...

Generalized Word Shift Graphs: A Method for Visualizing and Explaining Pairwise Comparisons Between Texts

A common task in computational text analyses is to quantify how two corp...

Code Repositories