The word entropy of natural languages

06/22/2016
by   Christian Bentz, et al.
0

The average uncertainty associated with words is an information-theoretic concept at the heart of quantitative and computational linguistics. The entropy has been established as a measure of this average uncertainty - also called average information content. We here use parallel texts of 21 languages to establish the number of tokens at which word entropies converge to stable values. These convergence points are then used to select texts from a massively parallel corpus, and to estimate word entropies across more than 1000 languages. Our results help to establish quantitative language comparisons, to understand the performance of multilingual translation systems, and to normalize semantic similarity measures.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/14/2016

Quantitative Entropy Study of Language Complexity

We study the entropy of Chinese and English texts, based on characters i...
research
04/12/2021

Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique

Major advancement in the performance of machine translation models has b...
research
05/07/2020

Phonotactic Complexity and its Trade-offs

We present methods for calculating a measure of phonotactic complexity—b...
research
02/03/2021

Disambiguatory Signals are Stronger in Word-initial Positions

Psycholinguistic studies of human word processing and lexical access pro...
research
03/04/2016

Parallel Texts in the Hebrew Bible, New Methods and Visualizations

In this article we develop an algorithm to detect parallel texts in the ...
research
08/05/2020

Generalized Word Shift Graphs: A Method for Visualizing and Explaining Pairwise Comparisons Between Texts

A common task in computational text analyses is to quantify how two corp...

Please sign up or login with your details

Forgot password? Click here to reset