Are All Languages Equally Hard to Language-Model?

06/10/2018
by   Ryan Cotterell, et al.
0

For general modeling methods applied to diverse languages, a natural question is: how well should we expect our models to work on languages with differing typological profiles? In this work, we develop an evaluation framework for fair cross-linguistic comparison of language models, using translated text so that all models are asked to predict approximately the same information. We then conduct a study on 21 languages, demonstrating that in some languages, the textual expression of the information is harder to predict with both n-gram and LSTM language models. We show complex inflectional morphology to be a cause of performance differences among languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/21/2022

SERENGETI: Massively Multilingual Language Models for Africa

Multilingual language models (MLMs) acquire valuable, generalizable ling...
research
05/11/2023

How Good are Commercial Large Language Models on African Languages?

Recent advancements in Natural Language Processing (NLP) has led to the ...
research
06/11/2019

What Kind of Language Is Hard to Language-Model?

How language-agnostic are current state-of-the-art NLP tools? Are there ...
research
05/20/2023

Revisiting Entropy Rate Constancy in Text

The uniform information density (UID) hypothesis states that humans tend...
research
09/11/2018

Feature-Specific Profiling

While high-level languages come with significant readability and maintai...
research
09/30/2021

A surprisal–duration trade-off across and within the world's languages

While there exist scores of natural languages, each with its unique feat...
research
08/09/2023

Information-Theoretic Characterization of Vowel Harmony: A Cross-Linguistic Study on Word Lists

We present a cross-linguistic study that aims to quantify vowel harmony ...

Please sign up or login with your details

Forgot password? Click here to reset