What Kind of Language Is Hard to Language-Model?

06/11/2019
by   Sebastian J. Mielke, et al.
0

How language-agnostic are current state-of-the-art NLP tools? Are there some types of language that are easier to model with current methods? In prior work (Cotterell et al., 2018) we attempted to address this question for language modeling, and observed that recurrent neural network language models do not perform equally well over all the high-resource European languages found in the Europarl corpus. We speculated that inflectional morphology may be the primary culprit for the discrepancy. In this paper, we extend these earlier experiments to cover 69 languages from 13 language families using a multilingual Bible corpus. Methodologically, we introduce a new paired-sample multiplicative mixed-effects model to obtain language difficulty coefficients from at-least-pairwise parallel corpora. In other words, the model is aware of inter-sentence variation and can handle missing data. Exploiting this model, we show that "translationese" is not any easier to model than natively written language in a fair comparison. Trying to answer the question of what features difficult languages have in common, we try and fail to reproduce our earlier (Cotterell et al., 2018) observation about morphological complexity and instead reveal far simpler statistics of the data that seem to drive complexity in a much larger sample.

READ FULL TEXT

page 1

page 13

page 14

research
12/11/2020

Morphology Matters: A Multilingual Language Modeling Analysis

Prior studies in multilingual language modeling (e.g., Cotterell et al.,...
research
06/10/2018

Are All Languages Equally Hard to Language-Model?

For general modeling methods applied to diverse languages, a natural que...
research
04/25/2021

XLM-T: A Multilingual Language Model Toolkit for Twitter

Language models are ubiquitous in current NLP, and their multilingual ca...
research
03/19/2021

MuRIL: Multilingual Representations for Indian Languages

India is a multilingual society with 1369 rationalized languages and dia...
research
05/11/2020

Neural Polysynthetic Language Modelling

Research in natural language processing commonly assumes that approaches...
research
06/07/2021

Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study

Recent research in multilingual language models (LM) has demonstrated th...

Please sign up or login with your details

Forgot password? Click here to reset