Morphology Matters: A Multilingual Language Modeling Analysis

12/11/2020
by   Hyunji Hayley Park, et al.
3

Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features. We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically-motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the impact of a language's morphology on language modeling.

READ FULL TEXT
research
06/03/2019

Better Character Language Modeling Through Morphology

We incorporate morphological supervision into character language models ...
research
06/11/2019

What Kind of Language Is Hard to Language-Model?

How language-agnostic are current state-of-the-art NLP tools? Are there ...
research
03/11/2021

Evaluation of Morphological Embeddings for English and Russian Languages

This paper evaluates morphology-based embeddings for English and Russian...
research
05/07/2022

UniMorph 4.0: Universal Morphology

The Universal Morphology (UniMorph) project is a collaborative effort pr...
research
10/11/2019

How Does Language Influence Documentation Workflow? Unsupervised Word Discovery Using Translations in Multiple Languages

For language documentation initiatives, transcription is an expensive re...
research
06/01/2016

Clustering with phylogenetic tools in astrophysics

Phylogenetic approaches are finding more and more applications outside t...
research
11/03/2022

Exploring the State-of-the-Art Language Modeling Methods and Data Augmentation Techniques for Multilingual Clause-Level Morphology

This paper describes the KUIS-AI NLP team's submission for the 1^st Shar...

Please sign up or login with your details

Forgot password? Click here to reset