Language Embeddings Sometimes Contain Typological Generalizations

01/19/2023
by   Robert Östling, et al.
0

To what extent can neural network models learn generalizations about language structure, and how do we find out what they have learned? We explore these questions by training neural models for a range of natural language processing tasks on a massively multilingual dataset of Bible translations in 1295 languages. The learned language representations are then compared to existing typological databases as well as to a novel set of quantitative syntactic and morphological features obtained through annotation projection. We conclude that some generalizations are surprisingly close to traditional features from linguistic typology, but that most of our models, as well as those of previous work, do not appear to have made linguistically meaningful generalizations. Careful attention to details in the evaluation turns out to be essential to avoid false positives. Furthermore, to encourage continued work in this field, we release several resources covering most or all of the languages in our data: (i) multiple sets of language representations, (ii) multilingual word embeddings, (iii) projected and predicted syntactic and morphological features, (iv) software to provide linguistically sound evaluations of language representations.

READ FULL TEXT

page 22

page 23

page 24

page 25

page 26

page 27

page 28

page 29

research
04/09/2020

On the Language Neutrality of Pre-trained Multilingual Representations

Multilingual contextual embeddings, such as multilingual BERT (mBERT) an...
research
12/22/2016

Continuous multilinguality with language vectors

Most existing models for multilingual natural language processing (NLP) ...
research
10/02/2020

Syntax Representation in Word Embeddings and Neural Networks – A Survey

Neural networks trained on natural language processing tasks capture syn...
research
11/01/2019

On the Linguistic Representational Power of Neural Machine Translation Models

Despite the recent success of deep neural networks in natural language p...
research
02/23/2018

From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings

A core part of linguistic typology is the classification of languages ac...
research
04/06/2020

A Systematic Analysis of Morphological Content in BERT Models for Multiple Languages

This work describes experiments which probe the hidden representations o...
research
04/30/2020

Linguistic Typology Features from Text: Inferring the Sparse Features of World Atlas of Language Structures

The use of linguistic typological resources in natural language processi...

Please sign up or login with your details

Forgot password? Click here to reset