Learning Word Vectors for 157 Languages

02/19/2018
by   Edouard Grave, et al.
0

Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word representations for 157 languages. We used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project. We also introduce three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish. Finally, we evaluate our pre-trained word vectors on 10 languages for which evaluation datasets exists, showing very strong performance compared to previous models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/26/2017

Advances in Pre-Training Distributed Word Representations

Many Natural Language Processing applications nowadays rely on pre-train...
research
10/25/2018

Learning Neural Emotion Analysis from 100 Observations: The Surprising Effectiveness of Pre-Trained Word Representations

Deep Learning has drastically reshaped virtually all areas of NLP. Yet o...
research
07/15/2016

Enriching Word Vectors with Subword Information

Continuous word representations, trained on large unlabeled corpora are ...
research
10/29/2017

Personalized word representations Carrying Personalized Semantics Learned from Social Network Posts

Distributed word representations have been shown to be very useful in va...
research
02/15/2018

Deep contextualized word representations

We introduce a new type of deep contextualized word representation that ...
research
11/24/2019

Causally Denoise Word Embeddings Using Half-Sibling Regression

Distributional representations of words, also known as word vectors, hav...
research
06/10/2016

WordNet2Vec: Corpora Agnostic Word Vectorization Method

A complex nature of big data resources demands new methods for structuri...

Please sign up or login with your details

Forgot password? Click here to reset