Massively Multilingual Word Embeddings

02/05/2016
by   Waleed Ammar, et al.
0

We introduce new methods for estimating and evaluating embeddings of words in more than fifty languages in a single shared embedding space. Our estimation methods, multiCluster and multiCCA, use dictionaries and monolingual data; they do not require parallel data. Our new evaluation method, multiQVEC-CCA, is shown to correlate better than previous ones with two downstream tasks (text categorization and parsing). We also describe a web portal for evaluation that will facilitate further research in this area, along with open-source releases of all our methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/21/2021

Debiasing Multilingual Word Embeddings: A Case Study of Three Indian Languages

In this paper, we advance the current state-of-the-art method for debias...
research
09/30/2021

Phonetic Word Embeddings

This work presents a novel methodology for calculating the phonetic simi...
research
06/11/2020

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

We use the multilingual OSCAR corpus, extracted from Common Crawl via la...
research
04/23/2018

Bilingual Embeddings with Random Walks over Multilingual Wordnets

Bilingual word embeddings represent words of two languages in the same s...
research
03/12/2018

Concept2vec: Metrics for Evaluating Quality of Embeddings for Ontological Concepts

Although there is an emerging trend towards generating embeddings for pr...
research
05/01/2020

Why Overfitting Isn't Always Bad: Retrofitting Cross-Lingual Word Embeddings to Dictionaries

Cross-lingual word embeddings (CLWE) are often evaluated on bilingual le...
research
02/18/2019

CBOW Is Not All You Need: Combining CBOW with the Compositional Matrix Space Model

Continuous Bag of Words (CBOW) is a powerful text embedding method. Due ...

Please sign up or login with your details

Forgot password? Click here to reset