A Strong Baseline for Learning Cross-Lingual Word Embeddings from Sentence Alignments

08/18/2016
by   Omer Levy, et al.
0

While cross-lingual word embeddings have been studied extensively in recent years, the qualitative differences between the different algorithms remain vague. We observe that whether or not an algorithm uses a particular feature set (sentence IDs) accounts for a significant performance gap among these algorithms. This feature set is also used by traditional alignment algorithms, such as IBM Model-1, which demonstrate similar performance to state-of-the-art embedding algorithms on a variety of benchmarks. Overall, we observe that different algorithmic approaches for utilizing the sentence ID feature space result in similar performance. This paper draws both empirical and theoretical parallels between the embedding and alignment literature, and suggests that adding additional sources of information, which go beyond the traditional signal of bilingual sentence-aligned corpora, may substantially improve cross-lingual word embeddings, and that future baselines should at least take such features into account.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/12/2019

Analyzing the Limitations of Cross-lingual Word Embedding Mappings

Recent research in cross-lingual word embeddings has almost exclusively ...
research
03/04/2018

Concatenated p-mean Word Embeddings as Universal Cross-Lingual Sentence Representations

Average word embeddings are a common baseline for more sophisticated sen...
research
04/01/2016

Cross-lingual Models of Word Embeddings: An Empirical Comparison

Despite interest in using cross-lingual knowledge to learn word embeddin...
research
04/11/2019

Scalable Cross-Lingual Transfer of Neural Sentence Embeddings

We develop and investigate several cross-lingual alignment approaches fo...
research
11/01/2018

A Stronger Baseline for Multilingual Word Embeddings

Levy, Søgaard and Goldberg's (2017) S-ID (sentence ID) method applies wo...
research
04/13/2020

Compass-aligned Distributional Embeddings for Studying Semantic Differences across Corpora

Word2vec is one of the most used algorithms to generate word embeddings ...
research
03/25/2022

Probabilistic Embeddings with Laplacian Graph Priors

We introduce probabilistic embeddings using Laplacian priors (PELP). The...

Please sign up or login with your details

Forgot password? Click here to reset