Why Overfitting Isn't Always Bad: Retrofitting Cross-Lingual Word Embeddings to Dictionaries

05/01/2020
by   Mozhi Zhang, et al.
0

Cross-lingual word embeddings (CLWE) are often evaluated on bilingual lexicon induction (BLI). Recent CLWE methods use linear projections, which underfit the training dictionary, to generalize on BLI. However, underfitting can hinder generalization to other downstream tasks that rely on words from the training dictionary. We address this limitation by retrofitting CLWE to the training dictionary, which pulls training translation pairs closer in the embedding space and overfits the training dictionary. This simple post-processing step often improves accuracy on two downstream tasks, despite lowering BLI test accuracy. We also retrofit to both the training dictionary and a synthetic dictionary induced from CLWE, which sometimes generalizes even better on downstream tasks. Our results confirm the importance of fully exploiting training dictionary in downstream tasks and explains why BLI is a flawed CLWE evaluation.

READ FULL TEXT

page 3

page 4

research
06/05/2019

A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity

Cross-lingual word embeddings encode the meaning of words from different...
research
02/01/2019

How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions

Cross-lingual word embeddings (CLEs) enable multilingual modeling of mea...
research
11/08/2019

Should All Cross-Lingual Embeddings Speak English?

Most of recent work in cross-lingual word embeddings is severely Angloce...
research
02/18/2019

CBOW Is Not All You Need: Combining CBOW with the Compositional Matrix Space Model

Continuous Bag of Words (CBOW) is a powerful text embedding method. Due ...
research
06/10/2016

Unsupervised Learning of Word-Sequence Representations from Scratch via Convolutional Tensor Decomposition

Unsupervised text embeddings extraction is crucial for text understandin...
research
09/03/2019

On the Downstream Performance of Compressed Word Embeddings

Compressing word embeddings is important for deploying NLP models in mem...
research
02/05/2016

Massively Multilingual Word Embeddings

We introduce new methods for estimating and evaluating embeddings of wor...

Please sign up or login with your details

Forgot password? Click here to reset