Bilingual Distributed Word Representations from Document-Aligned Comparable Data

09/24/2015
by   Ivan Vulić, et al.
0

We propose a new model for learning bilingual word representations from non-parallel document-aligned data. Following the recent advances in word representation learning, our model learns dense real-valued word vectors, that is, bilingual word embeddings (BWEs). Unlike prior work on inducing BWEs which heavily relied on parallel sentence-aligned corpora and/or readily available translation resources such as dictionaries, the article reveals that BWEs may be learned solely on the basis of document-aligned comparable data without any additional lexical resources nor syntactic information. We present a comparison of our approach with previous state-of-the-art models for learning bilingual word representations from comparable data that rely on the framework of multilingual probabilistic topic modeling (MuPTM), as well as with distributional local context-counting models. We demonstrate the utility of the induced BWEs in two semantic tasks: (1) bilingual lexicon extraction, (2) suggesting word translations in context for polysemous words. Our simple yet effective BWE-based models significantly outperform the MuPTM-based and context-counting representation models from comparable data as well as prior BWE-based models, and acquire the best reported results on both tasks for all three tested language pairs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/06/2016

Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec

Distributed dense word vectors have been shown to be effective at captur...
research
12/21/2016

Inverted Bilingual Topic Models for Lexicon Extraction from Non-parallel Data

Topic models have been successfully applied in lexicon extraction. Howev...
research
11/27/2015

Category Enhanced Word Embedding

Distributed word representations have been demonstrated to be effective ...
research
04/18/2017

An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification

End-to-end neural machine translation has overtaken statistical machine ...
research
02/06/2014

An Autoencoder Approach to Learning Bilingual Word Representations

Cross-language learning allows us to use training data from one language...
research
03/08/2021

A Topological Approach to Compare Document Semantics Based on a New Variant of Syntactic N-grams

This paper delivers a new perspective of thinking and utilizing syntacti...

Please sign up or login with your details

Forgot password? Click here to reset