Crosslingual Document Embedding as Reduced-Rank Ridge Regression

04/08/2019
by   Martin Josifoski, et al.
0

There has recently been much interest in extending vector-based word representations to multiple languages, such that words can be compared across languages. In this paper, we shift the focus from words to documents and introduce a method for embedding documents written in any language into a single, language-independent vector space. For training, our approach leverages a multilingual corpus where the same concept is covered in multiple languages (but not necessarily via exact translations), such as Wikipedia. Our method, Cr5 (Crosslingual reduced-rank ridge regression), starts by training a ridge-regression-based classifier that uses language-specific bag-of-word features in order to predict the concept that a given document is about. We show that, when constraining the learned weight matrix to be of low rank, it can be factored to obtain the desired mappings from language-specific bags-of-words to language-independent embeddings. As opposed to most prior methods, which use pretrained monolingual word vectors, postprocess them to make them crosslingual, and finally average word vectors to obtain document vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as document-level. Moreover, since our algorithm uses the singular value decomposition as its core operation, it is highly scalable. Experiments show that our method achieves state-of-the-art performance on a crosslingual document retrieval task. Finally, although not trained for embedding sentences and words, it also achieves competitive performance on crosslingual sentence and word retrieval tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/03/2022

Representing Mixtures of Word Embeddings with Mixtures of Topic Embeddings

A topic model is often formulated as a generative model that explains ho...
research
01/10/2020

Inductive Document Network Embedding with Topic-Word Attention

Document network embedding aims at learning representations for a struct...
research
01/08/2014

Learning Multilingual Word Representations using a Bag-of-Words Autoencoder

Recent work on learning multilingual word representations usually relies...
research
08/18/2017

Syllable-level Neural Language Model for Agglutinative Language

Language models for agglutinative languages have always been hindered in...
research
12/10/2019

An Ensemble Method for Producing Word Representations for the Greek Language

In this paper we present a new ensemble method, Continuous Bag-of-Skip-g...
research
12/11/2015

Words are not Equal: Graded Weighting Model for building Composite Document Vectors

Despite the success of distributional semantics, composing phrases from ...
research
02/08/2017

Name Disambiguation in Anonymized Graphs using Network Embedding

In real-world, our DNA is unique but many people share names. This pheno...

Please sign up or login with your details

Forgot password? Click here to reset