Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining

05/21/2021
by   Ivana Kvapilíková, et al.
0

Existing models of multilingual sentence embeddings require large parallel data resources which are not available for low-resource languages. We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data. We first produce a synthetic parallel corpus using unsupervised machine translation, and use it to fine-tune a pretrained cross-lingual masked language model (XLM) to derive the multilingual sentence representations. The quality of the representations is evaluated on two parallel corpus mining tasks with improvements of up to 22 F1 points over vanilla XLM. In addition, we observe that a single synthetic bilingual corpus is able to improve results for other language pairs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/22/2023

L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT

The multilingual Sentence-BERT (SBERT) models map different languages to...
research
10/25/2019

Exploring Multilingual Syntactic Sentence Representations

We study methods for learning sentence embeddings with syntactic structu...
research
06/05/2019

Learning Bilingual Sentence Embeddings via Autoencoding and Computing Similarities with a Multilayer Perceptron

We propose a novel model architecture and training algorithm to learn bi...
research
03/04/2016

A Bayesian Model of Multilingual Unsupervised Semantic Role Induction

We propose a Bayesian model of unsupervised semantic role induction in m...
research
11/01/2018

A Stronger Baseline for Multilingual Word Embeddings

Levy, Søgaard and Goldberg's (2017) S-ID (sentence ID) method applies wo...
research
04/07/2022

Automatic WordNet Construction using Word Sense Induction through Sentence Embeddings

Language resources such as wordnets remain indispensable tools for diffe...
research
05/17/2022

OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource Language Pair for Low-Resource Sentence Retrieval

Aligning parallel sentences in multilingual corpora is essential to cura...

Please sign up or login with your details

Forgot password? Click here to reset