Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

12/26/2018
by   Mikel Artetxe, et al.
0

We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different language families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifier on top of the resulting sentence embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification. Our approach sets a new state-of-the-art on zero-shot cross-lingual natural language inference for all the 14 languages in the XNLI dataset but one. We also achieve very competitive results in cross-lingual document classification (MLDoc dataset). Our sentence embeddings are also strong at parallel corpus mining, establishing a new state-of-the-art in the BUCC shared task for 3 of its 4 language pairs. Finally, we introduce a new test set of aligned sentences in 122 languages based on the Tatoeba corpus, and show that our sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages. Our PyTorch implementation, pre-trained encoder and the multilingual test set will be freely available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/19/2022

Cross-Lingual Retrieval Augmented Prompt for Low-Resource Languages

Multilingual Pretrained Language Models (MPLMs) have shown their strong ...
research
07/02/2020

Bayesian multilingual topic model for zero-shot cross-lingual topic identification

This paper presents a Bayesian multilingual topic model for learning lan...
research
05/24/2018

A Corpus for Multilingual Document Classification in Eight Languages

Cross-lingual document classification aims at training a document classi...
research
04/28/2023

Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings?

Dense vector representations for textual data are crucial in modern NLP....
research
09/15/2019

Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization

We present Emu, a system that semantically enhances multilingual sentenc...
research
09/09/2021

Subword Mapping and Anchoring across Languages

State-of-the-art multilingual systems rely on shared vocabularies that s...
research
05/17/2022

OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource Language Pair for Low-Resource Sentence Retrieval

Aligning parallel sentences in multilingual corpora is essential to cura...

Please sign up or login with your details

Forgot password? Click here to reset