Subword Mapping and Anchoring across Languages

09/09/2021
by   Giorgos Vernikos, et al.
0

State-of-the-art multilingual systems rely on shared vocabularies that sufficiently cover all considered languages. To this end, a simple and frequently used approach makes use of subword vocabularies constructed jointly over several languages. We hypothesize that such vocabularies are suboptimal due to false positives (identical subwords with different meanings across languages) and false negatives (different subwords with similar meanings). To address these issues, we propose Subword Mapping and Anchoring across Languages (SMALA), a method to construct bilingual subword vocabularies. SMALA extracts subword alignments using an unsupervised state-of-the-art mapping technique and uses them to create cross-lingual anchors based on subword similarities. We demonstrate the benefits of SMALA for cross-lingual natural language inference (XNLI), where it improves zero-shot transfer to an unseen language without task-specific data, but only by sharing subword embeddings. Moreover, in neural machine translation, we show that joint subword vocabularies obtained with SMALA lead to higher BLEU scores on sentences that contain many false positives and false negatives.

READ FULL TEXT
research
11/03/2020

Cross-lingual Word Embeddings beyond Zero-shot Machine Translation

We explore the transferability of a multilingual neural machine translat...
research
12/26/2018

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

We introduce an architecture to learn joint multilingual sentence repres...
research
05/06/2021

XeroAlign: Zero-Shot Cross-lingual Transformer Alignment

The introduction of pretrained cross-lingual language models brought dec...
research
04/18/2017

Baselines and test data for cross-lingual inference

Research in natural language inference is currently exclusive to English...
research
04/03/2023

A Simple and Effective Method of Cross-Lingual Plagiarism Detection

We present a simple cross-lingual plagiarism detection method applicable...
research
03/11/2021

Unsupervised Transfer Learning in Multilingual Neural Machine Translation with Cross-Lingual Word Embeddings

In this work we look into adding a new language to a multilingual NMT sy...
research
09/15/2021

Regressive Ensemble for Machine Translation Quality Evaluation

This work introduces a simple regressive ensemble for evaluating machine...

Please sign up or login with your details

Forgot password? Click here to reset