word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs

11/27/2019
by   Yo Joong Choe, et al.
0

We present word2word, a publicly available dataset and an open-source Python package for cross-lingual word translations extracted from sentence-level parallel corpora. Our dataset provides top-k word translations in 3,564 (directed) language pairs across 62 languages in OpenSubtitles2018 (Lison et al., 2018). To obtain this dataset, we use a count-based bilingual lexicon extraction model based on the observation that not only source and target words but also source words themselves can be highly correlated. We illustrate that the resulting bilingual lexicons have high coverage and attain competitive translation quality for several language pairs. We wrap our dataset and model in an easy-to-use Python library, which supports downloading and retrieving top-k word translations in any of the supported language pairs as well as computing top-k word translations for custom parallel corpora.

READ FULL TEXT
research
11/10/2019

A Massive Collection of Cross-Lingual Web-Document Pairs

Cross-lingual document alignment aims to identify pairs of documents in ...
research
08/04/2016

UsingWord Embeddings for Query Translation for Hindi to English Cross Language Information Retrieval

Cross-Language Information Retrieval (CLIR) has become an important prob...
research
10/11/2019

How Does Language Influence Documentation Workflow? Unsupervised Word Discovery Using Translations in Multiple Languages

For language documentation initiatives, transcription is an expensive re...
research
05/31/2021

Verdi: Quality Estimation and Error Detection for Bilingual

Translation Quality Estimation is critical to reducing post-editing effo...
research
03/10/2023

An algebraic approach to translating Japanese

We use Lambek's pregroups and the framework of compositional distributio...
research
09/30/2019

Simple and Effective Paraphrastic Similarity from Parallel Translations

We present a model and methodology for learning paraphrastic sentence em...

Please sign up or login with your details

Forgot password? Click here to reset