MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora

05/21/2020
by   Lifeng Han, et al.
0

Multi-word expressions (MWEs) are a hot topic in research in natural language processing (NLP), including topics such as MWE detection, MWE decomposition, and research investigating the exploitation of MWEs in other NLP fields such as Machine Translation. However, the availability of bilingual or multi-lingual MWE corpora is very limited. The only bilingual MWE corpora that we are aware of is from the PARSEME (PARSing and Multi-word Expressions) EU Project. This is a small collection of only 871 pairs of English-German MWEs. In this paper, we present multi-lingual and bilingual MWE corpora that we have extracted from root parallel corpora. Our collections are 3,159,226 and 143,042 bilingual MWE pairs for German-English and Chinese-English respectively after filtering. We examine the quality of these extracted bilingual MWEs in MT experiments. Our initial experiments applying MWEs in MT show improved translation performances on MWE terms in qualitative analysis and better general evaluation scores in quantitative analysis, on both German-English and Chinese-English language pairs. We follow a standard experimental pipeline to create our MultiMWE corpora which are available online. Researchers can use this free corpus for their own models or use them in a knowledge base as model features.

READ FULL TEXT
research
04/05/2018

Chinese-Portuguese Machine Translation: A Study on Building Parallel Corpora from Comparable Texts

Although there are increasing and significant ties between China and Por...
research
11/07/2020

AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations

In this work, we present the construction of multilingual parallel corpo...
research
10/11/2017

Word Translation Without Parallel Data

State-of-the-art methods for learning cross-lingual word embeddings have...
research
10/28/2021

Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC

Machine translation (MT) system aims to translate source language into t...
research
04/25/2021

Potential Idiomatic Expression (PIE)-English: Corpus for Classes of Idioms

We present a fairly large, Potential Idiomatic Expression (PIE) dataset ...
research
09/11/2015

A Parallel Corpus of Translationese

We describe a set of bilingual English--French and English--German paral...
research
04/09/2021

Chinese Character Decomposition for Neural MT with Multi-Word Expressions

Chinese character decomposition has been used as a feature to enhance Ma...

Please sign up or login with your details

Forgot password? Click here to reset