Chinese-Portuguese Machine Translation: A Study on Building Parallel Corpora from Comparable Texts

04/05/2018
by   Siyou Liu, et al.
0

Although there are increasing and significant ties between China and Portuguese-speaking countries, there is not much parallel corpora in the Chinese-Portuguese language pair. Both languages are very populous, with 1.2 billion native Chinese speakers and 279 million native Portuguese speakers, the language pair, however, could be considered as low-resource in terms of available parallel corpora. In this paper, we describe our methods to curate Chinese-Portuguese parallel corpora and evaluate their quality. We extracted bilingual data from Macao government websites and proposed a hierarchical strategy to build a large parallel corpus. Experiments are conducted on existing and our corpora using both Phrased-Based Machine Translation (PBMT) and the state-of-the-art Neural Machine Translation (NMT) models. The results of this work can be used as a benchmark for future Chinese-Portuguese MT systems. The approach we used in this paper also shows a good example on how to boost performance of MT systems for low-resource language pairs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/22/2019

Corpus Augmentation by Sentence Segmentation for Low-Resource Neural Machine Translation

Neural Machine Translation (NMT) has been proven to achieve impressive r...
research
05/21/2020

MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora

Multi-word expressions (MWEs) are a hot topic in research in natural lan...
research
11/21/2018

Neural Machine Translation based Word Transduction Mechanisms for Low-Resource Languages

Out-Of-Vocabulary (OOV) words can pose serious challenges for machine tr...
research
08/21/2023

An Effective Method using Phrase Mechanism in Neural Machine Translation

Machine Translation is one of the essential tasks in Natural Language Pr...
research
09/30/2021

Prose2Poem: The Blessing of Transformers in Translating Prose to Persian Poetry

Persian Poetry has consistently expressed its philosophy, wisdom, speech...
research
11/28/2020

Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality

Building Machine Translation (MT) systems for low-resource languages rem...
research
10/28/2021

Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC

Machine translation (MT) system aims to translate source language into t...

Please sign up or login with your details

Forgot password? Click here to reset