Learning variable length units for SMT between related languages via Byte Pair Encoding

10/20/2016
by   Anoop Kunchukuttan, et al.
0

We explore the use of segments learnt using Byte Pair Encoding (referred to as BPE units) as basic units for statistical machine translation between related languages and compare it with orthographic syllables, which are currently the best performing basic units for this translation task. BPE identifies the most frequent character sequences as basic units, while orthographic syllables are linguistically motivated pseudo-syllables. We show that BPE units modestly outperform orthographic syllables as units of translation, showing up to 11 syllables can be used only for languages whose writing systems use vowel representations, BPE is writing system independent and we show that BPE outperforms other units for non-vowel writing systems too. Our results are supported by extensive experimentation spanning multiple language families and writing systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/03/2016

Orthographic Syllable as basic unit for SMT between Related Languages

We explore the use of the orthographic syllable, a variable-length conso...
research
05/09/2018

wubi2en: Character-level Chinese-English Translation through ASCII Encoding

Character-level Neural Machine Translation (NMT) models have recently ac...
research
09/07/2018

Neural Machine Translation of Logographic Languages Using Sub-character Level Information

Recent neural machine translation (NMT) systems have been greatly improv...
research
02/23/2017

Utilizing Lexical Similarity between Related, Low-resource Languages for Pivot-based SMT

We investigate pivot-based translation between related languages in a lo...
research
05/27/2019

Specific polysemy of the brief sapiential units

In this paper we explain how we deal with the problems related to the co...
research
11/01/2016

Faster decoding for subword level Phrase-based SMT between related languages

A common and effective way to train translation systems between related ...
research
10/15/2021

Alternative Input Signals Ease Transfer in Multilingual Machine Translation

Recent work in multilingual machine translation (MMT) has focused on the...

Please sign up or login with your details

Forgot password? Click here to reset