wubi2en: Character-level Chinese-English Translation through ASCII Encoding

05/09/2018
by   Mi Xue Tan, et al.
0

Character-level Neural Machine Translation (NMT) models have recently achieved impressive results on many language pairs. They particularly do well for Indo-European language pairs, where the languages share the same writing system. However, for translating between Chinese and English, the gap between the two different writing systems poses a major challenge due to a lack of systematic correspondence between the individual linguistic units. In this paper, we enable character-level NMT for Chinese, by breaking down Chinese characters to linguistic units similar to that of Indo-European languages using the Wubi encoding scheme. We show promising results from training Wubi-based models on the subword- and character-level with recurrent as well as convolutional models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/07/2019

SubCharacter Chinese-English Neural Machine Translation with Wubi encoding

Neural machine translation (NMT) is one of the best methods for understa...
research
09/07/2018

Neural Machine Translation of Logographic Languages Using Sub-character Level Information

Recent neural machine translation (NMT) systems have been greatly improv...
research
10/20/2016

Learning variable length units for SMT between related languages via Byte Pair Encoding

We explore the use of segments learnt using Byte Pair Encoding (referred...
research
03/01/2019

Chinese-Japanese Unsupervised Neural Machine Translation Using Sub-character Level Information

Unsupervised neural machine translation (UNMT) requires only monolingual...
research
05/26/2020

The 'Letter' Distribution in the Chinese Language

Corpus-based statistical analysis plays a significant role in linguistic...
research
05/31/2019

Investigating an Effective Character-level Embedding in Korean Sentence Classification

Different from the writing systems of many Romance and Germanic language...
research
11/23/2022

Breaking the Representation Bottleneck of Chinese Characters: Neural Machine Translation with Stroke Sequence Modeling

Existing research generally treats Chinese character as a minimum unit f...

Please sign up or login with your details

Forgot password? Click here to reset