SubCharacter Chinese-English Neural Machine Translation with Wubi encoding

11/07/2019
by   Wei Zhang, et al.
0

Neural machine translation (NMT) is one of the best methods for understanding the differences in semantic rules between two languages. Especially for Indo-European languages, subword-level models have achieved impressive results. However, when the translation task involves Chinese, semantic granularity remains at the word and character level, so there is still need more fine-grained translation model of Chinese. In this paper, we introduce a simple and effective method for Chinese translation at the sub-character level. Our approach uses the Wubi method to translate Chinese into English; byte-pair encoding (BPE) is then applied. Our method for Chinese-English translation eliminates the need for a complicated word segmentation algorithm during preprocessing. Furthermore, our method allows for sub-character-level neural translation based on recurrent neural network (RNN) architecture, without preprocessing. The empirical results show that for Chinese-English translation tasks, our sub-character-level model has a comparable BLEU score to the subword model, despite having a much smaller vocabulary. Additionally, the small vocabulary is highly advantageous for NMT model compression.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/13/2017

Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT

Neural machine translation (NMT), a new approach to machine translation,...
research
09/07/2018

Neural Machine Translation of Logographic Languages Using Sub-character Level Information

Recent neural machine translation (NMT) systems have been greatly improv...
research
05/09/2018

wubi2en: Character-level Chinese-English Translation through ASCII Encoding

Character-level Neural Machine Translation (NMT) models have recently ac...
research
03/01/2019

Chinese-Japanese Unsupervised Neural Machine Translation Using Sub-character Level Information

Unsupervised neural machine translation (UNMT) requires only monolingual...
research
11/23/2022

Breaking the Representation Bottleneck of Chinese Characters: Neural Machine Translation with Stroke Sequence Modeling

Existing research generally treats Chinese character as a minimum unit f...
research
09/07/2018

Logographic Subword Model for Neural Machine Translation

A novel logographic subword model is proposed to reinterpret logograms a...
research
07/01/2022

Reduce Indonesian Vocabularies with an Indonesian Sub-word Separator

Indonesian is an agglutinative language since it has a compounding proce...

Please sign up or login with your details

Forgot password? Click here to reset