DeepAI AI Chat
Log In Sign Up

SubCharacter Chinese-English Neural Machine Translation with Wubi encoding

by   Wei Zhang, et al.

Neural machine translation (NMT) is one of the best methods for understanding the differences in semantic rules between two languages. Especially for Indo-European languages, subword-level models have achieved impressive results. However, when the translation task involves Chinese, semantic granularity remains at the word and character level, so there is still need more fine-grained translation model of Chinese. In this paper, we introduce a simple and effective method for Chinese translation at the sub-character level. Our approach uses the Wubi method to translate Chinese into English; byte-pair encoding (BPE) is then applied. Our method for Chinese-English translation eliminates the need for a complicated word segmentation algorithm during preprocessing. Furthermore, our method allows for sub-character-level neural translation based on recurrent neural network (RNN) architecture, without preprocessing. The empirical results show that for Chinese-English translation tasks, our sub-character-level model has a comparable BLEU score to the subword model, despite having a much smaller vocabulary. Additionally, the small vocabulary is highly advantageous for NMT model compression.


page 1

page 2

page 3

page 4


Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT

Neural machine translation (NMT), a new approach to machine translation,...

Neural Machine Translation of Logographic Languages Using Sub-character Level Information

Recent neural machine translation (NMT) systems have been greatly improv...

wubi2en: Character-level Chinese-English Translation through ASCII Encoding

Character-level Neural Machine Translation (NMT) models have recently ac...

Logographic Subword Model for Neural Machine Translation

A novel logographic subword model is proposed to reinterpret logograms a...

Reduce Indonesian Vocabularies with an Indonesian Sub-word Separator

Indonesian is an agglutinative language since it has a compounding proce...