DeepAI AI Chat
Log In Sign Up

2kenize: Tying Subword Sequences for Chinese Script Conversion

by   Pranav A, et al.

Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities.


page 1

page 2

page 3

page 4


Chinese Pinyin Aided IME, Input What You Have Not Keystroked Yet

Chinese pinyin input method engine (IME) converts pinyin into character ...

g2pM: A Neural Grapheme-to-Phoneme Conversion Package for MandarinChinese Based on a New Open Benchmark Dataset

Conversion of Chinese graphemes to phonemes (G2P) is an essential compon...

Development of a New Image-to-text Conversion System for Pashto, Farsi and Traditional Chinese

We report upon the results of a research and prototype building project ...

Unsupervised Typography Transfer

Traditional methods in Chinese typography synthesis view characters as a...

Neural-based Pinyin-to-Character Conversion with Adaptive Vocabulary

Pinyin-to-character (P2C) conversion is the core component of pinyin-bas...

Good Neighbors Are All You Need for Chinese Grapheme-to-Phoneme Conversion

Most Chinese Grapheme-to-Phoneme (G2P) systems employ a three-stage fram...

Tracing a Loose Wordhood for Chinese Input Method Engine

Chinese input methods are used to convert pinyin sequence or other Latin...