2kenize: Tying Subword Sequences for Chinese Script Conversion

05/07/2020
by   Pranav A, et al.
0

Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/02/2018

Chinese Pinyin Aided IME, Input What You Have Not Keystroked Yet

Chinese pinyin input method engine (IME) converts pinyin into character ...
research
04/07/2020

g2pM: A Neural Grapheme-to-Phoneme Conversion Package for MandarinChinese Based on a New Open Benchmark Dataset

Conversion of Chinese graphemes to phonemes (G2P) is an essential compon...
research
05/08/2020

Development of a New Image-to-text Conversion System for Pashto, Farsi and Traditional Chinese

We report upon the results of a research and prototype building project ...
research
02/07/2018

Unsupervised Typography Transfer

Traditional methods in Chinese typography synthesis view characters as a...
research
11/11/2018

Neural-based Pinyin-to-Character Conversion with Adaptive Vocabulary

Pinyin-to-character (P2C) conversion is the core component of pinyin-bas...
research
03/14/2023

Good Neighbors Are All You Need for Chinese Grapheme-to-Phoneme Conversion

Most Chinese Grapheme-to-Phoneme (G2P) systems employ a three-stage fram...
research
12/12/2017

Tracing a Loose Wordhood for Chinese Input Method Engine

Chinese input methods are used to convert pinyin sequence or other Latin...

Please sign up or login with your details

Forgot password? Click here to reset