Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean?

08/08/2017
by   Xiang Zhang, et al.
0

This article offers an empirical study on the different ways of encoding Chinese, Japanese, Korean (CJK) and English languages for text classification. Different encoding levels are studied, including UTF-8 bytes, characters, words, romanized characters and romanized words. For all encoding levels, whenever applicable, we provide comparisons with linear models, fastText and convolutional networks. For convolutional networks, we compare between encoding mechanisms using character glyph images, one-hot (or one-of-n) encoding, and embedding. In total there are 473 models, using 14 large-scale text classification datasets in 4 languages including Chinese, English, Japanese and Korean. Some conclusions from these results include that byte-level one-hot encoding based on UTF-8 consistently produces competitive results for convolutional networks, that word-level n-grams linear models are competitive even without perfect word segmentation, and that fastText provides the best result using character-level n-gram encoding but can overfit when the features are overly rich.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/04/2015

Character-level Convolutional Networks for Text Classification

This article offers an empirical exploration on the use of character-lev...
research
05/31/2019

Investigating an Effective Character-level Embedding in Korean Sentence Classification

Different from the writing systems of many Romance and Germanic language...
research
11/14/2016

Character-level Convolutional Network for Text Classification Applied to Chinese Corpus

This article provides an interesting exploration of character-level conv...
research
03/18/2019

A Multilingual Encoding Method for Text Classification and Dialect Identification Using Convolutional Neural Network

This thesis presents a language-independent text classification model by...
research
03/11/2019

An Innovative Word Encoding Method For Text Classification Using Convolutional Neural Network

Text classification plays a vital role today especially with the intensi...
research
12/09/2022

Moto: Enhancing Embedding with Multiple Joint Factors for Chinese Text Classification

Recently, language representation techniques have achieved great perform...
research
03/24/2017

Binarsity: a penalization for one-hot encoded features

This paper deals with the problem of large-scale linear supervised learn...

Please sign up or login with your details

Forgot password? Click here to reset