Investigating an Effective Character-level Embedding in Korean Sentence Classification

05/31/2019
by   Won Ik Cho, et al.
0

Different from the writing systems of many Romance and Germanic languages, some languages or language families show complex conjunct forms in character composition. For such cases where the conjuncts consist of the components representing consonant(s) and vowel, various character encoding schemes can be adopted beyond merely making up a one-hot vector. However, there has been little work done on intra-language comparison regarding performances using each representation. In this study, utilizing the Korean language which is character-rich and agglutinative, we investigate an encoding scheme that is the most effective among Jamo-level one-hot, character-level one-hot, character-level dense, and character-level multi-hot. Classification performance with each scheme is evaluated on two corpora: one on binary sentiment analysis of movie reviews, and the other on multi-class identification of intention types. The result displays that the character-level features show higher performance in general, although the Jamo-level features may show compatibility with the attention-based models if guaranteed adequate parameter size.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/08/2017

Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean?

This article offers an empirical study on the different ways of encoding...
research
10/08/2018

End-to-End Text Classification via Image-based Embedding using Character-level Networks

For analysing and/or understanding languages having no word boundaries b...
research
05/09/2018

wubi2en: Character-level Chinese-English Translation through ASCII Encoding

Character-level Neural Machine Translation (NMT) models have recently ac...
research
01/08/2017

Sentence-level dialects identification in the greater China region

Identifying the different varieties of the same language is more challen...
research
05/18/2022

Exploring the Advantages of Dense-Vector to One-Hot Encoding of Intent Classes in Out-of-Scope Detection Tasks

This work explores the intrinsic limitations of the popular one-hot enco...
research
03/18/2019

A Multilingual Encoding Method for Text Classification and Dialect Identification Using Convolutional Neural Network

This thesis presents a language-independent text classification model by...
research
03/30/2022

Does Configuration Encoding Matter in Learning Software Performance? An Empirical Study on Encoding Schemes

Learning and predicting the performance of a configurable software syste...

Please sign up or login with your details

Forgot password? Click here to reset