Investigating an Effective Character-level Embedding in Korean Sentence Classification

05/31/2019 ∙ by Won Ik Cho, et al. ∙ Seoul National University 0

Different from the writing systems of many Romance and Germanic languages, some languages or language families show complex conjunct forms in character composition. For such cases where the conjuncts consist of the components representing consonant(s) and vowel, various character encoding schemes can be adopted beyond merely making up a one-hot vector. However, there has been little work done on intra-language comparison regarding performances using each representation. In this study, utilizing the Korean language which is character-rich and agglutinative, we investigate an encoding scheme that is the most effective among Jamo-level one-hot, character-level one-hot, character-level dense, and character-level multi-hot. Classification performance with each scheme is evaluated on two corpora: one on binary sentiment analysis of movie reviews, and the other on multi-class identification of intention types. The result displays that the character-level features show higher performance in general, although the Jamo-level features may show compatibility with the attention-based models if guaranteed adequate parameter size.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Ever since an early approach exploiting the character features for the neural network-based natural language processing (NLP)  

[Zhang et al.2015], character-level embedding222Throughout this paper, the terms embedding and encoding are parallelly used depending on the context. has been widely used for many tasks such as machine translation  [Ling et al.2015], noisy document representation  [Dhingra et al.2016], language correction  [Xie et al.2016], and word segmentation  [Cho et al.2018a]

. However, little consideration was done for intra-language performance comparison regarding variant representation types. Unlike English, a Germanic language written with an alphabet comprising 26 characters, many languages used in East Asia are written with scripts whose characters can be further decomposed into sub-parts representing individual consonants or vowels. This conveys that (sub-)character-level representation for such languages has the potential to be managed with more than just a simple one-hot encoding.

In this paper, a comparative experiment is conducted on Korean, a representative language with a featural writing system  [Daniels and Bright1996]. To be specific, the Korean alphabet Hangul consists of the letters Jamo denoting consonants and vowels. The letters comprise a morpho-syllabic block that refers to character, which is resultingly equivalent to the phonetic unit syllable in terms of Korean morpho-phonology. The conjunct form of a character is {Syllable: CV(C)}; this notation implies that there should be at least one consonant (namely cho-seng, the first sound) and one vowel (namely cwung-seng, the second sound) in a character. An additional consonant (namely cong-seng, the third sound) is auxiliary. However, in decomposition of the characters, three slots are fully held to guarantee a space for each component; an empty cell comes in place of the third entry if there is no auxiliary consonant. The number of possible sub-characters, or (composite) consonants/vowels, that can come for each slot is 19, 21, and 27. For instance, in a syllable ‘간 (kan)’, the three clock-wisely arranged characters ㄱ, ㅏ, and ㄴ, which sound kiyek (stands for k; among 19 candidates), ah (stands for a; among 21 candidates), and niun (stands for n; among 27 candidates), refers to the first, the second and the third sound respectively.

To compare five different Jamo

/character-level embedding methodologies that are possible in Korean, we first review the related studies and the previous approaches. Then, two datasets are introduced, namely binary sentiment classification and multi-class intention identification, to investigate the performance of each representation under recurrent neural network (RNN)-based analysis. After searching for an effective encoding scheme, we demonstrate how the result can be adopted in combating other tasks and discuss if a similar approach can be applied to the languages with the complex writing system.

2 Related Work

Inter-language comparison with word and character embedding was thoroughly investigated in  zhang2017encoding, for Chinese, Japanese, Korean, and English. The paper investigates the languages via representations including character, byte, romanized character, morpheme, and romanized morpheme. The observation of tendency for Korean suggests that adopting the romanized characters outperforms utilizing the raw character-level features by a significant percentage, though the performance was far below the morpheme-level features. However, to be specific in the experiment, decomposition of the morpho-syllabic blocks was not conducted, and the experiment made use of the morphological analysis, which does not belong to the scope of this study. We concluded that more attention is to be paid to different character embedding methodologies of Korean. Here, to reduce ambiguity, we denote a morpho-syllabic block which consists of consonant(s) and a vowel by character, and the individual components by Jamo. A Jamo sequence is spread in the order of the first to the third sound if a character is decomposed.

There has been little study done on an effective text encoding scheme for Korean, a language that has distinguished character structure which can be decomposed into sub-characters. A comparative study on the hierarchical constituents of Korean morpho-syntax was first organized in  lee2016comparison, in the way of comparing the performance of Jamo, character, morpheme, and eojeol (word)-level embeddings for the task of text reconstruction. In the quantitative analysis using edit distance and accuracy, the Jamo-level feature showed a slightly better result than the character-level one. The (sub-)character-level representations presented the outcome far better than the morpheme or eojeol-level cases, unlike the classification task in  zhang2017encoding; which seems to originate in the property of the task.

In a more comprehensive viewpoint,  stratos2017sub showed that Jamo-level features combined with word and character-level ones display better performance with the parsing task. With more elaborate character processing, especially involving Jamos,  shin2017korean and  cho2018character made progress recently in the classification tasks.  song2018sequence aggregated the sparse features into multi-hot representation successfully, enhancing the output within the task of error correction. In a slightly different manner,  cho2018real applied dense vectors for the representation of the characters, obtained by skip-gram  [Mikolov et al.2013], improving the naturalness of word segmentation for noisy Korean text. To figure out the tendency, we implement the aforementioned Jamo/character-level features and discuss the result concerning classification tasks. The details on each approach are to be described in the following section.

Representation Property Dimension Feature type
(i) Shin2017 ㅎ / ㅏ ㅢ / ㄱ Jamo-level 67 one-hot
(ii) Cho2018c (i) + ㄱ Jamo-level 118 one-hot
(iii) Cho2018a-Sparse character-level 2,534 one-hot
(iv) Cho2018a-Dense character-level 100 dense
(v) Song2018 + character-level 67 multi-hot
Table 1: A description on the Jamo/character-level features (i-v).

3 Experiment

In this section, we demonstrate the featurization of five (sub-)character embedding methodologies, namely (i) Jamo  [Shin et al.2017, Stratos2017] (ii) modified Jamo  [Cho et al.2018c], (iii) sparse character vectors, (iv) dense character vectors  [Cho et al.2018a] trained based on fastText  [Bojanowski et al.2016], and (v) multi-hot character vectors  [Song et al.2018]. We featurize only Jamo/character and no other symbols such as numbers and special letters is taken into account.

For (i), we used one-hot vectors of dimension 67 (= 19 + 21 + 27), which is smaller in width than the ones suggested in  shin2017korean and  stratos2017sub, due to the omission of special symbols. Similarly, for (ii), we made up 118-dim one-hot vectors considering the cases that Jamo is used in the form of single (or composite) consonant or vowel, as frequently observed in the social media text. The cases make up an additional array of dimension 51.

For (iii) and (iv), we adopted a vector set that is distributed publicly in  cho2018real, reported to be extracted from a drama script corpus of size 2M. Constructing the vectors of (iii) is intuitive; for characters in the corpus, a -dimensional one-hot vector is assigned for each. Case of (iv) can be considered awkward in the sense of using characters as a meaningful token, but we concluded that the Korean characters should be regarded as a word piece  [Sennrich et al.2015]

or subword n-gram  

[Bojanowski et al.2016] concerning the nature of their composition. All the characters are reported to be treated as a separate token (subword) in the training phase that uses skip-gram  [Mikolov et al.2013]. Although the number of possible character combinations in Korean is precisely 11,172 (= 19 * 21 * 28), the number of ones that are used in real-life reaches about 2,500  [Kwon et al.1995]. Since the description says that the corpus is depunctuated and consists of around 2,500 Korean syllables, we exploited the dictionary of 100-dim fastText-embedded vectors which is provided in the paper, and extracted the list of the characters to construct a one-hot vector dictionary333Two types of embeddings were omitted, namely the Jamo-based fastText and the 11,172-dim one-hot vectors; the former was considered inefficient since there are only 118 symbols at most and the latter was assumed to require a huge computation..

(v) is a hybrid of Jamo and character-level features; three vectors indicating the first to the third sound of a character, namely the ones with dimension 19, 21, and 27 each, are concatenated into a single multi-hot vector. This preserves the conciseness of the Jamo-level one-hot encodings and also maintains the characteristics of conjunct forms. In summary, (i) utilizes 67-dim one-hot vectors, (ii) 118-dim one-hot vectors, (iii) 2,534-dim one-hot vectors, (iv) 100-dim dense vectors, and (v) 67-dim multi-hot vectors (Table 1).

3.1 Task description

For evaluation, we employed two classification tasks that can be conducted with the character-level embeddings. Due to a shortage of reliable open source data for Korean, we selected the datasets that show a clear description of the annotation.

Accuracy
(F1-score)
NSMC 3i4K
BiLSTM BiLSTM-SA BiLSTM BiLSTM-SA
(i) Shin2017 0.8176 0.8302 0.8759 (0.7603) 0.8764 (0.7798)
(ii) Cho2018c 0.8179 0.8251 0.8694 (0.7601) 0.8817 (0.7951)
(iii) Cho2018a-Sparse 0.8299 0.8301 0.8794 (0.7981) 0.8779 (0.7910)
(iv) Cho2018a-Dense 0.8347 0.8414 0.8823 (0.7934) 0.8869 (0.7977)
(v) Song2018 0.8289 0.8353 0.8781 (0.7758) 0.8843 (0.7949)
Table 2: Performance comparison. Only the accuracy is provided for NSMC since the labels are equally distributed. Two best models regarding accuracy (and F1-score for 3i4K) are bold (and underlined) for both tasks.

3.1.1 Naver sentiment movie corpus

The corpus NSMC444https://github.com/e9t/nsmc is a widely used benchmark for evaluation of Korean language models. The annotation follows  maas2011learning and incorporates 150K:50K documents for the train and test set each. The authors assign a positive label for the reviews with a score 8 and negative for the ones with a score 5 (in 10-scale), adopting a binary labeling system. To prevent confusion that comes from gray-zone data, neutral reviews were removed. The positive and negative instances are equally distributed in both train and test set. The dataset is expected to display how well each character-level encoding scheme conveys the information regarding lexical semantics.

3.1.2 Intonation-aided intention identification for Korean

The corpus 3i4K555https://github.com/warnikchow/3i4k  [Cho et al.2018b] is a recently distributed open-source data for multi-class intention identification. The labels, in total seven, include fragment and five clear-cut cases (statement, question, command, rhetorical question (RQ), rhetorical command (RC)). The remaining class is for the intonation-dependent utterances whose intention mainly depends on the prosody assigned to underspecified sentence enders, considering head-finality of the Korean language. Since the labels are elaborately defined and the corpus is largely hand-labeled (or hand-generated), the corpus size is relatively small (total 61K) and some classes possess a tiny volume (e.g., about 1.7K for RQs and 1.1K for RCs). However, such challenging factors of the dataset can show the aspects of the evaluation that can be overlooked.

3.2 Feature engineering

In the first task, to manage with the average document size, the length of Jamo or character sequence was fixed to the maximum of (i-ii) 420 and (iii-v) 140666The data description says the maximum volume of the input characters is 140.. Similarly, in the second task, (i-ii) 240 and (iii-v) 80777The number of the utterances with the length longer than 80 were under 40 ().. The length regarding (i-ii) being three times as long as that of (iii-v) comes from the spreadings of the sub-characters for each character.

For both tasks, the document was numericalized in the way that the tokens are placed on the right end of the feature, to preserve Jamos or characters which may incorporate syntactically influential components of the phrases in a head-final language. For example, in a sentence “배고파 (pay-ko-pha, I’m hungry)”, a vector sequence is arranged in the form of [0 0 0 0 ], where , , and each denotes the vector embeddings of the characters pay, ko, and pha. Here, pha encompasses the head of the phrase with the highest hierarchy in the sentence, which assigns the sentence a speech act of statement. The spaces between eojeols were represented as zero vector(s).

To look into the content of the corpora, the first dataset (NSMC) contains many incomplete characters such as solely used sub-characters (e.g., ㄱㄱ, ㅠㅠ) and non-Korean symbols (e.g., Chinese characters, special symbols, punctuations). The former ones were treated as characters, whereas the latter ones were ignored in all features. Although (i, iii, iv) do not represent the symbols regarding the former as non-zero vector while (ii, v) do so, we concluded that this does not threaten the fairness of the evaluation, since a wider range of representation is own advantage of each feature. The second dataset (3i4K) contains only the full characters, inducing no disturbance in the featurization.

3.3 Implementation

The implementation was done with Hangul Toolkit888https://github.com/bluedisk/hangul-toolkit, fastText999https://pypi.org/project/fasttext/

, and Keras

[Chollet and others2015]

, which were used for character decomposition, dense vector embedding and RNN-based training, respectively. For RNN models, bidirectional long short-term memory (BiLSTM)  

[Schuster and Paliwal1997] and self-attentive sentence embedding (BiLSTM-SA)  [Lin et al.2017] were applied.

In vanilla BiLSTM, an autoregressive system that is representatively utilized for time-series analysis, a fully connected layer (FC) is connected to the last hidden layer of BiLSTM, finally inferring the output with a softmax activation. In BiLSTM with a self-attentive embedding, the context vector whose length equals to that of the hidden layers of the BiLSTM, is jointly trained along with the network so that it can provide the weight assigned to each hidden layer. The weight is obtained by making up an attention vector via a column-wise multiplication of the context vector and the hidden layers. The model specification is provided as supplementary material.

3.4 Result

3.4.1 Performance

The evaluation phase displays the pros and cons of conventional methodologies. In both tasks, (iv) Cho2018a-Dense and (v) Song2018 showed significant performance. It is assumed that the result comes from the distinguished property of (iv,v); they do not break up the syllabic blocks and at the same time provide the local semantics to the models, (iv) in the way of employing skip-gram  [Mikolov et al.2013] and (v) in the way of using multi-hot encoding that assigns own role to each vector representation.

(iii) Cho2018a-Sparse preserves the blocks as well, but one-hot encoding hardly gives any information on each character, which led to an insignificant result for both tasks. Although compatible performance is achieved with BiLSTM, the models reached saturation fast and displayed overfitting, while the models with the other features showed a gradual increase in accuracy. The reason for both fast saturation and performance degeneration (with the attention) seems to be the little flexibility coming from the vast parameter set.

3.4.2 Using self-attentive embedding

The advantage of using self-attentive embedding was the most emphasized in Jamo-level (i) (for NSMC) and (ii) (for 3i4K), and the least in (iii) (for both tasks) (Table 2). We assume that relatively greater improvement using (i,ii) originates in the decomposability of the blocks. If a sequence of the blocks is decomposed into the sequence of sub-characters, the morphemes can be highlighted to provide more syntactically/semantically meaningful information to the system, especially the ones that could not have been revealed in the block-preserving environment (iii-v). For example, a Korean word ‘이상한 (i-sang-han, strange)’ can be split into ‘이상하 (i-sang-ha, the root of the word)’ and ‘-ㄴ (-n, a particle that makes the root an adjective)’, making up the circumstances in which the presence and role of the morphemes is pointed out.

3.4.3 Decomposability vs. Local semantics

The point described above is the weak point of character-level features (iii-v) in the sense that in such ones, characters cannot be decomposed, even for the sparse multi-hot encoding. The higher performance of (iv-v) compared to the Jamo-level features, which is currently displayed, can therefore be explained as a result of preserving the cluster of letters. However, if the computation resource is sufficient so that exploiting deeper networks (e.g., stacked BiLSTM, Transformer  [Vaswani et al.2017], BERT  [Devlin et al.2018]) is available, we infer that (i-ii) may also show compatible performance. Nevertheless, it is still quite impressive that (iv) scores the highest even though the utilized dictionary does not incorporate all the available character combinations. It is suspected to be where the distributive semantics on the word pieces are engaged in.

Trainable
param.s &
Training time
BiLSTM BiLSTM-SA
Param.s
Time /
epoch
Param.s
Time /
epoch
(i) Shin2017 34,178 13.5m 297,846 18m
(ii) Cho2018c 47,234 16m 310,902 20.5m
(iii) Cho2018a-Sparse 665,730 33m 772,318 38.5m
(iv) Cho2018a-Dense 42,626 6.5m 149,214 6m
(v) Song2018 34,178 6m 140,766 6m
Table 3: Computation burden for NSMC models.

3.4.4 Computation efficiency

Here we investigate only on the classification tasks, notwithstanding which take a short amount of time for training and inference, the measurement on parameter volume and complexity is meaningful (Table 3). It is observed that (v) yields a compatible or better performance with respect to the other schemes, accompanied by less burden of computation. Besides, we argue that the multi-hot encoding (v) has a significant advantage over the rest in terms of multiple usages; it possesses both conciseness of the sub-character (Jamo)-level features and local semantics of the character-level features. Due to these reasons, the derived models are fast in training and also have potential to be effectively used in sentence reconstruction or generation, as shown in  song2018sequence, where applying large-dimensional one-hot encoding has been considered challenging.

4 Discussion

The primary goal of this paper is to search for a Jamo

/character-level encoding scheme that best resolves the given task in Korean NLP. Experimentally, we found out that the fastText-embedded vectors outperform the other features if provided with the same environment (model architecture). It is highly probable that the distributive semantics plays a significant role in the NLP tasks concerning syntax-semantics, at least in the feature-based approaches  

[Mikolov et al.2013, Pennington et al.2014], albeit the sparse encodings may outperform in the recent fine-tuning approaches such as  devlin2018bert. However, we experimentally verified that even with traditional feature-based systems, the sparse encoding schemes also perform adequately with the dense one, especially displaying computation efficiency in the multi-hot case.

Hereby, we want to emphasize that the utility of the comparison result is not only restricted to Korean, in that the introduced character encoding schemes are also available in other languages. Although the Korean writing system is unique, the Japanese language incorporates several morae (e.g., small tsu) that approximately correspond to the third sound (cong-seng) of the Korean characters, which may let the Japanese characters be encoded in a similar manner with the cases of Korean. Also, each character of the Chinese language (and kanji in Japanese) can be further decomposed into sub-characters (bushu in Japanese) that have meanings as well, as suggested in  nguyen2017sub (e.g., 鯨 “whale” to 魚 “fish” and 京 “capital city”).

Besides, many other languages that are used in South Asia (India), such as Telugu, Devanagari, Tamil, and Kannada, have writing system type of Abugida101010https://en.wikipedia.org/wiki/Writing_system  [Daniels and Bright1996], the composition of consonant and vowel. The cases are not the same as Korean in view of a writing system since featural decomposition of the Abugida characters is not represented in the way of segmentation of a glyph. However, for example, instead of listing all the CV combinations, one can simplify the representation by segmenting the property of the character into consonant and vowel and making up a two-hot encoded vector. The similar kind of character embedding can be applied to many native Philippine languages such as Ilocano. Moreover, we believe that the argued type of featurization is robust in combating the noisy user-generated texts.

5 Conclusion

In this study, we have reviewed the five different types of (sub-)character-level embedding for a character-rich language. It is remarkable that the dense and multi-hot representation perform best given the classification tasks, and specifically, the latter one has the potential to be utilized beyond classification tasks due to its conciseness and computation efficiency. It is expected that the overall performance tendency may provide a useful reference for the text processing of other character-rich languages with conjunct forms in the writing system, including Japanese, Chinese, and the languages of various South and Southeast Asian regions. A brief tutorial on 3i4K using embedding methodologies presented in this paper is available online111111https://github.com/warnikchow/dlk2nlp#8-character-embedding.

References

Supplementary Material

BiLSTM and BiLSTM-SA model specification

Variables

  • Sequence length (L) and the number of output classes (N) depend on the task. For NSMC, L = 420 for feature (i-ii) and 140 for (iii-v). For 3i4K, L=240 for feature (i-ii) and 80 for (iii-v). N equals 2 and 7 for NSMC and 3i4K, respectively.

  • Character vector dimension (D) depends on the feature. For feature (i-v), D equals 67, 118, 2534, 100, and 67, respectively.

BiLSTM

  • Input dimension: (L, D)

  • RNN Hidden layer width: 64 (=322)

  • The width of FCN connected to the last hidden layer: 128 (Activation: ReLU)

  • Output layer width: N (Activation: softmax)

BiLSTM-SA

  • Input dimension: (L, D)

  • The dimension of RNN hidden layer sequence output: (64 (= 322), L)
    each layer connected to FCN of width: 64 (Activation: tanh; equals to in  lin2017structured) [a]

  • Auxiliary zero vector size: 64
    connected to FCN of width 64 (Activation: ReLU, Dropout  [Srivastava et al.2014]: 0.3)
    connected to FCN of width 64 (Activation: ReLU) [b]

  • Vector sequence [a] is column-wisely dot-multiplied with [b] to yield the layer of length L
    connected to an attention vector of size L (Activation: softmax)
    column-wisely multiplied to the hidden layer sequence to yield a weighted sum [c] of width 64
    [c] is connected to an FCN of width: 256 (Activation: ReLU, Dropout: 0.3) 2 (multi-layer)

  • Output layer width: N (Activation: softmax)

Settings

  • Optimizer: Adam  [Kingma and Ba2014] (Learning rate: 0.0005)

  • Loss function: Categorical cross-entropy

  • Batch size: 64 for NSMC, 16 for 3i4K (due to the difference in the corpus volume)

  • For 3i4K, class weights were taken into account to compensate the volume imbalance.

  • Device: Nvidia Tesla M40 24GB