Recently, many service platforms like Amazon alexa111https://developer.amazon.com/alexa. help third party developers create a wide range of natural language-based chatbots. Customers can access to those customized chatbots either through voice interface (employing speech recognition system), or textual interface (using chat applications or custom-built textual interface).
Natural Language Understanding for chatbot mainly consists of two components: (1) intent classification module that analyzes the user intent of an input sentence, and (2) entity extraction module that extracts entities from the input sentence. Table 1 shows how to analyze a Korean input sentence222English translation is in the bracket.. The first step is to classify the user intent of the input sentence, which is “play_music” in this example. Chatbot developers should predefine the set of possible intents. Once the intent is correctly classified, it is relatively easier to extract entities from the input sentence, since the classified intent helps determine the possible types of entity for the input sentence.
|Input||모짜르트 클래식 틀어줘|
|(Play Mozart’s classical music)|
When users access a chatbot through the textual interface, input sentences may occasionally contain spelling or space errors. For example, users could get confused about the spellings of the similarly pronounced words; users occasionally omit some required spaces between words; users could also make some typos.
Since Korean is an agglutinative language, its words could contain two or more morphemes to determine meanings semantically. Thus, various researchers tokenize Korean input sentences into morphemes first with a morpheme analyzer, and then the resultant morphemes are fed into a classification system (Choi18; Park18_2; Hur17). Such approaches based on morpheme embedding may work well on grammatically correct sentences but show poor performances on erroneous inputs, as shown in section 4.3. Errors in input sentences are likely to cause problems in the process of morphological analysis. Occasionally, morphemes with significant meaning are miscategorized as meaningless morphemes due to spelling errors. Table 2 briefly illustrates the case mentioned above. An important clue “파케스트”, the typo of a Korean morpheme “팟케스트(podcast)”, is separated into two meaningless morphemes due to spelling errors.
In this paper, a novel approach of Integrated Eojeol Embedding (IEE) is proposed to handle the classification problem. Eojeol is a Korean term meaning a word. In this paper, Eojeol is operationally defined as a sequence of Korean characters, separated by spaces. Detailed examples are given in section 3.1.
The main idea of IEE
is to feed the Eojeol embedding vectors into the sentence classification network, instead of morphemes or other subword units embedding vectors. In the case of an Eojeolw, subword unit-based Eojeol embedding vectors are calculated first based on different subword units of w, and the resultant vectors are integrated to form a single IEE vector. By doing so, the algorithm could significantly reduce the effect of incorrect subword unit tokenization results caused by spelling or other errors, while maintaining the benefits from the pre-trained subword unit embedding vectors such as GloVe (pennington14) or BPEmb (heinzerling2018bpemb).
|With Typo||딴 파케스트로 플레이|
|Morphemes||딴(another) / 파케(podca) / 스트로(to st) / 플레이(play)|
|Corrected||딴 팟케스트로 플레이|
|Morphemes||딴(another) / 팟케스트(podcast) /로(to) / 플레이(play)|
|Meaning||Play another podcast|
Also, two noise insertion methods called Jamo dropout and space-missing sentence generation are proposed to automatically insert noisy data into the training corpus and enhance the performance of the proposed IEE approach. The proposed system outperforms the baseline system by over 18%p on the erroneous sentence classification task, in terms of sentence accuracy.
2 Related Works
There exists a wide range of previous works for English sentence classification. Kim14
employed CNN with max-pooling for sentence classification,Bowman15 used BiLSTMs to get the sentence embeddings for natural language inference tasks, and Zhou15 tried to combine the LSTM with CNN. Recent works such as Im17 or Yoon18 tried to explore the self-attention mechanism for sentence encoding. However, to the best of our knowledge, the classification task of erroneous sentences received much less attention.
This paper mainly focuses on integrating multiple embeddings. Yin16 also considered the idea of integrating multiple embeddings; they considered many different embeddings as multiple channel inputs and extended the algorithm of Kim14 to handle the multi-channel inputs. This paper mainly differs from their work in that we attempt to generalize their approach. Unlike them, IEE does not require the input embeddings to have the same subword tokenizations while integrating various embeddings.
Besides, Choi18 proposed morpheme-based Korean GloVe word embedding vectors along with Korean word analogy corpus to evaluate them, while GloVe (pennington14) is a pre-trained embedding vector set using unstructured text. Choi18 also used the trained Korean GloVe embedding vectors with the algorithm of Kim14 to train the Korean sentence classifier for Korean chatbots. Park18 focused on sentence classification problem for sentences with spacing errors, but other types of errors such as typo are ignored.
In this paper, the proposed IEE vectors are fed into the network proposed in Choi18. The overall system performance is compared against the original sentence classification system of Choi18 and Kim14, to clarify the effect of IEE vectors.
3 Classifying Erroneous Korean Sentences
In this section, an algorithm to correctly classify erroneous Korean sentences is proposed.
3.1 Brief Introduction to Korean Word and Syntactic Structures
In this subsection, the structures of Korean words and sentences are briefly described to help to understand this paper. Eojeol is defined as a sequence of Korean characters, distinguished by spaces. A given input sentence s is tokenized using spaces to get its constituent Eojeol list . An Eojeol contains one or more morphemes, which can be extracted using a Korean morphological analyzer. Table 3 shows the Eojeols and morphemes of the example sentence from Table 2.
|Sentence||딴 팟케스트로 플레이|
|Eojeols||딴 / 팟케스트로 / 플레이|
|Morphemes||딴 / 팟케스트 / 로 / 플레이|
|initial||ㄱ, ㄴ, ㄷ, ㄹ, ㅁ, ㅂ, ㅅ, ㅇ, ㅈ, ㅊ, ㅋ, ㅌ, ㅍ, ㅎ, ㄲ, ㄸ, ㅃ, ㅆ, ㅉ|
|medial||ㅏ, ㅐ, ㅑ, ㅒ, ㅓ, ㅔ, ㅕ, ㅖ, ㅗ, ㅘ, ㅙ, ㅚ, ㅛ, ㅜ, ㅝ, ㅞ, ㅟ, ㅠ, ㅡ, ㅢ, ㅣ|
|final||ㄱ, ㄴ, ㄷ, ㄹ, ㅁ, ㅂ, ㅅ, ㅇ, ㅈ, ㅊ, ㅋ, ㅌ, ㅍ, ㅎ, ㄲ, ㅆ, ㄳ, ㄵ, ㄶ, ㄺ, ㄻ, ㄼ, ㄽ, ㄾ, ㄿ, ㅀ, ㅄ|
A Korean character consists of two or three Korean alphabets or Jamos; a consonant called the initial (or choseong in Korean), a vowel called the medial (or jungseong), and optionally another consonant called the final (or jongseong). 19 consonants are used as the initial, 21 vowels as the medial, and 27 consonants as the final. Table 4 shows the list of possible Jamo candidates for each placement. Theoretically, there can be 11,172(, considering the cases without the finals) different kinds of Korean characters in total, but only 2,000 to 3,000 characters are being used in the real world.
Table 5 shows a few examples of Korean characters having constituent Jamos. The first example has no final but still makes valid Korean characters. There exist 51 Korean Jamos, 30 consonants, and 21 vowels in total, except duplications.
3.2 Integrated Eojeol Embedding
Figure 1 illustrates the network architecture to calculate an IEE vector for a given Eojeol w. For an Eojeol w, t types of its subword unit lists are generated first. In this paper, four types of subword unit lists are considered: Jamo list , character list , byte-pair encoding (BPE) subunit list , and morpheme list , where , , and are the lengths of the lists. Table 6 shows the subword unit lists of an Eojeol “팟케스트로”.
|Unit||Subword Unit List|
|Character||팟 / 케 / 스 / 트 / 로|
|BPE unit||_ / 팟 / 케 / 스트로|
|Morpheme||팟케스트 / 로|
Each subword unit list is then fed into the subword unit merge (SUM) network. For a subword unit list , the SUM network first converts each list item into its corresponding embedding vector to get an embedding matrix . Afterward, one-dimensional depthwise separable convolutions (Chollet17) with kernel sizes , and filter size are applied on ; the results are followed by max-pooling and layernorm (Ba16). The result of the SUM network is the subword unit-based Eojeol embedding vector . The Integrated Eojeol Embedding vector is then calculated by integrating all the subword unit-based Eojeol embedding vectors. For subword unit types , three different algorithms to construct the IEE are proposed.
IEE by Concatenation. All the subword unit-based Eojeol embedding vectors are concatenated to form one IEE vector: .
IEE by Weighted Sum. Weights of the subword unit-based Eojeol embedding vectors are trained, and IEE vector is defined as the weighted sum of the subword unit-based Eojeol embedding vectors. More precisely, , where , and .
IEE by Max Pooling. Set the -th element of IEE as the maximal value of -th elements of the subword unit-based Eojeol embedding vectors: .
3.3 IEE-based Sentence Classification
Figure 2 illustrates the network architecture for sentence classification that uses the proposed IEE approach. A given Korean sentence s is considered as the list of Eojeols . For each Eojeol , is firstly calculated with the IEE network proposed in 3.2 to get an IEE matrix . In the formula, is the dimension of the IEE vector; for IEE by Concatenation and otherwise.
Once is calculated, depthwise separable convolutions with kernel sizes are applied on the matrix, and the results are followed by max-pooling and layernorm. Finally, the two fully connected feed-forward layers are applied to get the final output score vector , while is the number of possible classes. The overall network architecture is a variation of the network architecture proposed in Choi18, with its input replaced by the newly proposed IEE vectors.
3.4 Noise Insertion Methods to Improve the Integrated Eojeol Embedding
In this subsection, two noise insertion methods that further improve the performance of the IEE vectors are proposed. The first method, called Jamo dropout
, masks Jamos during the training phase with Jamo dropout probability orjdp. Instead of masking some elements of the input embedding vector as regular dropout (Srivastava14) does, Jamo dropout masks the whole Jamo embedding vector. Also, if any subword unit of different types contains masked Jamos, the embedding vector of that subword unit is masked. The method has two expected roles. First, it introduces additional noise during the training phase to make the trained classifier work better on noisy inputs. Secondly, by masking other subword units that contain masked Jamos allow the system to learn how to focus on the Jamo embeddings when the input subword unit is unknown in trained subword unit vocabulary.
Compared to the morpheme-embedding-based approach, the Eojeol-embedding-based approach is expected to show poor performance on sentences without spaces, since Eojeols are gathered by tokenizing the input sentence with spaces. Therefore, the second noise insertion method called space-missing sentence generation is introduced to resolve such impediments. This method aims to process input sentences without necessary spaces by automatically adding the inputs into the training corpus. More precisely, sentences are randomly chosen from the training corpus with the probability msp (missing space probability), and all the spaces in the chosen sentences are removed. The space-removed sentences are then inserted into the training corpus. The space-missing sentence generation method is applied only once before the training phase.
In this section, experimental settings and evaluation results are described.
The intent classification corpus proposed in Choi18 contains 127,322 manually-annotated, grammatically correct Korean sentences with 48 intents. Examples are weather (Ex: 오늘 날씨 어때 How is the weather today), fortune (Ex: 내일 운세 알려줘 Tell me the tomorrow’s fortune), music (Ex: 음악 틀어줘 Play music) and podcast (Ex: 팟캐스트 틀어줘 Play podcast). Sentences representing each intent are randomly divided into 8:1:1 ratio for train, validation, and test dataset. The test dataset here is called as the WF (Well-Formed) corpus throughout the paper and consists of 12,711 sentences.
The KM corpus standing for Korean Mis-spelling is manually annotated to measure the system performance on erroneous input sentences. The KM corpus is annotated with 46 intents, with two intents fewer than the WF corpus. The removed two intents are OOD and Common; OOD is the intent for meaningless sentences, and Common is the intent for short answers, such as “응 yes.” The sentences for the Common intent are too short of recovering their meanings from errors.
For 46 intents of the KM corpus, two annotators are asked to create sentences for 23 intents each. For each intent, an annotator is firstly asked to create 45 sentences without errors. After then, the annotator is asked to insert errors on the created sentences. The error insertion guideline is given in Table 7. As a result, the KM corpus contains 2,070 erroneous sentences in total, and it is used only for testing.
|RULE 1. For each sentence, insert one or more errors. Some recommendations are:|
|- Remove / Duplicate a Jamo.|
|- Swap the order of two or more Jamos.|
|- Replace a Jamo with similarly pronounced.|
|- Replace a Jamo with nearly located on keyboard.|
|RULE 2. The erroneous sentences should be understandable.|
|- Each annotator tries to classify other’s works. Misclassified sentences are reworked.|
|RULE 3. For 45 sentences of each intent, 25 sentences should contain only valid characters, and 20 sentences should contain one or more invalid characters.|
|- “Invalid” Korean characters are defined as those without initial or medial.|
Also, a space-missing (SM) test corpus is generated based on the WF and KM corpus to evaluate the system performance on sentences without necessary spaces. The corpus consists of sentences from the other test corpus, but the spaces between words are randomly removed with the probability of 0.5, and at least one space is removed from each original sentence. The SM corpus contains 14,781 sentences and also is used only for testing.
4.2 Experimental Setup
Sentence accuracy is used as a criterion to measure the performances of sentence classification systems on each corpus. The sentence accuracy or SA is defined as follows:
Throughout the experiments, the value of jdp is set to 0.05, and msp is set to 0.4. Those values are chosen through grid search with ranges and . For each experiment configuration, three training and test procedures are carried out, and the average of three test results on each test corpus is presented as the final system performance.
ADAM optimizer (Kingma14)
with learning rate warm-up scheme is applied. The learning rate increases from 0.0 to 0.1 in the first three epochs, and exponential learning rate decay with a decay rate of 0.75 is applied after five epochs of training. On each epoch, the trained classifier is evaluated against the validation dataset, and the training stops when the validation accuracy is not improved for four consequent training epochs. Minibatch size is set to 128. Dropout(Srivastava14) is applied between layers, with a rate of 0.1.
Korean GloVe embedding vectors proposed by Choi18 are used for the morpheme embedding vector. The dimension of GloVe embedding is 300. Also, BPEmb proposed in heinzerling2018bpemb is used for BPE unit embedding. For BPEmb, vector dimension and vocabulary size are experimentally set to 300 and 25,000, respectively. The values of morpheme embedding vectors and BPE unit embedding vectors are fixed during training. Meanwhile, Jamo embedding vectors and character embedding vectors are trained altogether with network parameters. The dimensions of Jamo and character embedding vectors are both set to 300, and the convolutional filter size is set to 128.
4.3 Evaluation Results
In this subsection, the system configurations are notated as follows: represents IEE by Concatenation, represents IEE by Weighted Sum, and represents IEE by Max-pooling, as described in subsection 3.2. Also, the applied subword units are represented in a square bracket; for morpheme, for BPE unit, for Jamo, and for character. The use of noise insertion methods are represented after + sign; SG represents the application of space-missing corpus generation, and JD represents the application of Jamo dropout, and ALL means both.
|Choi et al. Choi18||96.91||50.29||87.46|
Table 8 compares the performances of the proposed system and the baseline systems. The first and second baseline systems are based on the algorithm proposed by Kim14. Source code is retrieved from the author’s repository and modified to get the Korean sentences as its input. Baseline Kim-1 gets the Eojeol list of an input sentence as its input, and baseline Kim-2 instead gets the morpheme list of the given sentence as its input. The third baseline system is a system proposed in Choi18. It also receives the list of Korean morphemes as its input. The fourth baseline system M-BERT is the multilingual version of BERT (bert). We downloaded the pretrained model from the author’s repository, and fine-tuned it for Korean sentence classification. For the proposed approach, the performances of three different system configurations are selected and presented.
As can be observed from the table, the performance on the KM corpus improved over 17%p compared to the baseline systems. The proposed system outperforms the baseline systems on the WF and the SM corpus by integrating information from different types of subword units.
Another set of experiments is carried out to configure the effect of each noise insertion method and to compare the three proposed IEE approaches. All four subword unit types are used throughout the experiments, and evaluation results are presented in Table 9. As can be observed from the table, using the IEE approaches with the two proposed noise insertion methods dramatically improves the system performance on the erroneous sentence classification task. The performance on the KM corpus increased about 14 to 15%p on every proposed IEE approach.
As expected in section 3.4, the IEE approaches show relatively low performance on the SM corpus without the application of SG, compared to the baseline systems presented in Table 8. However, the evaluation result suggests that a decrease in performance could be efficiently handled by applying the SG method. The performance of system on SM corpus reaches up to 89.40%, which is about 2%p higher than the SM corpus performances in the baseline systems. Among the three proposed IEE approaches, performed better than the other two approaches in most cases. performed slightly better than the on the KM corpus but showed low performance on the WF corpus. The performance of on the KM corpus was much lower than the other two approaches.
To distinguish the contribution of the Eojeol-based approach from the contributions of the two noise insertion methods, two subwords unit-based integrated embedding approaches are newly defined. Integrated morpheme embedding (IME) approach creates an integrated morpheme embedding vector for each morpheme by integrating the Jamo-based morpheme embedding vector and the character-based morpheme embedding vector with the pre-trained morpheme embedding vector. The integrated BPE unit embedding (IBE) approach creates an integrated BPE embedding vector for each BPE unit in the same manner. The two approaches are the same as the IEE approach, except that IME and IBE feed the morpheme embedding vector and BPE unit embedding vector respectively while the IEE calculates and feeds the Eojeol embedding vector into the network in figure 2. is used to compare IME and IEE subword unit settings, since the BPE embeddings cannot be integrated into IME. To compare IBE and IEE subword unit settings, is used for the same reason. For the integration method, the concatenation is only considered in the experiments.
Table 10 shows the comparison result. The two noise insertion methods worked well on and ; however, for the KM corpus, the performance of the Eojeol-based embedding approach is 3%p higher than that of morpheme-based or BPE-based approaches. The result shows that the Eojeol-based embedding approach handles with the subword unit analysis errors efficiently, compared to the subword unit-based integrated embedding approaches such as IME or IBE.
Finally, experiments are conducted on system with various subword unit configurations to figure out the effect of each subword unit on system performance. Table 11 shows the comparison result. SG noise insertion method is applied only to the systems that have no Jamo subword units, while the two noise insertion methods are applied to other cases. The evaluation results are compared between the two different groups; one with the Jamo subword unit and one without it.
Several interesting facts can be observed from the table. First, the application of the Jamo subword unit dramatically improves the system performance on the KM corpus. The performance of the system on the KM corpus reaches up to 69.16%, which is the best performance on that corpus. However, using the single method of the Jamo subword unit resulted in a relatively low performance of 95.19% on the WF corpus due to the lack of pre-trained embeddings. By using the Jamo subword unit together with other subword units such as morphemes or BPE units, the system was able to achieve excellent performance on the WF corpus alike.
Additionally, achieves the best performance of 97.47% on the WF corpus by combining the morpheme embedding vectors with BPE embedding vectors. The result suggests that one can expect additional system performance improvement by further integrating different types of pre-trained subword embeddings. The evaluation result also shows that the Jamo subword unit is more effective compared to the character subword unit in terms of the KM corpus performance. Considering that Korean users are more likely to type in Jamos than characters, this result is quite understandable.
4.4 Error Analysis
Several examples are observed to figure out why the proposed Eojeol-based approach works better than the existing morpheme-based approaches. Those examples are presented in table 12; important clues for sentence classification are marked in bold.
|Case 1. Is the light turned on now?|
|S||지금 불링 켜졌어?|
|T||M||지금 불링 켜다(turned on)/지다/었/어|
|B||_지금 _불/링 _켜(turned on)/졌/어|
|S||지금 불이 켜졌어?|
|C||M||지금 불/이 켜다/지다/었/어|
|B||_지금 _불이 _켜/졌/어|
|Case 2. Lottery number.|
For Case 1 (typoT), the vital clue “불(light)” is not extracted as morphemes due to spelling error. However, the clue is successfully recovered in BPE subword units. For the Case 2, the typo can be handled by considering the Jamo subword units. The Jamo subword unit lists for correct and wrong sentences are the same, while their morpheme and BPE subword unit lists are different. As observed in the examples, the system can get a clue from other types of subword units by integrating multiple different types of subword unit embeddings when one subword unit type fails to recover the vital clue.
The proposed algorithm still has its weakness in redundantly spaced sentences, which is exemplified by the sentence “ㅈ ㅗ명 꺼줘”. It is the misspelling of “조명 꺼줘(Turn off the light)”. In the example sentence, the vital clue “조명(light)” is separated into two different Eojeols “ㅈ” and “ㅗ명” due to the misplaced space. Since they are separated into two Eojeols, the proposed Eojeol-based algorithm fails to recover the critical clue “조명” and to get the correct classification result.
In this paper, a novel approach of Integrated Eojeol Embedding is proposed to handle the Korean erroneous sentence classification tasks. Two noise insertion methods are additionally proposed to overcome the weakness of the Eojeol-embedding-based approaches and to add noises into training data automatically. The proposed system is evaluated against the intent classification corpus for Korean chatbot. The evaluation result shows over 18%p improvement on the erroneous sentence classification task and 0.5%p improvement on the grammatically correct sentence classification task, compared to the baseline system.
Although the proposed algorithm is tested only against the Korean chatbot intent classification task, it can be applied to other types of sentence classification, such as sentiment analysis or comment categorization. Also, the application of the proposed algorithm need not be restricted to the Korean text. For example, it can be applied to English text to integrate the English GloVe embedding vectors and BPE unit embedding vectors.
Our next work will be to investigate the performance of the proposed algorithm further and expand the algorithm to cover other languages.