Machines Getting with the Program: Understanding Intent Arguments of Non-Canonical Directives

12/01/2019 ∙ by Won Ik Cho, et al. ∙ IKI Seoul National University 0

Modern dialog managers face the challenge of having to fulfill human-level conversational skills as part of common user expectations, including but not limited to discourse with no clear objective. Along with these requirements, agents are expected to extrapolate intent from the user's dialogue even when subjected to non-canonical forms of speech. This depends on the agent's comprehension of paraphrased forms of such utterances. In low-resource languages, the lack of data is a bottleneck that prevents advancements of the comprehension performance for these types of agents. In this paper, we demonstrate the necessity of being able to extract the intent argument of non-canonical directives, and also define guidelines for building paired corpora for this purpose. Following the guidelines, we label a dataset consisting of 30K instances of question/command-intent pairs, including annotations for a classification task for predicting the utterance type. We also propose a method for mitigating class imbalance in the final dataset, and demonstrate the potential applications of the corpus generation method and dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The advent of smart agents such as Amazon Echo and Google Home has shown relatively wide market adoption. Users have been familiarized with formulating questions and orders in a way that these agents can easily comprehend and take actions. Given this trend, particularly for cases where questions can have various forms such as yes/no, alternative, wh-, echo and embedded [Huddleston1994], a number of analysis techniques have been studied in the domain of semantic role labeling [Shen and Lapata2007] and entity recognition [Mollá et al.2006]. Nowadays, various question answering tasks have been proposed [Yang et al.2015] and have yielded systems that have demonstrated significant advances in performance. Studies on the parsing of canonical imperatives [Matuszek et al.2013] have also been done for many household agents.

However, discerning the intent from a conversational and non-canonical sentence (question or command) and extracting its intent argument is still a challenge. Additional complexity is introduced when the target text is in a speech recognition context, as the result may not contain punctuation. For example, given an unclear declarative question [Gunlogson2002] such as “poppa joe you want me to go now”, a human listener can interpret the question as ‘if Joe wants the speaker to go now’, but this can be challenging to for a machine. Also, sometimes, merely the speech act can be hard to guess from the sentence form, as in inferring “why don’t you just call the police” as a representation of the to-do list ‘to call the police’ (Figure 1). Although many advanced dialog managing systems may generate a plausible reaction to the input utterances, it is different from extracting the exact intent argument (a question set or a to-do-list) that should be investigated for an actual operation.

Figure 1: A diagram of the proposed extraction scheme. Unlike in the Korean language that is to be investigated, in English translation, the wh-related noun (here, destination) is placed at the head part of the sentence.

Complexities like the example discussed above have not seen much exploration outside of English, especially in the context of languages with a distinguished syntax or cases which do not use Latin-like alphabets. As a more concrete example, in the Korean language, the morphology is agglutinative, the syntax is head-final, and scrambling (non-deterministic permutations of word/phrase ordering) is a common practice between native speakers. Specifically, the agglutinative property of Korean requires additional morphological analysis, which makes it challenging to identify the component of the sentence that has the strongest connection to core intent. Additionally, the head-finality characteristic introduces an additional layer of complexity, where an under-specified sentence ender incorporates a prosodic cue which requires disambiguation to comprehend the original intent [Yun2019, Cho et al.2019a]. Finally, considering the scrambling aspect, which frequently happens in spoken utterances, further analysis is required on top of recognizing the entities and extracting the relevant phrases. This makes it difficult for dialog managers to directly apply conventional analysis methods that have been used in Germanic or other Indo-European languages.

In this paper, we explore these aspects in the context of Korean, a less explored, low-resource language with various non-canonical expressions. From there on, we propose a structured sentence annotation scheme which can help enrich the human-like conversation with artificial intelligence (AI). For the automation, we annotate an existing corpus and then augment the dataset to mitigate class imbalance, demonstrating the flexibility, practicality, and extensibility of the proposed methods. To further prove that the scheme is not limited to Korean, we demonstrate the methodology using English examples and supplement specific cases with Korean.

To begin with, in section 2, we present the theoretical background of this study. We then discuss the detailed procedure with examples, along with an explanation of how it fits with modern natural language understanding (NLU) systems and an evaluation framework.

2 Concept and Related Work

The foundation of this proposal is based on the studies of intent classification and slot-filling [Liu and Lane2016]. The theoretical background builds on literature from speech act [Searle1976] and formal semantics [Portner2004]. Although many task-oriented systems identify the intents as a specific action that the agent should take [Li et al.2018], to make such intent categories generic in the aspect of sentence semantics, we hypothesized that it would be beneficial to represent them in a structured format. We believe that the closest problem we have to this task is formulating a question set (QS) or to-do-list (TDL) with multiple possible utterance permutations (Table 1) [Portner2004]. While these concepts have stronger relations with the domain of syntactic properties, we extend on this to speech act level to reflect common patterns in a human dialog form.

Type Denotations Discourse Component Force
Declaratives proposition (p) Common Ground Assertion
Interrogatives set of propositions (q) Question Set Asking
Imperatives property (P) To-Do List Function Requiring
Table 1: Clause types and their properties (Portner, 2004).

For directives which can be identified either as a question or command, conventional systems depend on slot-filling to extract the item and argument [Li et al.2018, Haghani et al.2018], where the number of the categories is generally restricted. Instead, for non-task-oriented dialogues, the presence of a specific domain is not assumed. Thus, we conclude that the arguments should be in natural language form rather than structured data, by, e.g., rewriting the utterances into some nominalized or simplified terms which correspond to the source text. There have been studies on paraphrasing of questions with regard to the core content [Dong et al.2017], but little has been done on its structured formalization. Our study targets the extraction of commands, which is equivalently essential but has not been widely explored outside of the robotics domain [Matuszek et al.2010, Matuszek et al.2013].

The work most related to ours is likely to be semantic parsing [Berant and Liang2014, Su and Yan2017] and structured query language (SQL) generation, [Zhong et al.2017], which propose seq2seq [Sutskever et al.2014]-like architectures to transform a natural language input into a structured format. These approaches provide the core content of the directive utterances as a sequence of queries, both utilizing it in paraphrasing [Berant and Liang2014] or code generation [Zhong et al.2017]. However, the proposed source sentence formats are usually canonical and mostly information-seeking, rather than being in a conversational form.

Our motivation builds on the basis that real-world utterances as input (e.g., smart speaker commands), in particular for Korean, can diverge from the expected input form, to the point that non-canonical utterances require actual comprehension on top classifying as a question or command. Moreover, as we discuss in the latter part of our work, we intend the extracted natural language terms to be re-usable as building blocks for efficient paraphrasing, following the approach in berant2014semantic.

Recently, in a related view, or stronger linguistic context emphasis, guidelines for identifying non-canonical natural language questions or commands have been suggested for Korean [Cho et al.2018a]. We build on top of this corpus for the initial dataset creation, and extend the dataset with additional human-annotated sentences.

Figure 2:

A simple description on the labeling and annotating. The lexicons on the right side denote the head of the arguments (which goes to the tail of a phrase in Korean). Multiple lists denotes the rare cases where question and command co-exist. The strong requirement is to be explained afterward since it depends on an empirical study and may not be a universal phenomenon.

3 Proposed Scheme

In this section, we describe the proposed annotation scheme along with the motivation of this work. As we discussed in the first section, our goal is to propose guidelines for annotating data which has conversational and non-canonical questions and commands as input. These forms appear a lot in everyday life, but unlike cases where the input is in a canonical form, extracting the core intent in an algorithmic manner is not straightforward. We suggest that a data-driven methodology should be introduced for this task, which can be done by creating a corpus annotated with the core content of the utterances. In this paper, all of the example sentences and the proposed structured scheme is provided in English for demonstrative purposes. Notwithstanding the actual corpus we annotate is Korean, as we demonstrate throughout the paper, the method is expected to be applicable for other languages as well.

3.1 Identifying Directives

Identifying directive utterances is a fundamental part of this work. Thus, at this moment we demonstrate more detailed on the corpus whose guideline is for distinguishing such utterances from the non-directives such as fragments and statements

[Cho et al.2018a].
For questions, interrogatives which include do support (1a) or wh- movement (1b) were primarily considered111Note that this does not hold for the Korean language, which is wh-in-situ. A more complicated and audio-aided identification is required in those cases, as in cho2019disambiguating.. The ones in an embedded form were also counted, possibly with the predicates such as wonder (1c). Also, a large number of the declarative questions (1d) [Gunlogson2002] were taken into account. Since the corpus utilized in both cho2018speech and this annotation process does not contain punctuation marks, the final work was carried out for the clear-cut questions which were selected upon the majority voting of the annotators, at the same time removing the utterances that necessitate acoustic features. For all the types of questions, the ones in rhetorical tone (1e) were removed since their discourse component usually does not perform as an effective question set [Rohde2006].
(1) a. did I ever tell you about how
(3) b. how many points you got left on your license
(3) c. wonder where powell and carney are
(3) d. you going to attack me too
(3) e. why we always gotta do this
For commands, the imperatives in a covert subject (2a) and with the modal verbs such as should (2b) were primarily counted. The requests in question form were also taken into account (2c,d). All the types incorporate the prohibition (2e). Conditionalized imperatives were considered as command only if the conditional junction does not negate the to-do-list as in (2f), not as in (2g). Same as the former case, the ones in rhetorical tone or usage (2h,i) were removed despite it has an imperative structure [Han2000, Kaufmann2016]. All the other types of utterances except questions and commands were considered non-directive222We aim to explain the type of utterances which are also counted as non-directive in other languages, even if a 1:1 mapping might not be possible through translation. We plan to publish an expansion of this work, which is specific to English sentences accompanied by sample corpora as separate work..
(2) a. well do something about it
(3) b. you should contact my administration
(3) c. why don’t you get undressed
(3) d. would you stay with me while i sleep a little
(3) e. don’t be in such a hurry
(3) f. let my daughter go or i’ll take you out
(3) g. shoot me if you can
(3) h. have a pleasant evening
(3) i. tell me that’s not the same guy

3.2 Extracting Intent Arguments

The following section exhibits an example annotation of intent arguments for non-canonical directives, as shown in Figure 2. We want to note again that while we describe the procedure based on simplified English sentence examples, the actual data and process had significantly higher diversity and complexity.

3.2.1 Questions

For the three major question types, which we defined as yes/no, alternative and wh-333Note that here, these are not the syntactic properties, preferably the level of speech act., we applied different annotation rules. For yes/no questions, we employ an if- clause which constraints the candidate answers to yes or no (3a). For alternative questions, we employ whether - or to - a clause accompanied by a list of possible answers (3b). For wh- questions, the extraction process starts with a lexicon which corresponds with the wh- particle that is displayed (3c,d). It is notable that some alternative questions also show the format that is close to the wh-questions, with possibly between that corresponds with whether - or to - (3e).
(3) a. did I ever tell you about how
(3)
if the speaker told the addressee about the procedure
(3) b. you hungry or thirsty or both
(3) whether the addressee is hungry or thirsty
(3) c. how many points you got
(3)
the number of points that the addressee got
(3) d. i want to know about treadstone
(3)
the information about treadstone
(3) e. you know which is hotter in hawaii or guam
(3)
the place that is hotter between hawaii and guam

3.2.2 Commands

Since the main intent of the commands is analogous to a to-do-list, we annotated a list in which the addressee may take action in a structured form. All of these lists start with to indeterminate (4a), with possibly not to for the prohibitions (4b). During this process, non-content-related lexicons such as politeness strategies (e.g., please) were not considered in the extraction (4c).
(4) a. i suggest that you ask your wife
(3) to ask one’s wife
(3) b. yeah but don’t pick me up
(3) not to pick the speaker up
(3) c. please don’t tell my daddy
(3)
not to tell the speaker’s daddy

3.2.3 Phrase Structure

As discussed above, the argument of the questions are transformed into if- clause, whether- clause or the- phrase. Following this logic, the argument of these commands is rewritten to either a to-clause or not to-clause. Except for the wh- questions and some alternative questions, all the (pseudo-)paraphrased sentences have more than one predicate, which contains at least one verb.

Here, note that unlike the English examples displayed above, in the Korean samples the components that decide the phrase structure are all placed at the end of the sentence, with regard to head-finality. To be discussed in the experiment analysis, but sometimes this property seems to help the automatic inference in an autoregressive setting positively.

3.2.4 Coreference

Coreference is a critical issue when extracting the information from the text. This appears a lot in conversational utterances, in the form of pronouns or anaphora. In the annotation process, we decided to preserve such lexicons with the exception of I/we and you since they are participants in the dialog. The concepts which correspond with the two were replaced with either the speaker(s) or the addressee as shown in (3a-c) and (4b,c); and in some cases with one(self) to make it sound more natural (4a).

Types Correspondings
Questions Yes/no
whether or not
-(in)ci, yepwu
Alternative
what is/to do between
-lang -cwung -han/hal kes
Wh-
questions
Who
person, identity
sa-lam, ceng-chey
What
meaning
uy-mi
Where
location, place
wi-chi, cang-so
When
time, period, hour
si-kan, ki-kan, si-kak
Why
reason
i-yu
How
method, measure
pang-pep, tay-chayk
Commands Prohibitions
Prohibition: not to -
-ci anh-ki
Requirements
Requirement: to -
-(ha)-ki
Strong
Requirements
Requirement: to -
-(ha)-ki
Table 2: Structured annotation scheme for the Korean language; more details available in Cho et al. (2018b).

3.2.5 Spatial-Temporal and Subjective Factors

Unlike other question or command corpora, the proposed scheme includes content which requires an understanding of spatial (5a) and temporal (5b) dependencies. These factors are related to the coreference in the previous section, in particular, involving lexicons such as there and then. Also, the dialog being non-task-oriented results in the content unintentionally incorporating the subjective information, such as current thoughts of the speaker or the addressee. The proposed scheme does not ignore such factors in the intent argument (5c,d), to ensure that the core content is preserved.
(5) a. put your right foot there
(3) to put the right foot there
(3)
b. i i don’t want to see you tomorrow
(3) not to meet tomorrow
(3) c. any ideas about the colour
(3) the idea about the colour
(3) d. i think you ought to know what our chances are
(3)
to be aware about the speaker’s chances

4 Dataset Construction

4.1 Corpus Annotation

During the labeling and annotating process, we referred to the corpus constructed in cho2018speech, a Korean single utterance corpus for identifying directives/non-directives that contains a wide variety of non-canonical directives. The tagging of questions and commands was performed with three native speakers for the process, which eventually resulted in an inter-annotator agreement (IAA) of = 0.85 [Fleiss1971].

More related to this paper, in our previous work [Cho et al.2018b], an annotation guideline for the Korean language was proposed. The dataset that was created and verified contains about 30K directive utterances and their intent arguments. We want to emphasize here that our work is not precisely an annotation task

, but closer to a story generation or summarization task with lax constraints on the expected answer. Although the written natural language argument may not be identical for all the addressees, we hypothesize that there is a plausible semantic boundary for each utterance.

In the Korean language, due to the head-finality, all of the structured expressions which are used to construct the phrase structure (Section 3.2.3.) goes to the end of the intent arguments (Table 2). However, in a cross-linguistic perspective, this does not necessarily change the role of the intent arguments. For example, in the Korean sentence SENT = “mwe ha-ko siph-ni (what do you want to do)”, which has an intent argument ARG = ‘cheng-ca-ka ha-ko siph-un kes (the thing that the addressee wants to do)’, the original SENT can be rewritten as SENT* = “ARG-i mwu-ess-ip-ni-kka”. Here, SENT* can be interpreted as “what is ARG” or “tell me about ARG”, where the core content ARG is not necessarily damaged in the translation process. Though displayed merely for a pair of languages, this kind of rewriting supports that the natural language-formatted intent argument can be robust in preserving the purpose of input directives. We claim that the constraints of our method guarantees this, as it utilizes the nominalized and structured terms. While it is difficult to prove that this holds for all possible languages or language pairs, we at least expect this assumption holds for head-first and head-final languages.

Specific constraints when creating a Korean dataset are discussed in the two following sections.

4.1.1 Strong Requirements

The term strong requirement is not an official academic term, but was coined and proposed in [Cho et al.2018b] for their existence in the corpus. Simply explained, this can be described as a co-existence of a prohibitive (PH) expression and the canonical requirement (REQ), as we can see in the sentence “don’t go outside, just stay in the house”. Although the prohibitive expression comes immediately before the requirement, it does not have any guarantee that such forbidding expressions will be part of the core content in the final sentence. In these cases, simply expressing it as “just stay in the house” can be considered a more concise form better suited for argument extraction, which in turn results in the ideal final form: ‘to stay in the house’. In Korean, scrambling is common, so both [PH+REQ] and [REQ+PH] can be valid expressions. In our work, we did not encounter cases where scrambling resulted in the interpretation of the utterance to be a prohibition.

4.1.2 Speaker/Addressee Notation

We consider the notation of coreference significant in this work. A subject omission is a common pattern that can be observed in casual spoken Korean. This is different from English, where the agent and the experiencer are explicit. The intent arguments in Korean can be vague or implicit when denoting the speaker/addressee. For these reasons, to minimize the ambiguity, we created two separate corpora; one with the speaker/addressee notation, and the other without this information. In the former corpus, we classify all possible cases into one of five categories: only the speaker (hwa-ca), only the addressee (cheng-ca), both (hwa-ca-wa cheng-ca), none, and unknown. We believe this kind of information will be beneficial for both the disambiguation in the context of analysis and further research. As for the latter, while the orientation must be inferred from the context, the expression will be closer to what one would encounter in everyday life. We also believe that ambiguity, which introduces stronger context dependencies, is a crucial piece of future advancements in natural language understanding of high-context languages.

Intention Types Original Augmented Sum
Question Yes/no Q 5,715 - 5,715
Alternative Q 229 4,000 4,229
Wh- Q 11,988 8,000 19,988
Command Prohibition 478 4,000 4,478
Requirement 12,302 - 12,302
Strong REQ. 125 4,000 4,125
Total 30,837 20,000 50,837
Table 3: The final composition of the dataset.

4.2 Corpus Augmentation

In the above, we used an existing dataset to annotate intent arguments for questions and command utterances. During our work, we concluded that there was an imbalance in the dataset - specifically not having enough data for some utterance types. Additionally, we concluded that the amount of parallel data was not large enough for wh-question to be useful in real life, also taking into account that the extraction of arguments from wh- questions involves the abstraction of the wh-related concept. To mitigate the issues, we increased the dataset size by obtaining various types of sentences from intent arguments, specifically via human-aided sentence rewriting.

First, alternative questions, prohibitions, and strong requirements were needed to ensure that we had class balance for each utterance type, or at least a sufficient number for the automation. To do this, we manually wrote 400 intent arguments for each of the three types. In the process of deciding intent arguments, the topic of sentences to be generated was also carefully considered. Specifically, sentences were created at a 1: 1: 1: 1: 4 ratio for mail, schedule, house control, weather, and other free topics. This reflects the topic characteristics of the dataset used in Section 4.1, and its purpose is to build a corpus oriented to the future advancement of smart agents.

To enforce the second goal - wh-questions, 800 intent arguments were constructed. Topics of each sentence considered in this process are identical to the above. However, the use of wh-particles that can assist with natural transformations between wh-particles and wh-related terms was not allowed, which can occur in wh-questions. This means that the intent arguments were created in the way in which they only expose the nominalized format, and not the wh-particles, e.g., the weather of tomorrow rather than what the weather is like tomorrow. This trend was also applied when constructing additional phrases for the alternative questions above.

With 2,000 arguments constructed through the approach discussed above, we requested participants to write ten utterances per phrase as diversely as possible444The detailed guideline is to be published as a separate article.. The paraphrasing process resulted in a total of 20,000 argument-directive pairs, constructed from 2,000 arguments. Examples of various question and command expressions for phrases obtained in this process include, e.g.,
Argument: The most important concept in algebra
Topic: Free, Type: wh- question
just pick me one the most important concept in algebra
what do you think the core concept in algebra is
which concept is the most important in algebra
what should i remember among various concepts in algebra (various versions in Korean)
The composition of the entire dataset and data created by augmenting the original data is shown in Table 3. We ensured the ratio between the utterance types is balanced so that common utterances which were not statistically well-represented in the corpus had enough training samples. Additionally, we increased the absolute count of utterances for wh-questions where our approach can be proven most effective. As a result, the class imbalance which was problematic for at the initial point, has been partially resolved.

5 Experiments

5.1 Format

The final format of the corpus is as follows:
  Utterance # Label Sentence Argument
 
Here, the label denotes the six utterance types as in Section 4.1., and the utterance and intent argument are in raw text form. As stated in Section 4.1.2, there are two versions of the corpus: with and without the speaker/addressee notation. Both are to be distributed on-line, but only the latter is utilized in the experiment and is available on-line currently555https://github.com/warnikchow/sae4k.

In the experiment utilizing seq2seq approach [Sutskever et al.2014], we aim to infer the intent argument directly rather than identifying the label, by giving sentence as an input and argument as output. Moreover, the correct inference of the intent argument is not independent with the identification of the exact utterance type. Thus, here we need both the metric related to classification and generation, respectively, which is to be discussed in the Evaluation section.

5.2 Automation

Although the volume may not be significant for the automation, we experimented with the corpus to observe how the proposed scheme works. The implementation was done for recurrent neural network (RNN)-based seq2seq with attention

[Cho et al.2014, Luong et al.2015] and Transformer [Vaswani et al.2017]. Due to the agglutinative nature of the Korean language, the morpheme-level tokenization was done with Mecab666https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/ via KoNLPy [Park and Cho2014] python wrapper.

For the RNN seq2seq with attention, which utilized the morpheme sequence of maximum length 25, hidden layer width and dropout rate [Srivastava et al.2014] was set to 256 and 0.1, respectively. The training stopped after 100,000 iterations, just before the increase of training loss took place.

For the Transformer, which adopts a much more concise model compared to the original paper [Vaswani et al.2017], the maximum length of the morpheme sequence was set to also 25, with hidden layer width 512 and dropout rate 0.5. Additionally, multi-head attention heads were set to 4, and a total of two layers were stacked, considering the size of the training data.

5.3 Evaluation

The most controversial part of the implementation is probably the evaluation measure, as in many other translation or generation tasks. Taking into account that the paraphrasing is a monolingual translation, there exist several candidates of an answer that can be considered felicitous for an input utterance. That means the same phrase can be expressed in various similar ways, without harming the core content.

Ironically, such flexibility makes up the different viewpoints between translation/paraphrasing/summarization and generation. There is no exact answer for both kind of tasks, but for the former types, at least there exists a rough boundary regarding how tolerable the output is. In our task, which is close to the former ones, the answer have to be some formatted expression. However, if we utilize only BLEU [Papineni et al.2002] or ROUGE [Lin2004] as a measure, there is a chance that the diversity of the expression can bring a lousy evaluation result, although it is semantically tolerable. Also, in the corpus construction, we have explicitly set the formats for different utterance types, which requires the correct identification of the speech act and thus can largely influence the accurate inference of an argument.

In this regard, we first surveyed a proper evaluation for the automatic and quantitative analysis of the result, respectively. A part of the conclusion is that the automatic analysis of semantic similarity can be executed utilizing and modifying the recent BERT-based scoring system777https://github.com/Tiiiger/bert_score [Zhang et al.2019]. Such an approach can be adopted regardlessly the label is correctly inferred, and also well reflects the common sense inherited in the pre-trained language models. Moreover, in the case that the label is correct that some format-related tokens (e.g., the method, whether, not to) in the output overlap with the ones in the gold data, the lexical similarity can also be taken into account, probably as an extra point. It can be further represented by ROUGE compared to the gold standard.

For a fair evaluation, we determined to aggregate both kinds of evaluation values. The final score was obtained by averaging those two results, namely ROUGE-1 and BERTScore. With this, we prevent the case that the format difference caused by the wrong label leads to the wrong judgment on lexical features.

5.4 Result

The validation result is in Table 4. For clarity, we recorded both BERTScore and ROUGE-1. Note that for ROUGE-1, the character-level comparison was utilized, regardless of the tokenizer that was adopted in the training and inference.

RNN s2s
+ Attention
Transformer
Test split 9:1 9:1 7:3
Iteration 100,000 10,000 10,000
ROUGE-1 0.5335 0.5732 0.5383
BERTScore 0.7693 0.9724 0.8601
Total 0.6514 0.7728 0.6992
Table 4: Validation result with the test set.

The result shows the advantage coming from (a) adopting the Transformer [Vaswani et al.2017] and (b) setting aside a larger volume of data for the training phase. (a) is evident here, comparing both ROUGE-1 and BERTScore, where the Transformer model has better performance with the same split model, and even with the 7:3 split model and less iteration. (b) is observed within the two Transformer models, The main reason for the difference is assumed to be the existence of out-of-vocabulary (OOV) terms in the test set, which confuse the system in the inference phase and that brings decoding of non-related terms.

Although the numerical value concerns only the quantitative analysis, we could check the validity of each model with the output for a test utterance that is fed as a common input. For example, from the original sentence:
(6) “저번처럼 가지 말고 백화점 세일은 미리 가서 대기하렴” / “This time, please go to the depratment store earlier (than its opening time) and wait there for the upcoming sale event
the followings are obtained from each model:
(6) a. RNN seq2seq with attention - 백화점 가 미리 가 서 대기 대기 대기 대기 대기 대기 대기 대기 대기 대기 대기 대기 / department store, go earlier (than its opening time), and wait wait wait wait wait wait wait wait wait wait wait wait
(3)
b. Transformer (split 9:1) - 백화점 세일 은 미리 가 서 대기 하 기 / to go to the depratment store earlier (than its opening time) and wait for the sale event
(3)
c. Transformer (split 7:3) - 백화점 가 서 미리 가 서 도와 주 기 / to go to the department store earlier (than its opening time) and help (something)
Taking into account that the given utterance (6) is a strong requirement, or a series of (less meaningful) prohibition and (substantial) requirement, it is encouraging that all three models succeeded to place the department store (백화점, payk-hwa-cem) at the very first of the sentence, ignoring the prohibition in the first half clause. However, note that in (6a), the conventional collapse regarding word repetition took place in the RNN model, while the other two Transformer models cope with it and find the right place to finish the inference. This is important for matching the speech act type correctly, especially in a head-final language as Korean, since stably guessing the accurate tail of the phrase is not guaranteed in the auto-regressive inference.

Besides, comparing (6b) and (6c), where the tails of the clauses (regarding speech act) were correctly inferred, the latter one fails to choose the lexicon regarding wait, instead picking up help that may trained in a large correlation with the terms such as go earlier in the training phase. Here, it is also assumed that the loanword such as sale (세일, seyil), which is expected to be OOV in the test phase, might have caused the failure in (6c).

The gold standard for (6) is ‘백화점 세일은 미리 가서 대기하기, to go to the department store earlier and wait for the sale event’, which is identical to (6b) if the decomposed morphemes are accurately merged. This suggests that the self attention-based model architecture and the supplement of the dataset are both the solution for the stable inference. Here are more samples that came from the Transformer model, especially some tricky input sentences (7-8) and wh- questions (9-10). Note that all the input sentences are removed with the punctuation marks, and the output phrases were not polished to deliver the original shape.
(7) “수영을 온천에서는 하면 안됩니다” / “it is prohibited to swim in an onsen
온천 에서 수영 하 지 않 기 / not to swim in an onsen
(8) “박사 졸업과 결혼 준비를 비교한다면 어떤게 더 지옥같아” / “which is more hell if you compare your phd with your wedding preparation
박사 졸업 과 결혼 준비 중 더 힘들 었 던 것 / the tougher process (for the addressee) between getting phd and preparing wedding
(9) “오늘 눈이 얼마나 오니” / “how much does it snow today
오늘 강설량 / the amount of today’s snowfall
(10) “몇 도 기준으로 열대야라고 해” / “from what temperature is it called a tropical night
열대야 기준 온도 / the reference temperature of tropical night
We expect the formalization as (7) can be useful for a real-life command to the social robots, and (8) meaningful if smart agents more become human-like beings, though far future. Also, as in the case of two wh-questions (9-10), the nominalization of wh-related features may help the NLU modules to efficiently get the answer of information-seeking questions that are not in a canonical form. Not all the results were nice, especially regarding some intonation-dependent utterances (11) and the most challenging ones that incorporate various OOV/loanwords (12).
(11) “꼭 대학원을 가야겠어” / “should you go to grad school
대학원 진학 하 기 / to go to grad school
(12) “인터파크 스팸차단했니” / “did you ban the mails from interpark
인터 파크 티켓 차단 여부 / if the addressee banned the tickets from interpark888An online shopping mall of Korea.
Built on these preliminary results, we aim to make up a more reliable extracting system, of which the main feature is the utilization of a pre-trained language model that can compensate for the deficit of the training data and appearance of OOVs. Also, content-preserving and controllable sentence generation are to be great strategies that fit the core of our task.

6 Application

Since the proposed approach regards the formal semantics and the task domain is not specified, we expect our study to be meaningful for a general AI that talks with human beings without making the users feel isolated. Recalling that for also humans, the reaction towards the directive and the non-directive utterance differs, our two-way approach makes sense. Along with the non-task-oriented dialog, our scheme may be useful for avoiding inadvertent ignorance of the users’ will.

Beyond the application to the spoken language understanding (SLU) modules within the smart agents, our approach can be utilized in making up the paraphrase corpus or supporting the semantic web search. Along with the boosted performance of recent text generation and reconstruction algorithms, we expect a large size of the dataset is furthermore constructed and be utilized with the real-life personal agents.

7 Conclusion

The significance of this research is to establish a creation and augmentation methodology for summarization and paraphrasing of less explored sentence units, and distribute them. In this paper, only dataset acquisition and application for directive utterances are presented, but the implementation of automatic question/command generation and sentence similarity test using this concept is also possible. Besides, we have shown a baseline system that automatically extracts intent arguments from the non-canonical Korean question/command by utilizing the constructed dataset and some up-to-date architectures, implying that the methodology to be practically meaningful. Our next work plans to extend this more typologically by showing that the annotation/generation scheme is applicable to other languages. We hope that research on automatic keyphrase/argument extraction is to be active among Korean natural language processing (NLP), and other low-resourced languages, via released annotation scheme and datasets.

8 Acknowledgements

This research was supported by Projects for Research and Development of Police science and Technology under Center for Research and Development of Police science and Technology and Korean National Police Agency funded by the Ministry of Science, ICT and Future Planning (PA-J000001-2017-101). Also, this work was supported by the Technology Innovation Program (10076583, Development of free-running speech recognition technologies for embedded robot system) funded By the Ministry of Trade, Industry & Energy (MOTIE, Korea).

The corpus construction would not have been possible without the help of eight great participants, namely Eunah Koh, Kyung Seo Ki, Sang Hyun Kim, Kimin Ryu, Dongho Lee, Yoon Kyung Lee, Minhwa Chung, and Ye Seul Jung. Also, the authors appreciate Siyeon Natalie Park for suggesting a great idea for the title. After all, we appreciate the helpful advices provided by Reinald Kim Amplayo, Jong In Kim, Jio Chung, and Kyuwhan Lee.

9 Bibliographical References

References