Speech Intention Understanding in a Head-final Language: A Disambiguation Utilizing Intonation-dependency

11/10/2018 ∙ by Won Ik Cho, et al. ∙ 0

For a large portion of real-life utterances, the intention cannot be solely decided by either their semantics or syntax. Although all the socio-linguistic and pragmatic information cannot be digitized, at least phonetic features are indispensable in understanding the spoken language. Especially in head-final languages such as Korean, sentence-final intonation has great importance in identifying the speaker's intention. This paper suggests a system which identifies the intention of an utterance, given its acoustic feature and text. The proposed multi-stage classification system decides whether given utterance is a fragment, statement, question, command, or a rhetorical one, utilizing the intonation-dependency coming from head-finality. Based on an intuitive understanding of Korean language which is engaged in data annotation, we construct a network identifying the intention of a speech and validate its utility with sample sentences. The system, if combined with the speech recognizers, is expected to be flexibly inserted into various language understanding modules.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Understanding the intention of a speech includes all aspects of phonetics, semantics, and syntax. For example, even when an utterance is given its syntactic structure of declarative/interrogative/imperative forms, the speech act may differ regarding semantics and pragmatics [1]. Besides, phonetic features such as prosody can influence the actual intention, which can be different from the illocutionary act that we catch at the first glance [2].

In text data, punctuation plays a dominant role in conveying the phonetic features. But nowadays, where artificial intelligence (AI) agents with microphones are widely used in daily lives, such linguistic features can be inadvertently omitted. Although a transcribed sentence from the speech recognition module of the system may be punctuated using a language model, it will be more accurate to consider the effect of acoustic data in determining an intention.

In this study, the language investigated is Korean, a representative one with head-final syntax. Korean is agglutinative, morphologically rich (as in Japanese), and most importantly, the intention of the sentence is significantly influenced by the phonetic property of sentence enders [3]. Consider the following sentence, of which the meaning differs with the sentence-final intonation:
(1) 천천히 가고 있어

chen-chen-hi ka-ko iss-e

slowly go-PROG be-SE222Denotes underspecified sentence enders; final particles whose role vary.
With a high rise intonation, this sentence becomes a question (Are you going slowly?), and given a fall or fall-rise intonation, becomes a statement ((I am) going slowly.). Also, given a low rise or level intonation, the sentence becomes a command (Go slowly.). This phenomenon partially originates in the particular constituents of Korean utterances, such as multi-functional particle ‘-어 (-e)’, or other sentence enders determining the sentence type [4].

In this paper, we propose a partially improvable multi-stage system that identifies the user intention of spoken Korean, with a disambiguation via additional information on intonation. The system classifies the utterances into six categories of fragment, statement, question, command, and rhetorical question

command. Since the system does not incorporate a speech recognition module, it receives the acoustic feature and transcript (assumed perfectly transcribed) of an utterance and infers the intention, as in [5]. To this end, 7,000 speech utterances were manually tagged with intonation, and total 57,427 lines of text utterances were semi-automatically collected or generated, including 19,318 manually tagged lines. In the following section, we take a brief look at the literature of intention classification. In Section 3 and 4, the architecture of the proposed system is described with a detailed implementation scheme. Afterward, the system is evaluated quantitatively and qualitatively with the test set.

2 Related Works

The most important among various areas related to this paper is the study on the intention of a sentence. Unlike the syntactic concept of sentence form presented in [6], it has been studied pragmatically in the area of illocutionary act and speech act [7, 1]. It is also closely related to the situation entity type [8] or speech intention [5], the concepts from which this study has been influenced.

2.1 Sentence Types of Korean

There’s no doubt that this research can be cross-lingually extended, but for data annotation, a research on the sentence types of Korean was essential. Although a large portion of the annotation process depended on the intuition of the annotators, we referred to various studies related to syntax-semantics and speech act [9, 10, 11] to handle complicated cases regarding optatives, permissives, promisives, requestsuggestions, and rhetorical questions.

2.2 Intonation-related Studies

Studies on intonation and sentence types are crucial in this study. However, data-driven approach on the relationship between intention and intonation has rarely been made, especially on the syllable-timed languages such as Korean. Therefore, to make up a dataset, we referred to a guideline on labeling the intonation of the Korean language [12], where the relationship between various sentence-final intonations and sentence forms has been described. The task of associating it with an intention will be done in this paper.

3 System Concept

The proposed system incorporates two modules: (1) a module classifying the utterances into fragments (FR), five clear-cut cases (CCs), and intonation-dependent utterances (IU), and (2) a module for the identification of IUs (Fig. 1).

3.1 FCI module: Distinguishing FCI

Fragments (FR): From a linguistic point of view, fragments often refer to single nounverb phrase where ellipsis occurred [13]. However, in this study, we also included the incomplete sentences where the intention is underspecified. If the input sentence is not a fragment, it was assumed to belong to clear-cut cases or be an intonation-dependent utterance333There were also context-dependent cases where the intention was hard to decide even given intonation, but the portion was tiny..
Clear-cut cases (CCs): Clear-cut cases incorporate the utterances of five categories: statement, question, command, rhetorical question, and rhetorical command, as described detailed in the annotation guideline. Briefly, questions are the utterances that require the addressee to answer, and commands are ones that require the addressee to act. Even if sentence form is declarative, words such as wonder or should can make the sentence question or command. Statements are descriptive sentences that do not apply to both cases.

Rhetorical questions (RQ) are the questions that do not require an answer [14] because the answer is already in speaker’s mind. Similarly, Rhetorical commands (RC) are a kind of idiomatic expressions in which imperative structure does not convey a to-do-list that is mandatory (e.g. Have a nice day). The sentences in these categories are functionally similar to the statement but were tagged separately since they usually showed non-neutral tone.
Intonation-dependent utterances (IU): With decision criteria for the clear-cut cases, we investigated whether the intention of given sentences depend on intonation, which is affected by both the final particle and the content. There have been studies on speech act or topic of the sentence, which dealt with final particles and adverbs [15]. However, up to our knowledge, there has been no explicit guideline on a text-based identification of Korean utterances that are influenced by intonation. Thus, we have set some principles for the annotation referring the maxims of conversation [16] (e.g., Do not consider sentences containing too specific information as questions.), and provided them to the annotators444https://drive.google.com/open?id=1AvxzEHr7wccMw7LYh0J3Xbx5GLFfcvMW.

Figure 1:

A brief illustration on the structure of the proposed system. The manually generated text is given for transcript.

3.2 IU module: Identifying IU

For the identification of the intonation types and the decision of intention, we constructed two submodules: an intonation classifier and an intonation-aided intention identifier.
Intonation classifier: Based on an observation of 7,000 Seoul Korean Utterances from the speech corpus utilized in [17], the five types of intonation were taken into account, as a simple modification of the empirical methods explained in [12]. The intonation types were classified into High rise (HR; 1,683), LR; Low rise (169), Fall rise (FR; 428), Level (LV; 996), and Fall (LV; 3,724), with the numbers indicating the volume of the instances. The labels were tagged by three Seoul Korean natives, referring to the real-time spectrogram555https://www.foobar2000.org/, especially considering the onset pitches of the three sentence-final syllables. To deal with the syllable-timedness of Korean, the classifier utilized mel spectrogram augmented with energy contour (MS+E) that emphasizes the syllable onsets. To meet the hypothesis, MS+E outperformed a simple mel spectrogram or various hand-crafted features, with a fixed network.
Intonation-aided intention identifier: To utilize the intonation information obtained as an output of the previous module, an additional dataset which adopts two inputs (namely text and intonation label) was constructed based on the collected IUs. Given a single transcript of the utterance, multiple intonation types were allowed for one intention (e.g., for sentence (1), HR corresponds with question, FR/FL with statement, and LR/LV with command) but not vise versa. We obtained 4,380 tuples for the identification. The architecture is described in the following section.

Primary decision
(for FCI module)
(corpus 1)
Fragments - 384 4,584
Clear-cut cases statements 8,032 17,921
questions 3,563 17,688
commands 4,571 12,700
rhetorical questions 613 1,643
rhetorical commands 572 1,046
(for IU module)
1,583 1,845
Total 19,318 57,427
Table 1: The composition of the corpus.

4 Experiment

4.1 Corpora

To cover a variety of topics, the utterances used for training and validation were collected from (i) the corpus provided by Seoul National University Speech Language Processing Lab666http://slp.snu.ac.kr/, (ii) the set of frequently used words, released from the National Institute of Korean Language777https://www.korean.go.kr/, and (iii) manually created questions/commands.

From (i) which contain short utterances of topics such as e-mail, houseworks, weather, transportation, and stocks, 20K were randomly chosen and three Seoul Korean L1 speakers annotated them into seven categories of FR, IU, and five CC categories. Annotators were well educated on the guideline and had enough debates on the conflicts that occurred during the annotating process. The resulting inter-annotator agreement (IAA) was = 0.85 [18] and the final decision was done by majority voting. Its composition is stated in Table 1.

Taking into account the shortage of the utterances, (i)-(iii) were utilized in the data supplementation. (i) contained various neutral statements and the rhetorical utterances. In (ii), the single nouns were collected and augmented to FR. The utterances in (iii) which were generated with purpose, were augmented to question and command directly. The composition of the final corpus is also stated in Table 1.

4.2 Implementation

Implementation for the system was done with Librosa888https://github.com/librosa/librosa, fastText999https://pypi.org/project/fasttext/

and Keras


, which were used for the extraction of acoustic features, character vector embedding and making neural network models, respectively. As stated earlier, mel spectrogram and energy were utilized for the acoustic feature. For text feature, character vector obtained based on skip-gram

[20] of fastText [21] was utilized, for the richness of characters and to avoid using morphological analyzer.

The system architecture can be roughly described as a combination of convolutional neural network (CNN)

[22, 23]

and bidirectional long short-term memory self-attention (BiLSTM-SA)

[24, 25]. First, FCI module was constructed using solely BiLSTM-SA (context vector dim: 64). CNN was good at catching out the syntactic distinction that comes from the length of the utterances or the presence of specific sentence enders, but not effective for handling the scrambling of Korean language, worsening the performance in the concatenated network. The architecture that incorporates the concatenation of CNN and BiLSTM-SA was utilized for the intonation classifier, in the sense that the identification of intonation type concern shape-related properties (e.g., mel spectrogram). For the intention identifier

, a modified attention network was used. In detail, the output of a multi-layer perceptron (MLP), which has the one-hot encoded intonation label as an input, is column-wisely multiplied to the hidden layer sequence of the BiLSTM and yields the weighted sum which is used for a multi-class classification. The neural network structure is described in Fig. 2.

Figure 2: Neural network architecture for IU identification. and denote the column-wise multiplication and the weighted sum conducted for the character BiLSTM hidden layers given IU input.

In brief, FCI module adopts a self attentive char-BiLSTM (acc: 0.88, F1: 0.75), and the intention identifier adopts an attention-based char-BiLSTM (acc: 0.90, F1: 0.80) which is aided by the intonation classifier (acc: 0.77, F1: 0.46). For all the modules, the dataset was split into training and test set with the ratio of 9:1, and the class weight was taken into account in the training session concerning the imbalance of the volume for each utterance type. Model specification is stated in Table 2.

4.3 Evaluation and Discussion

Arithmetically, our model yields the total accuracy of 0.87, assuming a correct ASR and taking into account the difference of accuracy regarding IUs and non-IUs. However, to investigate the practical reliability of the whole system, we separately constructed a test set of 1,000 challenging utterances, given depunctuated drama lines, recorded audio [17], and manually tagged six-class label (FR and five CCs). With the test speeches, our system recorded the accuracy of 0.65.

The analysis of the test result suggests that the system is relatively weak at pointing out FRs correctly. Short nouns or noun phrases were well identified as FR but some long predicates or incomplete sentences were identified as CC. For IUs, the ones with frequently appearing sentence enders were identified as IU even though the content or the length of the utterances conveyed awkwardness to be considered as IU. In short, identifying FR tends to show false negative and IU to false positive, implying the shortage of such utterances that should be supplemented consistently.

Once classified as IU, intonation-aided inference was quite successful for questions and statements, but sometimes the system confused commands for statements due to their similar prosody containing descending factors. Also, the performance degradation partially originated in the unexpected prosody of the voice actors. In contrast, the system performed well for the inference of CCs, except for some cases regarding RC/Qs those are sometimes even difficult for human.

Despite some weak points displayed in the FCI/IU module, the proposed system showed a demonstration of the novel methodology for a disambiguation regarding intonation-dependency. The system is expected to be cross-lingually extended to not only the head-final languages such as Japanese, but also the non-head-final ones such as English, considering the confusing utterances, i.e., declarative questions.

The proposed system can be compared to a multimodal system that has been recently suggested for English [5], which showed the accuracy of 0.83 with the test set splitted from the corpus. Those kind of systems are easier to be trained without doubt, and might be accurate in the sense that less human factors are engaged in. Nevertheless, our approach is meaningful for the case where there is a lack of resource in the labeled speech data; the whole system can be partially improved by augmenting additional text or speech. Also, the efficiency of the proposed system lies in utilizing the acoustic data only for the text that require an additional intonation information, avoiding an unnecessary computation and preventing the confusion from an anomalous prosody of the user.

Input size
(single channel)
(, 100, 1)
# Filters 32
Conv layer: (3, 100)
Max pooling: (2,1)
# Conv layer 2
BiLSTM Input size (, 100)
Hidden layer nodes 32
MLP Hidden layer nodes 128
Others Optimizer Adam (0.0005)
Batch size 128
Dropout 0.3 (for MLPs)
Relu (MLPs, CNN)
Softmax (attention, output)
Table 2: Architecture specification. (for the character sequence) was set to 50, considering the utterances length. Taking into account the head-finality of Korean, the last 50 syllables were utilized, including the space to incorporate the information on segmentation.

5 Conclusion

In this paper, we proposed a multi-stage system for an identification of the speech intention. The system first checks if the speech is fragment or has clearly determinable intention, and if neither, it conducts an intonation-aided decision, associating the underspecified utterance with the true intention. For a data-driven training of the modules, 7K speech and 57K text data were collected or manually tagged, yielding an arithmetic system accuracy of 0.87 and a practical counterpart of 0.65 with an additionally constructed challenging test set. The possible application of the proposed system is the SLU modules of smart agents, especially the ones targeting a free-style conversation with human. Our future work aims augmenting a multimodal system for IU module, which can be reliable only by making up a large-scale and accurately tagged speech DB.


  • [1] Andreas Stolcke, Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie Meteer, “Dialogue act modeling for automatic tagging and recognition of conversational speech,” Computational linguistics, vol. 26, no. 3, pp. 339–373, 2000.
  • [2] Atissa Banuazizi and Cassandre Creswell, “Is that a real question? final rises, final falls, and discourse function in yes-no question intonation,” CLS, vol. 35, pp. 1–14, 1999.
  • [3] Mary Shin Kim, “Evidentiality in achieving entitlement, objectivity, and detachment in korean conversation,” Discourse Studies, vol. 7, no. 1, pp. 87–108, 2005.
  • [4] Miok D Pak, “Types of clauses and sentence end particles in korean,” Korean Linguistics, vol. 14, no. 1, pp. 113–156, 2008.
  • [5] Yue Gu, Xinyu Li, Shuhong Chen, Jianyu Zhang, and Ivan Marsic,

    “Speech intention classification with multimodal deep learning,”

    in Canadian Conference on Artificial Intelligence. Springer, 2017, pp. 260–271.
  • [6] Jerrold M Sadock and Arnold M Zwicky, “Speech act distinctions in syntax,” Language typology and syntactic description, vol. 1, pp. 155–196, 1985.
  • [7] John R Searle, “A classification of illocutionary acts,” Language in society, vol. 5, no. 1, pp. 1–23, 1976.
  • [8] Annemarie Friedrich, Alexis Palmer, and Manfred Pinkal, “Situation entity types: automatic classification of clause-level aspect,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, vol. 1, pp. 1757–1768.
  • [9] Chung-hye Han, The structure and interpretation of imperatives: mood and force in Universal Grammar, Psychology Press, 2000.
  • [10] Miok Pak, “Jussive clauses and agreement of sentence final particles in korean,” Japanese/Korean Linguistics, vol. 14, pp. 295–306, 2006.
  • [11] Saetbyol Seo, The syntax of jussives: speaker and hearer at the syntax-discourse interface, Ph.D. thesis, Seoul National University, 2017.
  • [12] Sun-Ah Jun, “K-tobi (korean tobi) labelling conventions,” ms, Version, vol. 3, 2000.
  • [13] Jason Merchant, “Fragments and ellipsis,” Linguistics and philosophy, vol. 27, no. 6, pp. 661–738, 2005.
  • [14] Hannah Rohde, “Rhetorical questions as redundant interrogatives,” 2006.
  • [15] Jeesun Nam, “A novel dichotomy of the korean adverb nemwu in opinion classification,” Studies in Language. International Journal sponsored by the Foundation “Foundations of Language”, vol. 38, no. 1, pp. 171–209, 2014.
  • [16] Stephen C Levinson, Presumptive meanings: The theory of generalized conversational implicature, MIT press, 2000.
  • [17] Joun Yeop Lee, Sung Jun Cheon, Byoung Jin Choi, Nam Soo Kim, and Eunwoo Song,

    “Acoustic modeling using adversarially trained variational recurrent neural network for speech synthesis,”

    Proc. Interspeech 2018, pp. 917–921, 2018.
  • [18] Joseph L Fleiss, “Measuring nominal scale agreement among many raters.,” Psychological bulletin, vol. 76, no. 5, pp. 378, 1971.
  • [19] François Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015.
  • [20] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean,

    Distributed representations of words and phrases and their compositionality,”

    in Advances in neural information processing systems, 2013, pp. 3111–3119.
  • [21] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov, “Enriching word vectors with subword information,” CoRR, vol. abs/1607.04606, 2016.
  • [22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton,

    Imagenet classification with deep convolutional neural networks,”

    in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [23] Yoon Kim, “Convolutional neural networks for sentence classification,” arXiv preprint arXiv:1408.5882, 2014.
  • [24] Mike Schuster and Kuldip K Paliwal,

    Bidirectional recurrent neural networks,”

    IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
  • [25] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio, “A structured self-attentive sentence embedding,” arXiv preprint arXiv:1703.03130, 2017.