Dialogue Act Sequence Labeling using Hierarchical encoder with CRF

09/13/2017 ∙ by Harshit Kumar, et al. ∙ 0

Dialogue Act recognition associate dialogue acts (i.e., semantic labels) to utterances in a conversation. The problem of associating semantic labels to utterances can be treated as a sequence labeling problem. In this work, we build a hierarchical recurrent neural network using bidirectional LSTM as a base unit and the conditional random field (CRF) as the top layer to classify each utterance into its corresponding dialogue act. The hierarchical network learns representations at multiple levels, i.e., word level, utterance level, and conversation level. The conversation level representations are input to the CRF layer, which takes into account not only all previous utterances but also their dialogue acts, thus modeling the dependency among both, labels and utterances, an important consideration of natural dialogue. We validate our approach on two different benchmark data sets, Switchboard and Meeting Recorder Dialogue Act, and show performance improvement over the state-of-the-art methods by 2.2% and 4.1% absolute points, respectively. It is worth noting that the inter-annotator agreement on Switchboard data set is 84%, and our method is able to achieve the accuracy of about 79% despite being trained on the noisy data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Dialogue Acts (DA) are semantic labels attached to utterances in a conversation that serve to concisely characterize speakers’ intention in producing those utterances. The identification of DAs ease the interpretation of utterances and help in understanding a conversation. One of primary applications of DAs [Higashinaka et al.2014] is in building a natural language dialogue system, where knowing the DAs of the past utterances helps in the prediction of the DA of the current utterance, and thus, limiting the number of candidate utterances to be generated for the current turn. For example, if the previous utterance is of type Greeting then the next utterance is most likely going to be of the same type, i.e., Greeting. Table 1 shows a snippet of a conversation showing such dependency among DAs. Another application of DA identification is in building a conversation summarizer where DAs can be used to generate a summary of a conversation by collecting pair of utterances that have specific DA labels.

Utterance DA
1 U: Hi Greeting
2 S: Hi, How are you? Greeting
3 U: I recently visited canary island Statement
4 S: I am sure you had a nice time. Statement
5 U: yes, but it is an expensive place, Opinion
6 S: Aren't all tourist places expensive? Y/N question
7 U: yes, most are Ack
8 U: but Abandon
9 U: i liked the food, especially curry Statement
Table 1: A snippet of a conversation showing few dialogues between a User (U) and System(S).

DA recognition is a well-understood problem, and several different approaches ranging from multi-class classification to structured prediction have been applied to it [Grau et al.2004, Ang, Liu, and Shriberg2005, Stolcke et al.2006, Lendvai and Geertzen2007, Tavafi et al.2013]. These approaches use handcrafted features, often designed keeping in mind the characteristics of the underlying data, and therefore do not scale well across datasets. Furthermore, in a natural conversation, there is a strong dependency among consecutive utterances, and consecutive DAs, as is evident from the previous Greeting example, so it is important that any model should account for these dependencies. However, the standard multi-class classification such as Naïve Bayes does not account for any of these dependencies, and classify DAs independently, whereas structured prediction algorithms such as HMM only take into account the label dependency, not the dependencies among utterances. For the DA recognition task, one of the earlier works  [Grau et al.2004] used Naïve Bayes and reported an accuracy of 66% on the Switchboard (SwDA) corpus. The SwDA corpus has since become the standard corpus for DA recognition task because of its wide-spread use, and has been used as a benchmark data to compare different algorithms. Furthermore, structured prediction algorithms such as HMM [Stolcke et al.2006] and SVM-HMM [Lendvai and Geertzen2007, Tavafi et al.2013] though have reported an accuracy of and , respectively, they are are still far from the human reported inter-annotator agreement of on SwDA corpus.

The emergence of deep learning has dramatically improved the state-of-the-art across several domains 

[LeCun, Bengio, and Hinton2015]

, from image classification to natural language generation. Recent studies 

[Blunsom and Kalchbrenner2013, Lee and Dernoncourt2016, Khanpour, Guntakandla, and Nielsen2016, Ji, Haffari, and Eisenstein2016] have used deep learning models for the DA recognition task, and have shown promising results. However, most of these models do not leverage the implicit and intrinsic dependencies among DAs. A further limitation of existing methods is that they consider a conversation as a flat structure, attempting to recognize each DA in isolation. A conversation naturally has a hierarchical structure, i.e., a conversation is made up of utterances, utterances are made up of words, and so on. In our method, we make use of this structure to build a hierarchical recurrent neural network with four layers, the first three layers representing words, utterances and conversation, and the fourth layer representing the CRF (classification) layer. Among these four layers, the first three layers capture the dependencies among utterances, whereas the fourth layer captures the dependencies among dialogue acts, hence accounting for both kind of dependencies. Our method is in contrast to the existing methods which only capture one kind of dependency either utterance dependency [Blunsom and Kalchbrenner2013] or label dependency [Huang, Xu, and Yu2015, Ma and Hovy2016].
The main contributions of this paper are as follows:

  • We propose a Hierarchical Bi-LSTM-CRF (Bi-directional Long Short Term Memory with CRF) model for the DA recognition task, that can capture both kind of dependencies, i.e., among dialogue acts and among utterances.

  • We evaluate the proposed method on two benchmark datasets, SwDA and MRDA, and show performance improvement over the state-of-the-art by a significant margin. For the SwDA dataset, our method is able to achieve an accuracy of compared to the state-of-the-art accuracy of , a step closer to the human reported inter-annotator agreement of . On MRDA, our method achieves an accuracy of compared to the state-of-the-art accuracy of .

  • We analyze the effect of incorporating linguistic features, and additional context through intra-attention [Paulus, Xiong, and Socher2017] on the top of the proposed model, however, these additional variations do not result in any performance improvement. Although additional context does not boost the performance, it does help in convergence of the model at the time of training.

Related Work

DA recognition is a supervised classification problem that assigns DA label to each utterance in a conversation. There exist several approaches tackling this problem in different ways, and most of them can be grouped into the following two categories: 1) those that predict the entire DA sequence for all utterances in a conversation, in other words, those that treat DA identification as a sequence labeling problem [Stolcke et al.2006, Lendvai and Geertzen2007, Zimmermann2009, Lee and Dernoncourt2016]; 2) those that predict DA label for each utterance independently [Tavafi et al.2013, Khanpour, Guntakandla, and Nielsen2016, Ji, Haffari, and Eisenstein2016]. Until deep learning based models, the best reported accuracy on the benchmark SwDA dataset was by HMM [Stolcke et al.2006], using hand-crafted features along with contextual and lexical information, while the same for the MRDA dataset was by [Lendvai and Geertzen2007] using a naive Bayesian formulation.

Recently, researchers have started using deep learning based models for this task [Lee and Dernoncourt2016, Khanpour, Guntakandla, and Nielsen2016, Tavafi et al.2013], and have shown significant improvements over previous models.  [Lee and Dernoncourt2016] proposes a model based on CNNs and RNNs that incorporates preceding short texts as context to classify current DAs; the CNN based model performs better than the RNN based model for both SwDA and MRDA data sets. In another work,  [Blunsom and Kalchbrenner2013] builds a sentence representation using a combination of Hierarchical CNN (HCNN) and RNN, followed by the classification of these sentence representation into corresponding DAs. However, [Blunsom and Kalchbrenner2013] predict the dialogue act of each utterance individually, i.e., they do not take into account the label dependency. In another line of work [Ji, Haffari, and Eisenstein2016]

, authors propose a Latent Variable Recurrent Neural Network (LVRNN) where they tackle the problem of dialogue act classification and dialogue generation simultaneously. They use the context vector of previous utterance to predict the DA label of the next utterance which is then, along with the previous utterance vector, used to generate the next utterance. Although this model take into account the utterance dependency, it does not capture the dependencies among labels directly.

There has been some work on using conditional random fields with LSTM models [Huang, Xu, and Yu2015, Ma and Hovy2016]

for sequence tagging tasks such as POS tagging and named entity recognition. However, they do not make use of the hierarchical structure of language, and therefore, although they take into account the label dependency, they are unable to capture the dependencies among utterances in a principled way.


Figure 1: An illustration of our proposed hierarchical Bi-LSTM CRF model. The input is a conversation consisting of utterances , with each utterance itself being a sequence of words . As can be seen, there are four main layers, viz. embedding, utterance encoder, conversation encoder, and CRF classifier. The output is a DA prediction for each utterance in the conversation.

Before describing the proposed model in detail, we first set the mathematical notation for the problem of DA identification. Suppose, we have a set of conversations or dialogues, i.e. with corresponding target DAs. Each conversation itself is a sequence of utterances with being the corresponding target DAs. In other words, for each utterance in each conversation, we have an associated target label , where is the set of all possible DAs. Each utterance in turn is itself a sequence of words stringed together, i.e., .

The whole sequence of utterances in each conversation can be considered as a single very long chain of words, with output tags or labels only appearing sparsely, i.e., at the end of each utterance. However, such a construct suffers because of extremely long sequence lengths, which severely hampers neural network training as backpropagation through time becomes impractical due to vanishing/exploding gradients at extreme lengths. To mitigate the aforementioned problem, we take into consideration the hierarchical nature of dialogues and conversations, and opt to use a hierarchical recurrent encoder. Hierarchical recurrent encoders have been used previously by 

[Sordoni et al.2015, Serban et al.2016, Serban et al.2017, Dehghani et al.2017], and have been shown to perform better compared to standard non-hierarchical models. We propose a hierarchical recurrent encoder, where the first encoder operates at the utterance level, encoding each word in each utterance, and the second encoder operates at the conversation level, encoding each utterance in the conversation, based on the representations of the previous encoder. These two encoders make sure that the output of the second encoder capture the dependencies among utterances.

The output of the second encoder can be followed by any type of classification module which takes in the representation of each utterance, and in our formulation, we combine the hierarchical encoder with a linear chain conditional random field (CRF) [Lafferty, McCallum, and Pereira2001]

for structured prediction. DA identification can be treated as a sequence labeling problem and can be tackled naively by assigning a label to each element of the sequence independently. However, the implicit nature of dependencies among consecutive elements in a sequence means that instead of labeling each item independently, structured prediction models such as hidden Markov models, conditional random fields, etc., are naturally better choice. An illustration of the complete proposed model —a combination of word embedding layer, a recurrent

hierarchical encoder, and a CRF based classification layer— is shown in figure 1. The proposed model is trainable end-to-end, and constructs and captures the representation at multiple levels of granularity, e.g. word level, utterance level, and conversation level.

Hierarchical Recurrent Encoder

For a given conversation, each word of each utterance is processed by an embedding layer which converts one-hot vocabulary vectors to dense representations, followed by a word-level bidirectional LSTM [Hochreiter and Schmidhuber1997], which serves as the first encoder in our hierarchical encoder. The embedding layer can be initialized using pretrained embeddings such as Word2Vec [Mikolov et al.2013] or Glove [Pennington, Socher, and Manning2014]. Since we consider bidirectional LSTMs, the representation of each word is obtained by concatenating the outputs from the forward and backward RNNs at that time-step. For an utterance comprised of a sequence of words , the series of operations is as follows:


Here, represents the embedding layer, whereas denotes the utterance-level encoder in our hierarchical encoder. Note that the embedding layer can ideally capture finer granularities, such as character level [Kim et al.2016] or subword level [Sennrich, Haddow, and Birch2017] embeddings, which would potentially increase the depth of our hierarchical encoder. In order to keep the complexity of the model manageable, we decide to skip additional finer grained levels.

Due to the hierarchical nature of conversations, the representation of each utterance , denoted by

can be obtained by combining the representations of its constituent words. The combination can be done in many possible ways, e.g. average-pooling, max-pooling, etc. In the case of last pooling, we simply take the last representation of the last time-step of the word-level encoder as the representation of the entire utterance, i.e.


This is because the final time-step contains context of all the words and time-steps preceding it, and serves as a good approximation to a representation of the entire utterance. At this stage, we have a sequence of utterance representations , corresponding to the conversation consisting of utterances . This sequence of utterance representation is then passed on to the conversation-level encoder which is realized by means of another bidirectional LSTM. Once again, we concatenate the vectors obtained from the forward and backward RNNs at each time-step to form the final representation of each utterance. For each utterance , the representation is transformed via the utterance level encoder to obtain another representation as follows:


Here, denotes the utterance level RNN that forms the second level in our hierarchical encoder. For a conversation , we are left with a representation for each utterance , which can be passed forward to a classification layer.

Linear Chain CRF

In our proposed model, the classifier of choice is a linear chain CRF, which enables us to model dependencies among labels. Note that the dependencies among utterances has already been captured by the bidirectional encoders. In sequence tagging, greedily predicting the tag at each time-step might not lead to the optimal solution, and instead, it is better to look at correlations between labels in neighborhoods in order to jointly decode the best chain of tags. CRFs are undirected graphical models that model the conditional probability of a label sequence given an observed example sequence. Now, for a given conversation

, with utterances and corresponding associated dialogue acts , the probability of predicting the sequence of dialogue acts can be written as:


where is the dense representation of each utterance obtained from the second level encoder. Here is the set of parameters corresponding to the CRF layer, and is the feature function, providing us with unary and pairwise potentials. The CRF layer in our proposed model is parameterized by a state transition matrix, to model the transition from a label to a label at any time-step. The state transition matrix is of size , for a tag-set of size and is position independent, i.e. it remains the same for each pair of consecutive time-steps. The transition matrix provides us with the pairwise feature function for the CRF, while the output of the hierarchical encoder, i.e. is considered as the unary feature function. We do not opt for higher order potentials, and restrict ourselves to only pairwise potentials, since the target sequence is a chain of tags.

To learn the CRF parameters, we use maximum likelihood training estimation. For the given training set

, i.e. pairs, the log likelihood can be written as:


where is the set of network parameters i.e. parameters of all layers, viz. word embedding layer, hierarchical recurrent encoders, and CRF classifier. At the time of testing, dynamic programming techniques [Rabiner1989] can be used to obtain the optimal sequence via the Viterbi algorithm [Viterbi1967], i.e.,



In this section we describe the experimental evaluation of our approach.

Dataset Training Validation Testing
MRDA 5 10K 51(76K) 11(15K) 11(15K)
SwDA 42 19K 1003(173K) 112(22K) 19(4K)
Table 2: is the number of Dialogue Act classes, is the vocabulary size. Training, Validation and Testing indicate the number of conversations (number of utterances) in the respective splits.


We evaluate the performance of our model on two benchmark datasets used in several prior studies for the DA identification task, viz.:

  • SwDA: Switchboard Dialogue Act Corpus [Jurafsky1997] is annotated on 1155 human to human telephonic conversations. Each utterance in a conversation is labeled with one of the 42-class compact DAMSL taxonomy [Core and Allen1997], such as STATEMENT-OPINION, STATEMENT-NON-OPINION, BACKCHANNEL, etc.

  • MRDA: The ICSI Meeting Recorder Dialogue Act corpus [Janin et al.2003, Ang, Liu, and Shriberg2005] contains 72 hours of naturally occurring multi-party meetings that were first converted into 75 word level conversations, and then hand annotated with DAs using the Meeting Recorder Dialogue Act Tagset. The original MRDA tag set had 11 general tags and 39 specific tags. The MRDA scheme provides several class-maps and corresponding scripts for grouping several related tags together into smaller number of DAs. For this work, we use the most widely used class-map that groups all tags into 5 DAs, i.e., statements (S), questions(Q), Floorgrabber (F), Backchannel (B), Disruption (D).

Table 2 presents different statistics for both datasets. For SwDA, train and test sets are provided but not the validation set, so we use the standard practice of taking a part of training data set as validation set [Lee and Dernoncourt2016]. Because of the noise and informal nature of utterances, we performed a series of pre-processing steps. For both datasets, exclamations and commas were stripped, and characters were converted to lower-case. The datasets are also highly imbalanced in terms of label distribution: the DA labels non-opinion (sd) and backchannel (b) in SwDA are assigned to more than of utterances, while more than of utterances in MRDA have DA label statement (s).

Parameter Range Final
Pooling Last / Mean Last
Word Embedding Glove / Word2Vec 300D Glove
Bidirectional True / False True
Hidden Size
Learning Rate
Stacked LSTM Layers
Table 3: Hyperparameter tuning – the column lists the various values tried, while the column lists the final value chosen for the corresponding hyperparameter.

Hyperparameter Tuning

Conversations with the same number of utterances were grouped together into mini-batches, and each utterance in a mini-batch was padded to the maximum length for that batch. The maximum batch-size allowed was

. We used regularization of in the form of weight decay and the Adadelta optimizer. All other hyper-parameters were selected by tuning one hyper-parameter at a time while keeping the others fixed. The hyper-parameters were tuned using the SwDA validation set. The final set of hyper-parameters were then used to train two different models, one each on SwDA and MRDA training datasets. Table 3 lists the range of values for each parameter that we experimented with, and the final value that was selected. The word vectors were initialized with the 300-dimensional Glove embeddings [Pennington, Socher, and Manning2014], and were also updated during training. Dropout was applied to the embeddings obtained from the output of each encoder. The learning rate was initialized to and reduced by a factor of every epochs. Early stopping is also used on the validation set with a patience of epochs. Increasing the number of stacked LSTM layers reduced the accuracy of the model, so we settled with only one layer.

Results and Discussion

The results reported in this section are based on the hyper-parameters values tuned in the previous section. The Hierarchical Bi-LSTM-CRF model is compared against seven different baseline models.

  • DRLM-Conditional [Ji, Haffari, and Eisenstein2016] - a latent variable recurrent neural network architecture for joint modeling of utterance and DA label.

  • LSTM-Softmax [Khanpour, Guntakandla, and Nielsen2016] - Bidirectional LSTMs on word embeddings followed by a softmax classifier.

  • RCNN[Blunsom and Kalchbrenner2013] - Hierarchical CNN on word embeddings to model utterances followed by a RNN to capture context, with a softmax classifier.

  • CNN[Lee and Dernoncourt2016] - An utterance level CNN followed by a conversation CNN, with softmax classifiers. The utterance and conversation layers only consider the current utterance and at most preceding ones.

  • CRF - Simple baseline with pre-trained word embeddings followed by a CRF classifier.

  • LR

    - Simple baseline with pre-trained word embeddings followed by a logistic regression classifier.

Model Acc(%)
Hierarchical Bi-LSTM-CRF 79.2
DRLM-Conditional(Ji et al. 2016) 77.0
LSTM-Softmax(Khanpour et al. 2016) 75.8111The paper claimed accuracy of 80.1. Personal correspondence with the authors revealed that a non-standard test set was used by accident.
RCNN[Blunsom and Kalchbrenner2013] 73.9
CNN[Lee and Dernoncourt2016] 73.1
CRF 72.2
LR 71.4
HMM[Stolcke et al.2006] 71.0
Table 4: Comparing accuracy of our method (Hierarchical Bi-LSTM-CRF) with other methods in the literature on SwDA dataset.

Table 4 compares the results obtained using our model with the other previous models. The results show that our Hierarchical Bi-LSTM-CRF model outperforms the state-of-the-art. Our model improved the DA labeling accuracy over DRLM-Conditional model by

absolute points. In order to further analyze the results, we looked into the confusion matrix to know which labels are incorrectly/correctly assigned to utterances. Table 

5 shows the confusion matrix of our proposed model for the SwDA dataset. Among them the most confused pairs are (sd,sv) and (aa,b) which represent (statement-non-opinion, statement-opinion) and (agree-accept, acknowledge) respectively. The total number of utterances with DA ’sd’, ’sv’, ’aa’, and ’b’ are , , , and , respectively. utterances (7.8%) with true label non-opinion were predicted incorrectly as opinion, whereas, utterances (87.7%) with true label non-opinion were predicted correctly. Similarly, utterances (27.9%) with true label opinion were predicted incorrectly as non-opinion whereas utterances (66%) with true label opinion were predicted correctly. On further analysis of the cause of this confusion between these two class pairs, we identified that there are utterances which were classified correctly by the model, however, they were marked incorrectly classified because of bias in the ground truth. For some of the utterances, classes were not distinguishable even by humans because of the subjectivity.

Table 5: Confusion matrix of Hierarchical Bi-LSTM-CRF model for the SwDA dataset (10 DA class labels), where the row denotes the true label and the column denotes the predicted label. The numbers in the bracket besides the DA label in the first cell of each row is the count of the number of utterances of that DA label.
Utt no Utterance True DA Label Predicted DA Label
1692 This is quite a long distance. non-opinion (sd) opinion (sv)
1720 This is a little bigger than a tea cup. non-opinion (sd) opinion (sv)
1789 we’re supposed to appreciate them. non-opinion (sd) opinion (sv)
77 they could do something about that opinion (sv) non-opinion (sd)
739 i need to start jog something again. opinion (sv) non-opinion (sd)
112 i thought it was up there. opinion (sv) non-opinion (sd)
1121 Yeah. agree/accept (aa) backchannel (b)
1334 Yeah. agree/accept (aa) backchannel (b)
1337 Sure agree/accept (aa) backchannel (b)
1362 Yeah backchannel (b) agree/accept (aa)
1371 Yeah backchannel (b) agree/accept (aa)
1372 # Oh Yeah. # backchannel (b) agree/accept (aa)
Table 6: Example of utterances of confused pairs (non-opinion, opinion) and (agree/accept, backchannel)

We show examples of some of these cases in Table 6. For instance, the utterance no. 1692 seems to be an opinion (’sv’) and is also predicted as ’sv’, but its true label is non-opinion (’sd’). Similarly, utterance no. 1334 underlying text is ’Yeah’, its true label is agree/accept (’aa’). Also, utterance no. 1362 and 1371 underlying text is ’Yeah’, this time its true label is backchannel(’b’). This means two utterances with the same underlying text have two different DA associations. We accepted it as the characteristics of the SwDA dataset, this thought is echoed by the authors who created the dataset that the inter-labeler agreement is .

Model Acc(%)
Hierarchical Bi-LSTM-CRF 90.9
LSTM-Softmax(Khanpour et al. 2016) 86.8
CNN[Lee and Dernoncourt2016] 84.6
Naiive Bayes[Lendvai and Geertzen2007] 82.0
Table 7: Comparing Accuracy of our method (Bi-LSTM-CRF) with other methods in the literature on the MRDA dataset.

The results on the MRDA dataset are shown in Table 7. From this table, it is clear that our method outperforms the state-of-the-art by a significant margin i.e. by 4.1%. Table 8 shows the confusion matrix for the MRDA dataset. Except for the class label ’B’, all other DA class labels are predicted accurately. Approximately 21% of DA class label ’B’ are incorrectly predicted as ’S’. One of the reasons for this behavior is that the MRDA dataset is highly imbalanced, with more than 50% of the utterances labeled as class ’S’.

Table 8: Confusion matrix of Bi-LSTM-CRF for the MRDA dataset, where the row denotes the true DA label and the column denotes the predicted DA label. The numbers in the bracket besides the DA label in the first cell of each row is the count of the number of utterances of that DA label.

Effect of Hierarchy and Label Dependency

In this section, we discuss the influence of adding hierarchical layers (utterance layer, conversation layer) and classification layer on accuracy. In particular, we perform ablation studies by evaluating the model layer by layer to understand if the addition of new layers provides any improvement in performance.

The first model, WE, is a plain two layer network with a word embedding layer followed by the classification layer, i.e., the pre-trained Glove word embeddings are fed as input to the classification layer. No form of dependency, among utterances, across utterances, across DA labels, are captured here. The second model, WE+UL, is a three layer network that takes word embeddings as input. The output of WE layer is input to the utterance layer to learn utterance vectors. Each utterance vector is a compositional representation of all words in that utterance. Utterance vector is fed as input directly to the classification layer to predict the label. Dependencies across utterances are not captured here. The third model, WE+UL+CL, is a four layer network similar to the proposed hierarchical Bi-LSTM-CRF model, except that the final layer can be either logistic regression (LR) or a CRF based classifier.

Model Accuracy Accuracy
with LR with CRF
WE 71.4 72.2
WE+UL 72.2 72.7
WE+UL+CL 74.1 79.2
Table 9: WE is Word Embedding layer, UL is Utterance Layer, CL is Conversation Layer, LR is Logistic regression and CRF is Conditional Random Field.

Table 9 shows the results of various networks with both LR and CRF layer. From the table, we observe that the models WE, WE+UL, and WE+UL+CL with LR layer at the top produce an accuracy of , , and , respectively. In the final layer, if LR is replaced with CRF then the accuracy of WE, WE+UL, and WE+UL+CL (Hierarchical Bi-LSTM-CRF) is , , and , respectively. From these results it is clear that adding additional layers, viz. utterance layer and conversation layer, improve the results by a few notches. Also, replacing LR with CRF further improves the results. Note that the accuracy of WE+UL with LR and WE with CRF is same. We understand that the output of utterance layer at each time step is a vector representing the context of the utterance till that word. The word vector at the last time step is the final representation of the utterance. This means, adding an utterance layer generates a compositional vector of all words in an utterance, and thus serves as a good representation of all words in the utterance. Adding the utterance layer and replacing the LR with CRF in the existing model produces more or less the same result. Addition of conversation layer results in major improvement in the accuracy, approximately absolute points with LR in the final layer, and absolute points with CRF . This is because the output of conversation layer for an utterance is a representational vector capturing the context of itself and utterances preceding it.

Effect of Linguistic Features and Context

For Dialogue Act identification, linguistic features-[Tavafi et al.2013] and context information [Ribeiro, Ribeiro, and de Matos2015] have shown to improve the performance of the underlying model. In our model, we add linguistic features, in particular the part-of-speech tags (POS) associated with words in an utterance. More specifically, we add a POS tag layer with POS tag embeddings followed by an encoder, working in parallel to the utterance encoder, to learn a representation for each POS tag sequence associated with each utterance, and concatenate it with the utterance vector at the conversation layer, right before they are fed to the CRF layer. The results show that the addition of POS reduces the accuracy by approximately .

Extension Accuracy(%)
POS 77.9
Context length 10 77.4
length 5 78.3
length 3 78.1
Table 10: Accuracy obtained using two extensions to the Hierarchical Bi-LSTM-CRF model.

In another extension, we explore capturing context of an utterance through intra-attention [Paulus, Xiong, and Socher2017], and concatenating it to the utterance vector to produce a new utterance vector. Recent research [Cho et al.2014] has shown that LSTM performance deteriorates as the length of input sentence increases since they are not able to capture long context. Therefore, capturing context explicitly through attention [Bahdanau, Cho, and Bengio2015] is an alternate way to model long-term dependencies. In our model, after obtaining utterance vectors from the conversation layer, a normalized attention weight vector is computed for each utterance vector, by computing its similarity from previous utterance vectors. These attention weights are then used to compute the context vector by taking a weighted sum of the previous utterance vectors. The new context vector is concatenated to the utterance vector produced by the conversation layer to obtain new utterance vector, which is input to the classification layer. We experimented with this attention by varying the length of the context (number of previous utterances) i.e. . In a conversation, an utterance at time step is mostly dependent upon the previous two or three utterances. Modeling too long dependencies therefore reduces the performance, as is shown in Table 10.

Overall, adding additional context or POS representations to the Hierarchical Bi-LSTM-CRF model does not improve the performance, which means, these new additions are not contributing any new information to the existing model. The original hierarchical encoder has all the required information it needs to model the utterance representation and the dependencies among them. Although additional context does not help in performance, it helps quite a bit in convergence. We observed that training the model with additional context results in much faster convergence compared to training without context. For the SwDA dataset, the accuracy with additional context and without it after the first epoch was and , respectively. Similarly, for the MRDA dataset, the accuracy after first epoch while training the model with additional context was , whereas without it was .


In this paper, we used a Hierarchical Bi-LSTM-CRF model for labeling sequence of utterances in a conversation with Dialogue Acts. The proposed model captures long term dependencies between words in an utterance and across utterances, thus generating vector representations for each utterance in a conversation. The sequence of vectors corresponding to utterances in a conversation are sent to a CRF based classifier to model the dependencies between the Dialog Act labels and the utterance representations. We demonstrated the efficacy of our model on two popular datasets, SwDA and MRDA. Experimental results highlight that our proposed model outperforms the state-of-the-art for both data sets.


  • [Ang, Liu, and Shriberg2005] Ang, J.; Liu, Y.; and Shriberg, E. 2005. Automatic dialog act segmentation and classification in multiparty meetings. In ICASSP.
  • [Bahdanau, Cho, and Bengio2015] Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. ICLR.
  • [Blunsom and Kalchbrenner2013] Blunsom, P., and Kalchbrenner, N. 2013.

    Recurrent convolutional neural networks for discourse compositionality.

    In Proceedings of the 2013 Workshop on Continuous Vector Space Models and their Compositionality.
  • [Cho et al.2014] Cho, K.; van Merriënboer, B.; Bahdanau, D.; and Bengio, Y. 2014. On the properties of neural machine translation: Encoder–decoder approaches. Syntax, Semantics and Structure in Statistical Translation.
  • [Core and Allen1997] Core, M. G., and Allen, J. 1997. Coding dialogs with the damsl annotation scheme. In AAAI Fall Symposium On Communicative Action In Humans And Machines.
  • [Dehghani et al.2017] Dehghani, M.; Rothe, S.; Alfonseca, E.; and Fleury, P. 2017. Learning to attend, copy, and generate for session based query suggestion. In CIKM.
  • [Grau et al.2004] Grau, S.; Sanchis, E.; Castro, M. J.; and Vilar, D. 2004. Dialogue act classification using a bayesian approach. In 9th Conference Speech and Computer.
  • [Higashinaka et al.2014] Higashinaka, R.; Imamura, K.; Meguro, T.; Miyazaki, C.; Kobayashi, N.; Sugiyama, H.; Hirano, T.; Makino, T.; and Matsuo, Y. 2014.

    Towards an open-domain conversational system fully based on natural language processing.

    In COLING.
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9.
  • [Huang, Xu, and Yu2015] Huang, Z.; Xu, W.; and Yu, K. 2015. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
  • [Janin et al.2003] Janin, A.; Baron, D.; Edwards, J.; Ellis, D.; Gelbart, D.; Morgan, N.; Peskin, B.; Pfau, T.; Shriberg, E.; Stolcke, A.; et al. 2003. The icsi meeting corpus. In ICASSP.
  • [Ji, Haffari, and Eisenstein2016] Ji, Y.; Haffari, G.; and Eisenstein, J. 2016. A latent variable recurrent neural network for discourse relation language models. In NAACL-HLT.
  • [Jurafsky1997] Jurafsky, D. 1997. Switchboard swbd-damsl shallow-discourse-function annotation coders manual. www. dcs. shef. ac. uk/nlp/amities/files/bib/ics-tr-97-02. pdf.
  • [Khanpour, Guntakandla, and Nielsen2016] Khanpour, H.; Guntakandla, N.; and Nielsen, R. 2016. Dialogue act classification in domain-independent conversations using a deep recurrent neural network. In COLING.
  • [Kim et al.2016] Kim, Y.; Jernite, Y.; Sontag, D.; and Rush, A. M. 2016. Character-aware neural language models. In AAAI.
  • [Lafferty, McCallum, and Pereira2001] Lafferty, J. D.; McCallum, A.; and Pereira, F. C. N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML.
  • [LeCun, Bengio, and Hinton2015] LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature.
  • [Lee and Dernoncourt2016] Lee, J. Y., and Dernoncourt, F. 2016. Sequential short-text classification with recurrent and convolutional neural networks. In Proceedings of NAACL-HLT.
  • [Lendvai and Geertzen2007] Lendvai, P., and Geertzen, J. 2007. Token-based chunking of turn-internal dialogue act sequences. In SIGDIAL Workshop on Discourse and Dialogue.
  • [Ma and Hovy2016] Ma, X., and Hovy, E. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In ACL.
  • [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In NIPS.
  • [Paulus, Xiong, and Socher2017] Paulus, R.; Xiong, C.; and Socher, R. 2017. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304.
  • [Pennington, Socher, and Manning2014] Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In EMNLP.
  • [Rabiner1989] Rabiner, L. R. 1989. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE.
  • [Ribeiro, Ribeiro, and de Matos2015] Ribeiro, E.; Ribeiro, R.; and de Matos, D. M. 2015. The influence of context on dialogue act recognition. arXiv preprint arXiv:1506.00839.
  • [Sennrich, Haddow, and Birch2017] Sennrich, R.; Haddow, B.; and Birch, A. 2017. Neural machine translation of rare words with subword units. In ACL.
  • [Serban et al.2016] Serban, I. V.; Sordoni, A.; Bengio, Y.; Courville, A. C.; and Pineau, J. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI.
  • [Serban et al.2017] Serban, I. V.; Sordoni, A.; Lowe, R.; Charlin, L.; Pineau, J.; Courville, A. C.; and Bengio, Y. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI.
  • [Sordoni et al.2015] Sordoni, A.; Bengio, Y.; Vahabi, H.; Lioma, C.; Grue Simonsen, J.; and Nie, J.-Y. 2015. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In CIKM.
  • [Stolcke et al.2006] Stolcke, A.; Ries, K.; Coccaro, N.; Shriberg, E.; Bates, R.; Jurafsky, D.; Taylor, P.; Martin, R.; Van Ess-Dykema, C.; and Meteer, M. 2006. Dialogue act modeling for automatic tagging and recognition of conversational speech. Dialogue 26(3).
  • [Tavafi et al.2013] Tavafi, M.; Mehdad, Y.; Joty, S. R.; Carenini, G.; and Ng, R. T. 2013. Dialogue act recognition in synchronous and asynchronous conversations. In SIGDIAL.
  • [Viterbi1967] Viterbi, A. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory.
  • [Zimmermann2009] Zimmermann, M. 2009. Joint segmentation and classification of dialog acts using conditional random fields. In InterSpeech.