In natural language processing research, the dialogue act (DA) concept plays an important role. DAs are semantic labels associated with each utterance in a conversational dialogue that indicate the speaker’s intention, e.g., question, backchannel, statement-non-opinion, statement opinion. A key to model dialogue is to detect the intent of the speaker: correctly identifying a question gives an important clue to produce an appropriate response.
|A||Is there anyone who doesn’t know Nancy?|
|A||Do you - Do you know Nancy ?|
|B||I know Nancy|
As can be observed in Table 1, DA classification relies on its conversational aspect, i.e., predicting an utterance’s DA requires the knowledge of previous sentences and their associated act labels. For example, if a speaker asks a question, the interlocutor will answer with a response, analogously, a ”Greeting” or a ”Farwell” will be followed by a similar dialogue act. This means that in a conversation there is a sequential structure in the emitted dialogue acts. This poses the basis for the adoption of a novel perspective on the DA classification problem, i.e., from a multi-classification task to a sequence labeling one.
Limitations of current models:
Current state-of-the-art models rely on the use of linear Conditional Random Field (CRF) combined with a recurrent neural network based encoder[crf_multi_task, LSTM_CRF, DAClassifCxt] to model DA sequential dependencies. Unfortunately such approaches only capture local dependencies between two adjacent dialogue acts. For instance, if we consider the example in Table 1 we can see that the last statement ”I know Nancy” is a response to the first question ”Is there anyone who doesn’t know Nancy” and the knowledge of the previous backchannel does not help the prediction of the last dialogue act. Therefore, we must consider dependencies between labels with a scope that is wider than two successive utterances. In Neural Machine Translation (NMT), the problem of global dependencies has been addressed using seq2seq models [seq2seq]
that follow the encoder-decoder framework. The encoder embeds an input sentence into a single hidden vector which contains both global and local dependencies, and the hidden vector is then decoded to produce an output sequence. In this work, we propose a seq2seq architecture tailored towards DA classification paving the way for further innovations inspired by advances in NMT research.
Contributions: In this work (1) we formalise the Dialogue Act Prediction problem in a way that emphasises the relations between DA classification and NMT, (2) we demonstrate that the seq2seq architecture suits better to the DA classification task and (3) we present a seq2seq model leveraging NMT techniques that reaches an accuracy of 85%, outperforming the state of the art by a margin of around 2%, on the Switchboard Dialogue Act Corpus (SwDA) [switchboard_da] and a state-of-the-art accuracy score of 91,6% on the Meeting Recorder Dialogue Act (MRDA). This seq2seq model exploits a hierarchical encoder with a novel guided attention mechanism that fits with our setting without any handcrafted features. We finetune our seq2seq using a sequence level training objective making use of the beam search algorithm. To our knowledge, this is among the first seq2seq model proposed for DA classification.
Several approaches have been proposed to tackle the DA classification problem. These methods can be divided into two different categories. The first class of methods relies on the independent classification of each utterance using various techniques, such as HMM [handcrafted_HMM], SVM [svm_dialog]
and Bayesian Network[bayesian_dialog]
. The second class, which achieves better performance, leverages the context, to improve the classifier performance by using deep learning approaches to capture contextual dependencies between input sentences[RNN_CTX3_Softmax, LSTM_softmax]. Another refinement of input context-based classification is the modelling of inter-tag dependencies. This task is tackled as sequence-based classification where output tags are considered as a DA sequence [crf_multi_task, hierarchical_LSTM_CRF, handcrafted_HMM, LSTM_CRF, DAClassifCxt].
Two classical benchmarks are adopted to evaluate DA classification systems: the Switchboard Dialogue Act Corpus (SwDA)[switchboard_da] and the Meeting Recorder Dialogue Act (MRDA) [mrda]. State-of-the-art techniques achieve an accuracy of 82.9% [crf_multi_task, DAClassifCxt]. To capture input contextual dependencies they adopt a hierarchical encoder and a CRF to model inter-tag dependencies. The main limitation of the aforementioned architecture is that a linear-CRF model is able to only capture dependencies at a local level and fails to capture non local dependencies. In this paper, we tackle this issue with a sequence-to-sequence using a guided attention mechanism.
Seq2seq models have been successfully applied to NMT, where modeling non local dependencies is a crucial challenge. DA classification can be seen as a problem where the goal is to map a sequence of utterances to a sequence of DA. Thus, it can be formulated as sequence to sequence problem very similar to NMT.
The general architecture of our seq2seq models [seq2seq] follows a classical encoder-decoder approach with attention [attention]. We use GRU cells [grupaper], since they are faster to train than LSTM ones [lstm_gru]. Recent advances have improved both the learning and the inference process, producing sequences that are more coherent by means of sequence level losses beam_seach_optimization beam_seach_optimization and various beam search settings [beam_search_alpha, dbs]. The closest setting where seq2seq model have been successfully used is dependency parsing [dependancy], where output dependencies are crucial to achieve state-of-the-art performance. In our work we adjust NMT techniques to the specifics of DA classification.
3 Problem statement
DA classification as an NMT problem
First, let’s define the mathematical notations we will adopt in this work. We have a set of conversations, i.e with the corresponding set DA labels. A conversation is a sequence of utterances, namely with the corresponding sequence of DA labels. Thus, each utterance is associated with a unique DA label where is the set of all the possible dialogue acts. Finally, an utterance can be seen as a sequence of words, i.e . In NMT, the goal is to associate for any sentence in language a sentence in language where is the word in the sentence in language . Using this formalism, it is straightforward to notice two main similarities (, ) between DA classification and NMT. () In NMT and DA classification, the goal is to maximise the likelihood of the output sequence given the input sequence ( versus ). () For the two tasks, there are strong dependencies between units composing both the input and output sequences. In NMT, those units are words ( and ), in DA classification those units are utterances and DA labels ( and ).
Specifics of DA classification
While NMT and DA classification are similar under some point of views, three differences are immediately apparent (). () In NMT, the input units represent words, in DA classification are input sequences composed with words. Considering the set of all possible sequences as input (context consideration leads to superior performance) implies that the dimension of the input space several order of magnitude larger than compared to a standard NMT. () In DA, we have a perfect alignment between input and output sequences (hence ). Some languages, e.g., French, English, Italian share a partial alignment, but in DA classification we have a strong mapping between and . () In NMT, the input space (number of words in ) is approximately the same size of the output space (number of words in ). In our case the output space (number of DA tags has a limited size, with a dimension that is many order of magnitude smaller than the input space one.
In the following, we propose an end-to-end seq2seq architecture for DA classification that leverages () using a hierarchical encoder, () through a guided attention mechanism and () using beam search during both training and inference, taking advantage of the limited dimension of the output space.
In Seq2seq, the encoder takes a sequence of sentences and represents it as a single vector and then pass it to the decoder for tag generations.
In this section we introduce the different encoders we consider in our experiments. We exploit the hierarchical structure of the dialogue to reduce the input space size () and to preserve word/sentence structure. During both training and inference, the context size is fixed to . Formally, an encoder takes as input a fixed number of utterances () and outputs a vector which will serve to initialize the hidden state of the decoder. The first level of the encoder computes , an embedding of based on the words composing the utterance, and the next levels compute based on .
Vanilla RNN encoder: The vanilla RNN encoder () introduced by seq2seq seq2seq is considered as a baseline encoder. In the vanilla encoder where is an embedding of . To better model dependencies between consecutive utterances, we use a bidirectional GRU [grupaper]:
Hierarchical encoders: The vanilla encoder can be improved by computing using bi-GRU. This hierarchical encoder (HGRU) is in line with the one introduced by hierarchical_encoder hierarchical_encoder. Formally is defined as it follows:
is then computed using Equation 1. Intuitively, the first GRU layer (Equation 2) models dependencies between words (the hidden state of the word-level GRU is reset at each new utterance), and the second layer models dependencies between utterances.
Persona hierarchical encoders: In SwDA, a speaker turn can be splitted in several utterances. For example, if speaker A is interacting with speaker B we might encounter the sequence (AAABBBAA)111In SwDA arround two third of the sentence have at least a AA or BB. We propose a novel Persona Hierarchical encoder (PersoHGRU) to better model speaker-utterance dependencies. We introduce a persona layer between the word and the sentence levels, see Figure 1:
is then obtained following Equation 1 where is replaced by .
In this section, we introduce the different decoders we compare in our experiments. We introduce a novel form of attention that we name guided attention. Guided attention leverages the perfect alignment between input and output sequences (
). The decoder computes the probability of the sequence of output tags based on:
see Equation 1.
Vanilla decoder: The vanilla decoder () is similar to the one introduced by seq2seq seq2seq.
Decoders with attention: In NMT, the attention mechanism forces the seq2seq model to learn to focus on specific parts of the sequence each time a new word is generated and let the decoder correctly align the input sequence with output sequence. In our case, we follow the approach described by badau badau and we define the context vector as:
where scores how well the inputs around position and the output at position match. Since we have a perfect alignment (), we know a priori on which sequence the decoder needs to focus more at each time step. Taking into account this aspect of the problem, we propose three different attention mechanisms.
Vanilla attention: This attention represents our baseline attention mechanism and it is the one proposed by badau badau, where:
and is parametrized as a feedforward neural network.
Hard guided attention: The hard guided attention forces the decoder to focus only on the while predicting :
Soft guided attention: The soft guided attention guides the decoder to mainly focus on the while predicting , but allows it to have a limited focus on other parts of the input sequence.
where is parametrised as a feedforward neural network.
Training and inference
In this section, we describe the training and the inference strategies used for our models. A seq2seq model aims to find the best sentence for a given source sentence. This poses a computational challenge when the output vocabulary size is large, since even by using beam search it’s expensive to explore multiple paths. Since our output vocabulary size is limited (), we do not incur in this problem and we can use beam search during both training and inference.
Beam search: In our work we measure the sequence likelihood based on the following formula:
where and is the current target, and is the length normalisation coefficient [beam_search_alpha]. At each time step the most likely sequences are kept ( corresponding to the beam size).
Training objective: For training we follow beam_seach_optimization beam_seach_optimization and train our model until convergence with a token level loss and fine tune it by minimising the expected risk defined as:
where is the set of the sequences generated by the model using a beam search algorithm for the input , and is defined, for a given a candidate sequence and a target , as:
GRU/HGRU CRF baseline
State-of-the-art models use conditional random fields which model dependencies between tags on top of an GRU or a HGRU encoder which computed an embedding of the a variable number of utterances sentences . We have implemented our own CRF () following the work of hierarchical_LSTM_CRF hierarchical_LSTM_CRF:
Here is the set of parameters corresponding to the CRF layer, and is the feature function, providing us with unary and pairwise potentials. Let be the dense representation of each utterance’s output provided by the encoder. can be seen as the unary feature function.
5 Experimental Protocol
In this section we describe the experimental protocols adopted for the evaluation of our approach.
We consider two classical datasets for Dialogue Act Classification: The Switchboard Dialogue Act Corpus and the MRDA. Since our models explicitly generate a sequence of tags we compute the accuracy on the last generated tag.
Both datasets are already segmented in utterances and each utterance is segmented in words.
For each dataset, we split each conversation in sequence of utterances of length 222 is an hyperparameter, experiments have shown that 5 leads to the best results.
is an hyperparameter, experiments have shown that 5 leads to the best results..
SwDA: The Switchboard-1 corpus is a telephone speech corpus [switchboard_da]
, consisting of about 2.400 two-sided telephone conversation among 543 speakers with about 70 provided conversation topics. The dataset includes information about the speakers and the topics and has 42 different tags. In this dataset global dependency plays a key role due to the large amount of backchannel (19%), abandoned or turn-exit (5%), uninterpretable acts (1%). In this context, any models that only take into account local dependencies will fail at extracting information to distinguish between ambiguous tags. For the confusion matrix, we follow crf_multi_task crf_multi_task and present it for 10 tags only: statement-non-opinion (sd), backchannel (b), statement-opinion (sv), conventional-closing (fc), wh-question (qw), response acknowledgement (bk), hedge (h), open-question (qo), other answers (no), thanking (ft).
MRDA: MRDA: The ICSI Meeting Recorder Dialogue Act corpus [mrda_2] contains 72 hours of naturally occurring multi-party meetings that were first converted into 75 word level conversations, and then hand-annotated with DAs using the Meeting Recorder Dialogue Act Tagset. In this work we use 5 DAs, i.e., statements (s), questions (q), floorgrabber (f), backchannel (b), disruption (d).
Train/Dev/Test Splits: For both SwDA and RMDA we follow the official split introduced by handcrafted_HMM handcrafted_HMM. Thus, our model can directly be compared to crf_multi_task,LSTM_CRF,hierarchical_LSTM_CRF,DAClassifCxt crf_multi_task,LSTM_CRF,hierarchical_LSTM_CRF,DAClassifCxt.
All the hyper-parameters have been optimised on the validation set using accuracy computed on the last tag of the sequence. The embedding layer is initialised with pretrained fastText word vectors of size 300 [fastText]333In our work we rely on same pretrained embedding word2vect [word2vec] instead of GloVe [glove].
, trained with subword information (on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset), and updated during training. Hyperparameter selection has been done using a random search on a fixed grid. Models have been implemented in PyTorch and trained on a single NVIDIA P100.
Parameters for SwDA: We used Adam optimizer [adam]
with a learning rate of 0.01, which is updated using a scheduler with a patience of 20 epochs and a decrease rate of 0.5. The gradient norm is clipped to 5.0, weight decay is set to 1e−5, and dropout[dropout] is set to 0.2. The maximum sequence length is set to 20. Best performing model is an encoder with size of 128 and a decoder of size 48. For , we use two layers for the BiGRU layer. For hierarchical models, we use BiGRU with a single layer.
Parameters for MRDA: We used AdamW optimizer [adamw] with a learning rate of 0.001, which is updated using a scheduler with a patience of 15 epochs and a decrease rate of 0.5. The gradient norm is clipped to 5.0, weight decay is set to 5e−5, and dropout [dropout] is set to 0.3. The maximum sequence length is set to 30. Best performing model is an encoder with size of 40 and a decoder with size 400. For we use two layers for the BiGRU layer, for hierarchical models we use BiGRU with a single layer.
6 Experiments & Results
In this section we propose a set of experiments in order to investigate the performance of our model compared to existing approaches with respect to the difficulties highlighted in the introduction.
Experiment 1: Are Seq2seq better suited to DA prediction than CRF ?
Current state of the art are built on CRF models. In this first section, we aim at comparing a seq2seq with a CRF based model. To provide a fair comparison we perform the same grid search for all models on a fixed grid. At this step, we do not use attention neither use beam search during training or inference. As shown in Table 2, with a vanilla RNN encoder the seq2seq significantly outperforms the CRF on SwDa and MRDA. With an HGRU the seq2seq exhibit significantly higher results on SwDA and reaches comparable performances on MRDA. This behaviour suggests that a model based on a seq2seq architecture tends to be achieve higher score on DA classification than a CRF based model.
Global dependencies analysis: In Table 3 we present two examples where our seq2seq use contextual information to disambiguate the tag and to predict the correct label. In the first example, “It can be a pain” without context can be interpreted both as statement non-opinion (sd) or statement opinion (sv). Our seq2seq uses the surrounding context (two sentences before) to disambiguate and assign the sv label . In the second example, the correct tag assigned to “Oh, okay” is a response acknowledgement (bk) and not backchannel (b). The key difference between bk and b is that an utterance labelled with bk has to be produced within a question-answer context, whereas b is a continuer 444This analysis can be supported by 5.1.1 in SwDA coder manual https://web.stanford.edu/~jurafsky/ws97/manual.august1.html. In our example, the global context this is a question/reply situation: the first speaker asks a question (“What school is it”), the second replies then, the first speaker answers to the reply. This observation reflects the fact CRF models only handle local dependencies where seq2seq models consider global ones as well.
|How long does that take you to get to work?||qw||qw||qw|
|Uh, about forty-five, fifty minutes.||sd||sd||sd|
|It can be a pain .||sd||sd||sv|
|So, what school is it?||qw||qw||qw|
|Uh, University of Rochester.||sd||sd||sd|
|[height=20pt]Enc.Dec.||att.||soft guid.||hard guid.|
|[height=20pt]Enc.Dec.||att.||soft guid.||hard guid.|
Experiment 2: What is the best encoder?
In Table 4, we present the results of the three encoders presented in Section 4 on both datasets. For SwDA and MRDA, we observe that a seq2seq equipped with a hierarchical encoder outperforms models with Vanilla RNN encoder, while reducing the number of learned parameters.
The does not play well with the PersoHGRU encoder. When combined with a guided attention mechanism, the PersoHGRU exhibits competitive accuracy on SwDA. However on MRDA, adding a personna layer harms the accuracy. This suggests either that the information related to the speaker is irrelevant for our task (no improvement observed while adding persona information) 555Further investigations with several persona based model inspired from the work of [persona_based] shows the same poor improvement (in terms of accuracy), or that the considered hierarchy is not the optimal structure to leverage this information.
Our final model makes use of the HGRU encoder since in most of the settings it exhibits superior performance.
Experiment 3: Which attention mechanism to use?
The seq2seq encodes a source sentence into a fixed-length vector from which a decoder generates a sequence of tags. Attention forces the decoder to strengthen its focus on the source sentences that are relevant to predicting a label.
In NMT [attention], complementing a seq2seq with attention contributes to generate better sentences. In Table 4 we see that in most the case, the use of a simple attention mechanism provides a rather small improvement with VGRU and harms a bit the performances with a HGRU encoder. In case of a seq2seq composed with a PersoHGRU and a decoder without attention the learning fails: the decrease of the training loss is relatively small and seq2seq fails to generalise. It appears that in DA classification where sequences are short (5 tags), Vanilla attention does not have as much as impact as in NMT (that have longer sequences with more complex global dependencies).
If we consider an HGRU encoder, we observe that our proposed guided attention mechanisms improves dev accuracy which demonstrates the importance of easing the task by using prior knowledge on the alignment between the utterances and the tags. Indeed, while decoding there is a direct correspondence between labels and utterances meaning that is associated with . The soft guided attention will mainly focus on the current utterance with a small additional focus on the context where hard guided attention will only consider the current utterance. Improvement due to guided attention demonstrates that the alignment between input/output is a key prior to include in our model.
Attention analysis: Figure 2 shows a representative example of the attention weights of the three different mechanisms. The seq2seq with a normal attention mechanism is characterised by a weight matrix far from the identity (especially the lower right part). While decoding the last tags, this lack of focus leads to a wrongly predicted label for a simple utterance: “Uh-Huh” (backchannel). Both guided attention mechanisms focus more on the sentence associated with the tag, at each time step, and predict successfully the last DA.
Since the hard guided attention decoder exhibit overall the best results (on both SwDA and MRDA) and does not require any additional parameter we will use it for our final model.
Experiment 4: How to leverage beam search to improve the performance?
Beam Search allows the seq2seq model to consider alternative paths in the decoding phase.
Beam Search during inference: Using beam search provides a low improvement (maximum absolute improvement of ) 666The considered beam size are small compared to other applications [mmi]. While increasing the beam size, we see that the beam search become very conservative [diversity_bea_search] and tends to output labels highly represented in the training set (e.g., sd for SwDA)..
Compared to NMT, output size is drastically smaller ( while ) for DA classification. When considering alternative paths with small output space in imbalanced datasets the beam search is more likely to consider very unlikely sequences as alternatives (eg. “s s s s s”).
Fine tuning with a sequence loss: As previously mentioned, using beam search during inference only leads to a limited improvement in accuracy. We finetune a seq2seq composed with a HGRU encoder and a decoder with hard guided attention (this model has been selected in the previous steps) with the introduced sequence level loss describes in Section 4. Table 5 shows that this fine tuning steps improves the performances of 1% on SwDA (84% vs 85%) and 1.2% on RMDA (90.4% vs 91.6%).
: Our model is composed of a HGRU encoder and a decoder with hard guided attention finetuned with and for SwDA and and for SwDA.
Experiment 5: Comparison with state-of-the-art models
In this section, we compare the performances of with other state of the art models and analyse the performances of the models. Table 6 shows the performances of best performing model on the test set. achieves an accuracy of 85% on the SwDA corpora. This model outperforms LSTM_CRF LSTM_CRF and DAClassifCxt DAClassifCxt which achieve an accuracy of 82.9%. On MRDA, our best performing model reaches an accuracy of 91.6% where current state-of-the-art systems, LSTM_CRF,hierarchical_LSTM_CRF LSTM_CRF,hierarchical_LSTM_CRF achieve respectively 92.2% and 91.7%.
In this work, we have presented a novel approach to the DA classification problem. We have shown that our seq2seq model, using a newly devised guided attention mechanisms, achieves state-of-the-art results thanking its ability to better model global dependencies.
Appendix A Appendix
Additional details on the datasets
Tags in SwDA: SwDA extends the Switchboard-1 corpus with tags from the SWBD-DAMSL tagset. The 220 tags were reduced to 42 tags. The resulting tags include dialogue acts like statement-non-opinion, acknowledge, statement-opinion, agree/accept, etc. The average speaker turns per conversation, tokens per conversation, and tokens per utterance are 195.2, 1,237.8, and 7.0, respectively.
Full results for Experiment 4: How to leverage beam search to improve the performance?
Tab 8 shows the influence of the varying number of beam size during inference.
|[height=20pt]EncoderDecoder||GRU att.||GRU soft guid. att.||GRU hard guid. att.|
|[height=20pt]EncoderDecoder||GRU att.||GRU soft guid. att.||GRU hard guid. att.|
The confusion matrix on SwDA (see Figure 3) illustrates that our model faces same difficulties as human annotator: sd is often confused with sv, bk with b, qo with qw. Due to high imbalance of SwDA, our system fails to recognise underrepresented labels (e.g. no and ft).
The confusion Matrix on MRDA shows that, here, the DA classification is easier compared to SwDA with fewer tags and classes that are more easily distinguished. reaches a perfect score at recognising questions. One of the reasons for the mislabelling between backchannel (b) and statement (s) is that the MRDA dataset is highly imbalanced, with more than 50% of the utterances labelled as class s.