Log In Sign Up

Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study

by   Chinnadhurai Sankar, et al.

Neural generative models have been become increasingly popular when building conversational agents. They offer flexibility, can be easily adapted to new domains, and require minimal domain engineering. A common criticism of these systems is that they seldom understand or use the available dialog history effectively. In this paper, we take an empirical approach to understanding how these models use the available dialog history by studying the sensitivity of the models to artificially introduced unnatural changes or perturbations to their context at test time. We experiment with 10 different types of perturbations on 4 multi-turn dialog datasets and find that commonly used neural dialog architectures like recurrent and transformer-based seq2seq models are rarely sensitive to most perturbations such as missing or reordering utterances, shuffling words, etc. Also, by open-sourcing our code, we believe that it will serve as a useful diagnostic tool for evaluating dialog systems in the future.


page 1

page 2

page 3

page 4


Probing Neural Dialog Models for Conversational Understanding

The predominant approach to open-domain dialog generation relies on end-...

Integrating Dialog History into End-to-End Spoken Language Understanding Systems

End-to-end spoken language understanding (SLU) systems that process huma...

At your Command! An Empirical Study on How LaypersonsTeach Robots New Functions

Even though intelligent systems such as Siri or Google Assistant are enj...

Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical Analysis of System-wise Evaluation

There is a growing interest in developing goal-oriented dialog systems w...

Revisiting Markovian Generative Architectures for Efficient Task-Oriented Dialog Systems

Recently, Transformer based pretrained language models (PLMs), such as G...

Augmenting Non-Collaborative Dialog Systems with Explicit Semantic and Strategic Dialog History

We study non-collaborative dialogs, where two agents have a conflict of ...

1 Introduction

With recent advancements in generative models of text Wu et al. (2016); Vaswani et al. (2017); Radford et al. (2018), neural approaches to building chit-chat and goal-oriented conversational agents Sordoni et al. (2015); Vinyals and Le (2015); Serban et al. (2016); Bordes and Weston (2016); Serban et al. (2017b) has gained popularity with the hope that advancements in tasks like machine translation Bahdanau et al. (2015), abstractive summarization See et al. (2017) should translate to dialog systems as well. While these models have demonstrated the ability to generate fluent responses, they still lack the ability to “understand” and process the dialog history to produce coherent and interesting responses. They often produce boring and repetitive responses like “Thank you.” Li et al. (2015); Serban et al. (2017a) or meander away from the topic of conversation. This has been often attributed to the manner and extent to which these models use the dialog history when generating responses. However, there has been little empirical investigation to validate these speculations.

In this work, we take a step in that direction and confirm some of these speculations, showing that models do not make use of a lot of the information available to it, by subjecting the dialog history to a variety of synthetic perturbations. We then empirically observe how recurrent Sutskever et al. (2014) and transformer-based Vaswani et al. (2017) sequence-to-sequence (seq2seq) models respond to these changes. The central premise of this work is that models make minimal use of certain types of information if they are insensitive to perturbations that destroy them. Worryingly, we find that 1) both recurrent and transformer-based seq2seq models are insensitive to most kinds of perturbations considered in this work 2) both are particularly insensitive even to extreme perturbations such as randomly shuffling or reversing words within every utterance in the conversation history (see Table 1) and 3) recurrent models are more sensitive to the ordering of utterances within the dialog history, suggesting that they could be modeling conversation dynamics better than transformers.

2 Related Work

Since this work aims at investigating and gaining an understanding of the kinds of information a generative neural response model learns to use, the most relevant pieces of work are where similar analyses have been carried out to understand the behavior of neural models in other settings. An investigation into how LSTM based unconditional language models use available context was carried out by Khandelwal et al. (2018). They empirically demonstrate that models are sensitive to perturbations only in the nearby context and typically use only about 150 words of context. On the other hand, in conditional language modeling tasks like machine translation, models are adversely affected by both synthetic and natural noise introduced anywhere in the input (Belinkov and Bisk, 2017)

. Understanding what information is learned or contained in the representations of neural networks has also been studied by “probing” them with linear or deep models

Adi et al. (2016); Subramanian et al. (2018); Conneau et al. (2018).

Several works have recently pointed out the presence of annotation artifacts in common text and multi-modal benchmarks. For example, Gururangan et al. (2018) demonstrate that hypothesis-only baselines for natural language inference obtain results significantly better than random guessing. Kaushik and Lipton (2018) report that reading comprehension systems can often ignore the entire question or use only the last sentence of a document to answer questions. Anand et al. (2018) show that an agent that does not navigate or even see the world around it can answer questions about it as well as one that does. These pieces of work suggest that while neural methods have the potential to learn the task specified, its design could lead them to do so in a manner that doesn’t use all of the available information within the task.

Recent work has also investigated the inductive biases that different sequence models learn. For example, Tran et al. (2018) find that recurrent models are better at modeling hierarchical structure while Tang et al. (2018) find that feedforward architectures like the transformer and convolutional models are not better than RNNs at modeling long-distance agreement. Transformers however excel at word-sense disambiguation. We analyze whether the choice of architecture and the use of an attention mechanism affect the way in which dialog systems use information available to them.

Figure 1: The increase in perplexity for different models when only presented with the most recent utterances from the dialog history for Dailydialog (left) and bAbI dialog (right) datasets. Recurrent models with attention fare better than transformers, since they use more of the conversation history.

3 Experimental Setup

Following the recent line of work on generative dialog systems, we treat the problem of generating an appropriate response given a conversation history as a conditional language modeling problem. Specifically we want to learn a conditional probability distribution

where is a reasonable response given the conversation history . The conversation history is typically represented as a sequence of utterances , where each utterance itself is comprised of a sequence of words . The response is a single utterance also comprised of a sequence of words . The overall conditional probability is factorized autoregressively as

, in this work, is parameterized by a recurrent or transformer-based seq2seq model. The crux of this work is to study how the learned probability distribution behaves as we artificially perturb the conversation history . We measure behavior by looking at how much the per-token perplexity increases under these changes. For example, one could think of shuffling the order in which is presented to the model and observe how much the perplexity of under the model increases. If the increase is only minimal, we can conclude that the ordering of isn’t informative to the model. For a complete list of perturbations considered in this work, please refer to Section 3.2. All models are trained without any perturbations and sensitivity is studied only at test time.

3.1 Datasets

We experiment with four multi-turn dialog datasets.

bAbI dialog

is a synthetic goal-oriented multi-turn dataset (Bordes and Weston, 2016) consisting of 5 different tasks for restaurant booking with increasing levels of complexity. We consider Task 5 in our experiments since it is the hardest and is a union of all four tasks. It contains dialogs with an average of user utterances per dialog.

Persona Chat

is an open domain dataset (Zhang et al., 2018) with multi-turn chit-chat conversations between turkers who are each assigned a “persona” at random. It comprises of dialogs with an average of turns per dialog.


is an open domain dataset (Li et al., 2017) which consists of dialogs that resemble day-to-day conversations across multiple topics. It comprises of dialogs with an average of turns per dialog.


is a multi-turn goal-oriented dataset (He et al., 2017) where two agents must discover which friend of theirs is mutual based on the friends’ attributes. It contains dialogs with an average of utterances per dialog.

Models Test PPL Only Last Shuf Rev Drop First Drop Last Word Drop Verb Drop Noun Drop Word Shuf Word Rev
Utterance level perturbations ( ) Word level perturbations ( )
seq2seq_lstm 32.90 1.70 3.35 4.04 0.13 5.08 1.58 0.87 1.06 3.37 3.10
seq2seq_lstm_att 29.65 4.76 2.54 3.31 0.32 4.84 2.03 1.37 2.22 2.82 3.29
transformer 28.73 3.28 0.82 1.25 0.27 2.43 1.20 0.63 2.60 0.15 0.26
Persona Chat
seq2seq_lstm 43.24 3.27 6.29 13.11 0.47 6.10 1.81 0.68 0.75 1.29 1.95
seq2seq_lstm_att 42.90 4.44 6.70 11.61 2.99 5.58 2.47 1.11 1.20 2.03 2.39
transformer 40.78 1.90 1.22 1.41 0.1 1.59 0.54 0.40 0.32 0.01 0.00
seq2seq_lstm 14.17 1.44 1.42 1.24 0.00 0.76 0.28 0.00 0.61 0.31 0.56
seq2seq_lstm_att 10.60 32.13 1.24 1.06 0.08 1.35 1.56 0.15 3.28 2.35 4.59
transformer 10.63 20.11 1.06 1.62 0.12 0.81 0.75 0.16 1.50 0.07 0.13
bAbi dailog: Task5
seq2seq_lstm 1.28 1.31 43.61 40.99 0.00 4.28 0.38 0.01 0.10 0.09 0.42
seq2seq_lstm_att 1.06 9.14 41.21 34.32 0.00 6.75 0.64 0.03 0.22 0.25 1.10
transformer 1.07 4.06 0.38 0.62 0.00 0.21 0.36 0.25 0.37 0.00 0.00
Table 2: Model performance across multiple datasets and sensitivity to different perturbations. Columns 1 & 2 report the test set perplexity (without perturbations) of different models. Columns 3-12 report the increase in perplexity when models are subjected to different perturbations. The mean (

) and standard deviation

across 5 runs are reported. The Only Last column presents models with only the last utterance from the dialog history. The model that exhibits the highest sensitivity (higher the better) to a particular perturbation on a dataset is in bold. seq2seq_lstm_att are the most sensitive models 24/40 times, while transformers are the least with 6/40 times.

3.2 Types of Perturbations

We experimented with several types of perturbation operations at the utterance and word (token) levels. All perturbations are applied in isolation.

Utterance-level perturbations

We consider the following operations 1) Shuf that shuffles the sequence of utterances in the dialog history, 2) Rev that reverses the order of utterances in the history (but maintains word order within each utterance) 3) Drop that completely drops certain utterances and 4) Truncate that truncates the dialog history to contain only the most recent utterances where , where n is the length of dialog history.

Word-level perturbations

We consider similar operations but at the word level within every utterance 1) word-shuffle that randomly shuffles the words within an utterance 2) reverse that reverses the ordering of words, 3) word-drop that drops 30% of the words uniformly 4) noun-drop that drops all nouns, 5) verb-drop that drops all verbs.

3.3 Models

We experimented with two different classes of models - recurrent and transformer-based sequence-to-sequence generative models. All data loading, model implementations and evaluations were done using the ParlAI framework. We used the default hyper-parameters for all the models as specified in ParlAI.

Recurrent Models

We trained a seq2seq (seq2seq_lstm) model where the encoder and decoder are parameterized as LSTMs Hochreiter and Schmidhuber (1997). We also experiment with using decoders that use an attention mechanism (seq2seq_lstm_att) Bahdanau et al. (2015). The encoder and decoder LSTMs have 2 layers with 128 dimensional hidden states with a dropout rate of 0.1.


Our transformer Vaswani et al. (2017) model uses 300 dimensional embeddings and hidden states, 2 layers and 2 attention heads with no dropout. This model is significantly smaller than the ones typically used in machine translation since we found that the model that resembled Vaswani et al. (2017) significantly overfit on all our datasets.

While the models considered in this work might not be state-of-the-art on the datasets considered, we believe these models are still competitive and used commonly enough at least as baselines, that the community will benefit by understanding their behavior. In this paper, we use early stopping with a patience of on the validation set to save our best model. All models achieve close to the perplexity numbers reported for generative seq2seq models in their respective papers.

4 Results & Discussion

Our results are presented in Table 2 and Figure 1. Table 2 reports the perplexities of different models on test set in the second column, followed by the increase in perplexity when the dialog history is perturbed using the method specified in the column header. Rows correspond to models trained on different datasets. Figure 1 presents the change in perplexity for models when presented only with the most recent utterances from the dialog history.

We make the following observations:

  1. Models tend to show only tiny changes in perplexity in most cases, even under extreme changes to the dialog history, suggesting that they use far from all the information that is available to them.

  2. Transformers are insensitive to word-reordering, indicating that they could be learning bag-of-words like representations.

  3. The use of an attention mechanism in seq2seq_lstm_att and transformers makes these models use more information from earlier parts of the conversation than vanilla seq2seq models as seen from increases in perplexity when using only the last utterance.

  4. While transformers converge faster and to lower test perplexities, they don’t seem to capture the conversational dynamics across utterances in the dialog history and are less sensitive to perturbations that scramble this structure than recurrent models.

5 Conclusion

This work studies the behaviour of generative neural dialog systems in the presence of synthetically introduced perturbations to the dialog history, that it conditions on. We find that both recurrent and transformer-based seq2seq models are not significantly affected even by drastic and unnatural modifications to the dialog history. We also find subtle differences between the way in which recurrent and transformer-based models use available context. By open-sourcing our code, we believe this paradigm of studying model behavior by introducing perturbations that destroys different kinds of structure present within the dialog history can be a useful diagnostic tool. We also foresee this paradigm being useful when building new dialog datasets to understand the kinds of information models use to solve them.


We would like to acknowledge NVIDIA for donating GPUs and a DGX-1 computer used in this work. We would also like to thank the anonymous reviewers for their constructive feedback. Our code is available at