Log In Sign Up

Analysing the potential of seq-to-seq models for incremental interpretation in task-oriented dialogue

We investigate how encoder-decoder models trained on a synthetic dataset of task-oriented dialogues process disfluencies, such as hesitations and self-corrections. We find that, contrary to earlier results, disfluencies have very little impact on the task success of seq-to-seq models with attention. Using visualisation and diagnostic classifiers, we analyse the representations that are incrementally built by the model, and discover that models develop little to no awareness of the structure of disfluencies. However, adding disfluencies to the data appears to help the model create clearer representations overall, as evidenced by the attention patterns the different models exhibit.


Generative Encoder-Decoder Models for Task-Oriented Spoken Dialog Systems with Chatting Capability

Generative encoder-decoder models offer great promise in developing doma...

Self-Attentional Models Application in Task-Oriented Dialogue Generation Systems

Self-attentional models are a new paradigm for sequence modelling tasks ...

Goal-Embedded Dual Hierarchical Model for Task-Oriented Dialogue Generation

Hierarchical neural networks are often used to model inherent structures...

Challenging Neural Dialogue Models with Natural Data: Memory Networks Fail on Incremental Phenomena

Natural, spontaneous dialogue proceeds incrementally on a word-by-word b...

Assessing incrementality in sequence-to-sequence models

Since their inception, encoder-decoder models have successfully been app...

1 Introduction

The use of Recurrent Neural Networks (RNNs) to tackle sequential language tasks has become standard in natural language processing, after impressive accomplishments in speech recognition, machine translation, and entailment

(e.g., sutskever2014sequence; bahdanau2015attention; kalchbrenner2014convolutional)

. Recently, RNNs have also been exploited as tools to model dialogue systems. Inspired by neural machine translation, researchers such as

ritter and vinyals2015neural pioneered an approach to open-domain chit-chat conversation based on sequence-to-sequence models (sutskever2014sequence). In this paper, we focus on task-oriented dialogue, where the conversation serves to fulfil an independent goal in a given domain. Current neural dialogue models for task-oriented dialogue tend to equip systems with external memory components (bordes2016learning), since key information needs to be stored for potentially long time spans. One of our goals here is to analyse to what extent sequence-to-sequence models without external memory can deal with this challenge.

In addition, we consider language realisations that include disfluencies common in dialogue interaction, such as repetitions and self-corrections (e.g., I’d like to make a reservation for six, I mean, for eight people). Disfluencies have been investigated extensively in psycholinguistics, with a range of studies showing that they affect sentence processing in intricate ways (levelt1983monitoring; tree1995effects; bailey2003disfluencies; FerreiraBailey2004; LauFerreira2005; brennan2001listeners). Most computational work on disfluencies, however, has focused on detection rather than on disfluency processing and interpretation (e.g., stolcke1996statistical; heeman1999speech; zwarts2010detecting; qian2013disfluency; HoughPurver2014; HoughSchlangen2017). In contrast, our aim is to get a better understanding of how RNNs process disfluent utterances and to analyse the impact of such disfluencies on a downstream task—in this case, issuing an API request reflecting the preferences of the user in a task-oriented dialogue.

For our experiments, we use the synthetic dataset bAbI (bordes2016learning) and a modified version of it called bAbI+ which includes disfluencies (shalyminov2017challenging). The dataset contains simple dialogues between a user and a system in the restaurant reservation domain, which terminate with the system issuing an API call that encodes the user’s request. In bAbI+, disfluencies are probabilistically inserted into user turns, following distributions in human data. Thus, while the data is artificial and certainly simplistic, its goal-oriented nature offers a rare opportunity: by assessing whether the system issues the right API call, we can study, in a controlled way, whether and how the model builds up a relevant semantic/pragmatic interpretation when processing a disfluent utterance—a key aspect that would not be available with unannotated natural data.

2 Data

In this section, we discuss the two datasets we use for our experiments: bAbI (bordes2016learning) and bAbI+ (shalyminov2017challenging).

2.1 bAbI

The bAbI dataset consists of a series of synthetic dialogues in English, representing human-computer interactions in the context of restaurant reservations. The data is broken down into six subtasks that individuate different abilities that dialogue systems should have to conduct a successful conversation with a human. We focus on Task 1, which tests the capacity of a system to ask the right questions and integrate the answers of the user to issue an API call that matches the user’s preferences regarding four semantic slots: cuisine, location, price range, and party size. A sample dialogue can be found in example LABEL:ex:repair, Section LABEL:sec:error.


The training data for Task 1 is deliberatively kept simple and small, consisting of 1000 dialogues with on average 5 user and 7 system utterances. An additional 1000 dialogues based on different user queries are available for validation and testing, respectively. The overall vocabulary contains 86 distinct words. There are 7 distinct system utterances and 300 possible API calls.


Together with the dataset, bordes2016learning present several baseline models for the task. All the methods proposed are retrieval based, i.e., the models are trained to select the best system response from a set of candidate responses (in contrast to the models we investigate in the present work, which are generative—see Section LABEL:sec:results). The baseline models include classical information retrieval (IR) methods such as TF-IDF and nearest neighbour approaches, as well as an end-to-end recurrent neural network. bordes2016learning demonstrate that the end-to-end recurrent architecture—a memory network (sukhbaatar2015memory)—outperforms the classical IR methods as well as supervised embeddings, obtaining a 100% accuracy on retrieving the correct API calls.

2.2 bAbI+

shalyminov2017challenging observe that the original bAbI data lack naturalness and variation common in actual dialogue interaction. To introduce such variation while keeping lexical variation constant, they insert speech disfluencies, using a fixed set of templates that are probabilistically applied to the user turns of the original bAbI Task 1 dataset. In particular, three types of disfluencies are introduced: hesitations 1, restarts LABEL:ex:restart, and self-corrections LABEL:ex:correction, in around 21%, 40% and 5% of the user’s turns, respectively.111The inserted material is in italics in the examples.

Example 1.

. We will be uhm eight