The use of Recurrent Neural Networks (RNNs) to tackle sequential language tasks has become standard in natural language processing, after impressive accomplishments in speech recognition, machine translation, and entailment(e.g., sutskever2014sequence; bahdanau2015attention; kalchbrenner2014convolutional)
. Recently, RNNs have also been exploited as tools to model dialogue systems. Inspired by neural machine translation, researchers such asritter and vinyals2015neural pioneered an approach to open-domain chit-chat conversation based on sequence-to-sequence models (sutskever2014sequence). In this paper, we focus on task-oriented dialogue, where the conversation serves to fulfil an independent goal in a given domain. Current neural dialogue models for task-oriented dialogue tend to equip systems with external memory components (bordes2016learning), since key information needs to be stored for potentially long time spans. One of our goals here is to analyse to what extent sequence-to-sequence models without external memory can deal with this challenge.
In addition, we consider language realisations that include disfluencies common in dialogue interaction, such as repetitions and self-corrections (e.g., I’d like to make a reservation for six, I mean, for eight people). Disfluencies have been investigated extensively in psycholinguistics, with a range of studies showing that they affect sentence processing in intricate ways (levelt1983monitoring; tree1995effects; bailey2003disfluencies; FerreiraBailey2004; LauFerreira2005; brennan2001listeners). Most computational work on disfluencies, however, has focused on detection rather than on disfluency processing and interpretation (e.g., stolcke1996statistical; heeman1999speech; zwarts2010detecting; qian2013disfluency; HoughPurver2014; HoughSchlangen2017). In contrast, our aim is to get a better understanding of how RNNs process disfluent utterances and to analyse the impact of such disfluencies on a downstream task—in this case, issuing an API request reflecting the preferences of the user in a task-oriented dialogue.
For our experiments, we use the synthetic dataset bAbI (bordes2016learning) and a modified version of it called bAbI+ which includes disfluencies (shalyminov2017challenging). The dataset contains simple dialogues between a user and a system in the restaurant reservation domain, which terminate with the system issuing an API call that encodes the user’s request. In bAbI+, disfluencies are probabilistically inserted into user turns, following distributions in human data. Thus, while the data is artificial and certainly simplistic, its goal-oriented nature offers a rare opportunity: by assessing whether the system issues the right API call, we can study, in a controlled way, whether and how the model builds up a relevant semantic/pragmatic interpretation when processing a disfluent utterance—a key aspect that would not be available with unannotated natural data.
In this section, we discuss the two datasets we use for our experiments: bAbI (bordes2016learning) and bAbI+ (shalyminov2017challenging).
The bAbI dataset consists of a series of synthetic dialogues in English, representing human-computer interactions in the context of restaurant reservations. The data is broken down into six subtasks that individuate different abilities that dialogue systems should have to conduct a successful conversation with a human. We focus on Task 1, which tests the capacity of a system to ask the right questions and integrate the answers of the user to issue an API call that matches the user’s preferences regarding four semantic slots: cuisine, location, price range, and party size. A sample dialogue can be found in example LABEL:ex:repair, Section LABEL:sec:error.
The training data for Task 1 is deliberatively kept simple and small, consisting of 1000 dialogues with on average 5 user and 7 system utterances. An additional 1000 dialogues based on different user queries are available for validation and testing, respectively. The overall vocabulary contains 86 distinct words. There are 7 distinct system utterances and 300 possible API calls.
Together with the dataset, bordes2016learning present several baseline models for the task. All the methods proposed are retrieval based, i.e., the models are trained to select the best system response from a set of candidate responses (in contrast to the models we investigate in the present work, which are generative—see Section LABEL:sec:results). The baseline models include classical information retrieval (IR) methods such as TF-IDF and nearest neighbour approaches, as well as an end-to-end recurrent neural network. bordes2016learning demonstrate that the end-to-end recurrent architecture—a memory network (sukhbaatar2015memory)—outperforms the classical IR methods as well as supervised embeddings, obtaining a 100% accuracy on retrieving the correct API calls.
shalyminov2017challenging observe that the original bAbI data lack naturalness and variation common in actual dialogue interaction. To introduce such variation while keeping lexical variation constant, they insert speech disfluencies, using a fixed set of templates that are probabilistically applied to the user turns of the original bAbI Task 1 dataset. In particular, three types of disfluencies are introduced: hesitations 1, restarts LABEL:ex:restart, and self-corrections LABEL:ex:correction, in around 21%, 40% and 5% of the user’s turns, respectively.111The inserted material is in italics in the examples.
. We will be uhm eight