Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems

11/21/2015 ∙ by Jesse Dodge, et al. ∙ Facebook 0

A long-term goal of machine learning is to build intelligent conversational agents. One recent popular approach is to train end-to-end models on a large amount of real dialog transcripts between humans (Sordoni et al., 2015; Vinyals & Le, 2015; Shang et al., 2015). However, this approach leaves many questions unanswered as an understanding of the precise successes and shortcomings of each model is hard to assess. A contrasting recent proposal are the bAbI tasks (Weston et al., 2015b) which are synthetic data that measure the ability of learning machines at various reasoning tasks over toy language. Unfortunately, those tests are very small and hence may encourage methods that do not scale. In this work, we propose a suite of new tasks of a much larger scale that attempt to bridge the gap between the two regimes. Choosing the domain of movies, we provide tasks that test the ability of models to answer factual questions (utilizing OMDB), provide personalization (utilizing MovieLens), carry short conversations about the two, and finally to perform on natural dialogs from Reddit. We provide a dataset covering 75k movie entities and with 3.5M training examples. We present results of various models on these tasks, and evaluate their performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the recent employment of Recurrent Neural Networks (RNNs) and the large quantities of conversational data available on websites like Twitter or Reddit, a new type of dialog system is emerging. Such

end-to-end dialog systems (Ritter et al., 2011; Shang et al., 2015; Vinyals & Le, 2015; Sordoni et al., 2015) directly generate a response given the last user utterance and (potentially) the context from previous dialog turns without relying on the intermediate use of a dialog state tracking component like in traditional dialog systems (e.g. in Henderson (2015)). These methods are trained to imitate user-user conversations and do not need any hand-coding of attributes and labels for dialog states and goals like state tracking methods do. Being trained on large corpora, they are robust to many language variations and seem to mimic human conversations to some extent.

In spite of their flexibility and representational power, these neural network based methods lack pertinent goal-oriented frameworks to validate their performance. Indeed, traditional systems have a wide range of well defined evaluation paradigms and benchmarks that measure their ability to track user states and/or to reach user-defined goals (Walker et al., 1997; Paek, 2001; Griol et al., 2008; Williams et al., 2013). Recent end-to-end models, on the other hand, rely either on very few human scores (Vinyals & Le, 2015), crowdsourcing (Ritter et al., 2011; Shang et al., 2015) or machine translation metrics like BLEU (Sordoni et al., 2015) to judge the quality of the generated language only. This is problematic because these evaluations do not assess if end-to-end systems can conduct dialog to achieve pre-defined objectives, but simply whether they can generate correct language that could fit in the context of the dialog; in other words, they quantify their chit-chatting abilities.

To fill in this gap, this paper proposes a collection of four tasks designed to evaluate different pre-requisite qualities of end-to-end dialog systems. Focusing on the movie domain, we propose to test if systems are able to jointly perform: (1) question-answering (QA), (2) recommendation, (3) a mix of recommendation and QA and (4) general dialog about the topic, which we call chit-chat. All four tasks have been chosen because they test basic capabilities we expect a dialog system performing insightful movie recommendation should have while evaluation on each of them can be well defined without the need of human-in-the-loop (e.g. via Wizard-of-Oz strategies (Whittaker et al., 2002)). Our ultimate goal is to validate if a single model can solve the four tasks at once, which we assert is a pre-requisite for an end-to-end dialog system supposed to act as a movie recommendation assistant, and by extension a general dialog agent as well. At the same time we advocate developing methods that make no special engineering for this domain, and hence should generalize to learning on tasks and data from other domains easily.

In contrast to the bAbI tasks which test basic capabilities of story understanding systems (Weston et al., 2015b), the tasks have been created using large-scale real-world sources (OMDb111http://en.omdb.org, MovieLens222http://movielens.org and Reddit333http://reddit.com/r/movie). Overall, the dataset covers 75k movie entities (movie, actor, director, genre, etc.) with 3.5M training examples: even if the dataset is restricted to a single domain, it is large and allows a great variety of discussions, language and user goals. We evaluate on these tasks the performance of various neural network models that can potentially create end-to-end dialogs, ranging from simple supervised embedding models (Bai et al., 2009)

, RNNs with Long Short-Term Memory (LSTMs)

(Hochreiter & Schmidhuber, 1997), and attention-based models, in particular Memory Networks (Sukhbaatar et al., 2015). To validate the quality of our results, we also apply our best performing model, Memory Networks, in other conditions by comparing it on the Ubuntu Dialog Corpus (Lowe et al., 2015) against baselines trained by the authors of the corpus. We show that they outperform all baselines by a wide margin.

2 The Movie Dialog Dataset

We introduce a set of four tasks to test the ability of end-to-end dialog systems, focusing on the domain of movies and movie related entities. They aim to test five abilities which we postulate as being key towards a fully functional general dialog system (i.e., not specific to movies per se):

  • QA Dataset: Tests the ability to answer factoid questions that can be answered without relation to previous dialog. The context consists of the question only.

  • Recommendation Dataset: Tests the ability to provide personalized responses to the user via recommendations (in this case, of movies) rather than universal facts as above.

  • QA+Recommendation Dataset: Tests the ability of maintaining short dialogs involving both factoid and personalized content where conversational state has to be maintained.

  • Reddit Dataset: Tests the ability to identify most likely replies in discussions on Reddit.

  • Joint Dataset: All our tasks are dialogs. They can be combined into a single dataset, testing the ability of an end-to-end model to perform well at all skills at once.

Sample input contexts and target replies from the tasks are given in Tables 1-4. The datasets are available at: http://fb.ai/babi.

2.1 Question Answering (QA)

The first task we build is to test whether a dialog agent is capable of answering simple factual questions. The dataset was built from the Open Movie Database (OMDb)444Downloaded from http://beforethecode.com/projects/omdb/download.aspx. which contains metadata about movies. The subset we consider contains 15k movies, 10k actors and 6k directors. We also matched these movies to the MovieLens dataset555http://grouplens.org/datasets/movielens/ to attribute tags to each movie. We build a knowledge base (KB) directly from the combined data, stored as triples such as (The Dark Horse, starred_actor, Bette Davis) and (Moonraker, has_tag, james bond), with 8 different relation types involving director, writer, actor, release date, genre, tags, rating and imdb votes.

We distinguish 11 classes of question, corresponding to different kinds of edges in our KB: actor to movie (“What movies did Michael J Fox star in?”), movie to actors (“Who starred in Back to The Future?”), movie to director, director to movie, movie to writer, writer to movie, movie to tags, tag to movie, movie to year, movie to genre and movie to language. For each question type there is a set of possible answers. Using SimpleQuestions, an existing open-domain question answering dataset based on Freebase (Bordes et al., 2015) we identified the subset of questions posed by those human annotators that covered our question types. We expanded this set to cover all of our KB by substituting the actual entities in those questions to also apply them to other questions, e.g. if the original question written by an annotator was “What movies did Michael J Fox star in?”, we created a pattern “What movies did [@actor] star in?” which we substitute for any other actors in our set, and repeat this for all annotations. We split the questions into training, development and test sets with 96k, 10k and 10k examples, respectively.

Task 1: Factoid Question Answering (QA)
  What movies are about open source? Revolution OS
  Ruggero Raimondi appears in which movies? Carmen
  What movies did Darren McGavin star in? Billy Madison, The Night Stalker, Mrs. Pollifax-Spy
  Can you name a film directed by Stuart Ortiz? Grave Encounters
  Who directed the film White Elephant? Pablo Trapero
  What is the genre of the film Dial M for Murder? Thriller, Crime
  What language is Whity in? German
Table 1: Sample input contexts and target replies (in red) from Task 1.

To simplify evaluation rather than requiring the generation of sentences containing the answers, we simply ask a model to output a list, which is ranked as the possible set of answers. We then use standard ranking metrics to evaluate the list, making the results easy to interpret. Our main results report the hits@1 metric (i.e. is the top answer correct); other metrics are given in the appendix.

2.2 Recommendation Dataset

Not all questions about movies in dialogs have an objective answer, independent of the person asking; indeed much of human dialog is based on opinons and personalized responses. One of the simplest dialogs of this type to evaluate is that of recommendation, where we can utilize existing data resources. We again employ the MovieLens dataset which features a user item matrix of movie ratings, rated from 1 to 5. We filtered the set of movies to be the same set as in the QA task and additionally only kept movies that had at least 2 ratings, giving around 11k movies.

To use this data for evaluating dialog, we then use it to generate dialog exchanges. We first select a user at random; this will be the user who is participating in the dialog, and then sample 1-8 movies that the user has rated 5. We then form a statement intended to express the user’s feelings about these movies, according to a fixed set of natural language templates, one of which is selected randomly. See Table 2 for some examples. From the remaining set of movies the same user gave a rating of 5, we select one to be the answer.

Task 2: Recommendation
Schindler’s List, The Fugitive, Apocalypse Now, Pulp Fiction, and The Godfather are films I really liked.
Can you suggest a film? The Hunt for Red October
Some movies I like are Heat, Kids, Fight Club, Shaun of the Dead, The Avengers, Skyfall, and Jurassic Park.
Can you suggest something else I might like? Ocean’s Eleven
Table 2: Sample input contexts and target replies (in red) from Task 2.

There are 110k users in the training, 1k users in the development set and 1k for test. We follow the procedure above sampling users with replacement and generate 1M training examples and 10k development and test set examples, respectively. To evaluate the performance of a model, just as in the first task, we evaluate a ranked list of answers. In our main results we measure hits@100, i.e. 1 if the provided answer is in the top 100, and 0 otherwise, rather than hits@1 as this task is harder than the last.

Note that we expect absolute hits@k numbers to be lower for this task than for QA due to incomplete labeling (“missing ratings”): in recommendation there is no exact right answer, and it is not surprising the actual single true label is not always at the top position, i.e. the top predictions of the model may be good as well, but we do not have their labels. One can thus view the ranking metric as a kind of lower bound on performance of actually labeling all the predictions using human annotations, which would be time consuming and no longer automatic, and hence undesirable for algorithm development. This is standard in recommendation, see e.g. Cremonesi et al. (2010).

2.3 QA+Recommendation Dialog

The tasks presented so far only involve questions followed by responses, with no context from previous dialog. This task aims at evaluating responses in the context of multiple previous exchanges, while remaining straightforward enough that evaluation and analysis are still tractable. We hence combine the question answering and recommendation tasks from before in a multi-response dialog, where dialogs consist of 3 exchanges (3 turns from each participant).

The first exchange requires a recommendation similar to Task 1 except that they also specify what genre or topic they are interested in, e.g. “I’m looking for a Music movie”, where the answer might be “School of Rock”, as in the example of Table 3.

In the second exchange, given the model’s response (movie suggestion), the user asks a factoid question about that suggestion, e.g. “What else is that about?”, “Who stars in that?” and so on. This question refer back to the previous dialog, making context important.

In the third exchange, the user asks for a alternative recommendation, and provides extra information about their tastes, e.g. “I like Tim Burton movies more”. Again, context of the last two exchanges should help for best performance.

Task 3: QA + Recommendation Dialog
I loved Billy Madison, My Neighbor Totoro, Blades of Glory, Bio-Dome, Clue, and Happy Gilmore.
I’m looking for a Music movie. School of Rock
What else is that about? Music, Musical, Jack Black, school, teacher, Richard Linklater, rock, guitar
I like rock and roll movies more. Do you know anything else? Little Richard
Tombstone, Legends of the Fall, Braveheart, The Net, Outbreak, and French Kiss are films I really liked.
I’m looking for a Fantasy movie. Jumanji
Who directed that? Joe Johnston
I like Tim Burton movies more. Do you know anything else? Big Fish
Table 3: Sample input contexts and target replies (in red) from Task 3.

We thus generate 1M examples of such 6 line dialogs (3 turns from each participant) for training, and 10k for development and testing respectively. We can evaluate the performance of models across all the lines of dialog (e.g., all 30k responses from the test set), but also only on the 1st (Recommendation), 2nd (QA) or 3rd exchange (Similarity) for a more fine-grained analysis. We again use a ranking metric (here, hits@10), just as in our previous tasks.

2.4 Reddit Discussion

Our fourth task is to predict responses in movie discussions using real conversation data taken directly from Reddit, a website where registered community members can submit content in various areas of interest, called “subreddits”. We selected the movie subreddit666https://www.reddit.com/r/movies, selecting from the dataset available at https://www.reddit.com/r/datasets/comments/3bxlg7. to match our other tasks.

The original discussion data is potentially between multiple participants. To simplify the setup, we flatten this to appear as two participants (parent and comment), just as in our other tasks. In this way we collected 1M dialogs, of which 10k are reserved for a development set, and another 10k for the test set. Of the dialogs, 76% involve a single exchange, 17% have at least two exchanges, and 7% have at least three exchanges (the longest exchange is length 50).

Task 4: Reddit Discussion
I think the Terminator movies really suck, I mean the first one was kinda ok, but after that they got really
cheesy. Even the second one which people somehow think is great. And after that… forgeddabotit.
C’mon the second one was still pretty cool.. Arny was still so badass, as was Sararah Connor’s character..
and the way they blended real action and effects was perhaps the last of its kind…
Table 4: Sample input contexts and target replies (in red) from Task 4.

To evaluate the performance of models, we again separate the problem of evaluating the quality of a response from that of language generation by considering a ranking setup, in line with other recent works (Sordoni et al., 2015). We proceed as follows: we select a further 10k comments for the development set and another 10k for the test set which have not appeared elsewhere in our dataset, and use these as potential candidates for ranking during evaluation. For each exchange, given the input context, we rank 10001 possible candidates: the true response given in the dataset, plus the 10k “negative” candidates just described. The model has to rank the true response as high as possible. Similar to recommendation as described before we do not expect absolute hits@k performance to be as high as for QA due to incomplete labeling. As with Task 3, we can evaluate on all the data, or only on the 1st, 2nd or 3rd exchange, and so on. We also identified the subset of the test set where there is an entity match with at least two entities from Tasks 1-3, where one of the entities appears in the input, and the other in the response: this subset serves to evaluate the impact of using a knowledge base for conducting such a dialog.

2.5 Joint Task

Finally, we consider a task made of the combination of all four of the previous ones. At both training and test time examples consist of exchanges from any of the datasets, sampled at random, whereby the conversation is ‘reset’ at each sample, so that the context history only ever includes exchanges from the current conversation.

We consider this to be the most important task, as it tests whether a model can not only produce chit-chat (Task 4) but also can provide meaningful answers during dialog (Tasks 1-3). On the other hand, the point of delineating the separate tasks is to evaluate exactly which types of dialog a model is succeeding at or not. That all the datasets are in the same domain is crucial to testing the ability of models at performing well on all tasks jointly. If the domains were different, then the vocabularies would be trivially non-overlapping, allowing to learn effectively separate models inside a single one.

2.6 Relation to Existing Evaluation Frameworks

Traditional dialog systems consist of two main modules: (1) a dialog state tracking component that tracks what has happened in a dialog, incorporating into a pre-defined explicit state structure system outputs, user utterances, context from previous turns, and other external information, and (2) a response generator. Evaluation of the dialog state tracking stage is well defined since the PARADISE framework (Walker et al., 1997) and subsequent initiatives (Paek, 2001; Griol et al., 2008), including recent competitons (Williams et al., 2013; Henderson et al., 2014) as well as situated variants (Rojas-Barahona et al., 2012). However, they require fine grained data annotations in terms of labeling internal dialog state and precisely defined user intent (goals). As a result, they do not really scale to large domains and dialogs with high variability in terms of language. Because of language ambiguity and variation, evaluation of the response generation step is complicated and usually relies on human judgement (Walker et al., 2003).

End-to-end dialog systems do not rely on explicit internal state and hence do not have state tracking modules, they directly generate responses given user utterances and dialog context and hence can not be evaluated using state tracking test-beds. Unfortunately, as for response generator modules, their evaluation is ill-defined as it is difficult to objectively rate at scale the fit of returned responses. Most existing work (Ritter et al., 2011; Shang et al., 2015; Vinyals & Le, 2015; Sordoni et al., 2015) chose to use human ratings, which does not easily scale. Sordoni et al. (2015) also use the BLEU score to compare to actual user utterances but this is not a completely satisfying measure of success, especially when used in a chit-chat setting where there are no clear goals and hence measures of success. Lowe et al. (2015) use a similar ranking evaluation to ours, but only in a chit-chat setting.

Our approach of providing a collection of tasks to be jointly solved is related to the evaluation framework of the bAbI tasks (Weston et al., 2015a) and of the collection of sequence prediction tasks of Joulin & Mikolov (2015). However, unlike them, our Tasks 1-3 are much closer to real dialog, being built from human-written text, and with Task 4 actually involving real dialog from Reddit. The design of our tasks is such that all test one or more key characteristics a dialog system should have but also that an unambiguous answer is expected after each dialog act. In that sense, it follows the the notion of dialog evaluation by a reference answer introduced in (Hirschman et al., 1990). The application of movie recommender systems is connected to that of TV program suggestion proposed by Ramachandran et al. (2014), except that we frame it so that we can generate systematic evaluation from it, where they only rely on human judgement at small scale.

3 Models

3.1 Memory Networks

Memory Networks (Weston et al., 2015c; Sukhbaatar et al., 2015) are a recent class of models that perform language understanding by incorporaring a memory component that potentially includes both long-term memory (e.g., to remember facts about the world) and short-term context (e.g., the last few turns of dialog). They have only been evaluated in a few setups: question answering (Bordes et al., 2015), language modeling (Sukhbaatar et al., 2015; Hill et al., 2015), and language understanding on the bAbI tasks (Weston et al., 2015a), but not so far on dialog tasks such as ours.

We employ the MemN2N architecture of Sukhbaatar et al. (2015) in our experiments, with some additional modifications to construct both long-term and short-term context memories. At any given time step we are given as input the history of the current conversation: messages from the user at time step and the corresponding responses from the model itself at the corresponding time steps, . At the current time we are only given the input and the model has to respond.

Retrieving long-term memories

For each word in the last messages we perform a hash lookup to return all long-term memories (sentences) from a database that also contain that word. Words above a certain frequency cutoff can be ignored to avoid sentences that only share syntax or unimportant words. We employ the movie knowledge base of Sec. 2.1 for our long-term memories, but potentially any text dataset could be used. See Figure 5 for an example of this process.

Attention over memories

The sentences , returned from the hashing step plus the messages from the current conversation form the memory of the Memory Network777We also add time features to each memory to denote their position following (Sukhbaatar et al., 2015).:

The last user input is embedded using a matrix of size where is the embedding dimension and is the size of the vocabulary, giving . Each memory is embedded using the same matrix, giving . The match between the input and the memories is then computed by taking the inner product followed by a softmax:

giving a probability vector over the memories. The output memory representation is then constructed with

where is a rotation matrix888Optionally, different dictionaries can be used for inputs, memories and outputs instead of being shared.. The memory output is then added to the original input . This procedure can then be stacked in what is called multiple “hops” of attention over the memory.

Generating the final prediction

The final prediction is then defined as: where there are candidate responses in , and is of dimension . For Tasks 1-3 the candidates are the set of words in the vocabulary, which are ranked for final evaluation, whereas for Task 4 the candidates are target respones (sentences).

The whole model is trained using stochastic gradient descent by minimizing a standard cross-entropy loss between

and the true label .

Long-Term Shaolin Soccer directed_by Stephen Chow
Memories    Shaolin Soccer written_by Stephen Chow
Shaolin Soccer starred_actors Stephen Chow
Shaolin Soccer release_year 2001
Shaolin Soccer has_genre comedy
Shaolin Soccer has_tags martial arts, kung fu soccer, stephen chow
Kung Fu Hustle directed_by Stephen Chow
Kung Fu Hustle written_by Stephen Chow
Kung Fu Hustle starred_actors Stephen Chow
Kung Fu Hustle has_genre comedy action
Kung Fu Hustle has_imdb_votes famous
Kung Fu Hustle has_tags comedy, action, martial arts, kung fu, china, soccer, hong kong, stephen chow
The God of Cookery directed_by Stephen Chow
The God of Cookery written_by Stephen Chow
The God of Cookery starred_actors Stephen Chow
The God of Cookery has_tags hong kong Stephen Chow
From Beijing with Love directed_by Stephen Chow
From Beijing with Love written_by Stephen Chow
From Beijing with Love starred_actors Stephen Chow, Anita Yuen
                  and more
Short-Term   1) I’m looking a fun comedy to watch tonight, any ideas?
Memories     2) Have you seen Shaolin Soccer? That was zany and great.. really funny but in a whacky way.
Input             3) Yes! Shaolin Soccer and Kung Fu Hustle are so good I really need to find some more Stephen Chow
films I feel like there is more awesomeness out there that I haven’t discovered yet …
Output           4) God of Cookery is pretty great, one of his mid 90’s hong kong martial art comedies.
Table 5: Memory Network long-term and short-term memories. Blue underlined text indicates those words that hashed into the knowledge base to recall sentences from the long-term memory. Those, along with the recent short-term context (lines labeled 1 and 2) are used as input memories to the Memory Network along with the input (labeled 3). The desired goal is to output dialog line 4.

3.2 Supervised Embedding Models

While one of the major uses of word embedding models is to learn unsupervised embeddings over large unlabeled datasets such as in Word2Vec (Mikolov et al., 2013)

there are also very effective word embedding models for training supervised models when labeled data is available. The simplest approach which works suprisingly well is to sum the word embeddings of the input and the target independently and then compare them with a similarity metric such as inner product or cosine similarity. A ranking loss is used to ensure the correct targets are ranked higher than any other targets. Several variants of this approach exist. For matching two documents supervised semantic indexing (SSI) was shown to be superior to unsupervised latent semantic indexing (LSI)

(Bai et al., 2009). Similar methods were shown to outperform SVD for recommendation (Weston et al., 2013). However, we do not expect this method to work as well on question answering tasks, as all the memorization must occur in the individual word embeddings, which was shown to perform poorly in (Bordes et al., 2014). For example, consider asking the question “who was born in Paris?” and requiring the word embedding for Paris to effectively contain all the pertinent information. However, for rarer items requiring less storage, performance may not be as degraded. In general we believe this is a surprisingly strong baseline that is often neglected in evaluations. Our implementation corresponds to a Memory Network with no attention over memory.

3.3 Recurrent Language Models

Recurrent Neural Networks (RNNs) have proven successful at several tasks involving natural language, language modeling (Mikolov et al., 2011), and have been applied recently to dialog (Sordoni et al., 2015; Vinyals & Le, 2015; Shang et al., 2015). LSTMs are not known however for tasks such as QA or item recommendation, and so we expect them to find our datasets challenging.

There are a large number of variants of RNNs, including Long-Short Term Memory activation units (LSTMs) (Hochreiter & Schmidhuber, 1997), bidirectional LSTMs (Graves et al., 2012), seq2seq models (Sutskever et al., 2014), RNNs that take into account the document context (Mikolov & Zweig, 2012) and RNNs that perform attention over their input in various different ways (Bahdanau et al., 2015; Hermann et al., 2015; Rush et al., 2015). Evaluating all these variants is beyond the scope of this work and we instead use standard LSTMs as our baseline method999We used the code available at: https://github.com/facebook/SCRNNs. However, we note that LSTMs with attention have many properties in common with Memory Networks if the attention is applied over the same memory setup.

3.4 Question Answering Systems

For the particular case of Task 1 we can apply existing question answering systems. There has been a recent surge in interest in such systems that try to answer a question posed in natural language by converting it into a database search over a knowledge base (Berant & Liang, 2014; Kwiatkowski et al., 2013; Fader et al., 2014), which is a setup natural for our QA task also. However, such systems cannot easily solve any of our other tasks, for example our recommendation Task 2 does not involve looking up a factoid answer in a database. Nevertheless, this allows us to compare the performance of end-to-end systems performant on all our tasks to a standard QA benchmark. We chose the method of Bordes et al. (2014)101010We used the ‘Path Representation’ for the knowledge base, as described in Sec. 3.1 of Bordes et al. (2014). as our baseline. This system learns embeddings that match questions to database entries, and then ranks the set of entries, and has been shown to achieve good performance on the WebQuestions benchmark (Berant et al., 2013).

3.5 Singular Value Decomposition

Singular Value Decomposition (SVD) is a standard benchmark for recommendation, being at the core of the best ensemble results in the Netflix challenge, see Koren & Bell (2011) for a review. However, it has been shown to be outperformed by other flavors of matrix factorization, in particular by using a ranking loss rather than squared loss (Weston et al., 2013) which we will compare to (cf. sec 3.2), as well as improvements like SVD++ (Koren, 2008). Collaborative filtering methods are applicable to Task 2, but cannot easily be used for any of the other tasks. Even for Task 2, while our dialog models use textual input, as shown in Table 2, SVD requires a user item matrix, so for this baseline we preprocessed the text to assign each entity an ID, and throw away all other text. In contrast, the end-to-end dialog models have to learn to process the text as part of the task.

3.6 Information Retrieval Models

To select candidate responses a standard baseline is nearest neighbour information retrieval (IR) (Isbell et al., 2000; Jafarpour et al., 2010; Ritter et al., 2011; Sordoni et al., 2015). Two simple variants are often tried: given an input message, either (i) find the most similar message in the (training) dataset and output the response from that exchange; or (ii) find the most similar response to the input directly. In both cases the standard measure of similarity is tf-idf weighted cosine similarity between the bags of words. Note that that the Supervised Embedding Models of Sec. 3.2 effectively implement the same kind of model (ii) but with a learnt similarity measure. It has been shown previously that method (ii) performs better (Ritter et al., 2011), and our initial IR experiments showed the same result. Note that while (non-learning) IR systems can also be applied to other tasks such as QA (Kolomiyets & Moens, 2011) they require significant tuning to do so. Here we stick to a vanilla vector space model and hence only apply an IR baseline to Task 4.

4 Results

Our main results across all the models and tasks are given in Table 4. Supervised Embeddings and Memory Networks are tested in two settings: trained and tested on all tasks separately, or jointly on the combined Task 5. Other methods are only evaluated on independent tasks. In all cases, parameter search was performed on the development sets; parameter choices are provided in the appendix.

QA Task Recs Task QA+Recs Task Reddit Task
Methods (hits@1) (hits@100) (hits@10) (hits@10)
QA System (Bordes et al., 2014) 90.7 n/a n/a n/a

SVD
n/a 19.2 n/a n/a
IR n/a n/a n/a 23.7
LSTM  6.5 27.1 19.9 11.8
Supervised Embeddings 50.9 29.2 65.9 27.6
MemN2N 79.3 28.6 81.7 29.2
Joint Supervised Embeddings 43.6 28.1 58.9 14.5
Joint MemN2N 83.5 26.5 78.9 26.6
Table 6: Test results across all tasks. Of those methods tested, supervised embeddings, LSTMs and MemN2N are easily applicable to all tasks. The other methods are standard benchmarks for individual tasks. The final two rows are models trained on the Combined Task, of all tasks at once. Evaluation uses the hits@k metric (in percent) with the value of given in the second row.

Answering Factual Questions

Memory Networks and the baseline QA system are the two methods that have an explicit long-term memory via access to the knowledge base (KB). On the task of answering factual questions where the answers are contained in the KB, they outperform the other methods convincingly, with LSTMS being particularly poor. The latter is not unexpected as that method is good at language modeling, not question answering, see e.g. Weston et al. (2015b). The baseline QA system, which is designed for this task, is superior to Memory Networks, indicating there is still room for improvement in that model. On the other hand, the latter’s much more general design allows it to perform well on our other dialog tasks, whereas the former is task specific.

Making Recommendations

In this task a long-term memory does not bring any improvement, with LSTMs, Supervised Embeddings and Memory Networks all performing similarly, and all outperforming the SVD baseline. Here, we conjecture LSTMs can perform well because it looks much more like a language modeling task, i.e. the input is a sequence of similar recommendations.

Using Dialog History

In both QA+Recommendations (Task 3) and Reddit (Task 4) Memory Networks outperform Supervised Embeddings due to their better use of context. This can be seen by breaking down the results by length of context: in the first response they perform similarly, but Memory Networks show a relative improvement on the second and third responses, see Tables 9 and 10 in the appendix. Note that these improvements come from the short term memory (dialog history), not from the use of the KB, as we show Memory Networks results without access to the KB and they perform similarly. We believe the QA performance in these cases is not hindered by the lack of a KB because we ask questions based on fewer relations than in Task 1 and it is easier to store the knowledge directly in the word embeddings. The baseline IR model in Task 4 benefits from context too, it is compared with and without in Table 10. LSTMs perform poorly: the posts in Reddit are quite long and the memory of the LSTM is relatively short, as pointed out by Sordoni et al. (2015). In that work they employed a linear reranker that used LSTM prediction as features to better effect. Testing more powerful recurrent networks such as LSTMs with attention on these benchmarks remains as future work (although the latter is related to Memory Networks, which we do report).

Joint Learning

A truly end-to-end dialog system has to be good at all the skills in Tasks 1-4 (and more besides, i.e. this is necessary, but not sufficient). We thus report results on our Combined Task for Supervised Embeddings and Memory Networks. Supervised Embeddings still have the same failings as before on Tasks 1 and 3, but now seem to perform even more poorly due to the difficulty of encoding all the necessary skills in the word embeddings, so e.g., they now do significantly worse on Task 4. This is despite us trying word embeddings of up to 2000 dimensions. Memory Networks fare better, having only a slight loss in performance on Tasks 2-4 and a slight gain in Task 1. In their case, the modeling power is not only in the word embeddings, but also in the attention over the long-term and short-term memory, so it does not need as much capacity in the word embeddings. However, the best achievable models would presumably have some improvement from training across all the tasks, not a loss, and would perform at least as well as all the individual task baselines (i.e. in this case, perform better at Task 1).

5 Ubuntu Dialogue Corpus Results

Validation Test
Methods (hits@1) (hits@1)
IR n/a 48.81
RNN n/a 37.91
LSTM n/a 55.22
MemN2N 1-hop 57.23 56.25
MemN2N 2-hops 64.28 63.51
MemN2N 3-hops 64.31 63.72
MemN2N 4-hops 64.01 62.82
Table 7: Ubuntu Dialog Corpus results. The evaluation is retrieval-based, similar to that of Reddit (Task 4). For each dialog, the correct answer is mixed among 10 random candidates; Hits@1 (in %) are reported. Methods with have been ran by Lowe et al. (2015).

As no other authors have yet published results on our new benchmark, to validate the quality of our results we also apply our best performing model in other conditions by comparing it on the Ubuntu Dialog Corpus (Lowe et al., 2015)

. In particular, this also allows us to compare to more sophisticated LSTMs models that are trained discriminatively using metric learning, as well as additional baseline methods all trained by the authors. The Ubuntu Dialog Corpus contains almost 1M dialogs of more than 7 turns on average (900k dialogs for training, 20k for validation and 20k for testing), and 100M million words. The corpus was scraped from the Ubuntu IRC channel logs where users ask questions about issues they are having with Ubuntu and get answers by other users. Most chats can involve more than two users but a series of heuristics to disentangle them into dyadic dialogs was used.

The evaluation is similar to that of Reddit (Task 4): each correct answer has to be retrieved among a set of 10, mixed with 9 randomly chosen candidate utterances. We report the Hits@1 in Table 7.111111Results for the baselines from (Lowe et al., 2015) differ to that from the v3 of the arxiv paper, because the corpus has been updated since then. All results in Table 7 use the latest version of the corpus. We used the same MemN2N architecture as before. all models were selected using validation accuracy. On this dataset, which has longer dialogs than those from the Movie Dialog Corpus, we can see that running more hops on the memory with the MemN2N improves performance: the 1-hop model performs similarly to the LSTM but with 2-hops and more we can gain more than a +8% increase over the previous best reported model. Using even more hops still improves over 1-hop but not much over 2-hops.

6 Conclusion

We have presented a new set of benchmark tasks designed to evaluate end-to-end dialog systems. The movie dialog dataset measures how well such models can perform at both goal driven dialog, of both objective and subjective goals thanks to evaluation metrics on question answering and recommendation tasks, and at less goal driven chit-chat. A true end-to-end model should perform well at all these tasks, being a necessary but not sufficient condition for a fully functional dialog agent.

We showed that some end-to-end neural networks models can perform reasonably across all tasks compared to standard per-task baselines. Specifically, Memory Networks that incorporate short and long term memory can utilize local context and knowledge bases of facts to boost performance. We believe this is promising because we showed these same architectures also perform well on a separate dialog task, the Ubuntu Dialog Corpus, and have been shown previously to work well on the synthetic but challenging bAbI tasks of Weston et al. (2015a), and have no special engineering for the tasks or domain. However, some limitations remain, in particular they do not perform as well as stand-alone QA systems for QA, and performance is also degraded rather than improved when training on all four tasks at once. Future work should try to overcome these problems.

While our dataset focused on movies, there is nothing specific to the task design which could not be transferred immediately to other domains, for example sports, music, restaurants, and so on. Future work should create new tasks in this and other domains to ensure that models are firstly not overtuned for these goals, and secondly to test further skills – and to motivate the development of algorithms to be skillful at them.

References

Appendix A Further Experimental Details

Dictionary

For all models we built a dictionary using all the known entities in the KB (e.g. “Bruce Willis” and “Die Hard” are single dictionary elements). This allows us to output a single symbol for QA and Recommendation in order to predict an entity, rather than having to construct the answer out of words, making training and evaluation of the task simpler. The rest of the dictionary is built of unigrams that are not covered by our entity dictionary, where we removed other words (but not entities) with frequency less than 5. Overall this gives a dictionary of size 189472, which includes 75542 entities. All entries and texts were lower-cased. Our text parser to convert to the dictionary representation is then very simple: it goes left to right, consuming the largest -gram at each step.

Memory Networks

For most of the tasks the optimal number of hops was 1, except for Task 3 where 2 or 3 hops outperform 1. See Table 9 and the parameter choices in Sec. B. For the joint task (Task 5), to achieve best performance we increased the capacity compared to the individual task models by using different dictionaries for the input, memory and output layers, see Sec. B. Additionally, we pre-trained the weights by training without the long-term memory for speed.

Supervised Embedding Models

We tried two flavors of supervised embedding model: (i) a model (“single dictionary model”); and (ii) a model (“two dictionary model”). That is, the latter has two sets of word embeddings depending on whether the word is in the input+context, or the label. The input and context are concatenated together to form a bag of words in either case. It turns out method (i) works better on Tasks 1 and 4, and method (ii) works better on Tasks 2 & 3. Some of the reasons why that is so are easy to understand: on Tasks 2 and 3 (recommendations) a single dictionary model favors predicting the same movies that are already in the input context, which are never correct. However, it appears that on Tasks 1 and 4 the two dictionary model appears to overfit to some degree. This partially explains why the model overall is worse on the joint dataset (Task 5). See Sec. B for more details.

LSTMs

LSTMs performed poorly on Task 4 and we spent some time trying to improve these results. Despite the perplexity looking reasonable (96 on the training set, and 105 on the validation set) after training for 6 days, we still obtain poor results distinguishing between candidates. We also tried Seq2Seq models (without attention or metric learning) and did not obtain improvements. Part of the problem is that posts in Reddit vary from very short (a few words) to very long (several paragraphs) and one natural procedure to try – computing the probability of those sequences seeded by the input – gives very unbalanced results, and tends to select the shorter ones, ending up with worse than random performance. Further, computationally the whole procedure is then very slow compared to all other methods tested. Memory Networks and supervised embeddings need to compute the inner product between embedded inputs and outputs, and hence the the candidates can be embedded once and cached for the whole test set. This trick is not applicable to the method described above rendering it much slower. To deal with the speed issue one can use our supervised embedding model as a first step, and then only reranking the top 100 results with the LSTM to make it tractable, however performance is still poor as mentioned. We obtained improved results by instead adopting the approach of Narasimhan et al. (2015): we take the representation for a dialog message as the average embedding over the hidden states as the symbols are consumed (at each step of the recurrence). We also note that Lowe et al. (2015) report good results (on a different dataset, the Ubuntu Corpus) by training an additional metric learner on top of an LSTM representation, which we have not tried. However, we do compare that approach to Memory Networks on that corpus in Section 5.

Information Retrieval

Aside from the models described in the main paper, we also experimented with a hybrid relevance feedback approach: find the most similar message in the history, add the response to the query (with a certain weight) and then score candidate responses with the combined input. However, the relevance feedback model did not help: as we increase the feedback parameter (how much to use the retrieved response) the model only degrades, see Table 10 for the performance adding with a weight of 0.5.

Appendix B Optimal hyper-parameter values

Hyperparameters of all learning models have been set using grid search on the validation set. The main hyperparameters are embedding dimension , learning rate , number of dictionaries , number of hops for MemNNs and unfolding depth blen

for LSTMs. All models are implemented in the Torch library (see

torch.ch).

Task 1 (QA)

  • QA System of Bordes et al. (2014): , .

  • Supervised Embedding Model: , , .

  • MemN2N: , , , .

  • LSTM: , , .

Task 2 (Recomendation)

  • SVD: .

  • Supervised Embedding Model: , , .

  • MemN2N: , , , .

  • LSTM: , , .

Task 3 (QA+Recommendation)

  • Supervised Embedding Model: , , .

  • MemN2N: , , , .

  • LSTM: , , .

Task 4 (Reddit)

  • Supervised Embedding Model: , , .

  • MemN2N: , , , .

  • LSTM: , , .

Joint Task

We chose hyperparameters by taking the mean performance over the four tasks, after scaling each task by the best performing model on that task on the development set in order to normalize the metrics.

  • Supervised Embedding Model: , , .

  • MemN2N: , , .

Ubuntu Dialog Corpus

Hyperparameters of the MemN2N have been set using grid search on the validation set. We report the best models with in the paper; other hyperparameters were , .

Appendix C Further Detailed Results

c.1 Breakdown of Task 1 (QA) results by question type

QA System of Supervised
Bordes et al. (2014) Embeddings MemN2N
Task h@1 h@10 h@1 h@10 h@1 h@10
writer to movie 98.7 98.7 77.3 90.8 77.6 95.5
tag to movie 71.8 71.8 53.4 96.1 61.4 88.6
movie to year 89.8 89.8   3.4 25.4 87.3 92.1
movie to writer 88.8 89.5 61.7 93.6 73.5 84.1
movie to tags 84.5 85.3 36.8 92.0 79.9 95.1
movie to language 94.6 94.8 45.2 84.7 90.1 97.6
movie to genre 93.0 93.5 46.4 95.0 92.5 99.4
movie to director 88.2 88.2 52.3 90.1 78.3 87.1
movie to actors 88.5 88.5 64.5 95.2 68.4 87.2
director to movie 98.3 98.3 61.4 93.8 71.5 91.0
actor to movie 98.9 98.9 79.0 89.4 76.7 96.7
total 90.7 91.0 50.9 82.97 78.9 91.8
Table 8: QA task test performance per question type (h@1 / h@10 metrics).

c.2 Breakdown of Task 3 (QA+Recommendation) results by response type


Whole Response 1 Response 2 Response 3
Methods Test Set (Recs) (QA) (Similar)
Supervised Embeddings 56.0 56.7 76.2 38.8
LSTM 19.9 35.3 14.3  9.2
MemN2N (1 hop) 70.5 47.0 89.2 76.5
MemN2N (2 hops) 76.8 53.4 90.1 88.6
MemN2N (3 hops) 75.4 52.6 90.0 84.2
MemN2N (3 hops, -KB) 75.9 54.3 85.0 91.5
Table 9: QA+Recommendation task test results (h@10 metric). The last row shows MemN2N without access to a long-term memory (KB).

c.3 Breakdown of Task 4 (Reddit) results by response type

Whole Entity
Methods Test Set Matched Response 1 Response 2 Response 3+
IR (query+context) 23.7 49.0 21.1 26.4 30.0
IR (query) 23.1 48.3 21.1 25.7 27.9
IR (query) RF=0.05 19.2 40.8 18.3 21.2 21.4
Supervised Embeddings 27.6 54.1 24.8 30.4 33.1
MemN2N (-KB) 29.6 57.0 25.6 34.2 37.2
MemN2N 29.2 56.4 25.4 32.9 37.0
Table 10: Reddit task test results (h@10 metric). MemN2N (-KB) is the Memory Network model without access to the knowledge base.