With the recent employment of Recurrent Neural Networks (RNNs) and the large quantities of conversational data available on websites like Twitter or Reddit, a new type of dialog system is emerging. Suchend-to-end dialog systems (Ritter et al., 2011; Shang et al., 2015; Vinyals & Le, 2015; Sordoni et al., 2015) directly generate a response given the last user utterance and (potentially) the context from previous dialog turns without relying on the intermediate use of a dialog state tracking component like in traditional dialog systems (e.g. in Henderson (2015)). These methods are trained to imitate user-user conversations and do not need any hand-coding of attributes and labels for dialog states and goals like state tracking methods do. Being trained on large corpora, they are robust to many language variations and seem to mimic human conversations to some extent.
In spite of their flexibility and representational power, these neural network based methods lack pertinent goal-oriented frameworks to validate their performance. Indeed, traditional systems have a wide range of well defined evaluation paradigms and benchmarks that measure their ability to track user states and/or to reach user-defined goals (Walker et al., 1997; Paek, 2001; Griol et al., 2008; Williams et al., 2013). Recent end-to-end models, on the other hand, rely either on very few human scores (Vinyals & Le, 2015), crowdsourcing (Ritter et al., 2011; Shang et al., 2015) or machine translation metrics like BLEU (Sordoni et al., 2015) to judge the quality of the generated language only. This is problematic because these evaluations do not assess if end-to-end systems can conduct dialog to achieve pre-defined objectives, but simply whether they can generate correct language that could fit in the context of the dialog; in other words, they quantify their chit-chatting abilities.
To fill in this gap, this paper proposes a collection of four tasks designed to evaluate different pre-requisite qualities of end-to-end dialog systems. Focusing on the movie domain, we propose to test if systems are able to jointly perform: (1) question-answering (QA), (2) recommendation, (3) a mix of recommendation and QA and (4) general dialog about the topic, which we call chit-chat. All four tasks have been chosen because they test basic capabilities we expect a dialog system performing insightful movie recommendation should have while evaluation on each of them can be well defined without the need of human-in-the-loop (e.g. via Wizard-of-Oz strategies (Whittaker et al., 2002)). Our ultimate goal is to validate if a single model can solve the four tasks at once, which we assert is a pre-requisite for an end-to-end dialog system supposed to act as a movie recommendation assistant, and by extension a general dialog agent as well. At the same time we advocate developing methods that make no special engineering for this domain, and hence should generalize to learning on tasks and data from other domains easily.
In contrast to the bAbI tasks which test basic capabilities of story understanding systems (Weston et al., 2015b), the tasks have been created using large-scale real-world sources (OMDb111http://en.omdb.org, MovieLens222http://movielens.org and Reddit333http://reddit.com/r/movie). Overall, the dataset covers 75k movie entities (movie, actor, director, genre, etc.) with 3.5M training examples: even if the dataset is restricted to a single domain, it is large and allows a great variety of discussions, language and user goals. We evaluate on these tasks the performance of various neural network models that can potentially create end-to-end dialogs, ranging from simple supervised embedding models (Bai et al., 2009)
, RNNs with Long Short-Term Memory (LSTMs)(Hochreiter & Schmidhuber, 1997), and attention-based models, in particular Memory Networks (Sukhbaatar et al., 2015). To validate the quality of our results, we also apply our best performing model, Memory Networks, in other conditions by comparing it on the Ubuntu Dialog Corpus (Lowe et al., 2015) against baselines trained by the authors of the corpus. We show that they outperform all baselines by a wide margin.
2 The Movie Dialog Dataset
We introduce a set of four tasks to test the ability of end-to-end dialog systems, focusing on the domain of movies and movie related entities. They aim to test five abilities which we postulate as being key towards a fully functional general dialog system (i.e., not specific to movies per se):
QA Dataset: Tests the ability to answer factoid questions that can be answered without relation to previous dialog. The context consists of the question only.
Recommendation Dataset: Tests the ability to provide personalized responses to the user via recommendations (in this case, of movies) rather than universal facts as above.
QA+Recommendation Dataset: Tests the ability of maintaining short dialogs involving both factoid and personalized content where conversational state has to be maintained.
Reddit Dataset: Tests the ability to identify most likely replies in discussions on Reddit.
Joint Dataset: All our tasks are dialogs. They can be combined into a single dataset, testing the ability of an end-to-end model to perform well at all skills at once.
2.1 Question Answering (QA)
The first task we build is to test whether a dialog agent is capable of answering simple factual questions. The dataset was built from the Open Movie Database (OMDb)444Downloaded from http://beforethecode.com/projects/omdb/download.aspx. which contains metadata about movies. The subset we consider contains 15k movies, 10k actors and 6k directors. We also matched these movies to the MovieLens dataset555http://grouplens.org/datasets/movielens/ to attribute tags to each movie. We build a knowledge base (KB) directly from the combined data, stored as triples such as (The Dark Horse, starred_actor, Bette Davis) and (Moonraker, has_tag, james bond), with 8 different relation types involving director, writer, actor, release date, genre, tags, rating and imdb votes.
We distinguish 11 classes of question, corresponding to different kinds of edges in our KB: actor to movie (“What movies did Michael J Fox star in?”), movie to actors (“Who starred in Back to The Future?”), movie to director, director to movie, movie to writer, writer to movie, movie to tags, tag to movie, movie to year, movie to genre and movie to language. For each question type there is a set of possible answers. Using SimpleQuestions, an existing open-domain question answering dataset based on Freebase (Bordes et al., 2015) we identified the subset of questions posed by those human annotators that covered our question types. We expanded this set to cover all of our KB by substituting the actual entities in those questions to also apply them to other questions, e.g. if the original question written by an annotator was “What movies did Michael J Fox star in?”, we created a pattern “What movies did [@actor] star in?” which we substitute for any other actors in our set, and repeat this for all annotations. We split the questions into training, development and test sets with 96k, 10k and 10k examples, respectively.
|Task 1: Factoid Question Answering (QA)|
|What movies are about open source? Revolution OS|
|Ruggero Raimondi appears in which movies? Carmen|
|What movies did Darren McGavin star in? Billy Madison, The Night Stalker, Mrs. Pollifax-Spy|
|Can you name a film directed by Stuart Ortiz? Grave Encounters|
|Who directed the film White Elephant? Pablo Trapero|
|What is the genre of the film Dial M for Murder? Thriller, Crime|
|What language is Whity in? German|
To simplify evaluation rather than requiring the generation of sentences containing the answers, we simply ask a model to output a list, which is ranked as the possible set of answers. We then use standard ranking metrics to evaluate the list, making the results easy to interpret. Our main results report the hits@1 metric (i.e. is the top answer correct); other metrics are given in the appendix.
2.2 Recommendation Dataset
Not all questions about movies in dialogs have an objective answer, independent of the person asking; indeed much of human dialog is based on opinons and personalized responses. One of the simplest dialogs of this type to evaluate is that of recommendation, where we can utilize existing data resources. We again employ the MovieLens dataset which features a user item matrix of movie ratings, rated from 1 to 5. We filtered the set of movies to be the same set as in the QA task and additionally only kept movies that had at least 2 ratings, giving around 11k movies.
To use this data for evaluating dialog, we then use it to generate dialog exchanges. We first select a user at random; this will be the user who is participating in the dialog, and then sample 1-8 movies that the user has rated 5. We then form a statement intended to express the user’s feelings about these movies, according to a fixed set of natural language templates, one of which is selected randomly. See Table 2 for some examples. From the remaining set of movies the same user gave a rating of 5, we select one to be the answer.
|Task 2: Recommendation|
|Schindler’s List, The Fugitive, Apocalypse Now, Pulp Fiction, and The Godfather are films I really liked.|
|Can you suggest a film? The Hunt for Red October|
|Some movies I like are Heat, Kids, Fight Club, Shaun of the Dead, The Avengers, Skyfall, and Jurassic Park.|
|Can you suggest something else I might like? Ocean’s Eleven|
There are 110k users in the training, 1k users in the development set and 1k for test. We follow the procedure above sampling users with replacement and generate 1M training examples and 10k development and test set examples, respectively. To evaluate the performance of a model, just as in the first task, we evaluate a ranked list of answers. In our main results we measure hits@100, i.e. 1 if the provided answer is in the top 100, and 0 otherwise, rather than hits@1 as this task is harder than the last.
Note that we expect absolute hits@k numbers to be lower for this task than for QA due to incomplete labeling (“missing ratings”): in recommendation there is no exact right answer, and it is not surprising the actual single true label is not always at the top position, i.e. the top predictions of the model may be good as well, but we do not have their labels. One can thus view the ranking metric as a kind of lower bound on performance of actually labeling all the predictions using human annotations, which would be time consuming and no longer automatic, and hence undesirable for algorithm development. This is standard in recommendation, see e.g. Cremonesi et al. (2010).
2.3 QA+Recommendation Dialog
The tasks presented so far only involve questions followed by responses, with no context from previous dialog. This task aims at evaluating responses in the context of multiple previous exchanges, while remaining straightforward enough that evaluation and analysis are still tractable. We hence combine the question answering and recommendation tasks from before in a multi-response dialog, where dialogs consist of 3 exchanges (3 turns from each participant).
The first exchange requires a recommendation similar to Task 1 except that they also specify what genre or topic they are interested in, e.g. “I’m looking for a Music movie”, where the answer might be “School of Rock”, as in the example of Table 3.
In the second exchange, given the model’s response (movie suggestion), the user asks a factoid question about that suggestion, e.g. “What else is that about?”, “Who stars in that?” and so on. This question refer back to the previous dialog, making context important.
In the third exchange, the user asks for a alternative recommendation, and provides extra information about their tastes, e.g. “I like Tim Burton movies more”. Again, context of the last two exchanges should help for best performance.
|Task 3: QA + Recommendation Dialog|
|I loved Billy Madison, My Neighbor Totoro, Blades of Glory, Bio-Dome, Clue, and Happy Gilmore.|
|I’m looking for a Music movie. School of Rock|
|What else is that about? Music, Musical, Jack Black, school, teacher, Richard Linklater, rock, guitar|
|I like rock and roll movies more. Do you know anything else? Little Richard|
|Tombstone, Legends of the Fall, Braveheart, The Net, Outbreak, and French Kiss are films I really liked.|
|I’m looking for a Fantasy movie. Jumanji|
|Who directed that? Joe Johnston|
|I like Tim Burton movies more. Do you know anything else? Big Fish|
We thus generate 1M examples of such 6 line dialogs (3 turns from each participant) for training, and 10k for development and testing respectively. We can evaluate the performance of models across all the lines of dialog (e.g., all 30k responses from the test set), but also only on the 1st (Recommendation), 2nd (QA) or 3rd exchange (Similarity) for a more fine-grained analysis. We again use a ranking metric (here, hits@10), just as in our previous tasks.
2.4 Reddit Discussion
Our fourth task is to predict responses in movie discussions using real conversation data taken directly from Reddit, a website where registered community members can submit content in various areas of interest, called “subreddits”. We selected the movie subreddit666https://www.reddit.com/r/movies, selecting from the dataset available at https://www.reddit.com/r/datasets/comments/3bxlg7. to match our other tasks.
The original discussion data is potentially between multiple participants. To simplify the setup, we flatten this to appear as two participants (parent and comment), just as in our other tasks. In this way we collected 1M dialogs, of which 10k are reserved for a development set, and another 10k for the test set. Of the dialogs, 76% involve a single exchange, 17% have at least two exchanges, and 7% have at least three exchanges (the longest exchange is length 50).
|Task 4: Reddit Discussion|
|I think the Terminator movies really suck, I mean the first one was kinda ok, but after that they got really|
|cheesy. Even the second one which people somehow think is great. And after that… forgeddabotit.|
|C’mon the second one was still pretty cool.. Arny was still so badass, as was Sararah Connor’s character..|
|and the way they blended real action and effects was perhaps the last of its kind…|
To evaluate the performance of models, we again separate the problem of evaluating the quality of a response from that of language generation by considering a ranking setup, in line with other recent works (Sordoni et al., 2015). We proceed as follows: we select a further 10k comments for the development set and another 10k for the test set which have not appeared elsewhere in our dataset, and use these as potential candidates for ranking during evaluation. For each exchange, given the input context, we rank 10001 possible candidates: the true response given in the dataset, plus the 10k “negative” candidates just described. The model has to rank the true response as high as possible. Similar to recommendation as described before we do not expect absolute hits@k performance to be as high as for QA due to incomplete labeling. As with Task 3, we can evaluate on all the data, or only on the 1st, 2nd or 3rd exchange, and so on. We also identified the subset of the test set where there is an entity match with at least two entities from Tasks 1-3, where one of the entities appears in the input, and the other in the response: this subset serves to evaluate the impact of using a knowledge base for conducting such a dialog.
2.5 Joint Task
Finally, we consider a task made of the combination of all four of the previous ones. At both training and test time examples consist of exchanges from any of the datasets, sampled at random, whereby the conversation is ‘reset’ at each sample, so that the context history only ever includes exchanges from the current conversation.
We consider this to be the most important task, as it tests whether a model can not only produce chit-chat (Task 4) but also can provide meaningful answers during dialog (Tasks 1-3). On the other hand, the point of delineating the separate tasks is to evaluate exactly which types of dialog a model is succeeding at or not. That all the datasets are in the same domain is crucial to testing the ability of models at performing well on all tasks jointly. If the domains were different, then the vocabularies would be trivially non-overlapping, allowing to learn effectively separate models inside a single one.
2.6 Relation to Existing Evaluation Frameworks
Traditional dialog systems consist of two main modules: (1) a dialog state tracking component that tracks what has happened in a dialog, incorporating into a pre-defined explicit state structure system outputs, user utterances, context from previous turns, and other external information, and (2) a response generator. Evaluation of the dialog state tracking stage is well defined since the PARADISE framework (Walker et al., 1997) and subsequent initiatives (Paek, 2001; Griol et al., 2008), including recent competitons (Williams et al., 2013; Henderson et al., 2014) as well as situated variants (Rojas-Barahona et al., 2012). However, they require fine grained data annotations in terms of labeling internal dialog state and precisely defined user intent (goals). As a result, they do not really scale to large domains and dialogs with high variability in terms of language. Because of language ambiguity and variation, evaluation of the response generation step is complicated and usually relies on human judgement (Walker et al., 2003).
End-to-end dialog systems do not rely on explicit internal state and hence do not have state tracking modules, they directly generate responses given user utterances and dialog context and hence can not be evaluated using state tracking test-beds. Unfortunately, as for response generator modules, their evaluation is ill-defined as it is difficult to objectively rate at scale the fit of returned responses. Most existing work (Ritter et al., 2011; Shang et al., 2015; Vinyals & Le, 2015; Sordoni et al., 2015) chose to use human ratings, which does not easily scale. Sordoni et al. (2015) also use the BLEU score to compare to actual user utterances but this is not a completely satisfying measure of success, especially when used in a chit-chat setting where there are no clear goals and hence measures of success. Lowe et al. (2015) use a similar ranking evaluation to ours, but only in a chit-chat setting.
Our approach of providing a collection of tasks to be jointly solved is related to the evaluation framework of the bAbI tasks (Weston et al., 2015a) and of the collection of sequence prediction tasks of Joulin & Mikolov (2015). However, unlike them, our Tasks 1-3 are much closer to real dialog, being built from human-written text, and with Task 4 actually involving real dialog from Reddit. The design of our tasks is such that all test one or more key characteristics a dialog system should have but also that an unambiguous answer is expected after each dialog act. In that sense, it follows the the notion of dialog evaluation by a reference answer introduced in (Hirschman et al., 1990). The application of movie recommender systems is connected to that of TV program suggestion proposed by Ramachandran et al. (2014), except that we frame it so that we can generate systematic evaluation from it, where they only rely on human judgement at small scale.
3.1 Memory Networks
Memory Networks (Weston et al., 2015c; Sukhbaatar et al., 2015) are a recent class of models that perform language understanding by incorporaring a memory component that potentially includes both long-term memory (e.g., to remember facts about the world) and short-term context (e.g., the last few turns of dialog). They have only been evaluated in a few setups: question answering (Bordes et al., 2015), language modeling (Sukhbaatar et al., 2015; Hill et al., 2015), and language understanding on the bAbI tasks (Weston et al., 2015a), but not so far on dialog tasks such as ours.
We employ the MemN2N architecture of Sukhbaatar et al. (2015) in our experiments, with some additional modifications to construct both long-term and short-term context memories. At any given time step we are given as input the history of the current conversation: messages from the user at time step and the corresponding responses from the model itself at the corresponding time steps, . At the current time we are only given the input and the model has to respond.
Retrieving long-term memories
For each word in the last messages we perform a hash lookup to return all long-term memories (sentences) from a database that also contain that word. Words above a certain frequency cutoff can be ignored to avoid sentences that only share syntax or unimportant words. We employ the movie knowledge base of Sec. 2.1 for our long-term memories, but potentially any text dataset could be used. See Figure 5 for an example of this process.
Attention over memories
The sentences , returned from the hashing step plus the messages from the current conversation form the memory of the Memory Network777We also add time features to each memory to denote their position following (Sukhbaatar et al., 2015).:
The last user input is embedded using a matrix of size where is the embedding dimension and is the size of the vocabulary, giving . Each memory is embedded using the same matrix, giving . The match between the input and the memories is then computed by taking the inner product followed by a softmax:where is a rotation matrix888Optionally, different dictionaries can be used for inputs, memories and outputs instead of being shared.. The memory output is then added to the original input . This procedure can then be stacked in what is called multiple “hops” of attention over the memory.
Generating the final prediction
The final prediction is then defined as: where there are candidate responses in , and is of dimension . For Tasks 1-3 the candidates are the set of words in the vocabulary, which are ranked for final evaluation, whereas for Task 4 the candidates are target respones (sentences).
The whole model is trained using stochastic gradient descent by minimizing a standard cross-entropy loss betweenand the true label .
|Long-Term||Shaolin Soccer directed_by Stephen Chow|
|Memories||Shaolin Soccer written_by Stephen Chow|
|Shaolin Soccer starred_actors Stephen Chow|
|Shaolin Soccer release_year 2001|
|Shaolin Soccer has_genre comedy|
|Shaolin Soccer has_tags martial arts, kung fu soccer, stephen chow|
|Kung Fu Hustle directed_by Stephen Chow|
|Kung Fu Hustle written_by Stephen Chow|
|Kung Fu Hustle starred_actors Stephen Chow|
|Kung Fu Hustle has_genre comedy action|
|Kung Fu Hustle has_imdb_votes famous|
|Kung Fu Hustle has_tags comedy, action, martial arts, kung fu, china, soccer, hong kong, stephen chow|
|The God of Cookery directed_by Stephen Chow|
|The God of Cookery written_by Stephen Chow|
|The God of Cookery starred_actors Stephen Chow|
|The God of Cookery has_tags hong kong Stephen Chow|
|From Beijing with Love directed_by Stephen Chow|
|From Beijing with Love written_by Stephen Chow|
|From Beijing with Love starred_actors Stephen Chow, Anita Yuen|
|…and more …|
|Short-Term||1) I’m looking a fun comedy to watch tonight, any ideas?|
|Memories||2) Have you seen Shaolin Soccer? That was zany and great.. really funny but in a whacky way.|
|Input||3) Yes! Shaolin Soccer and Kung Fu Hustle are so good I really need to find some more Stephen Chow|
|films I feel like there is more awesomeness out there that I haven’t discovered yet …|
|Output||4) God of Cookery is pretty great, one of his mid 90’s hong kong martial art comedies.|
3.2 Supervised Embedding Models
While one of the major uses of word embedding models is to learn unsupervised embeddings over large unlabeled datasets such as in Word2Vec (Mikolov et al., 2013)
there are also very effective word embedding models for training supervised models when labeled data is available. The simplest approach which works suprisingly well is to sum the word embeddings of the input and the target independently and then compare them with a similarity metric such as inner product or cosine similarity. A ranking loss is used to ensure the correct targets are ranked higher than any other targets. Several variants of this approach exist. For matching two documents supervised semantic indexing (SSI) was shown to be superior to unsupervised latent semantic indexing (LSI)(Bai et al., 2009). Similar methods were shown to outperform SVD for recommendation (Weston et al., 2013). However, we do not expect this method to work as well on question answering tasks, as all the memorization must occur in the individual word embeddings, which was shown to perform poorly in (Bordes et al., 2014). For example, consider asking the question “who was born in Paris?” and requiring the word embedding for Paris to effectively contain all the pertinent information. However, for rarer items requiring less storage, performance may not be as degraded. In general we believe this is a surprisingly strong baseline that is often neglected in evaluations. Our implementation corresponds to a Memory Network with no attention over memory.
3.3 Recurrent Language Models
Recurrent Neural Networks (RNNs) have proven successful at several tasks involving natural language, language modeling (Mikolov et al., 2011), and have been applied recently to dialog (Sordoni et al., 2015; Vinyals & Le, 2015; Shang et al., 2015). LSTMs are not known however for tasks such as QA or item recommendation, and so we expect them to find our datasets challenging.
There are a large number of variants of RNNs, including Long-Short Term Memory activation units (LSTMs) (Hochreiter & Schmidhuber, 1997), bidirectional LSTMs (Graves et al., 2012), seq2seq models (Sutskever et al., 2014), RNNs that take into account the document context (Mikolov & Zweig, 2012) and RNNs that perform attention over their input in various different ways (Bahdanau et al., 2015; Hermann et al., 2015; Rush et al., 2015). Evaluating all these variants is beyond the scope of this work and we instead use standard LSTMs as our baseline method999We used the code available at: https://github.com/facebook/SCRNNs. However, we note that LSTMs with attention have many properties in common with Memory Networks if the attention is applied over the same memory setup.
3.4 Question Answering Systems
For the particular case of Task 1 we can apply existing question answering systems. There has been a recent surge in interest in such systems that try to answer a question posed in natural language by converting it into a database search over a knowledge base (Berant & Liang, 2014; Kwiatkowski et al., 2013; Fader et al., 2014), which is a setup natural for our QA task also. However, such systems cannot easily solve any of our other tasks, for example our recommendation Task 2 does not involve looking up a factoid answer in a database. Nevertheless, this allows us to compare the performance of end-to-end systems performant on all our tasks to a standard QA benchmark. We chose the method of Bordes et al. (2014)101010We used the ‘Path Representation’ for the knowledge base, as described in Sec. 3.1 of Bordes et al. (2014). as our baseline. This system learns embeddings that match questions to database entries, and then ranks the set of entries, and has been shown to achieve good performance on the WebQuestions benchmark (Berant et al., 2013).
3.5 Singular Value Decomposition
Singular Value Decomposition (SVD) is a standard benchmark for recommendation, being at the core of the best ensemble results in the Netflix challenge, see Koren & Bell (2011) for a review. However, it has been shown to be outperformed by other flavors of matrix factorization, in particular by using a ranking loss rather than squared loss (Weston et al., 2013) which we will compare to (cf. sec 3.2), as well as improvements like SVD++ (Koren, 2008). Collaborative filtering methods are applicable to Task 2, but cannot easily be used for any of the other tasks. Even for Task 2, while our dialog models use textual input, as shown in Table 2, SVD requires a user item matrix, so for this baseline we preprocessed the text to assign each entity an ID, and throw away all other text. In contrast, the end-to-end dialog models have to learn to process the text as part of the task.
3.6 Information Retrieval Models
To select candidate responses a standard baseline is nearest neighbour information retrieval (IR) (Isbell et al., 2000; Jafarpour et al., 2010; Ritter et al., 2011; Sordoni et al., 2015). Two simple variants are often tried: given an input message, either (i) find the most similar message in the (training) dataset and output the response from that exchange; or (ii) find the most similar response to the input directly. In both cases the standard measure of similarity is tf-idf weighted cosine similarity between the bags of words. Note that that the Supervised Embedding Models of Sec. 3.2 effectively implement the same kind of model (ii) but with a learnt similarity measure. It has been shown previously that method (ii) performs better (Ritter et al., 2011), and our initial IR experiments showed the same result. Note that while (non-learning) IR systems can also be applied to other tasks such as QA (Kolomiyets & Moens, 2011) they require significant tuning to do so. Here we stick to a vanilla vector space model and hence only apply an IR baseline to Task 4.
Our main results across all the models and tasks are given in Table 4. Supervised Embeddings and Memory Networks are tested in two settings: trained and tested on all tasks separately, or jointly on the combined Task 5. Other methods are only evaluated on independent tasks. In all cases, parameter search was performed on the development sets; parameter choices are provided in the appendix.
|QA Task||Recs Task||QA+Recs Task||Reddit Task|
|QA System (Bordes et al., 2014)||90.7||n/a||n/a||n/a|
|Joint Supervised Embeddings||43.6||28.1||58.9||14.5|
Answering Factual Questions
Memory Networks and the baseline QA system are the two methods that have an explicit long-term memory via access to the knowledge base (KB). On the task of answering factual questions where the answers are contained in the KB, they outperform the other methods convincingly, with LSTMS being particularly poor. The latter is not unexpected as that method is good at language modeling, not question answering, see e.g. Weston et al. (2015b). The baseline QA system, which is designed for this task, is superior to Memory Networks, indicating there is still room for improvement in that model. On the other hand, the latter’s much more general design allows it to perform well on our other dialog tasks, whereas the former is task specific.
In this task a long-term memory does not bring any improvement, with LSTMs, Supervised Embeddings and Memory Networks all performing similarly, and all outperforming the SVD baseline. Here, we conjecture LSTMs can perform well because it looks much more like a language modeling task, i.e. the input is a sequence of similar recommendations.
Using Dialog History
In both QA+Recommendations (Task 3) and Reddit (Task 4) Memory Networks outperform Supervised Embeddings due to their better use of context. This can be seen by breaking down the results by length of context: in the first response they perform similarly, but Memory Networks show a relative improvement on the second and third responses, see Tables 9 and 10 in the appendix. Note that these improvements come from the short term memory (dialog history), not from the use of the KB, as we show Memory Networks results without access to the KB and they perform similarly. We believe the QA performance in these cases is not hindered by the lack of a KB because we ask questions based on fewer relations than in Task 1 and it is easier to store the knowledge directly in the word embeddings. The baseline IR model in Task 4 benefits from context too, it is compared with and without in Table 10. LSTMs perform poorly: the posts in Reddit are quite long and the memory of the LSTM is relatively short, as pointed out by Sordoni et al. (2015). In that work they employed a linear reranker that used LSTM prediction as features to better effect. Testing more powerful recurrent networks such as LSTMs with attention on these benchmarks remains as future work (although the latter is related to Memory Networks, which we do report).
A truly end-to-end dialog system has to be good at all the skills in Tasks 1-4 (and more besides, i.e. this is necessary, but not sufficient). We thus report results on our Combined Task for Supervised Embeddings and Memory Networks. Supervised Embeddings still have the same failings as before on Tasks 1 and 3, but now seem to perform even more poorly due to the difficulty of encoding all the necessary skills in the word embeddings, so e.g., they now do significantly worse on Task 4. This is despite us trying word embeddings of up to 2000 dimensions. Memory Networks fare better, having only a slight loss in performance on Tasks 2-4 and a slight gain in Task 1. In their case, the modeling power is not only in the word embeddings, but also in the attention over the long-term and short-term memory, so it does not need as much capacity in the word embeddings. However, the best achievable models would presumably have some improvement from training across all the tasks, not a loss, and would perform at least as well as all the individual task baselines (i.e. in this case, perform better at Task 1).
5 Ubuntu Dialogue Corpus Results
As no other authors have yet published results on our new benchmark, to validate the quality of our results we also apply our best performing model in other conditions by comparing it on the Ubuntu Dialog Corpus (Lowe et al., 2015)
. In particular, this also allows us to compare to more sophisticated LSTMs models that are trained discriminatively using metric learning, as well as additional baseline methods all trained by the authors. The Ubuntu Dialog Corpus contains almost 1M dialogs of more than 7 turns on average (900k dialogs for training, 20k for validation and 20k for testing), and 100M million words. The corpus was scraped from the Ubuntu IRC channel logs where users ask questions about issues they are having with Ubuntu and get answers by other users. Most chats can involve more than two users but a series of heuristics to disentangle them into dyadic dialogs was used.
The evaluation is similar to that of Reddit (Task 4): each correct answer has to be retrieved among a set of 10, mixed with 9 randomly chosen candidate utterances. We report the Hits@1 in Table 7.111111Results for the baselines from (Lowe et al., 2015) differ to that from the v3 of the arxiv paper, because the corpus has been updated since then. All results in Table 7 use the latest version of the corpus. We used the same MemN2N architecture as before. all models were selected using validation accuracy. On this dataset, which has longer dialogs than those from the Movie Dialog Corpus, we can see that running more hops on the memory with the MemN2N improves performance: the 1-hop model performs similarly to the LSTM but with 2-hops and more we can gain more than a +8% increase over the previous best reported model. Using even more hops still improves over 1-hop but not much over 2-hops.
We have presented a new set of benchmark tasks designed to evaluate end-to-end dialog systems. The movie dialog dataset measures how well such models can perform at both goal driven dialog, of both objective and subjective goals thanks to evaluation metrics on question answering and recommendation tasks, and at less goal driven chit-chat. A true end-to-end model should perform well at all these tasks, being a necessary but not sufficient condition for a fully functional dialog agent.
We showed that some end-to-end neural networks models can perform reasonably across all tasks compared to standard per-task baselines. Specifically, Memory Networks that incorporate short and long term memory can utilize local context and knowledge bases of facts to boost performance. We believe this is promising because we showed these same architectures also perform well on a separate dialog task, the Ubuntu Dialog Corpus, and have been shown previously to work well on the synthetic but challenging bAbI tasks of Weston et al. (2015a), and have no special engineering for the tasks or domain. However, some limitations remain, in particular they do not perform as well as stand-alone QA systems for QA, and performance is also degraded rather than improved when training on all four tasks at once. Future work should try to overcome these problems.
While our dataset focused on movies, there is nothing specific to the task design which could not be transferred immediately to other domains, for example sports, music, restaurants, and so on. Future work should create new tasks in this and other domains to ensure that models are firstly not overtuned for these goals, and secondly to test further skills – and to motivate the development of algorithms to be skillful at them.
- Bahdanau et al. (2015) Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. ICLR 2015, 2015.
- Bai et al. (2009) Bai, Bing, Weston, Jason, Grangier, David, Collobert, Ronan, Sadamasa, Kunihiko, Qi, Yanjun, Chapelle, Olivier, and Weinberger, Kilian. Supervised semantic indexing. In Proceedings of the 18th ACM conference on Information and knowledge management, pp. 187–196. ACM, 2009.
- Berant & Liang (2014) Berant, Jonathan and Liang, Percy. Semantic parsing via paraphrasing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL’14), Baltimore, USA, 2014.
- Berant et al. (2013) Berant, Jonathan, Chou, Andrew, Frostig, Roy, and Liang, Percy. Semantic parsing on freebase from question-answer pairs. In EMNLP, pp. 1533–1544, 2013.
- Bordes et al. (2014) Bordes, Antoine, Chopra, Sumit, and Weston, Jason. Question answering with subgraph embeddings. In Proc. EMNLP, 2014.
- Bordes et al. (2015) Bordes, Antoine, Usunier, Nicolas, Chopra, Sumit, and Weston, Jason. Large-scale simple question answering with memory networks. arXiv preprint arXiv:1506.02075, 2015.
- Cremonesi et al. (2010) Cremonesi, Paolo, Koren, Yehuda, and Turrin, Roberto. Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of the fourth ACM conference on Recommender systems, pp. 39–46. ACM, 2010.
- Fader et al. (2014) Fader, Anthony, Zettlemoyer, Luke, and Etzioni, Oren. Open question answering over curated and extracted knowledge bases. In Proceedings of 20th SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’14), New York City, USA, 2014. ACM.
- Graves et al. (2012) Graves, Alex et al. Supervised sequence labelling with recurrent neural networks, volume 385. Springer, 2012.
- Griol et al. (2008) Griol, David, Hurtado, Lluís F, Segarra, Encarna, and Sanchis, Emilio. A statistical approach to spoken dialog systems design and evaluation. Speech Communication, 50(8):666–682, 2008.
- Henderson (2015) Henderson, Matthew. Machine learning for dialog state tracking: A review. In Proceedings of The First International Workshop on Machine Learning in Spoken Language Processing, 2015.
- Henderson et al. (2014) Henderson, Matthew, Thomson, Blaise, and Williams, Jason. The second dialog state tracking challenge. In 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 263, 2014.
- Hermann et al. (2015) Hermann, Karl Moritz, Kočiský, Tomáš, Grefenstette, Edward, Espeholt, Lasse, Kay, Will, Suleyman, Mustafa, and Blunsom, Phil. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems (NIPS), 2015. URL http://arxiv.org/abs/1506.03340.
- Hill et al. (2015) Hill, Felix, Bordes, Antoine, Chopra, Sumit, and Weston, Jason. The goldilocks principle: Reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301, 2015.
- Hirschman et al. (1990) Hirschman, Lynette, Dahl, Deborah A, McKay, Donald P, Norton, Lewis M, and Linebarger, Marcia C. Beyond class a: A proposal for automatic evaluation of discourse. Technical report, DTIC Document, 1990.
- Hochreiter & Schmidhuber (1997) Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Isbell et al. (2000) Isbell, Charles Lee, Kearns, Michael, Kormann, Dave, Singh, Satinder, and Stone, Peter. Cobot in lambdamoo: A social statistics agent. In AAAI/IAAI, pp. 36–41, 2000.
- Jafarpour et al. (2010) Jafarpour, Sina, Burges, Christopher JC, and Ritter, Alan. Filter, rank, and transfer the knowledge: Learning to chat. Advances in Ranking, 10, 2010.
- Joulin & Mikolov (2015) Joulin, Armand and Mikolov, Tomas. Inferring algorithmic patterns with stack-augmented recurrent nets. arXiv preprint: 1503.01007, 2015.
- Kolomiyets & Moens (2011) Kolomiyets, Oleksandr and Moens, Marie-Francine. A survey on question answering technology from an information retrieval perspective. Information Sciences, 181(24):5412–5434, 2011.
- Koren (2008) Koren, Yehuda. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 426–434. ACM, 2008.
- Koren & Bell (2011) Koren, Yehuda and Bell, Robert. Advances in collaborative filtering. In Recommender systems handbook, pp. 145–186. Springer, 2011.
Kwiatkowski et al. (2013)
Kwiatkowski, Tom, Choi, Eunsol, Artzi, Yoav, and Zettlemoyer, Luke.
Scaling semantic parsers with on-the-fly ontology matching.
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP’13), Seattle, USA, October 2013.
- Lowe et al. (2015) Lowe, Ryan, Pow, Nissan, Serban, Iulian, and Pineau, Joelle. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. arXiv preprint arXiv:1506.08909, 2015.
- Mikolov & Zweig (2012) Mikolov, Tomas and Zweig, Geoffrey. Context dependent recurrent neural network language model. In SLT, pp. 234–239, 2012.
- Mikolov et al. (2011) Mikolov, Tomáš, Kombrink, Stefan, Burget, Lukáš, Černockỳ, Jan Honza, and Khudanpur, Sanjeev. Extensions of recurrent neural network language model. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pp. 5528–5531. IEEE, 2011.
- Mikolov et al. (2013) Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013.
- Narasimhan et al. (2015) Narasimhan, Karthik, Kulkarni, Tejas, and Barzilay, Regina. Language understanding for text-based games using deep reinforcement learning. arXiv preprint arXiv:1506.08941, 2015.
- Paek (2001) Paek, Tim. Empirical methods for evaluating dialog systems. In Proceedings of the workshop on Evaluation for Language and Dialogue Systems-Volume 9, pp. 2. Association for Computational Linguistics, 2001.
- Ramachandran et al. (2014) Ramachandran, Deepak, Yeh, Peter Z, Jarrold, William, Douglas, Benjamin, Ratnaparkhi, Adwait, Provine, Ronald, Mendel, Jeremy, and Emfield, Adam. An end-to-end dialog system for tv program discovery. In Spoken Language Technology Workshop (SLT), 2014 IEEE, pp. 602–607. IEEE, 2014.
- Ritter et al. (2011) Ritter, Alan, Cherry, Colin, and Dolan, William B. Data-driven response generation in social media. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 583–593. Association for Computational Linguistics, 2011.
- Rojas-Barahona et al. (2012) Rojas-Barahona, Lina M, Lorenzo, Alejandra, and Gardent, Claire. An end-to-end evaluation of two situated dialog systems. In Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 10–19. Association for Computational Linguistics, 2012.
Rush et al. (2015)
Rush, Alexander M, Chopra, Sumit, and Weston, Jason.
A neural attention model for abstractive sentence summarization.Proceedings of EMNLP, 2015.
- Shang et al. (2015) Shang, Lifeng, Lu, Zhengdong, and Li, Hang. Neural responding machine for short-text conversation. arXiv preprint arXiv:1503.02364, 2015.
- Sordoni et al. (2015) Sordoni, Alessandro, Galley, Michel, Auli, Michael, Brockett, Chris, Ji, Yangfeng, Mitchell, Margaret, Nie, Jian-Yun, Gao, Jianfeng, and Dolan, Bill. A neural network approach to context-sensitive generation of conversational responses. Proceedings of NAACL, 2015.
- Sukhbaatar et al. (2015) Sukhbaatar, Sainbayar, Szlam, Arthur, Weston, Jason, and Fergus, Rob. End-to-end memory networks. Proceedings of NIPS, 2015.
- Sutskever et al. (2014) Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
- Vinyals & Le (2015) Vinyals, Oriol and Le, Quoc. A neural conversational model. arXiv preprint arXiv:1506.05869, 2015.
- Walker et al. (1997) Walker, Marilyn A, Litman, Diane J, Kamm, Candace A, and Abella, Alicia. Paradise: A framework for evaluating spoken dialogue agents. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, pp. 271–280. Association for Computational Linguistics, 1997.
- Walker et al. (2003) Walker, Marilyn A, Prasad, Rashmi, and Stent, Amanda. A trainable generator for recommendations in multimodal dialog. In INTERSPEECH, 2003.
- Weston et al. (2015a) Weston, J., Bordes, A., Chopra, S., and Mikolov, T. Towards AI-complete question answering: A set of prerequisite toy tasks. arXiv preprint: 1502.05698, 2015a.
- Weston et al. (2013) Weston, Jason, Yee, Hector, and Weiss, Ron J. Learning to rank recommendations with the k-order statistic loss. In Proceedings of the 7th ACM conference on Recommender systems, pp. 245–248. ACM, 2013.
- Weston et al. (2015b) Weston, Jason, Bordes, Antoine, Chopra, Sumit, and Mikolov, Tomas. Towards ai-complete question answering: a set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015b.
- Weston et al. (2015c) Weston, Jason, Chopra, Sumit, and Bordes, Antoine. Memory networks. Proceedings of ICLR, 2015c.
- Whittaker et al. (2002) Whittaker, Steve, Walker, Marilyn A, and Moore, Johanna D. Fish or fowl: A wizard of oz evaluation of dialogue strategies in the restaurant domain. In LREC, 2002.
- Williams et al. (2013) Williams, Jason, Raux, Antoine, Ramachandran, Deepak, and Black, Alan. The dialog state tracking challenge. In Proceedings of the SIGDIAL 2013 Conference, pp. 404–413, 2013.
Appendix A Further Experimental Details
For all models we built a dictionary using all the known entities in the KB (e.g. “Bruce Willis” and “Die Hard” are single dictionary elements). This allows us to output a single symbol for QA and Recommendation in order to predict an entity, rather than having to construct the answer out of words, making training and evaluation of the task simpler. The rest of the dictionary is built of unigrams that are not covered by our entity dictionary, where we removed other words (but not entities) with frequency less than 5. Overall this gives a dictionary of size 189472, which includes 75542 entities. All entries and texts were lower-cased. Our text parser to convert to the dictionary representation is then very simple: it goes left to right, consuming the largest -gram at each step.
For most of the tasks the optimal number of hops was 1, except for Task 3 where 2 or 3 hops outperform 1. See Table 9 and the parameter choices in Sec. B. For the joint task (Task 5), to achieve best performance we increased the capacity compared to the individual task models by using different dictionaries for the input, memory and output layers, see Sec. B. Additionally, we pre-trained the weights by training without the long-term memory for speed.
Supervised Embedding Models
We tried two flavors of supervised embedding model: (i) a model (“single dictionary model”); and (ii) a model (“two dictionary model”). That is, the latter has two sets of word embeddings depending on whether the word is in the input+context, or the label. The input and context are concatenated together to form a bag of words in either case. It turns out method (i) works better on Tasks 1 and 4, and method (ii) works better on Tasks 2 & 3. Some of the reasons why that is so are easy to understand: on Tasks 2 and 3 (recommendations) a single dictionary model favors predicting the same movies that are already in the input context, which are never correct. However, it appears that on Tasks 1 and 4 the two dictionary model appears to overfit to some degree. This partially explains why the model overall is worse on the joint dataset (Task 5). See Sec. B for more details.
LSTMs performed poorly on Task 4 and we spent some time trying to improve these results. Despite the perplexity looking reasonable (96 on the training set, and 105 on the validation set) after training for 6 days, we still obtain poor results distinguishing between candidates. We also tried Seq2Seq models (without attention or metric learning) and did not obtain improvements. Part of the problem is that posts in Reddit vary from very short (a few words) to very long (several paragraphs) and one natural procedure to try – computing the probability of those sequences seeded by the input – gives very unbalanced results, and tends to select the shorter ones, ending up with worse than random performance. Further, computationally the whole procedure is then very slow compared to all other methods tested. Memory Networks and supervised embeddings need to compute the inner product between embedded inputs and outputs, and hence the the candidates can be embedded once and cached for the whole test set. This trick is not applicable to the method described above rendering it much slower. To deal with the speed issue one can use our supervised embedding model as a first step, and then only reranking the top 100 results with the LSTM to make it tractable, however performance is still poor as mentioned. We obtained improved results by instead adopting the approach of Narasimhan et al. (2015): we take the representation for a dialog message as the average embedding over the hidden states as the symbols are consumed (at each step of the recurrence). We also note that Lowe et al. (2015) report good results (on a different dataset, the Ubuntu Corpus) by training an additional metric learner on top of an LSTM representation, which we have not tried. However, we do compare that approach to Memory Networks on that corpus in Section 5.
Aside from the models described in the main paper, we also experimented with a hybrid relevance feedback approach: find the most similar message in the history, add the response to the query (with a certain weight) and then score candidate responses with the combined input. However, the relevance feedback model did not help: as we increase the feedback parameter (how much to use the retrieved response) the model only degrades, see Table 10 for the performance adding with a weight of 0.5.
Appendix B Optimal hyper-parameter values
Hyperparameters of all learning models have been set using grid search on the validation set. The main hyperparameters are embedding dimension , learning rate , number of dictionaries , number of hops for MemNNs and unfolding depth blen
for LSTMs. All models are implemented in the Torch library (seetorch.ch).
Task 1 (QA)
QA System of Bordes et al. (2014): , .
Supervised Embedding Model: , , .
MemN2N: , , , .
LSTM: , , .
Task 2 (Recomendation)
Supervised Embedding Model: , , .
MemN2N: , , , .
LSTM: , , .
Task 3 (QA+Recommendation)
Supervised Embedding Model: , , .
MemN2N: , , , .
LSTM: , , .
Task 4 (Reddit)
Supervised Embedding Model: , , .
MemN2N: , , , .
LSTM: , , .
We chose hyperparameters by taking the mean performance over the four tasks, after scaling each task by the best performing model on that task on the development set in order to normalize the metrics.
Supervised Embedding Model: , , .
MemN2N: , , .
Ubuntu Dialog Corpus
Hyperparameters of the MemN2N have been set using grid search on the validation set. We report the best models with in the paper; other hyperparameters were , .
Appendix C Further Detailed Results
c.1 Breakdown of Task 1 (QA) results by question type
|QA System of||Supervised|
|Bordes et al. (2014)||Embeddings||MemN2N|
|writer to movie||98.7||98.7||77.3||90.8||77.6||95.5|
|tag to movie||71.8||71.8||53.4||96.1||61.4||88.6|
|movie to year||89.8||89.8||3.4||25.4||87.3||92.1|
|movie to writer||88.8||89.5||61.7||93.6||73.5||84.1|
|movie to tags||84.5||85.3||36.8||92.0||79.9||95.1|
|movie to language||94.6||94.8||45.2||84.7||90.1||97.6|
|movie to genre||93.0||93.5||46.4||95.0||92.5||99.4|
|movie to director||88.2||88.2||52.3||90.1||78.3||87.1|
|movie to actors||88.5||88.5||64.5||95.2||68.4||87.2|
|director to movie||98.3||98.3||61.4||93.8||71.5||91.0|
|actor to movie||98.9||98.9||79.0||89.4||76.7||96.7|
c.2 Breakdown of Task 3 (QA+Recommendation) results by response type
||Whole||Response 1||Response 2||Response 3|
|MemN2N (1 hop)||70.5||47.0||89.2||76.5|
|MemN2N (2 hops)||76.8||53.4||90.1||88.6|
|MemN2N (3 hops)||75.4||52.6||90.0||84.2|
|MemN2N (3 hops, -KB)||75.9||54.3||85.0||91.5|
c.3 Breakdown of Task 4 (Reddit) results by response type
|Methods||Test Set||Matched||Response 1||Response 2||Response 3+|
|IR (query) RF=0.05||19.2||40.8||18.3||21.2||21.4|