Towards Deep Conversational Recommendations

12/18/2018 ∙ by Raymond Li, et al. ∙ 10

There has been growing interest in using neural networks and deep learning techniques to create dialogue systems. Conversational recommendation is an interesting setting for the scientific exploration of dialogue with natural language as the associated discourse involves goal-driven dialogue that often transforms naturally into more free-form chat. This paper provides two contributions. First, until now there has been no publicly available large-scale dataset consisting of real-world dialogues centered around recommendations. To address this issue and to facilitate our exploration here, we have collected ReDial, a dataset consisting of over 10,000 conversations centered around the theme of providing movie recommendations. We make this data available to the community for further research. Second, we use this dataset to explore multiple facets of conversational recommendations. In particular we explore new neural architectures, mechanisms, and methods suitable for composing conversational recommendation systems. Our dataset allows us to systematically probe model sub-components addressing different parts of the overall problem domain ranging from: sentiment analysis and cold-start recommendation generation to detailed aspects of how natural language is used in this setting in the real world. We combine such sub-components into a full-blown dialogue system and examine its behavior.



There are no comments yet.


page 5

page 12

Code Repositories



view repo



view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep-learning-based approaches to creating dialogue systems provide extremely flexible solutions for the fundamental algorithms underlying dialogue systems. In this paper we explore fundamental algorithmic elements of conversational recommendation systems through examining a suite of neural architectures for sub-problems of conversational recommendation making.

It is well known that deep learning techniques require considerable amounts of data to be effective. Addressing this need, we provide a new dataset of 10,000 dialogues to the community to facilitate the study of discourse with natural language when making recommendations is an explicit goal of the exchange. Our setting of interest and our new dataset, named redial111, are centered around conversations about movies where one party in the conversation is seeking recommendations and the other party is providing recommendations. Our decision for focusing on this domain is motivated in part by the following.

A good discussion with a friend, librarian, movie rental store clerk or movie fan can be an enjoyable experience, leading to new ideas for movies that one might like to watch. We shall refer to this general setting as conversational movie recommendation. While dialogue systems are sometimes characterized as falling into the categories of goal-directed dialogue vs chit-chat, we observe that discussions about movies often combine various elements of chit-chat, goal-directed dialogue, and even question answering in a natural way. As such the practical goal of creating conversational recommendation systems provides an excellent setting for the scientific exploration of the continuum between these tasks.

This paper makes a number of contributions. First we provide the only real-world, two-party conversational corpus of this form (that we are aware of) to the community. We outline the data-collection procedure in Section 3

. Second, we use this corpus to systematically propose and evaluate neural models for key sub-components of an overall conversational recommendation system. We focus our exploration on three key elements of such a system, consisting of: 1) Making recommendations; we examine sampling based methods for learning to make recommendations in the cold-start setting using an autoencoder

(Sedhain et al., 2015). We present this model in Section 4.3 and evaluate it in Section 5

. Prior work with such models has not examined the cold-start setting which must be addressed in our dialogue set-up. 2) Classifying opinions or the sentiment of a dialogue participant with respect to a particular movie. For this task throughout the dialogue whenever a new movie is discussed we instantiate an RNN-based sentiment-prediction model. This model is used to populate the autoencoder-based recommendation engine above. We present this model component and our analysis of its behavior and performance in Sections 

4.2 and 5 respectively. 3) We compose the components outlined above into a complete neural dialogue model for conversation and recommendation. For this aspect of the problem we examine a novel formulation of a hierarchical recurrent encoder-decoder (HRED) model (Sordoni et al., 2015) with a switching mechanism inspired from Gulcehre et al. (2016) that allows suggested movies to be integrated into the model for the dialogue acts of the recommender. As our new dataset is relatively small for neural network techniques, our modular approach allows one to train sub-components on other larger data sources, whereas naïvely training end-to-end neural models from scratch using only our collected dialogue data can lead to overfitting.

2 Related Work

While we are aware of no large scale public dataset of human-to-human dialogue on the subject of movie recommendations, we review some of the most relevant work of which we are aware below. We also review a selection of prior work on related methods in Section 4 just prior to introducing each component of our model.

Dodge et al. (2015) introduced four movie dialogue datasets comprising the Facebook Movie Dialog Data Set. There is a QA dataset, a recommendation dataset, and a QA + recommendation dataset. All three are synthetic datasets built from the classic MovieLens ratings dataset (Harper and Konstan, 2016)222 and Open Movie Database333 Others have also explored procedures for generating synthetic dialogues from ratings data (Suglia et al., 2017). The fourth dataset is a Reddit dataset composed of around 1M dialogues from the movie subreddit444 The recommendation dataset is the closest to what we propose, however it is synthetically generated from natural language patterns, and the answers are always a single movie name. The Reddit dataset is also similar to ours in the sense that it consists of natural conversations on the topic of movies. However, the exchanges are more free-form and obtaining a good recommendation is not a goal of the discourse.

Krause et al. (2017) introduce a dataset of self dialogues collected for the Amazon Alexa Prize competition555

, using Amazon Mechanical Turk (AMT). The workers are asked to imagine a conversation between two individuals on a given topic and to play both roles. The topics are mostly about movies, music, and sport. The conversations are not specifically about movie recommendations, but have the advantage of being quite natural, compared to the Facebook Movie Dialog Data Set. They use this data to develop a chat bot. The chat bot is made of several components, including: a rule-based component, a matching-score component that compares the context with similar conversations from the data to output a message from the data, and a (generative) recurrent neural network (RNN). They perform human evaluation of the matching-score component.

Some older work from the PhD thesis of Johansson (2004) involved collecting a movie recommendation themed dialogue corpus with 24 dialogues, consisting of 2684 utterances and a mean of 112 utterances per dialogue. In contrast, our corpus has over 10k conversations and 160k utterances. See Serban et al. (2015) for an updated survey of corpora for data-driven dialogue systems.

The recommender-systems literature has also proposed models for conversational systems. These approaches are goal-oriented and combine various different modules each designed (and trained) independently (Göker et al., 2011; Greco et al., 2017). Further, these approaches either rely on tracking the state of the dialogue using slot-value pairs (Widyantoro and Baizal, 2014; Wärnestål et al., 2007) or focus on different objectives such as minimizing the number of user queries to obtain good recommendations (Christakopoulou et al., 2016). Other approaches (He et al., 2015; Das et al., 2017a; Li et al., 2017; Sun and Zhang, 2018)

use reinforcement learning to train goal-oriented dialogue systems.

Sun and Zhang (2018) apply it to conversational recommendations: a simulated user allows to train the dialogue agent to extract the facet values needed to make an appropriate recommendation. In contrast, we propose a conditional generative model of (natural language) recommendation conversations and our contributed dataset allows one to both train sub-modules as well as explore end-to-end trainable models.

3 redial dataset collection

Here we formalize the setup of a conversation involving recommendations for the purposes of data collection. To provide some additional structure to our data (and models) we define one person in the dialogue as the recommendation seeker and the other as the recommender. To obtain data in this form, we developed an interface and pairing mechanism mediated by Amazon Mechanical Turk (AMT). Our task setup is very similar to that used by Das et al. (2017b) to collect dialogue data around an image guessing game, except that we focus on movie recommendations. We pair up AMT workers and give each of them a role. The movie seeker has to explain what kind of movie he/she likes, and asks for movie suggestions. The recommender tries to understand the seeker’s movie tastes, and recommends movies. All exchanges of information and recommendations are made using natural language.

We add additional instructions to improve the data quality and guide the workers to dialogue the way we expect them to. We ask to use formal language and that conversations contain roughly ten messages minimum. We also require that at least four different movies are mentioned in every conversation. Finally, we ask to converse only about movies, and notably not to mention Mechanical Turk or the task itself. See Figure 4 in the supplementary material for a screen-shot of the interface.

In addition, we ask that every movie mention is tagged using the ‘@’ symbol. When workers type ‘@’, the following characters are used to find matching movie names, and workers can choose a movie from that list. This allows us to detect exactly what movies are mentioned and when. We gathered entities from DBpedia that were of type <> to obtain a list of movies, but also allow workers to add movies to the list if it is not present already. We also obtained movie release dates from DBpedia. Note that the year or release date of a movie can be essential to differentiate movies with the same name, but released at different dates.

# conversations 10006
# utterances 182150
# users 956
# movie mentions 51699
seeker mentioned 16278
recommender suggested 35421
not seen 16516
seen 31694
did not say 3489
disliked (4.9%) 2556
liked (81%) 41998
did not say (14%) 7145
Table 1a. (Above) redial data statistics. For the movie dialogue forms, the numbers shown represent the seeker’s answers.
Table 1b.(Right) Conversation excerpts (HUMAN followed by response by SEEKER) and model outputs (OUR proposed approach compared to HRED a generic dialogue model Sordoni et al. (2015)).
HUMAN: hello, can i help you find a movie ?
HRED: hi what kind of movies do you like
OURS: hi, how can i help you ?
SEEKER: hello ! i would like to find a suspenseful, but clean family friendly movie .
SEEKER: 2001 : a space odyssey might be a great option. anything else that you would suggest ?
HUMAN: you can try planet of the apes the older one is quite suspenseful and family friendly .
HRED: i haven’t seen that one yet but i heard it was good. i haven’t seen that one. have you seen the last house on the left ?
OURS: star wars : the force awakens is also a good one return of the jedi all good movies
SEEKER: those sound good ! i ’m going to look into those movies.
HUMAN: i hope you enjoy, have a nice one
HRED: have you seen foxcatcher ? it ’s about a man who has a rich guy.
OURS: i hope i was able to help you find a good movie to watch
SEEKER: thank you for your help ! have a great night ! good bye

Workers are (separately from the on-going discussion) asked three questions for each movie: (1) Whether the movie was mentioned by the seeker, or was a suggestion from the recommender (“suggested” label); (2) Whether the seeker has seen the movie (“seen” label): one of Seen it, Haven’t seen it, or Didn’t say; (3) Whether the seeker liked the movie or the suggestion (“liked” label): one of Liked, Didn’t like, Didn’t say. We will refer to these additional labels as movie dialogue forms. Both workers have to answer these forms even though it really concerns the seeker’s movie tastes. We use those ratings to validate data collection, the two workers agreeing in the forms being generally an indicator for conscientious workers. Ideally, the two participants would give the same answer to every form, but it is possible that their answers do not coincide (because of carelessness, or dialogue ambiguity). The dataset released provides both workers’ answers. The movie dialogue forms therefore allow us to evaluate sub-components of an overall neural dialogue system more systematically, for example one can train and evaluate a sentiment analysis model directly using these labels. We believe that predicting sentiment from dialogues poses an interesting sub-challenge within conversational recommendation, as the sentiment can be expressed in a question-answer form over several dialogue utterances.

In each conversation, the number of movies mentioned varies, so we have different numbers of movie dialogue form answers for each conversation. The distribution of the different classes of the movie dialogue form is shown in Table 1a. The liked/disliked/did not say label is highly imbalanced. This is standard for recommendation data (Marlin et al., 2007), since people are naturally more likely to talk about movies that they like, and the recommender’s objective is to recommend movies that the seeker is likely to like. Table 1b shows an example of conversation from the dataset.

For the AMT HIT we collect data in English and restrict the data collection to countries where English is the main language. The fact that we pair workers together slows down the data collection since two people must be online at the same time to do the task, so a good amount of workers is required to make the collection possible. Meanwhile, the task is quite demanding, and we have to select qualified workers. HIT reward and qualification requirement were decisive to get good conversation quality while still ensuring that people could get paired together. We launched preliminary HITs to find a compromise and finally set the reward to $0.50 per person for each completed conversation (so each conversation costs us $1, plus taxes), and ask that workers meet the following requirements: (1) Approval percentage greater than 95; (2) Number of approved HITs greater than 1000; and (3) They must be in the United States, Canada, the United Kingdom, Australia or New Zealand.

4 Our Approach

We aim at developing an agent capable of chatting with a partner and asking questions about their movie tastes in order to make movie recommendations. One might therefore characterize our system as a recommendation “chat-bot”. The complete architecture of our approach is illustrated in Figure 1. Starting from the bottom of Figure 1, there are four sub-components: (1) A hierarchical recurrent encoder following the HRED (Sordoni et al., 2015) architecture, using general purpose representations based on the Gensen model (Subramanian et al., 2018); (2) A switching decoder inspired by Gulcehre et al. (2016), modeling the dialogue acts generated by the recommender; (3) After each dialogue act our model detects if a movie entity has been discussed (with the @identifier convention) and we instantiate an RNN focused on classifying the seeker’s sentiment or opinion regarding that entity. As such there are as many of these RNNs as there are movie entities discussed in the discourse. The sentiment analysis RNNs are used to indicate the user opinions forming the input to (4), an autoencoder-based recommendation module Sedhain et al. (2015). The autoencoder recommender’s output is used by the decoder through a switching mechanism. Some of these components can be pre-trained on external data, thus compensating for the small data size. Notably, the switching mechanism allows one to include the recommendation engine, which we trained using the significantly larger MovieLens data. We provide more details for each of these components below and describe the training procedure in the supplementary materials.

Figure 1: Our proposed model for conversational recommendations.

4.1 Our Hierarchical Recurrent Encoder

Our dialogue model is reminiscent of the hierarchical recurrent encoder-decoder (HRED) architecture proposed and developed in Sordoni et al. (2015) and Serban et al. (2016)

. We reuse their hierarchical architecture, but we modify the decoder so that it can take explicit movie recommendations into account and we modify the encoder to take general purpose sentence (GenSen) representations arising from a bidirectional Gated Recurrent Unit (GRU) 

(Cho et al., 2014) as input. Since our new dataset here consists of about 10k dialogues (which is relatively small for deep learning techniques), we use pre-trained GenSen representations obtained from the encoder outlined in Subramanian et al. (2018). These representations have led to higher performance across a variety of new tasks in lower data regimes (e.g. with only 10k examples). We use the embeddings and first layer of the GenSen sentence encoder which are pre-trained on multiple language tasks and we keep them frozen during training of our model. To deal with the issue of how to process movies discussed in the dialogue using the @movie for movie entities, @movie tokens in the input are replaced by the corresponding word tokens for the title of the movie.

More formally, we model each utterance as a sequence of words where the tokens are either words from a vocabulary or movie names from a set of movies . We also use a scalar appended to each utterance to indicate the role (recommender or seeker) such that a dialogue of utterances can be represented as . We use a GRU to encode utterances and dialogues. Given an input sequence , the network computes reset gates , input gates , new gates and forward hidden state as follows:

Where the and are the learned parameters. In the case of a bi-directional GRU, the backward hidden state is computed the same way, but takes the inputs in a reverse order. In a multi-layer GRU, the hidden states of the first layer (or the concatenation of the forward and backward hidden states of the first layer for a bi-directional GRU) are passed as inputs to the second layer, and so on. For the utterance encoder words are embedded in a 2048 dimensional space. Each utterance is then passed to the sentence encoder bi-directional GRU. The final hidden state of the last layer is used as utterance representation . We obtain a sequence of utterance representations . To assist the conversation encoder we append a binary-valued scalar to each utterance representation , indicating if the sender is the seeker or the recommender. The sequence is passed to the conversation encoder unidirectional GRU, which produces conversation representations at each step of the dialogue: .

4.2 Dynamically Instantiated RNNs for Movie Sentiment Analysis

In a test setting, users would not provide explicit ratings about movies mentioned in the conversation. Their sentiment can however be inferred from the utterances themselves. Therefore, to drive our autoencoder-based recommendation module we build a model that takes as input both the dialogue and a movie name, and predicts for that movie the answers to the associated movie dialogue form. We remind the reader that both workers answer the movie dialogue form, but it only concerns the seeker’s movie tastes. It often happens that the two workers do not agree on all the answers to the forms. It may either come from a real ambiguity in the dialogue, or from worker carelessness (data noise). So the model predicts different answers for the seeker and for the recommender. For each participant it learns to predict three labels: the “suggested” label (binary), the “seen” label (categorical with three classes), the “liked” label (categorical with three classes) for a total of 14 dimensions.

Let us denote the training set, where is the pair of a dialogue and a movie name that is mentioned in and


are the labels in the movie dialogue form corresponding to movie in dialogue . So if 5 movies were mentioned in dialogue

, this dialogue appears 5 times in a training epoch.

The model is based on a hierarchical encoder (Section 4.1). For sentiment analysis, we modify the utterance encoder to take the movie into account. After the first layer of the utterance encoder GRU (which is pre-trained), we add a dimension to the hidden states that indicate for each word if it is part of a movie mention. For example if we condition on the movie The Sixth Sense, then the input ["<s>", "you", "would", "like", "the", "sixth", "sense", ".", "</s>"] produces the movie mention feature: [0, 0, 0, 0, 1, 1, 1, 0, 0]. The utterance and conversation encoding continue as described in Section 4.1 afterwards, producing dialogue representations at each dialogue step.

The dialogue representation at the last utterance

is passed in a fully connected layer. The resulting vector has 14 dimensions. We apply a sigmoid to the first component to obtain the predicted probability that the seeker answered that the movie was suggested by the recommender

. We apply a softmax to the next three components to obtain the predicted probabilities for the seeker’s answer in the not-seen/seen/did-not-say variable . We apply a softmax to the next three components to obtain the predicted probabilities for the seeker’s answer in the disliked/liked/did-not-say variable . The last 7 components are treated the same way to obtain the probabilities of answers according to the recommender . We denote the parameters of the neural network by and , the prediction of the model. We minimize the sum of the three corresponding cross-entropy losses.

4.3 The Autoencoder Recommender

At the start of each conversation, the recommender has no prior information on the movie seeker (cold start). During the conversation, the recommender gathers information about the movie seeker and (implicitely) builds a profile of the seeker’s movie preferences. Sedhain et al. (2015) developed a user-based autoencoder for collaborative filtering (U-Autorec), a model capable of predicting ratings for users not seen in the training set. We use a similar model and pre-train it with MovieLens data (Harper and Konstan, 2016).

We have users, movies and a partially observed user-movie rating matrix . Each user can be represented by a partially observed vector . Sedhain et al. (2015) project in a smaller space with a fully connected layer, then retrieve the full ratings vector with another fully connected layer. So during training they minimize the following loss:


where is the norm when considering the contribution of observed ratings only and controls the regularization strength.

To improve the performance of this model in the early stage of performing recommendations (i.e. in cold-start setting) we train this model as a denoising auto-encoder (Vincent et al., 2008). We denote by the number of observed ratings in the user vector . During training, we sample the number of inputs kept uniformly at random in . Then we draw inputs uniformly without replacement among all the observed inputs in , which gives us a noisy user vector . The term inside the sum of Equation 2 becomes . The validation procedure is not changed: the complete input from the training set is used at validation or test time.

4.4 Our Decoder with a Movie Recommendation Switching Mechanism

Let us place ourselves at step in dialogue . The sentiment analysis RNNs presented above predict for each movie mentioned so far whether the seeker liked it or not using the previous utterances. These predictions are used to create an input for the recommendation system. The recommendation system uses this input to produce a full vector of ratings . The hierarchical encoder (Section 4.1) produces the current context using previous utterances. The recommendation vector and the context are used by the decoder to predict the next utterance by the recommender.

For the decoder, a GRU decodes the context to predict the next utterance step by step. To select between the two types of tokens (words or movie names), we use a switch, as Gulcehre et al. (2016) did for the pointer softmax. The decoder GRU’s hidden state is initialized with the context , and decodes the sentence as follows: , , ,

is the predicted probability distribution for the next token

, knowing that this token is a word. The recommendation vector is used to obtain a predicted probability distribution vector for the next token , knowing that this token is a movie name: . Where we note that we use the same movie distribution during the whole utterance decoding. Indeed, while the recommender’s message is being decoded, it does not gather additional information about the seeker’s movie preferences, so the movie distribution should not change. A switching network conditioned on the context and the hidden state predicts the probability that the next token is a word and not a movie name.

Such a switching mechanism allows to include an explicit recommendation system in the dialogue agent. One issue of this method is that the recommendations are conditioned on the movies mentioned in the dialogue, but not directly on the language. For example our system would be unable to provide recommendations to someone who just asks for “a good sci-fi movie”. Initial experiments conditioning the recommendation system on the dialogue hidden state led to overfitting. This could be an interesting direction for future work. Another issue is that it relies on the use of the ‘@’ symbol to mention movies, which could be addressed by adding an entity recognition module.

5 Experiments

We propose to evaluate the recommendation and sentiment-analysis modules separately using established metrics. We believe that these individual metrics will improve when modules are more tightly coupled in the recommendation system and thus provide a proxy to overall dialogue quality. We also perform an utterance-level human evaluation to compare responses generated by different models in similar settings.

Evaluating models in a fully interactive setting, conversing with a human, is the ultimate testing environment. However, evaluating even one response utterance at a time is an open challenge (e.g., Liu et al. (2016)). We leave such evaluation for future work.

(a) (top row) Confusion matrices for the seen label. (bottom row) Confusion matrices for the liked label. (left column) Baseline GRU experiment. (middle) Our method with separate objectives (right) Our method, jointly trained. We also provide Cohen’s kappa coefficient for each matrix.
(b) Confusion matrix for the Cartesian product predictions of seen and liked labels using our method.
Figure 2: Confusion matrices for movie sentiment analysis on the validation set.

Movie sentiment analysis performance:

We use the movie dialogue forms from our data to train and evaluate our proposed RNN-based movie sentiment analysis formulation. The results obtained for the seeker’s answers and the recommender’s answers are highly similar, thus we present the results only for the seeker’s answers. We focus on understanding if models are able to correctly infer the seen vs not seen, and liked vs not liked assessments from the forms. Because of the class imbalance (i.e. % of movies were liked, vs % which were disliked), we weight the loss to compensate.

We compare with two simpler approaches. First, a baseline approach in which we pass the GenSen encodings of the sentences between the first and the last mention of a movie into a GRU layer. This is followed by a fully connected layer from the last hidden state. The prediction is made from the mean probability over all the sentences. Second, instead of using a single hierarchical encoder that is jointly trained to predict the three labels (suggested, seen and liked), we train the same model with only one of the three objectives (seen or liked) and demonstrate that the joint training regularizes the model. Figure 2a shows the confusion matrices for the seen and liked prediction tasks for, from left to right, the baseline model, our model trained on single objectives, and our method outlined in Section 4.2 and illustrated in the blue region of Figure 1. We also provide Cohen’s kappa coefficient (Cohen, 1960) for each model and prediction task. Cohen’s kappa measures the agreement between the true label and the predictions. For each prediction task, our jointly trained model has a higher kappa coefficient than the two other baselies. The full confusion matrix for the Cartesian product of predictions is shown in Figure 2b. All results are on the validation set.

Experiments on redial
Training procedure Experiments on MovieLens No pre-training Pre-trained on MovieLens
Standard Baseline (0.820) 0.35
Denoising Autorec (0.805) 0.33
Table 2: RMSE for movie recommendations. RMSE is shown for ratings on a 0–1 scale. For the MovieLens experiment, we show the RMSE on a 0.5-5 scale in parenthesis.777Due to an error in our code, the original published version of the paper incorrectly reported some of these results. Results are now updated to the ones from our (newly released) accompanying code. The new results do not alter the study’s conclusion.

Movie recommendation quality:

We use the “latest” MovieLens dataset888, retrieved September 2017., that contains 26 million ratings across 46,000 movies, given by 270,000 users. It contains 2.6 times more ratings, but also across 4.6 times more movies than MovieLens-10 M, the dataset used in Sedhain et al. (2015). First, we evaluate the model on the MovieLens dataset. Randomly chosen user-item ratings are held out for validation and test, and only training ratings are used as inputs. Following Sedhain et al. (2015), we sampled the training, validation, and test set in a -- proportion, and repeated this splitting procedure five times, reporting the average RMSE.

We also examine how the model performs on the ratings from our data (

redial), with and without pre-training on MovieLens. This experiment ignores the conversational aspect of our data and focuses only on the like/dislike ratings provided by users. We chose to consider only the ratings given by the movie seeker, and to ignore the responses where he answered “did not say either way”. We end up with a set of binary ratings for each conversation. To place ourselves in the setting of a recommender that meets a new movie seeker (cold-start setting), we consider each conversation as a separate user. Randomly chosen conversations are held out for validation, and each rating, in turn, is predicted using all other ratings (from the same conversation) as inputs. We binarize the Movielens observations—they range between

and — for pre-training, by choosing a threshold that gives a similar distribution of s and s as in our data. Knowing that our data has % of “liked” ratings, we chose a rating threshold of : ratings higher or equal are considered as “liked”, ratings lower are considered as “disliked”. The binarized MovieLens dataset now has % of “liked” ratings. In each experiment, for the two training procedures (standard and denoising), we perform a hyper-parameter search on the validation set.

Table 2 shows the RMSE obtained on the test set. In the experiment on the MovieLens dataset, the denoising training procedure brings a slight improvement on the standard training procedure. After pre-training on MovieLens, the performances of the models on our data is significantly improved.

Overall dialogue quality assessment:

Figure 3: Results of human assessment of dialogue quality. The percentages are relative to the total number of ranking tasks, so that bars of the same color sum to 1.

We run a user study to assess the overall quality of the responses of our model compared to HRED. Ten participants were each presented with ten complete real dialogues from our validation set, performing 56 ranking tasks–1 for each recommender’s utterance in those ten dialogues. At the point where the human recommender provided their response in the real dialogue we show: the text generated by our HRED baseline, our model, and the true response in a random order. The participant is asked to give the dialogue responses a rank from 1–3, with 1 being the best and 3 the worst. We allow ties so that multiple responses could be given the same rank (e.g., rankings of the form 1, 2, 2 were possible if the one response was clearly the best, but the other two were of equivalent quality). In Figure 3, we show the percentage of times that each model was given each ranking. The true response was ranked first 349 times, our model 267 times, and HRED 223 times.

6 Discussion and Conclusions

We presented redial a new, high-utility dataset of real-world, human generated conversations around the theme of providing movie recommendations. 10,000 conversations will likely be insufficient to train an end-to-end neural model from scratch, we believe that this shortage of data is a systematic problem in goal-oriented dialogue settings and needs to be adressed at the modeling side. We use this dataset to explore a novel modular formulation of a fully neural architecture for conversational movie recommendations. The dataset has been collected in such a way that subtasks such as sentiment analysis and movie recommendation can be explored and evaluated separately or within the context of a complete dialogue system.

We introduced a novel overall architecture for this problem domain which leverages general purpose sentence representations and hierarchical encoder-decoder architectures, extending them with dynamically instantiated RNN models that drive an autoencoder-based recommendation engine. We find tremendous benefit from this modularization in that it allows one to pre-train the recommendation engine on other larger data sources specialized for the recommendation task alone. Further, our proposed switching mechanism allows one to integrate recommendations within a recurrent decoder, mixing high quality suggestions into the overall dialogue framework.

Our proposed architecture is not specific to movies and applies to other types of products, given that a conversational recommendation dataset is available in that domain. Our utterance-level evaluation compares the responses generated by different models in a given context, controlling for confounding variables to some extent. In that context, our model outperforms the HRED baseline. However, we did not yet evaluate whole conversations between our model and a human user. Future works could improve this evaluation setting by asking more precise questions to the human evaluators. Instead of asking which response is the best in a general way, we could ask for example which response provides the best recommendation given the context, or which is the most fluent. This would allow us to gain insight on what parts of the model could be improved.


Appendix A Data collection interface

(a) Seeker interface
(b) Recommender interface
Figure 4: Data collection interface.

Appendix B Training Procedure

Two parts of our model are frozen after pre-training: the first layer of our utterance encoder (which uses the GenSen representations [Subramanian et al., 2018]) and the sentiment analysis module (after being pre-trained to predict the movie ratings). The recommender system is pre-trained on the Movielens dataset but will be fine-tuned with the rest of the network. Our model is then trained in teacher forcing using our dialogue dataset. The training objective is to maximize the likelihood of the next recommender utterance (which may contain words and movie entities). Thus, the switching decoder and the hierarchical encoder (except for the utterance encoder’s first layer) are trained from scratch using our dataset. We retain 20% of the training conversations for the validation set. We optimize the parameters with Adam [Kingma and Ba, 2014] and a leaning rate of . At test time, we use a beam search of size 10 to generate the response.

Appendix C Conversation samples

We show in tables 3, 4, 5, 6, 7, some randomly picked conversations from the validation set, and the responses generated by the HRED and by our model. At each utterance, we condition on the ground truth previous turns (labeled “SEEKER” and “HUMAN” in the tables) to generate the response.

SEEKER: hi ! i ’m looking for an action filled movie similar to jurassic park
HUMAN: ok , have you seen jurassic world ?
HRED: hi ! i love that one ! have you seen the conjuring or the conjuring 2 ?
OURS: hi what kind of movies do you like ?
SEEKER: yes i loved that movie as well . are there any good movies without dinosaurs ?
HUMAN: have you seen king kong ?
HRED: no , i have n’t seen it yet . have you seen the avengers ? it ’s about a couple of friends who are their own .
OURS: i have n’t seen that one . i will have to check it out . have you seen jurassic park
SEEKER: yes ! that was another great giant monster movie : )
HUMAN: what about jaws & amp ; jaws 2 ? oldies but goodies scary action ? suspense gives me the rumblies
HRED: i have n’t seen that one yet but i heard it was good . i have n’t seen it yet . i have n’t seen it yet . i have n’t seen it yet .
OURS: i have n’t seen that one yet . i will have to check it out . what about walking with dinosaurs ?
SEEKER: i absolutely loved the first one . one of my favorite movies ever : ) honestly i ca n’t say i remember much about the second one . jaws 3-d was quite funny and jaws : the revenge was just weird . i do love suspense too …
HUMAN: i like suspense but sometimes i can barely handle it ! it gives me anxiety ! lol
HRED: no , i have n’t seen it yet . i will have to check it out . thanks for the chat .
OURS: i have n’t seen that one . i will have to check it out . i hope i helped .
SEEKER: that ’s understandable . as long as it ’s not too gory i do n’t generally have a problem with those things .
HUMAN: well , it was great chatting with you ! have a great one !
HRED: i ’m glad i was able to help you today
OURS: have you seen immortals ? it ’s a war movie but it ’s pretty good
SEEKER: you too ! thanks !
Table 3: Sample conversation from validation set.
HUMAN: hello .
HRED: hi what kind of movies do you like
OURS: hi , how can i help you ?
HUMAN: what type of movie can i suggest to you ?
HRED: what kind of movies do you like ? i ’m good
OURS: what kind of movies do you like ? what kind of movies do you like ?
SEEKER: i am looking for a slap stick comedy from the 80 ’s or 90 ’s
HUMAN: not familiar with this , please give me a minute .
HRED: have you seen the conjuring or the conjuring 2 ? it was really good and very funny
OURS: what kind of movies do you like ? i ’m good , thanks for asking ! have a great day !
SEEKER: okay movies like airplane ! and airplane ii : the sequel
HUMAN: oh , okay . i saw those and thought they were great . did you see caddyshack or dumb and dumber ?
HRED: i have n’t seen that one yet but i heard it was good . have you seen the last house on the left ? it ’s pretty good .
OURS: have you seen coming to america ? or the naked gun hot shots !
SEEKER: yes , those were pretty good
HUMAN: how about spaceballs , i thought this was pretty funny .
HRED: have you seen foxcatcher ? it ’s a comedy about a man on a true story
OURS: have you seen dumb and dumber to ? it ’s a classic !
SEEKER: haha that one was great
HUMAN: okay , let ’ me see if i can offer one more .
HRED: have you seen foxcatcher ? it ’s a comedy about a man that turns out on a true story .
OURS: i have n’t seen that one yet . i ’ll have to look it up .
SEEKER: awesome
HUMAN: robin hood : men in <unk> ? this one had me rolling .
HRED: have you seen foxcatcher ? it ’s about a man who has a rich guy who is a <unk> rich guy
OURS: have you seen monty python ’s life of brian ? that ’s a good one !
SEEKER: haha i forgot about that one ! i ’ll be watching that one tonight
HUMAN: great .
HRED: it stars nanjiani , zoe kazan , holly hunter , ray romano , adeel akhtar , and anupam <unk> no , i have n’t seen it yet .
OURS: have you seen caddyshack ii ? it ’s a good one .
SEEKER: thanks for all of your help i hope you have a great day
HUMAN: welcome , glad to help . have a nice day . good bye
HRED: you ’re welcome ! have a great day !
OURS: you ’re welcome ! have a great day !
Table 4: Sample conversation from validation set
SEEKER: hi !
HUMAN: hi ! how is your morning going ?
HRED: hi , what kind of movies do you like
OURS: hello ! what kind of movie are you looking for ?
SEEKER: great , thanks ! i was thinking of starting the day off with a great comedy , any suggestions ?
HUMAN: i love the hangover it always makes me laugh . and i bet lots of people are hungover this morning ! airplane ! is another classic of mine .
HRED: that ’s a good one ! have you seen the lego movie ? it ’s a great movie !
OURS: i have n’t seen that one ! what kind of movies do you like ? i can help you with that .
SEEKER: i ’ve seen the hangover it ’s really good ! i thought about watching knocked up have you seen it ?
HUMAN: i enjoyed that movie . i enjoyed spy it ’s a well done parody .
HRED: it ’s about a man who gets revenge on a cyborg and a hockey team . it ’s really good . it ’s really good . i also like the girl on the train
OURS: i have n’t seen that one . i will have to check it out . i hope i was able to help you find a good one ! have a great day !
SEEKER: thanks ! i ’ll check it out !
HUMAN: sure ! have a great day !
HRED: you ’re welcome ! have a great day !
OURS: have you seen revenge of the nerds ? it ’s really good . it ’s a classic .
SEEKER: you too .
Table 5: Sample conversation from validation set
HUMAN: good day !
HRED: hi what kind of movies do you like
OURS: hi , how can i help you ?
SEEKER: hey i ’m really into romance movies any good suggestions ? thanks !
HUMAN: hmm . let me think for a sec . have you seen the big sick ? it ’s romantic and funny
HRED: what kind of movies do you like to watch ? have you seen the conjuring 2 ?
OURS: what kind of movies do you like ? i ’m good , and you ?
SEEKER: i like movies like beauty and the beast and a walk to remember oh and lol i have never seen the the big sick or never been kissed and romeo and juliet
HUMAN: i highly recommend ! great story ! have you seen the notebook ? that ’s a big one <unk> romance fans
HRED: i have n’t seen that one yet but i ’ve heard good things about it !
OURS: i have n’t seen that one yet but i have heard good things about the longest ride and the fault in our stars
SEEKER: i love the the notebook a really great film .
HUMAN: and if you liked romeo and juliet then i have to reccommend titanic for leo . also a recent sweet romantic comedy was the duff
HRED: i have n’t seen that one , but i ’ve heard good things about it !
OURS: i have n’t seen that one yet but i ’ve heard good things about it have you seen message in a bottle ?
SEEKER: ahh the great titanic what a classic thanks for recommending that one . i will definitely have to check out the the duff thanks so much for your help !
HUMAN: no problem ! i have one more to suggest . phantom thread ! it ’s new and i did n’t like it as much but a lot of others do ! it <unk> romance in a unique way no problem at all ! i hope you enjoy the films you decide to watch and enjoy the rest of your day !
HRED: you ’re welcome ! have a great day !
OURS: you ’re welcome ! have a great night !
SEEKER: thanks i will defiantly check out the phantom thread i really appreciate it and you have a great day !
Table 6: Sample conversation from validation set
SEEKER: hello
HUMAN: hello
HRED: hi , how can i help you ?
OURS: hello , what kind of movie are you looking for ?
SEEKER: how are you this evening ?
HUMAN: what type of movies do you like ? good , thanks
HRED: what kind of movies do you like ? i ’m good
OURS: can i help you find a movie ?
SEEKER: i am looking for a musical . what do you suggest ?
HUMAN: have you seen enchanted ? maybe you ’d like la la land .
HRED: have you seen get out yet ? it ’s pretty good
OURS: i have n’t seen that one yet but have heard good things about it have you seen prometheus ?
SEEKER: no i have n’t seen that one . i have been meaning to watch it though . oh that ’s a good suggestion . what about something older ?
HUMAN: the rocky horror picture show is a great one .
HRED: i have n’t seen that one yet but heard good things about ender ’s game i have n’t seen it yet but i have n’t seen it yet but i ’ve heard of it .
OURS: i ’m trying to think of something along the same way . sweeney todd : the demon barber of fleet street is another good one
SEEKER: i ’ve seen that before when i was younger . i did n’t know whether to laugh or be scared . lol
HUMAN: fiddler on the roof maybe ? sorry , i have n’t seen many musicals
HRED: i have n’t seen that one . i will have to check it out . thanks for chatting !
OURS: i have n’t seen grease but i have heard good things about it .
SEEKER: yes ! that ’s great ! thank you for the suggestions ! i try to watch them soon , especially enchanted ! hope you have a good night !
HUMAN: thanks you too
HRED: you ’re welcome ! have a great night !
OURS: no problem ! have a good day !
Table 7: Sample conversation from validation set

to@X[1,l]X[1,l]@ &

Figure 5: 2D embedding of movies in our conversation database. The edge weight in the similarity matrix is proportional to the number of co-occurrences in the same dialogue. Left: all movies, colored by number of occurrences from light blue (low) to red (high). Right: names of movies with highest number of occurrences. Embedding via Jacomy et al. [2014].