User Memory Reasoning for Conversational Recommendation

by   Hu Xu, et al.
University of Illinois at Chicago

We study a conversational recommendation model which dynamically manages users' past (offline) preferences and current (online) requests through a structured and cumulative user memory knowledge graph, to allow for natural interactions and accurate recommendations. For this study, we create a new Memory Graph (MG) <–> Conversational Recommendation parallel corpus called MGConvRex with 7K+ human-to-human role-playing dialogs, grounded on a large-scale user memory bootstrapped from real-world user scenarios. MGConvRex captures human-level reasoning over user memory and has disjoint training/testing sets of users for zero-shot (cold-start) reasoning for recommendation. We propose a simple yet expandable formulation for constructing and updating the MG, and a reasoning model that predicts optimal dialog policies and recommendation items in unconstrained graph space. The prediction of our proposed model inherits the graph structure, providing a natural way to explain the model's recommendation. Experiments are conducted for both offline metrics and online simulation, showing competitive results.


page 1

page 2

page 3

page 4


Interactive Path Reasoning on Graph for Conversational Recommendation

Traditional recommendation systems estimate user preference on items fro...

Knowledge Graph-enhanced Sampling for Conversational Recommender System

The traditional recommendation systems mainly use offline user data to t...

Self-Supervised Bot Play for Conversational Recommendation with Justifications

Conversational recommender systems offer the promise of interactive, eng...

Bridging the Gap between Conversational Reasoning and Interactive Recommendation

There have been growing interests in building a conversational recommend...

Multi-Graph based Multi-Scenario Recommendation in Large-scale Online Video Services

Recently, industrial recommendation services have been boosted by the co...

Bundle MCR: Towards Conversational Bundle Recommendation

Bundle recommender systems recommend sets of items (e.g., pants, shirt, ...

Improving Conversational Recommendation System by Pretraining on Billions Scale of Knowledge Graph

Conversational Recommender Systems (CRSs) in E-commerce platforms aim to...

1 Introduction

Figure 1: A conceptual illustration of Memory-grounded conversational recommendation. (1) Past (offline) user preferences are captured as an initial Memory Graph (MG). (2) Conversational recommendation allows users to express preferences and requirements through dialogs. (3) Our MGConvRex corpus is grounded on user memory, which represents user’s past history as well as newly added preferences.

Conversational recommendation system has recently gained traction in the dialog community, in which the model aims to learn up-to-date (online) user preferences, instead of using static (offline) preferences as in the traditional recommender systems (e.g. collaborative filtering (CF)). Most existing works focus on combining a static recommender system with a dialog system by updating user preferences via asking relevant questions (often referred as “System Ask User Respond (SAUR)” Zhang et al. (2018). However, this “short-term” update in the model unnaturally isolates users’ history and their preference in the current dialog (that are possibly forgotten after the dialog is finished). An intelligent system should be able to dynamically maintain and reason over users’ knowledge for current (and possibly future) recommendations.

To this end, we introduce a novel concept called user memory graph to holistically represent the knowledge about users and associated items. This user memory graph may contain any static knowledge obtained offline (e.g. items, attributes, the history of users and past dialogs) and users’ knowledge online (e.g. from state tracking of the current dialog), as illustrated in Figure 1. User memory graph naturally has the following benefits. (1) Holistic reasoning considers available knowledge about users and items all together to generate dialog policy. We believe this is the core problem in conversational recommendation because asking a good question or finding a good candidate item needs to explore the “soft match” of the knowledge between users and items222In contrast, task-oriented dialog has a focus on hard constraints matching (e.g. DB query) on available records, although their differences can be blurry.Zhang et al. (2018). (2) Zero-shot (cold-start) reasoning for users/items unseen during training. User memory graph naturally separates user/item knowledge from the reasoning process of policy. As a result, one can train a user/item agnostic model that can be later applied to the user memory graph for a new user (obtained after the model is deployed). In contrast, most CF-based system “overfits” to existing users / items (in their embeddings). (3) Open space policy is a key challenge in conversational recommendation because of the innumerable items involved in dialog policy. This requires a flexible space of policy to cover all items (and possibly all valid values and slots 333We widely reuse the terms from task-oriented dialog to make this paper easier to read, although slots and values can be special cases of entities in a user knowledge graph. for acquiring preference) instead of a pre-defined fixed space. User memory graph can be a basis for policy because it contains all these valid entities for the current dialog. In summary, this paper aims to address the following problem:

User Memory Reasoning for Recommendation: Assuming an agent involved in a conversational recommendation with a user . The agent (1) constructs 444The construction procedure for user memory graph is omitted here for brevity, and detailed in Section 4. a user memory graph based on history knowledge of , candidate items , and their associated slots and values, and then, (2) without loss of generality, updates with new knowledge from the -th turn , in the form of tuples ; (3) performs reasoning over to yield a dialog policy that either (i) performs more rounds of interaction to collect users’ knowledge (e.g. via question answering), or (ii) recommends items to the user.

To this end, we first collect a dataset for this problem as existing public datasets may hardly meet the needs of this paper for the following reasons. (1) Lacking users’ history and thus dialogs referring to the history (e.g. the 2nd and 4th turn in Figure 1). One reason is that most datasets aim for task-oriented systems, where users’ history and reasoning are not core issues to solve. (2) Lacking fine-grained annotation (for updating the user memory graph). Most public datasets for conversational recommendation are combinations of the datasets for recommender systems and dialogs transcribed separately Li et al. (2018a); Zhang et al. (2018). The process is not designed for knowledge-grounded dialogs and leads to the hardness of annotating entity-level knowledge. (3) Lacking human-level reasoning. The goal of transcribing for existing datasets is not to reason over existing knowledge from both users and items. Some actions are taken at the transcribers’ willLi et al. (2018a). The collected dataset is called Memory Graph Conversational Recommendation (MGConvRex), containing 7.6K+ dialogs with 73K turns based on real-world users’ behavior. It is annotated with dialog acts, items, slots, values, and sentiment polarities that captures human-level reasoning of dialog policy (see Section 3 and Appendix for more details of data collection).

To construct the user memory graph, we define a simple yet flexible ontology, as detailed in Section 4. One challenge in conversational recommendation is to deal with the open space policy. This needs a flexible formation of policy space that differs dialog-by-dialog. We propose a baseline called user memory graph reasoner (UMGR), which preserves the structure of the user memory graph during reasoning and generates policy based on the graph. This also potentially allows for the interpretability of dialog policy.

In summary, the contribution of this paper is as following: (1) We propose a novel task of user memory reasoning for conversational recommendation; (2) We collect a dataset and propose an ontology to construct user memory graph; (3) We propose a baseline for reasoning dialog policy over the user memory graph. Experimental results show that such a reasoning model is promising.

2 Related Work

Conversational Recommendation is one important type of information seeking dialog system Zhang et al. (2018). Existing studies focus on combining a recommender system with a dialog state tracking system, through the “System Ask User Respond (SAUR)” paradigm. Once enough user preference is collected, such systems often make personalized recommendations to the user. For instance, (Li et al., 2018a) proposes to mitigate cold-start users by learning new users’ preferences during conversations and linking the learned preferences to existing similar users in a traditional recommender system. (Sun and Zhang, 2018)

propose to updates a recommender system in the latent space with the latent space of dialog state tracking and tune the dialog policy via reinforcement learning. The updates are short-term and very close to a task-oriented dialog system.

(Kang et al., 2019) propose a self-play reinforcement learning (RL) setting to boost the performance of a text-to-text dialog model. (Zhang et al., 2018) leverages reviews to mimic online conversations to update an existing user’s preference and re-rank items. In Misu et al. (2010), the user memory/knowledge is represented as a probabilistic state with a fixed hierarchical structure of Markov probabilistic model to predict dialog actions. However, it lacks the flexibility for encoding richer and fine-grained knowledge and accumulating new knowledge about users for long-term use. Zhou et al. (2020) demonstrate the usage of user profile and users’ interests from ongoing dialog in a social chatbot. To the best of our knowledge, none of the existing systems (or datasets) aims to build an explicit user memory for reasoning and long-term use.

Task-oriented Dialog Systems are widely studied with multiple popular benchmark datasets Henderson et al. (2014); Wen et al. (2016); Budzianowski et al. (2018); Eric et al. (2019); Rastogi et al. (2019). Most of the state-of-the-art approaches Wu et al. (2019); Gao et al. (2019); Chao and Lane (2019) focus on improving dialog state tracking with span-based pointer networks for unseen values, which predicts information that is essential for completing a specified task (e.g

. hotel/air ticket booking, etc.). Datasets for task-oriented systems typically lack users’ history, probably because users’ history is not very important to correctly locate a record for the current dialog. Although certain types of dialog act, slots, and values are shareable for both task-oriented system and conversational recommendation, the core problem of conversational recommendation is to reason and to rank items or questions to ask.

Graph Reasoning is essential for generating dialog policy from the proposed user memory graph, where the graph can be viewed as a structured form of state representation. There are many studies on leveraging knowledge graphs for recommender systems. For example, (Xian et al., 2019)

introduced a graph-based recommender (not dialog) system that is trained via reinforcement learning. Graph neural networks are popular in recent years, which aim to learn hidden representations over discrete graph structures

Scarselli et al. (2008); Duvenaud et al. (2015); Defferrard et al. (2016); Kipf and Welling (2016). It is leveraged in this paper to learn structure-preserving (and thus explainable) reasoning. A number of extensions to the original graph neural network have been proposed Li et al. (2015); Pham et al. (2017), most notably R-GCNs Schlichtkrull et al. (2018), which can be applied to large-scale and multi-relational graphs (relations are associated with typed embeddings).

A few works have recently been proposed to allow knowledge graph reasoning in dialog systems. Moon et al. (2019a, b) propose a new corpus to learn knowledge graph paths that connect dialog turns. Tuan et al. (2019) introduces a knowledge-grounded dialog generation task given a knowledge graph that is dynamically updated. However, these works often focus on response generation and do not address the reasoning of user knowledge in conversational recommendations.

Dialog Act Description Examples
Greeting Greeting to the agent I’d like to find a place to eat.
Inform Actively inform the agent your preference I’d like to find a thai restaurant .
Answer Answer to a question from the agent I prefer thai food.
Reply Reply to a recommendation I’ll give it a try.
Open question (OQ) Actively ask an open question about a recommended item. What kind of food do they serve ?
Yes/no question (YNQ) Actively ask an yes/no question about a recommended item. Do they serve thai food ?
Thanks Thanks the agent Thanks for your help.
Greeting Greeting to the user. How may I help you today ?
Open question (OQ) Ask an open question about a slot to the user What kind of food do you prefer ?
Yes/no question (YNQ) Ask a yes/no question about a value of a slot I saw you’ve been to thai restaurant, do you still like that ?
Recommendation (REC) Recommend items to the user. How about burger king, which serves fast food ?
Answer (ANS) Answers user’s questions on an item. They serve thai food.
Thanks Thanks the user Enjoy your meal.
Table 1: Dialog acts for agent and user : the spans of items/slot values are italized.

3 MGConvRex Dataset

This section describes the construction of the MGConvRex dataset. MGConvRex aims to contain dialogs that draw relevance of the user’s history and fine-grained user preferences to update the user memory graph. As such, we propose to leverage existing data from recommender systems 555We focus on the restaurant domain at this stage. that carry users’ past behavior to harvest large-scale dialog scenarios. Then we define fine-grained dialog acts, slots, values and sentiment polarities to turn unstructured utterances into structured knowledge for memory graph updates.

This section is organized as follows. (1) We detail the curation of dialog scenarios in Section 3.1. (2) We then define structured knowledge such as dialog acts, slots, values, and sentiment polarities for MGConvRex, as detailed in Section 3.2. (3) Next, we describe the process for transcribing human-to-human simulated dialogs in a Wizard-of-Oz environment Henderson et al. (2014); Wen et al. (2016); Budzianowski et al. (2018); Eric et al. (2019) (Section 3.3). (4) Lastly, we define the ontology for annotating the structured knowledge in utterances, and provide the statistics of the dataset in Section 3.4. As a result, MGConvRex can be used for a broader scope of research in conversational recommendation, includes but not limited to policy reasoning, natural language understanding (e.g

. intent detection, slot filling, sentiment analysis), natural language generation, etc.

3.1 Dialog Scenarios

We use scenario to refer to a pre-defined user-agent setting to collect a dialog between two crowd workers, where one plays the user and the other plays the agent. Scenarios in conversational recommendation can be generated from user behaviors in the datasets of recommender system. This mitigates the needs of curating synthetic dialog scenarios as in datasets for task-oriented dialog systemLi et al. (2016, 2018b).

We assume each item is associated with values and each value is associated with at least one slot. Let be a binary number. We define a scenario consisting of the following parts: , where is a user, is about the candidate items and their associated values , is about users past history ( visited items 666To reduce the load of transcribers, a user’s past history only contains visited items at this stage. and their values) that is known to the agent, indicates values with their associated slots, is the user preference (which value the user prefer for a slot) and is the ground-truth items.

We create dialog scenarios as the following way: (1) for each user, we draw visited items and 777We use 1 ground-truth item to reduce the load of the transcribers and increase the difficulty of reasoning. items as the ground-truth items . Use the values and its associated slots of the ground-truth items as user preference . (2) negatively sample items and combine them with the ground-truth items as candidate items .

To ensure difficulty of human reasoning, we choose candidate items and enforce certain similarity over candidate items (such as all locations are from the same state) as the ground-truth items. For the same user, we also create a duplicated scenario except that , where the agent player can only use knowledge from the current dialog for recommendation.

Dataset All Dialogs Dialogs w/ History Dialogs w/o History
# of Dial. # of Turns Avg. # of Turns # of Dial. Avg. # of Turns # of Dial. Avg. # of Turns
Train 4985 48457 9.72 2418 9.62 2567 9.81
Dev 263 2466 9.38 121 9.16 142 9.56
Test 2367 23048 9.74 1160 9.62 1207 9.85
Table 2: Statistics of the Dataset: Dialogs w/ or w/o History indicates whether scenarios include visited items .

3.2 Dialog Acts, Slots, Values and Sentiment Polarities

We further define the following knowledge for curating structured information for graph updates.

Dialog Acts (): Table 1 demonstrates the dialog acts for both the user and the agent. Note that besides the System Ask – User Respond (SAUR) paradigm Sun and Zhang (2018); Li et al. (2018a); Zhang et al. (2018), we also propose a User Ask - System Respond (UASR) paradigm that allows users to actively participate in a recommendation. Acts such as Open question, Yes/no question and Inform are designed for this purpose.

Slots and Values(, ): We select popular slots with a total of 470+ values for the restaurant domain. To help transcribers use some values naturally in utterances, we change some values (such as price ranges $) into English words (“cheap” etc.).

Sentiment Polarity: We define a user’s preference expressed in a conversation as pairs of opinion targets (an item or a value) and their associated sentiment polaritiesHu and Liu (2004). We adopt 3 types of polarities pos_on, neg_on and neu_on to represent positive, negative and neutral polarity, respectively 888We do not deal with emotions (e.g. sad), although existing works may use sentiment to indicate emotions..

3.3 Wizard-of-Oz Collection and Annotation

We build a wizard-of-oz system to randomly pair two crowd workers to engage in a chat session, where each scenario is split into two parts: for the user and for the agent. The goal of a conversation is like a game between the user and the agent, where the agent needs to reason the user’s current preference and find the ground-truth item and the user can tell information from preference or confirm a recommended item but cannot tell the ground-truth directly. The guidelines, screenshots of the Wizard-of-Oz UI can be found in the Appendix.

3.4 Summary of MGConvRex

We annotate dialog acts, items, slots, values, and users’ utterance-level and entity-level sentiment. The dialogs are split into training, development, and testing sets with non-overlapping users for zero-shot reasoning on unseen users. The statistics of MGConvRex are in Table 2. For scenarios with users’ history, we notice that the average number of turns are slightly shorter than those without users’ history. We further plot agent’s dialog acts to study the behavior of the agent players, as in Figure 1, where agent players seem to use more yes/no questions to confirm users’ preference exhibit in history. We discuss more details in Appendix.

Figure 2: Distribution of dialog acts from agent side: w/ his indicates scenarios have users’ history.

4 User Memory Graph

In this section, we describe the formulation of a user memory graph based on a scenario and annotated user preference. There are many design choices for constructing a user memory graph. Our goal is to model user knowledge and scenarios with extensibility and maintenance.

Entity Types Explanation
user entities
memory entities
item entities:
slot entities
value entities
Relation Types
a user has a memory entity
a memory is about an item
an item has a value
a value belongs to a slot
is positive on a value or item
is negative on a value or item
is neutral on a value or item
Table 3: Ontology of user memory graph: bolded relations are used for graph updates or accumulation.
Figure 3: User memory graph construction and updates based on the dialog in Table 4.

4.1 Construction

As a reminder, a user memory graph is denoted as , which is essentially a heterogeneous graph with typed entities and relations. We first define the ontology (or meta entities and relations) in Table 3. The user memory contains available items for a dialog scenario. An item can be associated with multiple values s with relation. Each value is associated with their slot via relation. In this way, values / slots entities are rather expandable and new values or slots (or even slots of slots) can be easily added in. Further, each user has their own entity and several associated memory entities s. We define memory entity to model an event or experience of the user, such as visiting a restaurant (via entity ), or having a conversation as in current dialog (via ). The advantage of allowing multiple memory entities is that a user may have different opinions for the same target (items or values) from their very different experiences (e.g. like Thai food for lunch but not dinner). To express a user’s history on visited items, we use a relation to connect a memory entity with a visited item. As an example, we demonstrate the construction of a user in the first graph in Figure 3. We will keep use this example to demonstrate the updates of user memory graph for the dialog in Table 4.

4.2 Update

The updates of user memory graph is assumed999We leave language understanding parts to future work and the baselines of this paper use ground-truths from annotations. to leverage the outputs of natural language understanding (NLU) or state tracking. For simplicity, we use 3 sentiment relations , and to update a user memory graph, which associate values/items (opinion target) with the memory entity of the current dialog . We believe humans have a more complex memory system in their brains. We expect more complex (such as error correction) memory update systems in future work.

From the first turn of the user in Table 4, we know that likes Thai food and the user memory graph is updated with a new triple . Following the second turn of the user, we know that is still interested in restaurants, indicated by a new triple . Then the agent can infer a recommendation , which can be explained by paths: (1) , (2) , and (3) , where the last path draws the relevance from a visited item to the current recommendation. As we can see, sentiment relations serve as the bridge to connect a user to items and enables potential reasoning for recommendation.

Role Utterance
Agent what kinds of food do you like ?
User I like Thai food.
Agent are you still interested in affordable restaurant ?
User yes.
Agent how about Basil, which is affordable and serves Thai food.
Table 4: An example dialog corresponds to the graph updates in Figure 3.

5 User Memory Graph Reasoner

In this section, we propose a model called User Memory Graph Reasoner (UMGR), which uses user memory graph to reason dialog policy (Figure 4). As discussed in the introduction, we aim to resolve the issue of open space policy in conversational recommendation. We define the inputs/outputs as following, which maps certain entities from user memory graph to policy space.

Input: (1) past dialog acts up to the current turn from the user ; (2) updated user memory graph .

Output: dialog policy for the current turn, where , , , indicate the space of dialog acts, candidate items, slots and values, respectively.

Note that and can be interpreted as the arguments of dialog acts and are essentially rankings over their corresponding entity sets. For example, when Recommendation, the top-1 entity will be provided to the user. Similarly, Open Question is related to the top-1 slot and Yes/no Question is related to the top-1 value . As such, the policy space of UMGR can be determined by the user memory graph where only valid entities can be generated. A structure-preserving model is preferred for reasoning where all entities in policy are generated as a holistic reasoning process.

Figure 4: Overview of the User Memory Graph Reasoner (UMGR) architecture.

We let UMGR first encodes past dialog acts and entities into hidden dimensions.


where and are embedding layers and the past dialog acts are further encoded by an LSTM encoder. Then we incorporate a Relational Graph Convolutional Networks (R-GCN) Schlichtkrull et al. (2018) into UMGR for reasoning. R-GCN is a GCN Kipf and Welling (2016) with typed relations, where each relation is associated with their own weights to enable reasoning over a heterogeneous graph. Each entity is encoded by multiple layers of R-GCN as following:


where is the hidden state of entity in the -th layer of R-GCN, is entity ’s neighbors in relation type and is the weight associated with in the -th layer to transform one neighbor . The R-GCN layer updates the hidden states of each entity with the incoming messages in the form of their neighbors’ hidden states type-by-type. Then R-GCN sums over all types before passing through the GELU activation Hendrycks and Gimpel (2016). The hidden state of entity in the

-th layer is computed via a residual connection

He et al. (2016) (to keep the original entity information instead of just neighbors’ information) and layer normalization.


The hidden states from the last layer of R-GCN is passed into an aggregation layer.


where and are weight for aggregation layer. The purpose of having an aggregation layer is to leverage the information in the user memory graph for predicting the dialog acts. The loss for dialog acts is defined as


where is the concatenation operation, merges the hidden states of dialog acts and graph, is a multi-layer perception for dialog acts and is the label of dialog act. Further, all item, slot and value entities are trained by log loss for ranking. For example, the loss for candidate items is computed as



is the multi-layer perceptron for item. Similarly, we obtain loses

, for slot entities and value entities , respectively. The total loss is the sum over all losses for dialog acts, items, slots and values:


where , and are hyper-parameters to balance losses of different scales. Note that during training and prediction, all invalid entities (e.g. not appear in a user memory graph) are masked out. As we can see, unlike traditional recommender systems, UMGR has no assumption on users/items in training set and provides the capability of zero-shot reasoning. The policy space is open-ended because entities in policy is determined by the rankings of entities in user memory graph instead of a pre-defined set for the model.

6 Experiments

This section conducts experiments on baselines for reasoning dialog policy.

6.1 Evaluation Metrics

We propose the following metrics to evaluate UMGR both offline (against the collected testing dialogs) and online (against a user simulator running on testing scenarios in MGConvRex.

6.1.1 Offline metrics

We propose the following offline metrics to evaluate UMGR. Note that all offline metrics assume UMGR uses annotations (ground-truth) of past turns (e.g. on constructing a user memory graph). Act Accuracy & F1 are reported for all predicted dialog acts against annotated turn acts in testing.

Entity Matching Rate (EMR, k@1, 3, 5) measures turn-level top- entities against the testing set. These metrics evaluate only on correctly predicted dialog acts since the types of predicted entities (items, slots, or values) depend on the predicted dialog acts .

Item Matching Rate (IMR) measures dialog-level predicted items against the ground-truth items.

6.1.2 Online metrics

In addition to offline evaluation, we use a user simulator (see Appendix) to dynamically evaluate the performance of recommendation. This mitigates the assumption in offline metrics that all past turns are correct, which limits the interactive evaluation of conversations.

Success Rate tracks whether the interaction with user simulators yields the ground-truth item . We use the scenarios for testing sets used for the offline evaluation. The maximum number of turns is simulated as 11. We ran simulations 3 times and average the results.

Methods Offline Evaluation Online Evaluation
Act Acc. Act F1 EMR IMR Success Rate
@1 @3 @5
RandomAgent 18.17 18.24 1.5 1.5 1.5 6.55 6.0
RecAgent 25.89 6.86 2.7 2.7 2.7 39.16 39.21
Pretrained Emb. 64.3 54.79 13.75 29.02 36.7 9.97 9.73
MemoryNetwork 59.46 53.78 13.85 29.46 35.82 4.73 6.31
UMGR (Proposed) 65.7 56.54 33.92 48.47 52.54 67.93 71.03
- Prev. User Act Only 63.47 54.64 33.66 46.69 50.59 69.71 69.76
- No Dialog Acts 42.37 32.72 31.52 43.66 46.89 67.6 66.1
- Static 64.31 55.25 18.03 36.9 45.31 27.5 37.26
Table 5: Results of both offline and online evaluation: EMR stands for entity matching rate, which compares all types of predicted entities against annotated ones when the dialog act is predicted correctly; IMR stands for item matching rate, which evaluates predicted items against the ground-truth item across all turns in a dialog.
Figure 5: Visualization of item-level conversational reasoning, given an example dialog. Darker color indicates more salient items for recommendation at each given turn (row), predicted by our UMGR model.

6.2 Compared Methods

RandomAgent: we implement a baseline agent that randomly picks a dialog act and randomly pick a candidate item/slot/value as the dialog policy.

RecAgent: this agent always chooses Recommendation as the optimal dialog act to enact and select a random item that has not been tried in candidate items (memorize all trials). This is a strong (yet annoying) rule-based baseline and does not collect or use any user preference.

Memory NetworkSukhbaatar et al. (2015); Bordes et al. (2016): we adapt memory network and encodes the user memory graph as triples. The memory can be updated as new triples added. Note that memory networks cannot deal with open space policy because of attention-based aggregation of triple memories. As such, we enumerate all possible combinations of dialog acts and entities in user memory as the space of policy. Specifically, all items in a scenario are indexed as to differentiate candidate items for policy generation. The inputs of the memory network are the encoded dialog acts (the same as UMGR). We adopt 5 hops for memory networks.

Pretrained Embeddings: we pre-train the graph embeddings and utilize these as graph encoder for predicting dialog policy (without R-GCN layers in UMGR). The graph embeddings are trained from all scenarios in the training set using the TransE-based graph prediction approaches Nickel et al. (2016). While this approach is widely used in the related literature and carries cross-scenario knowledge, we show that using pre-trained graph embedding alone is sub-optimal for a particular user’s scenario and the dialog policy needs to perform dynamic reasoning over the user memory graph.

UMGR (Proposed): this is the proposed model in Section 5. To enable zero-shot reasoning during inference, all items share the same embeddings and UMGR purely learns leverage the graph structure for reasoning policy. We adopt 5 layers of R-GCN and all sizes of hidden states are 384. The maximum number of past acts is set as 10. Factors of losses , , and are set as 1, 10, 10, 100 based on the scales of losses. We choose the batch size to be 160. We further investigate the following ablation studies on UMGR:

- Prev. User Act Only: this ablation study only uses the most recent dialog act from the user. We use this to show how many past dialog acts are needed for policy generation.

- No Dialog Acts: this study removes the dialog acts encoder, investigating the importance of the dialog acts for recommendation.

- Static : this study uses the initial user memory graph without any updates during the conversation. We use this study to demonstrate that dynamic updates of the user memory graph are crucial for reasoning better dialog policy.

6.3 Results and Discussion

The results are summarized in Table 5. Overall, it can be seen that the proposed UMGR architecture outperforms other baselines in both offline and online evaluation. Ablations: Specifically, we notice that dynamically updating the user memory graph with users’ new preference is crucial for a recommendation, as indicated by UMGR - static that forbids updating user memory graph. It can also be seen that removing the previous dialog context does degrade the performance as expected (UMGR - Prev. User Act Only), although the UMGR architecture still maintains a competitive performance. Similarly, while UMGR -No Dialog Acts does not take past dialog acts as input, its results on non-act prediction metrics are relatively competitive. Both of these ablation studies indicate the user memory graph contains enough information for the model to perform dialog reasoning.

UMGR vs. Memory Network. We notice that memory networks may not be suitable for complex reasoning over a user memory graph. This may be caused by the following reasons: (1) triples in memory are disconnected, which limits the possibility of joint reasoning of multiple triples; (2) memory network is not structure-preserving, which leads to hardness of aligning entities in triples with the output policy, such as ranking items; (3) existing research using memory network Bordes et al. (2016); Eric and Manning (2017); Madotto et al. (2018) assumed a static memory, which carries a great amount of knowledge from training to testing. Memory network may not be very suitable for our zero-shot reasoning where no user or item knowledge can be carried to testing directly.

UMGR vs. Rule-based Agent. We notice that RecAgent is a good rule-based baseline regarding the performance of recommendation. One advantage of RecAgent is that it can easily remember the recommended items tried in previous turns. However, frequent acts of recommendation can be annoying to the user.

UMGR vs. Pre-trained Graph Embeddings. We confirm that static pre-trained graph embeddings provide general representations of memory graphs but have a limited capability of reasoning for a particular user’s scenario. This study indicates UMGR has the capability for a personalized recommendation.

Discussion We first examine the generated dialog acts. UMGR typically asks a few questions and then makes a few recommendations. We observe that UMGR may make more recommendations than expected from agent workers in MGConvRex. This may be caused by the frequent patterns of dialog acts in conversational recommendation: different types of non-recommendation acts are frequently followed by a recommendation act. As a result, a neural network prefers frequent patterns to diverse details of reasoning. We believe more diverse and detailed reasoning is an important direction to improve in the future. Meanwhile, we argue that human performance on reasoning is very limited given the vast amount of candidate items in the real-world recommendation. Learning the behavior from humans is just a beginning. We expect research on automatic reasoning over large-scale user knowledge in future work.

Visualization of Item-level Reasoning. Figure 5 shows an example dialog in which the prominence scores of candidate items for recommendations at each turn, predicted by our model (darker color indicates more salient items for recommendation). At the beginning of the dialog, the prominence scores (and thus the ranking among the candidate items) are soft-initialized to reflect the user’s offline preferences, as indicated in the user memory graph. We can see that UMGR can almost predict the ground-truth item. As the dialog progresses and the system collects (or confirms) new user knowledge or a request (e.g. updated slots, opinions on recommended items “Toronto,ON”, etc.. ), the proposed UMGR model dynamically updates the ranking of the relevant items, reflecting the online preferences. Overall, UMGR effectively incorporates both online and offline preferences through a structured user memory graph, allowing for natural interactions and accurate recommendations.

7 Conclusion

This paper proposes a novel problem of user memory graph reasoning for conversational recommendation. We expect to release a conversational recommendation dataset with a grounded user memory graph from the behaviors of real-world users. The proposed user memory graph has the benefits of accumulating knowledge for a user to reason dialog policy. We propose a baseline model called UMGR that performs reasoning over such a user memory graph in open space policy. UMGR is structure-preserving for policy generation and provides zero-shot reasoning capability for user memory graphs that have never been seen before. Experimental results demonstrate the effectiveness of UMGR over a wide spectrum of metrics.


We thank Rajen Subba, Alborz Geramifard, and Hao Zhou for insightful discussions. Thanks to Gerald Demeunynck for the discussion and improvement on the process of data collection.


  • Bordes et al. (2016) Antoine Bordes, Y-Lan Boureau, and Jason Weston. 2016. Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683.
  • Budzianowski et al. (2018) Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. 2018. MultiWOZ - a large-scale multi-domain wizard-of-Oz dataset for task-oriented dialogue modelling. In

    Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)

  • Chao and Lane (2019) Guan-Lin Chao and Ian Lane. 2019. Bert-dst: Scalable end-to-end dialogue state tracking with bidirectional encoder representations from transformer. In Annual Conference of the International Speech Communication Association (INTERSPEECH).
  • Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pages 3844–3852.
  • Duvenaud et al. (2015) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. 2015. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232.
  • Eric et al. (2019) Mihail Eric, Rahul Goel, Shachi Paul, Adarsh Kumar, Abhishek Sethi, Peter Ku, Anuj Kumar Goyal, Sanchit Agarwal, Shuyag Gao, and Dilek Hakkani-Tur. 2019. Multiwoz 2.1: Multi-domain dialogue state corrections and state tracking baselines. arXiv preprint arXiv:1907.01669.
  • Eric and Manning (2017) Mihail Eric and Christopher D Manning. 2017. Key-value retrieval networks for task-oriented dialogue. arXiv preprint arXiv:1705.05414.
  • Gao et al. (2019) Shuyang Gao, Sanchit Agarwal Abhishek Seth and, Tagyoung Chun, and Dilek Hakkani-Ture. 2019. Dialog state tracking: A neural reading comprehension approach. In Special Interest Group on Discourse and Dialogue (SIGDIAL).
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778.
  • Henderson et al. (2014) Matthew Henderson, Blaise Thomson, and Jason D Williams. 2014. The second dialog state tracking challenge. In Special Interest Group on Discourse and Dialogue (SIGDIAL).
  • Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
  • Hu and Liu (2004) Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 168–177.
  • Kang et al. (2019) Dongyeop Kang, Anusha Balakrishnan, Pararth Shah, Paul A Crook, Y-Lan Boureau, and Jason Weston. 2019. Recommendation as a communication game: Self-supervised bot-play for goal-oriented dialogue. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1951–1961.
  • Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
  • Li et al. (2018a) Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin, and Chris Pal. 2018a. Towards deep conversational recommendations. In Advances in Neural Information Processing Systems, pages 9725–9735.
  • Li et al. (2016) Xiujun Li, Zachary C Lipton, Bhuwan Dhingra, Lihong Li, Jianfeng Gao, and Yun-Nung Chen. 2016. A user simulator for task-completion dialogues. arXiv preprint arXiv:1612.05688.
  • Li et al. (2018b) Xiujun Li, Sarah Panda, Jingjing Liu, and Jianfeng Gao. 2018b. Microsoft dialogue challenge: Building end-to-end task-completion dialogue systems. arXiv preprint arXiv:1807.11125.
  • Li et al. (2015) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493.
  • Madotto et al. (2018) Andrea Madotto, Chien-Sheng Wu, and Pascale Fung. 2018. Mem2seq: Effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1468–1478.
  • Misu et al. (2010) Teruhisa Misu, Komei Sugiura, Kiyonori Ohtake, Chiori Hori, Hideki Kashioka, Hisashi Kawai, and Satoshi Nakamura. 2010. Modeling spoken decision making dialogue and optimization of its dialogue strategy. In Proceedings of the SIGDIAL 2010 Conference, pages 221–224, Tokyo, Japan. Association for Computational Linguistics.
  • Moon et al. (2019a) Seungwhan Moon, Pararth Shah, Anuj Kumar, and Rajen Subba. 2019a. Opendialkg: Explainable conversational reasoning with attention-based walks over knowledge graphs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 845–854.
  • Moon et al. (2019b) Seungwhan Moon, Pararth Shah, Rajen Subba, and Anuj Kumar. 2019b. Memory grounded conversational reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pages 145–150.
  • Nickel et al. (2016) Maximilian Nickel, Lorenzo Rosasco, and Tomaso Poggio. 2016. Holographic embeddings of knowledge graphs. AAAI.
  • Pham et al. (2017) Trang Pham, Truyen Tran, Dinh Phung, and Svetha Venkatesh. 2017. Column networks for collective classification. In

    Thirty-First AAAI Conference on Artificial Intelligence

  • Rastogi et al. (2019) Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. 2019.

    Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset.

    In Association for the Advancement of Artificial Intelligence (AAAI).
  • Scarselli et al. (2008) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80.
  • Schlichtkrull et al. (2018) Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pages 593–607. Springer.
  • Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In Advances in neural information processing systems, pages 2440–2448.
  • Sun and Zhang (2018) Yueming Sun and Yi Zhang. 2018. Conversational recommender system. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 235–244. ACM.
  • Tuan et al. (2019) Yi-Lin Tuan, Yun-Nung Chen, and Hung-yi Lee. 2019. DyKgChat: Benchmarking dialogue generation grounding on dynamic knowledge graphs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1855–1865, Hong Kong, China. Association for Computational Linguistics.
  • Wen et al. (2016) Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, Lina M. Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. 2016. A network-based end-to-end trainable task-oriented dialogue system. In European Chapter of the Association for Computational Linguistics (EACL).
  • Wu et al. (2019) Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-Asl, Caiming Xiong, Richard Socher, and Pascale Fung. 2019. Transferable multi-domain state generator for task-oriented dialogue systems. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
  • Xian et al. (2019) Yikun Xian, Zuohui Fu, S. Muthukrishnan, Gerard de Melo, and Yongfeng Zhang. 2019. Reinforcement knowledge graph reasoning for explainable recommendation. In SIGIR.
  • Zhang et al. (2018) Yongfeng Zhang, Xu Chen, Qingyao Ai, Liu Yang, and W Bruce Croft. 2018. Towards conversational search and recommendation: System ask, user respond. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 177–186. ACM.
  • Zhou et al. (2020) Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. 2020. The design and implementation of xiaoice, an empathetic social chatbot. Computational Linguistics, 46(1):53–93.

Appendix A Appendix

This appendix contains two guidelines for building MGConvRex dataset: transcription guideline and annotation guideline, followed by the statistics of the dataset and a sample implementation of user simulator.

a.1 Transcription Guideline

a.1.1 Motivation

Getting irrelevant restaurant recommendations is a frustrating experience. The ideal recommendation system should be able to provide better recommendations by understanding your current needs, your restaurant preferences, and your restaurant history.

a.1.2 Overview

In this project, you will generate a dialog between an imaginary person (user) and an imaginary recommendation system (assistant101010We term agent as “assistant” in guidelines.). You will play one of the two roles, that will randomly be assigned to you. You will automatically get paired with someone else who will play the other role.

User: A user is expected to interact with an assistant to get a restaurant recommendation. The user will already know his/her general restaurant preferences and also the exact name of the restaurant he/she wants to go to. Further, information about restaurants that the user has visited in the past will be available and shown to the user.

Assistant: An assistant is expected to interact with the user and work towards recommending a restaurant the user wants to go to in the future. The assistant will have access to information about restaurants that the user has previously visited and a list of candidate restaurants.

a.1.3 Task

You will be randomly assigned a single role: either user or assistant. You will see your assignment in the top left corner of the screen, “You are: the user” or “You are: the assistant”.

User: You will interact with the assistant, to get the correct restaurant recommendation from the assistant. You will be provided with the following information:

  • Restaurant preference over 10 characteristics (or slots).

  • The restaurant you will go to: “Ground-Truth restaurant”.

  • You will optionally have information about restaurants that you have visited in the past.

As a user player, you are expected to:

  • Answer the questions the assistant asks about your preference.

  • Reject incorrect restaurant recommendations.

  • Ask questions about the recommended restaurant to justify why you accept or reject the recommendation.

  • If needed, use the information in your visited restaurant to help inform the assistant about your preference.

  • The frequency of characteristics (or slots) shared by multiple restaurants are indicated in (…), e.g. “(3) parking lot” means this user has been to 3 restaurants with parking lots.

  • When you use information from your visited restaurants in one of your responses, make sure to click the “Use Fact” button.

Figure 6: Screenshot of transcription UI for User.
Figure 7: Screenshot of transcription UI for Assistant.

Assistant: You will interact with the User, to give the desired recommendation (ground-truth restaurant) to the user. You will be provided with the following information:

  • Name of the user.

  • A list of candidate restaurants, and their characteristics (slots). One of the restaurants in this list is the desired or ground-truth restaurant the user is looking for.

  • Optionally, the characteristics (slots) and values of the restaurants the user has visited (visited restaurants). (See the definitions of slots below). The frequency of slots shared by multiple restaurants are indicated in (…), e.g. “(3) parking lot” means this user has been to 3 restaurants with parking lots. The visited restaurants’ section may or may not be given to you. If it is given, your goal is to utilize (by clicking “Use Fact”) the information from visited restaurants as much as possible to provide the desired recommendation to the user.

To make an efficient recommendation, you are expected to:

  • Ask the user questions about their restaurant preference.

  • If the visited restaurants are available, investigate their slots and values to reduce the number of questions you may need.

  • Recommend restaurants to the user based on your knowledge of their preference, their visited restaurants, the information of the candidate restaurants, and from the answers the user gives to your questions.

  • Intelligently apply the information the user gives to you to guide your conversation.

  • Recommend the desired restaurant.

a.1.4 Instructions

This section describes the details of transcription. In general, transcribers are required to follow pre-defined dialog acts, slots and values, but free to make up utterances based on these pre-defined metadata.

Dialog Acts are the intents of one utterance from a player. Note that the user and assistant have their own set of dialog acts, as shown in Table 1. You can only use these pre-defined dialog acts in your utterance.

Slots refer to 10 pre-defined characteristics of restaurants.

Values: one slot is further associated with multiple values, such as a slot Parking can take value street. Note that a slot can take multiple values at the same time. In the UI, these values are separated by “”. For example, Parking = street garage means that a restaurant has both street and garage parking. DO NOT include “” in your responses, instead, use one or multiple values naturally in the utterance. e.g. “I prefer street or garage parking.” You do not have to write out all the values of a slot in one utterance. For example, the category slot usually has many values and you do not need to list them all in your utterance.

You will need to write the values exactly as you see them in the UI, including the underscores “_” and commas “,” and excluding “”. For example, type “Bonfyre_American_Grille” but not “Bonfyre American Grille”. The full lists of values and their slots are at the ends of guidelines.

Items and their Names: Each item (restaurant) has an item name and has multiple values and their associated slots. An item is typically associated with a recommendation act from the assistant side. When recommending a restaurant (item), you are expected to mention the restaurant name (item name), which follows the same rule as writing a value in an utterance.

a.1.5 Important Notes

During transcribing, it is important to keep these things in mind:

  • A dialog can end with either a user or an assistant response.

  • The person who plays the user, however, will be the one to terminate the session by pressing the button “Dialog is done!”

  • The user should NEVER give all of their preference to the assistant in a single utterance.

  • The user should NEVER give the ground-truth restaurant to the assistant.

  • When you use content from the visited restaurants in your response, make sure to click the corresponding “Use Fact” buttons before sending your response. The click will be recorded.

  • If the user player has sent more than 10 responses (20 including the responses from the assistant), it is up to the user player to decide whether to stop the current dialog or to continue.

The following actions should be avoided.

  • Do not engage in the transcribed dialog with the other person about the transcription task itself and do not go off-topic.

  • Do not share any of your personal information. Always be “in your character”, i.e., speak as the user or the assistant.

  • NO INDECENCY / DISRESPECT / HARASSMENT. Keep your messages decent and respectful towards the other person. Any violations will result in a ban on further tasks.

  • Do not directly copy any of the utterances from this guideline or UI.

  • Do not repeat /template your answer, that is to say, do not create one set of responses ahead and then make small changes to them over and over. Please always generate unique and new responses.

a.1.6 Feedback

After the transcription of one dialog is over, both sides need to give feedback about the transcribed dialog, including:

  1. Rate the dialog (1-5) based on the smoothness and coherence of the whole dialog and whether it closely follows this guideline.

  2. Rate the other side (1-5): whether the other side closely follows this guideline.

  3. (Optional) feedback about this transcription task.

a.2 Annotation Guideline

In this task, you will get a transcribed dialog between a user and an assistant, in which the assistant helps the user find the desired restaurant to go to. You will annotate the utterances with dialog acts, slots, values, item names and sentiment on values or item names. For your reference, the transcription guideline is detailed in Section A. This annotation task will be further supported by a QA process before and during the annotation to resolve hard cases.

a.2.1 Task

In this annotation task, you are required to label the following data:

  1. dialog quality: good or bad about the whole dialog.

  2. dialog acts (or utterance-level intents), as defined in Table 1.

  3. Label spans of values (or item name) from each utterance and their corresponding slots (or item).

  4. Utterance-level sentiment of each utterance, and optionally span-level sentiment towards a value (or item name) if it is different from the utterance-level sentiment.

You first need to read through the dialog once and label the overall dialog quality, and if it is good, label dialog acts. Then you need to read through the utterances again and label spans of values (or item name), their corresponding slots (item), and sentiment.

a.2.2 Dialog Quality

For the entire dialog, you will need to label the dialog quality as either good or bad. This step is to further ensure the quality of the transcribed dialog. If the dialog quality is labeled as bad, you can skip annotating the current dialog further.

a.2.3 Dialog Acts

Each utterance must have at least one dialog act. The dialog acts are pre-defined in Table 1. Note that there are different sets of dialog acts for the roles of user and assistant. If you believe one utterance is associated with multiple dialog acts, you need to label all of them. We summarize a few important tips for user and assistant separately as following.

Dialog Acts for User: There are a few key differences among reply, answer, inform, open and yes/no question.

  • reply, open and yes/no question are always related to a (previously) recommended restaurant (from the assistant). The item name (restaurant name) may or may not show up in the to-be-labeled utterance.

  • answer and inform are always related to a value. Note that the value (span) may not show up in an answer (e.g. “Yes, I like that location.”)

  • open question DOES NOT have a value show up in the utterance but only the explicit or implicit slot (e.g. “what type of food do they serve ?” [category]) , whereas yes/no question must have a value show up (e.g. “do they serve Italian food ?”).

  • inform, open and yes/no question indicates a user actively providing information, while reply and answer indicate a user passively giving information.

Dialog Acts for Assistant: The key differences among recommendation, open question, yes/no question and answer are as following:

  • recommendation and answer are always related to a restaurant (item). A recommendation act may have additional values show up, besides the restaurant name. The restaurant typically may not show up in answer (e.g. “it serves italian food.”)

  • open and yes/no question are always only about slots. But the slot itself may not show up in the utterance directly (e.g. “what kinds of food do they serve ?”).

  • yes/no question always has a value show up: “do you like italian food ?”

Note that you always need to annotate the true intent of having an utterance, not the surface form of an utterance. For example, a recommendation can have a surface form that looks like a question (e.g., “how about burger_king ?” and “why not try burger_king ?”).

a.2.4 Spans of Values, their Slots, Items, and Sentiment

We expect you to label spans of words that are values (of slots) in the utterance (or item names), all possible values that you can label are listed at the end of this guideline 111111We omit the list in this appendix for brevity.. You are also required to label slots when the values are not shown in an utterance (e.g., an open question) by just labeling the slots on utterance-level (similar to label a dialog act). Finally, label utterance-level sentiment (one of positive, negative and neutral) and span-level sentiment (if it differs from utterance-level sentiment).

We expect you to perform the following steps (after you finish labeling dialog acts):

  1. label spans of words as values (or item names).

  2. select the corresponding slot (or item).

  3. label utterance-level slot (open question).

  4. label utterance-level sentiment and check and label span-level sentiment.

a.2.5 Important Notes

  • dialog utterances may have typos (e.g., extra spaces, cases), correct and label the spans to the best of your ability, even if errors are present.

  • Do not label spans about slots (e.g., location, category, price, etc.) itself, such as words “where”, “located at”, “kinds of”, “price range”, “parking” etc. Labeled spans should only be about pre-defined values or item names (restaurant names).

  • Utterances from the assistant side DO NOT have sentiments.

  • Only utterances from a user fall in these dialog acts have sentiment: answer, inform, and reply.

a.3 Datasets

a.3.1 Data Cleaning

After annotation, the data will go through a data cleaning process via scripts to fix typos and illegal combinations of dialog acts, items, slots, and values. The cleaned data will be integrated with scenarios of each transcription task to form the final datasets.

a.3.2 Statistics

Role Utterance Acts
User hello. Greeting
Assistant Hello Will! Are you still living in the area. ? YNQ
User Yes. [pos_on] ANS
Assistant Ok great! Do you want a with your meal? YNQ
User No just are fine. ANS
Assistant My system shows restaurant. They also offer wifi. REC
User Is the food over there? I’m tight on budget. YNQ, inform
Assistant It is on the side of the restaurants. ANS
User Great. I’ll try them out. Thanks. Reply & Thanks
Assistant Thank you and enjoy your meal Thanks
Table 6: An example dialog in MGConvRex with slots and sentiment polarities annotated.
Figure 8: Distribution of user dialog acts
Figure 9: Distribution of agent dialog acts
Figure 10: Distribution of slots asked in Open and Yes/no questions from the agent

Besides the statistics of MGConvRex in Table 2, we further study the distributions of dialog acts and slots from the assistant side to learn more about the preferred behavior of crowd workers. From Figure 8, we can see that a user worker mainly uses the Answer act to the agent. More importantly, the user player is very active and likes to use the Inform act without being asked a question. User players can sometimes even more cooperative by examining the user memory and inform more salient preference, as indicated by more inform in scenarios with history. From Figure 9, we can see that an agent worker use both Open question and Yes/no questions to collect preference. Yes/no questions are more frequent in scenarios with history to confirm users’ preferences. Figure 10 shows the distribution of slots for Open and Yes/no questions asked by the agent player. category, location, and price are their mainly used slots for collecting user preference and distinguish different candidate items in . We further demonstrate one example dialog is shown in Table 6.

a.4 User Simulator

Our online evaluation is conducted against user simulators under a simulation environment in our developed framework. Here we brief one simulator as in Algorithm 1. Note that although our pre-defined dialog acts for user can be either passive or active (as in Table 1), we mostly focus on a passive user (less likely to use Inform, Open / Yes/no question, e.g. 0.2 chance to make an inform act) because an active user can vary domain-by-domain and hard to implement. In the implementation, the user mostly follows the dialog acts from the agent and provide information accordingly in a case-by-case fashion.

Input : , , , from the agent; and from scenario
Output : , , , , from the user
1 def :
2       if  then
4       end if
5      else
7       end if
8      return
10 switch  do
       case INIT   // first turn
11       do
13       end case
14      case REC do
15             if  then
17             end if
18            else
20             end if
22       end case
23      case OQ or YNQ do
25       end case
26      case ANS do
27             if  then
29             end if
30            else
32             end if
34       end case
35      case THANKS do
36             if  then
38             end if
39            else
41             end if
43       end case
45 end switch
Algorithm 1 Algorithm for User Simulator