Estimation-Action-Reflection: Towards Deep Interaction Between Conversational and Recommender Systems

02/21/2020 ∙ by Wenqiang Lei, et al. ∙ National University of Singapore University of Virginia Hefei University of Technology 0

Recommender systems are embracing conversational technologies to obtain user preferences dynamically, and to overcome inherent limitations of their static models. A successful Conversational Recommender System (CRS) requires proper handling of interactions between conversation and recommendation. We argue that three fundamental problems need to be solved: 1) what questions to ask regarding item attributes, 2) when to recommend items, and 3) how to adapt to the users' online feedback. To the best of our knowledge, there lacks a unified framework that addresses these problems. In this work, we fill this missing interaction framework gap by proposing a new CRS framework named Estimation-Action-Reflection, or EAR, which consists of three stages to better converse with users. (1) Estimation, which builds predictive models to estimate user preference on both items and item attributes; (2) Action, which learns a dialogue policy to determine whether to ask attributes or recommend items, based on Estimation stage and conversation history; and (3) Reflection, which updates the recommender model when a user rejects the recommendations made by the Action stage. We present two conversation scenarios on binary and enumerated questions, and conduct extensive experiments on two datasets from Yelp and LastFM, for each scenario, respectively. Our experiments demonstrate significant improvements over the state-of-the-art method CRM [32], corresponding to fewer conversation turns and a higher level of recommendation hits.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recommender systems are emerging as an important means of facilitating users’ information seeking (MF; BPR; NCF; ACF). However, much of such prior work in the area solely leverages the offline historical data to build the recommender model (henceforth, the static recommender system). This offline focus causes the recommender to suffer from an inherent limitation in the optimization of offline performance, which may not necessarily match online user behavior. User preference can be diverse and often drift with time; and as such, it is difficult to know the exact intent of a user when he uses a service even when the training data is sufficient.

The rapid development of conversational techniques (nips18/DeepConv; Liao:2018; sigir18/chatmore; acl18/sequicity; jin2018explicit) brings an unprecedented opportunity that allows a recommender system to dynamically obtain user preferences through conversations with users. This possibility is envisioned as the conversational recommender system (CRS), for which the community has started to expend effort in exploring its various settings. (zhang2018towards) built a conversational search engine by focusing on document representation. (nips18/DeepConv) developed a dialogue system to suggest movies for cold start users, contributing to language understanding and generation for the purpose of recommendation, but does not consider modeling users’ interaction histories (e.g., clicks, ratings). In contrast, (christakopoulou2018q) does considers user click history in recommending, but their CRS only handles single-round recommendation. That is, their model considers a scenario in which the CRS session terminates after making a single recommendation, regardless of whether the recommendation is satisfactory or not. While a significant advance, we feel this scenario is unrealistic in actual deployments.

In particular, we believe CRS models should inherently adopt a multi-round setting: a CRS converses with a user to recommend items based on his click history (if any). At each round, the CRS is allowed to choose two types of actions — either explicitly asking whether a user likes a certain item attribute or recommending a list of items. In a session, the CRS may alternate between these actions multiple times, with the goal of finding desirable items while minimizing the number of interactions. This multi-round setting is more challenging than the single-round setting, as the CRS needs to strategically plan its actions. The key in performing such planning, from our perspective, lies in the interaction between the conversational component (CC; responsible for interacting with the user) and the recommender component (RC; responsible for estimating user preference – e.g., generating the recommendation list). We summarize three fundamental problems toward the deep interaction between CC and RC as follows:

  • [leftmargin=1mm]

  • What attributes to ask? A CRS needs to choose which attribute to ask the user about. For example, in music recommendation, it may ask “Would you like to listen to classical music?”, expecting a binary yes/no response111Note that it is possible to compose questions eliciting an enumerated response; i.e., “Which music genre would you consider? I have pop, funk …”. However, this is a design choice depending on the domain requirements. In describing our method, we consider the basic single-attribute case. However in experiments, we also justify the effectiveness of EAR in asking such enumerated questions on Yelp. For the purpose of exposition, we have chosen to avoid open questions that do not constrain user response for now. Even interpreting user responses to such questions is considered a challenging task (chen2018hierarchical).. If the answer is “yes”, it can focus on items containing the attribute, benefiting the RC by reducing uncertainty in item ranking. However, if the answer is “no”, the CRS expends a conversation turn with less gain to the RC. To achieve the goal of hitting the right items in fewer turns, the CC must consider whether the user will like the asked attribute. This is exactly the job of the RC which scrutinizes the user’s historical behavior.

  • When to recommend items? With sufficient certainty, the CC should push the recommendations generated by the RC. A good timing to push recommendations should be when 1) the candidate space is small enough; when 2) asking additional questions is determined to be less useful or helpful, from the perspective of either information gain or user patience; and when 3) the RC is confident that the top recommendations will be accepted by the user. Determining the appropriate timing should take both the conversation history of the CC and the preference estimation of the RC into account.

  • How to adapt to users’ online feedback? After each turn, the user gives feedback; i.e., “yes”/“no” to a queried attribute, or an “accept”/“reject” the recommended items. (1) For “yes” on the attribute, both user profile and item candidates need to be updated to generate better recommendations; this requires the offline RC training to take such updates into account. (2) For “no’, the CC needs to adjust its strategy accordingly. (3) If the recommended items are rejected, the RC model needs to be updated to incorporate such a negative signal. Although adjustments seem only to impact either the RC or the CC, we show that such actions impact both.

Towards the deep interaction between CC and RC, we propose a new solution named Estimation–Action–Reflection (EAR), which consists of three stages. Note that the stages do not necessarily align with each of the above problems. (a) Estimation, which builds predictive models offline to estimate user preference on items and item attributes. Specifically, we train a factorization machine (rendle2010factorization)

 (FM) using user profiles and item attributes as input features. Our Estimation stage builds in two novel advances: 1) the joint optimization of FM on the two tasks of item prediction and attribute prediction, and 2) the adaptive training of conversation data with online user feedback on attributes. (b) Action, which learns the conversational strategy that determines whether to ask or recommend, and what attribute to ask. We train a policy network with reinforcement learning, optimizing the reward of shorter turns and successful recommendations based on the FM’s estimation of user preferred items and attributes, and the dialogue history. (c) Reflection, which adapts the CRS with user’s online feedback. Specifically, when a user rejects the recommended items, we construct new training triplets by treating the items as negative instances and update the FM in an online manner. In summary, the main contributions of this work are as follows:

  • [leftmargin=*]

  • We comprehensively consider a multi-round CRS scenario that is more realistic than previous work, highlighting the importance of researching into the interactions between the RC and CC to build an effective CRS.

  • We propose a three-stage solution, EAR, integrating and revising several RC and CC techniques to construct a solution that works well for the conversational recommendation.

  • We build two CRS datasets by simulating user conversations to make the task suitable for offline academic research. We show our method outperforms several state-of-the-art CRS methods and provide insight on the task.

2. Multi-round Conversational Recommendation Scenario

Figure 1. The workflow of our multi-round conversational recommendation scenario. The system may recommend items multiple times, and the conversation ends only if the user accepts the recommendation or chooses to quit.

Following (christakopoulou2018q), we denote one trial of recommendation as a round. This paper considers conversational recommendation as an inherently multi-round scenario, where a CRS interacts with the user by asking attributes and recommending items multiple times until the task succeeds or the user leaves. To distinguish the two, we term the setting single-round where the CRS only makes recommendations once, ending the session regardless of the outcome, as in (Sun:2018:CRS:3209978.3210002; christakopoulou2018q).

We now introduce the notation used to formalize our setting. Let denote a user from the user set and denote an item from the item set . Each item is associated with a set of attributes which describe its properties, such as music genre “classical” or “jazz” for songs in LastFM, or tags such as “nightlife”, “serving burgers”, or “serving wines” for businesses in Yelp. We denote the set of all attributes as and use to denote a specific attribute. Following (Sun:2018:CRS:3209978.3210002; zhang2018towards), a CRS session is started with ’s specification of a preferred attribute , then the CRS filters out candidate items that contain the preferred attribute . Then in each turn (; denotes the last turn of the session), the CRS needs to choose an action: recommend or ask:

  • [leftmargin=*]

  • If the action is recommend, we denote the recommended item list and the action as . Then the user examines whether contains his desired item. If the feedback is positive, this session succeeds and can be terminated. Otherwise, we mark as rejected and move to the next round.

  • If the action is ask (where the asked attribute is denoted as and the action as ), the user states whether he prefers items that contain the attribute or not. If the feedback is positive, we add into to denote the preferred attributes the user in the current session. Otherwise, we mark as rejected; regardless of rejection or not, we move to the next turn.

This whole process naturally forms a interaction loop (Figure 1) where the CRS may ask zero to many questions before making recommendations. A session terminates if a user accepts the recommendations or leaves due to his impatience. We set the main goal of the CRS as making desired recommendations within as few rounds as possible.

3. Proposed Methods

EAR consists of a recommendation and conversation component (RC and CC) which interact intensively in the three–stage conversational process. The system starts working at the estimation stage where the RC ranks candidate items and item attributes for the user, so as to support the action decision of the CC. After the estimation stage, the system moves to the action stage where the CC decides whether to choose an attribute to ask, or make a recommendation according to the ranked candidates and attributes, and the dialogue history. If the user likes the attribute asked by the RC, the CC feeds this attribute back to the RC to make a new estimation again; otherwise, the system stays at the action stage: updates the dialogue history and chooses another action. Once a recommendation is rejected by a user, the CC sends the rejected items back to RC, triggering the reflection stage where the RC adjusts its estimations. After that, the system enters the estimation stage again.

3.1. Estimation

As discussed before, the multi-round conversational scenario brings in new challenges to the traditional RC. Specifically, the CC interacts with a user and accumulates evidence on his preferred attributes, denoted as 222We detail how to obtain such data in experiments Section 4.1.2.. Importantly, different from traditional recommendation methods (BPR; NCF), the RC here needs to make full use of aiming to accurately predict both user’s the preferred items and preferred attributes. These two goals exert positive influence on EAR, where the first directly contributes to success rate of recommendation, and the second guides the CC to choose better attributes to ask users so as to shorten the conversation. In the following, we first introduce the basic form of the recommendation method, followed by detail on how we adapt our proposed method to achieve both goals simultaneously.

3.1.1. Basic Recommendation Method

we choose the factorization machine (FM) (rendle2010factorization) as our predictive model due to its success and wide usage in recommendation tasks. However, FM considers all pairwise interactions between input features, which is costly and may introduce undesired interactions that negatively affect our two goals. Thus, we only keep the interactions that are useful to our task and remove the others. Given user , his preferred attributes in the conversation , and the target item , we predict how likely will like in the conversation session as:


where u and v denote the embedding for user and item , respectively, and denotes the embedding for attribute . Bias terms are omitted for clarity. The first term models the general interest of the user on the target item, a common term in FM model (NCF). The second term models the affinity between the target item and user preferred attributes. We have also tried to include ’s attributes into FM, but found it brings no benefits. One possible reason is that the item embedding v may have already encoded its attribute information. Thus we also omit it.

To train the FM, we optimize the pairwise Bayesian Personalized Ranking (BPR) (BPR) objective. Specifically, given a user

, it assumes the interacted items (e.g., visited restaurants, listened music) should be assigned higher scores than those not interacted with. The loss function of traditional BPR is:


where is the set of pairwise instances for BPR training, , where is the interacted item of the conversation session (i.e., the ground truth item of the session), := denotes the set of non-interacted items of user and denotes the items interacted by .

is the sigmoid function, and

is the regularization parameter to prevent overfitting.

3.1.2. Attribute-aware BPR for Item Prediction.

However, in our scenario, the emphasis of CRS is to rank the items that contain the user preferred attributes well. For example, if specifies “Mexican restaurant” as his preferred attribute, a good CRS needs to rank his preferred restaurants among all available Mexican restaurants. To capture this, we propose to sample two types of negative examples:


where is the same negative samples as in the traditional BPR setting, i.e., all non-interacted items of . denotes the current candidate items satisfying the partially known preference in the conversation, and is the subset of that excludes the observed items . The two types of pairwise training instances is defined as:


We then train the FM model by optimizing both and :


where the first loss learns ’s general preference, and the second loss learns ’s specific preference given the current candidates. It is worth noting adding the second loss for training is critical for the model ranking well on the current candidates. This is very important for CRS since the candidate items dynamically change with user feedback along the conversation. However, the state-of-the-art method CRM (Sun:2018:CRS:3209978.3210002) does not account for this factor, being insufficient in considering the interaction between the CC and RC.

3.1.3. Attribute Preference Prediction.

We formulate the task of the second goal of accurate attribute prediction separately. This prediction of attribute preference is mainly used in the CC to support the action on which attribute to ask (c.f. Sec 3.2). As such, we take ’s preferred attributes in the current session into account:


which estimates ’s preference on attribute , given ’s current preferred attributes . To train the model, we also employ BPR loss, and assume that the attributes of the ground truth item (of the session) should be ranked higher than other attributes:


where the pairwise training data is defined as:


where denotes item ’s attributes.

3.1.4. Multi-task Training.

We perform joint training on the two tasks of item prediction and attribute prediction, which has the potential of mutual benefits since their parameters are shared. The multi-task training objective is:


Specifically, we first train the model with . After it converges, we continue optimizing the model using . We iterate the two steps until convergence under both losses. Empirically, 2-3 iterations are sufficient for convergence.

3.2. Action

After the estimation stage, the action stage finds the best strategy for when to recommend. We adopt reinforcement learning (RL) to tackle this multi-round decision making problem, aiming to accomplish successful recommendation in shorter number of turns. It is worth noting that since our focus is on conversational recommendation strategy, as opposed to fluent dialogue (the language part), we use templates as wrappers to handle user utterances and system response generation. That is to say, this work serves as an upper bound study of real applications as we do not include the errors for language understanding and generation.

3.2.1. State Vector.

The state vector is a bridge for the interaction between the CC and RC. We encode information from the RC and dialogue history into a state vector, providing it to the CC to choose actions. The state vector is a concatenation of four component vectors that encode signal from different perspectives:


Each of the vector components captures an assumption on asking which attribute could be most useful, or whether now is a good time to push a recommendation. They are defined as follows:

  • [leftmargin=1mm]

  • : This vector encodes the entropy information of each attribute among the attributes of the current candidate items . The intuition is that asking attributes with large entropy helps to reduce the candidate space, thus benefits finding desired items in fewer turns. Its size is the attribute space size , where the -th dimension denotes the entropy of the attribute .

  • : This vector encodes ’s preference on each attribute. It is also of size , where each dimension is evaluated by Equation (6) on the corresponding attribute. The intuition is that the attribute with high predicted preference is likely to receive positive feedback, which also helps to reduce the candidate space.

  • : This vector encodes the conversation history. Its size is the number of maximum turns , where each dimension encodes user feedback at turn . Specifically, we use -1 to represent recommendation failure, 0 to represent asking an attribute that disprefers, and 1 to represent successfully asking about an attribute that desires. This state is useful to determine when to recommend items. For example, if the system has asked about a number of attributes for which approves, it may be a good time to recommend.

  • : This vector encodes the length of the current candidate list. The intuition is that if the candidate list is short enough, EAR should turn to recommending to avoid wasting more turns. We divide the length into ten categorical (binary) features to facilitate the RL training.

It is worth noting that besides , the other three vectors are all derived from the RC component. We claim that this is a key difference from existing conversational systems (zhang2018towards; nips18/DeepConv; Sun:2018:CRS:3209978.3210002; christakopoulou2018q; Liao:2018); i.e., the CC needs to take information from the RC to decide the dialogue action. In contrast to EAR, the recent conversational recommendation method CRM (Sun:2018:CRS:3209978.3210002) makes decisions based only on the belief tracker that records the preferred attributes of the user, which makes it less informative. As such, CRM is less effective especially when the number of attributes is large (their experiments only deal with 5 attributes, which is insufficient for real-world applications).

3.2.2. Policy Network and Rewards

The conversation action is chosen by a policy network in our CC. In order to demonstrate the efficacy of our designed state vector, we purposely choose a simple policy network — a two-layer multi-layer perceptron, which can be optimized with the standard policy gradient method. It contains two fully-connected layers and maps the state vector


into the action space. The output layer is normalized to be a probability distribution over all actions by

. In terms of the action space, we follow the previous method (Sun:2018:CRS:3209978.3210002), which includes all attributes and a dedicated action for recommendation. To be specific, we define the action space as which is of size . After the CC takes an action at each turn, it will receive an immediate reward from the user (or user simulator). This will guide the CC to learn the optimal policy that optimizes long-term reward. In EAR, we design four kinds of rewards, namely: (1) , a strongly positive reward when the recommendation is successful, (2) , a positive reward when the user gives positive feedback on the asked attribute, (3) , a strongly negative reward if the user quits the conversation, (4) , a slightly negative reward on every turn to discourage overly lengthy conversations. The intermediate reward at turn is the sum of the above four rewards, .

We denote the policy network as , which returns the probability of taking action given the state . Here and denote the action to take and the state vector of the -th turn, respectively. To optimize the policy network, we use the standard policy gradient method (REINFORCE), formulated as follows:


where denotes the parameter of the policy network, denotes the learning rate of the policy network, and is the total reward accumulating from turn to the final turn : where is a discount factor which discounts future rewards over immediate reward.

3.3. Reflection

This stage also implements the interaction between the CC and RC. It is triggered when the CC pushes the recommended items to the user but gets rejected, so as to update the RC model for better recommendations in future turns. In the traditional static recommender system training scenario (BPR; NCF), one issue is the absence of true negative samples, since users do not explicitly indicate what they dislike. In our conversational case, the rejection feedback is an explicit signal on user dislikes which are highly valuable to utilize; moreover, it indicates that the offline learned FM model improperly assigns high scores to the rejected items. To leverage on this source of feedback, we treat the rejected items as negative samples, constructing more training examples to refresh the FM model. Following the offline training process, we also optimize the BPR loss:


where . Note that this stage is performed in an online fashion, where we do not have access to the ground truth positive item. Thus, we treat the historically interacted items as the positive items to pair with the rejected items. We put all examples in

into a batch and perform batch gradient descent. Empirically, it takes 3-5 epochs to converge, sufficiently efficient for online use.

Note that although it sounds reasonable to also update the policy network of the CC (since the rejection feedback implies that it is not an appropriate timing to push recommendation), we currently do not perform this due to high difficulty of online updating RL agent and leave it for future work.

4. Experiments

EAR 333Datasets, source code and demos at our project homepage: is built based on the guiding ideology of interaction between the CC and RC. To validate this ideology, we first evaluate the whole system to examine the overall effect brought by the interaction. Then, we perform ablation study to investigate the effect of interaction on each individual component. Specifically, we have the following research questions (RQ) to guide experiments on two datasets.

  • [leftmargin=*]

  • RQ1. How is the overall performance of EAR comparing with existing conversational recommendation methods?

  • RQ2. How do the attribute-aware BPR and multi-task training of the estimation stage contribute to the RC?

  • RQ3. Is the state vector designed for the CC in the action stage appropriate?

  • RQ4. Is the online model update of the reflection stage useful in obtaining better recommendation?

4.1. Settings

4.1.1. Datasets

We conduct experiments on two datasets: Yelp444 for business recommendation and LastFM555 for music artist recommendation. First, we follow the common setting of recommendation evaluation (NCF; BPR) that reduces the data sparsity by pruning the users that have less than 10 reviews. We split the user–item interactions in the ratio of 7:2:1 for training, validation and testing. Table 1 summarizes the statistics of the datasets.

Dataset #users #items #interactions #attributes
Yelp 27,675 70,311 1,368,606 590
LastFM 1,801 7,432 76,693 33
Table 1. Dataset statistics.

For the item attributes, we preprocess the original attributes of the datasets by merging synonyms and eliminating low frequency attributes, resulting in 590 attributes in Yelp and 33 attributes in LastFM. In real applications, asking about attributes in a large attribute space (e.g., on Yelp dataset) causes overly lengthy conversation. We therefore consider both the binary question setting (on LastFM) and enumerated question (on Yelp). To enable the enumerated question setting, we build a two-level taxonomy on the attributes of the Yelp data. For example, the parent attribute of {“wine”, “beer”, “whiskey”} is “alcohol”. We create 29 such parent attributes on the top of the 590 attributes, such as “nightlife”, “event planning & services”, “dessert types” etc. In the enumerated question setting, the system choose one parent attribute to ask. This is to say, we change the size of the output space of the policy network to be . At the same time, it also displays all its child attributes and ask the user to choose from them (the user can reply with multiple child attributes). Note that choosing what kinds of questions to ask is an engineering design choice by participants, here we evaluate our model on both settings.

4.1.2. User Simulator For Multi-round Scenario.

Because the conversational recommendation is a dynamic process, we follow  (zhang2018towards; Sun:2018:CRS:3209978.3210002)) to create a user simulator to enable the CRS training and evaluation. We simulate a conversation session for each observed interaction between users and items. Specifically, given an observed user–item interaction , we treat the as the ground truth item to seek for and its attributes as the oracle set of attributes preferred by the user in this session. At the beginning, we randomly choose an attribute from the oracle set as the user’s initialization to the session. Then the session goes in the loop of the “model acts – simulator response” process as introduced in Section 2. We set the max turn of a session to 15 and standardize the recommendation list length as 10.

4.1.3. Training Details

Following CRM (Sun:2018:CRS:3209978.3210002), the training process is divided into offline and online stages. The offline training is to build the RC (i.e., FM) and initialize the policy network (PN) by letting them optimize performance with the offline dialogue history. Due to the scarcity of the conversational recommendation dialogue history, we follow CRM (Sun:2018:CRS:3209978.3210002) to simulate dialogue history by building a rule-based CRS to interact with the simulator introduced in Section 4.1.2. Specifically, the strategy for determining which attribute to ask about is to choose the attribute with the maximum entropy. Each turn, the system chooses the recommendation action with probability where is the current candidate set. The intuition is that the confidence of recommendation grows when the candidate size is smaller. We train the RC to give the ground-truth item and oracle attributes higher ranks given the attribute confirmed by users in dialogue histories, while training the policy to mimic the rule-based strategy on the history. Afterwards, we conduct online training, optimizing the PN by letting EAR interact with the user simulator through reinforcement learning.

We tuned all hyper-parameters on the validation set, and empirically set them as followed: The embedding size of FM is set as 64. We employ the multi-task training mechanism to optimize FM as described in Section 3.1.4, using SGD with a regularization strength of 0.001. The learning rate for the first task (item prediction) and second task (attribute prediction) is set to 0.01 and 0.001, respectively. The size of the two hidden layers in the PN is set as 64. When the pre-trained model is initialized, we use the REINFORCE algorithm to train the PN. The four rewards are set as: =1, =0.1, =-0.3, and =-0.1, and the learning rate is set as . The discount factor is set to be 0.7.

4.1.4. Baselines.

As our multi-round conversational recommendation scenario is new, there are few suitable baselines. We compare our overall performance with the following three:

  • [leftmargin=1mm]

  • Max Entropy. This method follows the rule we used to generate the conversation history in Section 4.1.2. Each turn it asks the attribute that has the maximum entropy among the candidate items. It is claimed in (dhingra2017towards) that maximum entropy is the best strategy when language understanding is precise. It’s worth noting that, in enumerated question setting, the entropy of an attribute is calculated as the sum of its child attributes in the taxonomy (similar approach for attribute preference calculation).

  • Abs Greedy (christakopoulou2016towards). This method recommends items in every turn without asking any question. Once the recommendation is rejected, it updates the model by treating the rejected items as negative examples. According to (christakopoulou2016towards), this method achieves equivalent or better performance than popular bandit algorithms like Upper Confidence Bounds (auer2002using)

    and Thompson Sampling 


  • CRM (Sun:2018:CRS:3209978.3210002). This is a state-of-the-art CRS. Similar to EAR, it integrates a CC and RC by feeding the belief tracker results to FM for item prediction, without considering much interactions between them. It is originally designed for single-round recommendation. To achieve a fair comparison, we adapt it to the multi-round setting by following the same offline and online training of EAR.

It is worth noting that although there are other recent conversational recommendation methods (zhang2018towards; nips18/DeepConv; christakopoulou2016towards; Liao:2018), they are ill-suited for comparison due to their different task settings. For example, (zhang2018towards) focuses on document representation which is unnecessary in our case. It also lacks the conversation policy component to decide when to make what action. (nips18/DeepConv) focuses more on language understanding and generation. We summarize the settings of these methods in Table 6 and discuss differences in Section 5.

4.1.5. Evaluation Metrics

We use the success rate (SR@t) (Sun:2018:CRS:3209978.3210002) to measure the ratio of successful conversations, i.e., recommend the ground truth item by turn . We also report the average turns (AT) needed to end the session. Larger SR denotes better recommendation and smaller AT denotes more efficient conversation. When studying RC model of offline training, we use the AUC score which is a surrogate of the BPR objective (BPR)

. We conduct one-sample paired t-test to judge statistical significance.

4.2. Performance Comparison (RQ1)

Figure 2. Success Rate* of compared methods at different conversation turns on Yelp and LastFM (RQ1).
LastFM Yelp
SR@15 AT SR@15 AT
Abs Greedy 0.209 13.63 0.271 12.26
Max Entropy 0.290 13.61 0.919 5.77
CRM 0.325 13.43 0.923 5.33
EAR 0.429* 12.45* 0.971* 4.71*
Table 2. SR@15 and AT of compared methods. denotes that improvement of EAR over other methods is statistically significant for (RQ1).

Figure 2 shows the recommendation Success Rate* (SR*) @t at different turns ( to 15), SR* denotes the comparison of each method against the strongest baseline CRM, indicated as in the figure. Table 2 shows the scores of the final success rate and the average turns. As can be clearly seen, our EAR model significantly outperforms other methods. This validates our hypothesis that considering extensive interactions between the CC and RC is an effective strategy to build conversational a recommender system. We also make the following observations:

Comparing with Abs Greedy, the three attribute-based methods (EAR, Max Entropy and CRM) have nearly zero success rate at the beginning of a conversation (). This is because these methods tend to ask questions at the very beginning. As the conversation goes, Abs Greedy (which only recommends items) gradually falls behind the attribute-based methods, demonstrating the efficacy of asking attributes in the conversational recommendation scenario. Note that Abs Greedy has much weaker performance on Yelp compared to LastFM. The key reason is the setting of Yelp is to ask enumerated question, and user’s response with multiple finer-grained attributes sharply shrinks the candidate items.

CRM generally underperforms our EAR methods. One of the key reasons is that its state vector cannot help CC to learn sophisticated strategy to ask and recommend, especially in a much larger action space, i.e., the number of attributes (nearly 30 in our experiments versus 5 in theirs (Sun:2018:CRS:3209978.3210002)). This result suggests that in a more complex multi-round scenario where the CC needs to make a comprehensive utilization of both the CC (e.g., considering dialogue histories) and RC (considering statistics like attribute preference estimation) when formulating a recommendation strategy.

Interestingly, Figure 2 indicates that in Yelp, EAR’s gain over CRM enlarges in Turns 1–3, shrinks in Turns 4–6 and widens again afterwards. However, in LastFM it has a steadily increasing gain. This interesting phenomenon reveals that our EAR system can learn different strategies in different settings. In the Yelp dataset, the CRS asks enumerated questions where the user can choose finer-grained attributes, resulting a sharp reduction in the candidate space. The strategy that the EAR system learns is more aggressive: it attempts to ask attributes that can sharply shrink the candidate space and make decisive recommendation at the beginning turns when it feels confident. If this aggressive strategy fails, it changes to a more patient strategy to ask more questions without recommendations, causing less success in the medial turns (e.g., Turns 5–7). However, this strategy pays off in the long term, making recommendation more successful in the latter half of conversations (e.g., after Turn 7). At the same time, CRM is only able to follow the strategy of trying to ask more attributes at the beginning and making recommendations later. In the LastFM dataset, the setting is limited to binary attributes, leading to less efficiency in reducing candidate space. Both EAR and CRM adapt and ask more questions at the outset before making recommendations. However, as EAR incorporates better CC and RC to model better interaction, it significantly outperforms CRM.

4.3. Effectiveness of Estimation Designs (RQ2)

There are two key designs in the estimation stage that trains the recommendation model FM offline: the attribute-aware BPR that samples negatives with attribute matching considered, and the multi-task training that jointly optimizes item prediction and attribute prediction tasks. Table 3 shows offline AUC scores on the two tasks of three methods: FM, FM with attribute-aware BPR (FM+A), and FM+A with multi-task training (FM+A+MT).

LastFM Yelp
Item Attribute Item Attribute
FM 0.521 0.727 0.834 0.654
FM+A 0.724 0.629 0.866 0.638
FM+A+MT 0.742* 0.760* 0.870* 0.896*
Table 3. Offline AUC score of FM, FM with attribute-aware BPR (FM+A) and with multi-task training for item recommendation and attribute prediction (FM+A+MT). denotes that improvement of FM+A+MT over FM and FM+A is statistically significant for (RQ2).

As can be seen, the attribute-aware BPR significantly boosts the performance of item ranking, being highly beneficial to rank the ground truth item high. Interestingly, it harms the performance of attribute prediction, e.g. on lastFM, FM+A has a much lower AUC score (0.629) than FM (0.727). The reason might be that the attribute-aware BPR loss guides the model to specifically fit to item ranking in the candidate list. Without an even optimization enforced for the attribute prediction task, it may suffer from poor performance. This implies the necessity of explicitly optimizing the attribute prediction task. As expected, the best performance is achieved when we add multi-task training on. FM+A+MT significantly enhances the performance of both tasks, validating the effectiveness and rationality of our multi-task training design.

4.4. Ablation Studies on State Vector (RQ3)

What information helps in decision making? Let us examine the effects of the the four forms of information included in EAR state vector s (Equation 10), by ablating each information type from the feature vector (Table 4).

Yelp LastFM
SR@5 SR@10 SR@15 AT SR@5 SR@10 SR@15 AT
0.614 0.895 0.969 4.81 0.051 0.190 0.346 12.82
0.596 0.857 0.959 5.06 0.024 0.231 0.407 12.55
0.624 0.894 0.949 4.79 0.021 0.236 0.424 12.50
0.550 0.846 0.952 5.44 0.013 0.230 0.416 12.56
EAR 0.629* 0.907* 0.971* 4.71* 0.020 0.243* 0.429* 12.45*
Table 4. Performance of removing one component of the state vector (Equation 10) from our EAR. denotes that improvement of EAR over model with removed component is statistically significant for (RQ 3).

Comparing the performance drop of each method, we uncover differences that corroborate the intrinsic difference between the two conversational settings. The most important factor is question type: i.e., for LastFM (binary question) and for Yelp (enumerated question). The entropy() information is crucial for LastFM, it is in line with the claim in  (dhingra2017towards) that the maximum entropy is the best strategy when language understanding is precise. If we ablate on LastFM, although it reaches 0.051 in SR@5, future SR greatly suffers, due to the system’s over-agressiveness to recommend items before obtaining sufficient relevant attribute evidence. As for the enumerated question setting (Yelp), the candidate list length () is most important, because the candidate item list shrinks more sharply and is helpful when deciding when to recommend.

Apart from entropy and candidate list length, the remaining two factors – i.e., attribute preference, conversation history – both contribute positively. Their impact is sensitive to datasets and metrics. For example, the attribute preference () strongly affect SR@5 and SR@10 on Yelp, but does not show significant impacts for SR@15. This inconsistency provides an evidence for the intrinsic difficulty of decision making in the conversational recommendation scenario, which however has yet to be extensively studied.

4.5. Investigation on Reflection (RQ4)

Yelp LastFM
SR@5 SR@10 SR@15 AT SR@5 SR@10 SR@15 AT
-update 0.629 0.905 0.970 4.72 0.020 0.217 0.393 12.67
EAR 0.629 0.907 0.971 4.71 0.020 0.243* 0.429* 12.45*
Table 5. Performance after removing the online update module in the reflection stage. denotes that improvement of EAR over removing update module is statistically significant for (RQ4).
Figure 3. Percentage of bad updates w.r.t. the offline model’s AUC on the users on Yelp (RQ4).
1. Q? 2. Question Space 3. Explicit 4. Multi-round 5. Main Focus
Online bandits (christakopoulou2016towards; wu2016contextual; wu2018learning) N.A. Exploration-exploitation trade-off in item selection
REDIAL (NIPS’18) (nips18/DeepConv) Free texts End-to-end generation of natural language response
KMD (MM’18) (Liao:2018) Free texts End-to-end generation of text and image response
Q&R (KDD’18) (christakopoulou2018q) Attributes Question asking and single-round recommendation
MMN (CIKM’18) (zhang2018towards) Attributes Attribute-product match in conversational search
CRM (SIGIR’18) (Sun:2018:CRS:3209978.3210002) Attributes Shallow combination between CC and RC
VDARIS (KDD’19) (yu2019visual) N.A. User’s click and comment on recommended items
EAR (our method) Attributes Deep interaction between CC and RC
Table 6. Recent conversational recommender summary: 1) whether it asks about attributes, 2) question space, 3) any explicit strategy w.r.t. recommendation timing, 4) whether it considers multi-round recommendations, and 5) its main focus.

To understand the impact of online update in the reflection stage, we start from the ablation study. Table 5 shows the variant of EAR that removes online update. We find that the trends do not converge on two datasets: the updating strategy helps a lot on LastFM but has very minor effect on the Yelp dataset.

Questioning this interesting phenomenon, we examine the individual items on Yelp. We find that the updating does not always help ranking, especially when the offline model already ranks the ground truth item high (but not at top 10). In this case, doing updates is highly likely to pull down the ranking position of the ground truth item. To gain statistical evidence for this observation, we term such updates as bad updates, and show the percentage of bad updates with respect to the offline model’s AUC on the users. As seen from Figure 3, there is a clear positive correlation between bad updates and AUC score. For example, 3.5% of the bad updates come from users with an offline AUC of 0.9.

This explains why online update works well for LastFM, but not for Yelp: our recommendation model has a better performance on Yelp than LastFM (0.870 v.s. 0.742 in AUC as shown in Table 3). This means the items on Yelp are more likely to get higher AUC, resulting in worse updates. More such observations and analyses will help further the community understanding the efficacy of online updates. Although bandit algorithms have devoted to exploring this question (kuleshov2014algorithms; chu2011contextual; li2016collaborative; gentile2014online; wu2016contextual), the issue has largely been unaddressed in the context of conversational recommender system.

5. Related Work

The offline static recommendation task is formulated as estimating the affinity score between a user and an item (NCF). This is usually achieved by learning user preferences through the historical user-item interactions such as clicking and purchasing. The representative methods are Matrix Factorization (MF) (MF) and Factorization Machine (FM) (rendle2010factorization). Neural FM (NFM) and DeepFM (guo2017deepfm)

have improved FM’s representation ability with deep neural networks.

(fastMF; iCD; sigir/EbesuSF18) utilize user’s implicit feedback, commonly optimizing BPR loss (BPR). (cheng2019mmalfm; cheng2018aspect) exploits user’s reviews and image information. However, such static recommendation methods suffer from the intrinsic limitation of not being able to capture user dynamic preferences.

This intrinsic limitation motivates online recommendation. Its target is to adapt the recommendation results with the user’s online actions (li2015online). Many model it as a multi-arm bandit problem  (wu2016contextual; wang2017factorization; wu2018learning) , strategically demonstrating items to users for useful feedback. (zhang2019toward) makes the preliminary effort to extend the bandit framework to query attributes. While achieving remarkable progress, the bandit-based solutions are still insufficient: 1) Such methods focus on exploration–exploitation trade-off in cold start settings. However, in warm start scenario, capturing the user dynamic preference is critical as preference drift is common; 2) The mathematical formation of multi-arm bandit problem limits such method only recommend one item each time. This constraint limits its application, as we usually need to recommend a list of items.

Conversational recommender systems provide a new possibility for capturing dynamic feedback as they enable a system to interact with users using natural language. However, they also pose challenges to researchers, leading to various settings and problem formulations  (christakopoulou2016towards; nips18/DeepConv; Liao:2018; christakopoulou2018q; zhang2018towards; Sun:2018:CRS:3209978.3210002; priyogi2019preference; liao2019deep; yu2019visual; ayundhita2019ontology; sardella2019approach; zhang2019toward). Table 6 summarizes these works’ key aspects. Generally, prior work considers conversational recommendation only under simplified settings. For example, (christakopoulou2016towards; yu2019visual) only allow the CRS to recommend items without asking the user about their preferred attributes. The Q&R work (christakopoulou2018q) proposes to jointly optimize the two tasks of attribute and item prediction, but restricts the whole conversation to two turns: one turn for asking, one turn for recommending. CRM (Sun:2018:CRS:3209978.3210002) extends the conversation to multi-turns but still follows the single-round setting. MMN (zhang2018towards) focuses on document representation, aiming to learn better matching function for attributes and products description under a conversation setting. Unfortunately, it does not build a dialogue policy to decide when to ask or make recommendations. In contrast, situations for various real applications are complex: the CRS needs to strategically ask attributes and make recommendations in multiple rounds, achieving successful recommendations in the fewest turns. In recent work, only (nips18/DeepConv) considers this multi-round scenario, but it focuses on language understanding and generation, without attending to explicitly model the conversational strategy.

6. Conclusion and Future Work

In this work, we redefine the conversational recommendation task where the RC and CC closely support each other so as to achieve the goal of accurate recommendation in fewer turns. We decompose the task into three key problems, namely, what to ask, when to recommend, and how to adapt with user feedback. We then propose EAR – a new three-stage solution accounting for the three problems in a unified framework. For each stage, we design our method to carefully account for the interactions between RC and CC. Through extensive experiments on two datasets, we justify the effectiveness of EAR, providing additional insights into the conversational strategy and online updates.

Our work represents the first step towards exploring how the CC and RC can collaborate closely to provide quality recommendation service in this multi-round scenario. Naturally, there are thus a few loose ends for further investigation, especially with respect to incorporating user feedback. In the future, we will consider refreshing the policy network to make better actions. We will also extend EAR to consider explore–exploit balance which is the key problem for traditional interactive recommendation system. Lastly, we will deploy our system to online applications that interact with real users to gain more insights for further improvements.

Acknowledgement: This research is part of NExT++ research and also supported by the National Natural Science Foundation of China (61972372). NExT++ is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its IRC@SG Funding Initiative. We would like to thank the anonymous reviewers for their valuable reviews.