Collaborative Filtering with A Synthetic Feedback Loop

10/21/2019 ∙ by Wenlin Wang, et al. ∙ 0

We propose a novel learning framework for recommendation systems, assisting collaborative filtering with a synthetic feedback loop. The proposed framework consists of a "recommender" and a "virtual user." The recommender is formulizd as a collaborative-filtering method, recommending items according to observed user behavior. The virtual user estimates rewards from the recommended items and generates the influence of the rewards on observed user behavior. The recommender connected with the virtual user constructs a closed loop, that recommends users with items and imitates the unobserved feedback of the users to the recommended items. The synthetic feedback is used to augment observed user behavior and improve recommendation results. Such a model can be interpreted as the inverse reinforcement learning, which can be learned effectively via rollout (simulation). Experimental results show that the proposed framework is able to boost the performance of existing collaborative filtering methods on multiple datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recommendation systems are important modules for abundant online applications, helping users explore items of potential interest. As one of the most effective approaches, collaborative filtering Sarwar et al. (2001); Koren and Bell (2015); He et al. (2017)

and its deep neural networks based variants 

He et al. (2017); Wu et al. (2016); Liang et al. (2018); Li and She (2017); Yang et al. (2017); Wang et al. (2018a) have been widely studied. These methods leverage patterns across similar users and items, predicting user preferences and demonstrating encouraging results in recommendation tasks Bennett and Lanning (2007); Hu et al. (2008); Schedl (2016). Among these work, beside “user-item” pairs, side information, , user reviews and scores on items, are involved and have achieved remarkable success Menon et al. (2011); Fang and Si (2011). Such side information is a kind of user feedback to the recommended items, which is promising to improve the recommendation systems.

Unfortunately, both the user-item pairs and user feedback are extremely sparse compared with the search space of items. What is worse, when the recommendation systems are trained on static observations, the feedback is unavailable until it is deployed in real-world applications — in both training and validation phases, the target systems have no access to any feedback because no one has observed the recommended items. Therefore, the recommendation systems may suffer overfitting, and their performance may degrade accordingly, especially in the initial phase of deployment. Although real-world recommendation systems are usually updated in an online manner with the assist of increasing observed user behavior Rendle and Schmidt-Thieme (2008); Agarwal et al. (2010); He et al. (2016), introducing a feedback mechanism during their training phases can potentially improve the efficiency of the initial systems. However, this is neglected by existing learning frameworks.

Motivated by the above observations, we propose a novel framework that achieves collaborative filtering with a synthetic feedback loop (CF-SFL). As shown in Figure 1, the proposed framework consists of a “recommender” and a “virtual user.” The recommender is a collaborative filtering (CF) model, that predicts items from observed user behavior. The observed user behavior reflects intrinsic preferences of users, while the recommended items represent the potential user preferences estimated by the model. Regarding the fusion of the observed user behavior and the recommended items as inputs, the virtual user, which is the key of our model, imitates real-world scenarios and synthesizes user feedback. In particular, the virtual user contains a reward estimator and a feedback generator: the reward estimator estimates rewards based on the fused inputs (the compatible representation of the user observation and its recommended items), learned with a generative adversarial regularizer. The feedback generator provides feedback embeddings to augment the original user embeddings, conditioned on the estimated rewards as well as the fused inputs. Such a framework constructs a closed loop between the target CF model and the virtual user, synthesizing user feedback as side information to improve recommendation results.

Figure 1: Illustration of our proposed CF-SFL framework for collaborative filtering.

The proposed CF-SFL framework can be interpreted as inverse reinforcement learning (IRL) approach, in which the recommender learns to recommend user items (policy) with the estimated guidance (feedback) from the proposed virtual user. The proposed feedback loops can be understood as an effective rollout procedure for recommendation, jointly updating the recommender (policy) and the virtual user (the reward estimator and the feedback generator). Eventually, even if side information (, real-world user feedback) is unobservable, our algorithm is still applicable to synthesize feedback in both the training and inference phases. The proposed framework is general and compatible with most CF methods. Experimental results show that the performance of existing approaches can be remarkably improved within the proposed framework.

2 Proposed Framework

In this section, we first describe the problem we are interested in and give a detail description of each module that is included in the framework.

2.1 Problem Statement

Suppose we have users with items in total, we denote the observed user-item matrix as

, where each vector

, , represents observed user behaviors. indicates the the -th item is bought or reviewed via the -th user and otherwise the -th item can either be irrelevant to the -th user or we have not enough knowledge about their relationship. The desired recommendation system aims to predict each user’s preference, denoted as , whose element indicates the preference of the -th user to the -th item. Accordingly, the system recommends each user with the items having large ’s.

Ideally, for each user, just contains partial (actually, very sparse) information about user preference and a practical recommendation system works dynamically with a closed loop — users often generate feedback to the recommended items while the recommendation system considers these feedback to revise recommended items in the future. Therefore, we can formulate the whole recommendation process as


where represents the target recommender while represents the coupled feedback mechanism of the system. is the embedding of user feedback to historical recommended items. At each time , the recommender predicts preferred items according to observed user behaviors and previous feedback, and the user generates feedback to the recommender. Note that (1) is different from existing sequential recommendation models Mishra et al. (2015); Wang et al. (2016a) because those methods ignore the feedback loop as well, which just updates recommender according to observed sequential observations, i.e., for different time ’s.111When the static observation in (1) is replaced with sequential observation , (1) is naturally extended to a sequential recommendation system with a feedback loop. In this work, we focus on the case with static observations and train a recommender system accordingly.

Unfortunately, the feedback information is often unavailable in the training and inference phases. Accordingly, most of existing collaborative filtering-based recommendation methods ignore the feedback loop in the system, learning the target system purely from static observation user-item matrix  Liang et al. (2018); Li and She (2017). Although in some scenarios side information like user reviewers is associated with the observation matrix, the methods using such information often treat it as a kind of static knowledge rather than dynamic feedback. They mainly focus on fitting the groundtruth recommended items with the recommender given fixed ’s and ’s, while ignore the imitation of the whole recommendation-feedback loop in (1). Without the feedback mechanism , tends to over-fit the observed user behavior and static side information, which may degrade in practical dynamical scene.

To overcome the problems mentioned above, we propose a collaborative filtering framework with a synthetic feedback loop (CF-SFL), which explains the whole recommendation process from a viewpoint of reinforcement learning. As shown in Figure 1, besides traditional recommendation module the proposed framework further introduces a virtual user, which imitates the recommendation-feedback loop even if the real-world feedback are unobservable.

2.2 Recommender

In our framework, the recommender implements the function in (1), which takes the observed user behavior and his/her previous feedback embedding as input and recommends items accordingly. In principle, the recommender can be defined with high flexibility, which can be arbitrary collaborative filtering methods that predicting items from user representations, such as WMF Hu et al. (2008), CDAE Wu et al. (2016), VAE Liang et al. (2018) etc. In this work, we formulate the recommender from the viewpoint of reinforcement learning.

In particular, the recommendation-feedback loop generates a sequence of interactions between each user and the recommender, , for . Here, is the representation of user at time , which is a sample in the state space describing user preference. indicates the recommended items provided by the recommender, which is a sample in the action space

of the recommender. Accordingly, we can model the recommendation-feedback loop as a Markov Decision Process (MDP)

, where

is the transition probability of user preference and

is the reward function used to evaluate recommended items. The recommender works as a policy parametrized by , , , which corresponds to the distribution of items conditioned on user preference. The target recommender should be an optimal policy that maximizes the expected reward: , where means the reward over the state-action pair . For the -th user, given , the recommender selects potentially-preferred items via


Note that different from traditional reinforcement learning tasks, in which both and are available while and are with limited accessibility, our recommender just receives partial information of state — it does not observe users’ feedback embedding . In other words, to optimize the recommender, we need to build a reward model and a feedback generator jointly, which motivates us to introduce a virtual user into the framework.

2.3 Virtual User

The virtual user module aims to implement the feedback function in (1), which not only models the reward of the items provided by the recommender but also generates feedback to complete the representations of state. Accordingly, the virtual user contains the following two modules:

Reward Estimator The reward estimator parametrizes the function of reward, which takes the current prediction and user preference as input and evaluate their compatibility. In this work, we implement the estimator with parameter , which is defined as


In this work, we use the static part of the state , , the observed user behaviors as input. is the fusion function which merges and into a real value vector (the fused input is shown in Figure 5 and described in Appendix).

is the single value regression function that translates the fused input into a single reward value. The sigmoid function is used to restrict the regression value between 0 and 1.

Feedback Generator The feedback generator connects the reward estimator with the recommender via generating a feedback embedding, ,


where represents the parameters of the generator. Specifically, the a parametric function considers the fused input and the estimated reward and returns a feedback embedding to the recommender. indicates the compatibility between the recommended items and user preferences, and , which is a vector rather than a scalar like reward, further enriches the information of the reward to generate feedback embeddings. Consequently, the recommender receives the informative feedback as a complementary component of the static observation to make an improved recommendation via (2).

3 Learning Algorithm

3.1 Learning task

Figure 2: Unrolling the recurrent CF-SFL framework into an iterative learning process with multiple time steps T.

Based on the proposed framework, we need to jointly learn the policy corresponding to the recommender , the reward estimator and the feedback generator . Suppose that we have a set of labeled samples , where is the historical behaviors of user derived from the user-item matrix and is the ground truth of the recommended item for the user based on his/her behavior . We formulate the learning task as the following min-max optimization problem.




In particular, the first term in (6) can be any supervised loss based on labeled data , , the evidence lower bound (ELBO) proposed in VAEs Liang et al. (2018) (and used in our work). This term ensures the recommender to fit the groundtruth labeled data. The second term considers the following two types of interactions among these three modules:

  • The collaboration between the recommender policy and the feedback generator towards a better predictive recommender.

  • The adversarial game between the recommender policy and the reward estimator .

Accordingly, given current reward model, we update the recommender policy and the feedback generator to maximize the expected reward derived from the generated user preference and the recommended item . On the contrary, given the recommended policy and the feedback generator, we improve the reward estimator by sharpening its criterion — the updated reward estimator maximizes the expected reward derived from the generated user preference and the ground truth of item while minimize the expected reward based on the recommended item. Therefore, we solve (5) via alternating optimization. The updating of and is achieved by minimizing


And the updating of is achieved by maximizing


Both these two updating steps can be solved effectively via stochastic gradient descent.

3.2 Unrolling for learning and inference

Because the proposed framework contains a closed loop among learnable modules, during training we unroll the loop and let the recommender interact with the virtual user in steps. Specifically, at the initial stage, the recommender takes the observed user behaviour and an all-zero initial feedback embedding , to make recommendations. At each step , the recommender predicts the items given and to the virtual user, and receives the feedback embedding . The loss is defined according to the output of the last step, , and , and the modules are updated accordingly. After the model is learned, in the testing phase we need to infer the recommended item in the same manner, unrolling the feedback loop and deriving as the final recommended item. The details of unrolling process are illustrated in Figure 2, and the detailed scheme of our learning algorithm is shown in Algorithm 1 in appendix.

4 CF-SFL as Inverse Reinforcement Learning

Our CF-SFL framework automatically discovers informative user feedback as side information and gradually improve the training for the recommender. Theoretically, it closely connects with Inverse Reinforcement Learning (IRL). Specifically, we jointly learn the reward function and the policy (recommender) from the expert trajectories (the observed labeled data). typically consists of state-action pairs generated from an expert policy with the corresponding environment dynamics . And the goal of the IRL is to recover the optimal reward function as well as the corresponding recommender . Formally, the IRL is defined as:


Intuitively, the objective enforces the expert policy to induce higher rewards (the part), than all other policies. This objective is sub-optimal if the expert trajectories are noisy, i.e., the expert is not perfect and its trajectories are not optimal. This will render the learned policy always performs worse than the expert one. Besides, the illed-defined IRL objective often induces multiple solutions due to flexible solution space, i.e., one can assign an arbitrary reward to trajectories not from expert, as long as these trajectories yields lower rewards than the expert trajectories. To alleviate these issues, some constraints are placed into the objective functions, e.g., a convex reward functional, , which usually works as a regularizer.


To imitate the expert policy and provide better generalization, we adopt the adversarial regularizer Ho and Ermon (2016), which defines with the following form:

where . This regularizer places low penalty on reward functions that assign an amount of positive value to expert state-action pairs; however, if assigns low value (close to zero, which is the lower bound) to the expert, then the regularizer will heavily penalize

. With induced adversarial regularizer, we obtain a new imitation learning algorithm for recommender:


Intuitively, we want to find a saddle point of the expression:


where . Note equation 11 is derived from the objective of traditional IRL. However, distinct from the traditional approach, we propose a feedback generator to provide feedbacks to the recommender. In terms of the reward estimator, it tends to assign lower rewards to the predicted results by the recommender and higher rewards for the expert policy , which aims to discriminate from :


Similar to standard IRL, we update the generator to maximize the expected reward with respect to , moving towards expert-like regions of user-item space. In practice, we incorporate feedback embedding to update the user preferences, and the objective of the recommender is:


where .

5 Related Work

Collaborative Filtering. Collaborative filtering (CF) can be roughly categorized into two groups: CF with implicit feedback Bayer et al. (2017); Hu et al. (2008) and those with explicit feedback Koren (2008); Liu et al. (2010). In implicit CF, user-item interactions are binary in natural (i.e., 1 if clicked and 0 otherwise) as oppose to explicit CF where item ratings (e.g., 1-5 stars) are typically the subject of interests. Implicit CF has been widely studied, examples including factorization of user-item interactions He et al. (2016); Koren (2008); Liu et al. (2016); Rendle (2010); Rennie and Srebro (2005) and ranking based approach Rendle et al. (2009). And our CF-SFL is a new framework for implicit CF.

Currently, neural network based models have achieved state-of-the-art performance for various recommender systems Cheng et al. (2016); He et al. (2018, 2017); Zhang et al. (2018); Liang et al. (2018). Among these methods, NCF He et al. (2017) casts the well-established matrix factorization algorithm into an entire neural framework, combing the shallow inner-product based learner with a series of stacked nonlinear transformations. This method outperforms various of traditional baselines and has motivated many following works such as NFM He et al. (2017), Deep FM Guo et al. (2017) and Wide and Deep Cheng et al. (2016)

. Recently, deep learning approaches 

Wang et al. (2016b, c), especially deep generative models Chen et al. (2017); Yang et al. (2019); Wang et al. (2018b, 2019a, 2019b) have achieved remarkable success. VAEs Liang et al. (2018) uses variational inference to scale up the algorithm for large-scale dataset and has shown significant success in recommender systems with a usage of multinormial likelihood. Our CF-SFL is a general framework that can adapt to these models seamlessly.

RL in Collaborative Filtering.

For RL-based methods, contextual multi-armed bandits are firstly utilized to model the interactive nature of recommender systems. Thompson Sampling (TS) 

Chapelle and Li (2011); Kveton et al. (2015); Zhang et al. (2017) and Upper Confident Bound (UCB) Li et al. (2010) are used to balance the trade-off between exploration and exploitation. Zhao et al. (2013) combine matrix factorization with bandits to include latent vectors of items and users for better exploration. The MDP-Based CF model can be viewed as a partial observable MDP (POMDP) with partial observation of user preferences Sunehag et al. (2015). Value function approximation and policy based optimization can be employed to solve the MDP. Zheng et al. (2018) and Taghipour and Kardan (2008) modeled web page recommendation as a Q-Learning problem and learned to make recommendations from web usage data. Sunehag et al. (2015) introduced agents that successfully address sequential decision problems. Zhao et al. (2018) propose a novel page-wise recommendation framework based on deep reinforcement learning. In this paper, we consider the recommending procedure as sequential interactions between virtual users and recommender; and leverage feedbacks from virtual users to improve the recommendation. Recently, Chen et al. (2019) proposed an off-policy corrections technique, and successfully applied it in real-world applications.

6 Experiments

[c] ML-20M Netflix MSD    of users 136,677 463,435 571,355    of items 20,108 17,769 41,140    of interactions 10.0M 56.9M 33.6M    of held-out-users 10.0K 40.0K 50.0K    of sparsity 0.36

Table 1: Basic information of the considered datasets.


We investigate the effectiveness of the proposed CF-SFL framework on three benchmark datasets of recommendation systems. (i) MovieLens-20M (ML-20M), a movie recommendation service contains tens of millions user-movie ratings. (ii) Netflix-Prize (Netflix), another user-movie ratings dataset collected by the Netflix Prize Bennett and Lanning (2007). (iii) Million Song Dataset (MSD), a user-song rating dataset, which is released as part of the Million Song Dataset Bertin-Mahieux et al. (2011). To directly compare with existing work, we employed the same pre-processing procedure as Liang et al. (2018). A summary statistics of these datasets are provided in Table 1.

Evaluation Metrics

We employ Recallr222 together with NDCG@r333

as the evaluation metric for recommendation, which measures the similarity between the recommended items and the ground truth. Recall

r considers top-r recommended items equally, while NDCG@r ranks the top-r items and emphasizes the importance of the items that are with high ranks.

[c] Recommender Reward Estimator Feedback Generator  Input Input Input , tanh

, ReLU

, ReLU (x2) , ReLU  Sample , ReLU , tanh , ReLU softmax , sigmoid

Table 2: Architecture of our CF-SFL framework.


For our CF-SFL framework, the architectures of its recommender, reward estimator and feedback generator are shown in Table 2. To represent the user preference, we normalize and

independently and concatenate the two into a single vector. To learn the model, we pre-train the recommender (150 epochs for ML-20M and 75 epochs for Netflix and MSD) and optimize the entire framework (50 epochs for ML-20M and 25 epochs for the other two).

regularization with a penalty term is applied to the recommender, and Adam optimizer Kingma and Ba (2014) with batch in size of is employed.


To demonstrate the superiority of our framework, we consider multiple state-of-the-art approaches as baselines, which can be categorized into two types: (i) Linear based models: SLIM Ning and Karypis (2011) and WMF Hu et al. (2008). (ii) Deep neural network based models: CDAE Wu et al. (2016), VAE Liang et al. (2018) and aWAE Zhong and Zhang (2018). It should be noted that our CF-SFL is a generalized framework, which is compatible with all these approaches. In particular, as shown in Table 2, we implement our recommender as the VAE-based model Liang et al. (2018) for a fair comparison. In the following experiments, we will show that besides such a setting the recommender can be implemented by other existing models as well.

Performance Analysis

  Methods ML-20M Netflix MSD
R20 R50 NDCG100 R20 R50 NDCG100 R20 R50 NDCG100
  SLIM 0.370 0.495 0.401 0.347 0.428 0.379 - - -
  WMF 0.360 0.498 0.386 0.316 0.404 0.351 0.211 0.312 0.257
  CDAE 0.391 0.523 0.418 0.343 0.428 0.376 0.188 0.283 0.237
  aWAE 0.391 0.532 0.424 0.354 0.441 0.381 - - -
  VAE 0.395 0.537 0.426 0.351 0.444 0.386 0.266 0.364 0.316
   0.395 0.535 0.425 0.350 0.444 0.386 0.260 0.356 0.311
   0.396 0.536 0.426 0.352 0.445 0.387 0.263 0.360 0.314
  CF-SFL 0.404 0.542 0.435 0.355 0.449 0.394 0.273 0.369 0.323
Table 3: Performance comparison between our CF-SFL framework and various of baselines. is the results based on our own runs and is the VAE model with our reward estimator.
Figure 3: Performance (NDCG@100) boosting on the validation sets.

All the evaluation metrics are averaged across all the test sets.

(i) Quantitative results: we test various methods and report their results in Table 3. With the proposed CF-SFL framework, we significantly improve the performance of the baselines on all the evaluation metrics. These experimental results demonstrate the power of the proposed CF-SFL framework, which provides informative feedback as the side information. Particularly, we observed that the performance between the base model (VAE) is similar to that of its variation with the reward estimator (VAE). It implies that simply learning a feedback from the reward estimator via back-propagation is inefficient. Compared with such a naive strategy, the proposed CF-SFL provides more informative feedback to the recommender, and thus, is able to improve recommendation results more effectively.

Figure 4: The blue curve summarizes NDCG@100 and red curves report the computational cost for model inference in each epoch. In each sub-figure, we vary the time steps from 0 to 12 (T=0 is the base recommender).

(ii) Learning Comparison: In Figure 3, we show the training trajectory of the baselines (VAE, VAE+reward estimator) and the CF-SFL with multiple time steps. There are several interesting findings. (a) The performance of the base VAE doesn’t improve after the pre-training steps, e.g., 75 epochs for Netflix. In comparison, the proposed CF-SFL framework can further improve the performance once the whole model is triggered. (b) The CF-SFL yields fast convergence once the whole framework is activated. (c) Coincide with results in Table 3, the trajectory of VAE in Figure 3, is similar to that of the base VAEs (VAE). In contrast, the trajectories of our CF-SFL methods are more smooth and able to converge to a better local minimum. This phenomenon further verifies that our CF-SFL learns informative user feedback with better stability. (d) With an increasing of time steps in a particular range ( for ML-20M), CF-SFL achieves faster and better performance. One possible explanation is the learning with our unrolled structure — parameters are shared across different time-steps, and a more accurate gradient is found towards the local minimum. (e) We find ML-20M and MSD are more sensitive to the choice of when compared with Netflix. Therefore, the choice of should adjust to different datasets.

(iii) CF-SFL with dynamic time steps: As shown in Figure 2, the learning of CF-SFL involves a recurrent structure with times steps. We investigate the choice of and report its influence on the performance of our method. Specifically, the NDCG100 with different ’s is shown in Figure 4. Within 6 time steps, CF-SFL consistently boots the performance on all the three datasets. Even with a larger time steps, the results remain stable. Additionally, the inference time of CF-SFL is linear on time steps . To achieve a trade-off between performance and efficiency, in our experiments we set to for ML-20M and Netflix and for MSD.

[c] Recommender w/o CF-SFL w CF-SFL Gain ()  WARP 0.31228 0.33987 +27.59  MF 0.41587 0.41902 +3.15  DAE 0.42056 0.42307 +2.51  VAE 0.42546 0.43472 +9.26  VAE-(Gaussian) 0.42019 0.42751 +7.32  VAE-() 0.42027 0.42539 +5.02  VAE-Linear 0.41563 0.41597 +0.34

Table 4: Performance of our CF-SFL with various of recommenders are reported.

Generalization Study

As aforementioned, our CF-SFL is a generalized framework which is compatible with many existing collaborative filtering approaches. We study the usefulness of our CF-SFL on different recommenders and present the results in Table 4. Specifically, two types of recommenders are being considered: the linear approaches like WARP Weston et al. (2011) and MF Hu et al. (2008), and deep learning methods, e.g., DAE Liang et al. (2018) and the variation of VAE in Liang et al. (2018). We find that our CF-SFL is capable of generalizing to most of the exisiting collaborative filtering approaches and boosts their performance accordingly. The gains achieved by our CF-SFL may vary depending on the choice of recommender.

7 Conclusion

We propose a CF-SLF framework to simulate user feedback. It constructs a virtual user to provide informative side information as user feedback. Mathematically we formulate the framework as an IRL problem and learn the optimal policy by feeding back the action and reward. Specially, a recurrent architecture was built to unrolled the framework for efficient learning. Empirically we improve the performance of state-of-the-art collaborative filtering method with a non-trivial margin. Our framework serves as a practical solution making IRL feasible over large-scale collaborative filtering. And it will be interesting to investigate the framework in other applications, such as sequential recomender systems etc.


  • D. Agarwal, B. Chen, and P. Elango (2010) Fast online learning through offline initialization for time-sensitive recommendation. In KDD, Cited by: §1.
  • I. Bayer, X. He, B. Kanagal, and S. Rendle (2017) A generic coordinate descent framework for learning from implicit feedback. In WWW, Cited by: §5.
  • J. Bennett and S. Lanning (2007) The netflix prize. In KDD cup and workshop, Cited by: §1, §6.
  • T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere (2011) The million song dataset.. In Ismir, Cited by: §6.
  • O. Chapelle and L. Li (2011) An empirical evaluation of thompson sampling. In NIPS, Cited by: §5.
  • C. Chen, C. Li, L. Chen, W. Wang, Y. Pu, and L. Carin (2017) Continuous-time flows for deep generative models. arXiv preprint arXiv:1709.01179. Cited by: §5.
  • M. Chen, A. Beutel, P. Covington, S. Jain, F. Belletti, and E. H. Chi (2019) Top-k off-policy correction for a reinforce recommender system. In WSDM, Cited by: §5.
  • H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, et al. (2016) Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, Cited by: §5.
  • Y. Fang and L. Si (2011) Matrix co-factorization for recommendation with rich side information and implicit feedback. In Proceedings of the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems, Cited by: §1.
  • H. Guo, R. Tang, Y. Ye, Z. Li, and X. He (2017) Deepfm: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247. Cited by: §5.
  • X. He, X. Du, X. Wang, F. Tian, J. Tang, and T. Chua (2018) Outer product-based neural collaborative filtering. arXiv preprint arXiv:1808.03912. Cited by: §5.
  • X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua (2017) Neural collaborative filtering. In WWW, Cited by: §1, §5.
  • X. He, H. Zhang, M. Kan, and T. Chua (2016) Fast matrix factorization for online recommendation with implicit feedback. In SIGIR, Cited by: §1, §5.
  • J. Ho and S. Ermon (2016) Generative adversarial imitation learning. In NIPS, Cited by: §4.
  • Y. Hu, Y. Koren, and C. Volinsky (2008) Collaborative filtering for implicit feedback datasets. In ICDM, Cited by: §1, §2.2, §5, §6, §6.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.
  • Y. Koren and R. Bell (2015) Advances in collaborative filtering. In Recommender systems handbook, Cited by: §1.
  • Y. Koren (2008) Factorization meets the neighborhood: a multifaceted collaborative filtering model. In KDD, Cited by: §5.
  • B. Kveton, C. Szepesvari, Z. Wen, and A. Ashkan (2015) Cascading bandits: learning to rank in the cascade model. In ICML, Cited by: §5.
  • L. Li, W. Chu, J. Langford, and R. E. Schapire (2010) A contextual-bandit approach to personalized news article recommendation. In WWW, Cited by: §5.
  • X. Li and J. She (2017)

    Collaborative variational autoencoder for recommender systems

    In KDD, Cited by: §1, §2.1.
  • D. Liang, R. G. Krishnan, M. D. Hoffman, and T. Jebara (2018) Variational autoencoders for collaborative filtering. WWW. Cited by: §1, §2.1, §2.2, §3.1, §5, §6, §6, §6.
  • N. N. Liu, E. W. Xiang, M. Zhao, and Q. Yang (2010) Unifying explicit and implicit feedback for collaborative filtering. In CIKM, Cited by: §5.
  • Y. Liu, P. Zhao, X. Liu, M. Wu, and X. Li (2016) Learning optimal social dependency for recommendation. arXiv preprint arXiv:1603.04522. Cited by: §5.
  • A. K. Menon, K. Chitrapura, S. Garg, D. Agarwal, and N. Kota (2011) Response prediction using collaborative filtering with hierarchies and side-information. In KDD, Cited by: §1.
  • R. Mishra, P. Kumar, and B. Bhasker (2015) A web recommendation system considering sequential information. Decision Support Systems. Cited by: §2.1.
  • X. Ning and G. Karypis (2011) Slim: sparse linear methods for top-n recommender systems. In ICDM, Cited by: §6.
  • S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme (2009) BPR: bayesian personalized ranking from implicit feedback. In UAI, Cited by: §5.
  • S. Rendle and L. Schmidt-Thieme (2008) Online-updating regularized kernel matrix factorization models for large-scale recommender systems. In Recsys, Cited by: §1.
  • S. Rendle (2010) Factorization machines. In ICDM, Cited by: §5.
  • J. D. Rennie and N. Srebro (2005) Fast maximum margin matrix factorization for collaborative prediction. In ICML, Cited by: §5.
  • B. Sarwar, G. Karypis, J. Konstan, and J. Riedl (2001) Item-based collaborative filtering recommendation algorithms. In WWW, Cited by: §1.
  • M. Schedl (2016) The lfm-1b dataset for music retrieval and recommendation. In ICMR, Cited by: §1.
  • P. Sunehag, R. Evans, G. Dulac-Arnold, Y. Zwols, D. Visentin, and B. Coppin (2015) Deep reinforcement learning with attention for slate markov decision processes with high-dimensional states and actions. arXiv preprint arXiv:1512.01124. Cited by: §5.
  • N. Taghipour and A. Kardan (2008) A hybrid web recommender system based on q-learning. In SAC, Cited by: §5.
  • Q. Wang, H. Yin, Z. Hu, D. Lian, H. Wang, and Z. Huang (2018a) Neural memory streaming recommender networks with adversarial training. In KDD, Cited by: §1.
  • W. Wang, H. Yin, S. Sadiq, L. Chen, M. Xie, and X. Zhou (2016a) Spore: a sequential personalized spatial item recommender system. In ICDE, Cited by: §2.1.
  • W. Wang, C. Chen, W. Chen, P. Rai, and L. Carin (2016b) Deep metric learning with data summarization. In

    Joint European Conference on Machine Learning and Knowledge Discovery in Databases

    pp. 777–794. Cited by: §5.
  • W. Wang, C. Chen, W. Wang, P. Rai, and L. Carin (2016c) Earliness-aware deep convolutional networks for early time series classification. arXiv preprint arXiv:1611.04578. Cited by: §5.
  • W. Wang, Z. Gan, H. Xu, R. Zhang, G. Wang, D. Shen, C. Chen, and L. Carin (2019a)

    Topic-guided variational autoencoders for text generation

    arXiv preprint arXiv:1903.07137. Cited by: §5.
  • W. Wang, Y. Pu, V. K. Verma, K. Fan, Y. Zhang, C. Chen, P. Rai, and L. Carin (2018b) Zero-shot learning via class-conditioned deep generative models. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §5.
  • W. Wang, C. Tao, Z. Gan, G. Wang, L. Chen, X. Zhang, R. Zhang, Q. Yang, R. Henao, and L. Carin (2019b) Improving textual network learning with variational homophilic embeddings. arXiv preprint arXiv:1909.13456. Cited by: §5.
  • J. Weston, S. Bengio, and N. Usunier (2011) Wsabie: scaling up to large vocabulary image annotation. In IJCAI, Cited by: §6.
  • Y. Wu, C. DuBois, A. X. Zheng, and M. Ester (2016) Collaborative denoising auto-encoders for top-n recommender systems. In ICDM, Cited by: §1, §2.2, §6.
  • C. Yang, L. Bai, C. Zhang, Q. Yuan, and J. Han (2017)

    Bridging collaborative filtering and semi-supervised learning: a neural approach for poi recommendation

    In KDD, Cited by: §1.
  • Q. Yang, Z. Huo, D. Shen, Y. Cheng, W. Wang, G. Wang, and L. Carin (2019) An end-to-end generative architecture for paraphrase generation. Cited by: §5.
  • R. Zhang, C. Li, C. Chen, and L. Carin (2017) Learning structural weight uncertainty for sequential decision-making. arXiv preprint arXiv:1801.00085. Cited by: §5.
  • S. Zhang, L. Yao, A. Sun, S. Wang, G. Long, and M. Dong (2018) NeuRec: on nonlinear transformation for personalized ranking. arXiv preprint arXiv:1805.03002. Cited by: §5.
  • X. Zhao, L. Zhang, Z. Ding, L. Xia, J. Tang, and D. Yin (2018) Recommendations with negative feedback via pairwise deep reinforcement learning. arXiv preprint arXiv:1802.06501. Cited by: §5.
  • X. Zhao, W. Zhang, and J. Wang (2013) Interactive collaborative filtering. In CIKM, Cited by: §5.
  • G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. J. Yuan, X. Xie, and Z. Li (2018) DRN: a deep reinforcement learning framework for news recommendation. In WWW, Cited by: §5.
  • J. Zhong and X. Zhang (2018) Wasserstein autoencoders for collaborative filtering. arXiv preprint arXiv:1809.05662. Cited by: §6.

Appendix A Appendix

1:  Input: A user-item matrix and labeled pairs , the unrolling step , the size of batch .
2:  Output: Recommender , reward estimator , and feedback generator
13:  Initialization: randomly initialize , and ;
/* stage 1: pretrain the recommender */
4:  while not converge do
5:     Sample a batch of from ;
6:     Update via minimizing .
7:  end while/* stage 2: pretrain the reward estimator */
8:  while not converge do
9:     Sample a batch of from and calculate ;
10:     Sample another batch of user and set
11:     Infer the recommended items and calculate ;
12:     Update via maximizing (8).
13:  end while/* stage 3: alternative train all the modules */
14:  while not converge do
15:     Sample a batch of from , initialize feedback embedding ; /* Update recommender and feedback generator */
16:     Feed and into the recommender and infer through a -step recurrent structure.
17:     Collect the corresponding reward
18:     Update and via minimizing (7). /* Reward estimator update step */
19:     Sample a batch of from , and calculate ;
20:     Sample a batch of , infer the recommended items and calculate ;
21:     Update via maximizing (8)
22:  end while
Algorithm 1 CF-SFL training with stochastic optimization

a.1 Fusion function

Here we give a detail description of the fused function we have proposed. A straight-forward way to build the fusion function is to concatenate and , and feed it into a linear layer to learn a lower dimensional representation. However, in practice this method is infeasible since the dimension of items, , is extremely high and the usage of the concatenation will make the problem even worse. To this end, we introduce a sparse layer. This layer includes a lookup table . Once we have inferred the the recommended items based on the observation , we build the the fused input as


where is the Dirac Delta function and takes value 1 if , is number of 1 in . The parameters of the lookup table will be automatically learned during the training phrase. We show an example to illustrate the working scheme for the proposed fusion function in Figure 5

. The benefits for the proposed approach can be summarized as two folds: 1) it reduce the computational cost of the standard linear transformation under the general sparse set up and saves number of parameters in our proposed adversarial learning framework; 2) This lookup table is shared across the observation and the recommended items, building a unified space for the users existing preference and missing preference. Empirically such shared knowledge boosts the performance of our CF-SFL framework.

Figure 5: An example of the our fused function working scheme. The user behavior and the recommended items share the same lookup table . is the fused input for the given example. This method works efficient if and are sparse.