Over the past decade, recommender systems have shown great effectiveness and become an integral part of our daily lives. Recommendation by nature is an interactive process: a recommender agent suggests items, based on the user profile; users provide feedback on the suggested items; the agent updates the user profile and makes further recommendations. This kind of interactive recommendation paradigm has been widely deployed in real-world systems (e.g., personalized music recommendation in Spotify111https://www.spotify.com/, product recommendation in Amazon222https://www.amazon.com/, image recommendation in Pinterests333https://www.pinterest.com/) and has attracted a lot of interest from the research community (Steck et al., 2015; Zou et al., 2020).
A key challenge in the interactive recommendation is to suggest items with insufficient observations, especially for interactive collaborative filtering where there is no content data to represent users and items and the only observations are users’ ratings (Zhao et al., 2013). It poses a “chicken-or-the-egg” problem in providing accurate recommendations since satisfied recommendations require adequate observations of user’s preferences. Besides, it is inevitable because we only have partial observations or even no observations for the cold-start users or warm-start users with taste drifting, which constitute the main user group. Therefore, a persistent and critical problem in interactive collaborative filtering is how to quickly capture user’s interests while not compromising his/her recommendation experience, i.e., how to balance between the goals of learning the user profile (i.e., exploration) and making accurate recommendations (i.e., exploitation)?
The existing approaches mainly studied this problem in two directions: (1) MAB (multi-armed bandits) approaches and (2) Meta-Learning approaches. (1) The MAB
approaches formulate the problem as multi-armed bandits or contextual bandits, and solve it with intricate exploration strategies, such as GLM-UCB and Thompson Sampling(Li et al., 2010; Chapelle and Li, 2011; Zhao et al., 2013). However, to achieve provably low bounds, these approaches optimize the recommendations in the worst case and result in overly pessimistic recommendations that may not be able to achieve the overall optimal performance. Additionally, these methods are usually computationally intractable for non-linear models, which terrifically limits its usage in recent advanced deep models (Cheng et al., 2016; He et al., 2017). (2) Recently, meta-learning approaches, which can fast adapt model on newly encountered tasks, have been leveraged to solve the cold-start recommendation. Existing methods treat suggesting items for different users as different tasks and aim to learn a learning algorithm that can quickly identify user preferences after observing a small set of recommendations, i.e.,
the support set. The meta-learning perspective is appealing since it avoids the complexity of hand-designing sophisticated exploration policies and enables us to take advantage of deep neural networks. However, these approaches ignore the performance on the support set, which may lead to the recommendation of highly irrelevant items and terrible user experience at the phase of constructing the support set. Even worse, these methods perform lousy when faced with users’ tastes drifting or poor quality support set due to its deficiency in actively exploring users’ interests and excessive dependence on the heuristically selected support set.
Rather than hand-designing the sophisticated exploration policies, we propose a framework named neural interactive collaborative filtering (NICF), which regards interactive collaborative filtering as a meta-learning problem and attempts to learn a neural exploration policy that can adaptively select the recommendation with the goal of balance exploration and exploitation for different users. In our method, the exploration policy is structured as a sequential neural network, which consists of two parts. The first part embeds the user profile by feeding past recommendations and user’s feedback into multi-channel stacked self-attention blocks to separately capture the information of versatile user feedback. The second part, the policy layer, generates the recommendation with a multi-layer perceptron. Therefore, the sequential neural network can update the user profile based on the historical recommendations and the exploration policy is encoded in the weights of the neural network. In this work, we propose to directly optimize the weights of exploration policy by maximizing the overall users’ satisfaction throughout the recommendation journey with an efficient reinforcement learning (RL) algorithm. It is meaningful in two aspects:(1) The ultimate goal of exploration/exploitation is to maximize users’ overall engagement during the interactive recommendation. (2) From the perspective of reinforcement learning, it is insightful since the satisfied recommendations triggered by an exploration recommendation can be viewed as the exploration bonus (delayed reward) for its contribution on improving the quality of the user profile. Therefore, optimizing the sum of immediate rewards and delayed rewards can be viewed as maximizing the balance between the rewards for providing accurate personalized recommendations and the rewards for exploring user’s interests, which can be effectively solved by RL. By doing so, the learned exploration policy thus can act as the learning process for interaction recommendations and constantly adapt its strategy when deployed with cold-start or warm-start recommendation (analyzed in Section 4.5).
The NICF exhibits following desirable features: (1) It avoids the overly pessimism and complexity of existing hand-designing exploration policies for interactive collaborative filtering. (2) It can be incorporated with any advanced deep model for recommendations (Wang et al., 2015; Cheng et al., 2016), which can capture much more non-linear user-item interactions. (3) The property of balancing the goals of exploration and exploitation alleviates the pressure of losing users caused by the full exploitation in existing meta-learning methods. Lastly, to verify its advantage over state-of-the-arts, we conduct extensive experiments and analysis on three benchmark datasets (MovieLens 444https://grouplens.org/datasets/movielens/, EachMovie 555https://www.librec.net/datasets.html and Netflix 666https://www.kaggle.com/netflix-inc/netflix-prize-data). The experimental results demonstrate its significant advantage over state-of-the-art methods and the knowledge learned by NICF.
Our main contributions presented in this paper are as follows:
We formally propose to employ reinforcement learning to solve the cold-start and warm-start recommendation under the interactive collaborative filtering setting.
We propose to represent the exploration policy with multi-channel stacked self-attention neural networks and learn the policy network by maximizing users’ satisfaction.
We perform extensive experiments on three real-world benchmark datasets to demonstrate the effectiveness of our NICF approach and the knowledge learned by it.
In this section, we first formalize the interactive collaborative filtering on the perspective of the multi-armed bandit and then shortly recapitulate the widely used approaches and its limitations for interactive collaborative filtering.
2.1. A Multi-Armed Bandit Formulation
In a typical recommender system, we have a set of users and a set of items . The users’ feedback for items can be represented by a preference matrix where is the preference for item by user . Here, can be either explicitly provided by the user in the form of rating, like/dislike, etc, or inferred from implicit interactions such as views, plays and purchases. In the explicit setting, typically contains graded relevance (e.g., 1-5 ratings), while in the implicit setting is often binary. Without loss of generality, we consider the following process in discrete timesteps. At each timestep , the system delivers an item to the target user , then the user will give feedback , which represents the feedback collected by the system from user to the recommended item at timestep . In other words, is the “reward” collected by the system from the target user. After receiving feedback, the system updates its model and decides which item to recommend next. Let’s denote as the available information (the support set) the system has for the target user at timestep .
Then, the item is selected according to a policy , which is defined as a function from the current support set to the selected item . In the interactive recommendation process, the total T-trial payoff of is defined as . For any user , our goal is to design a policy so that the expected total payoff is maximized,
Similar, we can define the optimal expected -trial payoff as , where is the optimal recommendation with maximum expected reward at timestep . Usually, in MAB, we would like to minimize the regret defined as . However, in recommender system, it is more intuitive to directly maximize the cumulative reward , which is equivalent to minimize the regret.
2.2. Multi-Armed Bandit Based Approaches
Currently, the exploration techniques in interactive collaborative filtering are mainly based on probabilistic matrix factorization (PMF) (Mnih and Salakhutdinov, 2008). Here, and
are the user and item feature vectors with a zero mean Gaussian prior distribution and
is the prior variance. During the learning procedure, current approaches, as shown in Figure1 (a), iterate between two steps: (1) Obtaining the posterior distributions of the user and item feature vectors after the -th interaction, denoting as and . The calculation of mean and variance terms , , and can be obtained by following MCMC-Gibbs (refers to (Zhao et al., 2013)). (2) Heuristically select the item for the -th recommendation with the aim of maximizing the cumulative reward. Specifically, there are mainly two strategies have been explored to select the items in interactive collaborative filtering:
Thompson Sampling (Chapelle and Li, 2011)
At the timestep for user , this method suggests the item with the maximum sampled values as , where and are sampled from the posterior distribution of user and item feature vectors (Kaufmann et al., 2012).
Upper Confidence Bound
It based on the principle of optimism in the face of uncertainty, which is to choose the item plausibly liked by users. In (Zhao et al., 2013), it designs a general solution Generalized Linear Model Bandit-Upper Confidence Bound (GLM-UCB), which combined UCB with PMF as
is a sigmoid function defined as, is a constant with respect to . is 2-norm based on as
, which measures the uncertainty of estimated rateat the -th interaction.
The above-discussed approaches show the possible limitation of MAB based methods: (1) Owing to the difficulty of updating the posterior distribution for non-linear models, they are only applicable for linear user-item interaction models, which greatly limits its usage on effective neural networks based models (He et al., 2017; Xue et al., 2017). (2) A lot of crucial hyper-parameters (e.g., the variance term for prior distribution and the exploration hyper-parameter ) are introduced, which increases the difficulty of finding the optimal recommendations. (3) The sophisticated approaches (Thompson Sampling and GLM-UCB) are potentially overly pessimistic since they are usually optimizing the recommendations in the worst case to achieve provably good regret bounds.
2.3. Meta-learning Based Approach
Meta-learning based approaches aim to learn a learning procedure that can quickly capture users’ interests after observed a small support set. As shown in Figure 1 (b), we presented an example framework MELU (Lee et al., 2019), which adapted Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) for fastly model adaption on cold-start users. Specifically, assume the recommender agent is modeled with a neural network parameterized with , MELU aims to learn an initialization that can identify users’ interests after updating with small support set . Formally, the is learned by minimizing a specific loss over the support set after updating to as
where is the recommendation policy parameterized by . usually corresponds to an accuracy measure, such as MSE or Cross entropy.
The meta-learning approach is appealing since it avoids the complexity of hand-designing the sophisticated exploration policies and enables us to take advantage of deep neural networks. However, how to select the support set without compromising users’ experience has not been concerned in existing meta-learning approaches. It resulted in two problems: (1) It leads to the recommendation of highly irrelevant items and terrible user experience at the phase of constructing the support set. (2) These methods perform lousy when faced with users’ tastes drifting or poor quality support set due to its full exploitation strategy and deficiency in actively exploring users’ interests.
In the following, we address these limitations by employing a neural network based exploration policy, which directly learns to explore for interactive collaborative filtering.
3. Neural Interactive Collaborative Filtering
We first present the general neural interactive collaborative filtering framework, elaborating how to formulate the exploration in cold-start and warm-start recommendation as a meta RL task, a bandit problem within an MDP. To explore DNNs for modeling user-item interactions, we then propose an instantiation of NICF, using stacking self-attention neural networks to represent the recommendation policy under interactive collaborative filtering. Lastly, we present an efficient policy learning method for interactive collaborative filtering.
3.1. General Framework
Rather than hand-designing exploration strategies for cold-start or warm-start users, we take a different approach in this work and aim to learn a neural network based exploration strategy whereby the recommender agent can capture users’ interests rapidly for different users and hence maximize the cumulative users’ engagement in the system, i.e., we would like to learn a general procedure (a sequential neural network) that takes as input a set of items from any user’s history and produces a scoring function that can be applied to new test items and balance the goals between learning the user profile and making accurate recommendations (as shown in Figure 1(c)).
In this formulation, we notice that the interactive collaborative filtering is equivalent to a meta-learning problem where the objective is to learn a learning algorithm that can take as the input of the user’s history and will output a model (policy function) that can be applied to new items. From the perspective of meta-learning, the neural network based policy function is a low-level system, which learns quickly and is primarily responsible for exploring users’ interests, and we want to optimize the low-level system with a slower higher-level system that works across users to tune and improve the lower-level system (Duan et al., 2016). Specifically, for every user , the agent executes a sequential neural network based policy , which constantly updates its recommendation policy based on the recommending items and users’ feedback. The slower higher-level system optimizes the weights of the sequential neural network in an end-to-end way to maximize the cumulative reward , which can be viewed as a reinforcement learning problem and optimized with RL algorithm.
From the perspective of RL, applying RL to solve cold-start and warm-start recommendation is also meaningful since the users’ preferences gathered by exploration recommendations can trigger much more satisfied recommendations, which can be viewed as the delayed reward for the recommendations and RL is born to maximize the sum of delayed and immediate reward in a global view. Therefore, applying RL directly achieves the goal of balancing between exploration and exploitation for interactive collaborative filtering. In details, as a RL problem, in the MDP are defined as: (1) State is a set of states, which is set as the support set . (2) Action set is equivalent to item set in recommendation. (3) Transition is the transition function with being the probability of seeing state after taking action at . In our case, the uncertainty comes from user’s rating w.r.t. and . (4) Reward is set based on users’ feedback, i.e., the user’s rating.
3.2. Self-Attentive Neural Policy
In this work, the exploration policy is parameterized with multi-channel stacked self-attention neural networks, which separately capture the information of versatile user behaviors since different rewarding recommendations for a specific user are usually extremely imbalanced (e.g., liking items usually are much fewer than disliking items) (Zou et al., 2019a; Zhao et al., 2018b; Kang and McAuley, 2018). In Figure 2, we presented the neural architecture for exploration policy, which consists of an embedding layer, self-attentive blocks, and a policy layer.
Given , the entire set of are converted into item embedding vectors of dimension by embedding each in a continuous space, which, in the simplest case, is an embedding matrix .
To better represent the observation , as shown in Figure 2, we separately process different rated items by employing multi-channel stacked self-attentive neural networks. Denote the items rated with score as an embedding matrix as (). The self-attention operation takes the embedding as input, converts it to three matrices through linear projects, and feeds them into an attention layer
where are the projection matrices. These projections make the model more flexible. Attention function is the scaled dot-product attention
where represents the queries, the keys and the values (each row represents an item). The scale factor is to avoid overly large values of the inner product, especially when dimensionality is high. Due to sequential nature of the recommendations, the attention layer should only consider the first items when formulating the -th policy. Therefore, we modify the attention by forbidding all links between and ().
Point-Wise Feed-Forward Layer
To endow the model with nonlinearity and to consider interactions between different latent dimensions, we apply a point-wise two-layer feed-forward network to (the -th row of the self-attention layer ) as
is the rectified linear unit.and are matrics. and are -dimensional vectors.
Stacking Self-Attention Block
The self-attention layer and point-wise feed-forward layer, which formulates a self-attention block and can be stacked to learn more complex item transitions. Specifically, -th () block is defined as:
and the -st block is defined as and .
After self-attention blocks that adaptively and hierarchically extract information of previously rated items, we predict the next item score based on , where is the maximal reward. Denote the predicted cumulative reward of recommending items as , the policy layer is processed by two feed-forward layers as,
where , are weight matrices and and are the bias terms. With the estimated , the recommendation is generated by selecting the item with maximal Q-value as .
3.3. Policy Learning
We use Q-Learning (Mnih et al., 2013) to learn the weights for the exploration policy. In the -th trial, the recommender agent observes the support set , and chooses the item with an -greedy policy w.r.t. the approximated value function (i.e., with probability selecting the max Q-value action, with probability randomly choosing an action). The agent then receives the response from the user and updates the observed set to . Finally, we store the experience in a large replay buffer from which samples are taken in mini-batch training.
Usually, training a RL agent is much more challenging than supervised learning problems(Sutton and Barto, 2018). Additionally, in recommender systems, the large-scale action space and state space have greatly increased the difficulty of training a reinforcement learning-based recommender agent (Zou et al., 2020; Chen et al., 2019a). To reduce the difficulty, we adapt a constantly increased during the training as , where is the
-th epoch,is the total number of epoch, and is a hyper-parameter (we set in the experiments). Since the larger means planning in longer future horizons for RL, the increasing can be treated as an increasingly difficult curriculum (Bengio et al., 2009), which gradually guides the learning agent from 1-horizon (greedy solution), 2-horizon, , to overall optimal solutions. Therefore, it is much more efficient than finding the optimal recommender policy from scratch.
In this section, we conduct extensive experiments on three benchmark datasets to evaluate the effectiveness of NICF. We mainly focus on answering the following research questions:
RQ1: How can NICF outperform existing interactive collaborative filtering algorithms for the cold-start users?
RQ2: Can the NICF be applied to warm-start users with drifting taste, i.e., those whose interests change over time?
RQ3: What’s the influence of various components in NICF?
RQ4: What kind of knowledge learned by NICF for cold-start recommendations?
In what follows, we will first introduce our experimental settings, followed by answering the above four research questions.
4.1. Experimental Settings
We experiment with three real-world benchmark datasets: MovieLens 1M44footnotemark: 4, EachMovie55footnotemark: 5, and Netflix66footnotemark: 6. Table 1 lists the statistics of the three datasets.
Due to the interactive nature of the recommender system, an online experiment with true interactions from real users would be ideal, but it is not always possible (Li et al., 2010; Zhao et al., 2013). Following the setting of interactive collaborative filtering (Zhao et al., 2013; He et al., 2017), we assume that the ratings recorded in the datasets are users’ instinctive actions, not biased by the recommendations provided by the system. In this way, the records can be treated as unbiased to represent the feedback in an interactive setting. Additionally, we assume that the rating is no less than 4 is the satisfied recommendation, otherwise dissatisfied. These assumptions define a simulation environment for training and evaluating our proposed algorithm and the learning agent is expected to keep track of users’ interests and recommend successful items throughout a long time.
4.1.2. Compared Methods
We compare our model with state-of-the-art methods from different types of recommendation approaches, including:
Random: The random policy is executed in every recommendation, which is a baseline used to estimate the worst performance that should be obtained.
Pop: It ranks the items according to their popularity measured by the number of being rated. This is a widely used simple baseline. Although it is not personalized, it is surprisingly competitive in evaluation, as users tend to consume popular items.
MF (Koren et al., 2009): It suggests recommendations based on the ratings of other users who have similar ratings as the target user. For cold-start recommendation, we always greedy w.r.t. the estimated scores and update users’ latent factor after every interaction.
BPR (Rendle et al., 2009): It optimizes the MF model with a pairwise ranking loss, which is a state-of-the-art model for item recommendation.
ICF (Zhao et al., 2013): Interactive collaborative filtering combined the probabilistic matrix factorization (Mnih and Salakhutdinov, 2008) with different exploration techniques for recommender system, including GLM-UCB (generalized LinUCB (Li et al., 2010)), TS (Chapelle and Li, 2011) and -Greedy (Sutton and Barto, 2018), which are strong baselines for handling exploration/exploitation dilemma in recommender system.
NICF: Our proposed approach for learning to explore in cold-start or warm-start recommendation.
|# Interactions Per User||165.60||1732.42||209.25|
|# Interactions Per Item||269.89||45.89||5654.50|
4.1.3. Evaluation Metrics
Given a cold-start or warm-start user, a well-defined exploration strategy should recommend the items to deliver the maximal amount of information useful for estimating users’ preferences. Previously, this kind of exploration is achieved by improving the diversity of recommendations (Cheng et al., 2017; Zou et al., 2019b). Hence, to study the learned exploration strategy, we evaluate the model on both the accuracy and diversity of generated recommendations. Given the ordered list of items, we adopt three widely used metrics in recommender system:
Cumulative Precision. A straightforward measure is the number of positive interactions collected during the total interactions,
For both datasets, we define if , and otherwise.
Cumulative Recall. We can also check for the recall during timesteps of the interactions,
Cumulative . generalize to diversity of the recommendation list, which formulated as
Here, with as the number of times that topic has appeared in the ranking of the recommendation list up to (and including) the -th position. Here, the topic is the property of items or users. is the normalization factor.
4.1.4. Parameter Setting
These datasets are split into three user-disjoint sets: 85% users’ data as the training set and their ratings are used to learn the parameters for the models, 5% users’ data used for tuning hyper-parameters, including the learning rate, hidden units, and early stop. The last 10% of users go through the interactive recommendation process during 40 time-steps which are used to evaluate the effectiveness of different methods. For all methods except Random and Pop, grid search is applied to find the optimal settings. These include latent dimensions from , and the learning rate from
. We report the result of each method with its optimal hyper-parameter settings on the validation data. We implement our proposed methods with Tensorflow and the code is available athttps://github.com/zoulixin93/NICF. The optimizer is the Adam optimizer (Kingma and Ba, 2014). We stack two self-attentive blocks in the default setting. The capacity of the replay buffer for Q-learning is set to 10000 in experiments. The exploration factor decays from 1 to 0 during the training of the neural network.
|Measure||Cumulative Precision||Cumulative Precision||Cumulative Precision|
|Measure||Cumulative Recall||Cumulative Recall||Cumulative Recall|
|indicates the statistically significant improvements (i.e., two-sided -test with ) over the best baseline.|
4.2. Performance comparison on cold-start cases (RQ1)
reports the performance of accumulative precision and recall throughout 40 trial recommendations for cold-start cases. The results are quite consistent with our intuition. We have the following observations:
(1) Our method NICF outperforms other baselines on three benchmark datasets. We can see that NICF achieves the best performance on the precision and recall over three benchmark datasets, significantly outperforming the state-of-the-art methods by a large margin (on average, the relative improvement on cumulative precision@40 over the best baseline are 9.43%, 4.59% and 6.65% for three benchmark datasets, respectively). It means that for cold-start recommendation, our proposed method can quickly capture users’ interests, and adapt its strategy to cater to new users.
(2) The GLM-UCB and TS algorithms generally work better than the greedy methods MF, BRP, MLP, and heuristic search method -greedy. In most cases, TS and GLM-UCB also exceed other baseline algorithms on EachMovie and Netflix datasets (according to the cumulative precision and recall). It means that the exploration by considering the uncertainties of the user and items according to their probability distributions is more promising than random explorations. Nevertheless, TS and GLM-UCB fail to outperform our proposed NICF algorithms.
(3) Overall, the meta-learning method, MELU, consistently outperforms the traditional baselines on average as shown in Table 2, and is much better than all other baselines on MovieLen (1M), which indicates that meta-learning method helps improve the recommendation accuracy on cold-start recommendation.
4.3. Performance comparison on warm-start cases with taste drift (RQ2)
Through this experiment, we aim to answer the question of whether the algorithms are also applicable to warm-start users to follow up their interests throughout the interactions, especially when their tastes are changing over time. To do this, we first divide the rating records of the users (whose ratings are more than 80) into two periods (set 1 and set 2). For the selected user, the set 1 (20 items) is used as the historical interactions for the user and set 2 as the simulation for his/her taste drift. Then, we employ the genre information of the items as an indication of the user interest (Zhao et al., 2013)
. That is, we calculate the cosine similarity between the genre vectors of the two periods. We choose the users with the smallest cosine similarity as an indication that they have significant taste drifting across the two periods. Since the genre information of EachMovie is not available, we only conduct experiments on MovieLens (1M) and Netflix datasets (the genre of Netflix dataset is crawled by using IMDBpy777https://github.com/alberanid/imdbpy). Specifically, we respectively selected 4,600 users and 96,037 users from MovieLens (1M) and Netflix datasets to train and evaluate on warm-start recommendations.
Table 3 reports the performance of accumulative precision and recall throughout 40 trial recommendations for warm-start users with drifting interests. In Table 3, it can be seen that our proposed methods outperform the baselines for both datasets. When compared with the best baseline, the improvement is up to 7.92% on MovieLens (1M) dataset, and 6.43% on the Netflix dataset, which means that for warm-start users, our proposed method can keep track on users’ drifting taste and adapt its strategy to cater to users.
|Measure||Cumulative Precision||Cumulative Precision|
|Measure||Cumulative Recall||Cumulative Recall|
|indicates the statistically significant improvements|
|(i.e., two-sided -test with ) over the best baseline.|
|0 Block (b=0)||16.7368||17.0276||14.1250|
|1 Block (b=1)||20.9818||24.9333||18.0429|
|3 Block (b=3)||21.4544||25.1063||18.6074|
4.4. Ablation Study (RQ3)
Since there are many components in our framework, we analyze their impacts via an ablation study. Table 4 shows the performance of our default method and its 4 variants on three datasets (with d = 30). We introduce the variants and analyze their effect respectively:
(1) LSTM: Replacing the self-attention blocks with LSTM cell, which is used to verify the effectiveness of self-attention on interactive collaborative filtering. Specifically, we adopt a two-layer LSTM with the hidden dimension of 30. The results imply that applying stacked self-attention blocks is beneficial for interactive collaborative filtering.
(2) : means learning without using RL, i.e., training a multi-channel stacked self-attention recommendation policy without consideration about the delayed reward, i.e., the model delivers items in full exploitation way without consideration of exploration. Not surprisingly, results are much worse than the default setting.
(3) Number of blocks: Not surprisingly, results are inferior with zero blocks, since the model would only depend on the last item. The variant with one block performs reasonably well and three blocks performance a little better than two blocks, meaning that the hierarchical self-attention structure is helpful to learn more complex item transitions.
(4) Multi-head: The authors of Transformer (Vaswani et al., 2017) found that it is useful to use ’multi-head’ attention. However, performance with two heads is consistently and slightly worse than single-head attention in our case. This might owe to the small in our problem (d = 512 in Transformer), which is not suitable for decomposition into smaller subspaces.
4.5. Analysis on Diversity (RQ4)
Diversity and accuracy
Some existing works (Cheng et al., 2017) explore users’ interests by improving the recommendation diversity. It is an indirect method to keep exploration, and the assumption has not been verified. Intuitively, the diverse recommendation brings more information about users’ interests or item attributes. Here, we conduct experiments to see whether NICF, which directly learn to explore, can improve the recommendation diversity. Since the genre information is only available on MovieLens (1M) and Netflix, we mainly analyze the recommendation diversity on these two datasets. In Figure 3, the accumulative -NDCG has been shown over the first 40 round recommendations. We can see that the NICF, learned by directly learning to explore, favors for recommending more diverse items. The results verify that exploring users’ interests can increase the recommendation diversity and enhancing diversity is also a means of improving exploration.
The knowledge learned by NICF
To gain a better insight into NICF, we take a close look at the exploration policy, i.e., visualizing the sequential decision tree learned by NICF. Due to the space limitation, we only present the first four round recommendations on MovieLens (1M) dataset. As shown in the Figure 4, without using the genre information, NICF can explore users’ interests by recommending similar movies with some different topics if the user liked this movie, or changing the genre of the movies if the movie has been negative labeled, which indicates that NICF can effectively track users’ interests and adapt its strategy to balance the exploration/exploitation on cold-start recommendations.
5. Related Work
We summarize the related literature: traditional recommender system, interactive recommender system and meta-learning based recommender system as follows.
Traditional recommender system
Being supervised by the history records and making recommendations with maximum estimated score have been the common practice in majority models, including factorization methods (Rendle, 2010; Hoyer, 2004; Koren et al., 2009) and different kinds of deep neural models
, such as multilayer perceptron(Cheng et al., 2016; He et al., 2017), denoising auto-encoders (Wu et al., 2016)
, convolutional neural network (CNN)(Tang and Wang, 2018)
, recurrent neural network (RNN)(Li et al., 2017; Gu et al., 2020), memory network (Chen et al., 2018) and attention architectures (Zhou et al., 2018; Bai et al., 2019). Based on the partially observed historical interactions, these existing models usually learn the user profile (Zhou et al., 2018; Gu et al., 2020; Chen et al., 2019b; Gu et al., 2016)
and predict a customer’s feedback by a learning function to maximize some well-defined evaluation metrics in the ranking, such as Precision and NDCG(Clarke et al., 2008). However, most of them are myopic because the learned policies are greedy with estimating customers’ feedback and unable to purposely explore users’ interests for cold-start or warm-start users in a long term view.
Interactive recommender system
Interactive recommendation as a trend for the development of recommender systems has been widely studied in recent years. There are mainly two directions for the research: (1) contextual bandit; (2) reinforcement learning. (1) In contextual bandit, the main focus is on how to balance exploration and exploitation and achieving a bounded regret (i.e., the performance gap between optimal recommendations and suggested recommendations) under worst cases. Hence, many contextual bandit based recommender systems have been developed for dealing with different recommendation tasks, such as news recommendation (Li et al., 2010), diversify movie set (Qin et al., 2014), collaborative filtering (Wang et al., 2017; Zhao et al., 2013), online advertising(Zeng et al., 2016) and e-commerce recommendation (Wu et al., 2017). However, they are usually intractable for non-linear models and potentially overly pessimistic about the recommendations. (2) Reinforcement learning is suitable to model the interactive recommender system. However, currently, there are still many difficulties in directly applying RL, such as the off-policy training (Chen et al., 2019a; Zou et al., 2020), the off-policy evaluation (Gilotte et al., 2018) and the large action spaces (Dulac-Arnold et al., 2015; Zhao et al., 2018a) and its topics are concentrated on optimizing the metrics with delayed attributes, such as diversity (Zou et al., 2019b), browsing depth (Zou et al., 2019a). As far as we know, we are the first work analyzing its usage on exploring users’ interests for interactive collaborative filtering.
Meta-learning based recommender system
Meta-learning, also called learning-to-learn, aims to train a model that can rapidly adapt to a new task with a few-shot of samples (Finn et al., 2017; Koch et al., 2015; Santoro et al., 2016), which is naturally suitable for solving the cold-start problem after collecting a handful of trial recommendations. For example, Vartak et al. (2017) treated recommendation for one user as one task, and exploit learning to adopt neural networks across different tasks based on task information. Lee et al. (2019) proposed to learn the initial weights of the neural networks for cold-start users based on Model-agnostic meta-learning (MAML) (Finn et al., 2017). At the same time, Pan et al. (2019) proposed a meta-learning based approach that learns to generate desirable initial embeddings for new ad IDs. However, all these methods ignored the performance on the support set, which also greatly influence the user engagement on the recommender system. Additionally, the full exploitation principle after few-shot trials inevitably led to the local optimal recommendations.
In this work, we study collaborative filtering in an interactive setting and focus on recommendations for cold-start users or warm-start users with taste drifting. To quickly catch up with users’ interests, we propose to represent the exploration strategy with a multi-channel stacked self-attention neural network and learn it from the data. In our proposed method, the exploration strategy is encoded in the weights of the neural network, which are trained with efficient Q-learning by maximizing the cold-start or warm-start users’ satisfaction in limited trials. The key insight is that the satisfying recommendations triggered by the exploration recommendation can be viewed as the delayed reward for the information gathered by exploration recommendation, and the exploration strategy that seamlessly integrates constructing the user profile into making accurate recommendations, therefore, can be directly optimized by maximizing the overall satisfaction with reinforcement learning. To verify its effectiveness, extensive experiments and analyses conducted on three benchmark collaborative filtering datasets have demonstrated the knowledge learned by our proposed method and its advantage over the state-of-the-art methods.
This research was supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada. The authors gratefully appreciate all the anonymous reviewers for their valuable comments.
- CTRec: a long-short demands evolution model for continuous-time recommendation. In SIGIR’19, pp. 675–684. Cited by: §5.
- Curriculum learning. In ICML’09, pp. 41–48. Cited by: §3.3.
- An empirical evaluation of thompson sampling. In NIPS’11, pp. 2249–2257. Cited by: §1, §2.2, 6th item.
- Top-k off-policy correction for a reinforce recommender system. In WSDM’19, pp. 456–464. Cited by: §3.3, §5.
- Semi-supervised user profiling with heterogeneous graph attention networks. In IJCAI’19, pp. 2116–2122. Cited by: §5.
- Sequential recommendation with user memory networks. In WSDM’18, pp. 108–116. Cited by: §5.
Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pp. 7–10. Cited by: §1, §1, §5.
- Learning to recommend accurate and diverse items. In WWW’17, pp. 183–192. Cited by: §4.1.3, §4.5.
- Novelty and diversity in information retrieval evaluation. In SIGIR’08, pp. 659–666. Cited by: §5.
- RL: fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779. Cited by: §3.1.
- Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679. Cited by: §5.
- Model-agnostic meta-learning for fast adaptation of deep networks. In ICML’17, pp. 1126–1135. Cited by: §2.3, 7th item, §5.
- Offline a/b testing for recommender systems. In WSDM’18, pp. 198–206. Cited by: §5.
- Hierarchical user profiling for e-commerce recommender systems. In WSDM’20, pp. 223–231. Cited by: §5.
- HLGPS: a home location global positioning system in location-based social networks. In ICDM’16, pp. 901–906. Cited by: §5.
- Neural collaborative filtering. In WWW’17, pp. 173–182. Cited by: §1, §2.2, 4th item, §4.1.1, §5.
Non-negative matrix factorization with sparseness constraints.
Journal of machine learning research5 (Nov), pp. 1457–1469. Cited by: §5.
- Self-attentive sequential recommendation. In ICDM’18, pp. 197–206. Cited by: §3.2.
- Thompson sampling: an asymptotically optimal finite-time analysis. In International Conference on Algorithmic Learning Theory, pp. 199–213. Cited by: §2.2.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.4.
- Siamese neural networks for one-shot image recognition. In ICML’15 deep learning workshop, Vol. 2. Cited by: §5.
- Matrix factorization techniques for recommender systems. Computer (8), pp. 30–37. Cited by: 3rd item, §5.
- MeLU: meta-learned user preference estimator for cold-start recommendation. In SIGKDD’19, pp. 1073–1082. Cited by: §2.3, 7th item, §5.
- Neural attentive session-based recommendation. In CIKM’17, pp. 1419–1428. Cited by: §5.
- A contextual-bandit approach to personalized news article recommendation. In WWW’10, pp. 661–670. Cited by: §1, 6th item, §4.1.1, §5.
- Probabilistic matrix factorization. In NIPS’08, pp. 1257–1264. Cited by: §2.2, 6th item.
- Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §3.3.
- Warm up cold-start advertisements: improving ctr predictions via learning to learn id embeddings. In SIGIR’19, pp. 695–704. Cited by: §5.
- Contextual combinatorial bandit and its application on diversified online recommendation. In SDM’14, pp. 461–469. Cited by: §5.
- BPR: bayesian personalized ranking from implicit feedback. In UAI’09, pp. 452–461. Cited by: 5th item.
- Factorization machines. In ICDM’10, pp. 995–1000. Cited by: §5.
- Meta-learning with memory-augmented neural networks. In ICML’16, pp. 1842–1850. Cited by: §5.
- Interactive recommender systems: tutorial. In RecSys’15, pp. 359–360. Cited by: §1.
- Reinforcement learning: an introduction. MIT press. Cited by: §3.3, §3.3, 6th item.
- Personalized top-n sequential recommendation via convolutional sequence embedding. In WSDM’18, pp. 565–573. Cited by: §5.
- A meta-learning perspective on cold-start recommendations for items. In NIPS’17, pp. 6904–6914. Cited by: §5.
- Attention is all you need. In NIPS’17, pp. 5998–6008. Cited by: §4.4.
- Collaborative deep learning for recommender systems. In SIGKDD’15, pp. 1235–1244. Cited by: §1.
- Factorization bandits for interactive recommendation.. In AAAI’17, pp. 2695–2702. Cited by: §5.
- Returning is believing: optimizing long-term user engagement in recommender systems. In CIKM’17, pp. 1927–1936. Cited by: §5.
- Collaborative denoising auto-encoders for top-n recommender systems. In WSDM’16, pp. 153–162. Cited by: §5.
- Deep matrix factorization models for recommender systems.. In IJCAI’17, pp. 3203–3209. Cited by: §2.2, 4th item.
- Online context-aware recommendation with time varying multi-armed bandit. In SIGKDD’16, pp. 2025–2034. Cited by: §5.
- Deep reinforcement learning for page-wise recommendations. In RecSys’18, pp. 95–103. Cited by: §5.
- Recommendations with negative feedback via pairwise deep reinforcement learning. In SIGKDD’18, pp. 1040–1048. Cited by: §3.2.
- Interactive collaborative filtering. In CIKM’13, pp. 1411–1420. Cited by: §1, §1, §2.2, §2.2, 6th item, §4.1.1, §4.3, §5.
- Deep interest network for click-through rate prediction. In SIGKDD’18, pp. 1059–1068. Cited by: §5.
- Reinforcement learning to optimize long-term user engagement in recommender systems. In SIGKDD’19, pp. 2810–2818. Cited by: §3.2, §5.
- Reinforcement learning to diversify top-n recommendation. In DASFAA’19, pp. 104–120. Cited by: §4.1.3, §5.
- Pseudo dyna-q: a reinforcement learning framework for interactive recommendation. In WSDM’20, pp. 816–824. Cited by: §1, §3.3, §5.