Attention-based Mixture Density Recurrent Networks for History-based Recommendation

by   Tian Wang, et al.

The goal of personalized history-based recommendation is to automatically output a distribution over all the items given a sequence of previous purchases of a user. In this work, we present a novel approach that uses a recurrent network for summarizing the history of purchases, continuous vectors representing items for scalability, and a novel attention-based recurrent mixture density network, which outputs each component in a mixture sequentially, for modelling a multi-modal conditional distribution. We evaluate the proposed approach on two publicly available datasets, MovieLens-20M and RecSys15. The experiments show that the proposed approach, which explicitly models the multi-modal nature of the predictive distribution, is able to improve the performance over various baselines in terms of precision, recall and nDCG.



There are no comments yet.


page 1

page 2

page 3

page 4


PaccMann: Prediction of anticancer compound sensitivity with multi-modal attention-based neural networks

We present a novel approach for the prediction of anticancer compound se...

Naturalistic Driver Intention and Path Prediction using Recurrent Neural Networks

Understanding the intentions of drivers at intersections is a critical c...

Mixture-of-tastes Models for Representing Users with Diverse Interests

Most existing recommendation approaches implicitly treat user tastes as ...

MEANTIME: Mixture of Attention Mechanisms with Multi-temporal Embeddings for Sequential Recommendation

Recently, self-attention based models have achieved state-of-the-art per...

RepeatNet: A Repeat Aware Neural Recommendation Machine for Session-based Recommendation

Recurrent neural networks for session-based recommendation have attracte...

Recurrent Graph Tensor Networks

Recurrent Neural Networks (RNNs) are among the most successful machine l...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Recommender systems have become an indispensable part of the e-commerce industry, helping customers to sort out items of interest from large inventories. Among the most popular techniques are matrix factorization (MF) based models (see, e.g., Hu, Koren, and Volinsky, 2008; Koren, Bell, and Volinsky, 2009; Rendle et al., 2009)

which decompose a user–item matrix into user and item matrices. Such an approach treats recommendation as a matrix completion/imputation problem, where missing entries in the original matrix are estimated by the dot product between corresponding user and item factors. Despite their popularity in recommender systems, MF-based models have their limitations. First, MF-based models aim at reconstructing user history, instead of predicting future behaviors. The underlying assumption is that user preference is static over time. Second, most of such approaches omit ordering information in a user history. To address these issues, an increasing number of recent works have begun to treat user behaviors as sequences, and predict future events based on history 

(see, e.g., Hidasi et al., 2015; Tan, Xu, and Liu, 2016; De Boom et al., 2017).

Recurrent neural networks (RNNs) are one of the most widely used techniques for sequence modelling (see, e.g., Mikolov et al., 2010; Bahdanau, Cho, and Bengio, 2014; Sutskever, Vinyals, and Le, 2014). These RNNs have recently been considered for a history-based recommendation system (see, e.g., Hidasi et al., 2015; Tan, Xu, and Liu, 2016; Wu et al., 2017). In the implicit feedback scenario, where binary user-item interactions are recorded, the previously mentioned works pose recommendation as a classification task. For a recommender system containing millions of items, classification approach will calculate a score for each user-item pair, therefore leading to a scalability issue in both training and prediction. Besides scalability, in the work by De Boom et al. (2017), they demonstrated such an approach fails to recommend relevant items to users on their dataset containing more than 6 million songs.

Instead, De Boom et al. (2017) formulated the problem as regression rather than classification to address the scalability and performance degradation issues. However, due to the noisy and multi-modal nature of user behavior, building a mapping from history to future is an ill-posed problem, and regression may not be suitable.

In this work, we present an approach to history-based recommendation by generating a conditional distribution over future items vectors. The proposed model uses a recurrent network for summarizing a user history, and an attention-based recurrent mixture density network, which generates each component in a mixture network sequentially, for modelling a multi-modal conditional distribution.

Our model is evaluated on MovieLens-20M (Harper and Konstan, 2016) and RecSys15 111 in an implicit feedback setting. Experimental results demonstrate that the proposed model improves recall, precision, and nDCG significantly, compared to various baselines. Comprehensive analysis on model configuration shows that increasing the number of mixture components improves recommendations by better capturing multi-modality in user behavior.

Our model explores a new direction of recommender systems by conducting density estimation over continuous item representation. It provides an effective way to generate components for a mixture network, which potentially benefits all applications using multiple mixtures.

Related Work

A History-based Recommender System with a Recurrent Neural Network

RNNs were first used to model a user history in recommender systems by Hidasi et al. (2015). In this work, a RNN was used to model previous items and predict the next one in a user sequence. Tan, Xu, and Liu (2016) improved the recommender system performance on a similar architecture with data augmentation. To better leverage item features, Hidasi et al. (2016) introduced a parallel RNN architecture to jointly model user behaviors and item features. Wu et al. (2017) proposed a new architecture using separate recurrent neural networks to update user and item representations in a temporal fashion.

There are two major differences in the proposed approach from the previously mentioned work. First, our work frames the task of implicit-feedback recommendation as density estimation in a continuous space rather than classification with a discrete output space. Second, unlike most of the earlier works, where the whole systems were trained end-to-end, the proposed model leverages an external algorithm to extract item representation, allowing the system to cope with new items more easily.

More recently, De Boom et al. (2017) proposed a history-based recommender system with pretrained continuous item representations as a regression problem. In their work, a recurrent neural network read through a user’s history, as a sequence of listened songs, and extracted a fixed-length user taste vector, which was later used to predict future songs.

The major difference between the proposed work and the work by De Boom et al. (2017)

is the assumption on the number of modes in the distribution of user behaviors. The proposed model considers the mapping from history to a future behaviour as a probability distribution with multiple modes, unlike their work in which such a distribution is assumed to be unimodal. We do so by using a variant of mixture density network 

(Bishop, 1994) to explicitly model user behavior. Their approach can be considered as a special case of the proposed model with a single mixture component.

Continuous Item Representation

Inspired by the recent advance in word representation learning (Mikolov et al., 2013), various methods have been proposed to embed an item in a distributed vector that encodes useful information for recommendation. Barkan and Koenigstein (2016) learned item vectors using an approach similar to Word2Vec (Mikolov et al., 2013) by treating each item as a word without considering the order of items. Liang et al. (2016) jointly factorized a user-item matrix and item-item matrix to obtain item vectors. In the work by Liu et al. (2017), a vector was learned for each pin in Pinterest using a Word2Vec inspired model.

In this paper, we use external knowledge to extract item representation, instead of training jointly. Such an approach is effective, because it enables the use of any recent advances in representation learning and has a potential to incorporate new items unseen during training.


Recommendation Framework

In the implicit feedback setting, a user behavior is recorded as a sequence of interacted items, which can be a mixture of various behaviors, including viewing, purchasing, searching and others. For simplicity, we only focus on the viewing behavior in our model. We frame the task of recommendation as a sequence modelling problem with the goal of predicting the future directly.

Given a splitting index and a user behavior sequence , the sequence can be split into the history and the future . A recommender system, parametrized by , aims at modelling the probability of future items conditioned on historical items . For simplicity, we omit in our notation and assume that the items in are independent


This conditional probability can be approximated by, for instance, -gram conditional probability


where are previous viewed items.


-gram statistics table records the number of occurrence for each item n-gram in the training corpus. Based on this, the approximated conditional probability can be expressed as


where is the count in the training corpus. When equals two, such setting is a similar variant of item-to-item collaborative filtering (Linden, Smith, and York, 2003), where the temporal dependency among items is ignored.

Conditioned on a seed item in a user history, item-to-item collaborative filtering recommends item having the highest co-view probability


The statistics table contains the number of occurrence for each item pair . One can estimate item-to-item conditional probability by


With pairwise conditional probability, can be approximated using by random sampling an item . To stabilize the result, we take the average of the approximated probability


A major limitation of such count-based method is data sparsity, as a large number of -grams do not occur in the training corpus.

To address the data sparsity issue in count-based methods, Bengio et al. (2003) proposed neural language model, in which each word is represented as a continuous vector. In this paper, we take a similar approach by representing each item using a continuous vector

. Unlike earlier works using continuous representation as input only, and model its discrete probability distribution as classification, we instead choose to directly model probability density function over continuous item representation

. items with highest likelihoods are recommended accordingly.

In doing so, there are three major technical questions. The first question is how to construct item vector . The second one is how to represent a user history . Lastly, we must decided how to construct such a probability density function . We will answer each of these questions in the following subsections.

Item Representation

In our model, item embeddings are pretrained and kept fixed during training. Although no assumption is posed on item embeddings, the distance between two embeddings should be able to explain certain relationships between two items, such as content similarity and co-purchase likelihood. That is, the closer the distance is the stronger the relationship should be.

In this paper, we train item embeddings in a way similar to continuous bag-of-words model (Mikolov et al., 2013) by treating a user sequence of items as a sentence, and each item as a word. Under this setting, the distance between two items in vector space could be explained by their co-occurrence chance in a sequence. The closer the distance is, the higher chance two items have occurring in the same sequence.

As a result, a valid item embedding matrix is generated, where each row is the representation of an item. We denote that for each item , its -dimensional vector representation could be retrieved .

History Representation

A user’s history is recorded as a sequence of items, either viewed or purchased, . After mapping each item to its vector, we can get a sequence of the item vectors . In this paper, we experimented with three alternatives to represent user history.

Continuous Bag-of-Items Representation (CBoI)

The first proposed method is to simply bag all the items in into a single vector . Any element of corresponding to the item existing in will be assigned the frequency of that item, and otherwise 0. This vector is multiplied from left by item embedding matrix


We call this representation a continuous bag-of-items (CBoI). In this approach, the ordering of history items does not affect the representation.

Recurrent Representation (RNN)

Recurrent neural networks (RNN) have become one of the most popular techniques for modelling a sequence. Long short-term memory units

(LSTM, Hochreiter and Schmidhuber, 1997)

and gated recurrent units

(GRU, Cho et al., 2014) are the two most popular variants of RNNs. In this paper, we work with GRUs, which have the following update rule:



is a sigmoid function,

is the input at the -th timestep, and is element-wise multiplication.

After converting each item into a vector representation, the sequence of item vectors is read by a recurrent neural network. We initialize the recurrent hidden state as 0. For each item in the history, we get


for .

is GRU recurrent activation function defined in Eq. (


With , recurrent user representation is computed by

Attention-based Representation (RNN-ATT)

Inspired by the success of attention mechanism in machine translation (Bahdanau, Cho, and Bengio, 2014), the proposed method incorporates attention mechanism into recurrent history representation when using with recurrent decoder later described in Eq. (16). After is generated following the same way mentioned in Eq. (9

), we use a separate bidirectional recurrent neural network to read

, and generate a sequence of annotated vectors . For a mixture vector , attention-based history representation is calculated as


where the attention weight is computed by


In Eq. (12), is the hidden state of the recurrent neural network in the decoder calculated in Eq. (16), and function defines the relevance score of the -th item with respect to .

Mixture Density Network

A mixture density network (MDN, Bishop, 1994) formulates the likelihood of an item vector conditioned on a user history (represented by ) as a linear combination of kernel functions


where is the number of components used in the mixture. Each kernel is a multivariate Gaussian density function:


In order to reduce the computation complexity, the covariance matrix

is assumed to be diagonal, containing only entries for element-wise variances.

We propose two methods for generating parameters of the mixture density network using components.

Feedforward decoder (FF)

After a user history is encoded into a single user representation , the parameters for the -th mixture–, , and – are generated by


where , , and .

Recurrent decoder (RNN)

In addition to the feedforward decoder, we propose a recurrent decoder. For a mixture density network with components, the recurrent decoder iterates times. In each iteration, an RNN takes the history representation as input and generates the parameters of one mixture. Suppose at the -th step, the -th component’s parameters are calculated as


where is a recurrent activation function, and are shared among all mixtures.

After all are generated, the mixture weight is calculated by


where .

Figure 1: Architecture of recurrent decoder with attention-based history representation

Alternatively with the attention-based history representation described in Eq. (11), is replaced by at the -th iteration in Eq. (16). At the -th step, for annotated vectors , attention weight is computed by


Attention-based recurrent representation is computed by


where is the output from Recurrent Representation calculated in Eq. (9). The architecture for recurrent decoder with the attention-based encoder is illustrated in Fig. 1. The attention mechanism allows a model to automatically search for items in the user history relevant to each mixture component.

Experimental Settings


There are multiple configurations of the proposed methods. First, there are three ways to represent a user history: (1) continuous Bag-of-Items (CBoI), (2) recurrent representation (RNN), and (3) attention-based representation (RNN-ATT). Then, there are two ways to generate mixture parameters: (1) Feedforward decoder (FF) and (2) Recurrent decoder (RNN). We denote all models evaluated in our experiments by

  1. CBoI-FF-

  2. RNN-FF-

  3. RNN-RNN-

  4. RNN-ATT-RNN-,

where denotes the number of mixture components in a mixture density network. We test four ’s: 1, 2, 4, and 8.

Note that, when

is equal to 1, a mixture model can only output a unimodal Gaussian distribution. This is similar to the work by

De Boom et al. (2017), where regression can be viewed as an unimodal Gaussian with an identity covariance matrix.

We consider following baselines:

Recently Viewed Items (RVI)

recommends items a user has viewed in the history, ranked by the recency. Although this technique is not a collaborative filtering method, it is widely used as a personalized recommendation module in production systems. In a previous work by Song, Elkahky, and He (2016), a similar approach (Prev-day Click) was adopted as a baseline method, and outperformed all MF-base model in their experiment.

Item-to-Item Collaborative Filtering (Item-CF)

uses a single item as a seed instead of using a whole user history, as described in Eq. (6). Recommended items are ranked by an estimated conditional probability.

Implicit Matrix Factorization (IMF)

is implemented according to Hu, Koren, and Volinsky (2008) by using implicit package222 The model is fit using history and future sequences in a training set, and history sequences in validation and testing sets.

All mixture density network models uses 256 as , and are trained using Adam (Kingma and Ba, 2014) to maximize the log-likelihood defined as


where is the number of user sequences in the training set, and is the length of the -th sequence.

In all RNN-based models, a one-layer GRU with 256 hidden units is used. We early-stop training based on F1@20 on a validation set, and report metric on a test set using the best model according to the validation performance.

For implicit matrix factorization, we perform grid search over the number of factors on a validation set, and report the metric using the best model on a test set.

Item embeddings are trained using the continuous bag-of-words model from FastText package (Bojanowski et al., 2016), with the item embedding dimension set to 100 and windows size to 5. All sequences in the training set are used for embedding learnings. After training, each item vector is normalized by norm.


We evaluated our model on two publicly available datasets.


MovieLens-20M (Harper and Konstan, 2016) is a classic explicit-feedback collaborative filtering dataset for movie recommendation, in which (user, movie, rating, timestamp) tuples are recorded. We transform MovieLens-20M into an implicit-feedback dataset by only taking records having ratings greater or equal to 4 as positive observations. User behavior sequences are sorted by time, and those containing more than 15 implicit positive observations are included. Each last viewed 15 movies by each user are split into 10 and 5, as history and future respectively. As the nature of this dataset, there is no duplicate items in the user sequence. After preprocessing, 75,962 sequences are kept. 80%, 10%, and 10% of sequences are randomly split into training, validation, and test sets, respectively. A movie vocabulary is built using the training set, containing 16,253 unique movies.


RecSys15 333 is an implicit feedback dataset, containing click and purchase events from an online e-commerce website. We only work with the training file in the original dataset, and keep the click events with timestamps. We filter sequences of length less than 15, and use final 2 clicks as future, and the first 13 clicks as history. We do not filter out duplicate items, and as a result the same item could appear in both history and target parts. After preprocessing, we are left with 168,202 sequences. 80%, 10%, and 10% of sequences are randomly split into training, validation, and test sets, respectively. An item vocabulary is built only using items in the training set, leaving us with 32,117 unique items.

(a) MovieLens-20M (b) RecSys15 444Implicit Matrix Factorization resulted in the score of 0.0176, 0.0878, and 0.0402 respectively for precision@10, recall@10, and nDCG@10, and is not shown in here
Figure 2: Precision, Recall, and nDCG with varying number of mixture components on (a) MovieLens-20M and (b) RecSys15.


There are various metrics that could be used to evaluate the performance of a recommender system. In this paper, we use precision, recall, and nDCG. Higher values indicate better performance under these metrics.

We denote top-k recommended items by , where the items are ranked by the recommendation system; and we denote target items by .


calculates the fraction of top-k recommended items which are overlapping with target items.


calculates the fraction of target items which are overlapping with top-k recommended items.


computes the quality of ranking, by comparing the recommendation DCG with the optimal DCG (Järvelin and Kekäläinen, 2002). In implicit feedback datasets, relevance scores for items in the target set are assigned to 1. DCG@k is calculated as


The optimal DCG is calculated as


nDCG@k is calculated as



Table 1 summarizes the results of our experiments. As MovieLens has no duplicate items in a sequence, RVI is not used on that dataset. From the result on MovieLens, we first observe that the proposed RNN-ATT-RNN-4,8 model consistently outperformed the other methods in all the metrics by large margins. Second, we see that CBoI-FF does not work well regardless of the number of components used, while the performance is substantially improved with the recurrent encoder. Third, comparing between the two baseline models, Item-CF outperforms IMF by a good margin across all metrics.

On RecSys15, besides the similar trends we see from MovieLens, there are several new observations. First, RVI outperforms all the models except for the RNN-ATT-RNN-{2,4,8} on Precision@10 and Recall@10. This result is in line with Song, Elkahky, and He (2016). They also observed the competitive performance from using previous day’s clicks on news recommendation. Secondly, IMF is the worst performing model on this dataset. We conjecture that this is because that IMF only recommends items a user has not interacted with, while in clicking streams like RecSys15, items in the history are likely to reappear in the future.

To better understand the effect of mixture components on various model architecture, we group by the number of component used across various models and visualize the result in Fig. 2. We observe that RNN-ATT-RNN- achieves most visible improvement as the number of mixture components increases. We also notice that for all MDN-based architectures, using two mixtures always achieves better result compared with using one mixture. However, unless the attention mechanism is used we see diminishing improvements with more components.

The experiments have revealed that it is clearly beneficial to capture the multimodal nature of prediction in a recommender system. This is however only possible with the right choice of user representation and right mechanism for generating mixture parameters. In these experiments, our novel approach, the attention-based recurrent history representation combined with the recurrent decoder, was found to be the best choice in both datasets. We have further learned that the user preference is not static across time, and that it is beneficial to model the user history as a sequence rather than a bag.

Conclusion & Future Work

In this paper, we proposed a method to construct a recommender system by generating the probability density function over future item vectors. The proposed model combines recurrent user history representation with a mixture density network, where a novel attention-based recurrent mixture density has been proposed to output each mixture component sequentially. The experiments on two publicly available datasets, MovieLens-20M and RecSys 15, have demonstrated significant improvement in recall, precision, and nDCG compared against various baselines, validating the advantage of modelling the multimodal nature of the predictive distribution in a recommendation system.

Model P@10 CBoI-FF-4 0.0283 CBoI-FF-1 0.0286 CBoI-FF-8 0.0286 CBoI-FF-2 0.0289 IMF 0.0301 Item-CF 0.0337 RNN-FF-1 0.0343 RNN-FF-2 0.0348 RNN-FF-4 0.0350 RNN-RNN-4 0.0350 RNN-FF-8 0.0351 RNN-RNN-1 0.0352 RNN-RNN-2 0.0352 RNN-ATT-RNN-1 0.0356 RNN-ATT-RNN-2 0.0356 RNN-RNN-8 0.0358 RNN-ATT-RNN-8 0.0363 RNN-ATT-RNN-4 0.0365 Model P@20 IMF 0.0255 CBoI-FF-1 0.0273 CBoI-FF-4 0.0274 CBoI-FF-2 0.0275 CBoI-FF-8 0.0277 Item-CF 0.0288 RNN-FF-8 0.0307 RNN-FF-1 0.0308 RNN-FF-4 0.0311 RNN-FF-2 0.0313 RNN-RNN-2 0.0315 RNN-ATT-RNN-2 0.0316 RNN-RNN-8 0.0317 RNN-RNN-4 0.0318 RNN-RNN-1 0.0320 RNN-ATT-RNN-1 0.0322 RNN-ATT-RNN-4 0.0324 RNN-ATT-RNN-8 0.0327 Model R@10 CBoI-FF-4 0.0567 CBoI-FF-1 0.0573 CBoI-FF-2 0.0573 CBoI-FF-8 0.0578 IMF 0.0603 Item-CF 0.0674 RNN-FF-1 0.0686 RNN-FF-2 0.0695 RNN-FF-4 0.0700 RNN-RNN-4 0.0701 RNN-FF-8 0.0702 RNN-RNN-1 0.0705 RNN-RNN-2 0.0705 RNN-ATT-RNN-1 0.0712 RNN-ATT-RNN-2 0.0712 RNN-RNN-8 0.0717 RNN-ATT-RNN-8 0.0727 RNN-ATT-RNN-4 0.0731 Model R@20 IMF 0.1022 CBoI-FF-1 0.1094 CBoI-FF-4 0.1097 CBoI-FF-2 0.1102 CBoI-FF-8 0.1107 Item-CF 0.1150 RNN-FF-8 0.1229 RNN-FF-1 0.1234 RNN-FF-4 0.1245 RNN-FF-2 0.1255 RNN-RNN-2 0.1260 RNN-RNN-8 0.1267 RNN-ATT-RNN-2 0.1267 RNN-RNN-4 0.1273 RNN-RNN-1 0.1281 RNN-ATT-RNN-1 0.1287 RNN-ATT-RNN-4 0.1296 RNN-ATT-RNN-8 0.1307 Model nDCG@20 CBoI-FF-1 0.0422 CBoI-FF-8 0.0424 CBoI-FF-4 0.0426 CBoI-FF-2 0.0426 IMF 0.0492 RNN-FF-1 0.0514 RNN-FF-2 0.0524 RNN-FF-8 0.0525 RNN-RNN-4 0.0526 RNN-FF-4 0.0528 RNN-RNN-1 0.0534 RNN-RNN-2 0.0535 RNN-ATT-RNN-1 0.0536 RNN-RNN-8 0.0542 Item-CF 0.0544 RNN-ATT-RNN-2 0.0544 RNN-ATT-RNN-8 0.0548 RNN-ATT-RNN-4 0.0552 Model nDCG@20 CBoI-FF-1 0.0609 CBoI-FF-4 0.0610 CBoI-FF-2 0.0615 CBoI-FF-8 0.0616 IMF 0.0633 RNN-FF-1 0.0711 RNN-FF-8 0.0715 Item-CF 0.0716 RNN-FF-4 0.0724 RNN-FF-2 0.0725 RNN-RNN-4 0.0732 RNN-RNN-2 0.0735 RNN-RNN-8 0.0741 RNN-RNN-1 0.0742 RNN-ATT-RNN-1 0.0743 RNN-ATT-RNN-2 0.0745 RNN-ATT-RNN-4 0.0756 RNN-ATT-RNN-8 0.0757 (a) MovieLens-20M Model P@10 IMF 0.0176 CBoI-FF-1 0.0312 CBoI-FF-8 0.0318 CBoI-FF-4 0.0322 CBoI-FF-2 0.0323 Item-CF 0.0487 RNN-RNN-1 0.0491 RNN-FF-1 0.0496 RNN-FF-2 0.0514 RNN-ATT-RNN-1 0.0516 RNN-RNN-2 0.0516 RNN-FF-4 0.0516 RNN-FF-8 0.0522 RVI 0.0524 RNN-RNN-8 0.0529 RNN-RNN-4 0.0534 RNN-ATT-RNN-2 0.0536 RNN-ATT-RNN-4 0.0559 RNN-ATT-RNN-8 0.0589 Model P@20 IMF 0.0127 CBoI-FF-1 0.0215 CBoI-FF-8 0.0215 CBoI-FF-4 0.0217 CBoI-FF-2 0.0217 RVI 0.0273 RNN-FF-1 0.0320 RNN-FF-2 0.0320 RNN-RNN-1 0.0321 RNN-RNN-2 0.0324 RNN-FF-4 0.0324 RNN-FF-8 0.0327 RNN-ATT-RNN-1 0.0331 RNN-RNN-8 0.0331 Item-CF 0.0334 RNN-RNN-4 0.0334 RNN-ATT-RNN-2 0.0336 RNN-ATT-RNN-4 0.0344 RNN-ATT-RNN-8 0.0358 Model R@10 IMF 0.0878 CBoI-FF-1 0.1559 CBoI-FF-8 0.1592 CBoI-FF-4 0.1610 CBoI-FF-2 0.1616 Item-CF 0.2435 RNN-RNN-1 0.2453 RNN-FF-1 0.2478 RNN-FF-2 0.2571 RNN-RNN-2 0.2581 RNN-ATT-RNN-1 0.2582 RNN-FF-4 0.2582 RNN-FF-8 0.2612 RVI 0.2621 RNN-RNN-8 0.2645 RNN-RNN-4 0.2668 RNN-ATT-RNN-2 0.2682 RNN-ATT-RNN-4 0.2797 RNN-ATT-RNN-8 0.2944 Model R@20 IMF 0.1267 CBoI-FF-8 0.2146 CBoI-FF-1 0.2148 CBoI-FF-4 0.2166 CBoI-FF-2 0.2171 RVI 0.2728 RNN-FF-2 0.3198 RNN-FF-1 0.3201 RNN-RNN-1 0.3212 RNN-RNN-2 0.3235 RNN-FF-4 0.3237 RNN-FF-8 0.3269 RNN-RNN-8 0.3306 RNN-ATT-RNN-1 0.3313 RNN-RNN-4 0.3337 Item-CF 0.3342 RNN-ATT-RNN-2 0.3362 RNN-ATT-RNN-4 0.3439 RNN-ATT-RNN-8 0.3578 Model nDCG@10 IMF 0.0402 CBoI-FF-1 0.0996 CBoI-FF-8 0.1020 CBoI-FF-4 0.1031 CBoI-FF-2 0.1032 RNN-RNN-1 0.1531 Item-CF 0.1536 RNN-FF-1 0.1547 RNN-FF-4 0.1601 RNN-FF-2 0.1603 RNN-ATT-RNN-1 0.1612 RNN-RNN-2 0.1612 RNN-FF-8 0.1623 RNN-RNN-8 0.1643 RNN-RNN-4 0.1666 RNN-ATT-RNN-2 0.1676 RVI 0.1751 RNN-ATT-RNN-4 0.1796 RNN-ATT-RNN-8 0.1903 Model nDCG@20 IMF 0.0523 CBoI-FF-1 0.1215 CBoI-FF-8 0.1229 CBoI-FF-4 0.1241 CBoI-FF-2 0.1242 RNN-FF-2 0.1242 RVI 0.1793 RNN-RNN-1 0.1833 RNN-FF-1 0.1839 Item-CF 0.1867 RNN-FF-4 0.1875 RNN-RNN-2 0.1886 RNN-FF-8 0.1906 RNN-ATT-RNN-1 0.1913 RNN-RNN-8 0.1925 RNN-RNN-4 0.1943 RNN-ATT-RNN-2 0.1962 RNN-ATT-RNN-4 0.2054 RNN-ATT-RNN-8 0.2140 (b) RecSys15
Table 1: Recall, Precision and nDCG on MovieLens and RecSys15. Results are sorted by each metric.

To explore the full potential of our model, there are several areas in which more research needs to be done. First, to better understand our model, more thorough analysis on the learned mixture components and the attention weights should be conducted. Second, we use embeddings pretrained using the word2vec objective, which leads to embeddings that learn the distributional, user-behavior based properties of items. One way to extend our model is to incorporate content-based attributes into the item embeddings we use, and create a hybrid recommender system.


TW sincerely thanks Tommy Chen, Andrew Drozdov, Daniel Galron, Timothy Heath, Alex Shen, Krutika Shetty, Stephen Wu, Lijia Xie, and Kelly Zhang for helpful discussions and insightful feedbacks. KC thanks support by eBay, TenCent, Facebook, Google and NVIDIA, and was partly supported by Samsung Advanced Institute of Technology (Next Generation Deep Learning: from pattern recognition to AI).


  • Bahdanau, Cho, and Bengio (2014) Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  • Barkan and Koenigstein (2016) Barkan, O., and Koenigstein, N. 2016. Item2vec: neural item embedding for collaborative filtering. In Machine Learning for Signal Processing (MLSP), 2016 IEEE 26th International Workshop on, 1–6. IEEE.
  • Bengio et al. (2003) Bengio, Y.; Ducharme, R.; Vincent, P.; and Jauvin, C. 2003. A neural probabilistic language model. Journal of machine learning research 3(Feb):1137–1155.
  • Bishop (1994) Bishop, C. M. 1994. Mixture density networks.
  • Bojanowski et al. (2016) Bojanowski, P.; Grave, E.; Joulin, A.; and Mikolov, T. 2016. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.
  • Cho et al. (2014) Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  • De Boom et al. (2017) De Boom, C.; Agrawal, R.; Hansen, S.; Kumar, E.; Yon, R.; Chen, C.-W.; Demeester, T.; and Dhoedt, B. 2017. Large-scale user modeling with recurrent neural networks for music discovery on multiple time scales. Multimedia Tools and Applications 1–23.
  • Harper and Konstan (2016) Harper, F. M., and Konstan, J. A. 2016. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5(4):19.
  • Hidasi et al. (2015) Hidasi, B.; Karatzoglou, A.; Baltrunas, L.; and Tikk, D. 2015. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939.
  • Hidasi et al. (2016) Hidasi, B.; Quadrana, M.; Karatzoglou, A.; and Tikk, D. 2016. Parallel recurrent neural network architectures for feature-rich session-based recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, 241–248. ACM.
  • Hochreiter and Schmidhuber (1997) Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
  • Hu, Koren, and Volinsky (2008) Hu, Y.; Koren, Y.; and Volinsky, C. 2008. Collaborative filtering for implicit feedback datasets. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, 263–272. Ieee.
  • Järvelin and Kekäläinen (2002) Järvelin, K., and Kekäläinen, J. 2002. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS) 20(4):422–446.
  • Kingma and Ba (2014) Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Koren, Bell, and Volinsky (2009) Koren, Y.; Bell, R.; and Volinsky, C. 2009. Matrix factorization techniques for recommender systems. Computer 42(8).
  • Liang et al. (2016) Liang, D.; Altosaar, J.; Charlin, L.; and Blei, D. M. 2016. Factorization meets the item embedding: Regularizing matrix factorization with item co-occurrence. In Proceedings of the 10th ACM conference on recommender systems, 59–66. ACM.
  • Linden, Smith, and York (2003) Linden, G.; Smith, B.; and York, J. 2003. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet computing 7(1):76–80.
  • Liu et al. (2017) Liu, D. C.; Rogers, S.; Shiau, R.; Kislyuk, D.; Ma, K. C.; Zhong, Z.; Liu, J.; and Jing, Y. 2017. Related pins at pinterest: The evolution of a real-world recommender system. In Proceedings of the 26th International Conference on World Wide Web Companion, 583–592. International World Wide Web Conferences Steering Committee.
  • Mikolov et al. (2010) Mikolov, T.; Karafiát, M.; Burget, L.; Cernockỳ, J.; and Khudanpur, S. 2010. Recurrent neural network based language model. In Interspeech, volume 2,  3.
  • Mikolov et al. (2013) Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  • Rendle et al. (2009) Rendle, S.; Freudenthaler, C.; Gantner, Z.; and Schmidt-Thieme, L. 2009. Bpr: Bayesian personalized ranking from implicit feedback. In

    Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence

    , 452–461.
    AUAI Press.
  • Song, Elkahky, and He (2016) Song, Y.; Elkahky, A. M.; and He, X. 2016. Multi-rate deep learning for temporal recommendation. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, 909–912. ACM.
  • Sutskever, Vinyals, and Le (2014) Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104–3112.
  • Tan, Xu, and Liu (2016) Tan, Y. K.; Xu, X.; and Liu, Y. 2016. Improved recurrent neural networks for session-based recommendations. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, 17–22. ACM.
  • Wu et al. (2017) Wu, C.-Y.; Ahmed, A.; Beutel, A.; Smola, A. J.; and Jing, H. 2017. Recurrent recommender networks. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 495–503. ACM.