Introduction
Recommender systems have become an indispensable part of the ecommerce industry, helping customers to sort out items of interest from large inventories. Among the most popular techniques are matrix factorization (MF) based models (see, e.g., Hu, Koren, and Volinsky, 2008; Koren, Bell, and Volinsky, 2009; Rendle et al., 2009)
which decompose a user–item matrix into user and item matrices. Such an approach treats recommendation as a matrix completion/imputation problem, where missing entries in the original matrix are estimated by the dot product between corresponding user and item factors. Despite their popularity in recommender systems, MFbased models have their limitations. First, MFbased models aim at reconstructing user history, instead of predicting future behaviors. The underlying assumption is that user preference is static over time. Second, most of such approaches omit ordering information in a user history. To address these issues, an increasing number of recent works have begun to treat user behaviors as sequences, and predict future events based on history
(see, e.g., Hidasi et al., 2015; Tan, Xu, and Liu, 2016; De Boom et al., 2017).Recurrent neural networks (RNNs) are one of the most widely used techniques for sequence modelling (see, e.g., Mikolov et al., 2010; Bahdanau, Cho, and Bengio, 2014; Sutskever, Vinyals, and Le, 2014). These RNNs have recently been considered for a historybased recommendation system (see, e.g., Hidasi et al., 2015; Tan, Xu, and Liu, 2016; Wu et al., 2017). In the implicit feedback scenario, where binary useritem interactions are recorded, the previously mentioned works pose recommendation as a classification task. For a recommender system containing millions of items, classification approach will calculate a score for each useritem pair, therefore leading to a scalability issue in both training and prediction. Besides scalability, in the work by De Boom et al. (2017), they demonstrated such an approach fails to recommend relevant items to users on their dataset containing more than 6 million songs.
Instead, De Boom et al. (2017) formulated the problem as regression rather than classification to address the scalability and performance degradation issues. However, due to the noisy and multimodal nature of user behavior, building a mapping from history to future is an illposed problem, and regression may not be suitable.
In this work, we present an approach to historybased recommendation by generating a conditional distribution over future items vectors. The proposed model uses a recurrent network for summarizing a user history, and an attentionbased recurrent mixture density network, which generates each component in a mixture network sequentially, for modelling a multimodal conditional distribution.
Our model is evaluated on MovieLens20M (Harper and Konstan, 2016) and RecSys15 ^{1}^{1}1http://2015.recsyschallenge.com/ in an implicit feedback setting. Experimental results demonstrate that the proposed model improves recall, precision, and nDCG significantly, compared to various baselines. Comprehensive analysis on model configuration shows that increasing the number of mixture components improves recommendations by better capturing multimodality in user behavior.
Our model explores a new direction of recommender systems by conducting density estimation over continuous item representation. It provides an effective way to generate components for a mixture network, which potentially benefits all applications using multiple mixtures.
Related Work
A Historybased Recommender System with a Recurrent Neural Network
RNNs were first used to model a user history in recommender systems by Hidasi et al. (2015). In this work, a RNN was used to model previous items and predict the next one in a user sequence. Tan, Xu, and Liu (2016) improved the recommender system performance on a similar architecture with data augmentation. To better leverage item features, Hidasi et al. (2016) introduced a parallel RNN architecture to jointly model user behaviors and item features. Wu et al. (2017) proposed a new architecture using separate recurrent neural networks to update user and item representations in a temporal fashion.
There are two major differences in the proposed approach from the previously mentioned work. First, our work frames the task of implicitfeedback recommendation as density estimation in a continuous space rather than classification with a discrete output space. Second, unlike most of the earlier works, where the whole systems were trained endtoend, the proposed model leverages an external algorithm to extract item representation, allowing the system to cope with new items more easily.
More recently, De Boom et al. (2017) proposed a historybased recommender system with pretrained continuous item representations as a regression problem. In their work, a recurrent neural network read through a user’s history, as a sequence of listened songs, and extracted a fixedlength user taste vector, which was later used to predict future songs.
The major difference between the proposed work and the work by De Boom et al. (2017)
is the assumption on the number of modes in the distribution of user behaviors. The proposed model considers the mapping from history to a future behaviour as a probability distribution with multiple modes, unlike their work in which such a distribution is assumed to be unimodal. We do so by using a variant of mixture density network
(Bishop, 1994) to explicitly model user behavior. Their approach can be considered as a special case of the proposed model with a single mixture component.Continuous Item Representation
Inspired by the recent advance in word representation learning (Mikolov et al., 2013), various methods have been proposed to embed an item in a distributed vector that encodes useful information for recommendation. Barkan and Koenigstein (2016) learned item vectors using an approach similar to Word2Vec (Mikolov et al., 2013) by treating each item as a word without considering the order of items. Liang et al. (2016) jointly factorized a useritem matrix and itemitem matrix to obtain item vectors. In the work by Liu et al. (2017), a vector was learned for each pin in Pinterest using a Word2Vec inspired model.
In this paper, we use external knowledge to extract item representation, instead of training jointly. Such an approach is effective, because it enables the use of any recent advances in representation learning and has a potential to incorporate new items unseen during training.
Model
Recommendation Framework
In the implicit feedback setting, a user behavior is recorded as a sequence of interacted items, which can be a mixture of various behaviors, including viewing, purchasing, searching and others. For simplicity, we only focus on the viewing behavior in our model. We frame the task of recommendation as a sequence modelling problem with the goal of predicting the future directly.
Given a splitting index and a user behavior sequence , the sequence can be split into the history and the future . A recommender system, parametrized by , aims at modelling the probability of future items conditioned on historical items . For simplicity, we omit in our notation and assume that the items in are independent
(1) 
This conditional probability can be approximated by, for instance, gram conditional probability
(2) 
where are previous viewed items.
An
gram statistics table records the number of occurrence for each item ngram in the training corpus. Based on this, the approximated conditional probability can be expressed as
(3) 
where is the count in the training corpus. When equals two, such setting is a similar variant of itemtoitem collaborative filtering (Linden, Smith, and York, 2003), where the temporal dependency among items is ignored.
Conditioned on a seed item in a user history, itemtoitem collaborative filtering recommends item having the highest coview probability
(4) 
The statistics table contains the number of occurrence for each item pair . One can estimate itemtoitem conditional probability by
(5) 
With pairwise conditional probability, can be approximated using by random sampling an item . To stabilize the result, we take the average of the approximated probability
(6) 
A major limitation of such countbased method is data sparsity, as a large number of grams do not occur in the training corpus.
To address the data sparsity issue in countbased methods, Bengio et al. (2003) proposed neural language model, in which each word is represented as a continuous vector. In this paper, we take a similar approach by representing each item using a continuous vector
. Unlike earlier works using continuous representation as input only, and model its discrete probability distribution as classification, we instead choose to directly model probability density function over continuous item representation
. items with highest likelihoods are recommended accordingly.In doing so, there are three major technical questions. The first question is how to construct item vector . The second one is how to represent a user history . Lastly, we must decided how to construct such a probability density function . We will answer each of these questions in the following subsections.
Item Representation
In our model, item embeddings are pretrained and kept fixed during training. Although no assumption is posed on item embeddings, the distance between two embeddings should be able to explain certain relationships between two items, such as content similarity and copurchase likelihood. That is, the closer the distance is the stronger the relationship should be.
In this paper, we train item embeddings in a way similar to continuous bagofwords model (Mikolov et al., 2013) by treating a user sequence of items as a sentence, and each item as a word. Under this setting, the distance between two items in vector space could be explained by their cooccurrence chance in a sequence. The closer the distance is, the higher chance two items have occurring in the same sequence.
As a result, a valid item embedding matrix is generated, where each row is the representation of an item. We denote that for each item , its dimensional vector representation could be retrieved .
History Representation
A user’s history is recorded as a sequence of items, either viewed or purchased, . After mapping each item to its vector, we can get a sequence of the item vectors . In this paper, we experimented with three alternatives to represent user history.
Continuous BagofItems Representation (CBoI)
The first proposed method is to simply bag all the items in into a single vector . Any element of corresponding to the item existing in will be assigned the frequency of that item, and otherwise 0. This vector is multiplied from left by item embedding matrix
(7) 
We call this representation a continuous bagofitems (CBoI). In this approach, the ordering of history items does not affect the representation.
Recurrent Representation (RNN)
Recurrent neural networks (RNN) have become one of the most popular techniques for modelling a sequence. Long shortterm memory units
(LSTM, Hochreiter and Schmidhuber, 1997)(GRU, Cho et al., 2014) are the two most popular variants of RNNs. In this paper, we work with GRUs, which have the following update rule:(8) 
where
is a sigmoid function,
is the input at the th timestep, and is elementwise multiplication.After converting each item into a vector representation, the sequence of item vectors is read by a recurrent neural network. We initialize the recurrent hidden state as 0. For each item in the history, we get
(9) 
for .
is GRU recurrent activation function defined in Eq. (
8).With , recurrent user representation is computed by
(10) 
Attentionbased Representation (RNNATT)
Inspired by the success of attention mechanism in machine translation (Bahdanau, Cho, and Bengio, 2014), the proposed method incorporates attention mechanism into recurrent history representation when using with recurrent decoder later described in Eq. (16). After is generated following the same way mentioned in Eq. (9
), we use a separate bidirectional recurrent neural network to read
, and generate a sequence of annotated vectors . For a mixture vector , attentionbased history representation is calculated as(11) 
where the attention weight is computed by
(12) 
In Eq. (12), is the hidden state of the recurrent neural network in the decoder calculated in Eq. (16), and function defines the relevance score of the th item with respect to .
Mixture Density Network
A mixture density network (MDN, Bishop, 1994) formulates the likelihood of an item vector conditioned on a user history (represented by ) as a linear combination of kernel functions
(13) 
where is the number of components used in the mixture. Each kernel is a multivariate Gaussian density function:
(14) 
In order to reduce the computation complexity, the covariance matrix
is assumed to be diagonal, containing only entries for elementwise variances.
We propose two methods for generating parameters of the mixture density network using components.
Feedforward decoder (FF)
After a user history is encoded into a single user representation , the parameters for the th mixture–, , and – are generated by
(15) 
where , , and .
Recurrent decoder (RNN)
In addition to the feedforward decoder, we propose a recurrent decoder. For a mixture density network with components, the recurrent decoder iterates times. In each iteration, an RNN takes the history representation as input and generates the parameters of one mixture. Suppose at the th step, the th component’s parameters are calculated as
(16) 
where is a recurrent activation function, and are shared among all mixtures.
After all are generated, the mixture weight is calculated by
(17) 
where .
Alternatively with the attentionbased history representation described in Eq. (11), is replaced by at the th iteration in Eq. (16). At the th step, for annotated vectors , attention weight is computed by
(18) 
Attentionbased recurrent representation is computed by
(19) 
where is the output from Recurrent Representation calculated in Eq. (9). The architecture for recurrent decoder with the attentionbased encoder is illustrated in Fig. 1. The attention mechanism allows a model to automatically search for items in the user history relevant to each mixture component.
Experimental Settings
Models
There are multiple configurations of the proposed methods. First, there are three ways to represent a user history: (1) continuous BagofItems (CBoI), (2) recurrent representation (RNN), and (3) attentionbased representation (RNNATT). Then, there are two ways to generate mixture parameters: (1) Feedforward decoder (FF) and (2) Recurrent decoder (RNN). We denote all models evaluated in our experiments by

CBoIFF

RNNFF

RNNRNN

RNNATTRNN,
where denotes the number of mixture components in a mixture density network. We test four ’s: 1, 2, 4, and 8.
Note that, when
is equal to 1, a mixture model can only output a unimodal Gaussian distribution. This is similar to the work by
De Boom et al. (2017), where regression can be viewed as an unimodal Gaussian with an identity covariance matrix.We consider following baselines:
 Recently Viewed Items (RVI)

recommends items a user has viewed in the history, ranked by the recency. Although this technique is not a collaborative filtering method, it is widely used as a personalized recommendation module in production systems. In a previous work by Song, Elkahky, and He (2016), a similar approach (Prevday Click) was adopted as a baseline method, and outperformed all MFbase model in their experiment.
 ItemtoItem Collaborative Filtering (ItemCF)

uses a single item as a seed instead of using a whole user history, as described in Eq. (6). Recommended items are ranked by an estimated conditional probability.
 Implicit Matrix Factorization (IMF)

is implemented according to Hu, Koren, and Volinsky (2008) by using implicit package^{2}^{2}2https://github.com/benfred/implicit. The model is fit using history and future sequences in a training set, and history sequences in validation and testing sets.
All mixture density network models uses 256 as , and are trained using Adam (Kingma and Ba, 2014) to maximize the loglikelihood defined as
(20) 
where is the number of user sequences in the training set, and is the length of the th sequence.
In all RNNbased models, a onelayer GRU with 256 hidden units is used. We earlystop training based on F1@20 on a validation set, and report metric on a test set using the best model according to the validation performance.
For implicit matrix factorization, we perform grid search over the number of factors on a validation set, and report the metric using the best model on a test set.
Item embeddings are trained using the continuous bagofwords model from FastText package (Bojanowski et al., 2016), with the item embedding dimension set to 100 and windows size to 5. All sequences in the training set are used for embedding learnings. After training, each item vector is normalized by norm.
Datasets
We evaluated our model on two publicly available datasets.
MovieLens20M
MovieLens20M (Harper and Konstan, 2016) is a classic explicitfeedback collaborative filtering dataset for movie recommendation, in which (user, movie, rating, timestamp) tuples are recorded. We transform MovieLens20M into an implicitfeedback dataset by only taking records having ratings greater or equal to 4 as positive observations. User behavior sequences are sorted by time, and those containing more than 15 implicit positive observations are included. Each last viewed 15 movies by each user are split into 10 and 5, as history and future respectively. As the nature of this dataset, there is no duplicate items in the user sequence. After preprocessing, 75,962 sequences are kept. 80%, 10%, and 10% of sequences are randomly split into training, validation, and test sets, respectively. A movie vocabulary is built using the training set, containing 16,253 unique movies.
RecSys15
RecSys15 ^{3}^{3}3http://2015.recsyschallenge.com/ is an implicit feedback dataset, containing click and purchase events from an online ecommerce website. We only work with the training file in the original dataset, and keep the click events with timestamps. We filter sequences of length less than 15, and use final 2 clicks as future, and the first 13 clicks as history. We do not filter out duplicate items, and as a result the same item could appear in both history and target parts. After preprocessing, we are left with 168,202 sequences. 80%, 10%, and 10% of sequences are randomly split into training, validation, and test sets, respectively. An item vocabulary is built only using items in the training set, leaving us with 32,117 unique items.
Metric
There are various metrics that could be used to evaluate the performance of a recommender system. In this paper, we use precision, recall, and nDCG. Higher values indicate better performance under these metrics.
We denote topk recommended items by , where the items are ranked by the recommendation system; and we denote target items by .
 Precision@k

calculates the fraction of topk recommended items which are overlapping with target items.
(21)  Recall@k

calculates the fraction of target items which are overlapping with topk recommended items.
(22)  nDCG@k

computes the quality of ranking, by comparing the recommendation DCG with the optimal DCG (Järvelin and Kekäläinen, 2002). In implicit feedback datasets, relevance scores for items in the target set are assigned to 1. DCG@k is calculated as
(23) The optimal DCG is calculated as
(24) nDCG@k is calculated as
(25)
Result
Table 1 summarizes the results of our experiments. As MovieLens has no duplicate items in a sequence, RVI is not used on that dataset. From the result on MovieLens, we first observe that the proposed RNNATTRNN4,8 model consistently outperformed the other methods in all the metrics by large margins. Second, we see that CBoIFF does not work well regardless of the number of components used, while the performance is substantially improved with the recurrent encoder. Third, comparing between the two baseline models, ItemCF outperforms IMF by a good margin across all metrics.
On RecSys15, besides the similar trends we see from MovieLens, there are several new observations. First, RVI outperforms all the models except for the RNNATTRNN{2,4,8} on Precision@10 and Recall@10. This result is in line with Song, Elkahky, and He (2016). They also observed the competitive performance from using previous day’s clicks on news recommendation. Secondly, IMF is the worst performing model on this dataset. We conjecture that this is because that IMF only recommends items a user has not interacted with, while in clicking streams like RecSys15, items in the history are likely to reappear in the future.
To better understand the effect of mixture components on various model architecture, we group by the number of component used across various models and visualize the result in Fig. 2. We observe that RNNATTRNN achieves most visible improvement as the number of mixture components increases. We also notice that for all MDNbased architectures, using two mixtures always achieves better result compared with using one mixture. However, unless the attention mechanism is used we see diminishing improvements with more components.
The experiments have revealed that it is clearly beneficial to capture the multimodal nature of prediction in a recommender system. This is however only possible with the right choice of user representation and right mechanism for generating mixture parameters. In these experiments, our novel approach, the attentionbased recurrent history representation combined with the recurrent decoder, was found to be the best choice in both datasets. We have further learned that the user preference is not static across time, and that it is beneficial to model the user history as a sequence rather than a bag.
Conclusion & Future Work
In this paper, we proposed a method to construct a recommender system by generating the probability density function over future item vectors. The proposed model combines recurrent user history representation with a mixture density network, where a novel attentionbased recurrent mixture density has been proposed to output each mixture component sequentially. The experiments on two publicly available datasets, MovieLens20M and RecSys 15, have demonstrated significant improvement in recall, precision, and nDCG compared against various baselines, validating the advantage of modelling the multimodal nature of the predictive distribution in a recommendation system.
To explore the full potential of our model, there are several areas in which more research needs to be done. First, to better understand our model, more thorough analysis on the learned mixture components and the attention weights should be conducted. Second, we use embeddings pretrained using the word2vec objective, which leads to embeddings that learn the distributional, userbehavior based properties of items. One way to extend our model is to incorporate contentbased attributes into the item embeddings we use, and create a hybrid recommender system.
Acknowledgments
TW sincerely thanks Tommy Chen, Andrew Drozdov, Daniel Galron, Timothy Heath, Alex Shen, Krutika Shetty, Stephen Wu, Lijia Xie, and Kelly Zhang for helpful discussions and insightful feedbacks. KC thanks support by eBay, TenCent, Facebook, Google and NVIDIA, and was partly supported by Samsung Advanced Institute of Technology (Next Generation Deep Learning: from pattern recognition to AI).
References
 Bahdanau, Cho, and Bengio (2014) Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
 Barkan and Koenigstein (2016) Barkan, O., and Koenigstein, N. 2016. Item2vec: neural item embedding for collaborative filtering. In Machine Learning for Signal Processing (MLSP), 2016 IEEE 26th International Workshop on, 1–6. IEEE.
 Bengio et al. (2003) Bengio, Y.; Ducharme, R.; Vincent, P.; and Jauvin, C. 2003. A neural probabilistic language model. Journal of machine learning research 3(Feb):1137–1155.
 Bishop (1994) Bishop, C. M. 1994. Mixture density networks.
 Bojanowski et al. (2016) Bojanowski, P.; Grave, E.; Joulin, A.; and Mikolov, T. 2016. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.
 Cho et al. (2014) Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
 De Boom et al. (2017) De Boom, C.; Agrawal, R.; Hansen, S.; Kumar, E.; Yon, R.; Chen, C.W.; Demeester, T.; and Dhoedt, B. 2017. Largescale user modeling with recurrent neural networks for music discovery on multiple time scales. Multimedia Tools and Applications 1–23.
 Harper and Konstan (2016) Harper, F. M., and Konstan, J. A. 2016. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5(4):19.
 Hidasi et al. (2015) Hidasi, B.; Karatzoglou, A.; Baltrunas, L.; and Tikk, D. 2015. Sessionbased recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939.
 Hidasi et al. (2016) Hidasi, B.; Quadrana, M.; Karatzoglou, A.; and Tikk, D. 2016. Parallel recurrent neural network architectures for featurerich sessionbased recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, 241–248. ACM.
 Hochreiter and Schmidhuber (1997) Hochreiter, S., and Schmidhuber, J. 1997. Long shortterm memory. Neural computation 9(8):1735–1780.
 Hu, Koren, and Volinsky (2008) Hu, Y.; Koren, Y.; and Volinsky, C. 2008. Collaborative filtering for implicit feedback datasets. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, 263–272. Ieee.
 Järvelin and Kekäläinen (2002) Järvelin, K., and Kekäläinen, J. 2002. Cumulated gainbased evaluation of ir techniques. ACM Transactions on Information Systems (TOIS) 20(4):422–446.
 Kingma and Ba (2014) Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 Koren, Bell, and Volinsky (2009) Koren, Y.; Bell, R.; and Volinsky, C. 2009. Matrix factorization techniques for recommender systems. Computer 42(8).
 Liang et al. (2016) Liang, D.; Altosaar, J.; Charlin, L.; and Blei, D. M. 2016. Factorization meets the item embedding: Regularizing matrix factorization with item cooccurrence. In Proceedings of the 10th ACM conference on recommender systems, 59–66. ACM.
 Linden, Smith, and York (2003) Linden, G.; Smith, B.; and York, J. 2003. Amazon. com recommendations: Itemtoitem collaborative filtering. IEEE Internet computing 7(1):76–80.
 Liu et al. (2017) Liu, D. C.; Rogers, S.; Shiau, R.; Kislyuk, D.; Ma, K. C.; Zhong, Z.; Liu, J.; and Jing, Y. 2017. Related pins at pinterest: The evolution of a realworld recommender system. In Proceedings of the 26th International Conference on World Wide Web Companion, 583–592. International World Wide Web Conferences Steering Committee.
 Mikolov et al. (2010) Mikolov, T.; Karafiát, M.; Burget, L.; Cernockỳ, J.; and Khudanpur, S. 2010. Recurrent neural network based language model. In Interspeech, volume 2, 3.
 Mikolov et al. (2013) Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Rendle et al. (2009)
Rendle, S.; Freudenthaler, C.; Gantner, Z.; and SchmidtThieme, L.
2009.
Bpr: Bayesian personalized ranking from implicit feedback.
In
Proceedings of the twentyfifth conference on uncertainty in artificial intelligence
, 452–461. AUAI Press.  Song, Elkahky, and He (2016) Song, Y.; Elkahky, A. M.; and He, X. 2016. Multirate deep learning for temporal recommendation. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, 909–912. ACM.
 Sutskever, Vinyals, and Le (2014) Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104–3112.
 Tan, Xu, and Liu (2016) Tan, Y. K.; Xu, X.; and Liu, Y. 2016. Improved recurrent neural networks for sessionbased recommendations. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, 17–22. ACM.
 Wu et al. (2017) Wu, C.Y.; Ahmed, A.; Beutel, A.; Smola, A. J.; and Jing, H. 2017. Recurrent recommender networks. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 495–503. ACM.
Comments
There are no comments yet.