Scalable Recommender Systems through Recursive Evidence Chains

07/05/2018 ∙ by Elias Tragas, et al. ∙ UNIVERSITY OF TORONTO Borealis AI 6

Recommender systems can be formulated as a matrix completion problem, predicting ratings from user and item parameter vectors. Optimizing these parameters by subsampling data becomes difficult as the number of users and items grows. We develop a novel approach to generate all latent variables on demand from the ratings matrix itself and a fixed pool of parameters. We estimate missing ratings using chains of evidence that link them to a small set of prototypical users and items. Our model automatically addresses the cold-start and online learning problems by combining information across both users and items. We investigate the scaling behavior of this model, and demonstrate competitive results with respect to current matrix factorization techniques in terms of accuracy and convergence speed.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The central aim of model-based collaborative-filtering methods is to predict a user’s rating of an item from a small number of recorded preferences in the system. An effective approach towards this problem is to formulate it as a matrix factorization problem. One can approximate a ratings matrix as a low-rank factorization , where , and  srebro2003weighted ; rennie2005fast . The -dimensional rows and of and are commonly referred to as the latent user and item vectors. Under this framework, each of the ratings entries is approximated by the inner product .

(a) Latent factor models

(b) Rowless factor model

(c) Proposed method
Figure 1.1: Dotted lines denote generated embeddings, as opposed to stored in memory. Ellipses show the direction in which the data can scale without having to add new parameters.

A major drawback to these methods is that the number of parameters to be optimized grows with the number of users and items. Because the training loss is a coupled function of all of these parameters, stochastic optimization through data mini-batches resembles a form of blockwise coordinate optimization. We conjecture that there is room for improvement in the stochastic optimization of large models of this form, by directly enforcing a coupling between parameters of similar rows and columns.

In this paper, we propose our Recursive Evidence Chains (REC) algorithm. The central idea to our algorithm is that we do not store the entire latent matrices and

but rather store only a very small subset of it (the prototypes) and then learn a function using neural networks to recursively generate the latent representations for non-prototypical users and items on-demand.

Another challenge for collaborative-filtering methods is the classic “cold-start” problem, where new users have few ratings and the system has limited information on the preferences of such users. A side-benefit of applying recursive chains to generate latent representations on-demand is that the coupling of parameters between users provides a natural form of information-sharing, or regularization.

2 Background

In this section we introduce the notion of rowless and columnless matrix factorization techniques, as well as explore how they relate to coupled parameter optimization and online learning. In the next section, we will show how our proposed algorithm REC combines and generalizes these approaches.

Rowless collaborative filtering

Instead of explicitly storing an embedding for every row and column in a matrix, rowless methods generate row embeddings on-demand. They estimate each embedding as a function of the specific row’s rated item embeddings in conjunction with the corresponding ratings themselves. One such function could be a neural net which maps any given (item embedding, rating) pair to a user embedding. Let be such a net, then we can generate some user embedding by taking the average of over all of user ’s rated items, which we denote by . In precise mathematical terms, we have


where denotes the embedding for item and denotes the user ’s rating for item . After generating , we can make a prediction by setting .

Properties of rowless methods

Rowless methods perform well in online settings das2016chains ; verga2016generalizing , since they can handle an infinite amount of novel rows without requiring retraining. In addition, they have parameter complexity, since row embeddings are automatically generated from item embeddings. We will show in our experiments in Section 5 that using an aggregation function to couple multiple embeddings can improve the rate of convergence for mini-batch gradient based optimization methods.

(a) Information flow for a rowless factor model

(b) Graphical model of making predictions in a rowless factor model.
Figure 2.1: Two alternative representations of information flow in a rowless model. We see that multiple item embeddings and their ratings are used to define . In turn, and are used to define .

Columnless collaborative filtering

By making the necessary modifications, we can imagine a columnless model:


where denotes the rows in which a rating for item exists, and represents a neural network that maps a given (user embedding, rating) pair to an item embedding. Analogous to the rowless case, columnless methods have space complexity.

Combining rowless and columnless methods

Rowless and columnless methods rely on and parameters respectively, where

is the latent factor size. In contrast, Singular Value Decomposition (SVD) has parameter complexity

. Our proposed algorithm REC, which we present the next section, leverages recursion to achieve parameter scaling with respect to the dataset size.

3 Recursive Evidence Chains

In this section we introduce a general framework for combining both rowless and columnless matrix factorization. Instead of allocating full embeddings for only columns or only rows, we pick a constant number of prototype users and prototype items such that and . We select our prototypes to be the users and items with the most ratings. To aid with visualization, we sort our matrices such that the users are the first rows of the matrix, and similarly so for items.

Each prototype user and item receives an embedding, while all non-prototypical users and non-prototypical items are predicted on-demand. Intuitively, because our method is rowless, we are able to predict each missing embedding as a function of the ratings of that user and the embeddings of each item the user rated. Since our method is also columnless, we predict each missing as a function of the ratings of that item and the embeddings of each user who rated that item. This introduces a recursion until we reach the prototypical users and items. An example of this type of recursive structure is shown in Figures 3.1 and 3.2.

Figure 3.1: Information flow for making predictions with recursive evidence chains.
Figure 3.2: Graphical model of recursive evidence chains.
Figure 3.3: A trace of our model doing predictions using ML-100K. Edge directions denote information flow. prototype vectors are shaded. Note that User 13 rated Babe, which is not a prototype - but Babe has been rated by prototype users, which allows us to its latent.
Figure 3.4: Recursive prediction in REC

3.1 Predicting latent factors in REC

We denote by and two feed-forward networks parametrized respectively by and . The former maps an item latent factor and rating pair to a user latent factor, while the latter maps a user latent factor and rating pair to an item latent factor.

When collecting evidence to generate an embedding, we decompose the problem into whether or not our evidence stems from a prototypical user or item. For users, we define the latent factors as


where we recall that is the set of rated items by the -th user. Similarly, for items, we have the function :


where is the set of users that gave a rating for item .

With this simple formulation, it is almost impossible for an embedding generation step of REC to finish. If at least one rating is shared between a non-prototype row and non-prototype column, attempting to generate their embeddings will cause an infinite loop, since each predicted embedding , will request the other’s value. To address this, we introduce a Max Depth constant. If a recursive call has depth greater than or equal to in the stack, and the requested embedding is not a prototype, we return None and ignore that embedding’s value in the summation. This guarantees that REC will always terminate.

Predictions with REC

For a given rating , we generate our predicted ratings by generating both and and then setting . It is possible for one or both of our generated embeddings to be undefined. This can occur if the shortest path to the prototypes is longer than our given . In this case, we return the mean of the dataset.

Training REC

Given the above, we are able to train REC end-to-end using SGD. We can jointly optimize REC’s set of parameters: where are the user prototype embeddings, the item prototype embeddings and , the parameters of our generator nets. While the definition of our latent feature vectors

rely on piece-wise functions, they are sub-differentiable and thus easy to optimize in a framework which supports automatic differentiation such as PyTorch or Tensorflow. We use the standard matrix factorization loss function, with regularization terms on the magnitude of our prototype embeddings and generator net parameters.

1:procedure recursive_predict_rating()
4:   if 
5:      return dataset_mean    
6:   return
1:procedure user_vector()
2:   if is_prototype_user(i) :
3:      return embedding_for_user(i)    
4:   if  
5:      return None    
7:   for  
9:      if   == None:
10:         continue       
11:      user_embeddings.append()    
12:   if  
13:      return None    
14:   return mean(user_embeddings)
1:procedure item_vector()
2:   if is_prototype_item(j) :
3:      return embedding_for_item(j)    
4:   if  
5:      return None    
7:   for  
9:      if   == None:
10:         continue       
11:      item_embeddings.append()    
12:   if  
13:      return None    
14:   return mean(item_embeddings)
Algorithm 1 Prediction in REC with Max Depth
Figure 3.5: Pseudocode for REC with Max Depth. and here denotes the neural networks used to generate our user and item vectors respectively.

4 Complexity controls

While REC with an assigned Max Depth can converge, it suffers from wasted computation. In this section we introduce complexity controls to minimize the amount of computation done by REC, while still retaining enough information to accurately predict a given rating. When we combine these complexity controls, the amount of computation required is reduced by 3 orders of magnitude compared to an implementation using only Max Depth. The impact of our complexity controls can be seen in Figure 4.1.

Cycle Blocking (CB) We begin by eliminating cycles from the computation. Before using some embedding as evidence to generate some embedding , we first check if is already acting as evidence for earlier in the call stack. If so, we ignore it.

Caching (CA) When generating some embedding it is possible that some embedding is needed multiple times in the call stack. In this case, and all of its dependencies will be generated multiple times. To avoid repeated computation, we cache the result of any predicted embedding computation and return it instead of recomputing. These embedding caches are wiped after every gradient update, since they no longer reflect what the model would have generated. This optimization reduces the number of requests for embeddings by two orders of magnitude.

Evidence Limit (EL) On larger datasets such as ML-10M, a single user or item can have thousands of ratings. This means that generating one embedding may require the intermediate generation of thousands of other embeddings, a costly procedure at each level of our recursive algorithm. Instead, we define an evidence limit . When generating an embedding we randomly select an number of ratings and use them to generate our embedding. We set to 80 for all of our experiments unless otherwise specified. Using roughly halves the number of embeddings we generate on ML-100K, with the gains increasing on larger datasets.

Prototype Prioritization (PP) Once an evidence limit is introduced, picking the right embeddings to explore becomes an optimization problem. In the worst case, every single embedding we select is unable to reach the prototype section by the time Max Depth is reached, and will therefore return None. To avoid this where possible, we use Prototype Prioritization. Instead of randomly sampling users, we greedily select available prototypes to generate our embedding. If the number of available prototypes is less than , we randomly sample the rest to ensure that we still generate from embeddings. On ML-100K this reduces the number of generated embeddings by 36%.

Telescoping Evidence Limit (TEL) The deeper into the recursion stack an embedding is generated, the weaker the evidence it provides about the requested rating. To help reduce the amount of distantly-related embeddings we collect as evidence, we use a Telescoping Evidence Limit. The TEL is a depth aware version of , which halves its value at each depth. Thus and in general .

Figure 4.1: Embeddings generated for a mini-batch of size 10 on ML-100K. Max Depth is set to 2 for all experiments. On the left, we see the behaviour for 6 different implementations of REC using various complexity controls. On the right, we add the cache and note its ability to reduce the number of failed requests.

5 Experiments

Datasets. We evaluate our model on three collaborative filtering datasets: MovieLens-100K, MovieLens-1M and MovieLens-10M harper2016movielens

. The ML-100K dataset contains 100,000 ratings of 1682 movies by 943 users, ML-1M dataset contains approximately 1 million ratings of 3900 movies by 6040 users and ML-10M dataset contains approximately 10 million ratings of 10681 movies by 71567 users. For our experiments, we take a 80/20 train/validation split of the datasets. Our hyperparameters are optimized according to this choice of validation set.

Implementation details. The standard configuration we took for REC, unless otherwise specified, is as follows: the number of prototypical users and items and are set to 50, the Evidence Limit and Max Depth

are set to 80 and 4 respectively. We use a 3-layer feed-forward neural network with each 200 neurons for each hidden layer. The activation functions of each net are ReLUs, with exception of the last layer which is taken to be linear.

In addition, we introduce a pretraining phase in our REC experimentation: we first perform PMF on only the prototypical block before training on all parameters. This procedure ensures that the prototypes are large enough to recombine into an accurate rating, and therefore reducing the need for updating its distribution. We found that this short constant-time procedure sped up the optimization process considerably. Lastly, we set the default batch size to be 1000 and use the Adam optimizer Adam:14 with a learning rate of and regularization parameter . Our metric of evaluation across all experiments is test root-mean square error (RMSE).

Test RMSE comparisons. We begin by comparing the performance of REC to the following collaborative-filtering algorithms: PMF mnih2008probabilistic , NNMF dziugaite2015neural , Biased-MF koren2009matrix , and CF-NADE zheng2016neural . For ML-100K and ML-1M, we use the standard configuration whereas for ML-10M, the only changes we make are setting to be 40 and to be 3. In all three experiments, we train for 2000 iterations. Table 2 gives the comparison scores. We find that for ML-100K, REC achieves a test RMSE performance comparable to standard collaborative-filtering algorithms. For ML-1M and ML-10M, while our method does not reach state-of-art performance, especially compared to CF-NADE and BiasedMF, we believe that this can be substantially improved if we tune the number of prototype users and items .

Model ML-100K ML-1M ML-10M
PMF mnih2008probabilistic 0.952 0.883 -
NNMF dziugaite2015neural 0.903 0.843 -
Biased-MF koren2009matrix 0.911 0.852 0.803
CF-NADE zheng2016neural - 0.829 0.771
REC 0.910 0.882 0.846
Table 1: Test RMSE results on ML-100K, ML-1M, and ML-10M for various models. Scores reported for PMF, NNMF, Biased-MF (ML-100K/ML-1M) were taken from dziugaite2015neural . Scores reported for CF-NADE and Biased-MF (ML-10M) were taken from zheng2016neural . Note that these results were obtained using a 90/10 train/valid split whereas in REC we used a 80/20 split.

5.1 Coupling of parameters leads to faster convergence

We compare the performance of REC to PMF in the early stages of the training process. Again, we use the standard configuration for REC on ML-100K but change and to 20 and 2 for ML-1M and ML-10M. For PMF, the same 80/20 train/valid split is used to partition the datasets and we choose batch-sizes of 1000, 5000, and 10000 for ML-100K, ML-1M, and ML-10M respectively. The reason behind increasing batch-sizes is that for PMF, as opposed to REC, larger batch-sizes are required to achieve accurate training as the dataset size grows. For consistency, we also applied the same pretraining procedure in REC to PMF.

In Figure 5.1, we see that for all three datasets, REC converges to a RMSE in under 30 iterations. Furthermore, we find that REC learns very well in the early training process. On the other hand, PMF cannot reach a RMSE in iterations; in fact, it only attains this at around iterations 400, 360, and 500 for ML-100K, ML-1M, and ML-10M respectively. As for wall-clock statistics, it takes around 10, 45 and 70 seconds to complete one REC iteration on ML-100K, ML-1M and ML-10M respectively.

As demonstrated in Figure 5.1, REC has similar test RMSE convergence curves across increasingly large datasets while maintaining a constant number of parameters. This highlights the attractive scalability properties of REC. In contrast, we observe that the number of iterations it takes for PMF to converge depends on the size of the dataset.

Figure 5.1: Performance of REC and PMF on ML-100K, ML-1M and ML-10M for iterations.

5.2 Constant Scale Online Predictions Without Retraining

Here, we evaluate REC’s performance on an online learning problem. We do this by training REC on a small subset of all rows and columns, and then testing its performance against increasingly large sets of new rows and columns. This approximates new content and subscribers being added to a recommender system over time. To keep our experimental setup simple, we assume that we have access to the entire dataset for the purpose of determining our prototypes. We begin by selecting 50 prototypes for both users and items. We then train to convergence on 20% of the rows and columns of our dataset where this 20% includes the prototypes. We reintroduce 20% of the total rows and columns at a time to both the train and test sets, and evaluate REC’s performance on the test set at each iteration. The incremental findings of this experiment can be seen in Figure 1(a) while the final results are showcased in Table 2.

Model ML-100K ML-1M ML-10M
Test RMSE on full novel dataset 0.967 0.936 0.889
Parameters for REC (in millions) .17 .17 .17
Percent of data seen in training 6.02% 6.2% 6.03%
Number of new rows and columns in test set 2020 7951 65798
Table 2: Online test results on ML-100K, ML-1M, and ML-10M. Recall that the number of prototype users and prototype items are fixed to 50 for all datasets.

5.3 Cold Start

In this experiment on ML-100K, we show that REC] generates accurate predictions for users with few ratings. To simulate a cold start setting, we utilize the formulation in BKW:17 . Out of our training set, we randomly select users to be our cold start users, by dropping all but of their ratings. After training, we log REC’s performance on the entire validation set. In order to study the effect of both and on our model, we report configurations of and . Note that the case where is regular REC, as no users have ratings dropped. While REC underperforms when compared to the results given in BKW:17 , it significantly outperforms guessing the mean (1.15 RMSE), and our experiments using REC exhibit similar trends to those given in BKW:17 , indicating that REC is a promising technique for address cold-start problems.

(a) REC’s incremental performance on the test set as a function of percentage of total rows and columns available.
(b) Cold start analysis for REC and GC-MC where and . Scores reported for GC-MC were taken from BKW:17
Figure 5.2: REC’s performance to handle two sparsity problems: online learning and cold start

6 Related Work

Incorporating deep learning techniques into collaborative filtering has been an active area of research 

ZYS:17 . In dziugaite2015neural , the authors introduced Neural Network Matrix Factorization (NNMF) dziugaite2015neural which uses a neural network to factorize the ratings matrix . CF-NADE zheng2016neural

is a neural autoregressive architecture for collaborative filtering inspired from Restricted Boltzmann Machine for Collaborative Filtering (RBM-CF) 

salakhutdinov2007restricted and Neural Autoregressive Distribution Estimator (NADE) larochelle2011neural , currently maintains the highest state-of-the-art performance across MovieLens and Netflix datasets, though with high complexity overhead. Neural Collaborative Filtering (NCF) he2017neural provides a systematic study of applying various neural network architectures into the collaborative filtering problem.

Graph Convolutional Matrix Completion (GC-MC) BKW:17 , similar to REC, also interpreted matrix completion problem as a bipartite user-item graph where observed ratings represent links. A graph convolutional auto-encoder framework was used in GC-MC to predict the links. The main difference is that GC-MC stores the latents of every user and item, whereas REC only stores a small subset, generating the rest through neural networks.

Online collaborative filtering methods have also been an active vein of research. The authors in abernethy2007online proposed an algorithm for learning a rank- matrix factor model in an online manner, which scales linearly with and the number of ratings. An online algorithm to learn rank-prediction rules for a user, using the ratings of other users, was proposed in the work of crammer2002pranking . In the work of bresler2014latent , the authors present an algorithm that learns to group users into one of types of users in an online fashion, where each of the

user types have their own established probabilities of liking each item.

7 Conclusion

In this paper, we proposed REC, a generalization of rowless and columnless matrix factorization techniques where user and item embeddings are generated through recursive evidence chains. Our model has a variety of interesting and attractive properties, such as constant parameter scaling, fast training, and the ability to handle both online learning and the cold start problem. We demonstrate its performance on standard datasets and find that it has competitive performance to existing collaborative-filtering algorithms.


  • [1] Jacob Abernethy, Kevin Canini, John Langford, and Alex Simma. Online collaborative filtering. University of California at Berkeley, Tech. Rep, 2007.
  • [2] Guy Bresler, George H Chen, and Devavrat Shah. A latent source model for online collaborative filtering. In Advances in Neural Information Processing Systems, pages 3347–3355, 2014.
  • [3] Koby Crammer and Yoram Singer. Pranking with ranking. In Advances in neural information processing systems, pages 641–647, 2002.
  • [4] Rajarshi Das, Arvind Neelakantan, David Belanger, and Andrew McCallum. Chains of reasoning over entities, relations, and text using recurrent neural networks. arXiv preprint arXiv:1607.01426, 2016.
  • [5] Gintare Karolina Dziugaite and Daniel M Roy. Neural network matrix factorization. arXiv preprint arXiv:1511.06443, 2015.
  • [6] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4):19, 2016.
  • [7] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, pages 173–182. International World Wide Web Conferences Steering Committee, 2017.
  • [8] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. ArXiv e-prints, December 2014.
  • [9] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8), 2009.
  • [10] Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    , pages 29–37, 2011.
  • [11] Andriy Mnih and Ruslan R Salakhutdinov. Probabilistic matrix factorization. In Advances in neural information processing systems, pages 1257–1264, 2008.
  • [12] Jasson DM Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative prediction. In

    Proceedings of the 22nd international conference on Machine learning

    , pages 713–719. ACM, 2005.
  • [13] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. Restricted boltzmann machines for collaborative filtering. In Proceedings of the 24th international conference on Machine learning, pages 791–798. ACM, 2007.
  • [14] Nathan Srebro and Tommi Jaakkola. Weighted low-rank approximations. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 720–727, 2003.
  • [15] R. van den Berg, T. N. Kipf, and M. Welling. Graph Convolutional Matrix Completion. ArXiv e-prints, June 2017.
  • [16] Patrick Verga, Arvind Neelakantan, and Andrew McCallum. Generalizing to unseen entities and entity pairs with row-less universal schema. arXiv preprint arXiv:1606.05804, 2016.
  • [17] S. Zhang, L. Yao, and A. Sun. Deep Learning based Recommender System: A Survey and New Perspectives. ArXiv e-prints, July 2017.
  • [18] Yin Zheng, Bangsheng Tang, Wenkui Ding, and Hanning Zhou. A neural autoregressive approach to collaborative filtering. In Proceedings of the 33nd International Conference on Machine Learning, pages 764–773, 2016.