Recommender systems that recommends items based on users interest has become more and more popular among many web sites. Collaborative Filtering(CF) techniques that behind the recommender system have been developed for many years and keep to be a hot area in both academic and industry aspects. Currently CF problems face two kinds of major challenges: how to handle large-scale dataset and how to leverage the rich information of data collected.
Traditional approaches to solve these problems is to design specific models for each problem, i.e writing code for each model, which demands great efforts in engineering. Matrix factorization(MF) technique is one of the most popular method of CF model, and extensive study has been made in different variants of matrix factorization model, such as  and . However, we find that the majority of matrix factorization models share common patterns, which motivates us to put them together into one. We call this model feature-based matrix factorization. Moreover, we write a toolkit for solving the general feature-based matrix factorization problem, saving the efforts of engineering for detailed kinds of model. Using the toolkit, we get the best single model on track 1 of KDDCup’11.
This article serves as a technical report for our toolkit of feature-based matrix factorization111http://apex.sjtu.edu.cn/apex_wiki/svdfeature. We try to elaborate three problems in this report, i.e, what the model is, how can we use such kind of model, and additional discussion of issues in engineering and efficient computation.
2 What is feature based MF
In this section, we will describe the model of feature based matrix factorization, starting from the example of linear regression, and then going to the full definition of our model.
2.1 Start from linear regression
Let’s start from the basic collaborative filtering models. The very baseline of collaborative filtering model may be the baseline models just considering the mean effect of user and item. See the following two models.
Here is a constant indicating the global mean value of rating. Equation 1 describe a model considering users’ mean effect while Equation 2 denotes items’ mean effect. A more complex model considering the neighborhood information is as follows
Here is the set of items user rate, is a user average rating pre-calculated. means the similarity parameter from to . is a parameter that we train from data instead of direct calculation using memory based methods. Note is different from since it’s pre-calculated. This is a neighborhood model that takes the neighborhood effect of items into consideration.
Assuming we want to implement all three models, it seems to be wasting to write code for each of the model. If we compare those models, it is obvious that all the three models are special cases of linear regression problem described by Equation 4
Suppose we have users, items, and total number of possible in equation 3
. We can define the feature vectorfor user item pair as follows
The corresponding layout for weight shown in equation 6. Note that choice of pairs can be flexible. We can choose only possible neighbors instead of enumerating all the pairs.
In other words, equation 3 can be reformed as the following form
where , , corresponds to weight of linear regression, and the coefficients on the right of the weight are the input features. In summary, under this framework, the only thing that we need to do is to layout the parameters into a feature vector. In our case, we arrange first features to then and , then transform the input data into the format of linear regression input. Finally we use a linear regression solver to work the problem out.
2.2 Feature based matrix factorization
The previous section shows that some baseline CF algorithms are linear regression problem. In this section, we will discuss feature-based generalization for matrix factorization. A basic matrix factorization model is stated in Equation 8:
The bias terms have the same meaning as previous section. We also get two factor term and . models the latent peference of user . models the latent property of item .
Inspired by the idea of previous section, we can get a direct generalization for matrix factorization version.
Equation 9 adds a linear regression term to the traditional matrix factorization model. This allows us to add more bias information, such as neighborhood information and time bias information, etc. However, we may also need a more flexible factor part. For example, we may want a time dependent user factor or hierarchical dependent item factor . As we can find from previous section, a direct way to include such flexibility is to use features in factor as well. So we adjust our feature based matrix factorization as follows
The input consists of three kinds of features , we call user feature, item feature and global feature. The first part of Equation 10. The name of these features explains their meanings. describes the user aspects, describes the item aspects, while describes some global bias effect. Figure 1 shows the idea of the procedure.
We can find basic matrix factorization is a special case of Equation 10. For predicting user item pair , define
We are not limited to the simple matrix factorization. It enables us to incorporate the neighborhood information to , and time dependent user factor by modifying . Section 3 will present a detailed description of this.
2.3 Active function and loss function
There, you need to choose an active function
Common choice of active functions and loss are listed as follows:
identity function, L2 loss, original matrix factorization.
2.4 Model Learning
To update the model, we use the following update rule
Here the difference between true rate and predicted rate. This rule is valid for both logistic likelihood loss and L2 loss. For other loss, we shall modify to be corresponding gradient. is the learning rate and the s are regularization parameters that defines the strength of regularization.
3 What information can be included
In this section, we will present some examples to illustrate the usage of our feature-based matrix factorization model.
3.1 Basic matrix factorization
Basic matrix factorization model is defined by following equation
And the corresponding feature representation is
3.2 Pairwise rank model
For the ranking model, we are interested in the order of two items given a user . A pairwise ranking model is described as follows
The corresponding features representation are like this
by using sigmoid and log-likelihood as loss function. Note that the feature representation gives one extra which is not desirable. We can removed it by give high regularization to that penalize it to .
3.3 Temporal Information
A model that include temporal information can be described as follows
We can include using global feature, and ,
using user feature. For example, we can define a time interpolation model as follows
Here and mean start and end of the time of all the ratings. A rating that’s rated later will be affected more by and and earlier ratings will be more affected by and . For this model, we can define
Note we first arrange the in the first features then in next features.
3.4 Neighborhood information
3.5 Hierarchical information
In Yahoo! Music Dataset, some tracks belongs to same artist. We can include such hierarchical information by adding it to item feature. The model is described as follows
Here means track and denotes corresponding artist. This model can be formalized as feature-based matrix factorization by redefining item feature.
4 Efficient training for SVD++
Feature-based matrix factorization can naturally incorporate implicit and explicit information. We can simply add these information to user feature . The model configuration is shown as follows:
Here we omit the detail of bias term. The implicit and explicit feedback information is given by , where is the feature vector of feedback information, for implicit feedback, and for explicit feedback. is the parameter of implicit and explicit feedback factor. We explicitly state out the implicit and explicit information in Equation 34.
Although Equation 34 shows that we can easily incorporate implicit and explicit information into the model, it’s actually very costly to run the stochastic gradient training, since the update cost is linear to the size of nonzero entries of , and can be very large if a user has rated many items. This will greatly slow down the training speed. We need to use an optimized method to do training. To show the idea of the optimized method, let’s first define a derived user implicit and explicit factor as follows:
The update of after one step is given by the following equation
The resulted difference in is given by
Given a group of samples with the same user, we need to do gradient descent on each of the training sample. The simplest way is to do the following steps for each sample: (1) calculate to get prediction (2) update all associates with implicit and explicit feedback. Every time has to be recalculated using updated in this way. However, we can find that to get new , we don’t need to update each . Instead, we only need to update using Equation 37. What’s more, we can find there is a relation between and as follows:
We shall emphasize that Equation 38 is true even for multiple updates, given the condition that the user is same in all the samples. We shall mention that the above analysis doesn’t consider the regularization term. If L2 regularization of is used during the update as follows:
The corresponding changes in also looks very similar
However, the relation in Equation 38 no longer holds strictly. But we can still use the relation since it approximately holds when regularization term is small. Using the results we obtained, we can develop a fast algorithm for feature-based matrix factorization with implicit and explicit feedback information. The algorithm is shown in Algorithm 1.
We find that the basic idea is to group the data of the same user together, for the same user shares the same implicit and explicit feedback information. Algorithm 1 allows us to calculate implicit feedback factor only once for a user, greatly saving the computation time.
5 How large-scale data is handled
Recommender system confronts the problem of large-scale data in practice. This is a must when dealing with real problems. For example Yahoo! Music Dataset consists of more than 200M ratings. A toolkit that’s robust to input data size is desirable for real applications.
5.1 Input data buffering
The input training data is extremely large in real application, we don’t try to load all the training data into memory. Instead, we buffer all the training data through binary format into the hard-disk. We use stochastic gradient descend to train our model, that is we only need to linearly iterate over the data if we shuffle our data before buffering.
Therefore, our solution requires the input feature to be previously shuffled, then a buffering program will create a binary buffer from the input feature. The training procedure reads the data from hard-disk and uses stochastic gradient descend to train the model. This buffering approach makes the memory cost invariant to the input data size, and allows us to train models over large-scale of input data so long as the parameters fit into memory.
5.2 Execution pipeline
Although input data buffering can solve the problem of large-scale data, it still suffers from the cost of reading the data from hard-disk. To minimize the cost of I/O, we use a pre-fetching strategy. We create a independent thread to fetch the buffer data into a memory queue, then the training program reads the data from memory queue and do training. The procedure is shown in Figure 2
This pipeline style of execution removes the burden of I/O from the training thread. So long as I/O speed is similar or faster to training speed, the cost of I/O is negligible, and our our experience on KDDCup’11 proves the success of this strategy. With input buffering and pipeline execution, we can train a model with test RMSE=22.16 for track1 in KDDCup’11222kddcup.yahoo.com using less than 2G of memory, without significantly increasing of training time.
6 Related work and discussion
The most related work of feature based matrix factorization is Factorization Machine . The reader can refer to libFM333http://www.libfm.org for a toolkit for factorization machine. Strictly speaking, our toolkit implement a restricted case of factorization machine and is more useful in some aspects. We can support global feature that doesn’t need to be take into factorization part, which is important for bias features such as user day bias, neighborhood based features, etc. The divide of features also gives hints for model design. For global features, we shall consider what aspect may influence the overall rating. For user and item features, we shall consider how to describe the user preference and item property better. Our model is also related to  and 
, the difference is that in feature-based matrix factorization, the user/item feature can associate with temporal information and other context information to better describe the preference or property in current context. Our current model also has shortcomings. The model doesn’t support multiple distinct factorizations at present. For example, sometimes we may want to introduce user vs time tensor factorization together with user vs item factorization. We will try our best to overcome these drawbacks in the future works.
-  Deepak Agarwal and Bee-Chung Chen. Regression-based latent factor models. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’09, pages 19–28, New York, NY, USA, 2009. ACM.
-  Gideon Dror, Noam Koenigstein, Yehuda Koren, and Markus Weimer. The Yahoo! Music dataset and KDD-Cup’11. In KDD-Cup Workshop, 2011.
-  Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’08, pages 426–434, New York, NY, USA, 2008. ACM.
-  Yehuda Koren. Collaborative filtering with temporal dynamics. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’09, pages 447–456, New York, NY, USA, 2009. ACM.
Improving regularized singular value decomposition for collaborative filtering.In Proceedings of KDD Cup and Workshop, volume 2007, 2007.
-  Steffen Rendle. Factorization machines. In Proceedings of the 10th IEEE International Conference on Data Mining. IEEE Computer Society, 2010.
Jasson D. M. Rennie and Nathan Srebro.
Fast maximum margin matrix factorization for collaborative
Proceedings of the 22nd international conference on Machine learning, ICML ’05, pages 713–719, New York, NY, USA, 2005. ACM.
-  Nathan Srebro, Jason D. M. Rennie, and Tommi S. Jaakola. Maximum-Margin Matrix Factorization. In Advances in Neural Information Processing Systems 17, volume 17, pages 1329–1336, 2005.
-  David H. Stern, Ralf Herbrich, and Thore Graepel. Matchbox: large scale online bayesian recommendations. In Proceedings of the 18th international conference on World wide web, WWW ’09, pages 111–120, New York, NY, USA, 2009. ACM.