Python implementation of 'Relational learning via collective matrix factorization' with some extensions
This work explores the ability of collective matrix factorization models in recommender systems to make predictions about users and items for which there is side information available but no feedback or interactions data, and proposes a new formulation with a faster cold-start prediction formula that can be used in real-time systems. While these cold-start recommendations are not as good as warm-start ones, they were found to be of better quality than non-personalized recommendations, and predictions about new users were found to be more reliable than those about new items. The formulation proposed here resulted in improved cold-start recommendations in many scenarios, at the expense of worse warm-start ones.READ FULL TEXT VIEW PDF
Recommender systems provide personalized recommendations to the users fr...
Given a set U of users and a set of items I, a dataset of recommendation...
Most state-of-the-art top-N collaborative recommender systems work by
Conventional collaborative filtering techniques treat a top-n recommenda...
While recommendation systems generally observe user behavior passively, ...
We address the cold start problem in recommendation systems assuming no
Recommendation algorithms are widely adopted in marketplaces to help use...
Python implementation of 'Relational learning via collective matrix factorization' with some extensions
This work aims to explore the quality of cold-start recommendations derived from collective matrix factorization models  in collaborative filtering with explicit-feedback data in the form of ratings. Recommender systems based on collaborative filtering are typically constructed solely based on data about user-item interactions , such as movies rated by different users, which result in domain-independent and easily-implementable models, but have the disadvantage of only being able to make recommendations about users and items for which there is interactions data available (known as warm-start recommendations in the literature).
In many settings however, there is oftentimes additional side information available about users and/or items, which is not used in the most common models such as low-rank matrix factorization 
or kNN-based formulas, but which can be used both to improve recommendation models that take interactions data, and to make recommendations in the absence of interactions data (so-called cold-start recommendations).
This work focuses on the second case: studying recommendations from matrix factorization models that are based on attributes data without interactions data.
Collective matrix factorization is an extension of the low-rank factorization model that tries to incorporate attributes about the users and/or items by also factorizing the matrices associated with their side information, sharing the latent factors between them.
More formally, recommendation models based on low-rank matrix factorization try to factorize a partially-observed matrix of user-item interactions (e.g. movie ratings), where is the number of users and is the number of items, into the product of two lower-dimensional matrices and , where
, which can be thought of as latent factors determined for each user and item, by minimizing some loss function, such as squared loss, defined only on the entries ofthat are known (hereafter denoted by the indicator function ), e.g.:
Having obtained these matrices, it’s then possible to predict the values of for entries that are not known by the dot product for user and item . Recommendations are then made by sorting these predictions in decreasing order.
In most implementations, this model is improved by centering the data (subtracting the global mean from each entry), adding user and item biases
(row vector) and
(column vector), which might be treated as model parameters or obtained through a simple heuristic before attempting to obtain optimal values forand , and by adding regularization on all the model parameters, resulting in the following problem:
This is a non-convex optimization problem for which local minima can be found either by gradient-based methods, or more typically, by the ALS (alternating least-squares) algorithm , which takes advantage of the fact that, if holding one of the low-rank matrices constant, the optimal values for the other can be obtained through a closed-form solution that implies solving linear systems – the algorithm then alternates between solving one or the other holding the other constant until convergence.
The main idea behind collective matrix factorization is to jointly factorize the interactions matrix along with the user attributes matrix and the item attributes matrix , introducing new matrices and for the user and item attributes (assuming there is data about both user and item attributes), but sharing the and matrices between factorizations:
Up to this point, the problem is equivalent to factorizing an extended block matrix and can be solved using the same methods as before.
The new matrices and
are not used in the prediction formula, but their presence in the minimization objective allows obtaining better estimates forand - informally, they now need to explain both the interactions and the side information, making them less prone to overfitting the observed interactions and forcing these latent factors to relate to the non-latent attributes, thereby generalizing better to new data.
There are many logical improvements upon this model: the matrices might not share all the latent factors, but have independent parts, e.g.
each factorization might have a different weight, each matrix its own regularization hyperparameter, among others. Particularly, this work also applied a sigmoid transformation to all binary variables in the side information matrices, took the user and item biases as model parameters, for which regularization was also applied, and divided the sum of residuals from each matrix by the number of entries in order for their contribution not to be driven by the relative size of each, resulting in an optimization problem as follows:
This problem is no longer solvable through ALS, but can still be solved using gradient-based methods. The optimization was performed using the L-BFGS solver (a limited-memory quasi-Newton method ) in SciPy 
, with the gradients being calculated through Tensorflow. The implementation used here was made open-source and freely available111https://github.com/david-cortes/cmfrec.
In many implementations of low-rank matrix factorization, the regularization parameters are scaled by the number of ratings by each user and for each movie, but since this model was adding the side information matrices, this idea was not incorporated in the final objective formula.
It can be seen that this optimization objective will produce values for and as long as there is either interactions data or side information about a given user or item . If not applying sigmoid transformations, it is also possible to obtain values for and for new users and items based on side information alone without refitting the model entirely using the same closed-form solution as ALS, if holding everything else constant:
and similarly for items:
If using sigmoid or other transformations, such values might still be obtained from solving smaller optimization problems through gradient-based methods. Calculating parameters for new users/items this way, while faster than refitting the entire model from scratch, is still a rather slow process and not fast enough to be used in live systems.
Other approaches similar in spirit have also been proposed, e.g. , but they are aimed at warm-start recommendations only.
Collective matrix factorization as presented here is not the only cold-start-capable model that has been proposed for integrating side information into low-rank matrix factorization models. For example,  proposed a Bayesian formulation which assumes a decomposable generative model which, informally, can be thought of as calculating a base score for each factor derived from the item attributes, and an offset based on the observed behavior.
While  assumed counts data for the attributes and proposed a Bayesian approach to this problem, the idea of decomposing the low-rank matrices into additive components can be used in other settings by following a different optimization route.
As an alternative to the model from the previous section, this work also evaluated a different formulation based on minimizing a loss function as follows:
Just like before, optimization was done through L-BFGS and the gradients were obtained through Tensorflow.
Informally, this model tries to calculate a base matrix of latent factors by a linear combination of the user/item attributes, to which a free offset is added based on the observed interactions data in order to obtain the final latent factors. It will be referred hereafter as the “offsets” model.
This alternative formulation presents a computational advantage for cold-start recommendations compared to the previous formulation, as now the latent factors based on attributes can be calculated by a simple vector-matrix product instead of solving a larger linear system, while the offsets are zero in the absence of any interaction data, which makes it suitable for producing cold-start recommendations in real-time. Contrary to the models presented so far, here the low-rank matrices related to the attributes are also used in the prediction formula - predictions for a new user of known items would be given by:
(For new items, and would be zero, while would be calculated in the same way).
It also has the advantages of not requiring any special transformation for variables that are limited in range (e.g. binary or non-negative), and of having fewer hyperparameters to tune.
Compared to other approaches for cold-start recommendations such as  or approches based on user-wise regressions, this model can work in the absence of side information for either users or items, thus being usable in all the different cold-start scenarios, and its parameters are optimized to recommend items based on both attributes and observed interactions.
A further decomposition in which user latent factors are determined separately for combining them with those derived from the interactions data for items and those derived from the item attributes was also explored in  (”decoupled” model, ) and was briefly attempted here, but the results, in line with , were far below every other model and were left out of the analysis.
Both of these models were evaluated using the MovieLens 1M dataset , complemented with the movie tag genome data  taken from the MovieLens latest dataset (last updated 08/2017 at the time these experiments were run), and with demographical and geographical information about users, the later linked to them through their zip code. Unfortunately, later (and larger) releases of the MovieLens dataset no longer include user attributes, nor was the author aware of any larger, public dataset with side information about both users and items, thus it was not possible to evaluate the models on bigger datasets.
The MovieLens 1M dataset contains 1,000,209 ratings on 3,952 movies by 6,040 users, in a timespan from 2000 to 2003. Information from the tag genome dataset, consisting of 1,128 attributes for which movies are assigned a continuous value (which can also be negative) under each of them, was available for 3,028 of these movies only, but demographical information was available for all users, including their age group (7 buckets), occupation (21 categories), gender, and zip code (not used directly), which were taken as binary variables. Additionally, information about the US region of the user (or whether they were not from the US) was added to them as binary variables by linking them through the zip code, using free zip code databases to determine the region222http://federalgovernmentzipcodes.us/333http://www.fonz.net/blog/archives/2008/04/06/csv-of-states-and-state-abbreviations/444https://www.infoplease.com/us/states/sizing-states.
Recommendations were evaluated by randomly splitting the ratings data into a training set and four test sets in order to evaluate the different possible cold-start scenarios and compare them to warm-start recommendations, containing
only users and items that were in the training data,
users that were not in the training set but items that were,
users that were in the training set but items that were not and had tags available,
users and items that were not in the training set (exact same users as in 2, and only items that were also in 3), with each test set containing at least 5 ratings from each user included in that set, having sizes as follows:
Ratings Users Items Train set 478,105 4,530 2,753 Test set 1: users train, items train 81,765 3,527 2,518 Test set 2: users train, items train 187,456 1,510 2,576 Test set 3: users train, items train 170,579 4,278 759 Test set 4: users train, items train 57,021 1,426 736
In the case of the first model, which allows for setting weights for each factorization, different weights were tried to examine their impact on the quality of cold-start and warm-start recommendations. It was also experimented trying to fit the models to side information about users only, items only, or both, and compared to the same model without side informaton. Except when stated otherwise, the models were trained using side information about users and items that had no ratings data - there were tags available for 10,993 movies, many of which were neither in the training nor the test sets. All the models were fit with the number of latent factors and regularization . The dimensionality of the tags data was reduced by taking only their first 50 principal components, as the number of columns is too large for the second model, and the first model seems to also benefit from reducing dimensionality. Intuitively, however, the first model should not need this type of dimensionality reduction, as it performs it implicitly, but taking advantage of this rich side information would require setting the regularization parameters differently for each matrix, which was not experimented with in here. Some trial and error (not recorded here) suggests that also adding non-shared latent factors brings a slight improvement, particularly when using only item attributes. Models were evaluated in terms of their RMSE (root mean squared error) and NDCG@5 (net discounted cummulative gain at 5), the later calculated on a per-user basis and averaged across all users. The definition of DCG was taken as follows:
|CMF, item attributes||1||-||1||1.0128||0.7304|
|CMF, item attributes||1.5||-||0.5||1.0136||0.7212|
|CMF, item attributes||0.5||-||1.5||1.0194||0.7302|
|Offsets, item attributes||-||-||-||0.9558||0.7387|
|CMF, user and item attributes||1||1||1||0.9955||0.7717|
|CMF, user and item attributes||2.14||0.43||0.43||0.9986||0.7579|
|CMF, user and item attributes||0.43||1.29||1.29||1.0024||0.7657|
|Offsets, user and item attributes||-||-||-||0.9478||0.7642|
|CMF, user and item attributes||1||1||1||1.063||0.7226|
|CMF, user and item attributes||2.14||0.43||0.43||1.0684||0.7239|
|CMF, user and item attributes||0.43||1.29||1.29||1.0598||0.7296|
|Offsets, user and item attributes||-||-||-||1.0086||0.7247|
This work proposed an enhancement to the collective matrix factorization model in order to deal with binary data, and proposed an alternative formulation - the ”offsets” model - that is able to make fast recommendations for new users and items and which does not require any transformation for attributes data that is limited in range.
Cold-start recommendations are understandably not as good as warm-start ones, and the offsets model didn’t manage to beat non-personalized recommendations for new users when fit to user side information only, although it is significantly better than random recommendations and it did beat non-personalized recommendations when adding item attributes too. The coverage of these recommendations however is wider than ”most-popular” lists, as they can recommend new items when there is information available about them, which in practice can be more valuable than an improvement in offline metrics.
Predictions about new users turned out to be of better quality than predictions about new items, despite side information available about items in this experiment being far more detailed, albeit the difference in the evaluated metrics is not so large. Surprisingly, adding side information about users also resulted in a significant increase in the quality of predictions for new items, moreso than the increase one would expect based on the improvement seen in the warm-start scenario.
In the original formulation, adding side information about users and items that are not in the interactions (ratings) training data seems to worsen warm-start recommendations overall by a very small margin, but a better hyperparameter tuning might be able to make it benefit from this extra information.
Contrary to what one would expect, giving more weight to the factorization of side information in the original model did not result in better cold-start recommendations in most scenarios, nor did giving more weight to the factorization of the main matrix lead to better warm-start recommendations, but rather, the same weights that led to better warm-start recommendations generally also led to better cold-start ones.
Compared to the original CMF, the offsets model resulted in improved cold-start recommendations for new items being recommended to old users when there is no item side information, and depending on the metric being evaluated, also for new items recommended to new users (different runs of all models resulted in slight variations, with the model having highest NDCG@5 not being consistently the same one across runs in the last scenario), but the original CMF model performed slighlty better in recommending old items to new users. Warm-start recommendations from the offsets model however, while generally producing better metrics than non-personalized recommendations, lag behind those from a model without any side information at all.
It should be noted too that the offsets model required 3-4 times more L-BFGS iterations to reach convergence compared to the original model, and in the case of side information for both users and items, was stopped before convergence at 800 iterations. Letting it run for more than 2,000 iterations resulted in significantly degraded results everywhere (not reported here).
Collective matrix factorization models are harder to tune correctly in terms of their hyperparameters when compared with typical matrix factorization models based on interactions alone, and adding more side information with a bad choice of hyperparameters results in worse performance metrics compared to discarding it.
As a remark, this work only evaluated recommendations on one dataset (MovieLens 1M), and it remains to be seen if results will be consistent across different datasets. It also remains to be seen whether the type of attribute data has some influence in the results from each model - the user data here consisted in all-binary columns, while the items data consisted in all-continuous and normally-distributed columns, which might have had an impact in the difference seen between models that take user or item side information.
One serious limitation for both of these models as implemented here is their impracticality in the scenario of implicit data, as running the L-BFGS procedure would require allocating the full user-item interaction matrix, but perhaps the main idea behind the offsets models could also be used with a different optimization procedure, such as  or .
An interesting possibility that was not explored was to use Bayesian approaches with non-independent latent factors or with some hierarchical structure such as in , which tend to outperform simpler models based on minimization of squared loss as the ones experimented with in this work .
Tensorflow: a system for large-scale machine learning.In OSDI, volume 16, pages 265–283, 2016.
Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, pages 452–461. AUAI Press, 2009.
Bayesian probabilistic matrix factorization using markov chain monte carlo.In Proceedings of the 25th international conference on Machine learning, pages 880–887. ACM, 2008.