I Introduction
Nowadays, Internet floods users with useless information. Hence, recommender systems are useful to supply them with content that may be of interest. Recommender systems have become a popular research topic over the past 20 years, to make them more accurate and effective along many dimensions (social dimension [Ma2008][Ma2011][Noel2012], geographical dimension [Levandoski2012][Wang2013], diversification aspect [Abbar2013][Ziegler2005][Zhang2008], etc.).
Collaborative Filtering (CF) [Resnick1997] is one of the most commonly used recommendation methods. CF consists in predicting whether, or how much, a user will like (or dislike) an item by leveraging the knowledge of the user preferences, as well as that of other users. In practice, users interact and express their opinions on only a small subset of items, which makes the corresponding useritem rating matrix very sparse. Consequently, in a recommender system, this data sparsity induces two main problems: (1) the lack of data to effectively model user preferences (new users suffer from the coldstart problem [Sedhain2014]), and (2) the lack of data to effectively model items characteristics (new items suffer from the coldstart problem since no user has yet rated them).
On the other hand, beside this sparse useritem rating matrix, there are often many other data sources that are available to a recommender system, which can provide useful information that describe user interests and item characteristics. Examples of such diverse data sources are numerous: a user social network, a user’s topics of interest, tags associated to items, etc. These valuable data sources may supply useful information to enhance a recommendation system in modeling user preferences and item characteristics more accurately and thus, hopefully, to make more precise recommendations. Previous research work has demonstrated the effectiveness of using external data sources for recommender systems [Ma2008][Ma2011][Noel2012]. However, most of the proposed solutions focus on the use of only one kind of data provided by an online service (e.g., social network in [Ma2008] or geolocation information in [Levandoski2012][Wang2013]). Extending these solutions into a unified framework that considers multiple and diverse data sources is itself a challenging research problem.
Furthermore, these diverse data sources are typically managed by clusters at different data centers, thus requiring the development of new distributed recommendation algorithms to effectively handle this constantly growing data. In order to make better use of these different data sources, we propose a new distributed collaborative filtering algorithm, which exploits and combines multiple and diverse data sources to improve recommendation quality. To the best of our knowledge, this is the first attempt to propose such a distributed recommendation algorithm. In summary, the contributions of this paper are:

A new recommendation algorithm, based on matrix factorization, which leverages multiple and diverse data sources. This allows better modeling user preferences and item characteristics.

A distributed version of this algorithm that mainly computes factorizations of matrices by exchanging intermediate latent feature matrices in a coordinated manner.

A thorough comparative analysis with stateoftheart recommendation algorithms on different datasets.
This paper is organized as follows: Section II provides two use cases; Section III presents the main concepts used in this paper; Section IV describes our multisource recommendation model; Section V gives our distributed multisource recommendation algorithm; Section VI describes our experimental evaluation; Section VII discusses the related work; Finally, Section VIII concludes and provides future directions.
Ii Use Cases
Let us illustrate our motivation with two use cases, one with internal data sources, one with external data sources.
Iia Diverse internal data sources
Consider John, a user who has rated a few movies he saw on a movie recommender system. In that same recommendation system, John also expressed his topics of interest regarding movie genres he likes. He also maintains a list of friends, which he trusts and follows to get insight on interesting movies. Finally, John has annotated several movies he saw, with tags to describe their contents.
In this example, the same recommender system holds many valuable data sources (topics of interest, friends list, and annotations), which may be used to accurately model John’s preferences and movies’ characteristics, and thus, hopefully to make more precise recommendations. In this first scenario, we suppose that these diverse data sources are hosted over different clusters of the same data center of the recommender system. It is obvious that a centralized recommendation algorithm induces a massive data transfer, which will cause a bottleneck problem in the data center. This clearly shows the importance of developing a distributed recommendation solution.
IiB Diverse external data sources
Let us now consider that John is a regular user of a movie recommender system and of many other online services. John uses Google as a search engine, Facebook to communicate and exchange with his friends, and maybe other online services such as Epinions social network, IMDb, which is an online database of information related to films, Movilens, etc, as illustrated in Figure 1.
In this second use case, we believe that by exploiting and combining all these valuable data sources provided by different online services, we could make the recommender system more precise. The data sources are located and distributed over the clusters of different data centers, which are geographically distributed. In this second use case, we assume that the recommendation system can identify and link entities that refer to the same users and items across the different data sources. We envision that the connection of these online services may be greatly helped by initiatives like OpenID (http://openid.net/), which promotes a unified user identity across different services. In addition, we assume that the online services are willing to help the recommender system through contracts that can be established.
Iii Definitions and Background
In this section, we introduce the data model we use, and a CF algorithm based on matrix factorization. Then, we describe the recommendation problem we study.
Iiia Data Model
We use matrices to represent all the data manipulated in our recommendation algorithm. Matrices are very useful mathematical structures to represent numbers, and several techniques from matrix theory and linear algebra can be used to analyze them in various contexts. Hence, we assume that any data source can be represented using a matrix, whose value in the position is a correlation that may exist between the and elements. We distinguish mainly three different kinds of data matrices:
Users’ preferences history: In a recommender system, there are two classes of entities, which are referred as users and items. Users have preferences for certain items, and these preferences are extracted from the data. The data itself is represented as a matrix , giving for each useritem pair, a value that represents the degree of preference of that user for that item.
Users’ attributes: A data source may supply information on users using two classes of entities, which are referred to users and attributes. An attribute may refer to any abstract entity that has a relation with users. We also use matrices to represent such data, where for each userattribute pair, a value represents their correlation (e.g., term frequencyinverse document frequency (tfidf) [BaezaYates2010]). The way this correlation is computed is out of the scope of this paper.
Items’ attributes: Similarly, a data source may embed information that describes items using two classes of entities, namely items and attributes. Here, an attribute refers to any abstract entity that has a relation with items. Matrices are used to represent these data, where for each attributeitem pair, a value is associated to represent their correlation (e.g., tfidf). The way this correlation is computed is also beyond the scope of this paper.
Table I gives examples of attributes that may describe both users and items, as well as the meaning of the correlations. It is interesting to notice that these three kinds of matrices are sparse matrices, meaning that most entries are missing. A missing value implies that we have no explicit information regarding the corresponding entry.
Attribute 1  Attribute 2  Example of correlation 

User  User  Similarity between the two users 
User  Topic  Interest of the user in the topic 
Item  Topic  Topic of the items 
Item  Item  Similarity between two items 
IiiB Matrix Factorization (MF) Models
Matrix factorization aims at decomposing a useritem rating matrix of dimension containing observed ratings into a product of latent feature matrices U and V of rank K. In this initial MF setting, we designate and as the and columns of and such that acts as a measure of similarity between user and item in their respective kdimensional latent spaces and .
However, there remains the question of how to learn and given that may be incomplete (i.e., it contains missing entries). One answer is to define a reconstruction objective error function over the observed entries, that are to be minimized as a function of and , and then use gradient descent to optimize it; formally, we can optimize the following MF objective [Salakhutdinov2007]: , where is the indicator function that is equal to 1 if user rated item and equal to 0 otherwise. Also, in order to avoid overfitting, two regularization terms are added to the previous equation (i.e., and ).
IiiC Problem Definition
The problem we address in this paper is different from that in traditional recommender systems, which consider only the useritem rating matrix . In this paper, we incorporate information coming from multiple and diverse data matrices to improve recommendation quality. We define the problem we study in this paper as follows. Given:

a useritem rating matrix ;

data matrices that describe the user preferences distributed over different clusters;

data matrices that describe the items’ characteristics distributed over different clusters;
How to effectively and efficiently predict the missing values of the useritem matrix by exploiting and combining these different data matrices?
Iv Recommendation Model
In this section, we first give an overview of our recommendation model using an example. Then, we introduce the factor analysis method for our model that uses probabilistic matrix factorization.
Iva Recommendation Model Overview
Let us first consider as an example the useritem rating matrix of a recommender system (see Figure 2). There are 5 users (from to ) who rated 5 movies (from to ) on a 5point integer scale to express the extent to which they like each item. Also, as illustrated in Figure 2, the recommender system provider holds three data matrices that provide information that describe users and items. Note that only part of the users and items of these data matrices overlap with those of the useritem rating matrix.

Matrix (1): provides the social network of , , and , where each value in the matrix represents the trustiness between two users.

Matrix (2): provides information about the interests of and , where for each usertopic pair in the matrix , a value is associated, which represents the interest of the user in this topic.

Matrix (3): provides information about the genre of the movies , , and in the matrix .
The problem we study in this paper is how to predict the missing values of the useritem matrix effectively and efficiently by combining all these data matrices (, , and ). Motivated by the intuition that more information will help to improve a recommender system, and inspired by the solution proposed in [Ma2008], we propose to disseminate the data matrices of the data sources in the useritem matrix, by factorizing all these matrices simultaneously and seamlessly as illustrated in Figure 2, such as: , , , and , where the kdimensional matrices , , and denote the user latent feature space, such as , , , (, , , and refer respectively to the , , , and column of the matrix ), the matrices and are the kdimensional item latent feature space such as , , , and , , and , are factor matrices. In the example given in Figure 2, we use 3 dimensions to perform the factorizations of the matrices. Once done, we can predict the missing values in the useritems matrix using . In the following sections, we present the details of our recommendation model.
IvB UserItem Rating Matrix Factorization
Suppose that a Gaussian distribution gives the probability of an observed entry in the UserItem matrix as follows:
(1) 
where
is the probability density function of the Gaussian distribution with mean
and variance
. The idea is to give the highest probability to as given by the Gaussian distribution. Hence, the probability of observing approximately the entries of given the feature matrices and is:(2) 
where is the indicator function that is equal to 1 if a user rated an item and equal to 0 otherwise. Similarly, we place zeromean spherical Gaussian priors [Dueck2004][Ma2008][Salakhutdinov2007]
on user rating and item feature vectors:
(3) 
Hence, through a simple Bayesian inference, we have:
(4) 
IvC Matrix factorization for data sources that describe users
Now let’s consider a UserAttribute matrix of users and attributes, which describes users. We define the conditional distribution over the observed matrix values as:
(5) 
where is the indicator function that is equal to 1 if user has a correlation with attribute (in the data matrix ) and equal to 0 otherwise. Similarly, we place zeromean spherical Gaussian priors on feature vectors:
(6) 
Hence, similar to Equation 4, through a simple Bayesian inference, we have:
(7) 
IvD Matrix factorization for data sources that describe items
Now let’s consider an ItemAttribute matrix of items and attributes, which describes items. We also define the conditional distribution over the observed matrix values as:
(8) 
where is the indicator function that is equal to 1 if an item is correlated to an attribute (in the datasource ) and equal to 0 otherwise. We also place zeromean spherical Gaussian priors on feature vectors:
(9) 
Hence, through a Bayesian inference, we also have:
(10) 
IvE Recommendation Model
Considering data matrices that describe users, data matrices that describe items, and based on the graphical model given in Figure 3, we model the conditional distribution over the observed ratings as:
(11) 
Hence, we can infer the log of the posterior distribution for the recommendation model as follows:
(12) 
where denotes the Frobenius norm, and are regularization parameters. A local minimum of the objective function given by Equation 12 can be found using Gradient Descent (GD) as detailed in Algorithm LABEL:alg:GradientDescent. We present this algorithm in details in Appendix A and a distributed version of this algorithm in the next section.
V Distributed Recommendation
In this section, we first present a distributed version of the Collaborative Filtering Algorithm LABEL:alg:GradientDescent, which minimizes Equation 12, and then we carry out a complexity analysis to show that it can scale to large datasets.
Va Distributed CF Algorithm
In this section we show how to deploy Algorithm LABEL:alg:GradientDescent in a distributed setting over different clusters and how to generate predictions. This distribution is mainly motivated by: (i) the need to scale up our collaborative filtering algorithm to very large datasets, i.e., parallelize part of the computation, and (ii) avoid to transfer the raw data matrices (for mainly privacy concerns). Instead, we only exchange common latent features.
Based on the observation that several parts of the centralized CF algorithm (see Algorithm LABEL:alg:GradientDescent) can be executed in parallel and separately on different clusters, we propose to distribute it using Algorithms 1, 2, and 3. Algorithm 1 is executed by the master cluster that handle the useritem rating matrix, whereas each slave cluster that handles data matrices about users’ attributes executes an instance of Algorithm 2, and each slave cluster that hanldes data matrices about items’ attributes executes an instance of Algorithm 3.
Basically, the first step of this distributed algorithm is an initialization phase, where each cluster (master and slaves) initializes its latent feature matrices with small random values (lines 1 of Algorithms 1, 2, and 3). Next, in line 1 of Algorithm 1, the master cluster computes part of the partial derivative of the objective function given in Equation 12 with respect to (line 1 of Algorithm 1 computes a part of line LABEL:alg:GDStep1 in Algorithm LABEL:alg:GradientDescent). Then, for each user , the master cluster sends its latent feature vector to the other participant slave clusters, which share attributes about that user (lines 1 and 1 in Algorithm 1). Then, the master cluster waits for responses of all these participant slave clusters (line 1 in Algorithm 1).
Next, each slave cluster that receives users’ latent features replaces the corresponding user latent feature vector with the user latent feature vector received from the master cluster (lines 2 and 2 in Algorithm 2). Then, the slave cluster computes , which is part of the partial derivative of the objective function given in Equation 12 with respect to (line 2 in Algorithm 2 computes a part of line LABEL:alg:GDStep1 in Algorithm LABEL:alg:GradientDescent). Next, the slave cluster keeps in , only vectors of users that are shared with the master cluster (line 2 in Algorithm 2). The slave cluster sends the remaining feature vectors in to the master cluster (line 2 in Algorithm 2). Finally, the slave cluster updates its local user and attribute latent feature matrices and (lines 92 in Algorithm 2).
As for the master cluster, each user latent feature matrix received from a slave cluster is added to , which is the partial derivative of the objective function with respect to (line 1 in Algorithm 1). This addition is performed using , an algebraic operator defined as follows:
Definition 1.
For two matrices and , returns the matrix where:
Once the master cluster has received all the partial derivative of the objective function with respect to from all the user participant sites, it has computed the global derivative of the objective function given in Equation 12 with respect to . A similar operation is performed for item slave cluster from line 1 to line 1 in Algorithm 1 to compute the global derivative of the objective function given in Equation 12 with respect to as given in line LABEL:alg:GDStep2 of Algorithm LABEL:alg:GradientDescent. Finally, the master cluster updates the user and item latent feature matrices and , and evaluates in lines 1, 1, and 1 of Algorithm 1 respectively. The convergence of the whole algorithm is checked in line 1 of Algorithm 1. Note that all the involved clusters that hold data matrices on users and items’ attributes execute their respective algorithm in parallel.
VB Complexity Analysis
The main computation of the GD algorithm evaluates the objective function in Equation 12 and its derivatives. Because of the extreme sparsity of the matrices, the computational complexity of evaluating the object function is , where is the number of nonzero entries in matrix . The computational complexities for the derivatives are also proportional to the number of nonzero entries in data matrices. Hence, the total computational complexity in one iteration of this gradient descent algorithm is , which indicates that the computational time is linear with respect to the number of observations in the data matrices. This complexity analysis shows that our algorithm is quite efficient and can scale to very large datasets.
Vi Experimental Evaluation
In this section, we carry out several experiments to mainly address the following questions:

What is the amount of data transferred?

How does the number of user and item sources affect the accuracy of predictions?

What is the performance comparison on users with different observed ratings?

Can our algorithm achieve good performance even if users have no observed ratings?

How does our approach compare with the stateoftheart collaborative filtering algorithms?
In the rest of this section, we introduce our datasets and experimental methodology, and address these questions (question 1 in Section VIC, question 2 in Section VID, questions 3 and 4 in Section VIE, and question 5 in Section VIF).
Via Description of the Datasets
The first round of our experiment is based on a dataset from Delicious, described and analyzed in [Wetzker2008][Bouadjenek2013b][Bouadjenek2014] (http://data.dailabor.de/corpus/delicious/). Delicious is a bookmarking system, which provides to the user a means to freely annotate Web pages with tags. Basically, in this scenario we want to recommend interesting Web pages to users. This dataset contains 425,183 tags, 1,321,039 Web pages, and 318,769 users. The useritem matrix contains 2,265,207 entries (a density of ). Each entry of the useritem matrix represents the degree of which a user interacted with an item expressed on a scale. The dataset contains a usertag matrix with 4,598,815 entries, where each entry expresses the interest of a user in a tag on a scale. Lastly, the dataset contains an itemtags matrix with 4,403,244 entries, where each entry expresses the coverage of a tag in a Web page on a scale. The usertag matrix and itemtags are used as user data matrix, and item data matrix respectively. However, to simulate having many data matrices that describe both users and items, we have manually and randomly broken the two previous matrices into 10 matrices in both columns and rows. These new matrices kept their property of sparsity. Hence, we end up with a useritem rating matrix, 10 user data matrices (with approximately 459 000 entries each), and 10 item data matrices (with approximately 440 000 entries each).
The second round of experiments is based on one of the datasets given by the HetRec 2011 workshop (http://ir.ii.uam.es/hetrec2011/datasets.html), and reflect a real use case. This dataset is an extension of the Movielens dataset, which contains personal ratings, and data coming from other data sources (mainly IMDb and Rotten Tomatoes). This dataset includes ratings that range from 1 to 5, including 0.5 steps. This dataset contains a useritems matrix of 2,113 users, 10,109 movies, and 855,597 ratings (with a density of ). This dataset also includes a usertag matrix of 9,078 tags describing the users with 21,324 entries on a scale, which is used as a user data matrix. Lastly, this dataset contains four item data matrices: (1) an itemtag matrix with 51,794 entries, (2) an itemgenre matrix with 20 genres and 20,809 entries, (3) an itemactor matrix with 95,321 actors, and 213,742 entries, and (4) an itemlocation matrix with 187 locations and 13,566 entries.
ViB Methodology and metrics
We have implemented our distributed collaborative algorithm and integrated it into Peersim [Montresor2009], a wellknown distributed computing simulator. We use two different training data settings (80% and 60%) to test the algorithms. We randomly select part of the ratings from the useritem rating matrix as the training data (80% or 60%) to predict the remaining ratings (respectively 20% or 40%). The random selection was carried out 5 times independently, and we report the average results. To measure the prediction quality of the different recommendation methods, we use the Root Mean Square Error (RMSE), for which a smaller value means a better performance. We refer to our method Distributed Probabilistic Matrix Factorization (DPMF).
ViC Data transfer
Let’s consider an example where a recommender system uses a social network as a source of information to improve its precision. Let’s assume that the social network contains 40 million unique users with 80 billion asymmetric connections (a density of ). It turns out that if we only consider the connections, the size of the useruser matrix representing this social network is (assuming that we need 8 bytes to encode a double to represent the strength of the relation between two users, and 8 bytes to encode a long that represents the key for the entry of the value in the useruser matrix). Hence, for the execution of the centralized collaborative filtering algorithm, of data need to be transferred through the network. However, if we assume that there are 10% of common users between the recommender system and the social network, each iteration of the DPMF algorithm requires the transfer of (assuming that we use 10 dimensions for the factorization, that we need 8 bytes to encode a double value in a latent user vector, and a round trip of transfer for the latent vectors in line 1 of Algorithm 1 and line 2 of Algorithm 2). Hence, if the algorithm requires 100 iterations to converge (roughly the number of iterations needed in our experiment), the total amount of data transferred is , which represents of the data transferred in the centralized solution. Finally, the total amount of data transferred depends on the density of the source, the total number of common attributes, the number of latent dimensions used for the factorization and the number of iterations needed for the algorithm to converge. These parameters can make the DPMF very competitive compared to the centralized solution in terms of data transfer.
ViD Impact of the number of sources
Figures 4 and 5 show the results obtained on the two datasets, while varying the number of sources. Note that source=0 means that we factorize only the useritem matrix.
In Figure 4, the green curve represents the impact of adding item sources only, the red curve the impact of adding user sources only, and the blue curve the impact of adding both sources (e.g., 2 sources means we add 2 item and 2 user sources). First, the results show that adding more sources helps to improve the performance, confirming our initial intuition. The additional data sources have certainly contributed to refine users’ preferences and items’ characteristics. Second, we observe that sources that describe users are more helpful than sources that describe items (about 10% gain). However, we consider this observation to be specific to this dataset, and cannot be generalized. Third, we notice that combining both data sources provides the best performance (blue curve, about 8% with respect to the use of only user sources). Lastly, the best gain obtained with respect to the PMF method (source=0) is about 32%.
Figure 5 shows the results obtained on the Movielens dataset. The obtained results here are quite different than those obtained on the Delicious dataset. Indeed, we observe that the data matrices 1, 2 and 3 have a positive impact on the results; however, data matrices 4 and 5 decrease the performance. This is certainly due to the fact that the data embedded in these two matrices are not meaningful to extract and infer items’ characteristics. In general, the best performance is obtained using the three first data matrices, with a gain of 10% with respect to PMF (source=0).
ViE Performance on users and items with different ratings
We stated previously that the data sparsity in a recommender system induces mainly two problems: (1) the lack of data to effectively model user preferences, and (2) the lack of data to effectively model item characteristics. Hence, in this section we study the ability of our method to provide accurate recommendations for users that supply few ratings, and items that contain few ratings (or no ratings at all). We show the results for different user ratings in Figure (a)a, and for different item ratings in Figure (b)b on the Delicious dataset. We group them into 10 classes based on the number of observed ratings: “0”, “15”, “610”, “1120”, “2140”, “4180”, “81160”, “161320”, “321640”, and “>640”. We show the results for different user ratings in Figure (a)a, and for different item ratings in Figure (b)b on the Delicious dataset. We also show the performance of the Probabilistic Matrix Factorization (PMF) method [Salakhutdinov2007], and our method using 5 and 10 data matrices. In Figure (a)a, on the X axis, users are grouped and ordered with respect to the number of ratings they have assigned. For example, for users with no ratings (0 on the Xaxis), we got an average of 0.37 for RMSE using the PMF method. Similarly, in Figure (b)b, on the Xaxis, items are grouped and ordered with respect to the number of ratings they have obtained.
Dataset  Training  U. Mean  I. Mean  NMF  PMF  SoRec  PTBR  Matchbox  HeteroMF  DPMF 

Delicious  80%  0.4389  0.4280  0.3814  0.3811  0.3566  0.3499  0.3297  0.3301  0.2939 
33,03%  31,33%  22,94%  22,88%  17.58%  16.00%  10.85%  10.96%  Improvement  
60%  0.3965  0.4087  0.3779  0.3911  0.3681  0.3599  0.3387  0.3434  0.3047  
23,15%  25,44%  19,37%  22,09%  17.22%  15.33%  10.03%  11.26%  Improvement  
Movielens  80%  0.8399  0.8467  0.7989  0.8106  0.774  0.7801  0.7605  0.7788  0.7658 
8,82%  9,55%  4,14%  5,52%  1.05%  1.83%  0.69%  1.66%  Improvement  
60%  0.9478  0.9667  0.9011  0.9096  0.882  0.8912  0.8399  0.8360  0.8365  
11,74%  13,46%  7,16%  8,03%  5.15%  6.13%  0.40%  0.05%  Improvement 
The results show that our method is more efficient in providing accurate recommendations compared to the PMF method for both users and items with few ratings (from 0 to about 100 on theX axis). Also, the experiments show that the more we add data matrices, the more the recommendations are accurate. However, for clarity, we just plot the results obtained for our method while using 5 and 10 data matrices. Finally, we also noticed that the performance is better for predicting ratings to items that contain few ratings, than to users who rated few items. This is certainly due to the fact that users’ preferences change over time, and thus, the error margin is increased.
ViF Performance comparison
To demonstrate the performance behavior of our algorithm, we compared it with eight other stateoftheart algorithms: User Mean: uses the mean value of every user; Item Mean: utilizes the mean value of every item; NMF [Lee1999]; PMF [Salakhutdinov2007], SoRec [Ma2008]; PTBR [Guy2010]; MatchBox [Stern2009]; HeteroMF [Jamali2013]. The results of the comparison are shown in Table II. The optimal parameters of each method are selected, and we report the final performance on the test set. The percentages in Table II are the improvement rates of our method over the corresponding approaches.
First, from the results, we see that our method consistently outperforms almost all the other approaches in all the settings of both datasets. Our method can almost always generate better predictions than the stateoftheart recommendation algorithms. Second, only Matchbox and HeteroMF slightly outperform our method on the Movielens dataset. Third, the RMSE values generated by all the methods on the Delicious dataset are lower than those on Movielens dataset. This is due to the fact that the rating scale is different between the two datasets. Fourth, our method outperforms the other methods better on the Delicious dataset, than on the Movielens dataset (10% to 33% on Delicious and 0.05% to 11% on Movielens). This is certainly due to the fact that: (1) the Movilens dataset contains less data (fewer users and fewer items), (2) there are less data matrices in the Movielens dataset to add, and (3) the data matrices of the Delicious dataset are of higher quality. Lastly, even if we use several data matrices in our method, using 80% of training data still provides more accurate predictions than 60% of training data. We explain this by the fact that the data of the useritem matrix are the main resources to train an effective recommendation model. Clearly, an external source of data cannot replace the useritem rating matrix, but can be used to enhance it.
Vii Related work
Enhanced recommendation: Many researchers have started exploring social relations to improve recommender systems (including implicit social information, which can be employed to improve traditional recommendation methods [Ma2013]), essentially to tackle the coldstart problem [Lin2013][Ma2008][Sedhain2014]. However, as pointed in [Sedhain2013], only a small subset of user interactions and activities are actually useful for social recommendation.
In collaborative filtering based approaches, Liu and Lee [Liu2010]
proposed very simple heuristics to increase recommendation effectiveness by combining social networks information. Guy et al.
[Guy2009] proposed a ranking function of items based on social relationships. This ranking function has been further improved in [Guy2010] to include social content such as related terms to the user. More recently, Wang et al. [Wang2017] proposed a novel method for interactive social recommendation, which not only simultaneously explores user preferences and exploits the effectiveness of personalization in an interactive way, but also adaptively learns different weights for different friends. Also, Xiao et al. [Xiao2017] proposed a novel user preference model for recommender systems that considers the visibility of both items and social relationships.In the context of matrix factorization, following the intuition that a person’s social network will affect her behaviors on the Web, Ma et al. [Ma2008] propose to factorize both the users’ social network and the rating records matrices. The main idea is to fuse the useritem matrix with the users’ social trust networks by sharing a common latent lowdimensional user feature matrix. This approach has been improved in [Ma2009] by taking into account only trusted friends for recommendation while sharing the user latent dimensional matrix. Almost a similar approach has been proposed in [Jamali2010] and [Yang2012] who include in the factorization process, trust propagation and trust propagation with inferred circles of friends in social networks respectively. In this same context, other approaches have been proposed to consider social regularization terms while factorizing the rating matrix. The idea is to handle friends with dissimilar tastes differently in order to represent the taste diversity of each user’s friends [Ma2011][Noel2012]. A number of these methods are reviewed, analyzed and compared in [Yang2014].
Also, few works consider crossdomain recommendation, where a user’s acquired knowledge in a particular domain could be transferred and exploited in several other domains, or offering joint, personalized recommendations of items in multiple domains, e.g., suggesting not only a particular movie, but also music CDs, books or video games somehow related with that movie. Based on the type of relations between the domains, FernándezTobías et al. [fernandez2012cross] propose to categorize crossdomain recommendation as: (i) content basedrelations (common items between domains) [Abel2012], and (ii) collaborative filteringbased relations (common users between domain) [Winoto2008][Berkovsky2007]. However, almost all these algorithms are not distributed.
Distributed recommendation: Serveral decentralized recommendation solutions have been proposed mainly from a peer to peer perspective, basically for collaborative filtering [Kermarrec2012], search and recommendation [Draidi2011]. The goal of these solutions is to decentralize the recommendation process.
Other works have investigated distributed recommendation algorithms to tackle the problem of scalability. Hence, Liu et al. [Liu2010]
provide a multiplicativeupdate method. This approach is also applied to squared loss and to nonnegative matrix factorization with an “exponential” loss function. Each of these algorithms in essence takes an embarrassingly parallel matrix factorization algorithm developed previously and directly distributes it across a MapReduce cluster. Gemulla et al.
[Gemulla2011]provide a novel algorithm to approximately factor large matrices with millions of rows, millions of columns, and billions of nonzero elements. The approach depends on a variant of the Stochastic Gradient Descent (SGD), an iterative stochastic optimization algorithm. Gupta et al.
[Gupta1997] describe scalable parallel algorithms for sparse matrix factorization, analyze their performance and scalability. Finally, Yu et al. [Yu2012] uses coordinate descent, a classical optimization approach, for a parallel scalable implementation of matrix factorization for recommender system. More recently, Shin et al. [Shin2017]proposed two distributed tensor factorization methods, CDTF and SALS. Both methods are scalable with all aspects of data and show a tradeoff between convergence speed and memory requirements.
However, note that almost all the works described above focus mainly on decentralizing and parallelizing the matrix factorization computation. To the best of our knowledge, none of the existing distributed solutions proposes a distributed recommendation approach using diverse data sources.
Viii Conclusion
In this paper, we proposed a new distributed collaborative filtering algorithm, which uses and combines multiple and diverse data matrices provided by online services to improve recommendation quality. Our algorithm is based on the factorization of matrices, and the sharing of common latent features with the recommender system. This algorithm has been designed for a distributed setting, where the objective was to avoid sending the raw data, and parallelize the matrix computation. All the algorithms presented have been evaluated using two different datasets of Delicious and Movielens. The results show the effectiveness of our approach. Our method consistently outperforms almost all the stateoftheart approaches in all the settings of both datasets. Only Matchbox and HeteroMF slightly outperform our method on the Movielens dataset.
References
Appendix A GradientBased Optimization
We seek to optimize the sum of the objective function in Equation 12 and we use gradient descent for this purpose in Algorithm LABEL:alg:GradientDescent. are parts that may be computed separately and in parallel with no dependency. These parts are computed by Algorithms 1, 2, and 3. is the algebraic operator given in Definition 1.