deeplearningforrecommendersystems
Deep Learning for Recommender Systems
view repo
Collaborative filtering (CF) is a successful approach commonly used by many recommender systems. Conventional CFbased methods use the ratings given to items by users as the sole source of information for learning to make recommendation. However, the ratings are often very sparse in many applications, causing CFbased methods to degrade significantly in their recommendation performance. To address this sparsity problem, auxiliary information such as item content information may be utilized. Collaborative topic regression (CTR) is an appealing recent method taking this approach which tightly couples the two components that learn from two different sources of information. Nevertheless, the latent representation learned by CTR may not be very effective when the auxiliary information is very sparse. To address this problem, we generalize recent advances in deep learning from i.i.d. input to noni.i.d. (CFbased) input and propose in this paper a hierarchical Bayesian model called collaborative deep learning (CDL), which jointly performs deep representation learning for the content information and collaborative filtering for the ratings (feedback) matrix. Extensive experiments on three realworld datasets from different domains show that CDL can significantly advance the state of the art.
READ FULL TEXT VIEW PDF
Hybrid methods that utilize both content and rating information are comm...
read it
Collaborative filtering (CF) has been successfully employed by many mode...
read it
Recommender System (RS) is a hot area where artificial intelligence (AI)...
read it
Recommender systems today have become an essential component of any
comm...
read it
Current state of the art algorithms for recommender systems are mainly b...
read it
This paper proposes Quaternion Collaborative Filtering (QCF), a novel
re...
read it
We are interested in building collaborative filtering models for
recomme...
read it
Deep Learning for Recommender Systems
None
None
Due to the abundance of choice in many online services, recommender systems (RS) now play an increasingly significant role [40]. For individuals, using RS allows us to make more effective use of information. Besides, many companies (e.g., Amazon and Netflix) have been using RS extensively to target their customers by recommending products or services. Existing methods for RS can roughly be categorized into three classes [6]: contentbased methods, collaborative filtering (CF) based methods, and hybrid methods. Contentbased methods [17] make use of user profiles or product descriptions for recommendation. CFbased methods [23, 27] use the past activities or preferences, such as user ratings on items, without using user or product content information. Hybrid methods [1, 18, 12] seek to get the best of both worlds by combining contentbased and CFbased methods.
Because of privacy concerns, it is generally more difficult to collect user profiles than past activities. Nevertheless, CFbased methods do have their limitations. The prediction accuracy often drops significantly when the ratings are very sparse. Moreover, they cannot be used for recommending new products which have yet to receive rating information from users. Consequently, it is inevitable for CFbased methods to exploit auxiliary information and hence hybrid methods have gained popularity in recent years.
According to whether twoway interaction exists between the rating information and auxiliary information, we may further divide hybrid methods into two subcategories: loosely coupled and tightly coupled methods. Loosely coupled methods like [29] process the auxiliary information once and then use it to provide features for the CF models. Since information flow is oneway, the rating information cannot provide feedback to guide the extraction of useful features. For this subcategory, improvement often has to rely on a manual and tedious feature engineering process. On the contrary, tightly coupled methods like [34] allow twoway interaction. On one hand, the rating information can guide the learning of features. On the other hand, the extracted features can further improve the predictive power of the CF models (e.g., based on matrix factorization of the sparse rating matrix). With twoway interaction, tightly coupled methods can automatically learn features from the auxiliary information and naturally balance the influence of the rating and auxiliary information. This is why tightly coupled methods often outperform loosely coupled ones [35].
Collaborative topic regression (CTR) [34] is a recently proposed tightly coupled method. It is a probabilistic graphical model that seamlessly integrates a topic model, latent Dirichlet allocation (LDA) [5], and a modelbased CF method, probabilistic matrix factorization (PMF) [27]. CTR is an appealing method in that it produces promising and interpretable results. Nevertheless, the latent representation learned is often not effective enough especially when the auxiliary information is very sparse. It is this representation learning problem that we will focus on in this paper.
On the other hand, deep learning models recently show great potential for learning effective representations and deliver stateoftheart performance in computer vision
[38]and natural language processing
[15, 26] applications. In deep learning models, features are learned in a supervised or unsupervised manner. Although they are more appealing than shallow models in that the features can be learned automatically (e.g., effective feature representation is learned from text content), they are inferior to shallow models such as CF in capturing and learning the similarity and implicit relationship between items. This calls for integrating deep learning with CF by performing deep learning collaboratively.Unfortunately, very few attempts have been made to develop deep learning models for CF. [28]
uses restricted Boltzmann machines instead of the conventional matrix factorization formulation to perform CF and
[9] extends this work by incorporating useruser and itemitem correlations. Although these methods involve both deep learning and CF, they actually belong to CFbased methods because they do not incorporate content information like CTR, which is crucial for accurate recommendation. [24] uses lowrank matrix factorization in the last weight layer of a deep network to significantly reduce the number of model parameters and speed up training, but it is for classification instead of recommendation tasks. On music recommendation, [21, 39]directly use conventional CNN or deep belief networks (DBN) to assist representation learning for content information, but the deep learning components of their models are deterministic without modeling the noise and hence they are less robust. The models achieve performance boost mainly by loosely coupled methods without exploiting the interaction between content information and ratings. Besides, the CNN is linked directly to the rating matrix, which means the models will perform poorly when the ratings are sparse, as shown in the following experiments.
To address the challenges above, we develop a hierarchical Bayesian model called collaborative deep learning (CDL) as a novel tightly coupled method for RS. We first present a Bayesian formulation of a deep learning model called stacked denoising autoencoder (SDAE)
[32]. With this, we then present our CDL model which tightly couples deep representation learning for the content information and collaborative filtering for the ratings (feedback) matrix, allowing twoway interaction between the two. Experiments show that CDL significantly outperforms the state of the art. Note that although we present CDL as using SDAE for its feature learning component, CDL is actually a more general framework which can also admit other deep learning models such as deep Boltzmann machines [25][10], and convolutional neural networks
[16].The main contribution of this paper is summarized below:
By performing deep learning collaboratively, CDL can simultaneously extract an effective deep feature representation from content and capture the similarity and implicit relationship between items (and users). The learned representation may also be used for tasks other than recommendation.
Besides the algorithm for attaining maximum a posteriori (MAP) estimates, we also derive a samplingbased algorithm for the Bayesian treatment of CDL, which, interestingly, turns out to be a Bayesian generalized version of backpropagation.
To the best of our knowledge, CDL is the first hierarchical Bayesian model to bridge the gap between stateoftheart deep learning models and RS. Besides, due to its Bayesian nature, CDL can be easily extended to incorporate other auxiliary information to further boost the performance.
Extensive experiments on three realworld datasets from different domains show that CDL can significantly advance the state of the art.
Similar to the work in [34], the recommendation task considered in this paper takes implicit feedback [13] as the training and test data. The entire collection of items (articles or movies) is represented by a by matrix , where row
is the bagofwords vector
for item based on a vocabulary of size . With users, we define an by binary rating matrix . For example, in the dataset citeulikea if user has article in his or her personal library and otherwise. Given part of the ratings in and the content information , the problem is to predict the other ratings in . Note that although we focus on movie recommendation (where plots of movies are considered as content information) and article recommendation like [34] in this paper, our model is general enough to handle other recommendation tasks (e.g., tag recommendation).The matrix plays the role of clean input to the SDAE while the noisecorrupted matrix, also a by matrix, is denoted by . The output of layer of the SDAE is denoted by which is a by matrix. Similar to , row of is denoted by . and
are the weight matrix and bias vector, respectively, of layer
, denotes column of , and is the number of layers. For convenience, we use to denote the collection of all layers of weight matrices and biases. Note that an layer SDAE corresponds to an layer network.We are now ready to present details of our CDL model. We first briefly review SDAE and give a Bayesian formulation of SDAE. This is then followed by the presentation of CDL as a hierarchical Bayesian model which tightly integrates the ratings and content information.
SDAE [32] is a feedforward neural network for learning representations (encoding) of the input data by learning to predict the clean input itself in the output, as shown in Figure 2. Usually the hidden layer in the middle, i.e., in the figure, is constrained to be a bottleneck and the input layer is a corrupted version of the clean input data. An SDAE solves the following optimization problem:
where is a regularization parameter and denotes the Frobenius norm.
If we assume that both the clean input and the corrupted input are observed, similar to [4, 19, 3, 7], we can define the following generative process:
For each layer of the SDAE network,
For each column of the weight matrix , draw
Draw the bias vector .
For each row of , draw
(1) 
For each item , draw a clean input ^{1}^{1}1Note that while generation of the clean input from is part of the generative process of the Bayesian SDAE, generation of the noisecorrupted input from is an artificial noise injection process to help the SDAE learn a more robust feature representation.
Note that if
goes to infinity, the Gaussian distribution in Equation (
1) will become a Dirac delta distribution [31] centered at , whereis the sigmoid function. The model will degenerate to be a Bayesian formulation of SDAE. That is why we call it generalized SDAE.
Note that the first layers of the network act as an encoder and the last
layers act as a decoder. Maximization of the posterior probability is equivalent to minimization of the reconstruction error with weight decay taken into consideration.
Using the Bayesian SDAE as a component, the generative process of CDL is defined as follows:
For each layer of the SDAE network,
For each column of the weight matrix , draw
Draw the bias vector .
For each row of , draw
For each item ,
Draw a clean input ).
Draw a latent item offset vector and then set the latent item vector to be:
Draw a latent user vector for each user :
Draw a rating for each useritem pair :
Here , , , , and
are hyperparameters and
is a confidence parameter similar to that for CTR ( if and otherwise). Note that the middle layer serves as a bridge between the ratings and content information. This middle layer, along with the latent offset , is the key that enables CDL to simultaneously learn an effective feature representation and capture the similarity and (implicit) relationship between items (and users). Similar to the generalized SDAE, for computational efficiency, we can also take to infinity.The graphical model of CDL when approaches positive infinity is shown in Figure 1, where, for notational simplicity, we use , , and in place of , , and , respectively.
Based on the CDL model above, all parameters could be treated as random variables so that fully Bayesian methods such as Markov chain Monte Carlo (MCMC) or variational approximation methods
[14] may be applied. However, such treatment typically incurs high computational cost. Besides, since CTR is our primary baseline for comparison, it would be fair and reasonable to take an approach analogous to that used in CTR. Consequently, we devise below an EMstyle algorithm for obtaining the MAP estimates, as in [34].Like in CTR, maximizing the posterior probability is equivalent to maximizing the joint loglikelihood of , , , , , , and given , , , , and :
If goes to infinity, the likelihood becomes:
(2) 
where the encoder function takes the corrupted content vector of item as input and computes the encoding of the item, and the function also takes as input, computes the encoding and then the reconstructed content vector of item . For example, if the number of layers , is the output of the third layer while is the output of the sixth layer.
From the perspective of optimization, the third term in the objective function (3.4
) above is equivalent to a multilayer perceptron using the latent item vectors
as target while the fourth term is equivalent to an SDAE minimizing the reconstruction error. Seeing from the view of neural networks (NN), when approaches positive infinity, training of the probabilistic graphical model of CDL in Figure 1(left) would degenerate to simultaneously training two neural networks overlaid together with a common input layer (the corrupted input) but different output layers, as shown in Figure 3. Note that the second network is much more complex than typical neural networks due to the involvement of the rating matrix.When the ratio approaches positive infinity, it will degenerate to a twostep model in which the latent representation learned using SDAE is put directly into the CTR. Another extreme happens when goes to zero where the decoder of the SDAE essentially vanishes. On the right of Figure 1 is the graphical model of the degenerated CDL when goes to zero. As demonstrated in the experiments, the predictive performance will suffer greatly for both extreme cases.
For and , coordinate ascent similar to [34, 13] is used. Given the current , we compute the gradients of with respect to and and set them to zero, leading to the following update rules:
where , , is a diagonal matrix, is a column vector containing all the ratings of user , and reflects the confidence controlled by and as discussed in [13].
Given and , we can learn the weights and biases for each layer using the backpropagation learning algorithm. The gradients of the likelihood with respect to and are as follows:
By alternating the update of , , , and , we can find a local optimum for . Several commonly used techniques such as using a momentum term may be used to alleviate the local optimum problem. For completeness, we also provide a sampling based algorithm for CDL in the appendix.
Let be the observed test data. Similar to [34], we use the point estimates of , and to calculate the predicted rating:
where denotes the expectation operation. In other words, we approximate the predicted rating as:
Note that for any new item with no rating in the training data, its offset will be .
Extensive experiments are conducted on three realworld datasets from different domains to demonstrate the effectiveness of our model both quantitatively and qualitatively^{2}^{2}2Code and data are available at www.wanghao.in.
We use three datasets from different realworld domains, two from CiteULike^{3}^{3}3CiteULike allows users to create their own collections of articles. There are abstract, title, and tags for each article. More details about the CiteULike data can be found at http://www.citeulike.org. and one from Netflix, for our experiments. The first two datasets, from [35], were collected in different ways, specifically, with different scales and different degrees of sparsity to mimic different practical situations. The first dataset, citeulikea, is mostly from [34]. The second dataset, citeuliket, was collected independently of the first one. They manually selected seed tags and collected all the articles with at least one of those tags. Similar to [34], users with fewer than articles are not included. As a result, citeulikea contains users and items. For citeuliket, the numbers are and . We can see that citeuliket contains more users and items than citeulikea. Also, citeuliket is much sparser as only of its useritem matrix entries contain ratings but citeulikea has ratings in of its useritem matrix entries.
The last dataset, Netflix, consists of two parts. The first part, with ratings and movie titles, is from the Netflix challenge dataset. The second part, with plots of the corresponding movies, was collected by us from IMDB ^{4}^{4}4http://www.imdb.com. Similar to [41], in order to be consistent with the implicit feedback setting of the first two datasets, we extract only positive ratings (rating ) for training and testing. After removing users with less than positive ratings and movies without plots, we have users, movies, and ratings in the final dataset.
We follow the same procedure as that in [34] to preprocess the text information (item content) extracted from the titles and abstracts of the articles and the plots of the movies. After removing stop words, the top discriminative words according to the tfidf values are chosen to form the vocabulary ( is , , and for the three datasets).
For each dataset, similar to [35, 36], we randomly select items associated with each user to form the training set and use all the rest of the dataset as the test set. To evaluate and compare the models under both sparse and dense settings, we set to and , respectively, in our experiments. For each value of , we repeat the evaluation five times with different randomly selected training sets and the average performance is reported.
As in [34, 22, 35], we use recall as the performance measure because the rating information is in the form of implicit feedback [13, 23]. Specifically, a zero entry may be due to the fact that the user is not interested in the item, or that the user is not aware of its existence. As such, precision is not a suitable performance measure. Like most recommender systems, we sort the predicted ratings of the candidate items and recommend the top items to the target user. The recall@ for each user is then defined as:
The final result reported is the average recall over all users.
Another evaluation metric is the mean average precision (mAP). Exactly the same as
[21], we set the cutoff point at for each user.The models included in our comparison are listed as follows:
CMF: Collective Matrix Factorization [30] is a model incorporating different sources of information by simultaneously factorizing multiple matrices. In this paper, the two factorized matrices are and .
SVDFeature: SVDFeature [8] is a model for featurebased collaborative filtering. In this paper we use the content information as raw features to feed into SVDFeature.
CTR: Collaborative Topic Regression [34] is a model performing topic modeling and collaborative filtering simultaneously as mentioned in the previous section.
CDL: Collaborative Deep Learning is our proposed model as described above. It allows different levels of model complexity by varying the number of layers.
In the experiments, we first use a validation set to find the optimal hyperparameters for CMF, SVDFeature, CTR, and DeepMusic. For CMF, we set the regularization hyperparameters for the latent factors of different contexts to . After the grid search, we find that CMF performs best when the weights for the rating matrix and content matrix (BOW) are both in the sparse setting. For the dense setting the weights are and , respectively. For SVDFeature, the best performance is achieved when the regularization hyperparameters for the users and items are both with the learning rate equal to . For DeepMusic, we find that the best performance is achieved using a CNN with two convolutional layers. We also try our best to tune the other hyperparameters. For CTR, we find that it can achieve good prediction performance when , , , , and (note that and determine the confidence parameters ). For CDL, we directly set , , and perform grid search on the hyperparameters , , , and . For the grid search, we split the training data and use 5fold cross validation.
citeulikea  citeuliket  Netflix  

CDL  0.0514  0.0453  0.0312 
CTR  0.0236  0.0175  0.0223 
DeepMusic  0.0159  0.0118  0.0167 
CMF  0.0164  0.0104  0.0158 
SVDFeature  0.0152  0.0103  0.0187 
We use a masking noise with a noise level of to get the corrupted input from the clean input . For CDL with more than one layer of SDAE (), we use a dropout rate [2, 33, 11] of to achieve adaptive regularization. In terms of network architecture, the number of hidden units is set to for such that and . While both and are equal to the number of words in the dictionary, is set to which is the number of dimensions of the learned representation. For example, the 2layer CDL model () has a Bayesian SDAE of architecture ‘8000200502008000’ for the citeulikea dataset.
Figures 4 and 5 show the results that compare CDL, CTR, DeepMusic, CMF, and SVDFeature using the three datasets under both the sparse () and dense () settings. We can see that CTR is a strong baseline which beats DeepMusic, CMF, and SVDFeature in all datasets even though DeepMusic has a deep architecture. In the sparse setting, CMF outperforms SVDFeature most of the time and sometimes even achieves performance comparable to CTR. DeepMusic performs poorly due to lack of ratings and overfitting. In the dense setting, SVDFeature is significantly better than CMF for citeulikea and citeuliket but is inferior to CMF for Netflix. DeepMusic is still slightly worse than CTR due to the reasons mentioned in Section 1. To focus more specifically on comparing CDL with CTR, we can see that for citeulikea, 2layer CDL outperforms CTR by a margin of 4.2%6.0% in the sparse setting and 3.3%4.6% in the dense setting. If we increase the number of layers to (), the margin will go up to 5.8%8.0% and 4.3%5.8%, respectively. Similarly for citeuliket, 2layer CDL outperforms CTR by a margin of 10.4%13.1% in the sparse setting and 4.7%7.6% in the dense setting. When the number of layers is increased to , the margin will even go up to 11.0%14.9% and 5.2%8.2%, respectively. For Netflix, 2layer CDL outperforms CTR by a margin of 1.9%5.9% in the sparse setting and 1.5%2.0% in the dense setting. As we can see, seamless and successful integration of deep learning and RS requires careful designs to avoid overfitting and achieve significant performance boost.
Table 1 shows the mAP for all models in the sparse settings. We can see that the mAP of CDL is almost or more than twice of CTR. Tables 2 and 3 show the recall@300 results when CDL with different numbers of layers are applied to the three datasets under both the sparse and dense settings. As we can see, for citeuliket and Netflix, the recall increases as the number of layers increases. For citeulikea
, CDL starts to overfit when it exceeds two layers. Since the standard deviation is always very small (
), we do not include it in the figures and tables as it is not noticeable anyway.#layers  1  2  3 

citeulikea  27.89  31.06  30.70 
citeuliket  32.58  34.67  35.48 
Netflix  29.20  30.50  31.01 
Note that the results are somewhat different for the first two datasets although they are from the same domain. This is due to the different ways in which the datasets were collected, as discussed above. Specifically, both the text information and the rating matrix in citeuliket are much sparser.^{5}^{5}5Each article in citeulikea has words on average and that for citeuliket is . By seamlessly integrating deep representation learning for content information and CF for the rating matrix, CDL can handle both the sparse rating matrix and the sparse text information much better and learn a much more effective latent representation for each item and hence each user.
Figure 6 shows the results for different values of using citeuliket under the dense setting. We set , , and to and . Similar phenomena are observed when the number of layers and the value of are varied but they are omitted here due to space constraints. As mentioned in the previous section, when is extremely large, will approach positive infinity so that CDL degenerates to two separate models. In this case the latent item representation will be learned by the SDAE in an unsupervised manner and then it will be put directly into (a simplified version of) the CTR. Consequently, there is no interaction between the Bayesian SDAE and the collaborative filtering component based on matrix factorization and hence the prediction performance will suffer greatly. For the other extreme when is extremely small, will approach zero so that CDL degenerates to that in Figure 1 in which the decoder of the Bayesian SDAE component essentially vanishes. This way the encoder of the Bayesian SDAE component will easily overfit the latent item vectors learned by simple matrix factorization. As we can see in Figure 6, the prediction performance degrades significantly as gets very large or very small. When , the recall@ is already very close to (or even worse than) the result of PMF.
To gain a better insight into CDL, we first take a look at two example users in the citeuliket dataset and represent the profile of each of them using the top three matched topics. We examine the top 10 recommended articles returned by a 3layer () CDL and CTR. The models are trained under the sparse setting (). From Table 4, we can speculate that user I might be a computer scientist with focus on tag recommendation, as clearly indicated by the first topic in CDL and the second one in CTR. CDL correctly recommends many articles on tagging systems while CTR focuses on social networks instead. When digging into the data, we find that the only rated article in the training data is ‘What drives content tagging: the case of photos on Flickr’, which is an article that talks about the impact of social networks on tagging behaviors. This may explain why CTR focuses its recommendation on social networks. On the other hand, CDL can better understand the key points of the article (i.e., tagging and CF) to make appropriate recommendation accordingly. Consequently, the precision of CDL and CTR is 70% and 10%, respectively.
#layers  1  2  3 

citeulikea  58.35  59.43  59.31 
citeuliket  52.68  53.81  54.48 
Netflix  69.26  70.40  70.42 
user I (CDL)  in user’s lib?  
top 3 topics  1. search, image, query, images, queries, tagging, index, tags, searching, tag  
2. social, online, internet, communities, sharing, networking, facebook, friends, ties, participation  
3. collaborative, optimization, filtering, recommendation, contextual, planning, items, preferences  
top 10 articles  1. The structure of collaborative tagging Systems  yes 
2. Usage patterns of collaborative tagging systems  yes  
3. Folksonomy as a complex network  no  
4. HT06, tagging paper, taxonomy, Flickr, academic article, to read  yes  
5. Why do tagging systems work  yes  
6. Information retrieval in folksonomies: search and ranking  no  
7. tagging, communities, vocabulary, evolution  yes  
8. The complex dynamics of collaborative tagging  yes  
9. Improved annotation of the blogosphere via autotagging and hierarchical clustering 
no  
10. Collaborative tagging as a tripartite network  yes  
user I (CTR)  in user’s lib?  
top 3 topics  1. social, online, internet, communities, sharing, networking, facebook, friends, ties, participation  
2. search, image, query, images, queries, tagging, index, tags, searching, tag  
3. feedback, event, transformation, wikipedia, indicators, vitamin, log, indirect, taxonomy  
top 10 articles  1. HT06, tagging paper, taxonomy, Flickr, academic article, to read  yes 
2. Structure and evolution of online social networks  no  
3. Group formation in large social networks: membership, growth, and evolution  no  
4. Measurement and analysis of online social networks  no  
5. A face(book) in the crowd: social searching vs. social browsing  no  
6. The strength of weak ties  no  
7. Flickr tag recommendation based on collective knowledge  no  
8. The computermediated communication network  no  
9. Social capital, selfesteem, and use of online social network sites: A longitudinal analysis  no  
10. Increasing participation in online communities: A framework for humancomputer interaction  no  
user II (CDL)  in user’s lib?  
top 3 topics  1. flow, cloud, codes, matter, boundary, lattice, particles, galaxies, fluid, galaxy  
2. mobile, membrane, wireless, sensor, mobility, lipid, traffic, infrastructure, monitoring, ad  
3. hybrid, orientation, stress, fluctuations, load, temperature, centrality, mechanical, twodimensional, heat  
top 10 articles  1. Modeling the flow of dense suspensions of deformable particles in three dimensions  yes 
2. Simplified particulate model for coarsegrained hemodynamics simulations  yes  
3. Lattice Boltzmann simulations of blood flow: nonnewtonian rheology and clotting processes  yes  
4. A genomewide association study for celiac disease identifies risk variants  yes  
5. Efficient and accurate simulations of deformable particles  yes  
6. A multiscale model of thrombus development  yes  
7. Multiphase hemodynamic simulation of pulsatile flow in a coronary artery  yes  
8. Lattice Boltzmann modeling of thrombosis in giant aneurysms  yes  
9. A lattice Boltzmann simulation of clotting in stented aneursysms  yes  
10. Predicting dynamics and rheology of blood flow  yes  
user II (CTR)  in user’s lib?  
top 3 topics  1. flow, cloud, codes, matter, boundary, lattice, particles, galaxies, fluid, galaxy  
2. transition, equations, dynamical, discrete, equation, dimensions, chaos, transitions, living, trust  
3. mobile, membrane, wireless, sensor, mobility, lipid, traffic, infrastructure, monitoring, ad  
top 10 articles  1. Multiphase hemodynamic simulation of pulsatile flow in a coronary artery  yes 
2. The metallicity evolution of starforming galaxies from redshift 0 to 3  no  
3. Formation versus destruction: the evolution of the star cluster population in galaxy mergers  no  
4. Clearing the gas from globular clusters  no  
5. Macroscopic effects of the spectral structure in turbulent flows  no  
6. The WiggleZ dark energy survey  no  
7. LatticeBoltzmann simulation of blood flow in digitized vessel networks  no  
8. Global properties of ’ordinary’ earlytype galaxies  no  
9. Proteus : a direct forcing method in the simulations of particulate flows  yes  
10. Analysis of mechanisms for platelet nearwall excess under arterial blood flow conditions  yes 
User III  Movies in the training set: Moonstruck, True Romance, Johnny English, American Beauty, The  

Princess Bride, Top Gun, Double Platinum, Rising Sun, Dead Poets Society, Waiting for Guffman  
# training samples  2  4  10 
Top 10 recommended movies by CTR  Swordfish  Pulp Fiction  Best in Snow 
A Fish Called Wanda  A Clockwork Orange  Chocolat  
Terminator 2  Being John Malkovich  Good Will Hunting  
A Clockwork Orange  Raising Arizona  Monty Python and the Holy Grail  
Sling Blade  Sling Blade  Being John Malkovich  
Bridget Jones’s Diary  Swordfish  Raising Arizona  
Raising Arizona  A Fish Called Wanda  The Graduate  
A Streetcar Named Desire  Saving Grace  Swordfish  
The Untouchables  The Graduate  Tootsie  
The Full Monty  Monster’s Ball  Saving Private Ryan  
# training samples 
2  4  10 
Top 10 recommended movies by CDL  Snatch  Pulp Fiction  Good Will Hunting 
The Big Lebowski  Snatch  Best in Show  
Pulp Fiction  The Usual Suspect  The Big Lebowski  
Kill Bill  Kill Bill  A Few Good Men  
Raising Arizona  Momento  Monty Python and the Holy Grail  
The Big Chill  The Big Lebowski  Pulp Fiction  
Tootsie  One Flew Over the Cuckoo’s Nest  The Matrix  
Sense and Sensibility  As Good as It Gets  Chocolat  
Sling Blade  Goodfellas  The Usual Suspect  
Swinger  The Matrix  CaddyShack  

From the matched topics returned by both CDL and CTR, user II might be a researcher on blood flow dynamic theory particularly in the field of medical science. CDL correctly captures the user profile and achieves a precision of 100%. However, CTR recommends quite a few articles on astronomy instead. When examining the data, we find that the only rated article returned by CTR is ‘Simulating deformable particle suspensions using a coupled latticeBoltzmann and finiteelement method’. As expected, this article is on deformable particle suspension and the flow of blood cells. CTR might have misinterpreted this article, focusing its recommendation on words like ‘flows’ and ‘formation’ separately. This explains why CTR recommends articles like ‘Formation versus destruction: the evolution of the star cluster population in galaxy mergers’ (formation) and ‘Macroscopic effects of the spectral structure in turbulent flows’ (flows). As a result, its precision is only 30%.
From these two users, we can see that with a more effective representation, CDL can capture the key points of articles and the user preferences more accurately (e.g., user I). Besides, it can model the cooccurrence and relations of words better (e.g., user II).
We next present another case study which is for the Netflix dataset under the dense setting (). In this case study, we choose one user (user III) and vary the number of ratings (positive feedback) in the training set given by the user from to . The partition of training and test data remains the same for all other users. This is to examine how the recommendation of CTR and CDL adapts as user III expresses preference for more and more movies. Table 5 shows the recommendation lists of CTR and CDL when the number of training samples is set to , , and . When there are only two training samples, the two movies user III likes are ‘Moonstruck’ and ‘True Romance’, which are both romance movies. For now the precision of CTR and CDL is close (20% and 30%). When two more samples are added, the precision of CDL is boosted to 50% while that of CTR remains unchanged (20%). That is because the two new movies, ‘Johnny English’ and ‘American Beauty’, belong to action and drama movies. CDL successfully captures the user’s change of taste and gets two more recommendations right but CTR fails to do so. Similar phenomena can be observed when the number of training samples increases from to . From this case study, we can see that CDL is sensitive enough to changes of user taste and hence can provide more accurate recommendation.
Following the update rules in this paper, the computational complexity of updating is , where is the dimensionality of the learned representation and is the number of items. The complexity for is , where is the number of users, is the size of the vocabulary, and is the dimensionality of the output in the first layer. Note that the third term is the cost of computing the output of the encoder and it is dominated by the computation of the first layer. For the update of all the weights and biases, the complexity is
since the computation is dominated by the first layer. Thus for a complete epoch the total time complexity is
.All our experiments are conducted on servers with Intel E52650 CPUs and NVIDIA Tesla M2090 GPUs each. Using the MATLAB implementation with GPU/C++ acceleration, each epoch takes only about seconds and each run takes epochs for the first two datasets. For Netflix it takes about seconds per epoch and needs much fewer epochs (about ) to get satisfactory recommendation performance. Since Netflix is much larger than the other two datasets, this shows that CDL is very scalable. We expect that changing the implementation to a pure C++/CUDA one would significantly reduce the time cost.
We have demonstrated in this paper that stateoftheart performance can be achieved by jointly performing deep representation learning for the content information and collaborative filtering for the ratings (feedback) matrix. As far as we know, CDL is the first hierarchical Bayesian model to bridge the gap between stateoftheart deep learning models and RS. In terms of learning, besides the algorithm for attaining the MAP estimates, we also derive a samplingbased algorithm for the Bayesian treatment of CDL as a Bayesian generalized version of backpropagation.
Among the possible extensions that could be made to CDL, the bagofwords representation may be replaced by more powerful alternatives, such as [20]. The Bayesian nature of CDL also provides potential performance boost if other side information is incorporated as in [37]. Besides, as remarked above, CDL actually provides a framework that can also admit deep learning models other than SDAE. One promising choice is the convolutional neural network model which, among other things, can explicitly take the context and order of words into account. Further performance boost may be possible when using such deep learning models.
This research has been partially supported by research grant FSGRF14EG36.
A practical Bayesian framework for backpropagation networks.
Neural Computation, 4(3):448–472, 1992.Automatic tag expansion using visual similarity for photo sharing websites.
Multimedia Tools Appl., 49(1):81–99, 2010.A Guide to Distribution Theory and Fourier Transforms
. World Scientific, 2003.Temporal qosaware web service recommendation via nonnegative tensor factorization.
In WWW, pages 585–596, 2014.For completeness we also derive a samplingbased algorithm for the Bayesian treatment of CDL. It turns out to be a Bayesian and generalized version of the wellknown backpropagation (BP) learning algorithm. Due to space constraints we only list the results here without detailed derivation.
For : We denote the concatenation of and as . Similarly, the concatenation of and is denoted as . The subscripts of are ignored. Then
For (): Similarly, we denote the concatenation of and as and have
Note that for the last layer () the second Gaussian would be instead.
For (): Similarly, we have
For : The posterior
For : The posterior
Interestingly, if goes to infinity and adaptive rejection Metropolis sampling (which involves using the gradients of the objective function to approximate the proposal distribution) is used, the sampling for turns out to be a Bayesian generalized version of BP. Specifically, as Figure 7
shows, after getting the gradient of the loss function at one point (the red dashed line on the left), the next sample would be drawn in the region under that line, which is equivalent to a probabilistic version of BP. If a sample is above the curve of the loss function, a new tangent line (the black dashed line on the right) would be added to better approximate the distribution corresponding to the loss function. After that, samples would be drawn from the region under both lines. During the sampling, besides searching for local optima using the gradients (MAP), the algorithm also takes the variance into consideration. That is why we call it Bayesian generalized backpropagation.
Comments
There are no comments yet.