1 Introduction
Modern technology enables users to access an abundance of information. This deluge of data makes it difficult to sift through it all to find what is desired. This problem is of particular concern to companies who are trying sell products (e.g. Amazon or Walmart) or recommend movies (e.g. Netflix). To lessen the severity of information overload, recommender systems help a user find what he or she is looking for. Two commonly used classes of recommender systems are contentbased filters and collaborative filters.
Contentbased filters (CBF) make recommendations based on item/user descriptions and users’ ratings of the items. Creating item/user descriptions that are predictive of how a user will rate an item, however, is not a trivial process. On the other hand, collaborative filtering (CF) techniques use correlations between users’ ratings to infer the rating of unrated items for a user and make recommendations without having to understand the item or user itself. CF does not depend on item descriptions and tends to produce higher accuracies than CBF. However, CF suffers from the coldstart problem which occurs when an item cannot be recommended unless it is has been rated before (firstrater problem) or when a user has not rated any items (newuser problem). This is particularly important in domains where new items are frequently added to a set of items and users are more interested in the new items. For example, many users are more interested, and likely to purchase, new styles of shoes rather than outdated styles or many users are more interested in watching newly released movies rather than older movies. Recommending old items has the potential to drive away customers. In addition, making inappropriate recommendations for new users who have not built a profile can also drive away users.
One approach for addressing the coldstart problem is using a hybrid recommender system that can leverage the advantages of multiple recommendation systems. Developing hybrid models is a significant research direction [4, 18, 12, 20, 6, 7, 13]. Many hybrid approaches combine a contentbased filter with a collaborative filter through methods such as averaging the predicted ratings or combining the top recommendations from both techniques [2]. In this paper, we present a neural network model with latent input variables (latent neural network or LNN) as a hybrid recommendation algorithm that addresses the coldstart problem. LNN uses a matrix of item ratings and item/user descriptions to simultaneously train the weights in a neural network and induce a set of latent input variables for matrix factorization. Using a neural network allows for flexible architecture configurations to model higherorder dependencies in the data.
LNN is based on the idea of generative backpropagation (GenBP)
[9] and expands upon unsupervised backpropagation (UBP) [8]. Both GenBP and UBP are neural network methods that induce a set of latent input variables. The latent input variables form an internal representation of observed values. When the latent input variables are fewer than the observed variables, both methods are dimensionality reduction techniques. GenBP adjusts its latent inputs while holding the network weights constant. It has been used to generate labels for images [5], and for natural language [1]. UBP differs from GenBP in that it trains network weights simultaneously with the latent inputs, instead of training the weights as a preprocessing step. LNN is a further development of UBP that incorporates input features among the latent input variables. By incorporating user/item descriptions as input features, LNN is able to address the coldstart problem. We find that LNN outperforms other contentbased filters and hybrid filters on the coldstart problem. Additionally, LNN outperforms its predecessor (UBP) and maintains an accuracy similar to matrix factorization (which cannot handle the coldstart problem) on noncoldstart recommendations.2 Related Work
Matrix factorization (MF) has become a popular technique, in part due to its effectiveness with the data used in the NetFlix competition [10, 16]
and is widely considered a stateoftheart recommendation technique. MF is a linear dimensionality reduction technique that factors the rating matrix into two muchsmaller matrices. These smaller matrices can then be combined to predict all of the missing ratings in the original matrix. It was previously shown that MF could be represented with a neural network model involving one hidden layer and linear activation functions
[21]. By using nonlinear activation functions, unsupervised backpropagation (UBP) may be viewed as a nonlinear generalization of MF. UBP is related to nonlinear PCA (NLPCA) that was used as a means of imputing missing values (a task similar to recommending items)
[19]. UBP utilizes three phases for training to initialize the latent variables, the weights of the model and then to update the weights and latent variables simultaneously. LNN further builds on UBP and NLPCA by integrating item or user descriptions with the latent input variables.Pure collaborative filtering (CF) techniques are not able to handle the coldstart problem for items or users. As a result, several hybrid methods have been developed that incorporate item and/or user descriptions into collaborative filtering approaches. The most common, as surveyed by Burke [2], involves using separate CBF and CF techniques and then combining their outputs (i.e. weighted average, combining the output from both techniques, or switching depending on the context) or using the output from one technique as input to another. Contentboosted collaborative filtering [14] uses CBF to fill in the missing values in the ratings matrix and then the dense ratings matrix is passed to a collaborative filtering method (in their implementation, a neighbor based CF). Other work addresses the coldstart problem by build user/item descriptions for later use in a recommendation system [22].
3 Latent Neural Network
In this section, we formally describe latent neural networks (LNN). At a highlevel, a LNN is a neural network with latent input variables induced using generative backpropagation. Put simply, generative backpropagation calculates the gradient of the latent inputs with respect to the error and updates them in a manner similar to how the weights are updated in the backpropagation algorithm.
3.1 Preliminaries
In order to formally describe LNNs, we define the following terms.

Let be a given sparse user/item rating matrix, where is the number of items and is the number of users.

Let be an matrix, representing the given portion of the item profiles.

Let be an matrix, representing the latent portion of the item profiles.

If is the rating for item by user in , then is the predicted rating when and
are concatenated into a single vector
and then fed forward into the LNN. 
Let be the weight that feeds from unit to unit in the LNN.

For each network unit on hidden layer , let be the net input into the unit, be the output or activation value of the unit, and be an error term associated with the unit.

Let be the number of hidden layers in the LNN.

Let be a vector representing the gradient with respect to the weights of the LNN, such that is the component of the gradient that is used to refine .

Let be a vector representing the gradient with respect to the latent inputs of the LNN, such that is the component of the gradient that is used to refine .
We use item descriptions, but user descriptions could easily be used by transposing the and using user descriptions instead of item descriptions.
As using generative backpropagation to compute the gradient with respect to the latent inputs, , is less commonly used, we provide a derivation of it here. We compute each from the presentation of a single element since we assume that is typically highdimensional and sparse. It is significantly more efficient to train with the presentation of each known element individually. We begin by defining an error signal for an individual element, , and then express the gradient as the partial derivative of this error signal with respect to each latent input (the nonlatent inputs in do not change):
(1) 
The intrinsic input affects the value of through the net value of a unit () and further through the output of a unit (
). Using the chain rule, Equation
1 becomes:(2) 
where and represent, respectively, the output values and the net input values of the output nodes (the layer). The backpropagation algorithm calculates (which is for a network unit) as the error term associated with a network unit. Thus, to calculate , the only additional calculation to the backpropagation algorithm that needs to be made is
. For a single layer perceptron (0 hidden layers):
which is nonzero only when equals and is equal to since the error is being calculated with respect to a single element in . When there are no hidden layers () and using the error from a single element :
(3) 
If there is at least one hidden layer (), then,
where and are vectors that represent the output values and the net values for the units in the hidden layer. As part of the error term for the units in the layer, backpropagation calculates as the error term associated with each network unit. Thus, the only additional calculation for is:
As before, is nonzero only when equals . For networks with at least one hidden layer:
(4) 
Equation 4 is a strict generalization of Equation 3. Equation 3 only considers the one output unit, , for which a known target value is being presented, whereas Equation 4 sums over each unit, , into which the intrinsic value feeds.
3.2 ThreePhase Training
To integrate generative backpropagation into the training process, LNN uses three phases to train and
: 1) the first phase computes an initial estimate for the intrinsic vectors,
, 2) the second phase computes an initial estimate for the network weights,, and 3) the third phase refines them both together. All three phases train using stochastic gradient descent. In phase 1, the intrinsic vectors are induced while there are no hidden layers to form nonlinear separations among them. Likewise, phase 2 gives the weights a chance to converge without having to train against moving inputs. These two preprocessing phases initialize the system (consisting of both intrinsic vectors and weights) to a good initial starting point, such that gradient descent is more likely to find a local optimum of higher quality. Empirical results comparing threephase and singlephase training show that threephase training produces more accurate results than singlephase training, which only refines
and together (see [8]).Pseudocode for the LNN algorithm, which trains and in three phases, is given in Algorithm 1. LNN calls the train_epoch function (shown in Algorithm 2
) which performs a single epoch of training. A detailed description of LNN follows.
Matrices containing the known data values, , and the item descriptions, , are passed into LNN along with the parameters (defined below). LNN returns and . is a set or ragged matrix containing weight values for an MLP that maps from each to an approximation of .
Lines 19 perform the first phase of training, which computes an initial estimate for . Lines 14 initialize the model variables. represents the weights of a singlelayer perceptron and the elements in and
are initialized with small random values. Our implementation draws values from a Normal distribution with a mean of 0 and a deviation of 0.01. The singlelayer perceptron is a temporary model that is only used in phase 1 to for the initial training of
. is the learning rate and is used to store the previous error score. As no error has been measured yet, it is initialized to . Lines 59 train and until convergence is detected. may then be discarded. We note that many techniques could be used to detect convergence. Our implementation decays the learning rate whenever predictions fail to improve by a sufficient amount. Convergence is detected when the learning rate falls below . specifies the amount of improvement that is expected after each epoch, or else the learning rate is decayed. is the regularization term used in train_epoch.Lines 1017
perform the second phase of training. This phase differs from the first phase in two ways: 1) a multilayer perceptron is used instead of a temporary singlelayer perceptron, and 2)
is held constant during this phase.Lines 1823 perform the third phase of training. In this phase, the same multilayer perceptron that is used in phase 2 is used again, but and are both refined together. Also, no regularization is used in the third phase.
3.3 Stochastic gradient descent
For completeness, we describe train_epoch given in Algorithm 2, which performs a single epoch of training by stochastic gradient descent. This algorithm is very similar to an epoch of traditional backpropagation, except that it presents each element individually, instead of presenting each vector, and it conditionally refines the latent variables, , as well as the weights, .
Line 1 presents each known element in random order. Line 2 concatenates with the corresponding item description . Line 3 computes a predicted value for the presented element given the current . Note that efficient implementations of line 3 should only propagate values into output unit . Lines 410 compute an error term for output unit , and each hidden unit in the network. The activation of the other output units is not computed, so the error on those units is 0. Lines 1114 refine by gradient descent. Line 15 specifies that should only be refined during phases 1 and 3. Lines 1619 refine by gradient descent. Line 22 computes the rootmeansquarederror of the MLP for each known element in .
4 Experimental Results
In this section we present the results from our experiments. We examine LNN using the MovieLens^{1}^{1}1http://www.grouplens.org data set from the HetRec2011 workshop [3]. We use this data set because it provides descriptions for the movies in addition to the ratings matrix. There are few data sets that provide user/item descriptions in addition to the ratings matrix (e.g. the Netflix data only contains user ratings). Some data sets provide unstructured data such as twitter information or a set of friends on last.fm from which input variables could be created. As this paper focuses on the performance of LNN rather than feature creation from unstructured data, we chose to use the MovieLens data set. Also, running stateoftheart recommendation systems can take a long time – it was reported that running Bayesian probabilistic MF took 188 hours on the Netflix data [17]
. Using a smaller data set allows for a more extensive evaluation and facilitates crossvalidation. The MovieLens data set contains 2113 users and 10197 movies with 855598 ratings. On average, there are 405 ratings per user and 84 ratings per movie. For item descriptions, we use the genre(s) of the movie as a set of binary variables indicating if a movie belongs to one of the 19 genres.
We use LNN with and without three phase training. This is equivalent to a hybrid UBP and hybrid NLPCA technique. LNN with three phase training is denoted as LNN. We compare LNN with several other recommendation systems: 1) contentboosted collaborative filtering (CBCF), 2) contentbased filtering (CBF), 3) nonlinear principle component analysis (NLPCA), 4) unsupervised backpropagation (UBP), and 5) matrix factorization (MF). For each recommendation system, we test several parameter settings. CBF uses a single learning algorithm to learn the rating preferences of a user. We experiment using naïve Bayes (as is commonly used [14]
), linear regression, a decision tree, and a neural network trained with backpropagation. The same learning algorithms are also used for CBCF and the number of neighbors ranges from 1 to 64. For MF, the number of latent variables ranges from 2 to 32 and the regularization term from 0.001 to 0.1. In addition to the values used for MF for the number of latent variables and the regularization term, the number of nodes in the hidden layer ranges from 0 to 32 for UBP, NLPCA, LNN, and LNN
. For each experiment, we randomly select 20% of the ratings as a test set. We then using 10% of the training set as a validation set for parameter selection. Using the selected parameters, we test on the test set and using 10fold crossvalidation.4.1 Results
The results comparing LNN with the other recommendation approaches are shown in Table 1. We report the mean absolute error (MAE) for each approach. The bold values represent the lowest means within 0.002. The algorithms that use latent variables are significantly lower than those that do not (CBCF and CBF), thus demonstrating the predictive power of using latent variables for item recommendation. Latent inputs also allows one to bypass feature engineering – often a difficult process.
CBCF  CBF  LNN  LNN  MF  NLPCA  UBP  

Validation  0.7709  0.8781  0.5885  0.5877  0.5886  0.6058  0.5942 
Test  0.7767  0.8831  0.5795  0.5810  0.5779  0.5971  0.5942 
10CV  0.7754  0.8695  0.5781  0.5778  0.5760  0.5915  0.5915 
The addition of the item descriptions to NLPCA and UBP (LNN and LNN) improves the performance compared to only using the latent variables. The performance of LNN and LNN is similar to matrix factorization, which is widely considered stateoftheart in recommendation systems when comparing MAE. The power of LNN and LNN comes when faced with the coldstart problem which we address in the following section. As was discussed previously, MF and other pure collaborative filtering techniques are not able to address the coldstart problem despite being able to perform very well on items that have been rated previously a certain number of times. (They also suffer from the gray sheep problem which occurs when an item has only been rated a small number of times.) LNN and LNN are capable of addressing the coldstart problem while still obtaining similar performance to matrix factorization.
4.2 Cold Start Problem
To examine the coldstart problem, we remove the ratings for the top 10 most rated movies individually and collectively. The number of removed ratings for a single movie ranged from 1263 to 1670 and 15,131 ratings were removed for all top 10. The recommendation systems were trained using the remaining ratings using the parameter setting found in the previous experiments. For LNN, predicting a new item poses an additional challenge since the latent variables for the new items have not been induced. We find that using 0 values for the latent inputs often produced worse results than CBF. A CBF creates a model for each user based on item descriptions and corresponding user ratings. LNN, on the other hand, produces a single model which is beneficial when using all of the ratings because the mutual information between users and items can be shared. The shared information is contained in the latent variables. The quality of the latent variables depends on the number of ratings that a user has given and/or an item has received.
To compensate for the lack of latent variables for the new items, we utilize the new_item_prediction function that takes a vector representing the description of the new item and is outlined in Algorithm 3. At a high level, new_item_prediction uses to find its nearest neighbors. The induced latent input variables for each neighbor are concatenated with and fed into a trained LNN to predict a rating for the new item. The weighted mode of the predicted ratings of the new item is then returned. The rating from each neighbor is weighted according to how many times it has been rated. By weighting, we mean when selecting the mode from a set of numbers, the predicted rating is added times to the set where
is the number times that the neighbor item has been rated. We chose to use the mode rather than the mean because the mode is more robust to outliers and achieves better empirical results on the validation sets in our experimentation. We next describe new_item_prediction in more detail.
Lines 12 initializes a counter that keeps track of how many times a rating has been predicted for the new item and initializes all values to 0. Line 3 initializes the number of nearest neighbors to search for to 100 and sets the distance threshold to 0. We chose 100 neighbors because it was generally more than enough neighbors to produce good results. As we used binary item descriptions of movie genres, we only considered using the latent variables from items that have the same genre(s) (has a distance of 0). These values come into play in line 7 where an item is not used if its distance is greater than (in this case 0), and if an item has not been rated at least 50 times. The value of 50 was chosen based on the evaluation of a contentbased predictor [15]. The number of times that an item has been rated helps to determine the quality of the induced latent variables for an item and provides a confidence level for latent variables. Line 4 finds the closest neighbors and inserts their indexes into an array. Lines 510 count the number of times that each rating is predicted weighted by the number of times that the item has been rated. We use a linear rating such that the prediction for an item that has been rated 100 times will count for 100 ratings of that predicted value. This helps to discount items that have only been rated a few times and whose latent variables may not be set to good values. Line 13 returns the index (rating) that has the max count (i.e. the mode).
The results for recommending new items using new_item_prediction are provided in Table 2. The values at the top of the table correspond to the movie id in the MovieLens data set. The bold values represent the lowest MAE value obtained. No single recommendation system produces the lowest MAE all of the items, suggesting that some recommendation systems are better than others for a given user and/or item as has been suggested previously [11]. LNN and LNN each score the lowest MAE for several movies individually. With the exception of movie 2571, LNN and LNN produce the lowest MAE for all of the movies when they have not been previously rated. When holding out all 10 items, LNN produces the lowest MAE value. This shows the importance of using latent variables. CBCF uses CBF to create a dense matrix (except for the ratings corresponding to the active user) and then uses a collaborative filtering technique on the dense matrix to recommend items to the user. Thus, more emphasis is given to the CBF which generally produces poorer item recommendations than a collaborative filtering approach. LNN, on the other hand, utilizes the latent variables and their predictive power.
alg  2571  2858  2959  296  318  356  480  4993  5952  7153  top10 

CBCF  0.889  0.898  0.875  0.742  0.929  0.760  0.720  0.755  1.053  0.981  0.896 
CBF  0.957  0.905  0.920  0.870  0.965  0.866  0.766  0.790  1.121  1.041  0.972 
LNN  1.175  0.689  0.894  0.666  0.789  0.593  0.552  0.558  0.577  0.523  0.859 
LNN  1.189  0.690  0.906  0.680  0.810  0.595  0.541  0.587  0.566  0.521  0.847 
4.3 Efficiency
The efficiency of LNN is not precise as is the case for most neural network models since it is based on the number of iterations until convergence. In our experiments, LNN always converges regardless of the parameter settings. However, some parameter settings did require longer to reach convergence than others. The average time in seconds required to run each algorithm using the parameter settings found in the previous experiments is shown in Table 3. The additional complexity of LNN requires more time to train. However, it has the benefit that a new model will not have to be induced in order recommend new or unrated items as is the case with MF, NLPCA, and UBP. For recommending new items in LNN, LNN uses a kd tree for the nearest neighbor search which has search and insert complexities.
CBCF  CBF  LNN  LNN  MF  NLPCA  UBP  

train  2278.2  9.1  43.4  60.2  4.8  5.8  5.8 
Ave 10CV  2432.7  9.6  53.9  193.4  7.6  8.5  10.0 
5 Conclusions and Future Work
In this paper, we presented a neural network with latent input variables capable of recommending unrated items to users or items to new users which we call a latent neural network (LNN). The latent variables and input variables allow information and correlations among the rated items to be represented while also incorporating the item descriptions in the recommendation. Thus, LNN is a hybrid recommendation algorithm that leverages the advantages of collaborative filtering and content based filtering.
Empirically, a LNN is able to achieve similar results to stateoftheart collaborative filtering techniques such as matrix factorization while also addressing the coldstart problem. Compared with other hybrid filters and contentbased filtering, LNN achieves much lower error when recommending previously unrated items. As LNN achieves similar error rates to the stateoftheart filtering techniques and can make recommendations for previously unrated items, LNN does not have to be retrained once new items are rated in order to recommend them.
As LNN is built on a neural network, it is capable of modeling higherorder dependencies and nonlinearities in the data. However, the data in the MovieLens data set and many similar data sets is well suited to using linear models such as matrix factorization. This may be due in part to the fact many of the data sets are inherently sparse and nonlinear models could overfit them and reduce their generalization. As a direction of future work, we are examining how to better incorporate the nonlinear component of LNN. We are also looking at integrating both user and item descriptions with latent input variables to address the new user problem and the new item problem in a single model.
References

[1]
Y. Bengio, H. Schwenk, J. Senécal, F. Morin, and J. Gauvain.
Neural probabilistic language models.
In
Innovations in Machine Learning
, pages 137–186. Springer, 2006.  [2] R. D. Burke. Hybrid recommender systems: Survey and experiments. User Modeling and UserAdapted Interaction, 12(4):331–370, 2002.
 [3] I. Cantador, P. Brusilovsky, and T. Kuflik. 2nd workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In Proceedings of the 5th ACM conference on Recommender systems, RecSys 2011, New York, NY, USA, 2011. ACM.
 [4] M. Claypool, A. Gokhale, T. Miranda, P. Murnikov, D. Netes, and M. Sartin. Combining contentbased and collaborative filters in an online newspaper. In Proceedings of the ACM SIGIR ’99 Workshop on Recommender Systems: Algorithms and Evaluation, Berkeley, California, 1999. ACM.
 [5] D. Coheh and J. ShaweTaylor. Daugman’s gabor transform as a simple generative back propagation network. Electronics Letters, 26(16):1241–1243, 1990.
 [6] P. Cremonesi, R. Turrin, and F. Airoldi. Hybrid algorithms for recommending new items. In Proceedings of the 2Nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems, HetRec ’11, pages 33–40, New York, NY, USA, 2011. ACM.
 [7] P. Forbes and M. Zhu. Contentboosted matrix factorization for recommender systems: experiments with recipe recommendation. In B. Mobasher, R. D. Burke, D. Jannach, and G. Adomavicius, editors, RecSys, pages 261–264. ACM, 2011.
 [8] M. S. Gashler, M. R. Smith, R. Morris, and T. Martinez. Missing value imputation with unsupervised backpropagation. Computational Intelligence, page To Appear, 2014.
 [9] G. E. Hinton. Generative backpropagation. In Abstracts 1st INNS, 1988.
 [10] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
 [11] J. Lee, M. Sun, G. Lebanon, and S. jean Kim. Automatic feature induction for stagewise collaborative filtering. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 314–322. Curran Associates, Inc., 2012.
 [12] Q. Li and B. M. Kim. Clustering approach for hybrid recommender system. In Web Intelligence, pages 33–38. IEEE Computer Society, 2003.
 [13] J. Lin, K. Sugiyama, M.Y. Kan, and T.S. Chua. Addressing coldstart in app recommendation: latent user models constructed from twitter followers. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’13, pages 283–292, New York, NY, USA, 2013. ACM.

[14]
P. Melville, N. Shah, L. Mihalkova, and R. J. Mooney.
Experiments on ensembles with missing and noisy data.
In
Multiple Classifier Systems
, volume 3077 of Lecture Notes in Computer Science, pages 293–302, 2004.  [15] T. M. Mitchell. Machine Learning, volume 1. McGrawHill New York, 1997.
 [16] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20. Curran Associates, Inc., 2007.

[17]
R. Salakhutdinov and A. Mnih.
Bayesian probabilistic matrix factorization using Markov chain Monte Carlo.
In Proceedings of the 25th International Conference on Machine Learning, 2008.  [18] A. I. Schein, A. Popescul, L. H. Ungar, and D. M. Pennock. Methods and metrics for coldstart recommendations. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’02, pages 253–260, New York, NY, USA, 2002. ACM.
 [19] M. Scholz, F. Kaplan, C. L. Guy, J. Kopka, and J. Selbig. Nonlinear pca: a missing data approach. Bioinformatics, 21(20):3887–3895, 2005.
 [20] X. Su, R. Greiner, T. M. Khoshgoftaar, and X. Zhu. Hybrid collaborative filtering algorithms using a mixture of experts. In Web Intelligence, pages 645–649. IEEE Computer Society, 2007.
 [21] G. Takács, I. Pilászy, B. Németh, and D. Tikk. Scalable collaborative filtering approaches for large recommender systems. The Journal of Machine Learning Research, 10:623–656, 2009.
 [22] K. Zhou, S.H. Yang, and H. Zha. Functional matrix factorizations for coldstart recommendation. In Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, pages 315–324, 2011.