1 Introduction
Online services such as social media and ecommerce have played the key role to derive massive data sources for information systems. Since this information explosion makes users’ lives more complicated and even difficult to use such systems, recommender systems aim to offer personalized recommendations to users in order to minimize confusion and increase the chance to reach meaningful information. Based on the available data and the nature of the application domain, there are two main approaches in recommender systems to produce favorable recommendations: collaborative filtering that learn only from past interactions of users and contentbased methods that learn the taste of users by using content features. However, both approaches have flaws and favors. While collaborative filtering does not require domain expertise to mine information from data sources and works well for complex objects such as movies, books, music, etc. where variations in taste are much sparse than variations in preferences; contentbased filtering works better if preference data is sparse and coldstart is an issue. In practice, companies are following a middle way and using hybrid systems of these two approaches. Nevertheless, there are seldom cases of hybrid recommender systems investigated in the literature. Therefore, we present a general framework to use both aspects in a compact deep neural network architecture.
Among the various applied methods, matrix factorization is the most known collaborative filtering approach. Matrix factorization projects user and item into a shared latent space by decomposing the rating matrix into lowdimensional latent factors. To find out an interaction between user and item, the inner product of latent factors are used in recommender systems. In [14], a deep collaborative filtering (DCF) method is proposed to combine probabilistic matrix factorization (PMF) with marginalized denoising autoencoders (mDA). The latent factors are extracted from the hidden layer of deep networks and they are used to feed matrix factorization components. A collaborative topic modeling approach is proposed by Wang and Blei [18] for recommending scientific articles to online communities. Here, Latent Dirichlet Allocation (LDA) is applied to the user ratings as well as the article contents. Once users and articles are represented as latent factors, matrix factorization is applied to their latent representations to predict user preferences. [12]
proposed a contextaware recommendation model, convolutional matrix factorization (ConvMF) that integrates a convolutional neural network (CNN) into PMF. Item representation is obtained from the CNN network that they have trained directly in matrix factorization.
In most of the studies in recommender systems, Deep Neural Networks (DNNs) are used to either get better latent factor representation or integrate auxiliary information into matrix factorization to alleviate the coldstart problem. In contrast to the wide range of researches on the combination of matrix factorization and DNNs, there is relatively little work on employing DNNs to learn the interaction function directly from data. A very first attempt to build a traditional collaborative filtering setup by neural networks [4]
simulated matrix factorization by replacing its inner product by a feedforward neural network, however, it could not be succeeded in benchmark datasets.
[9] took this approach one step further because the inner product cannot capture nonlinear interactions between users and items. Thus, they proposed a framework named NCF to replace the inner product with nonlinear interaction function by a feedforward neural network and they reported promising results. However, interaction data by itself cannot be sufficient for a challenging recommender system in most cases, auxiliary data is a key factor especially for the systems introducing new users or items at any time. This paper explores the use of DNNs to extract meaningful information from both auxiliary and historical interaction data, then combines them to make better predictions than any single aspects and data sources. Our proposed framework can be extended by not yet experimented auxiliary data and/or by redefining the interaction function using the current data in a flexible manner.The main contributions of this work are summarized below.

We devise a general framework for a hybrid recommender system based on DNNs that model latent features of user and item from both auxiliary and interaction data.

We demonstrate the effectiveness of our NHR approach on the collaboration of selfsufficient recommender models.

We verify that auxiliary information can significantly improve recommendation quality, especially in largescale domains. Utilizing auxiliary information can improve not only the success in detecting true interactions but also the ability to correctly rank predictions.

We show that our NHR approach is essential in the domains that suffer from the severity of coldstarts and rating sparsity due to its stronger contributions to such disadvantaged domains.
Recommendation problems generally suffer from the lack of actual feedbacks given by users a.k.a. explicit feedback. Explicit feedback (via ratings and reviews) is a clear expression of user preferences on items, and it is expressed by direct interactions between system and user. On the other hand, implicit feedback is automatically tracked by the system itself, through inferences about the behavior of the user, such as watching videos, purchasing products and clicking items. Despite the plethora of research over explicit feedbacks; implicit feedbacks are the more realistic case of recommender systems in uttermost situations such as online advertising and online shopping. The reason for the less popularity of using implicit feedbacks is its challenging nature due to the absence of negative interactions. Since we have tested our framework on item prediction problems, we employ negative sampling as discussed in Section 3.4 to come through this problem.
2 Neural Hybrid Recommender
In order to build a general framework for both collaborative filtering and auxiliary information, we adopt feedforward neural networks. Neural networks can model useritem interaction since it has been proven that they are able to learn nonlinear relations which is essential for the recommendation of complex objects such as jobs and movies. As suggested in [3], we also utilize wide neural networks for memorization of feature interactions through a wide set of crossproduct feature transformations and deep neural networks for better generalization of unseen feature combinations through lowdimensional dense embeddings. Following NCF, we first build a Wide&Deep collaborative filtering approach by combining different neural networks using the same interaction data, then we add auxiliary information by supplementary networks into the system to address the coldstart problem. The names of pure collaborative filtering methods remained as in [9]
: GMF (Generalized Matrix Factorization) performing nonlinear matrix factorization and MLP (MultiLayer Perceptron) learning the highorder interaction function. The models trained on auxiliary information are simply named NHR
type where type refers to the data type that is used for training. We first train multiple selfsufficient neural recommenders independent from each other, then build a framework as an ensemble of all . Even though there is no limitation on the construction of the models, we can roughly divide what type of networks we use in our experiments into two groups:Both of the mentioned networks have embedding layers to transform users and items into vector representations. The obtained embedding vectors can be interpreted as the latent vectors of users and items. If we term
and as the user latent vector and item latent vector respectively, one can easily define a mapping function as(1) 
where denotes the elementwise product of latent vectors. Then, the next step is to project this product vector to the output layer of the model:
(2) 
where , the output of the multiplication layer in Fig. 1left, and , and
is the weight vector, bias, and activation function of the layer, respectively. Under the assumptions that the weight vector
is a uniform vector of 1, there is zero bias in the equation and the activation is an identity function which allows firing the perceptron with the exact value of the input, this project layer acts as a traditional matrix factorization. In order to implement neural network realization of matrix factorization, the weight vector and the bias are learned from interactions by the logarithmic loss function in Eq. 5, and in this way, a nonlinear MF approach a.k.a. GMF is obtained. The sigmoid function
is used asbecause it restricts each neuron to be in
range which meets the expectation for item prediction.The outputs of the embedding layers on GMF and MLP models are already 1dimensional vectors because they are fed on inputs of length 1 (ids only). However, the embedding layers of deep neural recommender networks trained on auxiliary data (NHR) produce sequences of embeddings w.r.t. sequence length. Averagepooling is a wellknown application to gather information exists in the sequence members into a particular form, for example getting sentence embeddings from word embeddings [20, 1], averagepooling is applied to the outputs of embedding layers in these models. Since users and items are represented with several features and every feature has its own embedding space, a concatenation is applied to have one unique latent vector representation for each useritem pair after the averagepooling of embeddings.
Once the latent vectors are obtained for useritem pairs, the following functions are used to generate MLP and NHR models.
(3) 
where
s are ReLU activation functions, except the final
which is a sigmoid. s are the weight matrices ands are bias vectors as usual.
As reported in [6], the initialization of weights can contribute to convergence and performance of deep learning models. Therefore, we first train all models without prior information till the convergence, then use their parameters to initialize relevant weights on the overall architecture. To combine the models, we simply concatenate the last layers of networks just before the outputs. Since this layer defines the predictive capability of a model, it is generally called as predictive factors in literature. We use the original weights of last layers in a weighting process:
(4) 
where denotes the weight vector of th pretrained model and (, , …, ) is the set of hyperparameters determining the tradeoff between the pretrained models. The final framework which ensembles multiple selfsufficient neural recommender networks by this weighting process is shown in Fig. 2.
The parameters given in the layer definitions of all models are learned by binary cross entropy loss function given below.
(5) 
where denotes the set of observed interactions, and denotes the set of negative instances. When the loss function is replaced to a weighted squared loss, the proposed framework can be easily applied to explicit datasets as well.
3 Experiments
3.1 Datasets
To conduct our experiments, we worked on two realworld problems: movie recommendation and job recommendation. For the movie recommendation task, we applied our approach to a benchmark movie rating dataset enriched by movie subtitles.
3.1.1 MovieLens & OPUS.
MovieLens [8] includes 5star ratings of movies and some categorical properties of users and movies. It contains ratings, movies and users in total. Users have at least 20 ratings. 5star explicit ratings are converted to implicit feedback by treating a rating is the indicator of useritem interaction, so all ratings in the dataset are considered to be 1. OPUS subtitles dataset [15] describes a collection of translated movie subtitles from http://www.opensubtitles.org/. It composes of bitexts from many language pairs. English subtitles are used to supply more convenient contents for movies. 2581 movies out of 3706 (69.64%) in the rating dataset have subtitles. The movie subtitles in the OPUS dataset are utilized for item representation while and the categorical properties of user profiles in the MovieLens for user representation.
Dataset  Type  Interaction  Item  User  Sparsity 

MovieLens  movie  1,000,209  3,706  6,040  95.53% 
Kariyer  job  383,434  16,134  20,283  99.88% 
3.1.2 Kariyer.
This dataset consists of the job application history of candidates from a oneweek period, candidate profiles, job definitions, job requirements, company details. Each user has at least 20 applications. It consists of applications, candidates for jobs in total. The application history of users is used as the interaction data in job recommendation, and the properties of jobs and candidates as the auxiliary data.
3.2 Handling Text Data
To make the text data suitable to feed neural networks, we need to convert raw texts into numeric vectors. In the simplest approach, using a simple dictionary for this purpose could lead to extremely sparse representations due to the huge size of vocabulary. Thus, we exploited the advantage of a hash function which converts a raw text to a sequence of indexes in a fixedsize hashing space. Note that some words may be assigned to the same index according to the hash function. The dimension of hashing space is in relation to the overlapping rate of distinct words and the dimension of embedding layers. By considering the pros and cons, we set the dimension of hashing space to in the experiments after evaluating its effect on overall performance and complexity.
Since the inputs to the neural networks have to be in the same size for all iterations, we examined the mean (
) and the standard deviation (
) of sequence lengths of text features. Then, the featurespecific input lengths are defined as for each text feature in the datasets.3.3 Evaluation Process
In order to split the dataset into the train and test sets, we preferred leaveoneout evaluation which has been widely applied in many works [10, 9, 2, 16], especially where sparse datasets are subjected. The latest interaction of each user is heldout to compose a test set, while the remaining interactions are used for training. The last interaction of each user in the train set is used for hyperparameters tuning.
Since ranking every useritem pair amongst the test pairs are very timeconsuming and not possible to run in realtime. Therefore, as in similar studies [13, 5, 9]
we randomly sampled 100 items per user and rank them by probability of interaction. To measure the quality of ranking, we used wellknown evaluation metrics: Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG). We applied both metrics on a truncated list including top10 ranked test items for each user. Due to the fact that the users have one interaction in the test set, HR@k is simplified in our experiments as follows:
(6) 
where and define the interaction with the item and the list of top recommended items for the user . In addition to HR@k, NDCG@k is reinterpreted as well in our experiments because ideal discounted cumulative gain () in position is equal to in our evaluation setup. Therefore, NDCG@k is redefined as:
(7) 
where is if the user interacted with the th item of the top list and otherwise. The results are reported by the mean of user scores.
HR gives a shallow understanding of success by considering if the interacted item is in the top10 list or not whereas NDCG helps for a better understanding by setting higher scores to hits at higher ranks.
3.4 Negative Sampling
In most of the cases, implicit feedback refers to positive inference of user interaction or user interest. To handle the absence of negative feedback, many studies have either assumed all unobserved cases as negative feedback or sampled negative instances from them. In this work, we also apply the latter approach to generate a set of negative feedback by sampling four negative instances per positive instance. Unlike the evaluation process, we randomly sampled negative training instances in realtime, just before each epoch starts. This allows our system to learn as much as possible from different instances and increases the utility of dataset without interfering with its feasibility.
3.5 Baselines
We compared our proposed approach NHR to the following methods:

PopRank is a nonpersonalized popularity based recommendation method. Items are ranked by their popularity which is determined by the number of interactions and recommended to all users with the same order.

BPR [16] is a highly competitive pairwise ranking method which works well for implicit feedbacks. It optimizes the matrix factorization model with a pairwise ranking loss.

ALS [11] is also a matrix factorization algorithm for item recommendation. It works in parallel and effective for largescale collaborative filtering problems which suffer from the sparseness of the rating data.

GMF [9] is a neural network realization of matrix factorization. Besides being a part of NCF, it can be employed as a complete recommender system.

MLP [9] is also a part of NCF that learns useritem interaction function by neural networks, Like GMF, it is a standalone recommender system.

NCF [9] is a stateoftheart neural network based collaborative filtering method which combines GMF and MLP methods. No matter that has very promising results for item prediction, it is a pure collaborative filtering method which benefits from only interaction data and does not regard coldstarts that is a very common case for realworld recommendation tasks.
3.6 Parameter Setting
We implemented our proposed framework using PyTorch. All individual models had been learned by optimizing the logarithmic loss of Eq.
5 because we tested them on an item prediction setup. To determine the hyperparameters of the methods, we conducted intensive tests on validation data.For individual models that are trained without any prior information, we set model parameters with a Xavier initialization, then optimize them with Adam optimizer which employs an adaptive learning rate for faster convergence. The learning rate is set to 0.001 and the momentum for Adam optimizer to 0.9 which is the default setting.
We tested a bunch of different batch size but found the 128 is the best performing setup for all, except the model trained on text data. Because the embedding size for the text data is quite large and hard to fit on even comparatively large computer memories, we adopt the batch size of 32 for them.
We evaluated the predictive factors of . We employed three hidden layers for interactionspecific networks, for example, if the number of predictive factors is set to , then the size of hidden layers are selected in the order of from the top on down and the embedding size is in this setup, as a matter of course. For the networks trained on auxiliary data, we used two hidden layers and intuitively set embedding size to be for movie subtitles, for job titles and candidate pastpositions, and for job qualifications, job explanations and candidate experiments. To treat equally, we set the parameter of NCF which defines the tradeoff between GMF and MLP by optimization as we did for our NHR methods.
ds  pf  mt  Baselines  NHR  Im.%  

PR  BPR  ALS  GMF  MLP  NCF  cat.  text  comb.  
ML  8  HR  0.4512  0.5331  0.6076  0.6247  0.6522  0.6560      0.6718  2.41% 
NDCG  0.2546  0.3027  0.3488  0.3528  0.3789  0.3807      0.3943  3.57%  
16  HR  0.4512  0.5886  0.6545  0.6714  0.6626  0.6828      0.6946  1.73%  
NDCG  0.2546  0.3426  0.3886  0.3945  0.3890  0.4057      0.4126  1.7%  
32  HR  0.4512  0.6040  0.6826  0.6757  0.6728  0.6874      0.6979  1.53%  
NDCG  0.2546  0.3564  0.4150  0.3936  0.3986  0.4053      0.4147  2.32%  
64  HR  0.4512  0.6108  0.6912  0.6763  0.5190  0.6798      0.6964  2.44%  
NDCG  0.2546  0.3621  0.4290  0.4052  0.2857  0.4077      0.4176  2.43%  
Ka  8  HR  0.3231  0.7399  0.5137  0.8249  0.7448  0.8594  0.8821  0.8624  0.8834  2.79% 
NDCG  0.1875  0.5067  0.3237  0.5719  0.5592  0.6204  0.6368  0.6188  0.6354  2.64%  
16  HR  0.3231  0.7874  0.6166  0.8357  0.8021  0.8695  0.8890  0.8730  0.8917  2.55%  
NDCG  0.1875  0.5560  0.4034  0.6041  0.5564  0.6402  0.6571  0.6426  0.6579  2.76%  
32  HR  0.3231  0.7934  0.7013  0.8121  0.8100  0.8658  0.8851  0.8703  0.8875  2.51%  
NDCG  0.1875  0.5629  0.4740  0.5870  0.5471  0.6369  0.6537  0.6411  0.6562  3.03%  
64  HR  0.3231  0.7922  0.7627  0.7841  0.8205  0.8621  0.8800  0.8678  0.8841  2.55%  
NDCG  0.1875  0.5608  0.5394  0.5624  0.5519  0.6334  0.6505  0.6378  0.6536  3.19% 
3.7 Performance Results
In our NHR experiments, we group auxiliary information sources into three categories: categorical, text, and a combination of them. Kariyer dataset includes many data types: freetext, real values, binary, singlelabel, and multilabel categorical features. In order to handle all different types during the learning process, we first apply general preprocessing steps such as outlier removal, tokenization, etc. We then normalize real values and transform binary and categorical features into onehot and multihot representations. All these features are considered
categorical for simplicity. We also convert raw text features to hash vectors which refer to text data source as explained in Section 3.2; Both networks trained on the categorical and text data sources are first incorporated into NCF alone (NHRcategorical and NHRtext respectively), then together to embody the most extensive NHR model (NHRcombined). As for MovieLens dataset, users are represented with categorical features whereas movies are represented with text features. This results in having one auxiliary network (NHRcombined) which combine the categorical and the text data sources at the same time. Thus, we could report one experiment on NHR for the movie recommendation task.Table 2 shows the recommendation performance of the compared methods with respect to the number of predictive factors. The results are given in HR@10 and NDCG@10. BPR and ALS methods have the same latent factor size as the predictive factors in neural network models. By doing so, we use the same predictive capability for all baselines except PopRank to make a fair comparison between them. PopRank has the weakest performance amongst the other methods. It is already expected because it is incapable to make personalized suggestions. Since 0.001level improvements are already found to be significant for similar tasks such as clickthrough rate (CTR) prediction [3, 19, 7, 17], one can easily say that NHR is significantly outperforming the stateoftheart matrix factorization methods, ALS and BPR, by a large margin in both metrics, and it is also consistently superior to the most competitive baseline NCF. NHR on MovieLens and Kariyer achieved 2.03% HR2.51% NDCG and 2.60% HR2.91% NDCG relative improvements on average over their NCF counterparts, relatively. NHR gains more generalization capability through merging interaction and auxiliary data. In addition to more accurate hits on top10 predictions, the results show that NHR systems could better learn to rank items in the top10 lists by uprising the test interaction amongst the other predictions since NDCG scores are improved by larger steps. The NHRcombined results on job recommendation clearly shows that adding new auxiliary data even with the same learning function can enhance the overall recommendation performance.
Even though NHRtext system improves the recommendation quality, it underperforms NHRcategorical because of its model complexity. Besides the inevitable large size of the embedding layer, the Kariyer dataset is extremely sparse and interaction data is not enough to feed such a network in fact. With more data, we expect to have more contribution from text data.
The last but not the least, the results are more promising for the job recommendation. Since Kariyer dataset suffers from a severe sparsity and a high frequency of coldstarts, the auxiliary data and the cooperation of models can fill in this information shortage about user preferences.
4 Conclusion
In this work, we explored DNNs for hybrid recommender systems . We devised a general framework NHR that model useritem interactions by combining auxiliary and historical data. We showed that every variation of NHR outperforms stateoftheart collaborative filtering methods as expected, but NHR also gives us the chance to alleviate deficiencies to be dependent on single aspects or data sources. It does not require to train complete architecture from scratch. Instead, it allows selfsufficient recommender models to speak for themselves by a weighting process which learns the capabilities of its components.
In the next phase of the study, we would like to test our approach on explicit datasets and use pretrained vector space models such as document vectors for text features since learning of embedding layers directly effects the model complexity and training time. Since average pooling leads to the loss of sequential property of natural language texts, we would like to improve our text models by using more elaborated architectures such as LSTMs and CNNs to exploit sequence information and interrelation of words.
Acknowledgements
This study is part of the research project (Project No:5170032) supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK). The authors would like to thank Istanbul Technical University for their financial support under the project BAP40737.
References
 [1] (2016) Finegrained analysis of sentence embeddings using auxiliary prediction tasks. arXiv preprint arXiv:1608.04207. Cited by: §2.
 [2] (2017) A generic coordinate descent framework for learning from implicit feedback. In Proceedings of the 26th International Conference on World Wide Web, pp. 1341–1350. Cited by: §3.3.
 [3] (2016) Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pp. 7–10. Cited by: §2, §3.7.
 [4] (2015) Neural network matrix factorization. arXiv preprint arXiv:1511.06443. Cited by: §1.
 [5] (2015) A multiview deep learning approach for cross domain user modeling in recommendation systems. In Proceedings of the 24th International Conference on World Wide Web, pp. 278–288. Cited by: §3.3.

[6]
(2010)
Why does unsupervised pretraining help deep learning?.
Journal of Machine Learning Research
11 (Feb), pp. 625–660. Cited by: §2.  [7] (2017) DeepFM: a factorizationmachine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247. Cited by: §3.7.
 [8] (2016) The movielens datasets: history and context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5 (4), pp. 19. Cited by: §3.1.1.
 [9] (2017) Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, pp. 173–182. Cited by: §1, §2, 4th item, 5th item, 6th item, §3.3, §3.3.
 [10] (2016) Fast matrix factorization for online recommendation with implicit feedback. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 549–558. Cited by: §3.3.
 [11] (2008) Collaborative filtering for implicit feedback datasets. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, pp. 263–272. Cited by: 3rd item.
 [12] (2016) Convolutional matrix factorization for document contextaware recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems, pp. 233–240. Cited by: §1.
 [13] (2008) Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 426–434. Cited by: §3.3.
 [14] (2015) Deep collaborative filtering via marginalized denoising autoencoder. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 811–820. Cited by: §1.
 [15] (2016) Opensubtitles2016: extracting large parallel corpora from movie and tv subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation, Cited by: §3.1.1.

[16]
(2009)
BPR: bayesian personalized ranking from implicit feedback.
In
Proceedings of the twentyfifth conference on uncertainty in artificial intelligence
, pp. 452–461. Cited by: 2nd item, §3.3.  [17] (2018) AutoInt: automatic feature interaction learning via selfattentive neural networks. arXiv preprint arXiv:1810.11921. Cited by: §3.7.
 [18] (2011) Collaborative topic modeling for recommending scientific articles. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 448–456. Cited by: §1.
 [19] (2017) Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17, pp. 12. Cited by: §3.7.
 [20] (2015) Towards universal paraphrastic sentence embeddings. arXiv preprint arXiv:1511.08198. Cited by: §2.
Comments
There are no comments yet.