1. Introduction
In the era of information overload, recommender systems play a pivotal role in many useroriented online services such as Ecommerce, contentsharing sites, and news portal. An effective recommender system not only can facilitate the information seeking process of users, but also can create customer loyalty and increase profit for the company. With such an important role in online information systems, recommendation has become an active topic of research and attracted increasing attention in information retrieval and data mining communities (Wang et al., 2017; He et al., 2017; Zhang et al., 2016).
Among various recommendation strategies, collaborative filtering (CF) is now the dominant one and has been widely adopted in industry (Liu et al., 2017; Smith and Linden, 2017). By leveraging useritem interaction data to predict user preference, CF is mostly used in the candidate selection phase of a recommender system (Wang et al., 2018b), which is complemented by an integrated ranking engine that integrates various signal to rank the candidates selected by CF. Generally speaking, CF techniques can be divided into two types — userbased and itembased approaches. The matrix factorization (MF) model (He et al., 2016)
is a representative userbased CF method (short for UCF), which represents a user with an ID and projects the ID into the same embedding space of items; then the relevance score between a useritem pair is estimated as the inner product of the user embedding and item embedding. In contrast, itembased CF (short for ICF) represents a user with her historically interacted items, using the similarity between the target item and interacted items to estimate the useritem relevance
(Smith and Linden, 2017; He et al., 2018).1.1. Why Itembased Collaborative Filtering?
Despite the popularity of MF in recommendation research, there are several advantages of ICF over UCF. First, by representing a user with her consumed items, ICF encodes more signal in its input than UCF that simply uses an ID to represent a user. This provides ICF more potential to improve both the accuracy (Christakopoulou and Karypis, 2016) and interpretability (Smith and Linden, 2017) of user preference modeling. For example, there are several empirical evidences on accuracy superiority of ICF over UCF methods for topN recommendation (Christakopoulou and Karypis, 2016; Wu et al., 2016; Christakopoulou and Karypis, 2014); and ICF can interpret a recommended item as its high similarity with some items that the user has consumed before, which would be more acceptable by users than “similar users” based explanation scheme (Zhang and Chen, 2018). Second, the composability of ICF in user preference modeling makes it easier to implement online personalization (He et al., 2018). For example, when a user has new purchases, instead of retraining model parameters to refresh recommendation list, ICF can approximate the refreshed list by simply retrieving items that are similar to the new purchased items. Such strategy has successfully provided instant personalization in YouTube based on user recent watches (cf. Section 6.2.3 Instant Recommendation of (Bayer et al., 2017)). By contrast, UCF methods like MF associate model parameters with an user ID, making them compulsory to update model parameters to refresh the recommendation list for a user (cf. the onlineupdate strategy for MF (He et al., 2016; Rendle and SchmidtThieme, 2008)).
Early ICF approaches use statistical measures such as Pearson correlation and cosine similarity to quantify the similarity between two items
(Sarwar et al., 2001). However, such methods typically require extensive manual tuning on the similarity measure to make them perform well, and it is nontrivial to adapt a welltuned method to a new dataset or dataset of a new product domain. In recent years, datadriven methods have been developed to learn item similarity from data, among which two representative methods are the sparse linear method (SLIM) (Ning and Karypis, 2011) and factored item similarity model (FISM) (Kabbur et al., 2013). In SLIM, the itemitem similarity matrix is directly learned with additional constraints on sparsity and nonnegativity; in FISM, the similarity between two items is factorized as the inner product of their latent vectors (
aka. embeddings), which can be seen as assuming the itemitem similarity matrix to be lowrank. Some recent developments for ICF include the neural attentive item similarity (NAIS) model (He et al., 2018) which extends FISM by using attention network to discriminate which itemitem similarities are more important for a prediction, the collaborative denoising autoencoder (CDAE) (Wu et al., 2016) which uses nonlinear autoencoder architecture (Sedhain et al., 2015) to learn item similarity, and the global and local SLIM (GLSLIM) (Christakopoulou and Karypis, 2016) which uses different SLIM models for different user subsets.1.2. Why HigherOrder Item Relations?
However, we argue that these previous efforts on ICF only model the secondorder relations between pairs of items, more specifically, the relation between an item in user history and the target item. Higherorder relations, such as multiple items that share certain properties which are likely to be consumed together, are not considered. Figure 1 illustrates some highorder relations among movies, such as directed by the same director, acted by the same actress, produced by the same producer, and so on. Such relations can be even more complicated when considering the possible overlap among multiple relations, such as a user subset likes to watch movies that share the same director and actress (e.g., Movie #12 and Movie #13 in Figure 1). Besides such explicit highorder relations based on item attributes, some implicit relations may also exist. For example, a set of items may be frequently bought together by users, since they are complementary in functionality (e.g., mouse, keyboard, and screen) or even they have no interpretable reason. We believe that such higherorder item relations provide valuable signal to estimate user preference, and such that, the recommendation accuracy of ICF can be significantly improved if such higherorder relations can be properly taken into account.
Essentially, Christakopoulou and Karypis (Christakopoulou and Karypis, 2014) have verified the existence of higherorder item relations on several product domains and demonstrated their effectiveness in ICF. Nevertheless, we argue that their proposed higherorder sparse linear method (HOSLIM) is limited in integrating higherorder relations in a static and linear manner. Specifically, they first identify frequent itemsets from useritem interaction data, and then extend SLIM to learn the itemitemset similarity matrix, which is then used to capture the higherorder item relations. One difficulty of such twostep solution is that, the identification of frequent itemsets requires a support threshold, which needs to be carefully tuned to avoid negative effects. If an itemset is useful but fails to be identified in the first step, the following predictive model cannot capture its impact; meanwhile, if an itemset is useless but is identified in the first step, it will have an uncontrollable impact on the predictive model which is unexpected. As such, a unified ICF solution that can automatically encode the impact of higherorder item relations into user preference modeling and prediction is highly desired.
1.3. Our Proposal and Contributions
In this work, we aim to fill the research gap of developing ICF models that can effectively capture higherorder item relations. We leverage on the recent success of neural recommender models (He et al., 2017) and develop a neural network method to achieve the target. Distinct to HOSLIM (Christakopoulou and Karypis, 2014) that detects higherorder item relations in a separate step, we integrate the learning of higherorder item relations into the predictive model that captures secondorder item relations, but use different neural components to capture the two kinds of item relations. Specifically, in the lowlevel of the neural network, we first model secondorder item relations via a multiply operation on each pair of item embeddings (similar to the setting of FISM (Kabbur et al., 2013) and NAIS (He et al., 2018)); above the pairwise interaction layer, we then stack multiple layers to learn higherorder item relations in a nonlinear way. Owing to the strong function learning ability of multilayer neural network, this endtoend solution is expected to capture the complicated impacts of higherorder item relations in user decisionmaking. Since this solution can be treated as a deep variant of ICF under the context of neural network modeling, we term it as DeepICF. We conduct extensive experiments on two datasets from MovieLens and Pinterest, verifying the highly positive effect of higherorder item relation modeling in DeepICF. Moreover, we integrate the attention design in the recently proposed NAIS (He et al., 2018) to refine the modeling of secondorder item relations (i.e., pairwise item similarities), which leads to further improvements.
The key contributions of this work are outlined as follows.

A generic neural network framework is proposed to model higherorder item relations for itembased CF. The key idea is simple in using multiple nonlinear layers above the pairwise interaction modeling to learn higherorder item relations.

Two specific methods under the framework which differ in the pairwise item relation modeling are presented. One method (DeepICF) combines pairwise item interactions with the same weight and the other method (DeepICF+a) uses an attention mechanism to differentiate the importance of pairwise interactions.

Extensive experiments are performed on two realworld datasets to verify the effectiveness of our proposal. Codes have been released to facilitate further developments on deep itembased CF methods: https://github.com/linzh92/DeepICF.
The rest of this paper is organized as follows. We first provide some preliminaries for ICF in Section 2. We then elaborate our proposed DeepICF methods in Section 3. Afterwards we report experimental results in 4 and review related work in Section 5. Finally we conclude the paper and highlight some future directions in Section 6.
2. Preliminaries
We first brief the general framework for itembased collaborative filtering. We then discuss the HigherOrder SLIM method, which is an existing solution to model higherorder item relations for ICF. Lastly, we recapitulate the FISM and NAIS methods, which are representation learningbased ICF methods that form the basis of our DeepICF methods.
2.1. Framework of Itembased Collaborative Filtering
ICF predicts a useritem interaction by assuming that user ’s preference on item depends on the similarity of to all items that has interacted with before. In general, the predictive model of ICF can be abstracted as,
(1) 
where is the item set that the user has interacted with, denotes the similarity between item and , and is the ’s observed preference on item , which can be a realvalued rating score (explicit feedback) and a binary or (implicit feedback).
We summarize the advantages of ICF over UCF in threefold: accuracy, interpretability, and ease on online recommendation. By accuracy, it is arguable that characterizing a user with her interacted items in ICF is can capture the user’s interest in a more explicit way. In contrast, in UCF, a static set of parameters to describe a user (e.g., user embedding in MF) has limited representation power in reflecting the dynamic and evolving user preference. Moreover, several prior efforts (Christakopoulou and Karypis, 2016; Wu et al., 2016; Christakopoulou and Karypis, 2014) provide empirical evidences on accuracy superiority of ICF over UCF. By interpretability, ICF can interpret why a recommendation is made via the explanation mechanism: because the item is similar to which item you liked before. Such explanation is more concrete and more acceptable by users than “because similar users also like it” provided by UCF (Zhang and Chen, 2018). By ease on online recommendation, the composability of ICF in user preference modeling —i.e., sum over the item similarities — makes it more suitable for online recommendation. Particularly, when a user has new purchases, UCF needs to update model parameters, e.g., user embeddings, to refresh the recommendation list, which is difficult to be adopted for realtime personalization. In contrast, based on the offline item similarities, ICF can approximate the refreshed list by simply retrieving items that are similar to the new purchased ones.
Clearly, the estimation of item similarity is crucial to the performance of ICF. A straightforward solution is to employ statistical measures, such as Pearson correlation and cosine similarity, on item features to estimate it. Recently, datedriven methods have been developed to learn the item similarity from data, which better tailor the similarity parameters for the specific dataset.
2.2. SLIM and HigherOrder SLIM
SLIM (Ning and Karypis, 2011) is among the first learningbased ICF methods that direct learn the itemitem similarity matrix from historical interaction data. Specially, it minimizes the reconstruction error between the original useritem interaction matrix and the reconstructed one that is derived from an ICF model. Two constraints are employed on the itemitem similarity matrix: nonegativity and sparsity, which ensure the meaningfulness of the learned similarities and enforce each item to be similar to a few items only. The objective function of SLIM is formulated as,
(2)  
where and denote the number of users and items, represents the itemitem similarity matrix where each entry is , is a hyperparameter to control the strength of regularization to enforce the sparsity constraint, and is a hyperparameter to control the strength of regularization to prevent overfitting. In SLIM, item similarity matrix S is the parameter to learn — the constraint is performed to ensure each element in (i.e., a similarity score) to be nonnegative, and the constraint forces the diagonal elements of to be zero to eleminate the impact the target item itself in estimating a prediction.
Despite effectiveness, the expressiveness of SLIM can be limited by its modeling of pairwise item relations only, since it overlooks the possible higherorder relations, such as multiple items belong the same group, share the same attributes, cooccur frequently, and so on. To this end, Christakopoulou and Karypis (Christakopoulou and Karypis, 2014) propose HOSLIM which extends SLIM to capture higherorder item relations. In particular, they first apply frequent itemset mining algorithm to identify itemsets that are frequently cointeracted by users; they then extend SLIM to jointly learn the itemitem similarities and itemsetitem similarity, which can capture higherorder item relations. Specifically, the predictive model of HOSLIM is as follows,
(3) 
where denotes the interaction vector of on items, and denotes the interaction vector of on itemsets ( itemsets in total), where each entry denotes whether has interacted with all items in itemset . Vectors and are model parameters to learn, where denotes the similarity vector of on items, and denotes the similarity vector of on itemsets. The objective function of HOSLIM is similar to that of SLIM with additional constraints on the itemset similarity matrix:
(4)  
(5) 
where and denote the itemitem and itemitemset similarity matrix (of which each vector is and ), respectively. denotes the itemsets that contain item . The definition of the constraints follows the same logic of SLIM thus we omit the explanation here.
As mentioned in introduction, the twostep solution of HOSLIM has several limitations. First, as we can see that from Equation (3), the useritemset interaction vector plays an important role in the predictive model; however, it is determined by the frequent itemset mining algorithm, which requires a support threshold that is nontrivial to tune for different datasets (since the item frequency distribution of different datasets may vary a lot). Second, in HOSLIM, higherorder item relation is defined as the similarity between an itemset and an item, which is aggregated in the same way as that of secondorder item similarities — i.e., linear and static — in the predictive model. By linear, we mean that the similarities between candidate itemsets and target item are directly summed in the predictive model; by static, we mean that for different predictions, the importance of itemsetitem similarities remain the same (i.e., a uniform weight of 1). Such a simple manner to model higherorder item relations omits the varying importance of itemsets for a prediction and the possible nonlinear relations among items, making it suboptimal to predict user preference. Moreover, this implies the difficulty of modeling higherorder relations in traditional linear recommendation models, revealing the necessity and possibility of addressing it with nonlinear, more expressive, and endtoend trainable neural network models. This forms the major motivation of this work from the technical perspective.
2.3. FISM and NAIS Methods
FISM (Kabbur et al., 2013) stands for another mainstream in learningbased ICF — instead of directly learning the whole item similarity matrix which can be very space and time consuming, it applies a lowrank assumption on the item similarity matrix and learns the lowrank structure to reconstruct the matrix. The predictive model of FISM is formulated as,
(6) 
where and are trainable parameters that define the lowrank structure and denotes the rank size; from the perspective of representation learning, and can be seen as the latent features (aka. embedding) for target item and historical item . As can be seen, the item similarity score can be expressed as the inner product between and . Hyperparameter controls the normalization on users of different history length, e.g., means no normalization is used, means normalization is used, and other intermediate values between 0 and 1 are also applicable. Following the zero diagonals constraint in SLIM, the sum over item set is to exclude the influence of the target item in constructing ’s profile to predict , which can avoid information leak during training. Moreover, since most recommender systems deal with implicit feedback where for all observed interactions, we can omit the coefficient in Equation (6).
It is arguable that FISM is limited since it models all secondorder item relations with a same weight for all predictions. To address this limitation, NAIS is proposed very recently (He et al., 2018) which applies a dynamic weighting strategy on secondorder item relations. Specifically, NAIS considers that the historical items of should have different contributions on the prediction of . An attention network is then employed to learn the varying weights of itemitem relations based on item embeddings. The predictive model of NAIS is formulated as,
(7) 
where denotes the attentive weight of similarity in contributing to the final prediction. In NAIS, is parameterized as a neural network’s output with item embeddings and as input. Specifically, two neural attention networks are presented which differ in how to combine and :
(8) 
where is a variant of the softmax function that takes the normalization on user history length into account. As such, the normalization term in FISM can be omitted in NAIS. and
are the weight matrix and bias vector of the hidden layer of the attention network, and
is the weight vector that projects the hidden layer into the scalar output.While FISM and NAIS have provided strong performance for item recommendation, we argue that neither of them takes the higherorder item relations into account. When certain higherorder item relations exist in the data, as have demonstrated in (Christakopoulou and Karypis, 2014) on several realworld datasets, both methods cannot capture them and thus may provide suboptimal performance. In next section, we present our neural network modeling approach that specifically accounts for higherorder item relations for user preference prediction.
3. Methods
This section elaborates our proposed methods. In Section 3.1, we first discuss the predictive model, i.e., given a useritem pair how to estimate the prediction value . Specifically, we first present a general framework for higherorder item relation modeling with neural networks (cf. Figure 2), and then discuss two instantiations under the framework — DeepICF that uses standard average pooling on secondorder interactions, and DeepICF+a that uses an adaptive pooling strategy with attention on secondorder interactions (cf. Figure 4). In Section 3.2, we describe the learning procedure of the models. Lastly in Section 3.4, we discuss the connections of our methods with existing models, shedding lights on the rationality of our proposed methods analytically.
3.1. Model
Figure 2 illustrates our proposed framework to model higherorder item relations for ICF. The overall neural network architecture follows the design of the neural collaborative filtering (NCF) framework (He et al., 2017)
with two major differences. First, in the input layer that represents a user (bottom left), distinct from NCF that applies onehot encoding on the user’s ID, we use multihot encoding on the user’s interacted items. This naturally leads to the difference in the embedding layer — rather than using a vector to represent the user, we use a set of vectors where each vector represents an interacted item of the user. Second, instead of designing a holistic neural CF layers to model the interaction between user and item, we divide the interaction modeling into two components — 1) pairwise interaction layer that models the interaction between each historical item and the target item, and 2) deep interaction layers that model higherorder interaction among all historical items and the target item. Next, we elaborate the architecture layer by layer.
Input and Embedding Layer. In the right channel that represents the target item , onehot encoding on the ID feature of is applied. Then the ID is projected to an embedding vector to describe the target item where denotes the embedding size. In the left channel that represents the user , multihot encoding on the ID feature of ’s interacted items is applied. Then for each historical item , we project it to an embedding vector . As such, the output of the embedding layer is a set of vectors that represents the user and a vector that represents the target item .
Note that another way to represent target item is to base on the users that have interacted with it (as used in the deep matrix factorization model (Xue et al., 2017)), which is arguably more informative but costly. Since this work is focused on ICF that aims to exploit item relations, we leave this exploration as future work.
Besides ID, the input of embedding layer can be easily extended to incorporate side information, such as location, time, and item attributes. To be specific, each feature can be mapped to an ID via onehot encoding. Thereafter, we feed them into the embedding layer to establish its embeddings, which are scaled by the feature value — 1 for discrete features (e.g., , user gender and item attributes) and real value for numerical features (e.g., click number). Since the work focuses on the pure collaborative filtering setting, we leave the incorporation of side information as future work.
Pairwise Interaction Layer. Inspired by the effectiveness of FISM and NAIS, we apply the similar way to explicitly model the interaction between each historical item and target item. Specifically, we apply elementwise product on their embedding vectors, obtaining a set of pairwise interacted vectors which capture the secondorder relations between the target item and ’s historically interacted items.
Theoretically speaking, other than elementwise product, any binary function can be applied here to map and to one vector that encodes their interaction. For example, addition (), subtraction (), division (), and so on. Here we choose elementwise product mainly because of its generalizing of inner product to vector space (He and Chua, 2017), which can sufficiently capture the signal in inner product. In FISM, the use of inner product to capture secondorder item interactions implies that the item similarity matrix has a lowrank structure, which leads to good estimation on item similarities. As such, the output vectors of this layer are supposed to encode the signal of pairwise item similarities.
Pooling Layer. Since the number of historical items of different users may vary, the output of pairwise interaction layer will have different sizes. The pooling layer operates on the vectors in variablesize , aiming to produce a vector of fixed size to facilitate further processing. Here, we consider two choices — weighted average pooling and attentionbased pooling — which lead to our two proposed methods DeepICF and DeepICF+a.
The weighted average pooling used in DeepICF is defined as follows,
(9) 
where is the normalization hyperparameter that controls the smoothing on of different sizes. When is set to 1, no smoothing is used and it becomes the standard average pooling; when is set to
, the operation downgrades to the standard sum pooling. Since the distribution of user activity level may vary for different datasets, there is no uniformly optimal setting for
and it should be separately tuned for different datasets. However, regardless of the value of , all historical items of a user will contribute equally to the prediction on all target items, which is an unrealistic assumption as argued in (He et al., 2018). Typically there are only a few items a user interacted before will affect the user’s decision on an item. For example, when a user decides whether to purchase a phone cover, it should be the phones that he purchased before have a larger impact, rather than cameras or clothing products. As such, when modeling the interaction between historical items and target item, nonuniform weights should be applied to the historical items. Besides, we have tried to assign a contribution weight for each historical item of user , but the performance of such design is not significantly improved. As such, we do not further explore the extension.The attentionbased pooling used in DeepICF+a is designed to address the abovementioned limitation in DeepICF. Inspired by the attention network design in ACF (Chen et al., 2017b) and NAIS (He et al., 2018), we define the attentionbased pooling as follows,
(10) 
where is the attention function that takes vector v as input, and outputs the importance of v in the weighted average pooling. Figure 3
illustrates the structure of the attention network. Specifically, we use a multilayer perceptron with a hidden layer to parameterize the attention function:
(11) 
where and denote the weight matrix and bias vector of the attention network, respectively, and denotes the size of the hidden layer, which is also called as attention size. denotes the weights of the output layer of the attention network. is a variant of the softmax function to normalize the attentive weights (He et al., 2018), defined as,
(12) 
where is a hyperparameter to smooth the value of the denominator in softmax. Note that tuning has a similar effect as tuning , since both hyperparameters can regulate the weights of secondorder item interactions for users of different history lengths. In our experiments, we find that with a proper tuning on (in the range of 0 to 1), setting to 0 leads to satisfactory performance. Thus the normalization term can be omitted in DeepICF+a.
Deep Interaction Layers. The output of the previous pooling layer is a vector of dimension , which condenses the secondorder interaction between historical items and target item. Let the vector be (i.e., for DeepICF and for DeepICF+a). Next we consider how to capture higherorder interactions among items on the basis of . Inspired by our recent development on neural factorization machines (NFM) (He and Chua, 2017), we propose to stack a multilayer perceptron (MLP) above to achieve the higherorder modeling. The rationality is quite similar — by treating the historical items and target item as features into NFM, the vector plays the same role as the output vector of the biinteraction layer in NFM (i.e., encoding the similar semantics of pairwise interactions between feature embeddings). Analogously, the MLP above is capable of capturing higherorder interactions among feature embeddings. We refer interested readers to the NFM paper (He and Chua, 2017) and the Deep Crossing paper (Shan et al., 2016) on more analysis about how the use of MLP can capture higheroder interactions among features.
We give the formal definition of the deep interaction layers as follows,
(13) 
where , , and
denote the weight matrix, bias vector, activation function, and output vector of the
th hidden layer, respectively. We use the rectifier unit as the activation function, which is known to be more resistant to the saturation issue when the network becomes deep, and empirically shows good performance in our setting. The size of each hidden layer is subjected to tune, and we adopt the conventional choice of tower structure. We will report the detailed setting and hyperparameter tuning process in Section 4.1.Prediction Layer. As the output of deep interaction layers, the vector encodes informative prediction signal aggregated from secondorder to higherorder item interactions. Since a multilayer nonlinear network is able to fit any continuous function in theory (Hornik et al., 1989), each dimension in is supposed to encode the item interactions of anyorder. We then project
to the prediction score with a simple linear regression model:
(14) 
where , , and
denotes the weight vector, user bias, and item bias of the prediction layer, respectively. The two bias terms are to capture the variance in the popularity of different items and activity of different users, which were found to have an impact for learning from implicit feedback
(Kabbur et al., 2013). Each element in z measures the importance of the corresponding dimension in for prediction. Here we make z a global parameter shared by all predictions. We note that a more finegrained design is to tweak it be itemaware or useraware or both, however it may also increase the model complexity and make the model more difficult to train. We leave this exploration as future work, since we find the current global setting leads to satisfactory performance.3.2. Learning
Two mainstream methods for learning recommender models are to optimize pointwise (He et al., 2017; Bayer et al., 2017; Li et al., 2015) and pairwise (Rendle et al., 2009; Wang et al., 2017; Zhang et al., 2016) learning to rank objective functions. Focusing on implicit feedback, pointwise methods typically assign a predefined target value to observed useritem entries (i.e., positive examples) and sampled nonobserved entries (i.e., negative examples), training model parameters to output similar values as the target values for both positive and negative examples. By contrast, pairwise methods assume that observed entries should have higher prediction scores than nonobserved ones, performing optimization on the margin between positive and negative examples. To our knowledge and experience, there is no permanent winner between the two learning types and the performance depends largely on the predictive model and the dataset (see (Wu et al., 2016; Kabbur et al., 2013) for more empirical evidence).
In this work, we opt for the pointwise log loss, which has been widely used for optimizing neural recommender models recently and demonstrated good performance (He et al., 2017; Bai et al., 2017; Xue et al., 2017). It casts the learning as a binary classification task, minimizing the objective function as follows,
(15) 
where
is the sigmoid function that restricts the prediction to be in
. Set denotes the positive examples, which are identical to observed useritem entries, and denotes the set of negative examples, which are sampled from nonobserved useritem entries. For each positive example , we sample negative examples to pair with it, where is the negative sampling ratio. Consistent with previous findings on NCF models (He et al., 2017), we find the negative sampling ratio plays an important role on the performance of our DeepICF methods. A default setting of leads to good performance in most cases (empirical results are shown in Figure 10 in Section 4). Hyperparameter controls the strength of regularization on model parameters to prevent overfitting. Due to the use of fully connected MLP, the deep interaction layers in DeepICF methods are prone to overfitting. Thus, we mainly tune for the weight matrices in the deep interaction layers.Pretraining. Due to the nonlinearity of deep neural network models and the nonconvexity of the learning problem, gradient descent methods can be easily trapped to local optimum solutions which are suboptimal. As such, model initialization plays an important role on the model’s generalization performance (Erhan et al., 2010)
. We empirically find that our models suffer from slow convergence and poor performance when all model parameters are initialized randomly. To address the optimization difficulties and fully explore the potential of DeepICF models, we pretrain them with FISM. Specifically, we use the item embedding vectors learned by FISM to initialize the embedding layer of both DeepICF models. With such a meaningful initialization on the embedding layer, the convergence and performance of DeepICF can be greatly improved, even when other parameters are randomly initialized with a Gaussian distribution.
3.3. Time Complexity Analysis
In this subsection, we analyze the time complexity of our models of DeepICF and DeepICF+a. This directly reflects the time cost of DeepICF and DeepICF+a in testing. First, the time complexity of evaluating a prediction with FISM (cf. Equation (6)) is , where represents embedding size and denotes the number of historical items interacted by user . Compared to FISM, the additional time cost of making a prediction with DeepICF is caused by the hidden layers. For the th hidden layer, the multiplication between matrices and vectors is the main operation which can be done in , where represents the size of the th hidden layer and = . The prediction layer only involves inner product of two vectors, for which the complexity is . As such, the overall time complexity for evaluating a DeepICF model is . As reported in (He et al., 2018), the time complexity of NAIS model is where denotes the attention factor. For the model of DeepICF+a, the additional time cost comes from the fully connected networks compared to NAIS. Therefore, the overall time cost of evaluating a prediction with DeepICF+a is .
3.4. Connections with Other Models
It is worthwhile to point out that FISM can be interpreted as a special case of our proposed DeepICF model. We (i) remove the hidden layers (i.e., set the layer depth ) and simply set in Equation (13), and (ii) project the vector into the prediction layer as,
(16) 
where denotes the weight vector of the prediction layer. Then if we set as the allone vector, DeepICF can exactly recover the FISM model. Obviously, the deep and nonlinear architecture of hidden layers enables DeepICF to investigate the higherorder and nonlinear feature interactions, while the linear modeling limits the capacity of FISM.
Analogously, NAIS is an instance of our proposed DeepICF+a. In particularly, if we (i) set as the output of attentionbased pooling in Equation (10) and (ii) feed the vector into the prediction layer as,
(17) 
then DeepICF+a can recover the NAIS model. Clearly, by taking advantages of nonlinear hidden layers, DeepICF+a not only identifies the importance of feature interactions, but also models the higherorder feature dependencies, which NAIS fails to capture.
4. Experiments
In this section, we conduct plenty of experiments on two publicly accessible datasets to answer several questions as follows, which aim at certifying the effectiveness of our proposed methods:
 RQ1:

How do our proposed models (DeepICF and DeepICF+a) perform compared to other stateoftheart recommender models?
 RQ2:

How do the key hyperparameter settings impose influence on the performance of our DeepICF models?
 RQ3:

Are deeper layers of hidden units useful for capturing the higherorder and nonlinear interactions between items and enhancing the expressiveness of FISM?
Hereinafter, we first describe the settings of experiments followed by answering the aforementioned questions one by one.
4.1. Experimental Settings
Dataset Description. We evaluate the performance of our proposed methods on two realworld datasets. MovieLens is a dataset of movie rating which has been leveraged extensively to investigate the performance of CF algorithms. In our experiments, we choose the version including one million ratings where there are 20 ratings per user at least. Pinterest is a dataset that is constructed for contentbased image recommendation. The original Pinterest is extremely sparse. We filter the original data similar to MovieLens which keeps each user with at least 20 interactions so as to make it easier to evaluate CF algorithms. These two datasets are publicly accessible on the websites^{1}^{1}1https://grouplens.org/datasets/movielens/1m/ ^{2}^{2}2https://sites.google.com/site/xueatalphabeta/academicprojects. The detailed characteristics of the two datasets are summarized in Table 1.
Dataset  Interactions  Users  Items  Density 

MovieLens  1,000,209  3,706  6,040  4.47 
1,500,809  9,916  55,187  0.27 
Evaluation Protocols. The extensively used leaveoneout evaluation protocol (Rendle et al., 2009; He et al., 2017) is employed here to study the performance of item recommendation. We sort the useritem interactions by the timestamps for each user at first; then we heldout the latest interaction as the testing data for each user and utilized the rest of interactions corresponding to the user for training. Following (Koren, 2008; He et al., 2017)
, we sample randomly 99 items (negative instances) which are not interacted by corresponding user for each testing item (positive instance) so as to rank the testing item among such 100 items. As such, we can alleviate the timeconsuming problem of ranking all items for each user during evaluation. In terms of evaluation metrics, we adopt
Hit Ratio at rank (HR@k) (He et al., 2017) and Normalized Discounted Cumulative Gain at rank (NDCG@k) (He and Chua, 2017; Arampatzis and Kalamatianos, 2018; Cao et al., 2017; He and Liu, 2017) to evaluate the performance of the ranked list generated by our models. In experimental parts we set for both metrics. The metric of HR@10 is capable of measuring intuitively if the test item is present at the top10 ranked list and NDCG@10 illustrates the quality of ranking which assigns higher score to hits at top position ranks. We report the average scores of both evaluation metrics. The higher both metrics are, the better recommendation performance is.Baselines. To evaluate the efficacy of our proposed models, we also study the performance of the following approaches:
ItemPop. It is nonpersonalized since it ranks items according to their popularity which is measured by the number of interactions.
ItemKNN (Sarwar et al., 2001). Itembased knearestneighbor (ItemKNN) is the standard itembased CF approach as shown in Equation (1). In the experiments, we attempt to test different numbers of nearest item neighbors, finding that utilizing all neighbors provide best performance.
HOSLIM (Christakopoulou and Karypis, 2014). This model extends SLIM to capture higherorder item relations. HOSLIM learns two sparse aggregation coefficient matrices for the purpose of capturing the itemitem and itemsetitem similarities.
Youtube Rec (Covington et al., 2016).
A deep neural network architecture for recommending YouTube videos. It maps a sequence of video IDs to a sequence of embeddings and these are simply averaged and fed into a feedforward neural network. The input layer is followed by several layers of fully connected Rectified Linear Units (ReLU).
BPR (Rendle et al., 2009). This approach optimizes MF model with a pairwise Bayesian Personalized Ranking loss to learn from implicit feedback data. It is often opted for the baseline for item recommendation.
eALS (He et al., 2016). It is a stateoftheart MF approach for item recommendation with pointwise regression loss, treating all nonobserved interactions as negative instances and weighting them with the corresponding item’s popularity.
MLP (He et al., 2017)
. This method exploits a multilayer perceptron instead of the simple inner product to learn the nonlinear interactions between users and items from data and tailors a pointwise log loss function to optimize. The experimental results illustrated in later sections are brought by a MLP with three hidden layers.
FISM (Kabbur et al., 2013). This is the stateoftheart itembased CF method. We experiment with a range from 0 to 1 with a step size of 0.1, discovering setting as the value of 0 brings about best results.
All the approaches mentioned above cover a various range of recommendation methods: ItemKNN and FISM are on behalf of conventional itembased CF methods to verify the effectiveness of our proposed deep models; BPR and eALS are two competitive userbased methods for implicit feedback; MLP is a recently proposed CF approach based on deep neural networks. In this paper, we primarily pay close attention to single CF models. Hence, we do not make a comparison with NeuMF which is an ensemble model that combines MF with MLP.
Parameter Settings. To avoid overfitting, we tuned the regularization coefficient in the range of for each learningbased approach. As for the embedding size , we evaluated the values of
in our experiments. For a fair comparison, we trained FISM by optimizing the same objective function of binary crossentropy loss with the optimizer Adagrad. For our DeepICF models, we initialized them with FISM embeddings which resulted in better performance and faster convergence. And we randomly initialized other model parameters with a Gaussian distribution wherein the value of mean and standard deviation is 0 and 0.01 respectively. The learning rate was searched in
and the value of was experimented in the range of . The smooth hyperpaprameter is consistent with the value when best results are achieved in He’s (He et al., 2018)work. Without additional explanation, we leveraged three hidden layers for MLP structure. We implemented our DeepICF models based on Tensorflow
^{3}^{3}3https://github.com/AaronHeee/NeuralAttentiveItemSimilarityModel, which will be released publicly once acceptance.4.2. Performance Comparison (RQ1)
We first make a comparison between our proposed models and other item recommendation approaches. For the purpose of fair comparison, the embedding size is set to 16 for all embeddingbased approaches (YouTube Rec, MLP, BPR, eALS, FISM, DeepICF and DeepICF+a). In next subsection, we will alter the embedding size of these embeddingbased approaches to observe embedding size performance trends. Table 2 demonstrates the performance of HR@10 and NDCG@10 of all compared methods.
At first, we can see that our DeepICF and DeepICF+a provide the best performance (the highest HR and NDCG scores) on both datasets, significantly outperforming the stateoftheart itembased method FISM. We attribute such improvements to the effective learning of higherorder item interactions based on deep neural networks and the efficacious introduction of attention mechanism to differentiate the importance of historical items in users’ representation. Further more, we conduct onesample ttests to justify that all of enhancements are statistically significant with
pValue ¡ 0.05. Secondly, it is obvious to see that the learningbased methods come up with more accurate recommendations than the heuristicbased methods ItemPop and ItemKNN. Especially,
HOSLIM captures the linear highorder relation by learning the similarity between itemset and an item, which will limit the performance. What’s more, the itemset is generated by the frequent mining algorithm, which requires a support threshold that is nontrivial to tune for different datasets. From Table 2, we can see that our DeepICF and DeepICF+a significantly outperform HOSLIM on both datasets which demonstrate the importance of the nonlinear relationship between the higherorder item interactions. FISM significantly exceeds its counterpart ItemKNN with about relative improvements of 6.1 and 18.3 in terms of HR and NDCG on MovieLens. Given the key difference between such two kinds of approaches lies in the way of item similarity estimation, we can come to a conclusion that it is extremely important to tailor optimization for recommendation task. Lastly, there is no absolute superior between userbased CF methods (BPR, eALS and MLP) and itembased CF method (FISM). YouTube Rec performs roughly the same as DeepICF on Movielens, weaker than DeepICF+a, but worse on Pinterest. We find the reason is that Pinterest has no time information, while Youtube Rec relies on the chronological order of browsing history. Moreover, we suspect that another reason might be the manner of modeling item relations. Both YouTube Rec and DeepICF utilize the historically interacted items to represent a user, which enriches the input of representation learning. However, our DeepICF utilizes the designed ”deep interaction layer” to capture higherorder and nonlinear feature interactions between any two historical items, while YouTube Rec only combines the features of historical items via the meaning pooling operation, which does not explicitly capture feature interactions. In particular, userbased CF methods yield better performance than FISM on the dataset of MovieLens while FISM exceeds the userbased CF methods on Pinterest. We can conclude that it is more predominant for itembased CF model to generate better performance on highly sparse datasets than userbased CF models.Dataset  MovieLens  

Methods  HR@10  NDCG@10  pValue  HR@10  NDCG@10  pValue 
ItemPop  0.4558  0.2556  9.0e3  0.2742  0.1410  6.8e4 
ItemKNN  0.6300  0.3341  1.3e2  0.7565  0.5207  3.9e4 
HOSLIM  0.6851  0.4238  9.6e6  0.8655  0.5551  1.5e3 
Youtube Rec  0.6874  0.4288  1.2e3  0.8634  0.5394  6.9e3 
MLP  0.6841  0.4103  1.6e3  0.8648  0.5385  3.7e2 
BPR  0.6674  0.3907  1.2e3  0.8628  0.5406  6.4e4 
eALS  0.6689  0.3977  2.8e3  0.8755  0.5449  5.8e3 
FISM  0.6685  0.3954  5.4e3  0.8763  0.5529  8.0e5 
DeepICF  0.6881  0.4113    0.8806  0.5631   
DeepICF+a  0.7084  0.4380    0.8835  0.5666   
Testing performance of FISM, DeepICF and DeepICF+a at embedding size 16 in each epoch.
Figure 5 demonstrates the state of DeepICF, DeepICF+a and FISM at embedding size 16 on two datasets in the first 50 epochs. We can obviously find the effectiveness of our proposed models. In particular, the performance of initialized DeepICF and DeepICF+a exceeds significantly FISM in the first epoch. As the training goes on, the experimental results can be even better. Upon convergence, our two DeepICF methods attain relative improvements of 4.8 and 6.6 over FISM in terms of NDCG on the datasets of MovieLens and Pinterest, respectively. Such promising results reveal the reason why both DeepICF models are capable of providing much better recommendation performance than FISM, proving the key arguments of our work that the higherorder relations between items can be better modeled based on deep neural networks and different interacted items of a user should be weighted differently in contributing to the preference of corresponding user.
MovieLens  

Target Users  268  1188  
Target Items  836  1525  323  806  39  1356  1549  918  362  2222 
0.51  0.52  0.56  0.80  0.70  0.64  0.52  0.70  0.62  0.67  
0.19  0.20  0.37  0.61  0.53  0.31  0.37  0.48  0.42  0.57  
Target Users  268  1188  
Target Items  346  497  2441  3779  3782  6092  6769  6802  6803  6809 
0.32  0.66  0.41  0.34  0.72  0.49  0.52  0.17  0.63  0.69  
0.18  0.62  0.22  0.30  0.68  0.39  0.50  0.10  0.39  0.59 
4.2.1. Explainability
To demonstrate the explainability of our enhanced model DeepICF+a which introduces attention weights to differentiate contributions of differently historical items corresponding to a particular user for the final prediction, we separately sampled two users from two datasets of MovieLens and Pinterest which are positive examples and the corresponding scores should be predicted larger. Meanwhile, we select five historically interacted items for each user. Figure 6 visualizes the attention weight learned by DeepICF+a model in which a row denotes a historically interacted item of a sampled user and a column denotes a target item. The two heat maps on the left in Figure 6 presents the attention scores over five selected items of two users sampled from MovieLens. Another two heat maps on the right for the datasets of Pinterest are on the same.
Taking the user 1188 sampled from MovieLens and the corresponding target item 1549 for an example. We can clearly see that DeepICF weights all the historical items (Item 43, 127, 1106, 495 and 769 in this case) of user 1188 equally while DeepICF+a assigns different weights to the five historical items. In more concrete terms, DeepICF+a assigns higher attention scores over item 495 and 43 but relatively lower attention scores for the rest of three items. As shown in Table 3, the prediction score (after sigmoid) of DeepICF+a on item 1549 in MovieLens is 0.52, which is higher than the score 0.37 predicted by DeepICF on the corresponding target item 1549. To better explain the rationality, we further give an insight into these movies from MovieLens. We found that these three movies of 1549, 495 and 43 are all about romance drama movies while movie 769 and 1106 are documentary. Although movie 127 is also a drama movie, its attention weight is quite low. We deem the reason is that the main content of movie 127 is about social enslavement not romance. It is quite reasonable when predicting the preference of a user on a target item, all of her historical items whose categories are similar to the target item, should impose more impact on the final prediction than the irrelevant ones. DeepICF+a successfully predicts the score higher which is expected strongly. Such learning results verify the usefulness of attention mechanism introduction into our proposed DeepICF model.
4.2.2. Utility of Pretraining
It is significant for deep learning models in terms of parameter initialization which can impose impact on their convergence and final performance (Erhan et al., 2010), (He et al., 2017). Owing to the nonconvexity of our DeepICF methods’ objective functions, we make a comparison between two DeepICF methods with and without pretraining to certify the effectiveness of utility of item embeddings pretraining (i.e., leveraging the item embeddings learned by FISM to initialize the corresponding item embeddings of DeepICF models). As for the version of both DeepICF models without pretraining, we leveraged the Adagrad optimizer to learn the corresponding models with item embeddings initializing randomly. While for DeepICF and DeepICF+a with pretraining, we first run FISM till convergence and then exploit its item embeddings to initialize the corresponding item embeddings of DeepICF and DeepICF+a. As we can see from the Figure 7, both DeepICF models initialized by FISM embeddings provide much better recommendation performance than the ones without pretraining. For instance, the relative enhancements of the DeepICF with pretraining over the one trained from random initialization at embedding size 16 are 1.3 and 0.8 in terms of HR on the datasets of MovieLens and Pinterest, respectively. Furthermore, DeepICF models with FISM embeddings initialized are able to converge much faster than the ones without pretraining. This experimental results prove the efficacy of pretraining for two DeepICF models.
4.3. Sensitivity to Hyperparameter (RQ2)
In this study, we investigate the impact of different values of normalization hyper parameter and different negative sampling ratios on the performance of our both DeepICF models. In addition, as embeddingbased models, the embedding size is a critical hyperparameter as well. In this subsection, we are also ready to compare the influence of different embedding sizes on the performance trends.
4.3.1. Effect of Normalization Coefficient
Figure 8 demonstrates the performance of both DeepICF methods with regard to the normalization hyperparameter . Keeping the rest parameters constant, we did a full parameter study for different values of . From the experimental results, the performance of HR and NDCG corresponding to FISM decreases gradually with the increase of with step size of 0.1. For the performance of DeepICF, we can see that the best result on MovieLens appears in the range 0.4 to 0.5 and exceeds FISM no matter what the setting of . While on Pinterest, the best result is obtained when the value of is in the range 0.5 to 1 and outperforms FISM when the value of is 0.2. For the enhanced version, DeepICF+a achieves best performance on both datasets when is set to the value of 0 and outperforms DeepICF. We contribute such improvements to the introduction of attention mechanism into DeepICF.
4.3.2. Effect of Item Embedding Size
Figure 9 shows the performance of HR and NDCG with respect to the embedding size. As we can see that the tendencies of recommendation performance at embedding size 8, 32 and 64 are similar to the one at embedding size 16 in general. Our proposed DeepICF approach outperforms all the other methods under most circumstances except for embedding size 8 where MLP achieves better performance than DeepICF on MovieLens. We argue that on the relative dense dataset of MovieLens (compared to Pinterest), userbased nonlinear methods (MLP in this case for instance) have the ability to express stronger representation at small embedding size. Our enhanced model DeepICF+a offers the best performance and covers the shortage of DeepICF for prediction at small embedding size in dense dataset.
4.3.3. Effect of Negative Instance Sampling
To illustrate the impact for DeepICF and DeepICF+a with regarding to negative instance sampling, we demonstrate the experimental results of both DeepICF methods with different ratios of negative sampling in Figure 10. We can find that the performances of two DeepICF models are better than FISM when just sampling one negative instance per positive instance and sampling more negative instances seems more helpful to improve performances for DeepICF and the enhanced model DeepICF+a. For these two datasets, the optimal number of negative samples per positive instance for DeepICF models is around 4 which is similar to the experimental results by He’s work (He et al., 2017).
4.4. Depth of Hidden Layer in Network (RQ3)
Our proposed models capture the higherlevel and nonlinear relations between items by virtue of neural networks with deep hidden layers. The hidden layers of our models play a vital part in learning higherorder item interactions in the way of nonlinearity. As there is relatively little work on learning the complex interaction function between items based on deep neural networks, it is curious to see whether leveraging a deep architecture is propitious to bring about encouraging performance in modeling higherorder relations in a nonlinear way for quality prediction or not. In the final part of experiments, we further investigate DeepICF with different number of hidden layers, omitting the DeepICF+a model here owing to space limitation.
The experimental results are provided in the Table 4. The DeepICF4 for instance, refers to the DeepICF model with four hidden layers and other notations are in the same. As we can see that to some extent it is beneficial for DeepICF to stack more nonlinear hidden layers so as to better capture the higherorder relations between items and then generate promising performance. This result is highly encouraging, indicating the effectiveness of using deep architecture for complex item relations learning. Such advance is credited to the higherorder and nonlinear item relations brought by stacking more nonlinear hidden layers. In order to certify this, we further attempted to stack linear layers, replacing ReLU with an identity function as the activation function of hidden layers. The corresponding experimental performances are much worse than utilizing ReLU function. This provides evidence to the necessity of learning higherorder interactions between items with nonlinear functions. Owing to space limitation, we omit the results of using identity function as activation function.
Embedding Size  DeepICF1  DeepICF2  DeepICF3  DeepICF4  

HR  NDCG  HR  NDCG  HR  NDCG  HR  NDCG  
MovieLens  
8  0.6424  0.3707  0.6444  0.3740  0.6444  0.3741  0.6444  0.3743 
16  0.6833  0.4081  0.6854  0.4086  0.6881  0.4113  0.6884  0.4107 
32  0.7022  0.4236  0.7051  0.4278  0.7048  0.4276  0.7050  0.4320 
64  0.7116  0.4327  0.7131  0.4386  0.7156  0.4362  0.7124  0.4388 
8  0.8679  0.5379  0.8692  0.5415  0.8705  0.5454  0.8719  0.5458 
16  0.8804  0.5547  0.8792  0.5607  0.8806  0.5631  0.8810  0.5608 
32  0.8844  0.5669  0.8852  0.5654  0.8857  0.5680  0.8844  0.5692 
64  0.8858  0.5708  0.8865  0.5691  0.8865  0.5720  0.8870  0.5742 
5. Related Work
The core to a personalized recommender system lies in collaborative filtering, namely modeling the preference of users towards items according to the their historical interactions. In this section, we briefly review the related literature about collaborative filtering from the following three aspects.
5.1. Userbased Collaborative Filtering Models
UCF has been extensively investigated in bost academia and industry. The UCF task with explicit feedback (e.g., user ratings), which directly reflects the preference of users on items, is usually formulated as a rating prediction problem (Koren, 2008; Sarwar et al., 2001). The target is to minimize the overall errors between the known ratings and the corresponding prediction scores. Among various UCF approaches, matrix factorization has been the frequently praised model due to its simplicity and effectiveness. Biased MF is proposed to further enhance the performance of traditional MF in the problem of rating prediction. Researchers in (McAuley and Leskovec, 2013; Wang et al., 2017; Lian et al., 2018; Liao et al., 2018; Shi et al., 2017; Sun et al., 2017) introduced extra information like review texts and social relations into MF so as to address the rating sparsity issue. Among numerous MFbased approaches, SVD++ has been proven to be the best single model in terms of fitting user ratings. SVD++ firstly factorizes the useritem rating matrix with implicit feedback (Koren, 2008) and is followed by lots of techniques for recommender systems (Rendle et al., 2009; He and McAuley, 2016; Wang et al., 2018a).
Since useritem interactions on many recommender systems are based on implicit feedback (e.g., view, click) rather explicit ratings, many approaches are proposed on the basis of implicit feedback (He and McAuley, 2016; Kabbur et al., 2013; He et al., 2016; Bayer et al., 2017; He et al., 2017; Polato and Aiolli, 2018). UCF with implicit feedback is usually treated as top recommendation task (Li et al., 2017), which offers a short ranked list of items to the potential users. Technically, the main difference between the tasks of rating prediction and topN recommendation lies in the way of model optimization. In particular, the former usually constructs a regression loss function only on the known ratings to optimize, yet the latter needs to take the remaining data (a mixture of real negative feedback and missing data) into consideration which are always ignored by the models for explicit feedback. Recently, researchers in (He et al., 2016) presented a welldesigned MFbased method which applied a popularityaware weighting strategy to model these remaining data, achieving stateoftheart performance for topN recommendation.
5.2. Itembased Collaborative Filtering Models
ICF has been used for the construction of industrial applications on online recommendation due to the excellent efficacy it performs. The core of itembased CF is the estimation of itemitem similarities. Early heuristicbased models (e.g., ItemKNN (Sarwar et al., 2001)) simply utilize the statistical measures like cosine similarity and Pearson correlation coefficient to estimate similarities between items. However, such methods requires extensive manual tuning to measure similarity well and hardly generalize to other datasets. To solve these issues, there appears some machine learningbased approaches attempting to construct objective function to automatically learn the itemitem similarity matrix. Among these itemtoitem learningbased methods, SLIM (Ning and Karypis, 2011) and FISM (Kabbur et al., 2013) are two representative models. Specifically, SLIM learns the item similarity matrix by building a regressionbased objective function to optimize. However, it suffers from high training cost and fails to capture the transitive relationships between items. What’s more, the work in (Christakopoulou and Karypis, 2014) extends item similarities to high orders and captures higherorder item relations by integrating itemsets to SLIM. As for FISM, it factorizes the similarity between two items as the inner product of their lowdimensional vectors. While achieving the stateoftheart performance, FISM has two inherent limitations. One is that FISM only model the secondorder itemitem similarity relations via the inner product but ignores the complex higherorder relationships between items. Another is that FISM assumes all historically interacted items of a user profile contribute equally for modeling user’s preference on the target item.
The pioneering work based on neural network for ICF is the collaborative denoising autoencoder (CADE) presented by Wu
et al.(Wu et al., 2016). It is worth mentioning that CADE can recover SVD++ when replacing the activation function of hidden layers with a identity function. CADE is a neural modeling method for ICF, however it still leverages the linear inner product to model the useritem interactions, limiting its expressiveness and capacity to capture the nonlinear relations.5.3. Deep Collaborative Filtering Models
More recent evidence has suggested that integrating deep learning with recommender systems can significantly boost the performance (Zhang et al., 2017; Beutel et al., 2018). Salakhutdinov et al.(Salakhutdinov et al., 2007)
firstly proposed to exploit twolayer Restricted Boltzmann Machines to model the ratings of users on items. Autoencoders and the denoising autoencoders have already applied for recommendation based on explicit feedback
(Li et al., 2015; Sedhain et al., 2015). In addition, there are also some recent works (Van den Oord et al., 2013; Zhang et al., 2016; Chen et al., 2017a)attempting to leverage deep neural networks for feature extraction from the side information of images and music and then integrating these features with MF models. It is obvious that these advances still belong to the family of shallow and linear models. To address the core technology of CF based on deep neural networks, He
et al.(He et al., 2017) initiate a general CF framework named NCF, which uses feedforward neural networks to model useritem interactions, instead of linear inner product. Based on the NCF framework, Ting et al.(Bai et al., 2017) integrate the localized information (i.e., user and item neighborhood information) to enrich the representations. Govington et al.(Covington et al., 2016)proposed a deep neural network structure for industrial recommendation system. It considers recommendation as extreme multiclass classfication where the prediction problem becomes classifying a user into a video in a specific context. As a practical recommendation system, it can model newly upload content by feeding the age of example into neural network.
More recently, He et al.(He and Chua, 2017) present a neural network view for factorization machine, named NFM, which captures the nonlienar and higherorder feature interactions by the proposed bilinear layer.Inspired by NCF (He et al., 2017) and NFM (He and Chua, 2017), our DeepICF and DeepICF+a take advantages of neural networks to automatically learn the complex interaction function between items. Specifically, the deep component of DeepICF models applies a standard structure of MLP above the item embedding vectors, similar to that of NCF (He et al., 2017). It will be our future work with regarding to the choice of DNNs’ architecture. In this work, we demonstrate that DNNs can be a promising choice for itembased CF about modeling the higherorder and nonlinear interactions between items.
6. Conclusions and Future Work
In this work, we present a new itembased CF solution based on deep neural networks named DeepICF for topN item recommendation. Our key argument is that the potential structure of realworld data tends to be greatly nonlinear and is incapable of being approximated accurately by linear models such as FISM. The proposed model can not only conquer the inherent limitations of FISM successfully, but also effectively learn the higherorder relations among items from data in a nonlinear way of neural networks. We conduct a comprehensive set of experiments on two realworld datasets and the corresponding experimental results demonstrates that DeepICF outperforms other stateoftheart itembased approaches for topN item recommendation task.
In future, we plan to improve DeepICF in three directions. First of all, this work focuses on modeling item relations based solely on implicit similarity, however there indeed exist many heterogeneous item relations, which can be characterized based on attributes (e.g., category, location) or other content (e.g.,
timestamp, cooccurrence). Hence we attempt to identify the relational knowledge between items and improve DeepICF with such item relations. Second, although DeepICF can offer explanations behind a recommendation, like “item A is recommended since you have experienced similar item B”, such similaritybased evidence may be too coarsegrained to increase users’ trust. We would like to use side information or relational knowledge of users and items to exhibit featurebased reasons. Third, we will investigate the sequential recommendation by modeling the evolution of user preferences towards items via reinforcement learning or tracking user tastes on different attributes of items via memory network.
Acknowledgements.
This work is supported by the National Natural Science Foundation of China (No. 61772170, 61472115), the National Key Research and Development Program of China (No. 2017YFB0803301) and the Fundamental Research Funds for the Central Universities (No. JZ2017YYPY0234). This work is also supported by the National Research Foundation, Prime Ministers Office, Singapore under its IRC@Singapore Funding Initiative. The authors would like to thank the anonymous reviewers for their reviewing efforts and valuable comments.References
 (1)
 Arampatzis and Kalamatianos (2018) Avi Arampatzis and Georgios Kalamatianos. 2018. Suggesting PointsofInterest via ContentBased, Collaborative, and Hybrid Fusion Methods in Mobile Devices. ACM Trans. Inf. Syst. 36, 3 (2018), 23:1–23:28.
 Bai et al. (2017) Ting Bai, JiRong Wen, Jun Zhang, and Wayne Xin Zhao. 2017. A Neural Collaborative Filtering Model with Interactionbased Neighborhood. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 1979–1982.
 Bayer et al. (2017) Immanuel Bayer, Xiangnan He, Bhargav Kanagal, and Steffen Rendle. 2017. A generic coordinate descent framework for learning from implicit feedback. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1341–1350.
 Beutel et al. (2018) Alex Beutel, Paul Covington, Sagar Jain, Can Xu, Jia Li, Vince Gatto, and Ed H. Chi. 2018. Latent Cross: Making Use of Context in Recurrent Recommender Systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM 2018, Marina Del Rey, CA, USA, February 59, 2018. 46–54.
 Cao et al. (2017) Da Cao, Xiangnan He, Liqiang Nie, Xiaochi Wei, Xia Hu, Shunxiang Wu, and TatSeng Chua. 2017. CrossPlatform App Recommendation by Jointly Modeling Ratings and Texts. ACM Trans. Inf. Syst. 35, 4 (2017), 37:1–37:27.
 Chen et al. (2017b) Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and TatSeng Chua. 2017b. Attentive Collaborative Filtering: Multimedia Recommendation with Item and ComponentLevel Attention. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 711, 2017. 335–344.
 Chen et al. (2017a) Xu Chen, Yongfeng Zhang, Qingyao Ai, Hongteng Xu, Junchi Yan, and Zheng Qin. 2017a. Personalized Key Frame Recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 711, 2017. 315–324.
 Christakopoulou and Karypis (2014) Evangelia Christakopoulou and George Karypis. 2014. Hoslim: Higherorder sparse linear method for topn recommender systems. In PacificAsia Conference on Knowledge Discovery and Data Mining. Springer, 38–49.
 Christakopoulou and Karypis (2016) Evangelia Christakopoulou and George Karypis. 2016. Local ItemItem Models For TopN Recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 67–74.
 Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 191–198.
 Erhan et al. (2010) Dumitru Erhan, Yoshua Bengio, Aaron Courville, PierreAntoine Manzagol, Pascal Vincent, and Samy Bengio. 2010. Why does unsupervised pretraining help deep learning? Journal of Machine Learning Research 11, Feb (2010), 625–660.
 He and Liu (2017) Jiangning He and Hongyan Liu. 2017. Mining Exploratory Behavior to Improve Mobile App Recommendations. ACM Trans. Inf. Syst. 35, 4 (2017), 32:1–32:37.

He and McAuley (2016)
Ruining He and Julian
McAuley. 2016.
VBPR: Visual Bayesian Personalized Ranking from
Implicit Feedback. In
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 1217, 2016, Phoenix, Arizona, USA.
144–150.  He and Chua (2017) Xiangnan He and TatSeng Chua. 2017. Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 355–364.
 He et al. (2018) Xiangnan He, Zhankui He, Jingkuan Song, Zhenguang Liu, YuGang Jiang, and TatSeng Chua. 2018. NAIS: Neural Attentive Item Similarity Model for Recommendation. IEEE Transactions on Knowledge and data Engineering (2018).
 He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and TatSeng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 173–182.
 He et al. (2016) Xiangnan He, Hanwang Zhang, MinYen Kan, and TatSeng Chua. 2016. Fast matrix factorization for online recommendation with implicit feedback. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 549–558.
 Hornik et al. (1989) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feedforward networks are universal approximators. Neural networks 2, 5 (1989), 359–366.
 Kabbur et al. (2013) Santosh Kabbur, Xia Ning, and George Karypis. 2013. Fism: factored item similarity models for topn recommender systems. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 659–667.
 Koren (2008) Yehuda Koren. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 426–434.
 Li et al. (2015) Sheng Li, Jaya Kawale, and Yun Fu. 2015. Deep collaborative filtering via marginalized denoising autoencoder. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 811–820.

Li
et al. (2017)
Xin Li, Mingming Jiang,
Huiting Hong, and Lejian Liao.
2017.
A TimeAware Personalized PointofInterest Recommendation via HighOrder Tensor Factorization.
ACM Trans. Inf. Syst. 35, 4 (2017), 31:1–31:23.  Lian et al. (2018) Defu Lian, Kai Zheng, Yong Ge, Longbing Cao, Enhong Chen, and Xing Xie. 2018. GeoMF++: Scalable Location Recommendation via Joint Geographical Modeling and Matrix Factorization. ACM Trans. Inf. Syst. 36, 3 (2018), 33:1–33:29.
 Liao et al. (2018) Yi Liao, Wai Lam, Lidong Bing, and Xin Shen. 2018. Joint Modeling of Participant Influence and Latent Topics for Recommendation in Eventbased Social Networks. ACM Trans. Inf. Syst. 36, 3 (2018), 29:1–29:31.
 Liu et al. (2017) David C Liu, Stephanie Rogers, Raymond Shiau, Dmitry Kislyuk, Kevin C Ma, Zhigang Zhong, Jenny Liu, and Yushi Jing. 2017. Related pins at pinterest: The evolution of a realworld recommender system. (2017), 583–592.
 McAuley and Leskovec (2013) Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems. ACM, 165–172.
 Ning and Karypis (2011) Xia Ning and George Karypis. 2011. Slim: Sparse linear methods for topn recommender systems. In 11th IEEE International Conference on Data Mining, ICDM 2011, Vancouver, BC, Canada, December 1114, 2011. IEEE, 497–506.
 Polato and Aiolli (2018) Mirko Polato and Fabio Aiolli. 2018. Boolean kernels for collaborative filtering in topN item recommendation. Neurocomputing 286 (2018), 214–225.
 Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars SchmidtThieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the twentyfifth conference on uncertainty in artificial intelligence. AUAI Press, 452–461.
 Rendle and SchmidtThieme (2008) Steffen Rendle and Lars SchmidtThieme. 2008. Onlineupdating Regularized Kernel Matrix Factorization Models for Largescale Recommender Systems. In Proceedings of the 2008 ACM Conference on Recommender Systems. ACM.
 Salakhutdinov et al. (2007) Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. 2007. Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th international conference on Machine learning. ACM, 791–798.
 Sarwar et al. (2001) Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Itembased collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web. ACM, 285–295.
 Sedhain et al. (2015) Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015. Autorec: Autoencoders meet collaborative filtering. In Proceedings of the 24th International Conference on World Wide Web. ACM, 111–112.
 Shan et al. (2016) Ying Shan, T. Ryan Hoens, Jian Jiao, Haijing Wang, Dong Yu, and JC Mao. 2016. Deep Crossing: WebScale Modeling Without Manually Crafted Combinatorial Features. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 255–262.
 Shi et al. (2017) Lei Shi, Wayne Xin Zhao, and YiDong Shen. 2017. Local RepresentativeBased Matrix Factorization for ColdStart Recommendation. ACM Trans. Inf. Syst. 36, 2 (2017), 22:1–22:28.
 Smith and Linden (2017) Brent Smith and Greg Linden. 2017. Two decades of recommender systems at Amazon. com. IEEE Internet Computing 21, 3 (2017), 12–18.
 Sun et al. (2017) Yu Sun, Nicholas Jing Yuan, Xing Xie, Kieran McDonald, and Rui Zhang. 2017. Collaborative Intent Prediction with RealTime Contextual Data. ACM Trans. Inf. Syst. 35, 4 (2017), 30:1–30:33.
 Van den Oord et al. (2013) Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep contentbased music recommendation. In Advances in neural information processing systems. 2643–2651.
 Wang et al. (2018a) Xiang Wang, Xiangnan He, Fuli Feng, Liqiang Nie, and TatSeng Chua. 2018a. TEM: Treeenhanced Embedding Model for Explainable Recommendation. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 2327, 2018. 1543–1552.
 Wang et al. (2017) Xiang Wang, Xiangnan He, Liqiang Nie, and TatSeng Chua. 2017. Item silk road: Recommending items from information domains to social users. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 185–194.
 Wang et al. (2018b) Zihan Wang, Ziheng Jiang, Zhaochun Ren, Jiliang Tang, and Dawei Yin. 2018b. A Pathconstrained Framework for Discriminating Substitutable and Complementary Products in Ecommerce. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 619–627.
 Wu et al. (2016) Yao Wu, Christopher DuBois, Alice X Zheng, and Martin Ester. 2016. Collaborative denoising autoencoders for topn recommender systems. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. ACM, 153–162.
 Xue et al. (2017) HongJian Xue, XinYu Dai, Jianbing Zhang, Shujian Huang, and Jiajun Chen. 2017. Deep matrix factorization models for recommender systems. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence. 3119–3125.
 Zhang et al. (2016) Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and WeiYing Ma. 2016. Collaborative knowledge base embedding for recommender systems. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 353–362.
 Zhang et al. (2017) Shuai Zhang, Lina Yao, and Aixin Sun. 2017. Deep Learning based Recommender System: A Survey and New Perspectives. CoRR abs/1707.07435 (2017).
 Zhang and Chen (2018) Yongfeng Zhang and Xu Chen. 2018. Explainable Recommendation: A Survey and New Perspectives. CoRR abs/1804.11192 (2018).
Comments
There are no comments yet.