1. Introduction
Learning from relational data (e.g., the useritem interactions on Netflix) has benefited many realworld services and applications, such as rating prediction and item recommendation on online platforms. A significant line of research has shown that latent factor models, in particular factorization based techniques, offer stateoftheart results for learning tasks on relational data (Koren, 2008; Mnih and Salakhutdinov, 2008; Ma et al., 2008). In addition, as the main framework of the winning solution of the Netflix Prize, matrix factorization (MF) has demonstrated its power on industrialgrade applications (Koren et al., 2009), further attracting much effort in generalizing its predictive abilities.
One direction of these efforts has been devoted to extending a twodimensional matrix, representative of interactions between users and items, into a threedimensional tensor for incorporating the time information. Subsequently, the tensor factorization (TF) technique can be employed to project users and items into a latent space with the encoding of time (Kolda and Bader, 2009; Bhargava et al., 2015). However, conventional TF assumes the independence between two consecutive time slots, leaving it infeasible to make predictions for the next time slot. Further, it is also incapable of capturing the temporal patterns that are themselves timeevolving, such as (i) the fastchanging item perception, for example, individuals’ impression to a movie may be dynamically affected by its winning of some movie awards. and (ii) the evolution of users’ preferences, i.e., user’s tastes may change over time.
Recently, another attempt—recurrent recommender networks (RRN)—was made to integrate a recurrent neural network with factorization models for modeling the sequence dependencies between users’ behavioral trajectories
(Wu et al., 2017). However, RRN achieves this by setting a fixed length of items’ historical ratings, ignoring the time interval between two consecutive ratings. Consequently, these fixedlength (e.g., ) rating sequences may cover various timeframes for different items. This is a limitation of RRN. For example, a popular movie may only take one hour to receive ratings, while a cult movie may need days to collect the same count of ratings. Additionally, both RRN and conventional TF models utilize dot product to make rating predictions, missing the potential to model the nonlinearities between latent factors.To address these challenges and limitations, we develop a Neural network based Tensor Factorization model (NTF). In general, the NTF takes a threeway tensor (i.e., useritemtime) as input and learns the latent embeddings (commonly referred to as factors in TF) for each dimension of the tensor. Specifically, NTF integrates the long shortterm memory (LSTM) network with tensor factorization. The LSTM module is used to adaptively capture the dependencies among multidimensional interactions based on the learned representations for each time slot. Furthermore, instead of using dot product between learned representations to make rating predictions, NTF concatenates the inherent factors together and feeds them into a Multilayer Perceptron (MLP) architecture. As such, the learned representations encode the nonlinear interactions between different dimensions.
In addition to the aforementioned differences with RRN, our NTF also differs from it with respect to the input to the LSTM module. To predict the next rating, the input to RRN’s LSTM is the previous rating sequence with a fixed length, while the LSTM module of NTF takes the representation vectors from previous time slots. Furthermore, NTF is different from the recent work on neural collaborative filtering (NCF)
(He et al., 2017)and collaborative deep learning (CDL)
(Wang et al., 2015), whereas they infers the users’ and items’ latent embeddings under a static scenario. Additionally, different from CDL, NTF does not need auxiliary information and domain knowledge to determine the effectiveness of features. The advantage of the NTF model lies in its complete utilization of each dimension of the relational data.To sum up, the main contributions of this paper include:

To the best of our knowledge, we present the first model to generalize tensor factorization with deep neural networks, empowering it to model timeevolving multidimensional data. We call this NTF.

We incorporate the multiplelayer perceptron architecture in NTF for modeling the nonlinearities in relational data, eliminating the linear limitation of dot product used in conventional tensor factorization.

We perform extensive experiments for the problems of rating prediction in the Netflix dataset and link prediction in the Github dataset, demonstrating the significant improvements over stateoftheart baselines, such as RRN and NCF.
2. Problem Formulation and
Tensor Factorization (TF)
In this section, we present the notations and problem formulation. We also provide a quick review of the conventional Tensor Factorization (TF) model and discuss its limitations in modeling the temporal dynamics of relational data.
2.1. Problem Formulation
We consider a dynamic scenario wherein there exists evolving pairwise relations between multiple types of entities (e.g., user, item, and time), such as Netflix users’ ratings to various movies on different days, and Github users’ repository forks during a month. To model the relationships among entities over time, we use a tensor to represent their timeevolving interactions.
In this work, we focus on the threeway tensor with a temporal dimension. Formally, we construct a tensor denoting the firstorder tensor of each dimension with a size of , and , respectively. We denote the entry in the tensor as to represent the interactions among different dimensions which are indexed by , and , respectively. For example, in relational rating data, can represent: (i) the quantitative rating score of th user on th item in th time slot, and (ii) the binary interactions (links) between th and th nodes at time slot .
Problem Formulation. Based on the above definitions, we use to represent the interactions among three dimensions in tensor which are observed from relational data. The objective of this work is to learn a predictive model that can effectively infer the unknown values in with the observed ones.
2.2. Conventional Tensor Factorization
The key idea of tensor factorization is learning connections among the observed values in a tensor in order to infer the missing ones. The most common mechanism of tensor factorization is CANDECOMP/PARAFAC (CP) (Carroll and Chang, 1970), which decomposes a tensor into multiple lowrank latent factor matrices representing each tensordimension. For example, movie rating dataset can be viewed as a tensor with dimensions: user, movie, and time. The latent factor matrices in this case, measure the latent factors of each dimension. The latent factors on userdimension may be users’ genre or rating preferences, the factor matrix on moviedimension may model the movie plot, starring information, and many other features, the timedimension factors could be specific time related information such as holiday season or special events. With these three matrices, the ratings can be obtained by a simple dot product across the matrices. Following the convention in Representation Learning (RL) literature, in this work we will refer inherent factor vectors as lowerdimensional embedding or representation vectors interchangeably (Bengio et al., 2013).
Formally, TF factorizes a tensor into three different matrices , and , where is the number of latent factors (indexed by ). We define the factorization of tensor as:
(1) 
where , and represents the th column of matrices , and respectively. denotes the vector outer product. Each entry can be computed by the innerproduct of three dimensional vectors as follows:
(2) 
The objective of tensor factorization is to learn , and
using maximum likelihood estimation. We further define
, and to index the row of , and , respectively. After the model learns , , and from the observed , one can easily fill in the missing values in using Eq. 2.However, two significant limitations exist in conventional tensor factorization model: (i) it fails to capture temporal dynamics because the interaction prediction only depends on current timeslot ; (ii) it is a linear model which cannot deal with complex nonlinear interactions that exist in realworld relational data. In the present work we aim to explicitly incorporate temporal dynamics into tensor factorization frameworks and model the nonlinear interactions across different latent factors.
3. Neural network based Tensor Factorization Framework
In this section, we present the Neural network based Tensor Factorization (NTF) framework, which is capable of learning implicit timeevolving interactions in relational data. We first introduce the general framework of NTF to elaborate the motivation of the model and then present details of NTF in the following subsections.
3.1. General Framework
Our NTF is a multilayer representation learning model which pursues a full neural treatment of tensor factorization to explicitly model the timeevolving interactions between different dimensions. We include the model architecture in Figure 1 and present the pseudo code of NTF in Algorithm 1. The motivations of designing our model are listed as follows:

To address the data sparsity challenge, in our raw time embedding layer, we transform the element from the firstorder tensor of temporal dimension with onehot encoding and then project them into embedding space. In this way, we can address the sparse tensor challenge by using the latent vectors to represent elements of temporal dimension instead of using handcrafted features.

To capture the complex temporal dynamics, with the generated embedding vectors as input, we utilize LSTM to encode the evolving interactions addressing the issues of longterm dependencies and vanishing gradients in recurrent neural networks (Jozefowicz et al., 2015)
. Note that there are other variants of gated recurrent neural networks, such as Gated Recurrent Unit (GRU)
(Chung et al., 2015). LSTM and GRU models bear some resemblance in the architecture and often provides similar results (Chung et al., 2014). This work chooses LSTM as the encoder for the temporal dimension of the tensor because it is slightly more general compared to others (Wu et al., 2017). As a general tensor factorization framework, NTF is also flexible to integrate other variants of recurrent neural networks. 
To model the nonlinearity of multidimensional interactions, we decide to use a Multilayer Perceptron (MLP) on top of the first two layers. Let us consider the generated tensor with useritemtime dimensions as an example. The projected temporal embedding vectors will be fed into a multilayer neural architecture together with the embedding vectors of users and items. This enables NTF to incorporate the learned complex dependencies in the temporal dimension into the factorization process as constrains. By doing so, we can detect the implicit patterns of useritemtime interactions through each layer in the MLP framework and model the nonlinear combinations of latent factors to make better predictions. Finally, the latent vectors will be mapped to quantitative values (i.e., ) which represents future interactions across different dimensions.
3.2. Modeling Temporal Dynamics via LSTM
Recurrent Neural Network (RNN) has been widely used to address various challenges in time series analysis. However, two key limitations exist in the basic RNN. First, it is difficult to train due to the problem of gradient vanishing and exploding, since the gradient might approach zero (vanish) or diverge (explode) as it is propagated back through time steps during the training process. Second
, it is incapable to model longdistance dependencies in sequence data. To obviate the above problems and make the architecture more effective, LSTM is introduced as a special kind of RNN to model longterm dependencies and addresses the vanishing gradient problem by developing a more complicated hidden unit. In particular, LSTM proposes to derive the vector representations of hidden states
and for each time step as follows:(3) 
where represents the transformation matrix from the previous state (i.e., and ) to LSTM cell and are the transformation matrices from input to LSTM cell, where and denotes the dimension of input vectors and hidden states, respectively. Furthermore, is defined as a vector of bias term. and represents the sigmoid and tanh function, respectively. The operator denotes the elementwise product. In Eq. 3, , , and represents input gate, output gate and forget gate, respectively. For simplicity, we denote Eq. 3 as in the following subsections.
3.3. Fusion of LSTM and TF
In this subsection, we present how we fuse LSTM and TF under the NTF framework to model the timeevolving interactions across three dimensions , and . In relational data, dynamic characteristics are observed across different time slots. In this work, we consider temporal factors affecting the interactions over time based on a global trend by assuming that the interactions between multidimensions evolve in a smooth way. In particular, to capture the temporal smoothness, we further assume that the embedding vectors of the temporal dimension depends on embedding vectors from previous time slots. In NTF, we predict the embedding vector in current time slot based on the embedding vectors from past time slots using LSTM.
To encode the evolving temporal hidden factors, our LSTM encoder generates embedding without using handcrafted features (e.g., the day of a week). We formally define the hidden states and in encoding the contextual sequence as:
(4) 
Using the last hidden state vector , we can define the embedding vector through Projection Layer as:
(5) 
where is the projection matrix, is the projection bias.
are activation functions that we define later.
Finally, the predictive value could be derived by the dot product of , , according to the conventional tensor factorization. With the increasing utilization of deep neural networks to handle complicated nonlinearity in image and text data (Collobert et al., 2011; Rigamonti et al., 2011), it is intuitive to explore the nonlinear interactions in relational data. To address this issue, we concatenate embedding vectors of , , together and consider them as input to the MultiLayer Perceptron (MLP) and output . In this way, we can address the limitation caused by the dot product in tensor factorization (as introduced in Section 2) with a neural network architecture to capture nonlinear interactions by concatenating latent factors from the previous embedding layer. Formally, we present MLP as:
(6) 
where represents the number of hidden layers which is indexed by and represents the concatenate operation. For the layer, , and represent the activation function (e.g., or
function) of MLP layers, weight matrix and bias vector, respectively. We further specify the activation function as
sigmoid (denoted as ) to output the quantitative values representing multidimensional interactions. In the experiments, we investigate the effect of number of layers in MLP.3.4. Learning Process
In this subsection, we first describe the learning process of our NTF framework. Then, we further utilize the advanced technique, i.e., batch normalization, to optimize the NTF.
3.4.1. The Objective Function.
As we introduced in the Section 2, our objective is to derive the value of which denotes the interactions between th, th and th elements of firstorder tensor , and in , respectively. We formally define our objective function in factorization procedure as follows:
(7) 
where denotes the set of observed interactions in tensor
. The NTF can be learned by minimizing the above loss function between the observed interaction data and the factorization representation. The above optimization problem can be efficiently solved using a popular optimizer Adaptive Moment Estimation (Adam). The reasons to choose Adam are mainly twofolds: (i) it can automatically tune the learning rate during the training, and (ii) it often provides faster convergence compared with stochastic gradient descent algorithm.
3.4.2. Batch Normalization.
In the training process of neuron network models, their performances could be degraded by covariance shift
(Shimodaira, 2000). To tackle this challenge, Batch Normalization (BN) has been proposed to normalize the input data from previous layer before sending it to the next layer as input (Ioffe and Szegedy, 2015). In NTF framework, we apply BNto reduce the internal covariance shift by transforming the input to zero mean/unit variance distributions in each minibatch training. In NTF, we apply the BN to LSTM to avoid the deceleration in training process as:
(8) 
where stands for the batch normalization operation. BN is also applied to Projection Layer.
4. Evaluation
We demonstrate the effectiveness of NTF with two realworld applications on dynamic relational data, i.e., regressionrating prediction and classificationlink prediction corresponding to predicting quantitative (rating scores) and binary (existence of links) interactions, respectively. In particular, we aim to answer the following questions:

Q1: How does our NTF framework perform as compared to the stateoftheart techniques in rating prediction task?

Q2: How does our NTF framework work for link prediction task when competing with baselines?

Q3: Does NTF consistently outperform other baselines in terms of prediction accuracy with respect to different time windows with different training and testing time period?

Q4: How is the performance of NTF variants with different combinations of key components in the joint framework?

Q5: How different hyperparameter settings (e.g., embedding size and number of hidden layers) affect the performance of NTF?
4.1. Experimental Setup
4.1.1. Data
In our evaluation, we perform experiments on two types of dynamic relational datasets and corresponding tasks, namely: (i) rating prediction on Netflix movie rating data; (ii) link prediction on Github archive data.
Netflix Rating Data. This movie rating dataset, which was collected between Jan 2002 and Dec 2005, has been widely used in rating prediction evaluation (Xiong
et al., 2010; Shih
et al., 2016). In the Netflix dataset, users rate a movie using a 1 (worst) to 5 (best) scale, the given score is also associated with a rating date to denote when the rating was reported. We generate tensor by associating each movie with the users who rated this movie on different months. In particular, if th user rated th movie on th month (the time slots we used for evaluation correspond to calendar months) in the dataset, the element is in the interaction set .
Github Archive Data. This dataset was collected from Github to record the fork actions of users on repositories. Specifically, forking a repository allows users to freely experiment with changes without affecting the original project. Note that for any repository a user can only fork it once. The collection lasts for month (Jan 2017 to Jun 2017) and the time information is provided. In this dataset, an edge (i,j,k) is generated when th user forks the th repository at time slot (the time slots we used for evaluation correspond to calendar weeks). We set the element in tensor to 1 if edge exists in the dataset and 0 otherwise.
# of Users  # of Items  # of Ratings  Time Span  
Netflix  68,079  2,328  12,326,319  36 months 
# of Users  # of Projects  # of Fork  Time Span  
Github  81,001  72,420  1,396,115  21 weeks 
Table 1 summarizes the statistics of the above two datasets. To better understand the effectiveness of NTF in modeling temporal dynamics, we evaluate NTF on different time windows with different training and testing period. Additionally, to evaluate the ability of our NTF to model temporal dynamics with timeevolving interactions among data, we remove users and items in the testing data which are not included in the training data. Table 2 shows the details to summarize of the experimental settings by varying time windows.
# of users  # of items  training size  validation size  testing size  training period  testing period 
28,077  1,772  1,454,868  145,486  225,787  Jan 2003Dec 2004  Jan 2004 
32,637  1,862  1,910,235  191,023  247,956  Mar 2003Feb 2004  Mar 2004 
37,060  1,937  2,437,882  243,788  270,312  May 2003Apr 2004  May 2004 
40,922  1,986  2,941,755  294,175  288,037  Jul 2003Jun 2004  Jul 2004 
44,072  2,055  3,388,774  338,877  305,693  Sep 2003Aug 2004  Sep 2004 
47,480  2,126  3,869,204  386,920  357,878  Nov 2003Oct 2004  Nov 2004 
4.1.2. Baselines
Because rating prediction and link prediction are two different tasks and have different representative baselines. Here, we summarize the compared baselines of these two tasks separately. In addition, the reason to compare NTF with matrix factorization methods rather than tensor factorization schemes is mainly twofold: (1) it is difficult to apply tensor factorization with temporal dimension to make predictions due to the ignorance of temporal dependencies between time slots. (2) There is no duplicate interactions (i.e., ratings and links) existed between users and items in different time slots in both Netflix and Github datasets.
Rating Prediction and Inference. For the rating prediction, we consider three types of baselines: representative matrix factorization for recommendation systems, neural network based collaborative filtering methods for recommendations or predictive analytics, and variants of Recurrent Neural Network models for time series prediction.

Probabilistic Matrix Factorization (PMF) (Mnih and Salakhutdinov, 2008)
: it is a probabilistic method for matrix factorization, which assigns a Ddimensional latent feature vector (following Gaussian distributions) for each user and item. The ratings are derived from the innerproduct of corresponding latent features.

Bayesian Probabilistic Matrix Factorization (BPMF) (Salakhutdinov and Mnih, 2008)
: extended from the PMF, BPMF learns the latent feature vector for each user and item by Monte Carlo Markov Chain method, which is able to address the overfitting issue.

Bayesian Probabilistic Tensor Factorization (BPTF) (Xiong et al., 2010): it is a bayesian probabilistic tensor factorization method for modeling evolving relational data.

Temporal Deep Semantic Structured Model (TDSSM) (Song et al., 2016): this method is a temporal recommendation model which combines traditional feedforward networks (DSSM) with LSTM, to capture temporal dynamics of users’ interests.

Recurrent Recommendation Networks (RRN) (Wu et al., 2017): it aims to predict future interactions between users and items by specifying two embedding vectors (stationary and dynamic) for both user and item. The dynamic embedding vectors are inferred with LSTM model based on historical ratings.

Neural Collaborative Filtering (NCF) (He et al., 2017): it proposed a framework for collaborative filtering based on neural network architecture to model the interactions between users and items.
4.1.3. Evaluation Protocols
In our evaluation, we split the datasets into training, validation and test sets. We use the validation datasets to tune hyperparameters and test datasets to evaluate the final performance of all compared algorithms.

Rating Prediction. To evaluate the performance of all compared algorithms in predicting quantitative rating scores, we use Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) which have been widely adopted in quantitative prediction tasks (He and Chua, 2017). Note that a lower RMSE and MAE score indicates better performance. The mathematical definitions of those metrics are presented as follows: , , where denotes the number of observed elements in tensor , and represents the actual rating score and estimated rating score, respectively.

Link Prediction. To validate the performance of each method in predicting the existences of links, Precision, Recall, F1score and AUC
are used as evaluation metrics
(Scellato et al., 2011).
4.1.4. Reproducibility
We summarize the parameter settings of NTF and experiments in Table 3
. In addition, we vary each of key parameters in NTF and fix others to examine the parameter sensitivity. We implemented our framework based on TensorFlow and chose Adam
(Kingma and Ba, 2014) as our optimizer to learn the model parameters^{1}^{1}1Code of our model and baselines will be publicly available upon publication. For all neural network baselines (i.e., NCF, RRN and TDSSM), we use the same parameters listed in Table 3.Parameter  Value  Parameter  Value 

Hidden State Dimension  32  Embedding size  32 
# of Time Steps  5  # of Hidden Layers  6 
BN scale parameter  0.99  BN shift parameter  0.001 
Batch size  256  Learning rate  0.001 
Link Prediction. In addition to the above NCF, BPMF and PMF algorithms which have been applied to solve link prediction problem, we consider other two traditional link prediction baselines to compare the performance of NTF in predicting future binary interactions (i.e., links) among users.

Preferential Attachment (PA) (LibenNowell and Kleinberg, 2007): PA assumes new connections are more likely to form between wellconnected nodes.

Adamic Adar (AA) (Adamic and Adar, 2003): AA smoothes the common neighbor method using neighbors’ node degree.

PMF, BPMF, and NCF: as introduced above.

TDSSM and RRN: both methods require the historical sequences of items and users. However, the sequences can not be generated from the Github dataset due to the following two reasons: First, the unobserved interactions may be either negative or positive cases; Second, if we take all unobserved cases as negative, it is also difficult to locate them to specific time slots.
For all embedding based and recurrent neural network based baselines, we use the same parameters as NTF which are listed in Table 3.
4.1.5. Variants of NTF
In addition to comparing NTF with existing approaches, we are also interested in discovering the best way to model nonlinear multidimensional interactions among different embeddings in the proposed NTF framework. Namely, we aim to answer the following two questions: (1) does the selection of activation functions affect the performance of NTF? and (2) is multiplelayer perceptron, helpful for learning nonlinear interactions from multidimensional relational data? Hence, in the evaluation of NTF framework, we consider four variants of NTF: NTF: a simplified version of NTF which does not use MLP to explore the nonlinear interactions in relational data. Instead, it uses dot product to predict value , which is also applied in compared baselines. NTF(), NTF(), and NTF(): the full version of NTF that use different activation functions.
4.2. Rating Prediction and Inference (Q1, Q3 and Q4)
We now compare NTF with stateoftheart techniques as we introduced above. To investigate the performance of all compared algorithms on different targeted time frames, we show the results from Jan, 2004 to Nov, 2004. The evaluation results are shown in Table 5. Furthermore, we provide analysis on the effects of training ratio (from 30% to 80%) on predictive performance, as shown in Table 6. Based on those evaluation results, we have the following four key observations.
(i) Rating Prediction: Training/test time period. We observe that NTF consistently achieves the best performance over different time frames from Table 5. For example, NTF achieves on average 0.075 and 0.076 (relatively 8.3% and 10.7%) improvements over TDSSM in terms of RMSE and MAE, and 0.075 and 0.075 (relatively 7.4% and 7.8%) improvements over RRN in terms of RMSE and MAE. The evaluation results across different time frames demonstrate the effectiveness of our NTF framework in modeling timeevolving interactions between multiple dimensions in a dynamic scenario. Furthermore, the data becomes more dense as time window slides, i.e., density degree: 2004 Jan3.67%, 2004 Mar3.86%, 2004 May4.11%, 2004 Jul4.34%, 2004 Sep4.45% and 2004 Nov4.57%. We can observe that the performance gain between NTF and other baselines become larger as data becomes sparser, suggesting our NTF is capable of handing sparse relational data.
(ii) Rating Inference: Training/test ratio. Table 6 shows the prediction results when varying the percentage of data in the training set. In this experiment, we fix the percentage of validation data as 10% of the entire dataset. We can observe that obvious improvements can be obtained by our NTF with different sizes of training data, demonstrating that NTF is robust to the data sparsity issue. For example, the average relative improvement over on TDSSM and RRN algorithms are (RMSE: 5.5%, MAE: 7.2%) and (RMSE: 3.1%, MAE: 3.3%), respectively. In addition, we can observe rising trends as the training size increases, which indicate the positive effects of training data size on predicting ratings. Also, the performance gain between NTF and other baselines becomes larger as the training size decreases (more sparse data), which validates the ability of NTF model in predicting interactions with sparse tensor. We can notice that neural network based models (i.e., NCF and RRN) achieve better performance compared with conventional matrix factorization algorithms (i.e., PMF and BPMF) with less training data. This observation suggests that neural network based models are more suitable in sparse relational data. An interesting observation is that BPTF achieves better performance than TDSSM and RRN (neural methods), which indicates the effectiveness to consider the interactions between user, item and time embeddings.
(iii) NTF’s variants. We notice that the performance of our NTF is not sensitive to different activation functions. In addition, our results also indicate that the necessity of our multilayer perceptron component in NTF for projecting inherent factors into a prediction output by capturing the nonlinear relations between them.
(iv) Performance gain analysis. We can observe that NTF shows improvement over both deep collaborative filtering based algorithms (i.e., NCF, PMF, BPMF and BPTF) and recurrent neural networks based schemes (i.e., RRN and TDSSM). In particular, Firstly, this sheds light on the limitations of collaborative filtering based algorithms which ignore the temporal dynamics among the multidimensional interactions in the rating data. Secondly, the large performance gap between NTF and recurrent neural network based schemes indicates the limitation of those approaches which only model the sequential pattern of the tensor’s temporal dimension and fail to consider the dependencies between the implicit interactions across dimensions.
4.3. Link Prediction (Q2, Q3 and Q4)
Table 7 lists the evaluation results of the link prediction task. In this evaluation, we evaluate all compared models using the the 90% and 10% of Github archive data from first twenty weeks as training and validation data to predict links in the twentyfirst week. To construct the testing set, we use the observed links in the twentyfirst week as the positive cases and randomly enumerate node pairs and choose unobserved edges as negative cases. We use the same parameter settings as rating prediction task (listed in Table 3). Furthermore, because link prediction problem often suffers from highly unbalanced data (i.e., only 0.02% instances are observed in the GitHub dataset), we sample the negative cases twice as many as positives cases to address this challenge following (Chawla et al., 2004).
Overall, the proposed NTF significantly outperforms other baselines in F1 and AUC. Specifically, the relative improvement of our NTF over NCF is 10.2% and 6.8% in terms of F1score and AUC respectively. This link prediction task further demonstrate that the NTF framework works well by capturing timeevolving interactions between different dimensions in a nonlinear manner.
4.4. Parameter Sensitivity (Q5)
The NTF model involves several parameters (i.e., embedding size in embedding layer, # of hidden layers in MLP, # of time steps and hidden state dimension in LSTM). To investigate the robustness of NTF framework, we examine how the different choices of parameters affect the performance of NTF in both rating and link prediction tasks. Except for the parameter being tested, we set other parameters at the default values (see Table 3).
Rating Prediction. Figure 3 and Figure 2 list the prediction results (measured by RMSE and MAE) as a function of one selected parameter when fixing others. Note that we have two axes corresponding to RMSE (leftblack) and MAE (rightblue) respectively due to their different value ranges. From Figure 3, overall, we can observe that NTF is not strictly sensitive to these parameters, except for # of hidden layers, and can achieve high performance with costeffective parameters, i.e., the smaller the parameters are, the more efficient the training process will be. Figure 3(b) and Figure 2(b) indicate that model performance becomes stable as long as the number of hidden layers is above 2. From Figure 3(a), we can observe that embedding size is positively correlated with the prediction accuracy and we set it to 32 in our experiment due to the balance between efficacy and computational cost. Additionally, we can observe the low impact of other two parameters (i.e., sequence length and state size in LSTM) on model performance, which suggests the robustness of our NTFin modeling the temporal dynamics of multidimensional interactions.
Link Prediction. We also study the parameter sensitivity of NTF as measured by link prediction performance. Figure 4 shows the prediction accuracy (measured by F1score and AUC) as a function of each of the four parameters when fixing the other three. Figure 4(c) and Figure 4(d) suggest that the sequence length and state size have little impact on prediction accuracy. The increase of link prediction performance converges as the number of hidden layers reaches around 4. Additionally, we can observe that our model shows an increasing trend with an increasing embedding size from Figure 4(a), which is consistent with the observation from the parameter sensitivity evaluation in rating prediction. For NTF without hidden layer (final predictions are directly derived from embedding layer), the performance is suboptimal. This observation verifies our argument that the dot product operation cannot handle the nonlinear interactions in tensor factorization and demonstrates the necessity to model the complex interaction dependencies with hidden layers.
5. Related Work
Deep Collaborative Filtering Models. Collaborative Filtering (CF) has been widely applied to various recommendation systems (He et al., 2017; Wu et al., 2016; Li and She, 2017; Hsieh et al., 2017; Chen et al., 2017). In particular, He et al. aimed to develop a neural network collaborative filtering framework by modeling latent features of users and items (He et al., 2017). Wu et al
. studied the topN recommendation problem and proposed a autoencoder based CF method
(Wu et al., 2016). Furthermore, a collaborative variational autoencoder has been developed in recommendation systems to consider implicit relationships between items and users (Li and She, 2017). Hsieh et al. studied the connection between metric learning and collaborative filtering. However, these approaches are static models and are lacking when they comes to dynamic scenarios. The proposed NTF addresses this problem by modeling temporal evolution of latent factors in collaborative filtering framework.Deep Matrix Factorization. With the advent of deep learning techniques, significant effort has been made to develop neural networkbased matrix factorization models (Sedhain et al., 2015; Sainath et al., 2013; Kim et al., 2016; Dziugaite and Roy, 2015). Sedhain et al. proposed an autoencoder framework for collaborative filtering (Sedhain et al., 2015). Sainath et al. proposed to apply lowrank factorization to deep neural network models to address the language modeling problem (Sainath et al., 2013). More recently, to address the sparsity problem in recommendation techniques, Kim et al
. designed a model which integrates convolutional neural network (CNN) into probabilistic matrix factorization (PMF)
(Kim et al., 2016). Dziugaite et al. suggested to replace the the inner product in matrix factorization with the function which is learned from the data together with latent feature vectors. However, the limitation of the above approaches is that they only consider static data instead of dynamic data in which temporal dimension need to be explored. Our work furthers the investigation on this direction by developing the NTF framework to capture the timeevolving temporal dynamics exhibited from relational data with multiple types of entity dependencies, which cannot be handled by previous models.Applications of Matrix Factorization. There is a good amount of work on the applications of Matrix Factorization. Existing recommendation techniques can be grouped into three categories: contentbased algorithms (Cantador et al., 2010; Van den Oord et al., 2013), collaborative filtering based algorithms (Mnih and Salakhutdinov, 2008; Bu et al., 2016) and hybrid algorithms (He and McAuley, 2016; Wang et al., 2015). For example, several contentbased recommendation models have been evaluated based on the profiles of users and items (Cantador et al., 2010). Salakhutdinov et al. presented a Probabilistic Matrix Factorization (PMF) model and demonstrated its effectiveness on the movie rating data (Mnih and Salakhutdinov, 2008). Additionally, Wang et al. proposed a hierarchical Bayesian model which integrated content information and collaborative filtering scheme by performing deep representation learning (Wang et al., 2015). Li et al. generalized latent factor framework for social network analysis by modeling homophily. This work can be complementary to the above works in the sense that explicitly exploring temporal dynamics in relational data normally lead to better recommendation results.
6. Conclusion
We developed a novel and general Neural network based Tensor Factorization (NTF) for modeling dynamic relational data that addresses the critical challenge of evolving useritem relational data. By modeling the timeevolving inherent factors and incorporating temporal smoothness constraints on those factors, NTF is capable of capturing both the timevarying interactions across dimensions and the nonlinear relations between them. Extensive experiments on two realworld datasets in rating prediction and link prediction tasks show that NTFsignificantly outperforms baseline methods.
Notwithstanding the interesting problem and promising results, some directions exist for future work. We will next incorporate rich heterogeneous auxiliary data to further improve the model. Another possible direction is adapting NTF to a timesensitive model by analyzing the tradeoff between accuracy and complexity.
References
 (1)
 Adamic and Adar (2003) Lada A Adamic and Eytan Adar. 2003. Friends and neighbors on the web. Social networks 25, 3 (2003), 211–230.
 Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798–1828.
 Bhargava et al. (2015) Preeti Bhargava, Thomas Phan, Jiayu Zhou, and Juhan Lee. 2015. Who, what, when, and where: Multidimensional collaborative recommendations using tensor factorization on sparse usergenerated data. In WWW. ACM, 130–140.
 Bu et al. (2016) Jiajun Bu, Xin Shen, Bin Xu, Chun Chen, Xiaofei He, and Deng Cai. 2016. Improving collaborative recommendation via useritem subgroups. TKDE 28, 9 (2016), 2363–2375.
 Cantador et al. (2010) Iván Cantador, Alejandro Bellogín, and David Vallet. 2010. Contentbased recommendation in social tagging systems. In Recsys. ACM, 237–240.
 Carroll and Chang (1970) J Douglas Carroll and JihJie Chang. 1970. Analysis of individual differences in multidimensional scaling via an Nway generalization of ”EckartYoung” decomposition. Psychometrika 35, 3 (1970), 283–319.
 Chawla et al. (2004) Nitesh V Chawla, Nathalie Japkowicz, and Aleksander Kotcz. 2004. Special issue on learning from imbalanced data sets. ACM SIGKDD Explorations (2004), 1–6.
 Chen et al. (2017) Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and TatSeng Chua. 2017. Attentive collaborative filtering: Multimedia recommendation with itemand componentlevel attention. In SIGIR. ACM, 335–344.
 Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, et al. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
 Chung et al. (2015) Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2015. Gated feedback recurrent neural networks. In ICML. 2067–2075.
 Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. JMLR 12, Aug (2011), 2493–2537.
 Dziugaite and Roy (2015) Gintare Karolina Dziugaite and Daniel M Roy. 2015. Neural network matrix factorization. ICLR (2015).
 He and McAuley (2016) Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with oneclass collaborative filtering. In WWW. ACM, 507–517.
 He and Chua (2017) Xiangnan He and TatSeng Chua. 2017. Neural Factorization Machines for Sparse Predictive Analytics. In SIGIR. ACM.
 He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and TatSeng Chua. 2017. Neural collaborative filtering. In WWW. ACM, 173–182.
 Hsieh et al. (2017) ChengKang Hsieh, Longqi Yang, Yin Cui, TsungYi Lin, Serge Belongie, and Deborah Estrin. 2017. Collaborative metric learning. In WWW. ACM, 193–201.
 Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML. 448–456.
 Jozefowicz et al. (2015) Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. 2015. An empirical exploration of recurrent network architectures. In ICML. 2342–2350.
 Kim et al. (2016) Donghyun Kim, Chanyoung Park, Jinoh Oh, Sungyoung Lee, and Hwanjo Yu. 2016. Convolutional matrix factorization for document contextaware recommendation. In Recsys. ACM, 233–240.
 Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
 Kolda and Bader (2009) Tamara G Kolda and Brett W Bader. 2009. Tensor decompositions and applications. SIAM review 51, 3 (2009), 455–500.
 Koren (2008) Yehuda Koren. 2008. Factorization Meets the Neighborhood: A Multifaceted Collaborative Filtering Model. In KDD ’08. ACM, New York, NY, USA, 426–434.
 Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. Computer 42, 8 (Aug. 2009), 30–37.
 Li and She (2017) Xiaopeng Li and James She. 2017. Collaborative Variational Autoencoder for Recommender Systems. In KDD. ACM, 305–314.
 LibenNowell and Kleinberg (2007) David LibenNowell and Jon Kleinberg. 2007. The linkprediction problem for social networks. journal of the Association for Information Science and Technology 58, 7 (2007), 1019–1031.
 Ma et al. (2008) Hao Ma, Haixuan Yang, Michael R. Lyu, and Irwin King. 2008. SoRec: Social Recommendation Using Probabilistic Matrix Factorization. In CIKM. ACM, New York, NY, USA, 931–940.
 Mnih and Salakhutdinov (2008) Andriy Mnih and Ruslan R Salakhutdinov. 2008. Probabilistic matrix factorization. In NIPS. 1257–1264.
 Rigamonti et al. (2011) Roberto Rigamonti, Matthew A Brown, et al. 2011. Are sparse representations really relevant for image classification?. In CVPR. IEEE, 1545–1552.
 Sainath et al. (2013) Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. 2013. Lowrank matrix factorization for deep neural network training with highdimensional output targets. In ICASSP. IEEE, 6655–6659.
 Salakhutdinov and Mnih (2008) Ruslan Salakhutdinov and Andriy Mnih. 2008. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In ICML. ACM, 880–887.
 Scellato et al. (2011) Salvatore Scellato, Anastasios Noulas, et al. 2011. Exploiting place features in link prediction on locationbased social networks. In KDD. ACM, 1046–1054.
 Sedhain et al. (2015) Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015. Autorec: Autoencoders meet collaborative filtering. In WWW. ACM, 111–112.
 Shih et al. (2016) TingYi Shih, TingChang Hou, JianDe Jiang, YenChieh Lien, ChiaRui Lin, and PuJen Cheng. 2016. Dynamically Integrating Item Exposure with Rating Prediction in Collaborative Filtering. In SIGIR. ACM, 813–816.
 Shimodaira (2000) Hidetoshi Shimodaira. 2000. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference 90, 2 (2000), 227–244.
 Song et al. (2016) Yang Song, Ali Mamdouh Elkahky, and Xiaodong He. 2016. Multirate deep learning for temporal recommendation. In SIGIR. ACM, 909–912.
 Van den Oord et al. (2013) Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep contentbased music recommendation. In NIPS. 2643–2651.
 Wang et al. (2015) Hao Wang, Naiyan Wang, and DitYan Yeung. 2015. Collaborative deep learning for recommender systems. In KDD. ACM, 1235–1244.
 Wu et al. (2017) ChaoYuan Wu, Amr Ahmed, Alex Beutel, Alexander J Smola, and How Jing. 2017. Recurrent recommender networks. In WSDM. ACM, 495–503.
 Wu et al. (2016) Yao Wu, Christopher DuBois, Alice X Zheng, et al. 2016. Collaborative denoising autoencoders for topn recommender systems. In WSDM. ACM, 153–162.
 Xiong et al. (2010) Liang Xiong, Xi Chen, et al. 2010. Temporal collaborative filtering with bayesian probabilistic tensor factorization. In SDM. SIAM, 211–222.
Comments
There are no comments yet.