1 Introduction
Recommender systems, which analyze users’ preference patterns to suggest potential targets, are indispensable in content providers, electronic retailers, web search engines, etc. The key mathematical problem underlying recommender systems is matrix completion [Candès and Recht2009]. Assume there are users and items, the recommendation algorithm aims to fill in the missing entries in the rating matrix given the existing entries.
The classical way to solve this problem is via Matrix Factorization (MF) [Koren et al.2009]
, in which the rating scores are generated by functions over the latent factors or embeddings of users and items. Recent advancements in deep learning, especially
Graph Convolutional Networks (GCN) [Defferrard et al.2016, Bronstein et al.2017, Kipf and Welling2017, Hamilton et al.2017], have brought new ideas for tackling this essential artificial intelligence problem. GCN generalizes the definition of convolution from the regular grid to irregular grid, like graph structures. The GCN framework generates node representations by a localized parametersharing operator, known as
graph aggregator [Hamilton et al.2017, Zhang et al.2018]. A graph aggregator calculates a node’s representation by transforming and aggregating the features of its local neighborhoods. By stacking multiple graph aggregators and nonlinear functions, we build a deep neural network that can extract features across far reaches of a graph. Because the local neighborhood set can be viewed as the receptive field of a convolution kernel, this kind of neighborhood aggregation methods is named as
graph convolution, which also have connections to spectral graph theory [Kipf and Welling2017].Monti et al. monti2017geometric proposed the first GCNbased method for recommender systems. In their approach, GCN was used to aggregate information from two auxiliary useruser and itemitem graphs. The latent factors of users and items were updated after each aggregation step, and a combined objective function of GCN and MF was used to train the model. After that, Berg et al. berg2017graph proposed the Graph Convolutional Matrix Completion
(GCMC) model. GCMC directly characterized the relationship between users and items as a bipartite interaction graph. Two multilink graph convolution layers were used to aggregate user features and item features. The ratings were estimated by predicting the edge labels. Thanks to the power of GCN in learning highquality user and item representations, GCMC has achieved stateoftheart performance in several public recommendation benchmarks.
While being powerful, the GCMC model has two significant limitations. To distinguish each node, the model uses onehot vectors as node input. This makes the input dimensionality proportional to the total number of nodes and thus is not scalable to large graphs. Moreover, the model is unable to predict the ratings for new users or items that are not seen in the training phase because we cannot represent unknown nodes as onehot vectors. The task of predicting ratings for new users or items is also known as the
cold start problem.In this paper, we propose a new architecture, STAcked and Reconstructed Graph Convolutional Networks (STARGCN), to solve these problems. Unlike GCMC, STARGCN directly learns lowdimensional user and item embeddings as the input to the network in an endtoend fashion. To improve the learned embeddings and also generalize the model to predict embeddings of unseen nodes for the cold start problem, STARGCN masks a part of or the whole user and item embeddings and reconstructs these masked embeddings with a block of graph encoderdecoder in the training phase. This technique is inspired by the recent success of the ‘masked language model’ in learning language embeddings [Devlin et al.2018]. Moreover, we build a stack of encoderdecoder blocks in conjunction with intermediate taskspecific supervision to enhance the final performance. During implementation, we find that training the GCNbased models for rating prediction faces the label leakage issue, which results in the overfitting problem and significantly degrades the final performance. To avoid the leakage issue, we provide a sampleandremove training strategy and empirically demonstrate the effectiveness.
We conduct experiments over two tasks: transductive rating prediction and inductive rating prediction. The transductive rating prediction is generally used in traditional matrix completion tasks, i.e., all the testing users and items are observed in training data. The inductive rating prediction is a newly introduced task to evaluate different models’ ability on the cold start problem. We ask new users to rate a few items or require new items to be rated by a few users. These data are only used in the inference step to elicit initial information about new users/items, which is similar to the asktorate technique [NadimiShahraki and Bahadorpour2014] for cold start. Experiments show that STARGCN achieves stateoftheart performance in four out of five realworld datasets in the transductive setting. In the inductive setting, our STARGCN consistently and significantly outperforms the baselines.
Our main contributions are: (1) we propose a new architecture for recommender systems to learn latent factors of users and items in both transductive and inductive settings; (2) we are the first to explicitly pinpoint a training label leakage issue when implementing GCNbased models in rating prediction tasks and propose a training strategy to avoid this issue, leading to substantial performance improvement; (3) our STARGCN models achieve stateoftheart performance in four out of five realworld recommendation datasets in the transductive setting and significantly outperform other models in the inductive setting.
2 Preliminary
We denote vectors with bold lowercase letters, matrices with bold uppercase letters, and sets with calligraphy letters. We omit the bias variable of linear transformation for brevity.
2.1 Rating Prediction Tasks
GCNbased models treat the recommendation environment as an undirected bipartite graph that contains two disjoint node sets, users and items . Suppose there are users and items, an edge value represents an observed rating value from user to item . The rating set may contain several ordinal rating levels, i.e., . Each rating level indicates a link type in the bipartite graph. All the training rating pairs form a training graph, and they are included in a testing graph. Examples are shown in Figure 1. The goal of rating prediction is to predict the ratings a user would give to other items, given a small subset of observed rating pairs. We focus on two types of rating prediction tasks: transductive rating prediction and inductive rating prediction. Figure 1 illustrates the difference between these two tasks.
Transductive Rating Prediction. Figure 0(a) shows a transductive rating prediction example, where users and items appearing in the testing graph are observed in the training graph. Prior collaborative filtering methods, like Matrix Factorization (MF) [Koren et al.2009], primarily concentrate on this task.
Inductive Rating Prediction. Figure 0(b) illustrates an example of the inductive rating prediction task. and are two new nodes, which are not seen at the training time but appear in the testing rating pairs. Before making predictions, we access a few rated edges connected with these new nodes in the testing graph. Traditional collaborative filtering methods cannot solve this task without retraining the models. The standard way is to rely on content information to model the users’ and items’ preferences. The Collaborative Deep Learning (CDL) [Wang et al.2015] model and DropoutNet [Volkovs et al.2017] are two recent representative methods. The core idea of these two models is to use a deep neural network to learn effective features of the input node content.
Recent progress in deep learning on graphs, mainly the GCN models, can address the above two tasks by learning transductive and inductive node representations. STARGCN inherits the ability of GCN for both transductive and inductive learning. Compared with CDL and DropoutNet, our STARGCN not only takes account of the node’s content information but also utilizes the structural information to learn the embeddings of new nodes. Thus, STARGCN can solve the cold start problem when the content information is unavailable, which is infeasible for CDL and DropoutNet.
2.2 Graph Convolutional Matrix Completion
In this subsection, we briefly revisit GCMC [Berg et al.2017]. Our STARGCN employs a similar graph aggregator to encode structural information. GCMC uses a multilink graph convolutional encoder to produce node representations. Each link type is assigned with a specific transformation. The messages from items to user are computed as
(1)  
Here, is the aggregated output of a link type . is the initial vector of and is the input dimension. of size is a linkspecific weight matrix for rating level , which transforms a vector of the dimension to an hidden size . is a normalized constant, computed as with denoting a set of neighbors connected by edgevalue . After computing the link specific messages, we sum the messages from total types of links and pass the output to a nonlinear function . Finally, we employ a fully connected layer with parameter of size and another nonlinear activation to produce the final node vector for user . Messages from users to items are processed analogously with a separate set of parameters.
In the next step, the GCMC model takes the computed user vector and item vector as the input to predict the rating value . See Berg et al. berg2017graph for more details. In the following section, we do not distinguish user and item and call them as a node.
3 Our Models
The architecture of STARGCN is a multiblock graph encoderdecoder shown in Figure 1(a). The multiblock architecture allows reassessment of initial estimates and features across the whole graph. In particular, each block contains two components: a graph encoder and a decoder. The graph encoder generates node representations by encoding semantic graph structures as well as input content features, and the decoder aims to recover the input node embeddings. For each block, we impose a taskspecific loss after graph encoders and a node reconstruction loss after decoders.
STARGCN supports two different types of combinations between two consecutive blocks, by stacking or by recurrence. The main difference is whether to share parameters among blocks or not. By stacking, we consecutively place multiple encoderdecoder blocks with separate sets of parameters. By recurrence, we unfold a single encoderdecoder block, so the same set of parameters are shared across all the blocks, which curtails the total memory usage. Besides, our STARGCN is a general framework, which can be simplified to some individual cases, as shown in Figure 1(b) and 1(c). Empirical studies on two combinations and the simplified models are in Table 3.
3.1 Input Node Representations
To make the network scalable to large graphs, we use an embedding lookup table to map each node to a lowdimensional vector , where . is trained endtoend along with the network. However, naively replacing the onehot vectors with embeddings, fails to tackle the cold start problem because we cannot set the embeddings of nodes that are not seen in the training phase.
So, to generalize the embedding learning technique to new nodes and preserve the high prediction accuracy, we take an approach of masking some percentage of the input nodes at random and then reconstructing the clean node embeddings. Like Devlin et al. devlin2018bert, in each training batch, we mask
percentage, say 20%, of the whole input nodes at random. Then we reconstruct these masked embeddings. For the masked nodes, they perform the following choices: (1) with probability
, we set the node embeddings to be zero; and (2) with the remaining probability, we keep the node unchanged.Training with the masked embedding mechanism has two advantages. First, it can learn embeddings for nodes that are not observed in the training phase. In a cold start scenario, we initialize the embeddings of new nodes to be zero and gradually refine the estimated embeddings by multiple blocks of GCN encoderdecoders. For instance, the first block predicts the embedding of the new node by leveraging the neighborhood data (or node attributes, if available). Then the predicted embedding is fed to the second block to predict ratings and a refined embedding. The rating and embedding prediction losses are jointly optimized. Thus, STARGCN can solve the coldstart issue by iteratively refining the embeddings and is fundamentally different from GCMC. Second, STARGCN leads to improvement in the transductive setting. In the training phase, part of the node embeddings are masked and the network is asked to reconstruct these masked embeddings, which requires the network to encode the relationships between users and items effectively. Thus, the reconstruction loss acts as a multitask regularizer that improves the performance of the primary rating prediction task.
When external node features are available, they are first processed via a separate network and then concatenated with the node embeddings. The feature vector is mapped to a fixed size vector using a twolayer feedforward neural network, i.e., , where both layers have an output dimension . Now the input node vector becomes and the input dimension rather than when the content information is not available.
3.2 Graph Encoder and Decoder
The graph encoder transforms an input node vector into hidden state of size by aggregating neighboring information of different rating levels, i.e., . We choose the encoder to be the multilink GCN aggregator in GCMC, which is formulated in Eq.(1). A decoder maps the structuralencoded node representation to a dimensional reconstructed embedding vector , i.e., . We use a twolayer feedforward neural network as a decoder, i.e., , where the output dimensions are both .
STARGCN is a general framework with a stack of GCN encoderdecoder blocks. Any variant of GCNs, e.g., GraphSAGE [Hamilton et al.2017] and GAT [Veličković et al.2018], can be used as an encoder or decoder in STARGCN. Our graph encoderdecoder is different from the graph autoencoder model [Kipf and Welling2016]
mainly in the role of the decoder. Our decoder is to recover the initial input node vectors, while their decoder is a taskspecific classifier to produce predictions. Another difference is that our graph aggregator considers different link types
[Schlichtkrull et al.2018], whereas their aggregator only models single link type.3.3 Loss
Suppose there are
blocks, the loss function is expressed as
(2) 
where is a supervised taskspecific loss, i,e., the rating prediction loss, and is a reconstruction loss from the th block. is a constant weighting factor. In the following description, we omit the layer superscript for brevity.
Suppose we have a batch of sampled edges and two sets of masked nodes and , the specific losses are
For the rating prediction loss, and are generated by linear transforms with the output of a graph encoder, i.e, and with user or itemspecific matrix parameters ,
. We train the overall models endtoend with backpropagation.
During the inference period, a block STARGCN can produce predictions. Generally, we take the prediction from the last block as a final result.
3.4 Training by Avoiding a Leakage Issue
When training the GCNbased models for rating prediction, we discover a training label leakage issue. The groundtruth training labels are involved in the model input, which significantly degrades the testing prediction performance. Specifically, suppose we have a training input and a label , the model should be trained as . However, if the training label leakage occurs, the model becomes resulting in an overfitting problem. The label leakage problem is a GCN specific issue because of the neighborhood aggregation operator. The useritem rating values in the training set are used to construct edges of the bipartite graph. Thus, when we utilize the neighboring data to update a node’s representation, the rating values, which we need to predict, are included in the graph structure. This causes the label leakage issue. As in Figure 2(a), is the edge value between and , which is taken as the input to a graph aggregator.
To avoid the leakage issue, we provide a sampleandremove training strategy. At each iteration, we sample a fixed size batch of rating pairs and remove the sampled pairs (edges) from the training graph before we start training the model. As in Figure 2(b), the sampled edges with bold links are removed from the graph when we aggregate neighbors. After avoiding this leakage issue, the network shows a substantial boost in performance. The compared results of the leakage issue are given in Table 3.
#U  #V  #R  

Flixster  3K  3K  2,341  2,956  0.5,1,..,5  26,173 
Douban  3K    2,999  3,000  1,..,5  136,891 
ML100K  23  320  943  1,682  1,..,5  100K 
ML1M  23  320  6,040  3,706  1,..,5  1M 
ML10M    321  69,878  10,677  0.5,1,..,5  10M 
Flixster  Douban  ML100K  ML1M  ML10M  
BiasMF [Koren et al.2009]      0.917  0.845  0.803 
NNMF [Dziugaite and Roy2015]      0.907  0.843   
IAUTOREC [Sedhain et al.2015]        0.831  0.782 
GRALS [Rao et al.2015]  1.245  0.833  0.945     
CFNADE [Zheng et al.2016]        0.829  0.771 
Factorized EAE [Hartford et al.2018]      0.910  0.860   
sRMGCNN [Monti et al.2017]  0.926  0.801  0.929     
GCMC [Berg et al.2017]  0.917  0.734  0.910  0.832  0.777 
STARGCN  0.8790.0030  0.7270.0006  0.8950.0009  0.8320.0016  0.7700.0001 
4 Experiments
We conduct extensive experiments on five popular recommendation benchmarks for the transductive and inductive rating prediction tasks. The datasets are summarized in Table 1. Flixster and Douban are preprocessed and provided by Monti et al. monti2017geometric. The user and item features are the adjacency vectors of the respective useruser and itemitem graphs. The MovieLens^{1}^{1}1Movielens [Harper and Konstan2016]: https://grouplens.org/datasets/movielens/ datasets contain different scales of rating pairs, i.e., 100 thousand, 1 million, and 10 million. We denote them as ML100K, ML1M, and ML10M. For user features, we take the age as a scalar and the gender as a binary numerical value, and the occupation as a onehot encoding vector. For movie features, we concatenate the title name, release year, and onehot encoded genres. We process title names by averaging the offtheshelf 300dimensional GloVe CommonCrawl word vector [Pennington et al.2014] of each word.
We take the commonly adopted Root Mean Squared Error (RMSE) metric to evaluate the prediction accuracy between the ground truth and the predicted rating ,
(3) 
Models  Flix.  Dou.  ML100K  ML 1M  ML10M  

1b2l ( rec.,  rm.)  0.920  0.731  0.921  0.841  0.782  
Emb.  1b2l ( rec.)  0.893  0.728  0.899  0.835  0.778 
only  1b2l  0.891  0.728  0.901  0.834  0.771 
2b1l (recurrent)  0.883  0.727  0.895  0.833  0.773  
2b1l  0.879  0.727  0.898  0.832  0.771  
1b2l ( rec., rm.)  0.917  0.731  0.920  0.840  0.782  
With  1b2l (  rec.)  0.889  0.727  0.899  0.835  0.778 
Fea.  1b2l  0.887  0.728  0.901  0.834  0.770 
2b1l (recurrent)  0.879  0.728  0.896  0.833  0.772  
2b1l  0.880  0.727  0.896  0.832  0.771 
Datasets  Models  Items 20%  Users 20%  

50%  30%  10%  50%  30%  10%  
Douban  DropoutNet        0.7970.002  0.7970.003  0.7970.001 
CDL        0.7810.006  0.7810.001  0.7810.001  
STARGCN( rec.)  0.7340.001  0.7460.001  0.7770.002  0.7310.000  0.7380.000  0.7530.001  
STARGCN( rec., + fea.)        0.7310.002  0.7370.000  0.7530.001  
STARGCN  0.7250.001  0.7340.001  0.7640.000  0.7250.001  0.7310.001  0.7470.001  
STARGCN(+ fea.)        0.7250.002  0.7310.000  0.7460.000  
ML100K  DropoutNet  1.2230.065  1.1670.031  1.1440.024  1.0150.002  1.0220.006  1.0230.003 
CDL  1.0830.009  1.0820.007  1.0820.007  1.0110.005  1.0130.006  1.0150.004  
STARGCN( rec.)  0.9320.001  0.9430.001  0.9760.003  0.9190.002  0.9330.001  0.9490.001  
STARGCN( rec., + fea.)  0.9280.002  0.9410.002  0.9770.004  0.9160.005  0.9310.004  0.9510.005  
STARGCN  0.9190.001  0.9260.000  0.9540.001  0.9070.004  0.9170.005  0.9370.005  
STARGCN(+ fea.)  0.9180.002  0.9260.002  0.9560.000  0.9070.002  0.9170.001  0.9360.004  
ML1M  DropoutNet  1.1690.120  1.1340.034  1.2560.128  1.0020.001  1.0050.005  1.0030.001 
CDL  1.0680.009  1.0690.009  1.0680.009  0.9740.000  0.9750.000  0.9740.000  
STARGCN( rec.)  0.8620.001  0.8720.004  0.9030.004  0.8590.002  0.8680.001  0.8910.001  
STARGCN( rec., + fea.)  0.8610.002  0.8670.002  0.9100.006  0.8590.001  0.8690.001  0.8930.001  
STARGCN  0.8440.000  0.8500.000  0.8760.004  0.8480.001  0.8580.001  0.8820.000  
STARGCN(+ fea.)  0.8440.001  0.8510.001  0.8760.002  0.8490.001  0.8580.000  0.8830.001 
Comparison of test RMSE scores for inductive rating prediction. ‘ rec.’ denotes the model does not have the reconstruction module and ‘+ fea.’ means the model uses external node features. We train all models three times and report the mean scores and the standard deviation.
4.1 Model Architecture and Implementation Details
We test the overall network design with different sets of hyperparameters. The validation set determines our final design. We regard Flixtser, Douban, and ML100K as small datasets and ML1M and ML10M as large datasets. After tuning the hyperparameters, we roughly apply two sets of hyperparameters, one for small datasets and the other for large datasets. In all models, we choose the nonlinear function
as a LeakyReLU activation with the negative slope equals to 0.1. For the input vectors, we set the dimension of node embeddings to be 32 for small datasets and 64 for large datasets. When incorporating features, we take the projection dimension to be 8 for small datasets and 32 for large datasets. Regarding the masking mechanism, for transductive prediction, we randomly select =% of the nodes for reconstruction and set =. For inductive prediction, we uniformly choose =% nodes and mask them to be zero, i.e., =, to approximate the testing data distribution. For encoders, the hidden size sets to be 250. The dimension of the output layer sets to be 75. We apply a dropout layer to the input of a GCN layer with a dropout rate of 0.5 for small datasets and 0.3 for large datasets. For decoders, all the hidden sizes are fixed as the node input dimension, i.e., . When predicting ratings, we set the projection size to be 64.We train the STARGCN models with Adam [Kingma and Ba2015] optimizer and use the validation set to perform learning rate decay scheduler. The initial learning rate is set to be 0.002 and gradually decreases to be 0.0005 with the decay rate of 0.5 each time the validation RMSE score does not fall in a window of 100 iterations, and the early stopping occurs for 150 iterations. The gradient normalization value clips to be no larger than 1.0. The training batch size is fixed to be 10K for small datasets, 100K for ML1M, and 500K for ML10M. We train the model three times with different random seeds, except for ML10M training two times, and report the average test RMSE scores along with the standard deviation.
4.2 Transductive Rating Prediction
All the datasets are used for transductive rating prediction evaluations, where the users and items are all observed in the training dataset. For fair comparison, we strictly follow the experimental setup of Berg et al. berg2017graph. For Douban and Flixster, we use the split test set provided by Monti et al. monti2017geometric with 10% of rating pairs as the testing set. For ML100K, we use the first of the five provided data split with 20% for testing. For ML1M and ML10M, we randomly split the edges with 10% for testing. In the transductive setting, we perform a thorough comparison of STARGCN with multiple baselines and stateoftheart models listed in Table 2. Baseline Results are taken from Berg et al. berg2017graph. We also conduct comprehensive ablation analysis in Table 3.
Table 2 summarizes all results of the baselines and our STARGCN. The baseline scores are directly taken from Monti et al. monti2017geometric and Berg et al. berg2017graph. The results of STARGCN are produced by different variant models with a total of two graph convolutional layers, labeled as either ‘1b2l’ or ‘2b1l’ in Table 3. ‘1b2l’ denotes a model with one block of encoderdecoder and each encoder includes a twolayer GCN, as in Figure 1(b). In contrast, ‘2b1l’ indicates a model with two encoderdecoder blocks and each encoder contains a onelayer GCN, as in Figure 1(a). In particular, ‘1b2l ( rec.)’ indicates the model only has an encoder without the reconstruction module, as in Figure 1(c). The reported RMSE scores for STARGCN in Table 2 is the best results of different STARGCN models listed in Table 3. We note that the proposed STARGCN architecture achieves the stateoftheart results on four out of five datasets. We have the following findings from Table 3.
Effect of removing sampled training edges within minibatches. Comparing the results of ‘1b2l ( rec.  rm.)’ and ‘1b2l ( rec.)’, we see a significant decrease in the testing RMSE scores after removing the sampled useritem pairs from the bipartite graph in each training batch. This proves the effectiveness of our sampleremove training strategy to avoid the training data leakage issue.
Effect of reconstructing masked nodes. Comparison of the results of the models labeled ‘ rec.’ with the models having no such label indicates that the models possessing the reconstruction module consistently beat the models without reconstruction utilizing the same total number of graph encoders, which proves that our reconstruction mechanism is beneficial to the final prediction performance.
Effect of the recurrent structure. Comparing the results of ‘2b1l (recurrent)’ and ‘2b1l’, we see that the recurrent structure can achieve competitive results with fewer parameters.
Effect of incorporating features. By comparing the results from the same models with and without features, we note that combining external node features does not always produce better performance.
4.3 Inductive Rating Prediction
We conduct inductive rating prediction experiments on three datasets, Douban, ML100K, and ML1M. For each dataset, we keep 20% of user (or item) nodes as the testing nodes and remove them from the training graph. Then, we choose a fraction of ratings linked with the testing nodes as the edges that are observed in the testing phase. Ratings that are not chosen by this step are kept as the testing data. The predictor never sees these 20% nodes and needs to rely on the observed links (together with the node features, if available) to predict ratings. We conduct experiments on three different fractions, 50%, 30%, and 10%, which means that there are 50%, 30%, and 10% ratings linked with new nodes available to the predictor in the testing phase. Intuitively, the more edges we access, the more information we have, and the better the performance will be. We implement two baseline models, CDL [Wang et al.2015] and DropoutNet [Volkovs et al.2017], and compare some variants of our STARGCN architecture. We use the ‘2b1l‘ versions of STARGCN in this task.
The testing RMSE scores are listed in Table 4. We find a worse performance tendency when the predictor accesses fewer neighboring edges for the new users/items in the testing phase. We show that our STARGCN model produces significantly better results than two baselines. Moreover, comparing the results of the models with and without reconstruction modules, we find that the reconstruction mechanism plays a crucial role in improving the final performance. An interesting observation is that incorporating content information for new nodes is not always beneficial to the final results. The reason may be that our reconstructed node embeddings have already contained enough information for accurate predictions, which proves that our STARGCN can effectively solve the cold start problem using structural information.
5 Conclusion and Future Work
We introduce a new GCNbased architecture and apply it to transductive and inductive rating prediction. Our STARGCN achieves the stateoftheart results in both tasks. Our architecture is generic and can be used in other applications, such as abnormal behavior detection, spatiotemporal forecasting [Shi and Yeung2018], thread popularity prediction [Chan and King2018], and so on [Gao et al.2018]. Moreover, we discover a training label leakage issue when implementing GCNbased models for rating prediction tasks. The discovery should serve as a reminder of later research. In the future, we plan to improve our STARGCN to handle heterogeneous graphs with diverse node types for better simulating reallife scenarios and to integrate ranking algorithms [Su et al.2017] to solve other recommendation tasks.
6 Acknowledgement
The work described in this paper was partially supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (No. CUHK 14208815 of the General Research Fund) and Meitu (No. 7010445).
References
 [Berg et al.2017] Rianne van den Berg, Thomas N Kipf, and Max Welling. Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263, 2017.
 [Bronstein et al.2017] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
 [Candès and Recht2009] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717, 2009.
 [Chan and King2018] Hou Pong Chan and Irwin King. Thread popularity prediction and tracking with a permutationinvariant model. In EMNLP, pages 3392–3401, 2018.
 [Defferrard et al.2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, pages 3844–3852, 2016.
 [Devlin et al.2018] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
 [Dziugaite and Roy2015] Gintare Karolina Dziugaite and Daniel M Roy. Neural network matrix factorization. arXiv preprint arXiv:1511.06443, 2015.
 [Gao et al.2018] Yifan Gao, Jianan Wang, Lidong Bing, Irwin King, and Michael R Lyu. Difficulty controllable question generation for reading comprehension. arXiv preprint arXiv:1807.03586, 2018.
 [Hamilton et al.2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NIPS, pages 1025–1035, 2017.
 [Harper and Konstan2016] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):19, 2016.
 [Hartford et al.2018] Jason S. Hartford, Devon R. Graham, Kevin LeytonBrown, and Siamak Ravanbakhsh. Deep models of interactions across sets. In ICML, pages 1914–1923, 2018.
 [Kingma and Ba2015] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [Kipf and Welling2016] Thomas N Kipf and Max Welling. Variational graph autoencoders. arXiv preprint arXiv:1611.07308, 2016.
 [Kipf and Welling2017] Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. In ICLR, 2017.
 [Koren et al.2009] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 8(Sep):30–37, 2009.
 [Monti et al.2017] Federico Monti, Michael Bronstein, and Xavier Bresson. Geometric matrix completion with recurrent multigraph neural networks. In NIPS, pages 3697–3707, 2017.
 [NadimiShahraki and Bahadorpour2014] MohammadHossein NadimiShahraki and Mozhde Bahadorpour. Coldstart problem in collaborative recommender systems: efficient methods based on asktorate technique. Journal of computing and information technology, 22(2):105–113, 2014.
 [Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543, 2014.
 [Rao et al.2015] Nikhil Rao, HsiangFu Yu, Pradeep K Ravikumar, and Inderjit S Dhillon. Collaborative filtering with graph information: Consistency and scalable methods. In NIPS, pages 2107–2115, 2015.
 [Schlichtkrull et al.2018] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In ESWC, pages 593–607. Springer, 2018.

[Sedhain et al.2015]
Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie.
Autorec: Autoencoders meet collaborative filtering.
In WWW, pages 111–112, 2015.  [Shi and Yeung2018] Xingjian Shi and DitYan Yeung. Machine learning for spatiotemporal sequence forecasting: A survey. arXiv preprint arXiv:1808.06865, 2018.

[Su et al.2017]
Yuxin Su, Irwin King, and Michael R. Lyu.
Learning to rank using localized geometric mean metrics.
In SIGIR, pages 45–54, 2017.  [Veličković et al.2018] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In ICLR, 2018.
 [Volkovs et al.2017] Maksims Volkovs, Guangwei Yu, and Tomi Poutanen. Dropoutnet: Addressing cold start in recommender systems. In NIPS, pages 4957–4966, 2017.
 [Wang et al.2015] Hao Wang, Naiyan Wang, and DitYan Yeung. Collaborative deep learning for recommender systems. In KDD, pages 1235–1244. ACM, 2015.
 [Zhang et al.2018] Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and DitYan Yeung. Gaan: Gated attention networks for learning on large and spatiotemporal graphs. In UAI, pages 339–349, 2018.
 [Zheng et al.2016] Yin Zheng, Bangsheng Tang, Wenkui Ding, and Hanning Zhou. A neural autoregressive approach to collaborative filtering. In ICML, pages 764–773, 2016.