Learn graph patterns for recommender systems based on a GNN.
Most modern successful recommender systems are based on matrix factorization techniques, i.e., learning a latent embedding for each user and each item from the given rating matrix and use the embeddings to complete the matrix. However, these learned latent embeddings are inherently transductive and are not designed to generalize to unseen users/items or new tasks. In this paper, we aim to learn an inductive model for recommender systems based on the local graph patterns around user-item pairs. The inductive model can generalize to unseen nodes/items, and potentially also transfer to other tasks. To learn such a model, we extract a local enclosing subgraph for each training (user, item) pair, and feed the subgraphs to a graph neural network (GNN) to train a rating prediction model. We show that our model achieves highly competitive performance with state-of-the-art transductive methods, and is more stable when the rating matrix is sparse. Furthermore, our transfer learning experiment validates that the learned model is transferrable to new tasks.READ FULL TEXT VIEW PDF
Matrix factorization (MF) is extensively used to mine the user preferenc...
The two main tasks in the Recommender Systems domain are the ranking and...
Matrix factorization (MF) techniques have been shown to be effective for...
With the increasing availability of videos, how to edit them and present...
Matrix completion is a classic problem underlying recommender systems. I...
We propose a new STAcked and Reconstructed Graph Convolutional Networks
Recently latent factor model (LFM) has been drawing much attention in
Learn graph patterns for recommender systems based on a GNN.
Collaborative filtering (CF) techniques for recommender systems leverage collected ratings of items by users to make new recommendations. These collected ratings can be written as entries of an rating matrix, where is the number of users and is the number of items. Modern CF-based recommender systems try to solve the matrix completion problem through matrix factorization techniques, which have achieved great successes (Adomavicius and Tuzhilin, 2005; Schafer et al., 2007; Koren et al., 2009; Bobadilla et al., 2013).
However, matrix factorization is intrinsically transductive, meaning that the learned latent features (embeddings) for users/items are not generalizable to new users/items. When new users/items come to the system or new ratings are made, it often requires a complete retraining to get the new embeddings. Such a behavior makes matrix factorization unsuitable for some applications that require timely recommendations in fast-evolving environments, such as news recommendation etc. Content-based recommender systems alleviate this problem by using user/item content features (Lops et al., 2011). However, these features are not always available and can be hard to extract. Therefore, in this paper, we aim to explore inductive CF methods for recommender systems, where a model learned out of training users-items is directly applicable to unseen users-items without the need of retraining.
So on which data can we train an inductive model for recommender systems? The answer is graphs. If for each existing rating we add an edge between the associated user and item, we can build a bipartite graph where an edge can only exist between a user and an item. Subsequently, predicting unknown ratings corresponds to predicting labeled links in this bipartite graph. This transforms the matrix completion problem to a link prediction problem (Lü and Zhou, 2011)
. One large category of link prediction methods are heuristic methods, which compute some heuristic scores such as common neighbors(Liben-Nowell and Kleinberg, 2007) and Katz index (Katz, 1953) based on local or global graph patterns. Heuristic methods use these predefined graph structure features for link prediction and are inductive, because these features are not restricted to certain links but applicable to the entire graph.
Can we also find some heuristics for recommender systems? Intuitively, such heuristics should exist. For example, if a user likes an item , we may expect to see very often that is also liked by some other user who shares a similar taste to . By similar taste, we mean and have together both liked some other item . In the bipartite graph, such a pattern is realized as a “like” path connecting . If there are many such paths between and , we may infer that is highly likely to like . Thus, we may count the number of such paths as an indicator of how likely likes . In fact, many neighborhood-based recommender systems (Desrosiers and Karypis, 2011) rely on such heuristics. However, in this work, we do not use any predefined fixed heuristics, but learn heuristics from the existing bipartite graph.
Present work Inspired by (Zhang and Chen, 2017, 2018), we aim to learn graph structure features related to ratings automatically from local enclosing subgraphs around user-item links. An -hop enclosing subgraph for a user-item pair is defined to be the subgraph induced from the whole bipartite graph by nodes and the neighbors of and within hops. Such local subgraphs contain rich structure information about the link existence (Zhang and Chen, 2018). For example, the number of paths can be just computed from ’s 1-hop enclosing subgraph. By feeding these enclosing subgraphs to a graph neural network (GNN) (Scarselli et al., 2009; Defferrard et al., 2016; Kipf and Welling, 2016; Zhang et al., 2018), we train a graph regression model that maps a subgraph to the rating of its target link. Due to the superior graph learning ability, a GNN can learn highly expressive graph structure features useful for rating prediction. Figure 1 illustrates the overall framework. Our model is inductive, as we can freely apply the trained model on other unseen links’ enclosing subgraphs without retraining. We can even transfer the model to other similar tasks. We evaluate our model on benchmark datasets, and show that it is highly competitive with state-of-the-art transductive methods. Our model also shows good performance under transfer learning and sparse rating settings.
We present our inductive graph pattern learning (IGPL) framework for recommender systems in this section. Some related work is included in Appendix A. IGPL extracts a local enclosing subgraph around each user-item pair, and trains a GNN regression model on these enclosing subgraphs to predict the ratings. We will use to denote the undirected bipartite graph constructed from the training rating matrix. In , a node is either a user-type node (denoted by ) or an item-type node (denoted by ). Edges only exist between user type and item type. An edge also has a type , corresponding to the rating that gives to . We use to denote the set of all possible ratings. We use to denote the set of ’s neighbors that connect to with edge type .
The first part of the IGPL framework is enclosing subgraph extraction. For each training (user, item, rating) tuple, we extract from an -hop enclosing subgraph around the user-item pair. We will feed these enclosing subgraphs to a GNN and regress on their ratings. Then, for each testing user-item pair, we again extract its -hop enclosing subgraph, and use the trained GNN model to predict its rating. Algorithm 1 describes how we extract -hop enclosing subgraphs.
The second part of IGPL is node labeling. Before we feed enclosing subgraphs to the GNN, we need to apply a node labeling to each enclosing subgraph. A node labeling is a function that returns an integer label for every node in the subgraph. The purpose is to use different labels to mark nodes’ different roles in a subgraph. For example, 1) we need to differentiate the target user and item nodes between which the target rating is located, and 2) we need to differentiate user-type nodes from item-type nodes. To achieve these goals, we propose a node labeling scheme as follows: We first give label 0 to the target user and label 1 to the target item. Then for other nodes, we determine their labels according to at which hop they are included in the subgraph in Algorithm 1. If a user-type node is included at the hop (), we will give it a label . If an item-type node is included at the hop, we will give it . Such a node labeling can sufficiently discriminate: 1) target nodes from “context” nodes, 2) users from items (users always have even labels), and 3) nodes of different distances to the target user/item. Note that this is not the only possible way of node labeling, but we empirically verified its excellent performance.
The last part of IGPL is to train a graph neural network (GNN) model predicting ratings from enclosing subgraphs. A GNN is typically composed of: 1) message passing layers which aggregate neighboring nodes’ features to the center to extract a feature vector for each node, and 2) a global pooling layer to summarize a graph representation from node features. To handle different edge types, we adopt relational graph convolutional operator (R-GCN)(Schlichtkrull et al., 2018) as our GNN’s message passing layers. The R-GCN layer has the following form:
where denotes node ’s input feature vector, denotes its output feature vector, and are learnable parameter matrices. In a R-GCN layer, neighbors connected to with different edge types have different parameter matrices. Thus, it is able to learn from the rich graph patterns inside the edge types. We apply several R-GCN layers with tanh activations between layers. The node feature vectors from all layers are concatenated for each node as its final representation.
To pool the node representations into a graph representation, we leverage the SortPooling layer from (Zhang et al., 2018). In SortPooling, node representations are sorted according to their continuous Weifeiler-Lehman colors represented by their last-layer features. Then, standard 1-D convolutional layers are applied to these sorted representations to learn the final graph representation from both individual nodes and the global topology contained in the node ordering. We empirically verified R-GCN and SortPooling’s superior performance over plain graph convolution (Kipf and Welling, 2016) and sum-pooling (Duvenaud et al., 2015).
After getting the final graph representation, we add a linear regression layer with mean squared error (MSE) loss between predictions and ground truth ratings. There are several additional notes: 1) We use the one-hot encodings of node labels as the initial node features in our experiments. However, one could concatenate them with additional node information, such as content features of nodes. To illustrate the power of learning graph patterns for recommender systems, we do not use any side information in our method, but learn from subgraphs only. 2) Before feeding a training enclosing subgraph to the GNN, we need to remove the link between the target user-item to remove label information.
Following the setup of (Monti et al., 2017), we conduct experiments on four standard datasets: Flixster (Jamali and Ester, 2010), Douban (Ma et al., 2011), YahooMusic (Dror et al., 2011) and MovieLens (Miller et al., 2003). For MovieLens, we train and evaluate on the canonical u1.base/u1.test train/test split. For Flixster, Douban and YahooMusic we use the preprocessed subsets provided by (Monti et al., 2017). Dataset statistics are summarized in Table 1. We implemented the GNN in IGPL using PyTorch_Geometric (Fey and Lenssen, 2019)
. We tuned all hyperparameters based on validation performance. The final architecture uses 4 R-GCN layers with 32, 32, 32, 1 hidden dimensions, respectively. Basis decomposition with 4 bases is used to reduce the number of parameters in(Schlichtkrull et al., 2018). After SortPooling, we apply two 1-D convolutional layers with 16 and 32 output channels, respectively, following (Zhang et al., 2018). The final linear regression layer has 128 hidden units and a dropout rate 0.5. We use 1-hop enclosing subgraphs for all datasets, and find them sufficiently good. We find using 2- or more-hop subgraphs can slightly increase the performance but take longer training time. We train our model using the Adam optimizer (Kingma and Ba, 2014)
with an initial learning rate of 0.001, and multiply the learning rate by 0.1 every 50 epochs. Our code is available athttps://github.com/muhanzhang/IGPL.
|Flixster||3,000||3,000||26,173||0.0029||0.5, 1, 1.5, …, 5|
|Douban||3,000||3,000||136,891||0.0152||1, 2, 3, 4, 5|
|YahooMusic||3,000||3,000||5,335||0.0006||1, 2, 3, …, 100|
|MovieLens||943||1,682||100,000||0.0630||1, 2, 3, 4, 5|
For these three datasets, we compare our IGPL with GRALS (Rao et al., 2015), sRGCNN (Monti et al., 2017), and GC-MC (Berg et al., 2017). Among them, GRALS is a graph regularized matrix completion algorithm. GC-MC and sRGCNN are GNN-assisted matrix completion methods, where GNNs are used to learn better user/item latent features to reconstruct the rating matrix. Thus, they are still transductive models. In contrast, our IGPL uses a GNN to inductively learn graph patterns which are not associated with particular nodes/edges, but are generally applicable to any part of the graph. Note that all baselines here use side information such as user-user or item-item graphs, while IGPL does not use any side information. We train our model for 40 epochs with a batch size of 50. Table 2 shows the results. Our model achieves state-of-the-art results on these three datasets, outperforming all three transductive baselines.
|GRALS (Rao et al., 2015)||1.245||0.833||38.0|
|sRGCNN (Monti et al., 2017)||0.926||0.801||22.4|
|GC-MC (Berg et al., 2017)||0.917||0.734||20.5|
To verify the transferability of the learned model, we conduct a transfer learning experiment. We retrain a model on Flixster by rounding its rating types to 1,2, …, 5 (the same as Douban), and then directly test this model on Douban (both Flixster and Douban are movie rating datasets). We get a test RMSE of 0.8365. Note that this result is got without using any Douban data for training, yet is already comparable with the baseline GRALS (0.833). This experiment shows that the model learned by IGPL is transferrable to new tasks, one property which transductive models hardly get.
We further conduct experiments on MovieLens. We compare against baselines including matrix completion variants MC (Candès and Recht, 2009), IMC (Jain and Dhillon, 2013), and GMC (Kalofolias et al., 2014), as well as GRALS, sRGCNN and GC-MC. User/item side information are used in baselines if possible. For IGPL, we train our model for 60 epochs with a batch size of 50. Results are summarized in Table 3. As we can see, IGPL achieves excellent performance, outperforming a number of matrix completion baselines except GC-MC.
|MC (Candès and Recht, 2009)||0.973|
|IMC (Jain and Dhillon, 2013)||1.653|
|GMC (Kalofolias et al., 2014)||0.996|
|GRALS (Rao et al., 2015)||0.945|
|sRGCNN (Monti et al., 2017)||0.929|
|GC-MC (Berg et al., 2017)||0.905|
To gain insight into when inductive graph pattern learning is more suitable than traditional transductive methods, we compare IGPL with GC-MC on MovieLens under different sparsity levels of the rating matrix. We sort all the training ratings according to their timestamps, and sparsify the rating matrix by keeping first 20%, 40%, 60%, 80% and 100% ratings only, in order to simulate different phases of a recommender system’s data collection. We train both models on the sparsified rating matrix, and evaluate on the original MovieLens test set. The results are shown in Figure 2. As we can see, IGPL performs consistently better than GC-MC when the sparsity is less than 80%. This indicates that IGPL has more stable performance on sparse ratings, and that useful graph patterns could still be learned even the rating matrix is very sparse. It also suggests that during the very initial phase of a recommender system, using graph patterns for recommendation might be a better choice than matrix factorization.
Finally, we visualize 10 testing enclosing subgraphs with the highest and lowest predicted ratings for Flixster in Figure 3. As expected, there are substantially different patterns between high-score and low-score subgraphs. For example, high-score subgraphs typically show both high user bias and high item bias, while low-score subgraphs only show low user bias and have less ratings to the target item. See Appendix B for more visualization results.
We propose a new paradigm, IGPL, for recommender systems. Instead of learning transductive latent features, IGPL learns graph patterns related to ratings inductively. IGPL not only shows highly competitive performance with traditional matrix completion baselines in standard settings, but also shows exclusive advantages in transfer learning and sparse rating matrix settings. We believe IGPL will open a new direction on learning inductive recommender systems.
Proceedings of The 33rd International Conference on Machine Learning. 2702–2711.
Fast Graph Representation Learning with PyTorch Geometric. InICLR Workshop on Representation Learning on Graphs and Manifolds.
An End-to-End Deep Learning Architecture for Graph Classification. InAAAI. 4438–4445.
Graph neural networks Graph neural network (GNN) is a new type of neural network for learning over graphs (Scarselli et al., 2009; Bruna et al., 2013; Duvenaud et al., 2015; Kipf and Welling, 2016; Niepert et al., 2016; Li et al., 2015; Dai et al., 2016; Hamilton et al., 2017; Zhang et al., 2018)
. GNNs iteratively pass messages between each node and its neighbors in order to extract local substructure features around nodes. Then, an aggregation operation such as summing is applied to all nodes to get a graph feature vector. GNNs are parametric models. The learnable parameters in the message passing layers equip GNNs with excellent graph representation learning abilities and flexibility for different kinds of graphs. GNNs have gained great popularity in recent years, achieving state-of-the-art performance on semi-supervised node classification(Kipf and Welling, 2016), network embedding (Hamilton et al., 2017), graph classification (Zhang et al., 2018) etc. A GNN usually consists of 1) message passing layers that extract local substructure features around nodes, and 2) a global pooling layer which aggregates node features into a graph representation for graph-level tasks such as graph classification or regression. Please refer to (Wu et al., 2019) for an overview. Our work introduces a novel application of GNN in the recommender system field.
Graph-based matrix completion The matrix completion problem has been studied from a graph point of view previously. Monti et al. (2017) develops a multi-graph CNN model to extract user and item latent features from their respective networks and use the latent features to predict the ratings. Berg et al. (2017) directly operates on user-item bipartite graphs to extract user and item latent features using a GNN. In (Chen et al., 2005; Zhou et al., 2007), traditional link prediction heuristics are adapted to bipartite graphs and show promising performance for recommender systems. Our work differs in that we do not use any predefined heuristics, but learn general graph structure features using a GNN. Another similar work to ours is (Li and Chen, 2013), where graph kernels are used to learn graph structure features. However, graph kernels require quadratic time and space complexity to compute and store the kernel matrices thus unsuitable for modern recommender systems.
Graph pattern learning for link prediction Learning supervised heuristics (graph patterns) has been studied for link prediction in simple graphs. Zhang and Chen (2017) proposes Weisfeiler-Lehman Neural Machine (WLNM), which learns graph structure features using a fully-connected neural network on the subgraphs’ adjacency matrices. Later, they improve this work by replacing the fully-connected neural network with a GNN and achieves state-of-the-art link prediction results (Zhang and Chen, 2018). Our work generalizes this line of research to predicting labeled links in bipartite graphs.