Single-Layer Graph Convolutional Networks For Recommendation

06/07/2020 ∙ by Yue Xu, et al. ∙ Beihang University Taobao Tencent The Chinese University of Hong Kong 16

Graph Convolutional Networks (GCNs) and their variants have received significant attention and achieved start-of-the-art performances on various recommendation tasks. However, many existing GCN models tend to perform recursive aggregations among all related nodes, which arises severe computational burden. Moreover, they favor multi-layer architectures in conjunction with complicated modeling techniques. Though effective, the excessive amount of model parameters largely hinder their applications in real-world recommender systems. To this end, in this paper, we propose the single-layer GCN model which is able to achieve superior performance along with remarkably less complexity compared with existing models. Our main contribution is three-fold. First, we propose a principled similarity metric named distribution-aware similarity (DA similarity), which can guide the neighbor sampling process and evaluate the quality of the input graph explicitly. We also prove that DA similarity has a positive correlation with the final performance, through both theoretical analysis and empirical simulations. Second, we propose a simplified GCN architecture which employs a single GCN layer to aggregate information from the neighbors filtered by DA similarity and then generates the node representations. Moreover, the aggregation step is a parameter-free operation, such that it can be done in a pre-processing manner to further reduce red the training and inference costs. Third, we conduct extensive experiments on four datasets. The results verify that the proposed model outperforms existing GCN models considerably and yields up to a few orders of magnitude speedup in training, in terms of the recommendation performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recommender system plays a pivotal role in various online services, e.g., E-commerce, news feeds, and video-on-demand services. The aim of recommendation is to match user preference with resource items (zhang2019deep, ). Traditional recommendation models, e.g., matrix factorization (koren2009matrix, ) and collaborative filtering (sarwar2001item, ), mainly model user preference by performing statistical analysis on historical user-item interaction records. Nowadays, as various kinds of auxiliary data become increasingly available in online services, many recommendation models shift their focus to graph-based methods (zhao2017meta, ; ying2018graph, ; zhao2019intentgc, ; fan2019meirec, ; wu2019dual, ; wang2019kgat, ; wang2019knowledge, ; Wang2019graph, ), which have greater expressive power on modeling manifold types of nodes and relationships in recommender systems.

Among others, Graph Convolutional Networks (GCNs), which generalize the Convolutional Neural Networks (CNNs) on graph-structured data 

(kipf2016semi, ), have achieved impressive performance on various graph-based learning tasks (velivckovic2017gat, ; hamilton2017representation, ; hamilton2017inductive, ), including recommendation (zhao2019intentgc, ). The core idea behind GCNs is to iteratively aggregate information from locally nearby neighbors in a graph using neural networks (chen2018fastgcn, ). Specifically, each node at one GCN layer performs graph convolution operation to aggregate information from its nearby neighbors at the previous layer. By stacking multiple GCN layers, the information can be propagated across far reaches of a graph, which makes GCNs capable of learning from both content information as well as graph structure. As such, GCN-based models are widely adopted in recommendation tasks (ying2018graph, ; zhao2019intentgc, ; fan2019meirec, ; wu2019dual, ; wang2019kgat, ; wang2019knowledge, ; Wang2019graph, ) which require learning from relational datasets. However, although existing GCN-based recommendation models have set a new standard on many benchmark tasks (ying2018graph, ; zhao2019intentgc, ; fan2019meirec, ; wu2019dual, ; wang2019kgat, ; wang2019knowledge, ; Wang2019graph, ), they suffer from two main pitfalls.

Recursive Neighborhood Aggregation. The recursive neighborhood aggregation among all nodes arises severe computational burden, which, however, may have limited contribution in recommendation tasks. Specifically, as pointed out in (li2018laplacian, ), the convolution in GCN model is indeed a special form of Laplacian smoothing, which mixes the features of a node and its nearby neighbors. The smoothing operation makes the feature of nodes within the same cluster to be similar, thus greatly easing the classification/regression task. Therefore, it is critical for GCN models to ensure that similar nodes have been grouped into the same cluster before performing the aggregations. In homogeneous networks, it is highly likely for two similar nodes to form a direct edge in the graph, which is known as the homophily hypothesis (mcpherson2001homophily, ). In this case, by recursively aggregating features from -hop neighbors, GCNs are able to achieve impressive performances (kipf2016semi, ; hamilton2017inductive, ; huang2018asgcn, ).

However, in the context of recommendation in heterogeneous networks, the difficulty of recognizing similar nodes arises since we need to measure the similarity between two users (or items) based on their indirect relationships. In particular, existing models usually measure the similarity between two users (or items) according to their historical interactions with other auxiliary nodes. For example, (fan2019meirec, ; Wang2019graph, ; wu2019dual, ; wang2019knowledge, ) consider two users to be similar if they clicked the same item or the same brand, which, however, can be easily dominated by the popular items or brand; (zhao2019intentgc, ) measures the similarity of two users according to the number of their common auxiliary neighbors. However, in this case, the users who interacted with most of the auxiliary nodes would have a high similarity to all other users. Besides, the number of common neighbors is unlikely to scale linearly with the value of similarity. Additionally, none of them defines an explicit and principled metric to quantitatively evaluate the node similarity in heterogeneous networks. In fact, given such a similarity metric, we may not need to perform recursive aggregations with multiple GCN layers. Instead, we only need to select similar neighbors for each node beforehand, and then perform aggregation for only once with a single GCN layer.

Complicated Architecture. Many existing models suffer from considerable computational complexity due to the use of multi-layer architectures in conjunction with complicated modeling techniques. For example, the metapath-guided GCN models (hu2018leveraging, ; fan2019meirec, ) construct manifold metapaths to find similar neighbors for aggregations, which arises more complexity on both information aggregation and data pre-processing. The attention based GCN (GAT) models (velivckovic2017gat, ; wang2019kgat, ; wu2019dual, ) generalize the graph convolution with the attention mechanism, which, however, introduce additional and excessive amount of model parameters. Besides, (wang2019hgat, ) further introduce a contextual multi-arm bandit over GAT to weight the interactions of various social effects, which brings higher uncertainties in model tuning. Generally, to some extent, these models are trading complexity for potential performance enhancement, which largely hinder their application in real-world recommender systems.

On the other hand, the recent advances on simplified GCNs such as (wu2019sgc, ), indicate that it is feasible to remove certain components from existing architectures while still preserving comparable performances. This motivates us to rethink about the essential components of building an expressive GCN model for recommendations. Moreover, exploring the existence of an efficient and effective GCN architecture is not only a must for the application to current recommender platforms, but also paves the way for the resources-constrained on-device (e.g., mobile phone and wearable devices) recommendation in the near future.

Our Work. In this paper, we consider the user-item recommendation problem and propose the single-layer GCN (SLGCN) model. The model has a much lower complexity compared to existing GCN-based recommendation models but is able to achieve superior performance. The main contributions are summarized as follows.

  • Principled Similarity Metric: we propose a principled similarity metric named distribution-aware similarity (DA similarity) which explicitly measures the similarity of a pair of nodes according to the distribution of their interactions towards other auxiliary nodes. On this basis, we propose another quantitative metric named Mean Average Neighbor Similarity (MANS) to evaluate the quality of neighbor sampling results. Then, we prove that MANS has a positive correlation with the final recommendation performance from a theoretical standpoint. Experimental results verify our analysis and show that existing GCN models can also benefit from our proposed similarity metric to improve the performance, without changing their model architectures.

  • Simplified Learning Architecture: we propose a simplified GCN architecture which generates node representations with only a single GCN layer. Particularly, the architecture performs propagation for only once to aggregate information from the neighbors which are selected based on DA similarity. Moreover, the aggregation step is indeed a parameter-free operation, such that it can be done in a pre-processing manner to further reduce the training costs. Besides, we also investigate the efficiency of different architectures of the prediction layer.

  • Extensive Verifications: we conduct extensive experiments on three benchmark datasets and one commercial dataset to verify the superiority of our proposed model. The results show that our proposed model can outperform existing GCN models considerably, and yield up to a few orders of magnitude speedup in training.

2. Related Work

2.1. GCN-based Recommendation

GCNs originated from a version of graph convolutions developed based on spectral graph theory (kipf2016semi, ) and have many variants on various fields, e.g., node classification (hamilton2017inductive, ; velivckovic2017gat, ; chen2018fastgcn, ), link prediction (zhang2018linkprediction, ; chami2019hgcn, ), as well as recommendation (zhao2017meta, ; ying2018graph, ; zhao2019intentgc, ). The user-item recommendation aims at directly predicting users’ preference over items. Related GCN models usually first generate user and item embeddings by utilizing both content information and graph structure, and then predict user-item interactions (zhao2019intentgc, ; fan2019meirec, ; hu2018leveraging, ; fan2019graph, ). While most models adopt multiple multi-layer perception (MLP) layers to construct the prediction layer, their architectures to obtain node representation differ from each other. In particular, IntentGC (zhao2019intentgc, )

proposed the vector-wise convolution to avoid useless feature interactions during neighborhood feature propagation. MEIRec 

(fan2019meirec, ) leveraged LSTM to capture the sequential correlation among different neighbors. KGAT (wang2019kgat, ) computed the hidden states of each node by attending over its neighbors. Dual Graph Attention Networks (wu2019dual, ) introduced a contextual multi-arm bandit to weight social influence on the user’s preference for items. However, all these models are constructed with a stack of multiple nonlinear GCN layers, which requires fitting excessive amount of model parameters. On the other hand, the recently proposed simple graph convolution (SGC) (wu2019sgc, ) reveals that removing certain components (the nonlinear transformations in their work) from GCNs causes little effect on the performance of node classification. This encourages us to seek for a compact but effective model architecture in the context of recommendation.

2.2. Similarity Measurement

Existing recommendation models proposed various strategies to measure the node similarities which are then used to guide the neighbor sampling process. Among others, the most popular strategy is based on the first-order proximity. In particular, many models consider two users (or two items) to be similar if they have interacted with the same auxiliary node. The sampling probability can either depend on the interaction frequency (i.e., importance sampling) or not (i.e., random sampling). Examples include MEIRec 

(fan2019meirec, ), KGCN (Wang2019graph, ), Dual Graph Attention Networks (wu2019dual, ), KGNN-LS (wang2019knowledge, ), etc. The other choice is based on the second-order proximity, which measures the similarity of two nodes by comparing their neighborhood structure (goyal2018graph, ). For example, IntentGC (zhao2019intentgc, ) measured the similarity between two nodes by comparing the number of their common neighbors. Another group of works such as Pinsage (ying2018graph, ) leveraged the random walk to measure the similarity. However, all these works only provide empirical explanations on similarity measurements, without developing an explicit similarity metric or investigating the influence of similarity measurement (or neighbor sampling) on final recommendation performance. Besides, there are also recent works from other fields studied the graph sampling methods (zeng2020graphsaint, ; li2018laplacian, ; chen2018fastgcn, ; huang2018asgcn, ). The most related work to ours is LINE (tang2015line, ) which proposed to measure nodes’ similarity by comparing their distributions. However, they defined the distribution from a perspective of generating network context, which is different from ours. Besides, their aim is to propose an optimization objective for network embedding, while we aim at GCN-based recommendation.

3. Problem Definition

In this paper, we consider the user-item recommendation task within a graph consists of heterogeneous nodes and relationships. Specifically, the user-item recommendation task can be described as follows. We denote the user set as with the number of users, and denote the item set as with the number of items. Given a user node and an item node , the aim of user-item recommendation is to predict the potential interaction (e.g., click, rate, and purchase) between user and item . On the other hand, the heterogeneous graph can be modeled as a heterogeneous information network (HIN), which is defined as follows:

Definition 3.0 (Heterogeneous Information Network).

A HIN is defined as a graph where is the set of nodes and is the set of edges between the nodes in . Each node and each edge is associated with a node type mapping function and an edge type mapping function , respectively. The number of types satisfy or .

Moreover, we consider the GCN models are trained with a subgraph sampled from the entire graph. This is a practical setting in real-world recommender systems (zhao2019intentgc, ; fan2019meirec, ; ying2018graph, ), since training GCN models with the entire graph will arise excessive computational complexity. Specifically, we consider each node in the graph only aggregates information from a subset of its neighbors. The sampled subgraph can be represented as , where denotes the edges between each node and its sampled neighbors. Note that the subgraph still contains the entire set of nodes from the original graph (where ), but only contains a subset of edges (i.e., propagation paths among the nodes) from the original graph due to neighbor sampling. In other words, the sampling process only reduces the information aggregated from the neighbors, without removing any node from the graph. In this case, it is critical to sample the most similar neighbors for each node in in order to guarantee reliable performance.

We aim to 1) propose a principled and interpretable similarity metric to guide the neighbor sampling process and investigate the influence of neighbor sampling on the recommendation performance; 2) propose an efficient and effective GCN architecture which is able to achieve superior performance to existing models but with much lower complexity.

4. Neighbor Sampling

4.1. Network Translation

Recommendation models mainly focus on modeling user nodes and item nodes. Therefore, it is a common routine for them to translate all relationships in the original graph into user-user and item-item relationships (zhao2019intentgc, ; fan2019meirec, ; ying2018graph, ; fan2019graph, ). In this way, they can avoid modeling all different types of nodes and relationships, thereby reducing the model complexity. In the translated graph, two users (or items) are considered to have one connected path if they have both interacted with the same auxiliary node. For example, two users are considered to be connected if they clicked the same item or purchased the same brand. The subgraph is constructed by allowing each node in the translated graph to sample its neighbors according to their inter-connected paths.

Existing works proposed various similarity metrics to guide the sampling process. Among others, the first-order proximity and the second-order proximity are the most popular ones. Specifically, the first-order proximity measures the similarity between two nodes according to the weight of their connected path (goyal2018graph, ). Taking user-click-item paths as an example, as shown in Figure 1(a), the target user finds its -hop neighbors by first traversing to his/her top clicked items (1-hop), and then traversing to the item’s top clicked users (2-hop). The traversing probability can either depend on the path weight (i.e., importance sampling) or not (i.e., random sampling). However, this method can be easily influenced by the popular nodes whose paths usually have higher weights than the others. Alternatively, the second-order proximity measures the similarity between two nodes according to the proximity of their neighborhood structure (goyal2018graph, ). For example, as shown in Figure 1(b), the target user measures the similarity of each neighbor according to the number of their common item-clicks. However, in this case, the users who clicked most of the items would have a high similarity towards all other users. Also, the number of common neighbors is unlikely to scale linearly with the value of similarity. Inspired by the above methods, we next propose a more principled similarity metric that takes both path weights and neighborhood structure into consideration.

(a) First-order Proximity
(b) Second-order Proximity
(c) DA Similarity
Figure 1. Examples of neighbor sampling with different similarity metrics. (a) Sampling according to the weights of direct edges. (b) Sampling according to the number of common item-clicks. (c) Sampling according to the distribution of item-clicks.

4.2. Distribution-Aware Similarity

We propose the DA similarity in the context of recommendation, which measures the similarity between two nodes according to their interaction distribution upon other nodes.

For clarity, let us first consider the user-click-item paths. We denote a user ’s click probability over an item as and denote his/her click probability over all items in as . Then, the similarity between user and user on item-click preference can be written as

. This similarity can be formulated with various distance metrics defined on probability distribution. For example, with the Kullback-Leibler divergence (KL divergence), the distance between user

’s and user ’s preference on item-clicks can be formulated as

(1)

where refers to the set of user ’s clicked items. We define the similarity as the negative distance between and , i.e.,

(2)

As such, higher similarity means less distance on the probability distribution. The similarity formulated by KL divergence is asymmetric, we can also formulate a symmetric DA similarity with the norm function:

(3)

where denotes the norm function, including the L1-norm function and the L2-norm function, etc.

Heterogeneous relationships. Now we extend the definition of DA similarity under heterogeneous relationships. Considering a graph with types of nodes and types of relationships, the definition of DA similarity can be given as follows.

Definition 4.0 (Distribution-Aware Similarity).

Given a graph with types of node and types of relationships, we define the set of probability distributions of node ’s interaction with other nodes under all relationships as

where is a probability distribution denoting the probability of node to interact with other nodes of type under the relationship of type . Then, the DA similarity between node and node can be written as

(4)

where is a distance function while and are the importance weights assigned to the similarity of interactions with the auxiliary nodes of type under the relationship of type , respectively.

4.3. Neighbor Quality Measurement

The DA similarity provides an explicit metric to measure the quality of neighbors. As such, the subgraph can be constructed by letting each node in the graph sample its neighbors according to the evaluated distribution-aware similarities. For example, the node could 1) directly select the top- neighbors with the highest similarity or 2) normalize all neighbors’ similarities into a probability distribution and perform importance sampling.

Now we investigate the correlation between our defined similarity metric and the final prediction performance. To this end, we first give another quantitative metric named Mean Average Neighborhood Similarity (MANS) to evaluate the quality of the sampled neighbors and 2) reveal the positive correlation between MANS and the prediction performance through theoretical analysis.

First, we define the average neighbor similarity (ANS) as follows.

Definition 4.0 (Average Neighborhood Similarity).

For a given node , its average neighborhood similarity (ANS) is defined as

(5)

where is the set of sampled neighbors of node and denotes the cardinality of a set.

ANS is the average DA similarity of one node’s sampled neighbors, which measures the quality of one node’s sampled neighbors. On this basis, the definition of mean average neighbor similarity (MANS) of a sampled subgraph is given as follows.

Definition 4.0 (Mean Average Neighbor Similarity).

For a given sampled subgraph , its mean average neighborhood similarity (MANS) is defined as

(6)

where denotes the set of all the nodes in the subgraph.

MANS is the mean of all nodes’ ANS values in the subgraph, such that higher MANS indicates that the grouped nodes (each node and its neighbors) have a higher similarity. Considering that the philosophy behind GCN is to smooth features over similar vertices thus easing the classification task (li2018laplacian, ), it is highly likely that MANS has a positive correlation with the performance achieved by GCN models. Recall that in GCN models, the node embedding generated by the -th layer can be generally written as

(7)

where denotes the aggregation function at the -th layer,

refers to the linear transformation,

is a nonlinear activation function, and

is the union set of node and its neighbors. As pointed out in SGC (wu2019sgc, ), the nonlinearity transformation between consecutive GCN layers can be redundant, since the main benefits of aggregation come from local averaging. Therefore, in order to highlight the influence of neighbor selection, we herein develop our theoretical analysis based on the SGC (wu2019sgc, ). In this case, the update function in (7) can be simplified into

(8)

Next, for clarity and ease of derivation, we analyze the aggregation process on user-click-item paths as an example. User modeling aims at generating an accurate user embedding to describe his/her preference on the item-click event. Given any user

, we denote the estimated probability distribution on his/her item-click event as

, where is his/her embedding generated by the -th layer and is an unbiased mapping function. In order to generate an accurate item-click prediction, the GCN model needs to minimize the distance between the true probability distribution and the estimated probability distribution . The distance can be measured by the KL divergence:

(9)

Given a node , we denote the estimated probability distribution on his/her item-click event by the -th layer as . Without loss of generality, we assume that the probability distributions of different neighbors are independent from each other, such that we have from (8). Therefore, the KL distance given in (9) satisfies

(10a)
(10b)
(10c)
(10d)
(10e)

where the inequality in (10c) is based on the Jenson inequality (boyd2004convex, ). The results in (10) reveals that the the distance between and is upper bounded by the distance between and all , i.e.,

(11)

where is an approximation of . Hence, one can minimize the upper bound of by minimizing , which means increasing the ANS value of node . In other words, the probability of correctly predicting user’s item-click interaction (i.e, estimating ) can be increased by sampling his/her neighbors with higher DA similarity values. Note that the result in (10) also holds when formulating the distance with norm functions, which satisfies

(12a)
(12b)
(12c)
(12d)

where the inequality in (12c) comes from triangle inequality.

The above analysis can be readily applied to other types of nodes and relationships in the heterogeneous graph. We therefore give the following proposition.

Proposition 1 ().

When learning from a sampled subgraph , it is promising to increase the performance of GCN-based recommendation models by increasing MANS of the subgraph, i.e., increasing .

5. Single-layer GCN

In this section, we propose an efficient single-layer GCN architecture to learn the node representation. The architecture performs propagation for only once to aggregate information from the neighbors which are selected based on DA similarity, without suffering from the excessive computation caused by recursive aggregations. Moreover, the aggregation step in our architecture is indeed a parameter-free operation which can be done in a pre-processing manner, thus can significantly reduce the model complexity as well as training and inference costs.

Figure 2. The architectures of traditional recursive GCNs v.s. our proposed SLGCN. Top panel: the recursive GCN repeatedly performs propagations throughout GCN layers. Bottom panel: the SLGCN performs propagation for only once among the neighbors filtered by the DA similarity metric.

5.1. Node Representation

The user modeling and item modeling are symmetric in our proposed single-layer GCN architecture. Therefore, we mainly present user modeling for illustration in the following statement. Specifically, given a user , we initialize his/her embedding vector with the raw features, i.e., , where denotes the raw feature vector of user . Then, we aggregate the features from the neighbors of user as the neighborhood feature, i.e.,

(13)

where is the set of neighbors sampled according to the DA similarity, denotes the raw feature vector of the neighbor , and is a pooling function, e.g., mean pooling. Afterwards, we generate the aggregated feature of user by concatenating its self-feature and the neighborhood feature together, i.e.,

(14)

where denotes the concatenation of vectors. Then, we feed the aggregated feature vector into a single neural network layer to obtain the user representation:

(15)

Similarly, the item modeling is performed under the same process but needs to replace the context with item-related neighbors and features.

Remark. It is noteworthy that both (13) and (14) are parameter-free operations since they do not require fitting any weights. As such, they are essentially equivalent to a feature pre-processing step. In this case, the user/item modeling reduces to (15), which is only a simple transformation based on a single-layer neural network.

Heterogeneous Relationships. We now extend SLGCN to deal with more heterogeneous relationships, which includes two methods. First, we could use the heterogeneous similarity metric in Definition 4.1 to select neighbors and follow the process from (13) to (15

) to generate node representations. In this case, one needs to specify the hyperparameters

and based on domain-knowledge, which is encouraged when dealing with familiar recommendation context. Alternatively, we could put the hyperparameters into the concatenation step, i.e., (14), to automatically determine the weights of different types of nodes and relationships in training. For example, when considering the node type with relationship type , the concatenation step can be modified into

(16)

where denotes the set of similar neighbors filtered with the similarity metric and is the importance weight.

5.2. Prediction

In this paper, we model the user-item recommendation task as a binary classification problem, where the positive label refers to an observed user-item interaction, and the negative label otherwise. The prediction process can be formulated as

(17)

where is the item representation and denotes a mapping function. The function can be constructed with a few MLP layers or with a dot product function. We will compare the performance of different choices in Sec. 6.5. We adopt the cross-entropy loss as our optimization objective, which can be given as

(18)

where denotes the training dataset, is the real user-item recommendation label (equals or ), and is the predicted label.

5.3. Complexity Analysis

The time cost of SLGCN mainly comes from a) subgraph construction, b) representation learning, and c) model inference. For a), we can offline compute the similarities of all connected users and items and then sample the neighbors for each node to construct the subgraph. Specifically, computing the similarity of a given user-user pair can be done in offline time, where and denotes the nonzero interactions from the user to all items and from the other user to all items, respectively. Note that we only need to update the similarity matrix daily or weekly in practical recommender systems. For b), we denote the complexity of performing pooling-based feature aggregation in (13) to be and denote the complexity of representation mapping with MLP in (15) to be . Without loss of generality, we assume that and

only differs with constant coefficient in different GCN models. We denote the number of total training epochs as

, the number of total edges in the training set as . The recursive GCN models (e.g. PinSAGE, MEIRec, IntentGC) perform recursive aggregations per training step. Moreover, they need to use MLP functions to do feature mapping after each aggregation step at each layer. We denote the number of total MLPs within the multiple GCN layers as . The complexity of recursive GCN models is . Comparatively, SLGCN performs the aggregation in (15) for only once during the pre-processing step, and performs feature mapping also for only once. As such, the complexity of SLGCN is . Note that . For c), we denote the number of prediction attempts as . The inference complexity of recursive GCNs is . While the inference complexity of SLGCN is only , since the neighbor aggregations have been completed beforehand. Empirical comparisons of the time costs of SLGCN vs other GCN models are presented in Sec. 6.3.

LastFM Ciao Epinions WeChat
AUC NDCG@10 AUC NDCG@10 AUC NDCG@10 AUC NDCG@10
MEIRec 0.8723* 0.7167* 0.7705 0.5534 0.8363 0.7277 0.8036 0.6571
MEIRec++ 0.8868 (+1.7%) 0.7167 (+0.0%) 0.8314 (+7.9%) 0.6289 (+13.6%) 0.8872 (+6.1%) 0.7985 (+9.7%) 0.9073 (+12.9%) 0.7343 (+11.7%)
IntentGC 0.8704 0.7157 0.8123* 0.6419* 0.8574* 0.7720* 0.8808* 0.7026*
IntentGC++ 0.8805 (+1.2%) 0.6826 (-4.6%) 0.8444 (+4.0%) 0.6462(+0.6%) 0.8808 (+2.7%) 0.7766 (+0.6%) 0.9073 (+3.0%) 0.7345 (+4.5%)
SLGCN-1ord 0.9348 0.7856 0.8656 0.7199 0.9003 0.8067 0.8574 0.5550
SLGCN-2ord 0.9374 0.7871 0.8929 0.7628 0.9198 0.8125 0.9016 0.7411
SLGCN-sim2 0.9528 0.8112 0.9282 0.7957 0.9403 0.8280 0.9104 0.7602
Improvement 9.2% 13.2% 14.3% 24.0% 9.6% 7.3% 3.4% 8.2%
Table 1. Performance comparison on the four datasets.

6. Experiments

We conduct extensive experiments on four datasets with the goal of answering four research questions:

Q1: Does our proposed SLGCN outperform the state-of-the-art GCN-based recommendation methods?

Q2: How efficient is the learning of SLGCN compared with other GCN-based architectures?

Q3: How does neighbor sampling affect the final performance?

Q4: What is the efficiency of different architectures for inference?

6.1. Experimental Setup

Datasets. We use the following four datasets in our experiments for music, movie, products, and information recommendations, respectively: (1) Last-FM111https://grouplens.org/datasets/hetrec-2011/. is a music listening dataset collected from the Last.fm online music system, where the tracks are viewed as items; (2) Ciao 222https://www.cse.msu.edu/ tangjili/datasetcode/truststudy.htm is a dataset crawled from the ciaoDVD website which describes user ratings towards movies ranging from to ; (3) Epinions333https://www.cse.msu.edu/ tangjili/datasetcode/truststudy.htm dataset records user ratings on different types of items (software, music, television show, etc.) scaled from to . (4) WeChat dataset contains users’ clicks on different articles, recorded by the WeChat platform. The detailed statistics of the datasets is given in Table 2. Following (Wang2019graph, ), we convert the explicit ratings (ranging from to ) in Last-FM, Ciao, and Epinions dataset into implicit labels where each one is marked as indicating that user has positive feedback, otherwise, marked as . The threshold for the positive rating is set to be , similar as (Wang2019graph, ). We use MetaPath2vec (dong2017metapath2vec, ) to produce the pre-trained embeddings of different nodes in the dataset and feed them into the GCN model as the raw features.

Dataset #Users #Items # Interections
LastFM 1,892 17,632 86,769
Ciao 7,375 105,114 264,229
Epinions 22,164 296,277 857,165
WeChat 180,871 116,551 3,801,612
Table 2. Statistics of datasets.

Evaluation Protocols.

We randomly split the entire user-item recommendation records of each dataset into a training set, a validation set, and a test set, where each of them contains 80%, 10%, and 10% of the full records, respectively. Two popular metrics are adopted to evaluate the recommendation accuracy, i.e., 1) the Area Under receiver operator characteristic Curve (AUC) and 2) the Normalized Discounted Cumulative Gain (NDCG). Generally, higher metric values indicate better recommendation accuracy. To evaluate NDCG on top-K recommendation performance, we follow a similar setting as 

(hu2018leveraging, ; he2017neural, ). Specifically, for each positive item in the test set, we choose negative items from the set of items which have no interaction records with the target user. Then, we rank the list of positive and negative items together. The final NDCG of each dataset is computed by first averaging over all the test items of a user and then averaging over all the users in the test set. We report the average score at (i.e., NDCG@10) in this paper.

Comparison Methods. We compare four different neighbor sampling methods: (1) Random walk based sampling (ying2018graph, ), which simulates random walks starting from each node and compute the L1-normalized visit count of neighbors visited by the random walk. (2) First-order proximity based sampling (fan2019meirec, ; Wang2019graph, ; wu2019dual, ; wang2019knowledge, ), which examines the neighborhood similarity based on the edge weights (e.g., number of clicks). (3) Second-order proximity based sampling (zhao2019intentgc, ), which examines the neighborhood similarity based on the number of common neighbors. (4) Our proposed DA similarity based sampling. We also compare the following model architectures for node representation: (I) MEIRec (fan2019meirec, ) which is a multi-layer GCN model. MEIRec adopts metapath-guided aggregations to learn user/item representation and samples the neighbors using metapath-based first-order proximity. (II) IntentGC (zhao2019intentgc, ) which is also a multi-layer GCN model. IntentGC learns user/item representation with a faster architecture named IntentNet which avoids unnecessary feature interactions to speed up training. (III) Our proposed simplified architecture with only one GCN layer. Moreover, we extend MEIRec and IntentGC to learn with the DA similarity based sampling method, which are referred to as MEIRec++ and IntentGC++, respectively.

Parameter Settings. The optimal parameter settings for all the comparison methods are achieved by either empirical study or suggested settings by the original papers. For all models, we fix the total number of sampled neighbors to be on all datasets. For SLGCN, we adopt Adam (kingma2014adam, ) as the optimizer, and set the learning rate as 0.01; the regularization coefficient as . We utilize warm-up technique to accelerate the training of the SLGCN. Specifically, we start with an initial batch size of and then change it to after batches. Note that SLGCN is able to learn with an extra large batch size due to the simplified propagation step. The linear transformation matrix in (15) scales as where denotes the dimension of the raw feature. The prediction function in (17) is a three-layer MLP and the size of each layer is . Code will be released later.

6.2. Performance Comparison (Q1)

Table 1

reports the performance on the four datasets w.r.t. AUC and NDCG. Overall, our proposed SLGCN consistently achieves the best performance among all four datasets w.r.t. all evaluation metrics. We summarize the major findings as below.

First, the second-order proximity based models (i.e., IntentGC, SLGCN-2ord) achieve a generally better performance than the first-order proximity based models (i.e., MEIRec, SLGCN-1ord), which indicates that comparing the neighborhood structure to find similar neighbors is more reliable than directly comparing the edge weights. Meanwhile, our proposed DA sampling method can help all GCN-based models (i.e., MEIRec++, IntentGC++) to obtain a general performance enhancement, which verifies that the DA similarity can well-capture the neighbor similarity.

Second, when fixing the sampling method, our proposed simplified architecture (i.e., SLGCN-1ord, SLGCN-2ord) can still outperform the corresponding multi-layer GCN architectures (i.e., MEIRec, IntentGC). The reason is two-fold. First, the simplified architecture still preserves the local averaging operation, which is the main reason why GCN works well (wu2019sgc, ; li2018laplacian, ). Second, simplifying the multi-layer architecture into a single-layer architecture can largely reduce the difficulty of parameter fitting thus leading to a higher probability of converging to a better local optimal solution.

6.3. Learning Efficiency (Q2)

One main advantage of SLGCN is the low training complexity. We show the convergence rate and the changes of validation accuracy of all comparing models in Figure 3. Particularly, MEIRec, IntentGC, and SLGCN sample neighbors according to the first-order proximity, the second-order proximity, and the DA similarity, respectively. For all comparing methods, we employ the same mapping function in (17) to inference the prediction results and sample the neighbors beforehand so as to present a clean comparison of the training costs. All experiments are conducted based on a workstation with Intel(R) Xeon(R) CPU cores at GHz and one NVIDIA GTX-1080 GPU. The results are averaged over multiple runs. The results in Figure 3 shows that SLGCN can achieve superior performance with one or two orders of magnitude speedup in training in all four datasets.

6.4. Influence From Neighbor Sampling (Q3)

Figure 3. Convergence curves on four datasets (the horizontal axis is in logarithmic scale). All methods are running on the same GPU device.

Table 3 reports the performance of SLGCN under different sampling methods on four datasets to justify the effectiveness of our propose DA similarity. The results are generated by fixing the model architecture (i.e., node representation and prediction layer) while only varying the neighbor sampling method.

Overall, the random walk based sampling method generates the worst performance. In fact, the neighbors found by random walks may change significantly when varying the total number or the total length of the generated paths. One can stabilize the results by performing extensive random walks on each node, which, however, is computational exhibitive on large graphs. Following our discussion in Sec 5.3, which mentioned that the DA similarity outperforms the second-order proximity, while the latter outperforms the first-order proximity. This inference can be verified by taking a deeper look at the changes of MANS in Table 3. In particular, we calculate the MANS of user nodes and item nodes separately. The results in Table 3 show that MANS has a general positive correlation with the performance of GCN models. The exceptions are the MANS of user in lastFM and the MANS of item in WeChat, which indicates that we need to assign a lower importance weight to the user-click-item similarity in lastFM and the item-click-user similarity in WeChat.

6.5. Inference Performance (Q4)

LastFM Ciao Epinions WeChat
MANS(U, I) AUC NDCG MANS(U, I) AUC NDCG MANS(U, I) AUC NDCG MANS(U, I) AUC NDCG
rand -0.147, -0.653 0.9403 0.8027 -0.097, 0.-454 0.8828 0.7690 -0.090, -0.445 0.9062 0.8026 -0.210, -0.502 0.8415 0.5496
walk -0.150, -0.649 0.9032 0.7698 -0.104, -0.457 0.8599 0.7293 -0.094, -0.446 0.8698 0.7766 -0.215, -0.506 0.8851 0.6907
1ord -0.155, -0.608 0.9348 0.7856 -0.096, -0.426 0.8656 0.7199 -0.088, -0.431 0.9003 0.8067 -0.192, -0.504 0.8574 0.5550
2ord -0.205, -0.586 0.9374 0.7871 -0.094, -0.421 0.8929 0.7628 -0.083, -0.411 0.9198 0.8125 -0.172, -0.539 0.9016 0.7411
sim2 -0.082, -0.356 0.9528 0.8112 -0.082, -0.221 0.9282 0.7957 -0.067, -0.196 0.9403 0.8280 -0.163, -0.528 0.9104 0.7602
Table 3. Influence From Neighbor Sampling.
Figure 4. Influence From Model Architecture.

The results in Table 1 and Figure 3 already verified the superiority of using a single GCN layer for node representation. We now focus on the comparison of different inference architectures in SLGCN. Specifically, we compare the following variants: 1) standard SLGCN, which inferences the results with a stack of multiple MLP layers; 2) linear SLGCN, which replace the and in (17) with the aggregated and , i.e., do not perform separate nonlinear transformations on the user embedding and the item embedding; 3) vanilla-cosine SLGCN, which computes the distance between and with a cosine function to inference the results; 4) cosine SLGCN, which adds an additional nonlinear activation function outside the cosine function in Vanilla-cosine SLGCN when inferencing the results. Figure 4 reports the experimental results, where we refer to the above variants as STD, LIN, VCOS, COS for short. The results show that the standard SLGCN achieves the best performance on all four datasets. While the linear SLGCN has an obvious performance degradation, which indicates that it is critical to perform nonlinear transformation on user embedding and item embedding separately before feeding them into the mapping function. Moreover, it is noteworthy that the cosine SLGCN achieves a close performance to the standard SLGCN in WeChat dataset, which indicates that it is promising to replace the MLP layers with cosine function to deliver further complexity reduction when learning from large datasets.

7. Conclusion

In this paper, we introduced the SLGCN model which is able to achieve superior performance along with a few orders of magnitude speedup in training compared with existing models. We proved that the proposed DA similarity has a positive correlation with the final performance through both theoretical analysis and empirical simulations. Experimental results revealed that existing GCN models could also make use of the proposed DA similarity metric to improve their performances. Meanwhile, we proposed a simplified GCN architecture which employs a single GCN layer to first aggregate information from the neighbors filtered by DA similarity, and then generates the node representations for inference. Extensive experiments verified the superiority of proposed model on both recommendation performance and training speed. We hope our study can inspire more future research activities on building a compact but expressive GCN model for recommendations.

References

  • [1] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • [2] Ines Chami, Zhitao Ying, Christopher Ré, and Jure Leskovec. Hyperbolic graph convolutional neural networks. In NIPS, pages 4869–4880, 2019.
  • [3] Jie Chen, Tengfei Ma, and Cao Xiao. FastGCN: fast learning with graph convolutional networks via importance sampling. In ICLR, 2018.
  • [4] Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. Metapath2vec: Scalable representation learning for heterogeneous networks. In KDD, pages 135–144, 2017.
  • [5] Shaohua Fan, Junxiong Zhu, Xiaotian Han, Chuan Shi, Linmei Hu, Biyu Ma, and Yongliang Li. Metapath-guided heterogeneous graph neural network for intent recommendation. In KDD, pages 2478–2486, 2019.
  • [6] Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. Graph neural networks for social recommendation. In WWW, pages 417–426, 2019.
  • [7] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151:78–94, 2018.
  • [8] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NIPS, pages 1024–1034, 2017.
  • [9] William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. IEEE Data Engineering Bulletin, 2017.
  • [10] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative filtering. In WWW, 2017.
  • [11] Binbin Hu, Chuan Shi, Wayne Xin Zhao, and Philip S Yu.

    Leveraging meta-path based context for top-N recommendation with a neural co-attention model.

    In KDD, pages 1531–1540, 2018.
  • [12] Wenbing Huang, Tong Zhang, Yu Rong, and Junzhou Huang. Adaptive sampling towards fast graph representation learning. In NIPS, pages 4558–4567, 2018.
  • [13] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [14] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
  • [15] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
  • [16] Qimai Li, Zhichao Han, and Xiao-Ming Wu.

    Deeper insights into graph convolutional networks for semi-supervised learning.

    In AAAI, 2018.
  • [17] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27(1):415–444, 2001.
  • [18] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-based collaborative filtering recommendation algorithms. In WWW, pages 285–295, 2001.
  • [19] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. LINE: Large-scale information network embedding. In WWW, pages 1067–1077, 2015.
  • [20] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In ICLR, 2018.
  • [21] Hongwei Wang, Fuzheng Zhang, Mengdi Zhang, Jure Leskovec, Miao Zhao, Wenjie Li, and Zhongyuan Wang. Knowledge-aware graph neural networks with label smoothness regularization for recommender systems. In KDD, pages 968–977, 2019.
  • [22] Hongwei Wang, Miao Zhao, Xing Xie, Wenjie Li, and Minyi Guo. Knowledge graph convolutional networks for recommender systems. In WWW, page 3307–3313, New York, NY, USA, 2019.
  • [23] Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and Tat-Seng Chua. KGAT: Knowledge graph attention network for recommendation. In KDD, pages 950–958, 2019.
  • [24] Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. Heterogeneous graph attention network. In WWW, pages 2022–2032, 2019.
  • [25] Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr, Christopher Fifty, Tao Yu, and Kilian Q Weinberger. Simplifying graph convolutional networks. In ICML, 2019.
  • [26] Qitian Wu, Hengrui Zhang, Xiaofeng Gao, Peng He, Paul Weng, Han Gao, and Guihai Chen. Dual graph attention networks for deep latent representation of multifaceted social effects in recommender systems. In WWW, page 2091–2102, New York, NY, USA, 2019.
  • [27] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In KDD, pages 974–983, 2018.
  • [28] Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. GraphSAINT: Graph sampling based inductive learning method. In ICLR, 2020.
  • [29] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In NIPS, pages 5165–5175, 2018.
  • [30] Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys (CSUR), 52(1):1–38, 2019.
  • [31] Huan Zhao, Quanming Yao, Jianda Li, Yangqiu Song, and Dik Lun Lee. Meta-graph based recommendation fusion over heterogeneous information networks. In KDD, pages 635–644, 2017.
  • [32] Jun Zhao, Zhou Zhou, Ziyu Guan, Wei Zhao, Wei Ning, Guang Qiu, and Xiaofei He. IntentGC: a scalable graph convolution framework fusing heterogeneous information for recommendation. In KDD, pages 2347–2357, 2019.