To alleviate information overload on the web, recommender system has been widely deployed to perform personalized information filtering (PinSage; YoutubeRS). The core of recommender system is to predict whether a user will interact with an item, e.g., click, rate, purchase, among other forms of interactions. As such, collaborative filtering (CF), which focuses on exploiting the past user-item interactions to achieve the prediction, remains to be a fundamental task towards effective personalized recommendation (NCF; VACF; NGCF; CMN).
The most common paradigm for CF is to learn latent features (a.k.a. embedding) to represent a user and an item, and perform prediction based on the embedding vectors(NCF). Matrix factorization is an early such model, which directly projects the single ID of a user to her embedding (MF; BPRMF). Later on, several research find that augmenting user ID with the her interaction history as the input can improve the quality of embedding. For example, SVD++ (SVD++) demonstrates the benefits of user interaction history in predicting user numerical ratings, and Neural Attentive Item Similarity (NAIS) (NAIS) differentiates the importance of items in the interaction history and shows improvements in predicting item ranking. In view of user-item interaction graph, these improvements can be seen as coming from using the subgraph structure of a user — more specifically, her one-hop neighbors — to improve the embedding learning.
To deepen the use of subgraph structure with high-hop neighbors, Wang et al. (NGCF) recently proposes NGCF and achieves state-of-the-art performance for CF. It takes inspiration from the Graph Convolution Network (GCN) (GCN; GraphSAGE), following the same propagation rule to refine embeddings: feature transformation, neighborhood aggregation, and nonlinear activation. Although NGCF has shown promising results, we argue that its designs are rather heavy and burdensome — many operations are directly inherited from GCN without justification. As a result, they are not necessarily useful for the CF task. To be specific, GCN is originally proposed for node classification on attributed graph, where each node has rich attributes as input features; whereas in user-item interaction graph for CF, each node (user or item) is only described by a one-hot ID, which has no concrete semantics besides being an identifier. In such a case, given the ID embedding as the input, performing multiple layers of nonlinear feature transformation — which is the key to the success of modern neural networks (ResNet) — will bring no benefits, but negatively increases the difficulty for model training.
To validate our thoughts, we perform extensive ablation studies on NGCF. With rigorous controlled experiments (on the same data splits and evaluation protocol), we draw the conclusion that the two operations inherited from GCN — feature transformation and nonlinear activation — has no contribution on NGCF’s effectiveness. Even more surprising, removing them leads to significant accuracy improvements. This reflects the issues of adding operations that are useless for the target task in graph neural network, which not only brings no benefits, but rather degrades model effectiveness. Motivated by these empirical findings, we present a new model named LightGCN, including the most essential component of GCN — neighborhood aggregation — for collaborative filtering. Specifically, after associating each user (item) with an ID embedding, we propagate the embeddings on the user-item interaction graph to refine them. We then combine the embeddings learned at different propagation layers with a weighted sum to obtain the final embedding for prediction. The whole model is simple and elegant, which not only is easier to train, but also achieves better empirical performance than NGCF and other state-of-the-art methods like Mult-VAE (VACF).
To summarize, this work makes the following main contributions:
We empirically show that two common designs in GCN, feature transformation and nonlinear activation, have no positive effect on the effectiveness of collaborative filtering.
We propose LightGCN, which largely simplifies the model design by including only the most essential components in GCN for recommendation.
We empirically compare LightGCN with NGCF by following the same setting and demonstrate substantial improvements. In-depth analyses are provided towards the rationality of LightGCN from both technical and empirical perspectives.
We first introduce NGCF (NGCF), a representative and state-of-the-art GCN model for recommendation. We then perform ablation studies on NGCF to judge the usefulness of each operation in NGCF. The novel contribution of this section is to show that the two common designs in GCNs, feature transformation and nonlinear activation, have no positive effect on collaborative filtering.
2.1. NGCF Brief
In the initial step, each user and item is associated with an ID embedding. Let denote the ID embedding of user and denote the ID embedding of item . Then NGCF leverages the user-item interaction graph to propagate embeddings as:
where and respectively denote the refined embedding of user and item after layers propagation,
is the nonlinear activation function,denotes the set of items that are interacted by user , denotes the set of users that interact with item , and and are trainable weight matrix to perform feature transformation in each layer. By propagating layers, NGCF obtains embeddings to describe a user () and an item (). It then concatenates these embeddings to obtain the final user embedding and item embedding, using inner product to generate the prediction score.
NGCF largely follows the standard GCN (GCN), including the use of nonlinear activation function and feature transformation matrices and . However, we argue that the two operations are not as useful for collaborative filtering. In semi-supervised node classification, each node has rich semantic features as input, such as the title and abstract words of a paper. Thus performing multiple layers of nonlinear transformation is beneficial to feature learning. Nevertheless, in collaborative filtering, each node of user-item interaction graph only has an ID as input which has no concrete semantics. In this case, performing multiple nonlinear transformations will not contribute to learn better features; even worse, it may add the difficulties to train well. In the next subsection, we provide empirical evidence on this argument.
2.2. Empirical Explorations on NGCF
We conduct ablation studies on NGCF to explore the effect of nonlinear activation and feature transformation. We use the codes released by the authors of NGCF111https://github.com/xiangwang1223/neural_graph_collaborative_filtering, running experiments on the same data splits and evaluation protocol to keep the comparison as fair as possible. Since the core of GCN is to refine embeddings by propagation, we are more interested in the embedding quality under the same embedding size. Thus, we change the way of obtaining final embedding from concatenation (i.e., ) to sum (i.e., ). Note that this change has little effect on NGCF’s performance, but makes the following ablation studies more indicative of the embedding quality refined by GCN.
We implement three simplified variants of NGCF:
NGCF-f, which removes the feature transformation matrices and .
NGCF-n, which removes the non-linear activation function .
NGCF-fn, which removes both the feature transformation matrices and non-linear activation function.
For the three variants, we keep all hyper-parameters (e.g., learning rate, regularization coefficient, dropout ratio, etc.) same as the optimal settings of NGCF. We report the results of the 2-layer setting on the Gowalla and Amazon-Book datasets in Table 1, where the scores of NGCF are directly copied from the Table 3 of (NGCF). As can be seen, removing feature transformation (i.e., NGCF-f) leads to consistent improvements over NGCF on all three datasets. In contrast, removing nonlinear activation does not affect the accuracy that much. However, if we remove nonlinear activation on the basis of removing feature transformation (i.e., NGCF-fn), the performance is improved significantly. From these observations, we conclude the findings that:
(1) Adding feature transformation imposes negative effect on NGCF, since removing it in both models of NGCF and NGCF-n improves the performance significantly;
(2) Adding nonlinear activation affects slightly when feature transformation is included, but it imposes negative effect when feature transformation is disabled.
(3) As a whole, feature transformation and nonlinear activation impose rather negative effect on NGCF, since by removing them simultaneously, NGCF-fn demonstrates large improvements over NGCF (9.57% relative improvement on recall).
To gain more insights into the scores obtained in Table 1 and understand why NGCF deteriorates with the two operations, we plot the curves of model status recorded by training loss and testing recall in Figure 1. As can be seen, NGCF-fn achieves a much lower training loss than NGCF, NGCF-f, and NGCF-n along the whole training process. Aligning with the curves of testing recall, we find that such lower training loss successfully transfers to better recommendation accuracy. The comparison between NGCF and NGCF-f shows the similar trend, except that the improvement margin is smaller.
From these evidences, we can draw the conclusion that the deterioration of NGCF stems from the training difficulty, rather than overfitting. Theoretically speaking, NGCF has higher representation power than NGCF-f, since setting the weight matrix andI can fully recover the NGCF-f model. However, in practice, NGCF demonstrates higher training loss and worse generalization performance than NGCF-f. And the incorporation of nonlinear activation further aggravates the discrepancy between representation power and generalization performance. To round out this section, we claim that when designing model for recommendation, it is important to perform rigorous ablation studies to be clear about the impact of each operation. Otherwise, including less useful operations will complicate the model unnecessarily, increase the training difficulty, and even degrade model effectiveness.
The former section demonstrates that NGCF is a heavy and burdensome GCN model for collaborative filtering. Driven by these findings, we set the goal of developing a light yet effective model by including the most essential ingredients of GCN for recommendation. The advantages of being simple are several-fold — more interpretable, practically easy to train and maintain, technically easy to analyze the model behavior and revise it towards more effective directions, and so on.
In this section, we first present our designed Light Graph Convolution Network (LightGCN) model, as illustrated in Figure 2. We then provide an in-depth analysis of LightGCN to show the rationality behind its simple design. Lastly, we describe how to do model training for recommendation.
The basic idea of GCN is to learning representation for nodes by smoothing features over the graph (GCN; SGCN). To achieve this, it performs graph convolution iteratively, i.e., aggregating the features of neighbors as the new representation of a target node. Such neighborhood aggregation can be abstracted as:
The AGG is an aggregation function — the core of graph convolution — that considers the -th layer’s representation of the target node and its neighbor nodes. Many work have specified the AGG, such as the weighted sum aggregator in GCN (GCN) and GIN (GIN), mean aggregator and LSTM aggregator in GraphSAGE (GraphSAGE), etc. However, most of the work ties feature transformation or nonlinear activation with the AGG function. Although they perform well on node or graph classification tasks that have semantic input features, they could be burdensome for collaborative filtering (see preliminary results in Section 2.2).
3.1.1. Light Graph Convolution (LGC)
In LightGCN, we adopt the simple weighted sum aggregator and abandon the use of feature transformation and nonlinear activation. The graph convolution operation (a.k.a., propagation rule (NGCF)) in LightGCN is defined as:
The symmetric normalization term follows the design of standard GCN (GCN), which can avoid the scale of embeddings increasing with graph convolution operations; other choices can also be applied here, such as the norm, while empirically we find this symmetric normalization has good performance (see experiment results in Section 4.4.2).
It is worth noting that in LGC, we aggregate only the connected neighbors and do not integrate the target node itself (i.e., self-connection). This is different from most existing graph convolution operations (NGCF; GCN; GAT; GraphSAGE) that typically aggregate extended neighbors and need to handle the self-connection specially. The layer combination operation, to be introduced in the next subsection, essentially captures the same effect as self-connections. Thus, there is no need in LGC to include self-connections.
3.1.2. Layer Combination and Model Prediction
In LightGCN, the only trainable model parameters are the embeddings at the 0-th layer, i.e., for all users and for all items. When they are given, the embeddings at higher layers can be computed via LGC defined in Equation (3). After layers LGC, we further combine the embeddings obtained at each layer to form the final representation of a user (an item):
where denotes the importance of the -th layer embedding in constituting the final embedding. It can be treated as a hyper-parameter to be tuned manually, or as a model parameter (e.g., output of an attention network (ACF)) to be optimized automatically. In our experiments, we find that setting uniformly as leads to good performance in general. Thus we do not design special component to optimize , to avoid complicating LightGCN unnecessarily and to keep its simplicity. The reasons that we perform layer combination to get final representations are three-fold. (1) With the increasing of the number of layers, the embeddings will be over-smoothed (DeepInsights). Thus simply using the last layer is problematic. (2) The embeddings at different layers capture different semantics. E.g., the first layer enforces smoothness on users and items that have interactions, the second layer smooths users (items) that have overlap on interacted items (users), and higher-layers capture higher-order proximity (NGCF). Thus combining them will make the representation more comprehensive. (3) Combining embeddings at different layers with weighted sum captures the effect of graph convolution with self-connections, an important trick in GCNs (proof sees Section 3.2.1).
The model prediction is defined as the inner product of user and item final representations:
which is used as the ranking score for recommendation generation.
3.1.3. Matrix Form
We provide the matrix form of LightGCN to facilitate implementation and discussion with existing models. Let the user-item interaction matrix be where and denote the number of users and items, respectively, and each entry is 1 if has interacted with item otherwise 0. We then obtain the adjacency matrix of the user-item graph as
Let the -th layer embedding matrix be , where is the embedding size. Then we can obtain the matrix equivalent form of LGC as:
where D is a diagonal matrix, in which each entry denotes the number of nonzero entries in the -th row vector of the adjacency matrix A (also named as degree matrix). Lastly, we get the final embedding matrix used for model prediction as:
where is the symmetrically normalized matrix.
3.2. Model Analysis
We conduct model analysis to demonstrate the rationality behind the simple design of LightGCN. First we discuss the connection with the Simplified GCN (SGCN) (SGCN), which is a recent linear GCN model that integrates self-connection into graph convolution; this analysis shows that by doing layer combination, LightGCN subsumes the effect of self-connection thus there is no need for LightGCN to add self-connection in adjacency matrix. Then we discuss the relation with the Approximate Personalized Propagation of Neural Predictions (APPNP) (ICLR19-APPNP), which is recent GCN variant that addresses oversmoothing by inspiring from Personalized PageRank (haveliwala2002topic); this analysis shows the underlying equivalence between LightGCN and APPNP, thus our LightGCN enjoys the sames benefits in propagating long-range with controllable oversmoothing. Lastly we analyze the second-layer LGC to show how it smooths a user with her second-order neighbors, providing more insights into the working mechanism of LightGCN.
3.2.1. Relation with SGCN
In (SGCN), the authors argue the unnecessary complexity of GCN for node classfication and propose SGCN, which simplifies GCN by removing nonlinearities and collapsing the weight matrices to one weight matrix. The graph convolution in SGCN is defined as222The weight matrix in SGCN can be absorbed into the 0-th layer embedding parameters, thus it is omitted in the analysis.:
where is an identity matrix, which is added on A to include self-connections. In the following analysis, we omit the terms for simplicity, since they only re-scale embeddings. In SGCN, the embeddings obtained at the last layer are used for downstream prediction task, which can be expressed as:
The above derivation shows that, inserting self-connection into A and propagating embeddings on it, is essentially equivalent to a weighted sum of the embeddings propagated at each LGC layer.
3.2.2. Relation with APPNP
In a recent work (ICLR19-APPNP), the authors connect GCN with Personalized PageRank (haveliwala2002topic), inspiring from which they propose a GCN variant named APPNP that can propagate long range without the risk of oversmoothing. Inspired by the teleport design in Personalized PageRank, APPNP complements each propagation layer with the starting features (i.e., the 0-th layer embeddings), which can balance the need of preserving locality (i.e., staying close to the root node to alleviate oversmoothing) and leveraging the information from a large neighborhood. The propagation layer in APPNP is defined as:
is the teleport probability to control the retaining of starting features in the propagation, anddenotes the normalized adjacency matrix. In APPNP, the last layer is used for final prediction, i.e.,
Aligning with Equation (8), we can see that by setting accordingly, LightGCN can fully recover the prediction embedding used by APPNP. As such, LightGCN shares the strength of APPNP in combating oversmoothing — by setting the properly, we allow using a large for long-range modeling with controllable oversmoothing.
Another minor difference is that APPNP adds self-connection into the adjacency matrix. However, as we have shown before, this is redundant due to the weighted sum of different layers.
3.2.3. Second-Order Embedding Smoothness
Owing to the linearity and simplicity of LightGCN, we can draw more insights into how does it smooth embeddings. Here we analyze a 2-layer LightGCN to demonstrate its rationality. Taking the user side as an example, intuitively, the second layer smooths users that have overlap on the interacted items. More concretely, we have:
We can see that, if another user has co-interacted with the target user , the smoothness strength of on is measured by the coefficient (otherwise 0):
This coefficient is rather interpretable: the influence of a second-order neighbor on is determined by 1) the number of co-interacted items, the more the larger; 2) the popularity of the co-interacted items, the less popularity (i.e., more indicative of user personalized preference) the larger; and 3) the activity of , the less active the larger. Such interpretability well caters for the assumption of CF in measuring user similarity (CSE; Wang:2006) and evidences the reasonability of LightGCN. Due to the symmetric formulation of LightGCN, we can get similar analysis on the item side.
3.3. Model Training
The trainable parameters of LightGCN are only the embeddings of the 0-th layer, i.e., ; in other words, the model complexity is same as the standard matrix factorization (MF). We employ the Bayesian Personalized Ranking (BPR) loss (BPRMF), which is a pairwise loss that encourages the prediction of an observed entry to be higher than its unobserved counterparts:
where controls the regularization strength. We employ the Adam (Adam) optimizer and use it in a mini-batch manner. We are aware of other advanced negative sampling strategies which might improve the LightGCN training, such as the hard negative sampling (rendle2014improving) and adversarial sampling (Ding2019IJCAI). We leave this extension in the future since it is not the focus of this work.
Note that we do not introduce dropout mechanisms, which are commonly used in GCNs and NGCF. The reason is that we do not have feature transformation weight matrices in LightGCN, thus enforcing regularization on the embedding layer is sufficient to prevent overfitting. This showcases LightGCN’s advantages of being simple — it is easier to train and tune than NGCF which additionally requires to tune two dropout ratios (node dropout and message dropout) and normalize the embedding of each layer to unit length.
Moreover, it is technically viable to also learn the layer combination coefficients , or parameterize them with an attention network. However, we find that learning on training data does not lead improvement. This is probably because the training data does not contain sufficient signal to learn good that can generalize to unknown data. We have also tried to learn from validation data, as inspired by (lambdaOpt) that learns hyper-parameters on validation data. The performance is slightly improved (less than ). We leave the exploration of optimal settings of (e.g., personalizing it for different users and items) as future work.
We first describe experimental settings, and then conduct detailed comparison with NGCF (NGCF), the method that is most relevant with LightGCN but more complicated (Section 4.2). We next compare with other state-of-the-art methods in Section 4.3. To justify the designs in LightGCN and reveal the reasons of its effectiveness, we perform ablation studies and embedding analyses in Section 4.4. The hyper-parameter study is finally presented in Section 4.5.
|Dataset||User #||Item #||Interaction #||Density|
4.1. Experimental Settings
To reduce the experiment workload and keep the comparison fair, we closely follow the settings of the NGCF work (NGCF). We request the experimented datasets (including train/test splits) from the authors, for which the statistics are shown in Table 2
. The Gowalla and Amazon-Book are exactly the same as the NGCF paper used, so we directly use the results in the NGCF paper. The only exception is the Yelp2018 data, which is a revised version. According to the authors, the previous version did not filter out cold-start items in the testing set, and they shared us the revised version only. Thus we re-run NGCF on the Yelp2018 data. The evaluation metrics are recall@20 and ndcg@20 computed by the all-ranking protocol — all items that are not interacted by a user are the candidates.
*The scores of NGCF on Gowalla and Amazon-Book are directly copied from the Table 3 of (NGCF); the scores of NGCF on Yelp2018 are re-run by us.
4.1.1. Compared Methods
The main competing method is NGCF, which has shown to outperform several methods including GCN-based models GC-MC (GC-MC) and PinSage (PinSage), neural network-based models NeuMF (NCF) and CMN (CMN), and factorization-based models MF (BPRMF) and HOP-Rec (HOP-rec). As the comparison is done on the same datasets under the same evaluation protocol, we do not further compare with these methods. In addition to NGCF, we further compare with two relevant and competitive CF methods:
. This is an item-based CF method based on the variational autoencoder (VAE). It assumes the data is generated from a multinomial distribution and using variational inference for parameter estimation. We run the codes released by the authors333https://github.com/dawenl/vae_cf, tuning the dropout ratio in , and the in . The model architecture is the suggested one in the paper: .
GRMF (rao2015collaborative). This method smooths matrix factorization by adding the graph Laplacian regularizer. For fair comparison on item recommendation, we change the rating prediction loss to BPR loss. The objective function of GRMF is:
where is searched in the range of . Moreover, we compare with a variant that adds normalization to graph Laplacian: , which is termed as GRMF-norm. Other hyper-parameter settings are same as LightGCN. The two GRMF methods benchmark the performance of smoothing embeddings via Laplacian regularizer, while our LightGCN achieves embedding smoothing in the predictive model.
4.1.2. Hyper-parameter Settings
Same as NGCF, the embedding size is fixed to 64 for all models and the embedding parameters are initialized with the Xavier method (Xarvier). We optimize LightGCN with Adam (Adam) and use the default learning rate of 0.001 and default mini-batch size of 1024 (on Amazon-Book, we increase the mini-batch size to 2048 for speed). The regularization coefficient is searched in the range of , and in most cases the optimal value is . The layer combination coefficient is uniformly set to where is the number of layers. We test in the range of 1 to 4, and satisfactory performance can be achieved when
equals to 3. The early stopping and validation strategies are the same as NGCF. Typically, 1000 epochs are sufficient for LightGCN to converge. The implementation is based on TensorFlow, and we will release all codes and data upon acceptance.
4.2. Performance Comparison with NGCF
We perform detailed comparison with NGCF, recording the performance at different layers (1 to 4) in Table 4, which also shows the percentage of relative improvement on each metric. We further plot the training curves of training loss and testing recall in Figure 3 to reveal the advantages of LightGCN and to be clear of the training process. The main observations are as follows:
In all cases, LightGCN outperforms NGCF by a large margin. For example, on Gowalla the highest recall reported in the NGCF paper is 0.1560, while our LightGCN can reach 0.1817 under the 4-layer setting, which is higher. On average, the recall improvement on the three datasets is and the ndcg improvement is , which are rather significant.
Aligning Table 4 with Table 1 in Section 2, we can see that LightGCN performs better than NGCF-fn, the variant of NGCF that removes feature transformation and nonlinear activation. As NGCF-fn still contains more operations than LightGCN (e.g., self-connection, the interaction between user embedding and item embedding in graph convolution, and dropout), this suggests that these operations might also be useless for NGCF-fn.
Increasing the number of layers can improve the performance, but the benefits diminish. The general observation is that increasing the layer number from 0 (i.e., the matrix factorization model, results see (NGCF)) to 1 leads to the largest performance gain, and using a layer number of 3 leads to satisfactory performance in most cases. This observation is consistent with NGCF’s finding.
Along the training process, LightGCN consistently obtains lower training loss, which indicates that LightGCN fits the training data better than NGCF. Moreover, the lower training loss successfully transfers to better testing accuracy, indicating the strong generalization power of LightGCN. In contrast, the higher training loss and lower testing accuracy of NGCF reflect the practical difficulty to train such a heavy model it well. Note that in the figures we show the training process under the optimal hyper-parameter setting for both methods. Although increasing the learning rate of NGCF can decrease its training loss (even lower than that of LightGCN), the testing recall could not be improved, as lowering training loss in this way only finds trivial solution for NGCF.
4.3. Performance Comparison with State-of-the-Arts
Table 4 shows the performance comparison with competing methods. We show the best score we can obtain for each method. We can see that LightGCN consistently outperforms other methods on all three datasets, demonstrating its high effectiveness with simple yet reasonable designs. Note that LightGCN can be further improved by tuning the (see Figure 4 for an evidence), while here we only use a uniform setting of to avoid over-tuning it. Among the baselines, Mult-VAE exhibits the strongest performance, which is better than GRMF and NGCF. The performance of GRMF is on a par with NGCF, being better than MF, which admits the utility of enforcing embedding smoothness with Laplacian regularizer. By adding normalization into the Laplacian regularizer, GRMF-norm betters than GRMF on Gowalla, while brings no benefits on Yelp2018 and Amazon-Book.
4.4. Ablation and Effectiveness Analyses
We perform ablation studies on LightGCN by showing how layer combination and symmetric sqrt normalization affect its performance. To justify the rationality of LightGCN as analyzed in Section 3.2.3, we further investigate the effect of embedding smoothness — the key reason of LightGCN’s effectiveness.
4.4.1. Impact of Layer Combination
Figure 4 shows the results of LightGCN and its variant LightGCN-single that does not use layer combination (i.e., is used for final prediction for a -layer LightGCN). We omit the results on Yelp2018 due to space limitation, which show similar trend with Amazon-Book. We have three main observations:
Focusing on LightGCN-single, we find that its performance first improves and then drops when the layer number increases from 1 to 4. The peak point is on layer 2 in most cases, while after that it drops quickly to the worst point of layer 4. This indicates that smoothing a node’s embedding with its first-order and second-order neighbors is very useful for CF, but will suffer from over-smoothing issues when higher-order neighbors are used.
Focusing on LightGCN, we find that its performance gradually improves with the increasing of layers. Even using 4 layers, LightGCN’s performance is not degraded. This justifies the effectiveness of layer combination for addressing over-smoothing, as we have technically analyzed in Section 3.2.2 (relation with APPNP).
Comparing the two methods, we find that LightGCN consistently outperforms LightGCN-single on Gowalla, but not on Amazon-Book and Yelp2018 (where the 2-layer LightGCN-single performs the best). Regarding this phenomenon, two points need to be noted before we draw conclusion: 1) LightGCN-single is special case of LightGCN that sets to 1 and other to 0; 2) we do not tune the and simply set it as uniformly for LightGCN. As such, we can see the potential of further enhancing the performance of LightGCN by tuning .
Method notation: -L means only the left-side norm is used, -R means only the right-side norm is used, and - means the norm is used.
4.4.2. Impact of Symmetric Sqrt Normalization
In LightGCN, we employ symmetric sqrt normalization on each neighbor embedding when performing neighborhood aggregation (cf. Equation (3)). To study its rationality, we explore different choices here. We test the use of normalization only at the left side (i.e., the target node’s coefficient) and the right side (i.e., the neighbor node’s coefficient). We also test normalization, i.e., removing the square root. Note that if removing normalization, the training becomes numerically unstable and suffers from not-a-value (NAN) issues, so we do not show this setting. Table 5 shows the results of the 3-layer LightGCN. We have the following observations:
The best setting in general is using sqrt normalization at both sides (i.e., the current design of LightGCN). Removing either side will drop the performance largely.
The second best setting is using normalization at the left side only (i.e., LightGCN-
-L). This is equivalent to normalize the adjacency matrix as a stochastic matrix by the in-degree.
Normalizing symmetrically on two sides is helpful for the sqrt normalization, but will degrade the performance of normalization.
4.4.3. Analysis of Embedding Smoothness
As we have analyzed in Section 3.2.3, a 2-layer LightGCN smooths a user’s embedding based on the users that have overlap on her interacted items, and the smoothing strength between two users is measured in Equation (14). We speculate that such smoothing of embeddings is the key reason of LightGCN’s effectiveness. To verify this, we first define the smoothness of user embeddings as:
where the norm on embeddings is used to eliminate the impact of the embedding’s scale. Similarly we can obtained the definition for item embeddings. Table 6 shows the smoothness of two models, matrix factorization (i.e., using the for model prediction) and the 2-layer LightGCN-single (i.e., using the for prediction). Note that the 2-layer LightGCN-single outperforms MF in recommendation accuracy by a large margin. As can be seen, the smoothness loss of LightGCN-single is much lower than that of MF. This indicates that by conducting light graph convolution, the embeddings become smoother and more suitable for recommendation.
|Smoothness of User Embeddings|
|Smoothness of Item Embeddings|
4.5. Hyper-parameter Studies
When applying LightGCN to a new dataset, besides the standard hyper-parameter learning rate, the most important hyper-parameter to tune is the regularization coefficient . Here we investigate the performance change of LightGCN w.r.t. .
As shown in Figure 5, LightGCN is relatively insensitive to — even when sets to 0, LightGCN is better than NGCF, which additionally uses dropout to prevent overfitting444Note that Gowalla shows the same trend with Amazon-Book, so its curves are not shown to better highlight the trend of Yelp2018 and Amazon-Book.. This shows that LightGCN is less prone to overfitting — since the only trainable parameters in LightGCN are ID embeddings of the 0-th layer, the whole model is easy to train and to regularize. The optimal value for Yelp2018, Amazon-Book, and Gowalla is , , and , respectively. When is larger than , the performance drops quickly, which indicates that too strong regularization will negatively affect model normal training and is not encouraged.
5. Related Work
5.1. Collaborative Filtering
Collaborative Filtering (CF) is a prevalent technique in modern recommender systems (YoutubeRS; PinSage). One common paradigm of CF model is to parameterize users and items as embeddings, and learn the embedding parameters by reconstructing historical user-item interactions. For example, earlier CF models like matrix factorization (MF) (MF; BPRMF) project the ID of a user (or an item) into an embedding vector. The recent neural recommender models like NCF (NCF) and LRML (tay2018latent) use the same embedding component, while enhance the interaction modeling with neural networks.
Beyond merely using ID information, another type of CF methods considers historical items as the pre-existing features of a user, towards better user representations. For example, FISM (FISM) and SVD++ (SVD++) use the weighted average of the ID embeddings of historical items as the target user’s embedding. Recently, researchers realize that historical items have different contributions to shape personal interest. Towards this end, attention mechanisms are introduced to capture the varying contributions, such as ACF (ACF), NAIS (NAIS), and DeepICF (DeepICF), to automatically learn the importance of each historical item. When revisiting historical interactions as a user-item bipartite graph, the performance improvements can be attributed to the encoding of local neighborhood — one-hop neighbors — that improves the embedding learning.
5.2. Graph Methods for Recommendation
Another relevant research line is exploiting the user-item graph structure for recommendation. Prior efforts, such as ItemRank (ItemRank) and BiRank (BiRank), use the label propagation mechanism to directly propagate user preference scores over the graph, i.e., encouraging connected nodes to have similar labels. Recently emerged graph neural networks (GNNs) shine a light on modeling graph structure, especially high-hop neighbors, to guide the embedding learning (GCN; GraphSAGE). Early studies define graph convolution on the spectral domain, such as Laplacian eigen-decomposition (DBLP:journals/corr/BrunaZSL13) and Chebyshev polynomials (FirstGCN), which are computationally expensive. Later on, GraphSage (GraphSAGE) and GCN (GCN) re-define graph convolution in the spatial domain, i.e., aggregating the embeddings of neighbors to refine the target node’s embedding. Owing to its interpretability and efficiency, it quickly becomes a prevalent formulation of GNNs and is being widely used (DeepInf; Feng2019TOIS). Motivated by the strength of graph convolution, recent efforts like NGCF (NGCF), GC-MC (GC-MC), and PinSage (PinSage) adapt GCN to the user-item interaction graph, capturing CF signals in high-hop neighbors for recommendation.
It is worth mentioning that several recent efforts provide deep insights into GNNs (DeepInsights; ICLR19-APPNP; SGCN), which inspire us developing LightGCN. Particularly, Wu et al. (SGCN) argues the unnecessary complexity of GCN, developing a simplified GCN (SGCN) model by removing nonlinearities and collapsing multiple weight matrices into one. One main difference is that LightGCN and SGCN are developed for different tasks, thus the rationality of model simplification is different. Specifically, SGCN is for node classification, performing simplification for model interpretability and efficiency. In contrast, LightGCN is on collaborative filtering (CF), where each node has an ID feature only. Thus, we do simplification for a stronger reason: nonlinearity and weight matrices are useless for CF, and even hurt model training. For node classification accuracy, SGCN is on par with (sometimes weaker than) GCN. While for CF accuracy, LightGCN outperforms GCN by a large margin (over 15% improvement over NGCF).
6. Conclusion and Future Work
In this work, we argued the unnecessarily complicated design of GCNs for collaborative filtering, and performed empirical studies to justify this argument. We proposed LightGCN which consists of two essential components — light graph convolution and layer combination. In light graph convolution, we discard feature transformation and nonlinear activation — two standard operations in GCNs but inevitably increase the training difficulty. In layer combination, we construct a node’s final embedding as the weighted sum of its embeddings on all layers, which is proved to subsume the effect of self-connections and is helpful to control oversmoothing. We conduct experiments to demonstrate the strengths of LightGCN in being simple: easier to be trained, better generalization ability, and more effective.
We believe the insights of LightGCN are inspirational to future developments of recommender models. With the prevalence of linked graph data in real applications, graph-based models are becoming increasingly important in recommendation (GraphSAGE; AliKDD2018)
; by explicitly exploiting the relations among entities in the predictive model, they are advantageous to traditional supervised learning scheme like factorization machines(FM; NFM)
that model the relations implicitly. For example, a recent trend is to exploit auxiliary information such as item knowledge graph(KGAT; KGCN), social network (GCNSocial) and multimedia content (MMGCN) for recommendation, where GCNs have set up the new state-of-the-art. However, these models may also suffer from the similar issues of NGCF since the user-item interaction graph is also modeled by same neural operations that may be unnecessary. We plan to explore the idea of LightGCN in these models. Another future direction is to personalize the layer combination weights
, so as to enable adaptive-order smoothing for different users (e.g., sparse users may require more signal from higher-order neighbors while active users require less). Lastly, we will explore further the strengths of LightGCN’s simplicity, studying whether closed-form solution exists for particular forms of loss functions and streaming it for online industrial scenarios.