1. Introduction
To alleviate information overload on the web, recommender system has been widely deployed to perform personalized information filtering (PinSage; YoutubeRS). The core of recommender system is to predict whether a user will interact with an item, e.g., click, rate, purchase, among other forms of interactions. As such, collaborative filtering (CF), which focuses on exploiting the past useritem interactions to achieve the prediction, remains to be a fundamental task towards effective personalized recommendation (NCF; VACF; NGCF; CMN).
The most common paradigm for CF is to learn latent features (a.k.a. embedding) to represent a user and an item, and perform prediction based on the embedding vectors
(NCF). Matrix factorization is an early such model, which directly projects the single ID of a user to her embedding (MF; BPRMF). Later on, several research find that augmenting user ID with the her interaction history as the input can improve the quality of embedding. For example, SVD++ (SVD++) demonstrates the benefits of user interaction history in predicting user numerical ratings, and Neural Attentive Item Similarity (NAIS) (NAIS) differentiates the importance of items in the interaction history and shows improvements in predicting item ranking. In view of useritem interaction graph, these improvements can be seen as coming from using the subgraph structure of a user — more specifically, her onehop neighbors — to improve the embedding learning.To deepen the use of subgraph structure with highhop neighbors, Wang et al. (NGCF) recently proposes NGCF and achieves stateoftheart performance for CF. It takes inspiration from the Graph Convolution Network (GCN) (GCN; GraphSAGE), following the same propagation rule to refine embeddings: feature transformation, neighborhood aggregation, and nonlinear activation. Although NGCF has shown promising results, we argue that its designs are rather heavy and burdensome — many operations are directly inherited from GCN without justification. As a result, they are not necessarily useful for the CF task. To be specific, GCN is originally proposed for node classification on attributed graph, where each node has rich attributes as input features; whereas in useritem interaction graph for CF, each node (user or item) is only described by a onehot ID, which has no concrete semantics besides being an identifier. In such a case, given the ID embedding as the input, performing multiple layers of nonlinear feature transformation — which is the key to the success of modern neural networks (ResNet) — will bring no benefits, but negatively increases the difficulty for model training.
To validate our thoughts, we perform extensive ablation studies on NGCF. With rigorous controlled experiments (on the same data splits and evaluation protocol), we draw the conclusion that the two operations inherited from GCN — feature transformation and nonlinear activation — has no contribution on NGCF’s effectiveness. Even more surprising, removing them leads to significant accuracy improvements. This reflects the issues of adding operations that are useless for the target task in graph neural network, which not only brings no benefits, but rather degrades model effectiveness. Motivated by these empirical findings, we present a new model named LightGCN, including the most essential component of GCN — neighborhood aggregation — for collaborative filtering. Specifically, after associating each user (item) with an ID embedding, we propagate the embeddings on the useritem interaction graph to refine them. We then combine the embeddings learned at different propagation layers with a weighted sum to obtain the final embedding for prediction. The whole model is simple and elegant, which not only is easier to train, but also achieves better empirical performance than NGCF and other stateoftheart methods like MultVAE (VACF).
To summarize, this work makes the following main contributions:

We empirically show that two common designs in GCN, feature transformation and nonlinear activation, have no positive effect on the effectiveness of collaborative filtering.

We propose LightGCN, which largely simplifies the model design by including only the most essential components in GCN for recommendation.

We empirically compare LightGCN with NGCF by following the same setting and demonstrate substantial improvements. Indepth analyses are provided towards the rationality of LightGCN from both technical and empirical perspectives.
2. Preliminaries
We first introduce NGCF (NGCF), a representative and stateoftheart GCN model for recommendation. We then perform ablation studies on NGCF to judge the usefulness of each operation in NGCF. The novel contribution of this section is to show that the two common designs in GCNs, feature transformation and nonlinear activation, have no positive effect on collaborative filtering.
2.1. NGCF Brief
In the initial step, each user and item is associated with an ID embedding. Let denote the ID embedding of user and denote the ID embedding of item . Then NGCF leverages the useritem interaction graph to propagate embeddings as:
(1)  
where and respectively denote the refined embedding of user and item after layers propagation,
is the nonlinear activation function,
denotes the set of items that are interacted by user , denotes the set of users that interact with item , and and are trainable weight matrix to perform feature transformation in each layer. By propagating layers, NGCF obtains embeddings to describe a user () and an item (). It then concatenates these embeddings to obtain the final user embedding and item embedding, using inner product to generate the prediction score.NGCF largely follows the standard GCN (GCN), including the use of nonlinear activation function and feature transformation matrices and . However, we argue that the two operations are not as useful for collaborative filtering. In semisupervised node classification, each node has rich semantic features as input, such as the title and abstract words of a paper. Thus performing multiple layers of nonlinear transformation is beneficial to feature learning. Nevertheless, in collaborative filtering, each node of useritem interaction graph only has an ID as input which has no concrete semantics. In this case, performing multiple nonlinear transformations will not contribute to learn better features; even worse, it may add the difficulties to train well. In the next subsection, we provide empirical evidence on this argument.
Gowalla  AmazonBook  

recall  ndcg  recall  ndcg  
NGCF  0.1535  0.2238  0.0319  0.0622 
NGCFf  0.1682  0.2392  0.0355  0.0646 
NGCFn  0.1538  0.2243  0.0325  0.0616 
NGCFfn  0.1723  0.2414  0.0371  0.0669 
2.2. Empirical Explorations on NGCF
We conduct ablation studies on NGCF to explore the effect of nonlinear activation and feature transformation. We use the codes released by the authors of NGCF^{1}^{1}1https://github.com/xiangwang1223/neural_graph_collaborative_filtering, running experiments on the same data splits and evaluation protocol to keep the comparison as fair as possible. Since the core of GCN is to refine embeddings by propagation, we are more interested in the embedding quality under the same embedding size. Thus, we change the way of obtaining final embedding from concatenation (i.e., ) to sum (i.e., ). Note that this change has little effect on NGCF’s performance, but makes the following ablation studies more indicative of the embedding quality refined by GCN.
We implement three simplified variants of NGCF:

[leftmargin=*]

NGCFf, which removes the feature transformation matrices and .

NGCFn, which removes the nonlinear activation function .

NGCFfn, which removes both the feature transformation matrices and nonlinear activation function.
For the three variants, we keep all hyperparameters (e.g., learning rate, regularization coefficient, dropout ratio, etc.) same as the optimal settings of NGCF. We report the results of the 2layer setting on the Gowalla and AmazonBook datasets in Table 1, where the scores of NGCF are directly copied from the Table 3 of (NGCF). As can be seen, removing feature transformation (i.e., NGCFf) leads to consistent improvements over NGCF on all three datasets. In contrast, removing nonlinear activation does not affect the accuracy that much. However, if we remove nonlinear activation on the basis of removing feature transformation (i.e., NGCFfn), the performance is improved significantly. From these observations, we conclude the findings that:
(1) Adding feature transformation imposes negative effect on NGCF, since removing it in both models of NGCF and NGCFn improves the performance significantly;
(2) Adding nonlinear activation affects slightly when feature transformation is included, but it imposes negative effect when feature transformation is disabled.
(3) As a whole, feature transformation and nonlinear activation impose rather negative effect on NGCF, since by removing them simultaneously, NGCFfn demonstrates large improvements over NGCF (9.57% relative improvement on recall).
To gain more insights into the scores obtained in Table 1 and understand why NGCF deteriorates with the two operations, we plot the curves of model status recorded by training loss and testing recall in Figure 1. As can be seen, NGCFfn achieves a much lower training loss than NGCF, NGCFf, and NGCFn along the whole training process. Aligning with the curves of testing recall, we find that such lower training loss successfully transfers to better recommendation accuracy. The comparison between NGCF and NGCFf shows the similar trend, except that the improvement margin is smaller.
From these evidences, we can draw the conclusion that the deterioration of NGCF stems from the training difficulty, rather than overfitting. Theoretically speaking, NGCF has higher representation power than NGCFf, since setting the weight matrix and
I can fully recover the NGCFf model. However, in practice, NGCF demonstrates higher training loss and worse generalization performance than NGCFf. And the incorporation of nonlinear activation further aggravates the discrepancy between representation power and generalization performance. To round out this section, we claim that when designing model for recommendation, it is important to perform rigorous ablation studies to be clear about the impact of each operation. Otherwise, including less useful operations will complicate the model unnecessarily, increase the training difficulty, and even degrade model effectiveness.3. Method
The former section demonstrates that NGCF is a heavy and burdensome GCN model for collaborative filtering. Driven by these findings, we set the goal of developing a light yet effective model by including the most essential ingredients of GCN for recommendation. The advantages of being simple are severalfold — more interpretable, practically easy to train and maintain, technically easy to analyze the model behavior and revise it towards more effective directions, and so on.
In this section, we first present our designed Light Graph Convolution Network (LightGCN) model, as illustrated in Figure 2. We then provide an indepth analysis of LightGCN to show the rationality behind its simple design. Lastly, we describe how to do model training for recommendation.
3.1. LightGCN
The basic idea of GCN is to learning representation for nodes by smoothing features over the graph (GCN; SGCN). To achieve this, it performs graph convolution iteratively, i.e., aggregating the features of neighbors as the new representation of a target node. Such neighborhood aggregation can be abstracted as:
(2) 
The AGG is an aggregation function — the core of graph convolution — that considers the th layer’s representation of the target node and its neighbor nodes. Many work have specified the AGG, such as the weighted sum aggregator in GCN (GCN) and GIN (GIN), mean aggregator and LSTM aggregator in GraphSAGE (GraphSAGE), etc. However, most of the work ties feature transformation or nonlinear activation with the AGG function. Although they perform well on node or graph classification tasks that have semantic input features, they could be burdensome for collaborative filtering (see preliminary results in Section 2.2).
3.1.1. Light Graph Convolution (LGC)
In LightGCN, we adopt the simple weighted sum aggregator and abandon the use of feature transformation and nonlinear activation. The graph convolution operation (a.k.a., propagation rule (NGCF)) in LightGCN is defined as:
(3)  
The symmetric normalization term follows the design of standard GCN (GCN), which can avoid the scale of embeddings increasing with graph convolution operations; other choices can also be applied here, such as the norm, while empirically we find this symmetric normalization has good performance (see experiment results in Section 4.4.2).
It is worth noting that in LGC, we aggregate only the connected neighbors and do not integrate the target node itself (i.e., selfconnection). This is different from most existing graph convolution operations (NGCF; GCN; GAT; GraphSAGE) that typically aggregate extended neighbors and need to handle the selfconnection specially. The layer combination operation, to be introduced in the next subsection, essentially captures the same effect as selfconnections. Thus, there is no need in LGC to include selfconnections.
3.1.2. Layer Combination and Model Prediction
In LightGCN, the only trainable model parameters are the embeddings at the 0th layer, i.e., for all users and for all items. When they are given, the embeddings at higher layers can be computed via LGC defined in Equation (3). After layers LGC, we further combine the embeddings obtained at each layer to form the final representation of a user (an item):
(4) 
where denotes the importance of the th layer embedding in constituting the final embedding. It can be treated as a hyperparameter to be tuned manually, or as a model parameter (e.g., output of an attention network (ACF)) to be optimized automatically. In our experiments, we find that setting uniformly as leads to good performance in general. Thus we do not design special component to optimize , to avoid complicating LightGCN unnecessarily and to keep its simplicity. The reasons that we perform layer combination to get final representations are threefold. (1) With the increasing of the number of layers, the embeddings will be oversmoothed (DeepInsights). Thus simply using the last layer is problematic. (2) The embeddings at different layers capture different semantics. E.g., the first layer enforces smoothness on users and items that have interactions, the second layer smooths users (items) that have overlap on interacted items (users), and higherlayers capture higherorder proximity (NGCF). Thus combining them will make the representation more comprehensive. (3) Combining embeddings at different layers with weighted sum captures the effect of graph convolution with selfconnections, an important trick in GCNs (proof sees Section 3.2.1).
The model prediction is defined as the inner product of user and item final representations:
(5) 
which is used as the ranking score for recommendation generation.
3.1.3. Matrix Form
We provide the matrix form of LightGCN to facilitate implementation and discussion with existing models. Let the useritem interaction matrix be where and denote the number of users and items, respectively, and each entry is 1 if has interacted with item otherwise 0. We then obtain the adjacency matrix of the useritem graph as
(6) 
Let the th layer embedding matrix be , where is the embedding size. Then we can obtain the matrix equivalent form of LGC as:
(7) 
where D is a diagonal matrix, in which each entry denotes the number of nonzero entries in the th row vector of the adjacency matrix A (also named as degree matrix). Lastly, we get the final embedding matrix used for model prediction as:
(8)  E  
where is the symmetrically normalized matrix.
3.2. Model Analysis
We conduct model analysis to demonstrate the rationality behind the simple design of LightGCN. First we discuss the connection with the Simplified GCN (SGCN) (SGCN), which is a recent linear GCN model that integrates selfconnection into graph convolution; this analysis shows that by doing layer combination, LightGCN subsumes the effect of selfconnection thus there is no need for LightGCN to add selfconnection in adjacency matrix. Then we discuss the relation with the Approximate Personalized Propagation of Neural Predictions (APPNP) (ICLR19APPNP), which is recent GCN variant that addresses oversmoothing by inspiring from Personalized PageRank (haveliwala2002topic); this analysis shows the underlying equivalence between LightGCN and APPNP, thus our LightGCN enjoys the sames benefits in propagating longrange with controllable oversmoothing. Lastly we analyze the secondlayer LGC to show how it smooths a user with her secondorder neighbors, providing more insights into the working mechanism of LightGCN.
3.2.1. Relation with SGCN
In (SGCN), the authors argue the unnecessary complexity of GCN for node classfication and propose SGCN, which simplifies GCN by removing nonlinearities and collapsing the weight matrices to one weight matrix. The graph convolution in SGCN is defined as^{2}^{2}2The weight matrix in SGCN can be absorbed into the 0th layer embedding parameters, thus it is omitted in the analysis.:
(9) 
where is an identity matrix, which is added on A to include selfconnections. In the following analysis, we omit the terms for simplicity, since they only rescale embeddings. In SGCN, the embeddings obtained at the last layer are used for downstream prediction task, which can be expressed as:
(10)  
The above derivation shows that, inserting selfconnection into A and propagating embeddings on it, is essentially equivalent to a weighted sum of the embeddings propagated at each LGC layer.
3.2.2. Relation with APPNP
In a recent work (ICLR19APPNP), the authors connect GCN with Personalized PageRank (haveliwala2002topic), inspiring from which they propose a GCN variant named APPNP that can propagate long range without the risk of oversmoothing. Inspired by the teleport design in Personalized PageRank, APPNP complements each propagation layer with the starting features (i.e., the 0th layer embeddings), which can balance the need of preserving locality (i.e., staying close to the root node to alleviate oversmoothing) and leveraging the information from a large neighborhood. The propagation layer in APPNP is defined as:
(11) 
where
is the teleport probability to control the retaining of starting features in the propagation, and
denotes the normalized adjacency matrix. In APPNP, the last layer is used for final prediction, i.e.,(12)  
Aligning with Equation (8), we can see that by setting accordingly, LightGCN can fully recover the prediction embedding used by APPNP. As such, LightGCN shares the strength of APPNP in combating oversmoothing — by setting the properly, we allow using a large for longrange modeling with controllable oversmoothing.
Another minor difference is that APPNP adds selfconnection into the adjacency matrix. However, as we have shown before, this is redundant due to the weighted sum of different layers.
3.2.3. SecondOrder Embedding Smoothness
Owing to the linearity and simplicity of LightGCN, we can draw more insights into how does it smooth embeddings. Here we analyze a 2layer LightGCN to demonstrate its rationality. Taking the user side as an example, intuitively, the second layer smooths users that have overlap on the interacted items. More concretely, we have:
(13) 
We can see that, if another user has cointeracted with the target user , the smoothness strength of on is measured by the coefficient (otherwise 0):
(14) 
This coefficient is rather interpretable: the influence of a secondorder neighbor on is determined by 1) the number of cointeracted items, the more the larger; 2) the popularity of the cointeracted items, the less popularity (i.e., more indicative of user personalized preference) the larger; and 3) the activity of , the less active the larger. Such interpretability well caters for the assumption of CF in measuring user similarity (CSE; Wang:2006) and evidences the reasonability of LightGCN. Due to the symmetric formulation of LightGCN, we can get similar analysis on the item side.
3.3. Model Training
The trainable parameters of LightGCN are only the embeddings of the 0th layer, i.e., ; in other words, the model complexity is same as the standard matrix factorization (MF). We employ the Bayesian Personalized Ranking (BPR) loss (BPRMF), which is a pairwise loss that encourages the prediction of an observed entry to be higher than its unobserved counterparts:
(15) 
where controls the regularization strength. We employ the Adam (Adam) optimizer and use it in a minibatch manner. We are aware of other advanced negative sampling strategies which might improve the LightGCN training, such as the hard negative sampling (rendle2014improving) and adversarial sampling (Ding2019IJCAI). We leave this extension in the future since it is not the focus of this work.
Note that we do not introduce dropout mechanisms, which are commonly used in GCNs and NGCF. The reason is that we do not have feature transformation weight matrices in LightGCN, thus enforcing regularization on the embedding layer is sufficient to prevent overfitting. This showcases LightGCN’s advantages of being simple — it is easier to train and tune than NGCF which additionally requires to tune two dropout ratios (node dropout and message dropout) and normalize the embedding of each layer to unit length.
Moreover, it is technically viable to also learn the layer combination coefficients , or parameterize them with an attention network. However, we find that learning on training data does not lead improvement. This is probably because the training data does not contain sufficient signal to learn good that can generalize to unknown data. We have also tried to learn from validation data, as inspired by (lambdaOpt) that learns hyperparameters on validation data. The performance is slightly improved (less than ). We leave the exploration of optimal settings of (e.g., personalizing it for different users and items) as future work.
4. Experiments
We first describe experimental settings, and then conduct detailed comparison with NGCF (NGCF), the method that is most relevant with LightGCN but more complicated (Section 4.2). We next compare with other stateoftheart methods in Section 4.3. To justify the designs in LightGCN and reveal the reasons of its effectiveness, we perform ablation studies and embedding analyses in Section 4.4. The hyperparameter study is finally presented in Section 4.5.
Dataset  User #  Item #  Interaction #  Density 

Gowalla  
Yelp2018  
AmazonBook 
4.1. Experimental Settings
To reduce the experiment workload and keep the comparison fair, we closely follow the settings of the NGCF work (NGCF). We request the experimented datasets (including train/test splits) from the authors, for which the statistics are shown in Table 2
. The Gowalla and AmazonBook are exactly the same as the NGCF paper used, so we directly use the results in the NGCF paper. The only exception is the Yelp2018 data, which is a revised version. According to the authors, the previous version did not filter out coldstart items in the testing set, and they shared us the revised version only. Thus we rerun NGCF on the Yelp2018 data. The evaluation metrics are recall@20 and ndcg@20 computed by the allranking protocol — all items that are not interacted by a user are the candidates.
Dataset  Gowalla  Yelp2018*  AmazonBook  

Layer #  Method  recall  ndcg  recall  ndcg  recall  ndcg 
1 Layer  NGCF  0.1511  0.2218  0.0542  0.1028  0.0315  0.0618 
LightGCN  0.1726(+14.23%)  0.2455(+10.67%)  0.0633(+16.79%)  0.1148(+11.67%)  0.0385(+22.22%)  0.0698(+12.94%)  
2 Layers  NGCF  0.1535  0.2238  0.0550  0.1025  0.0319  0.0622 
LightGCN  0.1786(+16.35%)  0.2487(+11.12%)  0.0618(+12.36%)  0.1120(+9.27%)  0.0413(+29.48%)  0.0729(+17.20%)  
3 Layers  NGCF  0.1547  0.2237  0.0549  0.1023  0.0344  0.0630 
LightGCN  0.1809(+16.94%)  0.2513(+12.34%)  0.0648(+18.03%)  0.1163(+13.69%)  0.0415(+20.64%)  0.0740(+17.46%)  
4 Layers  NGCF  0.1560  0.2240  0.0548  0.1020  0.0342  0.0636 
LightGCN  0.1817(+16.47%)  0.2518(+12.41%)  0.0655(+19.53%)  0.1170(+14.71%)  0.0416(+21.68%)  0.0739(+16.19%) 
*The scores of NGCF on Gowalla and AmazonBook are directly copied from the Table 3 of (NGCF); the scores of NGCF on Yelp2018 are rerun by us.
4.1.1. Compared Methods
The main competing method is NGCF, which has shown to outperform several methods including GCNbased models GCMC (GCMC) and PinSage (PinSage), neural networkbased models NeuMF (NCF) and CMN (CMN), and factorizationbased models MF (BPRMF) and HOPRec (HOPrec). As the comparison is done on the same datasets under the same evaluation protocol, we do not further compare with these methods. In addition to NGCF, we further compare with two relevant and competitive CF methods:

[leftmargin=*]

MultVAE (VACF)
. This is an itembased CF method based on the variational autoencoder (VAE). It assumes the data is generated from a multinomial distribution and using variational inference for parameter estimation. We run the codes released by the authors
^{3}^{3}3https://github.com/dawenl/vae_cf, tuning the dropout ratio in , and the in . The model architecture is the suggested one in the paper: . 
GRMF (rao2015collaborative). This method smooths matrix factorization by adding the graph Laplacian regularizer. For fair comparison on item recommendation, we change the rating prediction loss to BPR loss. The objective function of GRMF is:
(16) where is searched in the range of . Moreover, we compare with a variant that adds normalization to graph Laplacian: , which is termed as GRMFnorm. Other hyperparameter settings are same as LightGCN. The two GRMF methods benchmark the performance of smoothing embeddings via Laplacian regularizer, while our LightGCN achieves embedding smoothing in the predictive model.
4.1.2. Hyperparameter Settings
Same as NGCF, the embedding size is fixed to 64 for all models and the embedding parameters are initialized with the Xavier method (Xarvier). We optimize LightGCN with Adam (Adam) and use the default learning rate of 0.001 and default minibatch size of 1024 (on AmazonBook, we increase the minibatch size to 2048 for speed). The regularization coefficient is searched in the range of , and in most cases the optimal value is . The layer combination coefficient is uniformly set to where is the number of layers. We test in the range of 1 to 4, and satisfactory performance can be achieved when
equals to 3. The early stopping and validation strategies are the same as NGCF. Typically, 1000 epochs are sufficient for LightGCN to converge. The implementation is based on TensorFlow, and we will release all codes and data upon acceptance.
4.2. Performance Comparison with NGCF
We perform detailed comparison with NGCF, recording the performance at different layers (1 to 4) in Table 4, which also shows the percentage of relative improvement on each metric. We further plot the training curves of training loss and testing recall in Figure 3 to reveal the advantages of LightGCN and to be clear of the training process. The main observations are as follows:

[leftmargin=*]

In all cases, LightGCN outperforms NGCF by a large margin. For example, on Gowalla the highest recall reported in the NGCF paper is 0.1560, while our LightGCN can reach 0.1817 under the 4layer setting, which is higher. On average, the recall improvement on the three datasets is and the ndcg improvement is , which are rather significant.

Aligning Table 4 with Table 1 in Section 2, we can see that LightGCN performs better than NGCFfn, the variant of NGCF that removes feature transformation and nonlinear activation. As NGCFfn still contains more operations than LightGCN (e.g., selfconnection, the interaction between user embedding and item embedding in graph convolution, and dropout), this suggests that these operations might also be useless for NGCFfn.

Increasing the number of layers can improve the performance, but the benefits diminish. The general observation is that increasing the layer number from 0 (i.e., the matrix factorization model, results see (NGCF)) to 1 leads to the largest performance gain, and using a layer number of 3 leads to satisfactory performance in most cases. This observation is consistent with NGCF’s finding.

Along the training process, LightGCN consistently obtains lower training loss, which indicates that LightGCN fits the training data better than NGCF. Moreover, the lower training loss successfully transfers to better testing accuracy, indicating the strong generalization power of LightGCN. In contrast, the higher training loss and lower testing accuracy of NGCF reflect the practical difficulty to train such a heavy model it well. Note that in the figures we show the training process under the optimal hyperparameter setting for both methods. Although increasing the learning rate of NGCF can decrease its training loss (even lower than that of LightGCN), the testing recall could not be improved, as lowering training loss in this way only finds trivial solution for NGCF.
4.3. Performance Comparison with StateoftheArts
Table 4 shows the performance comparison with competing methods. We show the best score we can obtain for each method. We can see that LightGCN consistently outperforms other methods on all three datasets, demonstrating its high effectiveness with simple yet reasonable designs. Note that LightGCN can be further improved by tuning the (see Figure 4 for an evidence), while here we only use a uniform setting of to avoid overtuning it. Among the baselines, MultVAE exhibits the strongest performance, which is better than GRMF and NGCF. The performance of GRMF is on a par with NGCF, being better than MF, which admits the utility of enforcing embedding smoothness with Laplacian regularizer. By adding normalization into the Laplacian regularizer, GRMFnorm betters than GRMF on Gowalla, while brings no benefits on Yelp2018 and AmazonBook.
Dataset  Gowalla  Yelp2018  AmazonBook  

Method  recall  ndcg  recall  ndcg  recall  ndcg 
NGCF  0.1560  0.2240  0.0550  0.1028  0.0344  0.0636 
MultVAE  0.1651  0.2245  0.0582  0.1026  0.0408  0.0710 
GRMF  0.1472  0.2030  0.0570  0.1049  0.0351  0.0641 
GRMFnorm  0.1544  0.2122  0.0559  0.1051  0.0352  0.0645 
LightGCN  0.1817  0.2518  0.0655  0.1170  0.0416  0.0739 
4.4. Ablation and Effectiveness Analyses
We perform ablation studies on LightGCN by showing how layer combination and symmetric sqrt normalization affect its performance. To justify the rationality of LightGCN as analyzed in Section 3.2.3, we further investigate the effect of embedding smoothness — the key reason of LightGCN’s effectiveness.
4.4.1. Impact of Layer Combination
Figure 4 shows the results of LightGCN and its variant LightGCNsingle that does not use layer combination (i.e., is used for final prediction for a layer LightGCN). We omit the results on Yelp2018 due to space limitation, which show similar trend with AmazonBook. We have three main observations:

[leftmargin=*]

Focusing on LightGCNsingle, we find that its performance first improves and then drops when the layer number increases from 1 to 4. The peak point is on layer 2 in most cases, while after that it drops quickly to the worst point of layer 4. This indicates that smoothing a node’s embedding with its firstorder and secondorder neighbors is very useful for CF, but will suffer from oversmoothing issues when higherorder neighbors are used.

Focusing on LightGCN, we find that its performance gradually improves with the increasing of layers. Even using 4 layers, LightGCN’s performance is not degraded. This justifies the effectiveness of layer combination for addressing oversmoothing, as we have technically analyzed in Section 3.2.2 (relation with APPNP).

Comparing the two methods, we find that LightGCN consistently outperforms LightGCNsingle on Gowalla, but not on AmazonBook and Yelp2018 (where the 2layer LightGCNsingle performs the best). Regarding this phenomenon, two points need to be noted before we draw conclusion: 1) LightGCNsingle is special case of LightGCN that sets to 1 and other to 0; 2) we do not tune the and simply set it as uniformly for LightGCN. As such, we can see the potential of further enhancing the performance of LightGCN by tuning .
Dataset  Gowalla  Yelp2018  AmazonBook  

Method  recall  ndcg  recall  ndcg  recall  ndcg 
LightGCNL  0.1700  0.2215  0.0633  0.1119  0.0423  0.0726 
LightGCNR  0.1577  0.2311  0.0586  0.1086  0.0331  0.0634 
LightGCN  0.1587  0.2169  0.0574  0.1053  0.0364  0.0653 
LightGCNL  0.1511  0.2125  0.0564  0.1051  0.0375  0.0689 
LightGCNR  0.1295  0.1884  0.0484  0.0872  0.0256  0.0538 
LightGCN  0.1809  0.2513  0.0648  0.1163  0.0415  0.0740 
Method notation: L means only the leftside norm is used, R means only the rightside norm is used, and  means the norm is used.
4.4.2. Impact of Symmetric Sqrt Normalization
In LightGCN, we employ symmetric sqrt normalization on each neighbor embedding when performing neighborhood aggregation (cf. Equation (3)). To study its rationality, we explore different choices here. We test the use of normalization only at the left side (i.e., the target node’s coefficient) and the right side (i.e., the neighbor node’s coefficient). We also test normalization, i.e., removing the square root. Note that if removing normalization, the training becomes numerically unstable and suffers from notavalue (NAN) issues, so we do not show this setting. Table 5 shows the results of the 3layer LightGCN. We have the following observations:

[leftmargin=*]

The best setting in general is using sqrt normalization at both sides (i.e., the current design of LightGCN). Removing either side will drop the performance largely.

The second best setting is using normalization at the left side only (i.e., LightGCN
L). This is equivalent to normalize the adjacency matrix as a stochastic matrix by the indegree.

Normalizing symmetrically on two sides is helpful for the sqrt normalization, but will degrade the performance of normalization.
4.4.3. Analysis of Embedding Smoothness
As we have analyzed in Section 3.2.3, a 2layer LightGCN smooths a user’s embedding based on the users that have overlap on her interacted items, and the smoothing strength between two users is measured in Equation (14). We speculate that such smoothing of embeddings is the key reason of LightGCN’s effectiveness. To verify this, we first define the smoothness of user embeddings as:
(17) 
where the norm on embeddings is used to eliminate the impact of the embedding’s scale. Similarly we can obtained the definition for item embeddings. Table 6 shows the smoothness of two models, matrix factorization (i.e., using the for model prediction) and the 2layer LightGCNsingle (i.e., using the for prediction). Note that the 2layer LightGCNsingle outperforms MF in recommendation accuracy by a large margin. As can be seen, the smoothness loss of LightGCNsingle is much lower than that of MF. This indicates that by conducting light graph convolution, the embeddings become smoother and more suitable for recommendation.
Dataset  Gowalla  Yelp2018  Amazonbook 

Smoothness of User Embeddings  
MF  15449.3  16258.2  38034.2 
LightGCNsingle  12872.7  10091.7  32191.1 
Smoothness of Item Embeddings  
MF  12106.7  16632.1  28307.9 
LightGCNsingle  5829.0  6459.8  16866.0 
4.5. Hyperparameter Studies
When applying LightGCN to a new dataset, besides the standard hyperparameter learning rate, the most important hyperparameter to tune is the regularization coefficient . Here we investigate the performance change of LightGCN w.r.t. .
As shown in Figure 5, LightGCN is relatively insensitive to — even when sets to 0, LightGCN is better than NGCF, which additionally uses dropout to prevent overfitting^{4}^{4}4Note that Gowalla shows the same trend with AmazonBook, so its curves are not shown to better highlight the trend of Yelp2018 and AmazonBook.. This shows that LightGCN is less prone to overfitting — since the only trainable parameters in LightGCN are ID embeddings of the 0th layer, the whole model is easy to train and to regularize. The optimal value for Yelp2018, AmazonBook, and Gowalla is , , and , respectively. When is larger than , the performance drops quickly, which indicates that too strong regularization will negatively affect model normal training and is not encouraged.
5. Related Work
5.1. Collaborative Filtering
Collaborative Filtering (CF) is a prevalent technique in modern recommender systems (YoutubeRS; PinSage). One common paradigm of CF model is to parameterize users and items as embeddings, and learn the embedding parameters by reconstructing historical useritem interactions. For example, earlier CF models like matrix factorization (MF) (MF; BPRMF) project the ID of a user (or an item) into an embedding vector. The recent neural recommender models like NCF (NCF) and LRML (tay2018latent) use the same embedding component, while enhance the interaction modeling with neural networks.
Beyond merely using ID information, another type of CF methods considers historical items as the preexisting features of a user, towards better user representations. For example, FISM (FISM) and SVD++ (SVD++) use the weighted average of the ID embeddings of historical items as the target user’s embedding. Recently, researchers realize that historical items have different contributions to shape personal interest. Towards this end, attention mechanisms are introduced to capture the varying contributions, such as ACF (ACF), NAIS (NAIS), and DeepICF (DeepICF), to automatically learn the importance of each historical item. When revisiting historical interactions as a useritem bipartite graph, the performance improvements can be attributed to the encoding of local neighborhood — onehop neighbors — that improves the embedding learning.
5.2. Graph Methods for Recommendation
Another relevant research line is exploiting the useritem graph structure for recommendation. Prior efforts, such as ItemRank (ItemRank) and BiRank (BiRank), use the label propagation mechanism to directly propagate user preference scores over the graph, i.e., encouraging connected nodes to have similar labels. Recently emerged graph neural networks (GNNs) shine a light on modeling graph structure, especially highhop neighbors, to guide the embedding learning (GCN; GraphSAGE). Early studies define graph convolution on the spectral domain, such as Laplacian eigendecomposition (DBLP:journals/corr/BrunaZSL13) and Chebyshev polynomials (FirstGCN), which are computationally expensive. Later on, GraphSage (GraphSAGE) and GCN (GCN) redefine graph convolution in the spatial domain, i.e., aggregating the embeddings of neighbors to refine the target node’s embedding. Owing to its interpretability and efficiency, it quickly becomes a prevalent formulation of GNNs and is being widely used (DeepInf; Feng2019TOIS). Motivated by the strength of graph convolution, recent efforts like NGCF (NGCF), GCMC (GCMC), and PinSage (PinSage) adapt GCN to the useritem interaction graph, capturing CF signals in highhop neighbors for recommendation.
It is worth mentioning that several recent efforts provide deep insights into GNNs (DeepInsights; ICLR19APPNP; SGCN), which inspire us developing LightGCN. Particularly, Wu et al. (SGCN) argues the unnecessary complexity of GCN, developing a simplified GCN (SGCN) model by removing nonlinearities and collapsing multiple weight matrices into one. One main difference is that LightGCN and SGCN are developed for different tasks, thus the rationality of model simplification is different. Specifically, SGCN is for node classification, performing simplification for model interpretability and efficiency. In contrast, LightGCN is on collaborative filtering (CF), where each node has an ID feature only. Thus, we do simplification for a stronger reason: nonlinearity and weight matrices are useless for CF, and even hurt model training. For node classification accuracy, SGCN is on par with (sometimes weaker than) GCN. While for CF accuracy, LightGCN outperforms GCN by a large margin (over 15% improvement over NGCF).
6. Conclusion and Future Work
In this work, we argued the unnecessarily complicated design of GCNs for collaborative filtering, and performed empirical studies to justify this argument. We proposed LightGCN which consists of two essential components — light graph convolution and layer combination. In light graph convolution, we discard feature transformation and nonlinear activation — two standard operations in GCNs but inevitably increase the training difficulty. In layer combination, we construct a node’s final embedding as the weighted sum of its embeddings on all layers, which is proved to subsume the effect of selfconnections and is helpful to control oversmoothing. We conduct experiments to demonstrate the strengths of LightGCN in being simple: easier to be trained, better generalization ability, and more effective.
We believe the insights of LightGCN are inspirational to future developments of recommender models. With the prevalence of linked graph data in real applications, graphbased models are becoming increasingly important in recommendation (GraphSAGE; AliKDD2018)
; by explicitly exploiting the relations among entities in the predictive model, they are advantageous to traditional supervised learning scheme like factorization machines
(FM; NFM)that model the relations implicitly. For example, a recent trend is to exploit auxiliary information such as item knowledge graph
(KGAT; KGCN), social network (GCNSocial) and multimedia content (MMGCN) for recommendation, where GCNs have set up the new stateoftheart. However, these models may also suffer from the similar issues of NGCF since the useritem interaction graph is also modeled by same neural operations that may be unnecessary. We plan to explore the idea of LightGCN in these models. Another future direction is to personalize the layer combination weights, so as to enable adaptiveorder smoothing for different users (e.g., sparse users may require more signal from higherorder neighbors while active users require less). Lastly, we will explore further the strengths of LightGCN’s simplicity, studying whether closedform solution exists for particular forms of loss functions and streaming it for online industrial scenarios.
Comments
There are no comments yet.