1 Introduction
Graph neural networks (GNNs) are widely used in modeling real-world connections, like protein networks
ying2018hierarchical , social networks velivckovic2018deep , and co-author networks kipf2016semi . One could also construct similarity graphs by linking data points that are close in the feature space even when there is no explicit graph structure. There have been several successful GNN architectures: ChebyshevNet defferrard2016convolutional , GCN kipf2016semi , SGC wu2019simplifying , GAT velivckovic2017graph , GraphSAGE hamilton2017inductive and other subsequent variants tailored for practical applications gilmer2017neural ; liu2019hyperbolic ; klicpera2018predict .Recently, researchers started to explore the fundamentals of GNNs, such as expressive power xu2018powerful ; oono2019graph ; loukas2019graph ; dehmamy2019understanding , and analyze their capacity and limitations. One of the frequently mentioned limitations is oversmoothing li2019deepgcns . In deep GCNs, over-smoothing means that after multi-layer graph convolution, the effect of Laplacian smoothing makes node representations become more and more similar, eventually becoming indistinguishable. This issue was first mentioned in li2018deeper and has been widely discussed since then, such as in JKNet xu2018representation , DenseGCN li2019deepgcns , DropEdge rong2019dropedge , and PairNorm zhao2019pairnorm . However, these discussions were mostly on the powering effect of convolution operator (where is the convolution operator, and
is the number of layers). This essentially implies a GCN variant without an activation function (i.e., a simplified graph convolution, or SGC
wu2019simplifying ).In this work, we instead argue that deep GCNs can learn anti-oversmoothing, while overfitting is the real cause of performance drop. This paper focuses on graph node classification and starts from the perspective of graph-based optimization (minimizing ) belkin2003laplacian ; yang2016revisiting , where is the (supervised) empirical loss and is an (unsupervised) graph regularizer, which encodes smoothness over the connected node pairs. We interpret a GCN as a two-step optimization problem: (i) forward propagation to minimize by viewing as constants and as parameters, and (ii) back propagation to minimize by updating . This work therefore derives GCN as:
Similarly, an SGC is interpreted as:
From this formulation, we show that a deep SGC indeed suffers from oversmoothing but deep GCNs will learn to prevent oversmoothing because (i) is conditioned on in GCN, and (ii) and are somehow contradicted. To prevent gradient vanishing/exploding, in this paper, we add skip connections he2016deep to all deep architectures by default. An illustration of deep GCNs, deep SGC and directly learning using DNN is shown in Fig. 1.
[width=5.5in]fig/GCNvsSGC.pdf
As is mentioned above, the training of deep GCNs is a learning process of anti-oversmoothing, which is extremely slow in practice and sometimes may not converge. Based on the formulation, we further propose a mean-subtraction trick to accelerate the training of deep GCNs. Extensive experiments verify our theories and provide more insights about deep GCNs.
2 Background of Graph Transductive Learning
Graph representation learning aims at embedding the nodes into low-dimensional vectors, while simultaneously preserving both
graph topology structure and node feature information. Given a graph , let be the set of nodes, and let be a set of possible classes. Assume that each node is associated with a class label . A graph could be represented by an adjacency matrix with when two nodes are connected . The degree matrix is diagonal where . Let denote the feature vectors for each node. Given a labelled set , the goal of transductive learning on a graph is to transductively predict labels for the remaining unknown nodes . A well-studied solution category is to include graph regularizers tenenbaum2000global ; belkin2003laplacian ; zhu2002learning ; weston2012deep into the classification algorithm. Graph-convolution-based models kipf2016semi ; defferrard2016convolutional ; velivckovic2017graph ; hamilton2017inductive are a powerful learning approach in this space.2.1 Graph-based Regularization
There is a rather general class of embedding algorithms that include graph regularizers. They could be described as: finding a mapping by minimizing the following two-fold loss:
(1) |
where is the low-dimensional representation of nodes. The first term is the empirical risk on the labelled set . The second term is a graph regularizer over connected pairs, so as to make sure that a trivial solution is not reached.
The measurements on graphs are usually invariant to node permutations. A canonical way is to use Dirichlet energy belkin2002laplacian for the graph-base regularization,
(2) |
where is the normalized Laplacian operator, which induces a semi-norm on , penalizing the changes between adjacent vertices. Same normalized formulation could be found in chen2019deep ; ando2007learning ; smola2003kernels ; shaham2018spectralnet ; belkin2003laplacian , and some related literature also use the unnormalized version zhu2002learning ; von2007tutorial .
2.2 Graph Convolutional Network
GCNs are derived from graph signal processing sandryhaila2013discrete ; chen2015discrete ; duvenaud2015convolutional . On the spectral domain, the operator is a real-valued symmetric semidefinite matrix and the graph convolution is parameterized by a learnable filter
on the its eigenvalue matrix. Kipf et al.
kipf2016semi made assumptions of the largest eigenvalue (i.e., ) and simplified it with two-order Chebyshev expansion,(3) |
A multi-layer graph convolutional network (GCN) is formulated as the following layer-wise propagation rule (
is an activation function, e.g., ReLU):
(4) |
where is the renormalization trick, and are the layer-wise feature and parameter matrices, respectively.
3 GCN as Two-Step Optimization
In this section, we re-interpret a GCN as a two-step optimization problem, where STEP1 is to minimize by viewing as constants and as parameters while STEP2 minimizes by updating . Overall, the GCN architecture is interpreted as a layer-wise combination of MLP architecture and the gradient descent algorithm of minimizing . In the meantime, the training process of parameters is entirely inherited from MLP, which aims at only minimizing . Let us first discuss a gradient descent algorithm for minimizing the form .
3.1 Gradient Descent for Trace Optimization.
Problem Definition.
Given the Laplacian operator , we consider to minimize the trace on feature domain , where is the input dimension. To prevent the trivial solution , we consider the energy constraint on , i.e., . The trace optimization problem is:
(5) |
where denotes the Forbenius-norm of . To solve this, We equivalently transform the optimization problem into the Reyleigh Quotient form , which is,
(6) |
It is obvious that is scaling invariant on , i.e., , .
One-step Improvement.
Suppose an initial guess is , one-step of trace optimization aims at finding a better guess , which satisfies and . Our strategy is first view the problem as unconstrainted optimization on and update the guess to by gradient descent. Then we rescale and reach the improved guess , which meets the norm constraint.
Given the initial guess , we move against the derivative of by the learning rate and reach an intermediate solution in the unconstrainted space:
(7) | ||||
(8) |
Immediately, we get (details in Appendix B). Then, we rescale (to achieve the improved guess , which naturally satisfies ) by a constant to meet the norm constraint, such that . Therefore, the improved guess satisfies and has the following form,
(9) |
Note that, if we conduct the trace optimization algorithm enough times, the optimal solution will finally be proportional to the largest eigenvector of
, which causes oversmoothing.3.2 Layer-wise Propagation and Optimization
We introduce the trace optimization solution into the layer-wise propagation of MLP. Given the node set , features and a labelled set , a label mapping, , is usually a deep neural network, which could be tailored according to the practical applications. For example,
could be a convolutional neural network (CNN) for image recognition or a recurrent neural network (RNN) for language processing. In this scenario, we first consider a simple multi-layer perceptron (MLP). The forward propagation rule of an MLP is given by,
(10) |
where and are layer-wise parameters and inputs.
STEP1: minimizing in Forward Propagation.
Let us fix parameters and consider , i.e., the output of layer, as an initial guess of the trace optimization problem. We know from Sec. 3.1 that through one-step gradient descent, Eqn. (9) will find an improved guess for minimizing :
(11) |
We plug this new value into Eqn. (10) and immediately reach the same convolutional propagation rule, , as Kipf et al. kipf2016semi (before applying the renormalization trick),
(12) |
where the constant scalar in Eqn. (11) is absorbed into parameter matrix . Therefore, a GCN forward propagation is essentially applying STEP1 layerwise in the forward propagation of an MLP, which is formulated as a composition of mappings on initial feature .
STEP2: minimizing in Back Propagation.
After forward propagation, the cross-entropy loss over the labelled set is calculated. In this procedure, we then conversely view as constants and as parameters of the MLP and conduct standard back-propagation algorithm.
3.3 GCN: combining STEP1 and STEP2
In essence, STEP1 essentially defines a combined architecture, where the layer-wise propagation is adjusted by an additional step of trace optimization. In STEP2, under that architecture, the optimal is learned and a low-dimension is reached with respect to explicitly and implicitly, after standard loss back-propagation. We express it as a two-step optimization,
(13) |
In this section, the learning rate is specially chosen, and it satisfies since is semi-definite and is smaller than the largest eigenvalue of , which is smaller than . In the experiment section, we reveal that is related to the weight of neighbor information averaging. We further test different and provide more insights on graph convolution operators. In the following sections, we use to denote the convolutional operator and use for the random walk form .
4 The Over-smoothing Issue
The recent successes in applying GNNs are largely limited to shallow architectures (e.g., 2-4 layers). Model performance decreases when adding more intermediate layers. Summarized in zhao2019pairnorm , there are three possible contributing factors: (i) overfitting due to increasing number of parameters (one matrix per layer); (ii) gradient vanishing/exploding; (iii) oversmoothing due to Laplacian smoothing. The first two points are common to all deep neural networks. The issue of oversmoothing is therefore the focus in this work. We show that deep GCNs can learn anti-oversmoothing by nature, but overfitting is the major cause of performance drop.
In this section, we first recall the commonly discussed oversmoothing problem (in SGC). Based on the re-formulation in Sec. 3.3, we then show how deep GCNs have the ability to learn anti-oversmoothing. Further, we propose an easy but effective mean-subtraction trick to speed up anti-oversmoothing, which accelerates the convergence in training deep GCNs.
4.1 Over-smoothing in SGC.
Oversmoothing means that node representations become similar and finally go indistinguishable after multi-layer graph convolution. Starting from the random walk theory, the analysis for oversmoothing is usually done on a connected, un-directed and non-bipartite graph. The issue is discussed in li2018deeper ; xu2018representation ; li2019deepgcns ; rong2019dropedge ; zhao2019pairnorm and mainly on the -layer linear SGC wu2019simplifying .
SGC was proposed in wu2019simplifying , with the hypothesis that the non-linear activation is not critical while the majority of the benefit arises from the local averaging . The authors directly remove the activation function and proposed a linear “-layer” model,
(14) |
where has collapsed into a single . This model explicitly disentangles the dependence of STEP1 and STEP2. We similarly formulate the SGC model in the form of two-step optimization,
(15) |
Theorem 1.
Given any random signal and a symmetric matrix , the following property holds almost everywhere on , where has non-negative eigenvalues and is the eigenvector associated with the largest eigenvalue of .
For two widely used convolution operators and , they have the same dominant eigenvalue with eigenvectors and 1, respectively. From the two-step optimization form, SGC is essentially conducting gradient descent algorithm times in STEP1. According to Theorem 1 (see proofs in Appendix D), if goes to infinity, then each output feature channel will become or , which means oversmoothing. In STEP2, SGC model will seek to minimize on the basis of the oversmoothed features. The independence between STEP1 and STEP2 accounts for the performance drop in deep SGC.
4.2 Anti-oversmoothing in GCN.
On the contrary, the result of in STEP1 is dependent on , i.e., also on STEP2, in GCN. In fact, after STEP1, node representation in GCN will be oversmoothed to some extent as well. However, during STEP2, GCN will learn to update layer-wise and make node features separable, such that will be minimized, during which, the effect of STEP1 (minimizing , i.e., making node features inseparable) will be mitigated and actually increases implicitly. In essence, the dependency of enables GCNs to do anti-ovesmoothing during STEP2.
We demonstrate on the Karate club dataset: this graph has 34 vertices of 4 classes (the same labeling as kipf2016semi ; perozzi2014deepwalk
) and 78 edges. A 32-layer GCN (deep enough for this demo dataset), with 16 hidden units in each layer, is considered. The model is trained on identity feature matrix with basic residual connection
he2016deep . The training setconsists of two labeled examples per class. After 1000 epochs, the model achieves
accuracy in the testing samples. We present the feature-wise smoothing by layer in Fig. 2.a and node-wise smoothing by layer in Fig. 2.b. The y-axis score of the first two figures are calculated by cosine similarity. We also calculate and present
as well as for each training epoch in Fig. 2.c. More details of the demo are given in Appendix F.[width=5.2in]fig/karate.pdf
From the demonstrations, we observe that without training (blue curves in Fig. 2.a and Fig. 2.b and the beginning of Fig. 2.c), the issue of oversmoothing do exist for deep GCNs. Because forward propagation mixes up features layer-by-layer. However, this issue is automatically addressed during training and with more training epochs, feature-wise smoothing and node-wise smoothing are gradually mitigated. The effect is more obvious when referring to Fig. 2.c, where actually increases during the training (small indicates oversmoothing). The gradual increase in demonstrates that GCNs have the ability to learn to anti-oversmooth by nature. Then, the next question comes: what is the real cause of performance drop in deep GCNs? Our answer is overfitting. We give practical support in the experimental section.
4.3 Mean-subtraction: an Accelerator
Although deep GCNs could learn anti-oversmoothing naturally, another practical problem appears that the convergence of training deep GCNs is extremely slow (sometimes may not converge). This issue has not been explored extensively in the literature. In this work, we present a mean-subtraction trick to accelerate the training of deep GCNs, which theoretically magnifies the effect of Fiedler vector. PairNorm zhao2019pairnorm also includes a mean-subtraction step, however, their purpose is to simplify derivation. This section provides more insights and motivation to use mean-subtraction.
There are primarily two reasons to use mean-subtraction: (i) deep neural network classifiers are discriminators that draw a boundary between classes. Therefore, the mean feature of the entire dataset (a DC signal) does not help with the classification, whereas the components away from the center (the AC signal) matters; (ii) layer-wise
mean-subtraction will eliminate the dominant eigen component ( or ) and actually magnifies the Fiedler vector (the eigenvector associated with the second smallest eigenvalue of ), which reveals important community information and graph conductance chen2017supervised . This helps to set an initial graph partition and spends up model training (STEP2).We start with one of the most popular convolution operator and its largest eigenvector . Given any non-zero , the mean-subtraction gives,
(16) |
where . Eqn. (16) reveals that mean-subtraction reduces the components aligned with -space. This is exactly a step of numerical approximation of the Fiedler vector, which sets the initial graph partition (demonstration in Appendix F) and makes the feature separable. For , the formulation could be adjusted by a factor of (refer to Appendix E).
5 Experiments
In this section, we present experimental evidence to answer the following three questions: (i) whether oversmoothing is an issue in deep GCNs and why? (ii) How to stabilize and accelerate the training of deep GCNs? (iii) Does the learning rate matter? How about changing them? We also provide more insights and draw useful conclusions for the practical usage of GCN models and its variants.
The experiments show the performance of deep GCNs on the semi-supervised node classification tasks. All the deep models (with more than 3 hidden layers) are implemented with basic skip-connection kipf2016semi ; he2016deep . Since skip-connection (also called residual connection) are necessary in deep architectures, we do not consider them as new models. Three benchmark citation networks (Cora, Citeseer, Pubmed) are considered. We follow the same experimental settings from yang2016revisiting and show the basic statistics of datasets in Table. 1. All the experiments are conducted 20 times and mainly finished in a Linux server with 64GB memory, 32 CPUs and a single GTX-2080 GPU.
Dataset | Nodes | Edges | Features | Class | Label rate |
---|---|---|---|---|---|
Cora | 2,708 | 5,429 | 1,433 | 7 | 0.052 |
Citeseer | 3,327 | 4,732 | 4,732 | 6 | 0.036 |
Pubmed | 19,717 | 44,338 | 500 | 3 | 0.003 |
5.1 Overfitting in Deep GCNs
The performance of GCNs is known to decrease with increasing number of layers, for which, a common explanation is “oversmoothing”. In Sec. 4, we contradict this thesis and conjecture instead that overfitting is the major reason for the drop of performance in deep GCNs; we show that deep GCNs actually learn anti-oversmoothing. In this section, we provide evidence to support our conjecture.
Performance vs Depth.
We first evaluate vanilla GCN models (with residual connection) with hidden layers on Cora, Citeseer and Pubmed. The results of training and test accuracy are reported in Fig. 3.
Form Fig. 3, we know immediately that test accuracy drops in the beginning (1-4 layers) and then remains stable even as the model depth increases, which means the increasing number () of hidden layers does not hurt model performance. Thus, oversmoothing is not the reason. From 2 to 3 or 3 to 4 layers, we notice that these is a big rise in training accuracy (up to ) and simultaneously a big drop in training loss (to ) and test accuracy consistently on the three datasets. This is more consistent with overfitting.
[width=4.9in]fig/threesets.pdf
Deep GCNs Learn Anti-oversmoothing.
We recall that in Sec. 4.2, we show from an optimization perspective that the dependency of on allows the network to learn anti-oversmoothing. To verify our theory, we compare SGC and GCN on Cora and Pubmed with various depth. To make it clear, SGC is actually a linear model, the depth means the number of graph convolution . Model performance in both training and test is shown in Fig. 4.
It is interesting that the accuracy of SGC decreases rapidly oono2019graph with more graph convolutions either for training or test. This is a strong indicator of oversmoothing, because node features converge to the same stationary due to the effect of STEP1 (specified in Theorem 1). The performance of the GCN model is not as good as SGC soon after 2 layers because of overfitting, but it stabilizes at a high accuracy even as the model goes very deep, which again verifies that GCNs naturally have the power of anti-oversmoothing.
[width=5.5in]fig/GCNvsSGC2.pdf
5.2 Mean-Subtraction
To facilitate the training of deep GCNs, we proposed mean-subtraction in Sec. 4.3. In this section, we evaluate the efficacy of the mean-subtraction trick and compare it with vanilla GCNs kipf2016semi , PairNorm zhao2019pairnorm and the widely used BatchNorm ioffe2015batch
in the deep learning area. The four models have same settings, such as the number of layers (64), learning rate (0.01), and hidden units (16). Mean-subtraction is to subtract the mean feature value before each convolution layer (and PairNorm further re-scales the feature by variance). They do not include additional parameters. BatchNorm adds more parameters for each layer and learns to whiten the input of each layer. The experiment is conducted on
Cora with training epochs.[width=4.5in]fig/mean.pdf
Fig. 5 reports the training and test curve for the four model variants. GCN and GCN+BatchNorm perform similarly, which means BatchNorm does not help substantially in training deep GCNs (at least for Cora). GCN+mean-subtraction and GCN+PairNorm give fast and stable training/test convergence. However, we could tell from the training curve that the PairNorm trick seems to suffer a lot from overfitting, leading to a drop in test accuracy. In sum, mean-subtraction not only speeds up the model convergence but also retains the same expressive power. It is an ideal trick for training deep GCNs.
5.3 Performance vs Learning Rate
In Sec. 3, we choose the learning rate . However, a different learning rate does lead to different weights of neighbor information aggregation (we show that is a monotonically increasing function in Appendix C). There are also some efforts on trying different ways to aggregate neighbor information velivckovic2017graph ; hamilton2017inductive ; rong2019dropedge ; chen2018fastgcn . In this section, we consider the form “” with and exploit a group of convolution operators by their normalized version. GCN with normalized is named as -GCN. We evaluate this operator group on Cora and list the experimental results in Table. 2
Accuracy (%) | =0 | 0.1 | 0.2 | 0.5 | 1.0 | 2 | 5 | 10 | 20 | 50 | 100 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2-layer | training | 92.66 | 95.67 | 96.32 | 96.05 | 95.33 | 94.54 | 93.44 | 93.30 | 92.82 | 92.86 | 92.98 |
test | 52.66 | 71.62 | 75.64 | 78.26 | 79.72 | 79.06 | 78.34 | 78.14 | 77.93 | 77.67 | 77.58 | |
32-layer | training | 95.02 | 99.49 | 99.58 | 99.35 | 98.69 | 98.10 | 98.84 | 98.83 | 98.81 | 98.76 | 98.83 |
test | 39.93 | 72.53 | 73.59 | 73.65 | 74.03 | 75.11 | 74.16 | 75.08 | 75.49 | 74.64 | 74.74 |
We conclude that when is small (i.e., is small), which means the gradient of does not contribute much to the end effect, -GCN is more of a DNN. As increases, a significant increase in model performance is initially observed. When exceeds some threshold, the accuracy saturates, remaining high (or maybe decreases slightly in shallow models, i.e., -layer) even as we increase substantially. We conclude that for the widely used shallow GCNs, the common choice of weight , which means a learning rate, , is large enough to include the gradient descent effect and small enough to avoid the drop in accuracy. To find the best weight of neighbor averaging, further inspection is needed in future work.
6 Conclusion
We reformulate GCNs from an optimization perspective by plugging the gradient of graph regularizer into a standard MLP. From this formulation, we revisit the commonly discussed “oversmoothing issue” in deep GCNs and provide a new understanding: deep GCNs have the power to learn anti-oversmoothing by nature, but overfitting is the real cause of performance drop when the model goes deep. We further propose a cheap but effective mean-subtraction trick to accelerate the training of deep GCNs. Extensive experiments are presented to verify our theory and provide more practical insights.
References
- (1) Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In Advances in neural information processing systems, pages 4800–4810, 2018.
- (2) Petar Veličković, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. arXiv preprint arXiv:1809.10341, 2018.
- (3) Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- (4) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pages 3844–3852, 2016.
- (5) Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr, Christopher Fifty, Tao Yu, and Kilian Q Weinberger. Simplifying graph convolutional networks. arXiv preprint arXiv:1902.07153, 2019.
- (6) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
- (7) Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in neural information processing systems, pages 1024–1034, 2017.
-
(8)
Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and
George E Dahl.
Neural message passing for quantum chemistry.
In
Proceedings of the 34th International Conference on Machine Learning-Volume 70
, pages 1263–1272. JMLR. org, 2017. - (9) Qi Liu, Maximilian Nickel, and Douwe Kiela. Hyperbolic graph neural networks. In Advances in Neural Information Processing Systems, pages 8228–8239, 2019.
- (10) Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018.
- (11) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
- (12) Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for node classification. arXiv preprint cs.LG/1905.10947, 2019.
- (13) Andreas Loukas. What graph neural networks cannot learn: depth vs width. arXiv preprint arXiv:1907.03199, 2019.
- (14) Nima Dehmamy, Albert-László Barabási, and Rose Yu. Understanding the representation power of graph neural networks in learning graph topology. In Advances in Neural Information Processing Systems, pages 15387–15397, 2019.
-
(15)
Guohao Li, Matthias Muller, Ali Thabet, and Bernard Ghanem.
Deepgcns: Can gcns go as deep as cnns?
In
Proceedings of the IEEE International Conference on Computer Vision
, pages 9267–9276, 2019. -
(16)
Qimai Li, Zhichao Han, and Xiao-Ming Wu.
Deeper insights into graph convolutional networks for semi-supervised learning.
InThirty-Second AAAI Conference on Artificial Intelligence
, 2018. - (17) Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. arXiv preprint arXiv:1806.03536, 2018.
- (18) Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convolutional networks on node classification. In International Conference on Learning Representations, 2019.
- (19) Lingxiao Zhao and Leman Akoglu. Pairnorm: Tackling oversmoothing in gnns. arXiv preprint arXiv:1909.12223, 2019.
- (20) Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.
- (21) Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. arXiv preprint arXiv:1603.08861, 2016.
-
(22)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016. - (23) Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
- (24) Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Carnegie Mellon University, 2002.
- (25) Jason Weston, Frédéric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semi-supervised embedding. In Neural networks: Tricks of the trade, pages 639–655. Springer, 2012.
- (26) Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems, pages 585–591, 2002.
- (27) Yu Chen, Lingfei Wu, and Mohammed J Zaki. Deep iterative and adaptive learning for graph neural networks. arXiv preprint arXiv:1912.07832, 2019.
- (28) Rie K Ando and Tong Zhang. Learning on graph with laplacian regularization. In Advances in neural information processing systems, pages 25–32, 2007.
- (29) Alexander J Smola and Risi Kondor. Kernels and regularization on graphs. In Learning theory and kernel machines, pages 144–158. Springer, 2003.
- (30) Uri Shaham, Kelly Stanton, Henry Li, Boaz Nadler, Ronen Basri, and Yuval Kluger. Spectralnet: Spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587, 2018.
-
(31)
Ulrike Von Luxburg.
A tutorial on spectral clustering.
Statistics and computing, 17(4):395–416, 2007. - (32) Aliaksei Sandryhaila and José MF Moura. Discrete signal processing on graphs. IEEE transactions on signal processing, 61(7):1644–1656, 2013.
- (33) Siheng Chen, Rohan Varma, Aliaksei Sandryhaila, and Jelena Kovačević. Discrete signal processing on graphs: Sampling theory. IEEE transactions on signal processing, 63(24):6510–6523, 2015.
- (34) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
- (35) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710, 2014.
- (36) Zhengdao Chen, Xiang Li, and Joan Bruna. Supervised community detection with line graph neural networks. arXiv preprint arXiv:1705.08415, 2017.
- (37) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- (38) Jie Chen, Tengfei Ma, and Cao Xiao. Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247, 2018.
Appendix A and Spectral Clustering
Graph Regularizer .
is commonly formulated by Dirichlet energy, , where is a mapping from the input feature to low-dimensional representation . To minimize , this paper adds constraint on the magnitude of , i.e., , which gives,
(17) |
Spectral Clustering.
Given a graph with binary adjacency matrix , a partition of node set into set could be written as in graph theory. For normalized spectral clustering, the indicator vectors is written as , where represents the affiliation of node in class set and is the volume.
(18) |
The is a matrix containing these indicator vectors as columns. For each row of , there is only one non-empty entry, implying . Let us revisit the Normalized Cut of a graph for a partition .
(19) |
Also, satisfies . When the discreteness condition is relaxed and is substitute by , the normalized graph cut problem (normalized spectral clustering) is relaxed into,
(20) |
This is a standard trace minimization problem which is solved by the matrix the eigen matrix of . Compared to Eqn. (17), Eqn. (20) has a stronger contraints, which outputs the optimal solution irrelevant to the inputs (feature matrix ). However, Eqn. (17) only add constraints on the magnitude of , which balances the trade-off and will give a solution induced by both the eigen matrix of and the original feature .
Appendix B Reyleigh Quotient
Reyleigh Quotient.
The Reyleigh Quotient of a vector is the scalar,
(21) |
which is invariant to the scaling of . For example, , we have . When we view as a function on -dim variable , it has stationary points , where is the eigenvector of . Let us assume , then the stationary value at point will be exactly the eigenvalue ,
(22) |
When is not an eigenvector of , the partial derivatives of with respect to the vector coordinate is calculated as,
(23) |
Thus, the derivative of with respect to is collected as,
(24) |
Minimizing R(x).
Suppose is the normalized Laplacian matrix. Let us first consider to minimize without any constraints. Since
is a symmetric real-valued matrix, it could be factorized by Singular Value Decomposition,
(25) |
where is the rank of and are the eigen values. For any non-zero vector , it is decomposed w.r.t. the eigen space of ,
(26) |
where is the coordinates and is a component tangent to the eigen space spanned by . Let us consider the component of within the eigen space and discuss later. Therefore, the Reyleigh Quotient can be calculated by,
(27) |
Recall the partial derivative of w.r.t. in Eqn. (24). Think about to minimize by gradient descent and always consider the learning rate (the same as what we used in the main text. The factor is from that the in appendix does not have the scalar ),
(28) |
The initial is regarded as an starting point, and the next point is given by gradient descent,
(29) |
The new Reyleigh Quotient value is,
(30) |
The eigen properties of could be derived from , where they have the same eigenvector, and any eigenvalue of will adjust to be an eigenvalue of . Therefore, we do further derivation,
(31) |
So far, to get the ideal effect, a final check is needed: whether the Reyleigh Quotient does decrease after the gradient descent.
(32) |
Also, we show the asymptotic property of in gradient descent,
(33) |
where is the -th new point given by gradient descent. So far, we finish the proof of well-definedness of gradient descent with the .
Remark 1.
In fact, as stated above, is invariant to the scaling of , so we could scale on its magnitude, i.e., making as a constraint during the gradient descent iteration, all the properties and results still hold.
Remark 2.
In the main text, instead of using a vector , we use a feature matrix and define our Reyleigh Quotient by . In fact, different feature channels of could be viewed as independent vector signal and for each channel, the same gradient descent analysis is applied. Therefore, we finish the detailed proof for our formulation in the main text, which is of the following form,
(34) |
Appendix C Learning Rate and Neighbor Averaging Weight
We show the relation of learning rate and neighbor averaging weight in this section (the derivation is in terms of the main text, so does not have factor ).
(35) |
Thus, we have,
(36) |
According to the formulation, is a monotonically increasing function on variable and is valid when . Therefore, when , the domain and when (we know from Eqn. (33) that ), the domain of the function is bounded, .
Remark 3.
The choice of
Comments
There are no comments yet.