1 Introduction
2 Introduction
Data are organized in graph structures in various domains. Social networks, citation networks, molecular structures, protein-protein interactions—all of these domains can be modeled using graphs. Developing powerful graph learning methods is important for many real-world applications including recommendation systems Wu et al. (2019), link prediction Zhang and Chen (2018) and drug discovery Klambauer et al. (2017)
. Motivated by the success of deep neural networks, recent years have seen considerable interests in generalizing deep learning techniques to graph learning.
Compared with images and sequence data, graphs have a much complex topographical structure. The nodes in a graph can have a very different number of neighbours, and there is no fixed node ordering for a graph. Early methods are primarily based on recurrent neural networks. These methods involve a process that iteratively propagates node features until these node features reach a stable fixed point. In recent years, graph convolutional networks (GCNs) that leverage graph convolutions have become the dominant approach for graph learning. In GCNs, node features updated by aggregating neighbourhood features. Compared with recurrent-based methods, GCNs are much efficient in learning on graph-structured data.
Although significant progress has been made in improving GCN capacity, overfitting is one of the major issues that affect the performance Yang et al. (2020), resulting in the model not generalizing well on unseen samples. Overfitting can be caused by number of parameters, issues with learning algorithm such as gradient vanishing/exploding. An illustration of this phenomenon is show in Figure 1
, we see that the training accuracies on MNIST and CIFAR10
Dwivedi et al. (2020) for superpixel graph classification are higher than the test accuracies, especially on the CIFAR10 dataset. Graph convolutions update a node feature by aggregating features from the node’s local neighborhood. Repeatedly applying graph convolutions can lead to the oversmoothing Li et al. (2018) issue, i.e., node features converge to similar values. The oversmoothing issue is one of the limitations that restrict the performance of GCNs. This is one of the major reasons that cause the overfitting issue. Due to oversmoothing, existing GCNs usually use several (e.g., 2 to 4) layers to model the relationship between inputs and outputs. Further increasing the number of layers can lead to reduced performance. We show in this work that by tackling oversmoothing, the overfitting issue can be alleviated, and hence the performance of GCNs is improved.
Regularization is an effective way to reduce overfitting. However, commonly used regularization methods such as L2 regularization and Dropout Srivastava et al. (2014) can only slightly improve the generalization performance. Most recent efforts have been devoted to improving model capacity, few efforts have been devoted to developing regularization techniques. Rong et al. Rong et al. (2020)
recently proposed the DropEdge method for regularizing graph networks. The DropEdge method randomly removes a number of edges from the input graph at each training epoch. This method works as a data augmentation method and a message passing reducer.
In this paper, we propose a stochastic regularization method to address the oversmoothing issue. In our method, we stochastically scale features and gradients (SSFG) in the training procedure. The factors used for stochastic scaling are sampled from a distribution transformed from the beta distribution. Scaling features can break feature convergence, and therefore the oversmoothing issue is mitigated. Stochastically scaling features can be seen as varying perturbations are applied to latent features, leading the graph network model to learn on the vicinity of node features. We also stochastically scale gradients in backpropagation to prevent overfitting on inputs. We show that applying stochastic scaling at the gradient level provides further performance improvement.
Our SSFG regularization method can be seen as a variant of the Dropout method. Unlike Dropout, we preserve all neurons and stochastically drop or add a small portion of the features in forward propagation. Our method can also be seen as a stochastic rectified linear unit (ReLU) function
Nair and Hinton (2010) when applied together with ReLU. It generalizes the standard ReLU by using stochastic slops in forward and backward propagation.We apply our SSFG regularization method in three graph convolutional networks, i.e, Graphsage Hamilton et al. (2017), graph attention networks (GATs) Veličković et al. (2018) and gated graph convnets (GatedGCN) Bresson and Laurent (2017). We experiment on seven benchmark datasets for four graph-based tasks, i.e., graph classification, node classification, link prediction and graph regression. We show that our regularization method is effective to improve the overall performance of the three baseline graph networks.
The contributions of this paper can be summarized as follows:
-
We propose a stochastic regularization method for graph convolutional networks. In our method, we stochastically scale features and gradients in the training procedure. As far as we know, this is the first research on regularizing graph convolutional networks at the both the feature level and the gradient level. Our method does not increase model parameters, and we show that our method is able to address the overfitting issue and the underfitting issue.
-
We experimentally evaluate our SSFG regularization method on three graph networks, i.e., Graphsage, GATs and GatedGCNs. Extensive experimental results demonstrate that our regularization method effectively improves the overall performance of the three baseline graph networks.
3 Related Work
3.1 Graph Convolutional Networks
Graph convolutional networks have become the dominant approach for learning on graph-structured data. Current studies on graph convolutional networks can be categorized into two approaches: the spectral-based approach and the spatial-based approach Wu et al. (2020). The spectral-based approach works with the spectral representation of graphs, and the spatial-based approach directly defines convolutions on graph nodes that are spatially close.
Burana et al. Bruna et al. (2014) proposed the first spectral-based graph network, wherein convolutions are defined in the Fourier domain on the eigen-decomposition of the graph Laplacian. Defferrard et al. Defferrard et al. (2016) proposed Chebyshev spectral networks (ChebNets) to address the limitations of Burana’s work. ChebNets approximate the filters using Chebyshev expansion of the graph Laplacian, resulting in spatially localized filters. Kips et al. Kipf and Welling (2017) further introduced an efficient layer-wise propagation rule based on the first-order approximation of spectral convolutions. In spectral-based graph networks, the learned filters depend on the graph structure, therefore a model trained on a specific graph can be applied to graphs with a different structure.
Unlike the spectral-based approach, the spatial-based approach defines convolutions in the spatial domain and learns a node’s feature by propagating information from the node’s neighbors. To deal with variable-sized neighbors, several sampling-based methods have been proposed for efficient graph learning. These methods include the nodewise sampling-based method Hamilton et al. (2017), the layerwise sampling-based method Chen et al. (2018) and the adaptive layerwise sampling-based method Huang et al. (2018). Velickovic et al. Veličković et al. (2018) proposed graph attention networks (GATs), in which the self-attention mechanism is used to compute attention weights in aggregation. Zhang et al. Zhang et al. (2018) further proposed gated attention networks that apply self-attention to the outputs multiple attention heads to improve performance. Recently, Bresson et al. Bresson and Laurent (2017) proposed GatedGCNs, integrating edge gates, residual learning He et al. (2016)
Ioffe and Szegedy (2015) into graph networks.
An illustration of the SSFG regularization method and its comparison to Dropout. In SSFG, an input feature vector is randomly scaled in forward propagation, and the gradient of the feature vector in backward propagation is also randomly scaled. The scaling factors are sampled from a distribution transformed from the Beta function. Each node feature corresponds to a SSFG regularization.
3.2 Regularization Methods
Regularization methods have been commonly used to reduce overfitting in training neural networks. Conventional regularization methods include early stopping, i.e., terminating the training procedure when the performance on a validation set stops to increase, Lasso regularization, weight decay and soft weight sharing Nowlan and Hinton (1992).
Srivastava et al. Srivastava et al. (2014) introduced Dropout as a stochastic regularization method for training neural networks. The key idea of Dropout is to randomly drop out neurons from the neural network during training. Dropout can also been seen as perturbing the feature outputted by a layer by setting the randomly selected feature points to zero. The idea behind Dropout has also been adopted in graph networks. For example in GraphSAGE Hamilton et al. (2017), a fix-sized number of neighbors are sampled for each node in feature aggregation. This facilitates fast training and also helps improve the generalization performance. The method of node sampling can be seen as using random subgraphs of the original graph in the training procedure. More recently, Rong et al. Rong et al. (2020) proposed the DropEdge method that randomly drops out edges in each training epoch. This method helps to address both the overfitting and oversmoothing issues.
4 Methodology
In this section, we first present the SSFG regularization method, showing its relationship to Dropout and ReLU. Then, we introduce the use of SSFG to regularize graph networks.

4.1 SSFG Regularization
In the Dropout method, neurons, as well as their connections, are randomly dropped out from the neural network during training. Applying Dropout to a neural network can be seen as training many subnetworks of the original network and using the ensemble of these subnetworks to make predictions at test time Srivastava et al. (2014). Although Dropout helps to improve the generalization performance, it does not directly address the oversmoothing issue.
We introduce the SSFG regularization method to address the oversmoothing issue. In SSFG, we stochastically scale node features and gradients in the training procedure. Specifically, we multiply each node feature by a factor sampled from a probability distribution in forward propagation, and in backward propagation the gradient of each node is also multiplied by a factor sampled from the probability distribution. We wish the expectation of node features to be unchanged. To this end, we adopt the following trick to define the scaling factor.
(1) |
where is a probability distribution with the mean equal to 0.5, and
is a hyperparameter. With the above trick, the scaling factor falls in the interval
. By directly scaling features, the oversmoothing issue is mitigated. We detail our SSFG regularization method in Algorithm 1. It is worth noting that the SSFG method does not increase trainable model parameters. Our SSFG regularization is only applied in the training procedure. At test time, we directly use original node features for target tasks.A schematic illustration of the SSFG method and its compassion to Dropout is shown in Figure 2. Dropout performs drop out at the neuron level. Compared to Dropout, our SSFG regularization in forward propagation can be seen as an improved version of Dropout that is applied at the feature level. When the scaling factor is less than , a proportion of the node feature is dropped out; and when the scaling factor is greater than , a proportion of the node feature is added back to the node feature. Stochastically scaling gradients also introduces uncertainty the optimization procedure. This can further helps improve the overall performance. To the best of our knowledge, this is the fist study on regularizing neural networks at the gradient level. We show through experiments that stochastically scaling gradients is complementary to stochastically scaling features to improve the overall performance.
ReLU has been commonly used as the nonlinear activation function in graph networks. When used together with ReLU, SSFG can be seen as a stochastic ReLU function. An illustration of this explanation is shown in Figure
3. By using stochastic slopes in propagating features and gradients, the network model can be robust to different feature variations. This property makes our SSFG method not specific to graph networks. Other networks such as convolutional networks could also use the SSFG method to improve the overall performance.4.2 Regularizing Graph Networks with SSFG
Let a graph with nodes be denoted , where is the node set and the edge set, and be the feature matrix associated with the nodes. The adjacent matrix of is an matrix where equals to if is connected to , or otherwise. Usually we consider the nodes of to be self-connected. Then, the structure of can be represented as , where
is the identity matrix.
A typical graph convolutional network takes and the graph structure as input and updates node features layerwisely as follows:
(2) |
where is the number of graph convolutional layers and is the set of neighbors of node . Our SSFG regularization is a general method that can be applied to a wide variety of graph networks. In this work, we evaluate SSFG on three types of graph networks, i.e., Graphsage, GATs and GatedGCNs, to demonstrate its effectiveness.
Graphsage In a Graphsage layer, a fixed-size set of neighbors are randomly sampled, and the features of these sampled neighbors are aggregated using an aggregator function, such as the mean operator and LSTM Hochreiter and Schmidhuber (1997) to update a node’s feature as follows:
(3) |
where
is the weight matrix of the shared linear transformation, and
is a nonlinear activation function. SSFG can be applied to input node features to the Graphsage layer or after the nonlinear activation function.GATs In GATs, each neighbor of a node is assigned with an attention weight computed by the self-attention mechanism in feature aggregation as follows:
(4) |
where attn is the attention function. GATs usually employs multiple attention heads to improve the overall performance. By default, we apply our SSFG regularization to the output of each attention head.
GatedGCNs GatedGCNs use the edge gating mechanism Marcheggiani and Titov (2017)
and residual connection in aggregating features from a node’s local neighborhood as follows:
(5) | |||
where are weight matrices of linear transformations, and is the Hadamard product. GatedGCNs explicitly maintain edge feature at each layer. By default, we apply our SSFG regularization to both the node features and the edge features outputted by a GatedGCN layer.
5 Experiments
Dataset | Graphs | Nodes/graph | Training | Val. | Test |
---|---|---|---|---|---|
PATTERN | 14K | 44-188 | 10,000 | 2000 | 2000 |
CLASTER | 12K | 41-190 | 10,000 | 1000 | 1000 |
MNIST | 70K | 40-75 | 55,000 | 5000 | 10,000 |
CIFAR10 | 60K | 85-150 | 45,000 | 5000 | 10,000 |
TSP | 12K | 50-500 | 10,000 | 1000 | 1000 |
COLLAB | 1 | 235,868 | - | - | - |
ZINC | 12K | 9-37 | 10,000 | 1000 | 1000 |
Method | PATTERN | ||
---|---|---|---|
Test (Acc.) | Train (Acc.) | ||
Graphsage w/o SSFG | 4 | 50.5160.001 | 50.4730.014 |
Graphsage + SSFG (=+) | 50.5160.001 | 50.4730.014 | |
Graphsage w/o SSFG | 16 | 50.4920.001 | 50.4870.005 |
Graphsage + SSFG (=+) | 50.4920.001 | 50.4870.005 | |
GAT w/o SSFG | 4 | 75.8241.823 | 77.8831.632 |
GAT + SSFG (=8.0) | 77.2900.469 | 77.9380.528 | |
GAT w/o SSFG | 16 | 78.2710.186 | 90.2120.476 |
GAT + SSFG (=8.0) | 81.4610.123 | 82.7240.385 | |
Gatedgcn w/o SSFG | 4 | 84.4800.122 | 84.4740.155 |
Gatedgcn + SSFG (=3.0) | 85.2050.264 | 85.2830.347 | |
Gatedgcn + SSFG (=4.0) | 85.0160.181 | 84.9230.202 | |
Gatedgcn + SSFG (=5.0) | 85.3340.175 | 85.3160.192 | |
Gatedgcn + SSFG (=6.0) | 85.1020.161 | 85.0660.155 | |
Gatedgcn w/o SSFG | 16 | 85.5680.088 | 86.0070.123 |
Gatedgcn + SSFG (=3.0) | 85.7230.069 | 85.6250.072 | |
Gatedgcn + SSFG (=4.0) | 85.7170.020 | 85.6060.012 | |
Gatedgcn + SSFG (=5.0) | 85.6510.054 | 85.5950.048 |
Method | CLUSTER | ||
---|---|---|---|
Test (Acc.) | Train (Acc.) | ||
Graphsage w/o SSFG | 4 | 50.4540.145 | 54.3740.203 |
Graphsage + SSFG (=5.0) | 50.5620.070 | 53.0140.025 | |
Graphsage vanilla | 16 | 63.8440.110 | 86.7100.167 |
Graphsage + SSFG (=7.0) | 66.8510.066 | 79.2200.023 | |
GAT w/o SSFG | 4 | 57.7320.323 | 58.3310.342 |
GAT + SSFG (=7.0) | 59.8880.044 | 59.6560.025 | |
GAT w/o SSFG | 16 | 70.5870.447 | 76.0741.362 |
GAT + SSFG (=4.0) | 73.6890.088 | 79.4760.302 | |
Gatedgcn w/o SSFG | 4 | 60.4040.419 | 61.6180.536 |
Gatedgcn + SSFG (=6.0) | 61.0280.302 | 62.4150.311 | |
Gatedgcn + SSFG (=7.0) | 61.2220.267 | 62.8440.352 | |
Gatedgcn + SSFG (=8.0) | 61.4980.267 | 63.3100.343 | |
Gatedgcn + SSFG (=9.0) | 61.3750.047 | 63.0490.134 | |
Gatedgcn w/o SSFG | 16 | 73.8400.326 | 87.8800.908 |
Gatedgcn + SSFG (=4.0) | 75.6710.084 | 83.7690.035 | |
Gatedgcn + SSFG (=5.0) | 75.9600.020 | 83.6230.652 | |
Gatedgcn + SSFG (=6.0) | 75.6010.078 | 84.5160.299 |
Method | MNIST | |
---|---|---|
Test (Acc.) | Train (Acc.) | |
Graphsage vanilla | 97.3120.097 | 100.000.000 |
Graphsage + SSFG (=5.0) | 97.9430.147 | 99.9960.002 |
GAT vanilla | 95.5350.205 | 99.9940.008 |
GAT + SSFG (=2.0) | 97.9380.075 | 99.9960.002 |
Gatedgcn vanilla | 97.3400.143 | 100.000.000 |
Gatedgcn + SSFG (=1.0) | 97.8480.106 | 99.8890.035 |
Gatedgcn + SSFG (=1.5) | 97.7300.116 | 99.9750.004 |
Gatedgcn + SSFG (=2.0) | 97.9850.032 | 99.9960.001 |
Gatedgcn + SSFG (=2.5) | 97.7030.054 | 99.9960.001 |
Method | CIFAR10 | |
---|---|---|
Test (Acc.) | Train (Acc.) | |
Graphsage vanilla | 65.7670.308 | 99.7190.062 |
Graphsage + SSFG (=4.0) | 68.8030.471 | 89.8450.166 |
GAT vanilla | 64.2230.455 | 89.1140.499 |
GAT + SSFG (=4.0) | 66.0650.171 | 84.3830.986 |
Gatedgcn vanilla | 67.3120.311 | 94.5531.018 |
Gatedgcn + SSFG (=1.0) | 71.5850.361 | 83.8781.146 |
Gatedgcn + SSFG (=1.5) | 71.9380.190 | 87.4730.593 |
Gatedgcn + SSFG (=2.0) | 71.3830.427 | 87.7450.973 |
Gatedgcn + SSFG (=2.5) | 70.9130.306 | 88.6450.750 |
Method | TSP | |
---|---|---|
Test (F1) | Train (F1) | |
Graphsage w/o SSFG | 0.6650.003 | 0.6690.003 |
Graphsage + SSFG (=5.0) | 0.7140.003 | 0.7170.003 |
GAT w/o SSFG | 0.6710.002 | 0.6730.002 |
GAT + SSFG (=300) | 0.6820.000 | 0.6840.0001 |
Gatedgcn w/o SSFG | 0.7910.003 | 0.7930.003 |
Gatedgcn + SSFG (=4.0) | 0.8020.001 | 0.8040.001 |
Gatedgcn + SSFG (=5.0) | 0.8060.001 | 0.8070.001 |
Gatedgcn + SSFG (=6.0) | 0.8050.001 | 0.8080.001 |
Gatedgcn + SSFG (=7.0) | 0.8050.001 | 0.8070.001 |
Method | COLLAB | |
---|---|---|
Test (Hits@50) | Train (Hits@50) | |
Graphsage w/o SSFG | 51.6180.690 | 99.9490.052 |
Graphsage + SSFG (=4.0) | 53.1460.230 | 98.2801.300 |
GAT w/o SSFG | 51.5010.962 | 97.8511.114 |
GAT + SSFG (=4.0) | 53.6160.400 | 97.7000.132 |
GAT + SSFG (=5.0) | 53.9080.253 | 97.8350.182 |
GAT + SSFG (=6.0) | 54.7150.069 | 97.9290.189 |
GAT + SSFG (=7.0) | 54.2520.092 | 98.0840.340 |
Gatedgcn w/o SSFG | 52.6351.168 | 96.1031.876 |
Gatedgcn + SSFG (=5.0) | 53.0550.671 | 92.5350.989 |
Method | ZINC | |
---|---|---|
Test (MAE ) | Train (MAE ) | |
Graphsage vanilla | 0.4680.003 | 0.2510.004 |
Graphsage + SSFG (=10) | 0.4410.006 | 0.1910.005 |
GAT vanilla | 0.4750.007 | 0.3170.006 |
GAT + SSFG (=20) | 0.4660.001 | 0.3290.010 |
Gatedgcn vanilla | 0.4350.011 | 0.2870.014 |
Gatedgcn + SSFG (=4) | 0.4150.007 | 0.3160.008 |
Gatedgcn + SSFG (=5) | 0.4130.006 | 0.3150.008 |
Gatedgcn + SSFG (=6) | 0.4040.0003 | 0.2960.005 |
Gatedgcn + SSFG (=7) | 0.3980.001 | 0.2860.007 |
5.1 Experimental Setup
Our experiments are conducted on seven recently released benchmark datasets, i.e., PATTERN, CLUSTER, MNIST, CIFAR10, TSP, COLLAB and ZINC Dwivedi et al. (2020). These datasets are used for four graph-based tasks: node classification (PATTERN, CLUSTER), graph classification (MNIST, CIFAR10), link prediction (TSP, COLLAB) and graph regression (ZINC). The statistics of the seven datasets are given in Table 1.
We closely follow the experimental setup used in the work of Dwivedi et al. Dwivedi et al. (2020). The Adam method Kingma and Ba (2014) is used to train all the models. The learning rate is initialized to and reduced by a factor of 2 if the loss has not improved for a number of epochs (10 or 20). The training procedure is terminated when the learning rate is reduced to smaller than
. For the node classification task, we experiment using different layers (4 and 16). For the remaining tasks, the number of layers is set to 4. For each evaluation, we run the experiment 4 times using different random seeds and report the mean and standard deviation of the 4 results. Our method is implemented using Pytorch
Paszke et al. (2017) and the DGL library Wang et al. (2019).Evaluation Metrics Following Dwivedi et al. (2020)
, the following evaluation metrics are used for different tasks.
-
Accuracy. Weighted average node classification accuracy is used for the node classification task (PATTERN and CLUSTER), and classification accuracy is used for the graph classification task (MNIST and CIFAR10).
-
F1 score is used for evaluation on the TSP dataset, due to high class imbalance, i.e., only the edges in the TSP tour are labeled as positive.
-
Hits@K Hu et al. (2020) is used for the COLLAB dataset, aiming to measure a model’s ability to predict future collaboration relationships. This method ranks each true collaboration against 100,000 randomly sampled negative collaborations and counts the ratio of positive edges that are ranked at -th place or above.
-
MAE (mean absolute error) is used to evaluate graph regression performance on ZINC.

5.2 Experimental Results
5.2.1 Quantitative Results
Table 2 reports the quantitative results of node classification on PATTERN and CLUSTER. It can be seen that applying SSFG regularization effectively improves the test accuracies except for Graphsage on the PATTERN dataset. Applying SSFG regularization to GATs with 16 layers yields 3.190% and 3.102% performance improvements on PATTERN and CLUSTER, respectively. These improvements are higher than those obtained for GATs with 4 layers. For GatedGCNs, applying SSFG regularization yields more performance improvements on CLUSTER than on PATTERN.
The graph classification results on MNIST and CIFAR10 are shown in Table 3. We see that applying our SSFG regularization method helps improve performance for the three baseline graph networks on the two datasets. While vanilla Graphsage and GateGCN perform well compared to GAT on MNIST, the three baseline graph networks with SSFG regularization achieve comparable performance. On CIFAR10, applying our SSFG method to GatedGCN improves the accuracy from 67.312% to 71.938, yielding a 4.626% performance gain, which is higher than those otained by applying the SSFG to Graphsage and GAT.
The link prediction results are given in Table 4. Once again, applying SSFG achieves improved performance for the three baseline graph networks. On TSP, the use of SSFG regularization in Graphsage, GAT and GatedGCN yields 0.049, 0.011 and 0.015 performance gains, respectively. On the COLLAB dataset, applying SSFG to Graphsage and GAT yields 2.528 and 3.214 performance improvements, respectively. For GatedGCN, SSFG only slightly improves the prediction performance.
Table 5 reports the experimental results on ZINC, demonstrating the effectiveness of our SSFG regularization to improve the graph regression performance. Applying the SSFG method improves the overall performance of the three baseline graph networks. For Graphsage, GAT and GatedGCN, the use of SSFG reduces 0.027, 0.009 and 0.037 mean absolute errors, respectively.
We have shown that our SSFG method helps improve the performance of the three baseline graph networks for different graph-based tasks. For most experiments, the performance on test data improves while that on training data reduces. These improvements are obtained through reducing the overfitting issue. It is worth noting that for some experiments, e.g., the experiments on TSP and the experiments of GatedGCN with 4 layers on PATTERN and CLUSTER, the performance for the test data and that for the training data improve simultaneously. This shows that our SSFG method also helps to address the underfitting issue.
Besides, we observe that for most experiments our SSFG regularization method results in small standard deviations. For example, the use of our SSFG regularization consistently results in small standard deviations on CLUSTER, COLLAB and ZINC compared to those obtained without using SSFG regularization. This shows that our SSFG method can stabilize the learning algorithms.
For the graph network that achieves the highest performance improvements in different tasks, we also show the impact of the value of used for sampling scaling factors on performance in the quantitative results. We see that the value of has different impacts for different tasks. Even for the same task, a graph network with different layers may use different values of to obtain the best performance. This suggests that the value of needs to be tuned for the best task performance.
Figure 4 shows the training and test accuracy curves with respect to the training epoch. We see that our SSFG method can help address both the overfitting issue and the underfitting issue. Compared with the work of Dwivedi Dwivedi et al. (2020), we use a large patience value for the optimizer when learning on some datasets. Therefore, it can take more epochs to finish the training procedure. We also observe that the training procedure takes comparable epochs on some datasets, for example on the MNIST dataset, as that without using SSFG regularization. As aforementioned, our method can be seen as a stochastic ReLU function. The results show that our method outperforms standard ReLU in graph representation learning. It is also worth noting that our method does not increase the number of learnable parameters.
5.2.2 Broader Impact
We have shown that our SSFG regularization method is effective in improving graph representation learning performance. Our SSFG method helps to address both the overfitting issue and the underfitting issue. When used together with ReLU, the SSFG method can be seen as a stochastic ReLU function. This explanation makes our SSFG method not specific for graph networks. Overfitting and underfitting are also issues with neural networks for other tasks such as image recognition and natural language processing. Our SSFG method could replace the ReLU function for these tasks to improve the overall performance, especially when training data are small.
6 Conclusions
In this paper, we presented a stochastic regularization method for graph networks. In our method, we stochastically scale features and gradients by a factor sampled from a probability distribution in the training procedure. Our method can help address the oversmoothing issue caused by repeatedly applying graph convolutional layers. We showed that applying stochastic scaling at the feature level is complementary to that at the gradient level in improving the overall performance. When used together with ReLU, our method can also be seen as a stochastic ReLU function. We experimentally validated our SSFG regularization method on seven benchmark datasets for different graph-based tasks. The experimental results demonstrated that our method can help address both the overfitting issue and the underfitting issue.
References
- [1] (2017) Residual gated graph convnets. arXiv preprint arXiv:1711.07553. Cited by: §2, §3.1.
- [2] (2014) Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014, (English (US)). Cited by: §3.1.
- [3] (2018) Fastgcn: fast learning with graph convolutional networks via importance sampling. International Conference on Learning Representations. Cited by: §3.1.
- [4] (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §3.1.
- [5] (2020) Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982. Cited by: Figure 1, §2, §5.1, §5.1, §5.1, §5.2.1.
- [6] (2017) Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: §2, §3.1, §3.2.
-
[7]
(2016)
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §3.1. - [8] (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.2.
-
[9]
(2020)
Open graph benchmark: datasets for machine learning on graphs
. arXiv preprint arXiv:2005.00687. Cited by: 3rd item. - [10] (2018) Adaptive sampling towards fast graph representation learning. In Advances in neural information processing systems, pp. 4558–4567. Cited by: §3.1.
- [11] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.1.
- [12] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
- [13] (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR2017), (English (US)). Cited by: §3.1.
- [14] (2017) Self-normalizing neural networks. In Advances in neural information processing systems, pp. 971–980. Cited by: §2.
-
[15]
(2018)
Deeper insights into graph convolutional networks for semi-supervised learning
.AAAI Conference on Artificial Intelligence
. Cited by: §2. - [16] (2017) Encoding sentences with graph convolutional networks for semantic role labeling. arXiv preprint arXiv:1703.04826. Cited by: §4.2.
-
[17]
(2010)
Rectified linear units improve restricted boltzmann machines
. In ICML, Cited by: §2. - [18] (1992) Simplifying neural networks by soft weight-sharing. Neural computation 4 (4), pp. 473–493. Cited by: §3.2.
- [19] (2017) Automatic differentiation in pytorch. Cited by: §5.1.
- [20] (2020) Dropedge: towards deep graph convolutional networks on node classification. In International Conference on Learning Representations, Cited by: §2, §3.2.
- [21] (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2, §3.2, §4.1.
- [22] (2018) Graph Attention Networks. International Conference on Learning Representations. External Links: Link Cited by: §2, §3.1.
- [23] (2019) Deep graph library: towards efficient and scalable deep learning on graphs. arXiv preprint arXiv:1909.01315. Cited by: §5.1.
- [24] (2019) Session-based recommendation with graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 346–353. Cited by: §2.
- [25] (2020) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §3.1.
- [26] (2020) Revisiting” over-smoothing” in deep gcns. arXiv preprint arXiv:2003.13663. Cited by: §2.
- [27] (2018) Gaan: gated attention networks for learning on large and spatiotemporal graphs. UAI 2018. Cited by: §3.1.
- [28] (2018) Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, pp. 5165–5175. Cited by: §2.