SSFG: Stochastically Scaling Features and Gradients for Regularizing Graph Convolution Networks

02/20/2021 ∙ by Haimin Zhang, et al. ∙ University of Technology Sydney 0

Graph convolutional networks have been successfully applied in various graph-based tasks. In a typical graph convolutional layer, node features are computed by aggregating neighborhood information. Repeatedly applying graph convolutions can cause the oversmoothing issue, i.e., node features converge to similar values. This is one of the major reasons that cause overfitting in graph learning, resulting in the model fitting well to training data while not generalizing well on test data. In this paper, we present a stochastic regularization method to address this issue. In our method, we stochastically scale features and gradients (SSFG) by a factor sampled from a probability distribution in the training procedure. We show that applying stochastic scaling at the feature level is complementary to that at the gradient level in improving the overall performance. When used together with ReLU, our method can be seen as a stochastic ReLU. We experimentally validate our SSFG regularization method on seven benchmark datasets for different graph-based tasks. Extensive experimental results demonstrate that our method effectively improves the overall performance of the baseline graph networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

2 Introduction

Data are organized in graph structures in various domains. Social networks, citation networks, molecular structures, protein-protein interactions—all of these domains can be modeled using graphs. Developing powerful graph learning methods is important for many real-world applications including recommendation systems Wu et al. (2019), link prediction Zhang and Chen (2018) and drug discovery Klambauer et al. (2017)

. Motivated by the success of deep neural networks, recent years have seen considerable interests in generalizing deep learning techniques to graph learning.

Compared with images and sequence data, graphs have a much complex topographical structure. The nodes in a graph can have a very different number of neighbours, and there is no fixed node ordering for a graph. Early methods are primarily based on recurrent neural networks. These methods involve a process that iteratively propagates node features until these node features reach a stable fixed point. In recent years, graph convolutional networks (GCNs) that leverage graph convolutions have become the dominant approach for graph learning. In GCNs, node features updated by aggregating neighbourhood features. Compared with recurrent-based methods, GCNs are much efficient in learning on graph-structured data.

Although significant progress has been made in improving GCN capacity, overfitting is one of the major issues that affect the performance Yang et al. (2020), resulting in the model not generalizing well on unseen samples. Overfitting can be caused by number of parameters, issues with learning algorithm such as gradient vanishing/exploding. An illustration of this phenomenon is show in Figure 1

, we see that the training accuracies on MNIST and CIFAR10

Dwivedi et al. (2020) for superpixel graph classification are higher than the test accuracies, especially on the CIFAR10 dataset. Graph convolutions update a node feature by aggregating features from the node’s local neighborhood. Repeatedly applying graph convolutions can lead to the oversmoothing Li et al. (2018) issue, i.e., node features converge to similar values. The oversmoothing issue is one of the limitations that restrict the performance of GCNs. This is one of the major reasons that cause the overfitting issue. Due to oversmoothing, existing GCNs usually use several (e.g., 2 to 4) layers to model the relationship between inputs and outputs. Further increasing the number of layers can lead to reduced performance. We show in this work that by tackling oversmoothing, the overfitting issue can be alleviated, and hence the performance of GCNs is improved.

Figure 1: An illustration that shows the overfitting issue with graph network for superpixel graph classification on CIFAR10 and MNIST Dwivedi et al. (2020), the results are obtained using GatedGCN with 4 layers.

Regularization is an effective way to reduce overfitting. However, commonly used regularization methods such as L2 regularization and Dropout Srivastava et al. (2014) can only slightly improve the generalization performance. Most recent efforts have been devoted to improving model capacity, few efforts have been devoted to developing regularization techniques. Rong et al. Rong et al. (2020)

recently proposed the DropEdge method for regularizing graph networks. The DropEdge method randomly removes a number of edges from the input graph at each training epoch. This method works as a data augmentation method and a message passing reducer.

In this paper, we propose a stochastic regularization method to address the oversmoothing issue. In our method, we stochastically scale features and gradients (SSFG) in the training procedure. The factors used for stochastic scaling are sampled from a distribution transformed from the beta distribution. Scaling features can break feature convergence, and therefore the oversmoothing issue is mitigated. Stochastically scaling features can be seen as varying perturbations are applied to latent features, leading the graph network model to learn on the vicinity of node features. We also stochastically scale gradients in backpropagation to prevent overfitting on inputs. We show that applying stochastic scaling at the gradient level provides further performance improvement.

Our SSFG regularization method can be seen as a variant of the Dropout method. Unlike Dropout, we preserve all neurons and stochastically drop or add a small portion of the features in forward propagation. Our method can also be seen as a stochastic rectified linear unit (ReLU) function

Nair and Hinton (2010) when applied together with ReLU. It generalizes the standard ReLU by using stochastic slops in forward and backward propagation.

We apply our SSFG regularization method in three graph convolutional networks, i.e, Graphsage Hamilton et al. (2017), graph attention networks (GATs) Veličković et al. (2018) and gated graph convnets (GatedGCN) Bresson and Laurent (2017). We experiment on seven benchmark datasets for four graph-based tasks, i.e., graph classification, node classification, link prediction and graph regression. We show that our regularization method is effective to improve the overall performance of the three baseline graph networks.

The contributions of this paper can be summarized as follows:

  • We propose a stochastic regularization method for graph convolutional networks. In our method, we stochastically scale features and gradients in the training procedure. As far as we know, this is the first research on regularizing graph convolutional networks at the both the feature level and the gradient level. Our method does not increase model parameters, and we show that our method is able to address the overfitting issue and the underfitting issue.

  • We experimentally evaluate our SSFG regularization method on three graph networks, i.e., Graphsage, GATs and GatedGCNs. Extensive experimental results demonstrate that our regularization method effectively improves the overall performance of the three baseline graph networks.

3 Related Work

3.1 Graph Convolutional Networks

Graph convolutional networks have become the dominant approach for learning on graph-structured data. Current studies on graph convolutional networks can be categorized into two approaches: the spectral-based approach and the spatial-based approach Wu et al. (2020). The spectral-based approach works with the spectral representation of graphs, and the spatial-based approach directly defines convolutions on graph nodes that are spatially close.

Burana et al. Bruna et al. (2014) proposed the first spectral-based graph network, wherein convolutions are defined in the Fourier domain on the eigen-decomposition of the graph Laplacian. Defferrard et al. Defferrard et al. (2016) proposed Chebyshev spectral networks (ChebNets) to address the limitations of Burana’s work. ChebNets approximate the filters using Chebyshev expansion of the graph Laplacian, resulting in spatially localized filters. Kips et al. Kipf and Welling (2017) further introduced an efficient layer-wise propagation rule based on the first-order approximation of spectral convolutions. In spectral-based graph networks, the learned filters depend on the graph structure, therefore a model trained on a specific graph can be applied to graphs with a different structure.

Unlike the spectral-based approach, the spatial-based approach defines convolutions in the spatial domain and learns a node’s feature by propagating information from the node’s neighbors. To deal with variable-sized neighbors, several sampling-based methods have been proposed for efficient graph learning. These methods include the nodewise sampling-based method Hamilton et al. (2017), the layerwise sampling-based method Chen et al. (2018) and the adaptive layerwise sampling-based method Huang et al. (2018). Velickovic et al. Veličković et al. (2018) proposed graph attention networks (GATs), in which the self-attention mechanism is used to compute attention weights in aggregation. Zhang et al. Zhang et al. (2018) further proposed gated attention networks that apply self-attention to the outputs multiple attention heads to improve performance. Recently, Bresson et al. Bresson and Laurent (2017) proposed GatedGCNs, integrating edge gates, residual learning He et al. (2016)

and batch normalization

Ioffe and Szegedy (2015) into graph networks.

Figure 2:

An illustration of the SSFG regularization method and its comparison to Dropout. In SSFG, an input feature vector is randomly scaled in forward propagation, and the gradient of the feature vector in backward propagation is also randomly scaled. The scaling factors are sampled from a distribution transformed from the Beta function. Each node feature corresponds to a SSFG regularization.

3.2 Regularization Methods

Regularization methods have been commonly used to reduce overfitting in training neural networks. Conventional regularization methods include early stopping, i.e., terminating the training procedure when the performance on a validation set stops to increase, Lasso regularization, weight decay and soft weight sharing Nowlan and Hinton (1992).

Srivastava et al. Srivastava et al. (2014) introduced Dropout as a stochastic regularization method for training neural networks. The key idea of Dropout is to randomly drop out neurons from the neural network during training. Dropout can also been seen as perturbing the feature outputted by a layer by setting the randomly selected feature points to zero. The idea behind Dropout has also been adopted in graph networks. For example in GraphSAGE Hamilton et al. (2017), a fix-sized number of neighbors are sampled for each node in feature aggregation. This facilitates fast training and also helps improve the generalization performance. The method of node sampling can be seen as using random subgraphs of the original graph in the training procedure. More recently, Rong et al. Rong et al. (2020) proposed the DropEdge method that randomly drops out edges in each training epoch. This method helps to address both the overfitting and oversmoothing issues.

4 Methodology

In this section, we first present the SSFG regularization method, showing its relationship to Dropout and ReLU. Then, we introduce the use of SSFG to regularize graph networks.

Figure 3: Comparison of our SSFG method and ReLU. When used together with ReLU, SSFG can be seen as a stochastic ReLU, using random slopes in forward and backward propagations.

4.1 SSFG Regularization

In the Dropout method, neurons, as well as their connections, are randomly dropped out from the neural network during training. Applying Dropout to a neural network can be seen as training many subnetworks of the original network and using the ensemble of these subnetworks to make predictions at test time Srivastava et al. (2014). Although Dropout helps to improve the generalization performance, it does not directly address the oversmoothing issue.

We introduce the SSFG regularization method to address the oversmoothing issue. In SSFG, we stochastically scale node features and gradients in the training procedure. Specifically, we multiply each node feature by a factor sampled from a probability distribution in forward propagation, and in backward propagation the gradient of each node is also multiplied by a factor sampled from the probability distribution. We wish the expectation of node features to be unchanged. To this end, we adopt the following trick to define the scaling factor.

(1)

where is a probability distribution with the mean equal to 0.5, and

is a hyperparameter. With the above trick, the scaling factor falls in the interval

. By directly scaling features, the oversmoothing issue is mitigated. We detail our SSFG regularization method in Algorithm 1. It is worth noting that the SSFG method does not increase trainable model parameters. Our SSFG regularization is only applied in the training procedure. At test time, we directly use original node features for target tasks.

1:Node feature ; hyperparameter used for sampling.
2:function ForwardPropagation()
3:     
4:     if  then
5:         
6:     end if
7:     return
8:end function
9:function BackwardPropagation()
10:     
11:     if  then
12:         
13:     end if
14:     return
15:end function
Algorithm 1 The SSFG regularization method.

A schematic illustration of the SSFG method and its compassion to Dropout is shown in Figure 2. Dropout performs drop out at the neuron level. Compared to Dropout, our SSFG regularization in forward propagation can be seen as an improved version of Dropout that is applied at the feature level. When the scaling factor is less than , a proportion of the node feature is dropped out; and when the scaling factor is greater than , a proportion of the node feature is added back to the node feature. Stochastically scaling gradients also introduces uncertainty the optimization procedure. This can further helps improve the overall performance. To the best of our knowledge, this is the fist study on regularizing neural networks at the gradient level. We show through experiments that stochastically scaling gradients is complementary to stochastically scaling features to improve the overall performance.

ReLU has been commonly used as the nonlinear activation function in graph networks. When used together with ReLU, SSFG can be seen as a stochastic ReLU function. An illustration of this explanation is shown in Figure

3. By using stochastic slopes in propagating features and gradients, the network model can be robust to different feature variations. This property makes our SSFG method not specific to graph networks. Other networks such as convolutional networks could also use the SSFG method to improve the overall performance.

4.2 Regularizing Graph Networks with SSFG

Let a graph with nodes be denoted , where is the node set and the edge set, and be the feature matrix associated with the nodes. The adjacent matrix of is an matrix where equals to if is connected to , or otherwise. Usually we consider the nodes of to be self-connected. Then, the structure of can be represented as , where

is the identity matrix.

A typical graph convolutional network takes and the graph structure as input and updates node features layerwisely as follows:

(2)

where is the number of graph convolutional layers and is the set of neighbors of node . Our SSFG regularization is a general method that can be applied to a wide variety of graph networks. In this work, we evaluate SSFG on three types of graph networks, i.e., Graphsage, GATs and GatedGCNs, to demonstrate its effectiveness.

Graphsage In a Graphsage layer, a fixed-size set of neighbors are randomly sampled, and the features of these sampled neighbors are aggregated using an aggregator function, such as the mean operator and LSTM Hochreiter and Schmidhuber (1997) to update a node’s feature as follows:

(3)

where

is the weight matrix of the shared linear transformation, and

is a nonlinear activation function. SSFG can be applied to input node features to the Graphsage layer or after the nonlinear activation function.

GATs In GATs, each neighbor of a node is assigned with an attention weight computed by the self-attention mechanism in feature aggregation as follows:

(4)

where attn is the attention function. GATs usually employs multiple attention heads to improve the overall performance. By default, we apply our SSFG regularization to the output of each attention head.

GatedGCNs GatedGCNs use the edge gating mechanism Marcheggiani and Titov (2017)

and residual connection in aggregating features from a node’s local neighborhood as follows:

(5)

where are weight matrices of linear transformations, and is the Hadamard product. GatedGCNs explicitly maintain edge feature at each layer. By default, we apply our SSFG regularization to both the node features and the edge features outputted by a GatedGCN layer.

5 Experiments

Dataset Graphs Nodes/graph Training Val. Test
PATTERN 14K 44-188 10,000 2000 2000
CLASTER 12K 41-190 10,000 1000 1000
MNIST 70K 40-75 55,000 5000 10,000
CIFAR10 60K 85-150 45,000 5000 10,000
TSP 12K 50-500 10,000 1000 1000
COLLAB 1 235,868 - - -
ZINC 12K 9-37 10,000 1000 1000
Table 1: Statistics of the seven benchmark datasets used in our experiments. On COLLAB, edges that represent collaborations up to 2017 are used for training, and edges that represent collaborations in 2018 and 2019 are used for validation and testing, respectively.
Method PATTERN
Test (Acc.) Train (Acc.)
Graphsage w/o SSFG 4 50.5160.001 50.4730.014
Graphsage + SSFG (=+) 50.5160.001 50.4730.014
Graphsage w/o SSFG 16 50.4920.001 50.4870.005
Graphsage + SSFG (=+) 50.4920.001 50.4870.005
GAT w/o SSFG 4 75.8241.823 77.8831.632
GAT + SSFG (=8.0) 77.2900.469 77.9380.528
GAT w/o SSFG 16 78.2710.186 90.2120.476
GAT + SSFG (=8.0) 81.4610.123 82.7240.385
Gatedgcn w/o SSFG 4 84.4800.122 84.4740.155
Gatedgcn + SSFG (=3.0) 85.2050.264 85.2830.347
Gatedgcn + SSFG (=4.0) 85.0160.181 84.9230.202
Gatedgcn + SSFG (=5.0) 85.3340.175 85.3160.192
Gatedgcn + SSFG (=6.0) 85.1020.161 85.0660.155
Gatedgcn w/o SSFG 16 85.5680.088 86.0070.123
Gatedgcn + SSFG (=3.0) 85.7230.069 85.6250.072
Gatedgcn + SSFG (=4.0) 85.7170.020 85.6060.012
Gatedgcn + SSFG (=5.0) 85.6510.054 85.5950.048
Method CLUSTER
Test (Acc.) Train (Acc.)
Graphsage w/o SSFG 4 50.4540.145 54.3740.203
Graphsage + SSFG (=5.0) 50.5620.070 53.0140.025
Graphsage vanilla 16 63.8440.110 86.7100.167
Graphsage + SSFG (=7.0) 66.8510.066 79.2200.023
GAT w/o SSFG 4 57.7320.323 58.3310.342
GAT + SSFG (=7.0) 59.8880.044 59.6560.025
GAT w/o SSFG 16 70.5870.447 76.0741.362
GAT + SSFG (=4.0) 73.6890.088 79.4760.302
Gatedgcn w/o SSFG 4 60.4040.419 61.6180.536
Gatedgcn + SSFG (=6.0) 61.0280.302 62.4150.311
Gatedgcn + SSFG (=7.0) 61.2220.267 62.8440.352
Gatedgcn + SSFG (=8.0) 61.4980.267 63.3100.343
Gatedgcn + SSFG (=9.0) 61.3750.047 63.0490.134
Gatedgcn w/o SSFG 16 73.8400.326 87.8800.908
Gatedgcn + SSFG (=4.0) 75.6710.084 83.7690.035
Gatedgcn + SSFG (=5.0) 75.9600.020 83.6230.652
Gatedgcn + SSFG (=6.0) 75.6010.078 84.5160.299
Table 2: Node classification results on PATTERN and CLUSTER. We experiment using 4 and 16 layers in the graph networks.
Method MNIST
Test (Acc.) Train (Acc.)
Graphsage vanilla 97.3120.097 100.000.000
Graphsage + SSFG (=5.0) 97.9430.147 99.9960.002
GAT vanilla 95.5350.205 99.9940.008
GAT + SSFG (=2.0) 97.9380.075 99.9960.002
Gatedgcn vanilla 97.3400.143 100.000.000
Gatedgcn + SSFG (=1.0) 97.8480.106 99.8890.035
Gatedgcn + SSFG (=1.5) 97.7300.116 99.9750.004
Gatedgcn + SSFG (=2.0) 97.9850.032 99.9960.001
Gatedgcn + SSFG (=2.5) 97.7030.054 99.9960.001
Method CIFAR10
Test (Acc.) Train (Acc.)
Graphsage vanilla 65.7670.308 99.7190.062
Graphsage + SSFG (=4.0) 68.8030.471 89.8450.166
GAT vanilla 64.2230.455 89.1140.499
GAT + SSFG (=4.0) 66.0650.171 84.3830.986
Gatedgcn vanilla 67.3120.311 94.5531.018
Gatedgcn + SSFG (=1.0) 71.5850.361 83.8781.146
Gatedgcn + SSFG (=1.5) 71.9380.190 87.4730.593
Gatedgcn + SSFG (=2.0) 71.3830.427 87.7450.973
Gatedgcn + SSFG (=2.5) 70.9130.306 88.6450.750
Table 3: Graph classification results on MINIST and CIFAR10. The number of layers is set to 4.
Method TSP
Test (F1) Train (F1)
Graphsage w/o SSFG 0.6650.003 0.6690.003
Graphsage + SSFG (=5.0) 0.7140.003 0.7170.003
GAT w/o SSFG 0.6710.002 0.6730.002
GAT + SSFG (=300) 0.6820.000 0.6840.0001
Gatedgcn w/o SSFG 0.7910.003 0.7930.003
Gatedgcn + SSFG (=4.0) 0.8020.001 0.8040.001
Gatedgcn + SSFG (=5.0) 0.8060.001 0.8070.001
Gatedgcn + SSFG (=6.0) 0.8050.001 0.8080.001
Gatedgcn + SSFG (=7.0) 0.8050.001 0.8070.001
Method COLLAB
Test (Hits@50) Train (Hits@50)
Graphsage w/o SSFG 51.6180.690 99.9490.052
Graphsage + SSFG (=4.0) 53.1460.230 98.2801.300
GAT w/o SSFG 51.5010.962 97.8511.114
GAT + SSFG (=4.0) 53.6160.400 97.7000.132
GAT + SSFG (=5.0) 53.9080.253 97.8350.182
GAT + SSFG (=6.0) 54.7150.069 97.9290.189
GAT + SSFG (=7.0) 54.2520.092 98.0840.340
Gatedgcn w/o SSFG 52.6351.168 96.1031.876
Gatedgcn + SSFG (=5.0) 53.0550.671 92.5350.989
Table 4: Link prediction results on TSP and COLLAB. The number of layers is set to 4.
Method ZINC
Test (MAE ) Train (MAE )
Graphsage vanilla 0.4680.003 0.2510.004
Graphsage + SSFG (=10) 0.4410.006 0.1910.005
GAT vanilla 0.4750.007 0.3170.006
GAT + SSFG (=20) 0.4660.001 0.3290.010
Gatedgcn vanilla 0.4350.011 0.2870.014
Gatedgcn + SSFG (=4) 0.4150.007 0.3160.008
Gatedgcn + SSFG (=5) 0.4130.006 0.3150.008
Gatedgcn + SSFG (=6)   0.4040.0003 0.2960.005
Gatedgcn + SSFG (=7) 0.3980.001 0.2860.007
Table 5: Graph regression results on ZINC. The number of layers is set to 4.

5.1 Experimental Setup

Our experiments are conducted on seven recently released benchmark datasets, i.e., PATTERN, CLUSTER, MNIST, CIFAR10, TSP, COLLAB and ZINC Dwivedi et al. (2020). These datasets are used for four graph-based tasks: node classification (PATTERN, CLUSTER), graph classification (MNIST, CIFAR10), link prediction (TSP, COLLAB) and graph regression (ZINC). The statistics of the seven datasets are given in Table 1.

We closely follow the experimental setup used in the work of Dwivedi et al. Dwivedi et al. (2020). The Adam method Kingma and Ba (2014) is used to train all the models. The learning rate is initialized to and reduced by a factor of 2 if the loss has not improved for a number of epochs (10 or 20). The training procedure is terminated when the learning rate is reduced to smaller than

. For the node classification task, we experiment using different layers (4 and 16). For the remaining tasks, the number of layers is set to 4. For each evaluation, we run the experiment 4 times using different random seeds and report the mean and standard deviation of the 4 results. Our method is implemented using Pytorch

Paszke et al. (2017) and the DGL library Wang et al. (2019).

Evaluation Metrics Following Dwivedi et al. (2020)

, the following evaluation metrics are used for different tasks.

  • Accuracy. Weighted average node classification accuracy is used for the node classification task (PATTERN and CLUSTER), and classification accuracy is used for the graph classification task (MNIST and CIFAR10).

  • F1 score is used for evaluation on the TSP dataset, due to high class imbalance, i.e., only the edges in the TSP tour are labeled as positive.

  • Hits@K Hu et al. (2020) is used for the COLLAB dataset, aiming to measure a model’s ability to predict future collaboration relationships. This method ranks each true collaboration against 100,000 randomly sampled negative collaborations and counts the ratio of positive edges that are ranked at -th place or above.

  • MAE (mean absolute error) is used to evaluate graph regression performance on ZINC.

Figure 4: Training and test accuracy/F1/Hits@50/MAE curves with respective to the training epoch, showing that our SSFG regularization is able to address the overfitting issue and the underfitting issue.

5.2 Experimental Results

5.2.1 Quantitative Results

Table 2 reports the quantitative results of node classification on PATTERN and CLUSTER. It can be seen that applying SSFG regularization effectively improves the test accuracies except for Graphsage on the PATTERN dataset. Applying SSFG regularization to GATs with 16 layers yields 3.190% and 3.102% performance improvements on PATTERN and CLUSTER, respectively. These improvements are higher than those obtained for GATs with 4 layers. For GatedGCNs, applying SSFG regularization yields more performance improvements on CLUSTER than on PATTERN.

The graph classification results on MNIST and CIFAR10 are shown in Table 3. We see that applying our SSFG regularization method helps improve performance for the three baseline graph networks on the two datasets. While vanilla Graphsage and GateGCN perform well compared to GAT on MNIST, the three baseline graph networks with SSFG regularization achieve comparable performance. On CIFAR10, applying our SSFG method to GatedGCN improves the accuracy from 67.312% to 71.938, yielding a 4.626% performance gain, which is higher than those otained by applying the SSFG to Graphsage and GAT.

The link prediction results are given in Table 4. Once again, applying SSFG achieves improved performance for the three baseline graph networks. On TSP, the use of SSFG regularization in Graphsage, GAT and GatedGCN yields 0.049, 0.011 and 0.015 performance gains, respectively. On the COLLAB dataset, applying SSFG to Graphsage and GAT yields 2.528 and 3.214 performance improvements, respectively. For GatedGCN, SSFG only slightly improves the prediction performance.

Table 5 reports the experimental results on ZINC, demonstrating the effectiveness of our SSFG regularization to improve the graph regression performance. Applying the SSFG method improves the overall performance of the three baseline graph networks. For Graphsage, GAT and GatedGCN, the use of SSFG reduces 0.027, 0.009 and 0.037 mean absolute errors, respectively.

We have shown that our SSFG method helps improve the performance of the three baseline graph networks for different graph-based tasks. For most experiments, the performance on test data improves while that on training data reduces. These improvements are obtained through reducing the overfitting issue. It is worth noting that for some experiments, e.g., the experiments on TSP and the experiments of GatedGCN with 4 layers on PATTERN and CLUSTER, the performance for the test data and that for the training data improve simultaneously. This shows that our SSFG method also helps to address the underfitting issue.

Besides, we observe that for most experiments our SSFG regularization method results in small standard deviations. For example, the use of our SSFG regularization consistently results in small standard deviations on CLUSTER, COLLAB and ZINC compared to those obtained without using SSFG regularization. This shows that our SSFG method can stabilize the learning algorithms.

For the graph network that achieves the highest performance improvements in different tasks, we also show the impact of the value of used for sampling scaling factors on performance in the quantitative results. We see that the value of has different impacts for different tasks. Even for the same task, a graph network with different layers may use different values of to obtain the best performance. This suggests that the value of needs to be tuned for the best task performance.

Figure 4 shows the training and test accuracy curves with respect to the training epoch. We see that our SSFG method can help address both the overfitting issue and the underfitting issue. Compared with the work of Dwivedi Dwivedi et al. (2020), we use a large patience value for the optimizer when learning on some datasets. Therefore, it can take more epochs to finish the training procedure. We also observe that the training procedure takes comparable epochs on some datasets, for example on the MNIST dataset, as that without using SSFG regularization. As aforementioned, our method can be seen as a stochastic ReLU function. The results show that our method outperforms standard ReLU in graph representation learning. It is also worth noting that our method does not increase the number of learnable parameters.

5.2.2 Broader Impact

We have shown that our SSFG regularization method is effective in improving graph representation learning performance. Our SSFG method helps to address both the overfitting issue and the underfitting issue. When used together with ReLU, the SSFG method can be seen as a stochastic ReLU function. This explanation makes our SSFG method not specific for graph networks. Overfitting and underfitting are also issues with neural networks for other tasks such as image recognition and natural language processing. Our SSFG method could replace the ReLU function for these tasks to improve the overall performance, especially when training data are small.

6 Conclusions

In this paper, we presented a stochastic regularization method for graph networks. In our method, we stochastically scale features and gradients by a factor sampled from a probability distribution in the training procedure. Our method can help address the oversmoothing issue caused by repeatedly applying graph convolutional layers. We showed that applying stochastic scaling at the feature level is complementary to that at the gradient level in improving the overall performance. When used together with ReLU, our method can also be seen as a stochastic ReLU function. We experimentally validated our SSFG regularization method on seven benchmark datasets for different graph-based tasks. The experimental results demonstrated that our method can help address both the overfitting issue and the underfitting issue.

References

  • [1] X. Bresson and T. Laurent (2017) Residual gated graph convnets. arXiv preprint arXiv:1711.07553. Cited by: §2, §3.1.
  • [2] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun (2014) Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014, (English (US)). Cited by: §3.1.
  • [3] J. Chen, T. Ma, and C. Xiao (2018) Fastgcn: fast learning with graph convolutional networks via importance sampling. International Conference on Learning Representations. Cited by: §3.1.
  • [4] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §3.1.
  • [5] V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio, and X. Bresson (2020) Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982. Cited by: Figure 1, §2, §5.1, §5.1, §5.1, §5.2.1.
  • [6] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: §2, §3.1, §3.2.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §3.1.
  • [8] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.2.
  • [9] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec (2020)

    Open graph benchmark: datasets for machine learning on graphs

    .
    arXiv preprint arXiv:2005.00687. Cited by: 3rd item.
  • [10] W. Huang, T. Zhang, Y. Rong, and J. Huang (2018) Adaptive sampling towards fast graph representation learning. In Advances in neural information processing systems, pp. 4558–4567. Cited by: §3.1.
  • [11] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.1.
  • [12] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
  • [13] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR2017), (English (US)). Cited by: §3.1.
  • [14] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter (2017) Self-normalizing neural networks. In Advances in neural information processing systems, pp. 971–980. Cited by: §2.
  • [15] Q. Li, Z. Han, and X. Wu (2018)

    Deeper insights into graph convolutional networks for semi-supervised learning

    .

    AAAI Conference on Artificial Intelligence

    .
    Cited by: §2.
  • [16] D. Marcheggiani and I. Titov (2017) Encoding sentences with graph convolutional networks for semantic role labeling. arXiv preprint arXiv:1703.04826. Cited by: §4.2.
  • [17] V. Nair and G. E. Hinton (2010)

    Rectified linear units improve restricted boltzmann machines

    .
    In ICML, Cited by: §2.
  • [18] S. J. Nowlan and G. E. Hinton (1992) Simplifying neural networks by soft weight-sharing. Neural computation 4 (4), pp. 473–493. Cited by: §3.2.
  • [19] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §5.1.
  • [20] Y. Rong, W. Huang, T. Xu, and J. Huang (2020) Dropedge: towards deep graph convolutional networks on node classification. In International Conference on Learning Representations, Cited by: §2, §3.2.
  • [21] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2, §3.2, §4.1.
  • [22] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph Attention Networks. International Conference on Learning Representations. External Links: Link Cited by: §2, §3.1.
  • [23] M. Wang, L. Yu, D. Zheng, Q. Gan, Y. Gai, Z. Ye, M. Li, J. Zhou, Q. Huang, C. Ma, et al. (2019) Deep graph library: towards efficient and scalable deep learning on graphs. arXiv preprint arXiv:1909.01315. Cited by: §5.1.
  • [24] S. Wu, Y. Tang, Y. Zhu, L. Wang, X. Xie, and T. Tan (2019) Session-based recommendation with graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 346–353. Cited by: §2.
  • [25] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip (2020) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §3.1.
  • [26] C. Yang, R. Wang, S. Yao, S. Liu, and T. Abdelzaher (2020) Revisiting” over-smoothing” in deep gcns. arXiv preprint arXiv:2003.13663. Cited by: §2.
  • [27] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D. Yeung (2018) Gaan: gated attention networks for learning on large and spatiotemporal graphs. UAI 2018. Cited by: §3.1.
  • [28] M. Zhang and Y. Chen (2018) Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, pp. 5165–5175. Cited by: §2.