Recently, research of analyzing graphs with machine learning has received more and more attention, mainly focusing on node classification (Kipf and Welling, 2016), link prediction (Zhu et al., 2016) and clustering tasks (Fortunato, 2010)
. Graph convolutions, as the transformation of traditional convolutions from Euclidean domain to non-Euclidean domain, have been leveraged to design Graph Neural Networks to deal with a wide range of graph-based machine learning tasks.
Graph Convolutional Networks (GCNs) (Kipf and Welling, 2016)
generalize convolutional neural networks (CNNs) to graph structured data from the perspective of spectral theory based on prior works(Bruna et al., 2013; Defferrard et al., 2016). It has been demonstrated that GCN and its variants (Hamilton et al., 2017; Velickovic et al., 2017; Dai et al., 2018; Chen and Zhu, 2017)
significantly outperform traditional multi-layer perceptron (MLP) models and prior graph embedding approaches(Tang et al., 2015; Perozzi et al., 2014; Grover and Leskovec, 2016).
However, there are still many deficits on GCNs, thus in this paper we propose to apply VAT on GCNs to tackle these drawbacks of GCNs. Particularly, we firstly highlight the importance of VAT on GCNs from the following aspects, which construct the motivation of our approaches.
Lacking the Leverage of Unlabeled Data for GCNs. The optimization of GCNs is solely based on the labeled nodes. Concretely speaking, GCNs directly distribute gradient information over the entire labeled set of nodes from the supervised loss. Due to the lack of loss on unlabeled data, the parameters that are not involved in the receptive field may not be updated (Chen and Zhu, 2017), resulting in the inefficiency of information propagation of GCNs.
The Smoothness of GCNs. Bruna et al. (2013) firstly define the spectral convolutional operation on graphs and point out that adding the smoothness constraint on the spectrum of the filters improves classification results, since the filters are enforced to have better spatial localization. Defferrard et al. (2016) utilize Chebyshev polynomials to approximate the spectral convolutions and also state that spectral convolutions rely on the smoothness in Fourier domain. Since GCNs are established on spectral theory mentioned above and are equivalent to Symmetrical Laplacian Smoothing (Li et al., 2018), the performance of GCN actually heavily depends on the effect of its smoothness.
Effect of Regularization in Semi-Supervised Learning. Regularization plays a crucial role in semi-supervised learning including graph-based learning tasks. On the one hand, by introducing regularization, a model can make full use of unlabeled data, thus enhancing the performance in semi-supervised learning. On the other hand, regularization can also be regarded as prior knowledge that can smooth the posterior output. For GCN model, a good regularization can not only leverage the unlabeled data to refine its optimization, but only benefit the smoothness of GCNs, resulting in a improved generalization performance.
Virtual Adversarial Regularization on GCNs. Virtual Adversarial Training (VAT) (Miyato et al., 2018)
smartly performs adversarial training without label information to impose a local smoothness on the classifier, which is especially beneficial to semi-supervised learning. In particular, VAT endeavors to smooth the model anisotropically in the direction in which the model is the most sensitive, i.e., the adversarial direction, to improve the generalization performance of a model. In addition, the existence of robustness issue in GCNs has been explored in recent works(Zügner and Günnemann, 2018; Zügner et al., 2018), allowing VAT on graph-based learning task.
Due to the fact that VAT has been successfully applied on semi-supervised image classification (Miyato et al., 2018; Yu et al., 2018) and text classification (Miyato et al., 2016), a natural question could be asked: Can we utilize the efficacy of VAT to improve the performance of GCNs in semi-supervised node classification?
Following this motivation, in our paper, we formally introduce VAT regularization on the original supervised loss of GCNs in semi-supervised node classification task. Concretely speaking, firstly, a detailed theoretical analysis of GCNs focusing on the first-order approximation of local spectral convolutions and the obtained Symmetric Laplacian Smoothing (Li et al., 2018) is provided to demystify how GCNs work in semi-supervised learning. Moreover, based on the motivation described above, we elaborate the process of applying VAT on GCNs in a theoretical way by additionally imposing virtual adversarial loss on the basic loss of GCNs, resulting in GCNVAT algorithm framework. Next, due to the sparse property of node features, in the realization of our method, we actually add virtual adversarial perturbations on sparse and dense features, respectively, and attain the GCNSVAT and GCNDVAT algorithms. Finally, in the experimental part, we demonstrate the effectiveness of the two approaches under different training sizes and refine a theoretical analysis on the sensitivity to the hyper-parameters on VAT, facilitating us to apply our approaches in real applications involving graph-based machine learning tasks. In summary, the contributions of the paper are listed below:
To the best of our knowledge, we are the first to focus on applying better regularization on original GCN to refine its generalization performance.
We are the first to successfully transfer the efficacy of Virtual Adversarial Training (VAT) to the semi-supervised node classification on graphs and point out the difference compared with image and text classification setting.
We refine the sensitivity analysis of hyper-parameters in GCNSVAT and GCNDVAT algorithms, facilitating the deployment of our methods in real scenarios.
2 GCNs with Virtual Adversarial Training
In this section, we will elaborate how the GCNs work in semi-supervised learning and how to utilize the virtual adversarial training to improve the local smoothness of GCNs.
2.1 Semi-Supervised Classification with GCNs
Graph Convolutional Networks (GCNs) are derived from first-order approximation of localized spectral filters (Kipf and Welling, 2016) and are finally equivalent to Symmetric Laplacian Smoothing (Li et al., 2018). Firstly, we denote a graph by , where is the vertex set and is the edge set. and are the features and adjacent matrix of the graph, respectively and denotes the degree matrix of , where is the degree of vertex .
GCNs are based on the graph spectral theory. For efficient computation, Defferrard et al. (2016) approximate the spectral filter with Chebyshev polynomials up to order:
is the eigenvalues matrix of normalized graph Laplacian. is the Chebyshev polynomials and
is a vector of Chebyshev coefficients. Further,Kipf and Welling (2016) simplified this model by limiting and approximated by 2. Then the first-order approximation of spectral graph convolution is defined as:
where is the only Chebyshev coefficients left. Through the normalization trick, the final form of graph convolutional networks with two layers in GCNs (Kipf and Welling, 2016) is:
where . is the degree matrix of . is the obtained embedding matrix from nodes, is the input-to-hidden weight matrix and is the hidden-to-output weight matrix.
Symmetric Laplacian Smoothing
Li et al. (2018) point out the reason why the GCNs work lies in the Symmetric Laplacian Smoothing of this spectral convolutional type. We simplify it as follows:
where is the first-layer embedding of node from features and corresponding matrix formulation is as follows:
where is the one-layer embedding matrix of feature matrix .
Finally, the loss function is defined as the cross entropy error over all labeled nodes:
where is the set of node indices that have labels. In fact, the performance of GCNs heavily depends on the efficiency of this Laplacian Smoothing Convolutions, which has been demonstrated in (Li et al., 2018; Kipf and Welling, 2016). Therefore, how to design a good regularization to refine the smoothness of GCNs plays a crucial role for the improvement of performance for GCNs.
2.2 Virtual Adversarial Training in GCNs
Virtual Adversarial Training (VAT) (Miyato et al., 2018) is a regularization method that trains the output distribution to be isotropically smooth around each input data point by selectively smoothing the model in its most anisotropic direction, namely adversarial direction. In this section, we apply VAT on GCNs to improve the local smoothness of GCNs.
Firstly, both VAT and GCNs mainly focus on semi-supervised setting, in which two assumptions should be implicitly met (Yu et al., 2018):
Manifold Assumption. The observed data
presented in high dimensional space is with high probability concentrated in the vicinity of some underlying manifold with much lower dimensional space.
Smoothness Assumption. If two points are close in manifold distance, then the conditional probability and should be similar. In other words, the true classifier, or the true condition distribution varies smoothly along the underlying manifold .
In the node classification task, GCNs, which involve the graph embedding process, also implicitly conform to these assumptions. There is underlying manifold in the process of graph embedding and the conditional distribution of embedding vectors are expected to vary smoothly along the underlying manifold. In this way, we are capable of utilizing VAT to smooth the embedding of nodes in the adversarial direction to improve the generalization of GCNs.
Difference of VAT on Graph and Image, Text.
Traditional VAT (Miyato et al., 2018) is proposed on image classification while VAT on text classification (Miyato et al., 2016) is applied on word embedding vectors of each word. For VAT on graphs, we simply apply VAT on the features of nodes for easy implementation. Additionally, another obvious difference lies in that the relation between each node is not independent for the node classification task compared with image and text classification. The classification result of each node not only depends on the feature itself but also the features of its neighbors, resulting in the Propagation Effect of perturbations on feature of each node. We use and to denote dataset with labeled nodes and unlabeled nodes respectively. represents features excluding feature of current node.
Adversarial Training in GCNs
Here we formally define the adversarial training in GCNs, where adversarial perturbations are solely added on features of labeled nodes:
where measures the divergence between two distributions and . is the true distribution of output labels, usually one hot vector and denotes the predicted distribution by GCNs. represents the feature of current labeled node and represents the adversarial perturbation on the feature . When the true distribution is denoted by one hot vector , the perturbation in norm can be linearly approximated:
Virtual Adversarial Loss
In order to utilize the unlabeled data, we are expected to evaluate the true conditional probability
. Therefore, we use the current estimatein place of .
Then virtual adversarial regularization is constructed from inner max loss:
The final regularization term we propose in this study is the average of over all input nodes:
Virtual Adversarial Training
The full objective function is thus given by:
where is constructed from labeled nodes in GCNs, denotes the regularization coefficient and VAT regularization is crafted from both labeled and unlabeled nodes.
2.3 Fast Approximation of VAT in GCNs
The key of VAT in GCNs is the approximation of where
Just like the situation in traditional VAT, the evaluation of GCNs with VAT cannot be performed with the linear approximation since:
Therefore, a second-order approximation is needed:
where . Then the evaluation of can be approximated by:
is the first dominant eigenvector ofwith magnitude 1.
Power Iteration Approximation
Following VAT, we also apply power iteration approximation for first dominant eigenvector :
where is initialized as a randomly sampled unit vector and can finally converge to .
Finite Difference Approximation
We also employ finite difference approximation for :
After the two approximations, is evaluated by:
As the demonstration in traditional VAT (Miyato et al., 2018), is sufficient to achieve good performance of VAT in GCNs. Thus, the final approximation of is:
In this section, we will elaborate our Graph Convolutional Networks with Virtual Adversarial Training (GCNVAT) Algorithm. Algorithm 1 summarizes the procedures of the computation of mini-batch SGD for GCNs with VAT algorithm.
Input: Features Matrix , Adjacent Matrix . Graph Convolution Network
Output: Graph Embedding
using an iid Gaussian distribution.
Our GCNVAT Algorithm Framework is economical in computation since the derivative of the full objective function can be computed with at most three sets of propagation in total. Specifically speaking, firstly, by initializing the random unit vector in mini-batch and computing the gradient of divergence between predicted distribution of GCNs and that with the initial perturbation, we can evaluate the fast approximated , which is involved in the first set of back propagation. Secondly, after the computation of , we are able to compute the average virtual adversarial loss in the mini-batch and optimize this loss under fixed , which incorporates the second set of back propagation. Finally, the third back propagation is related to the original supervised loss based on labeled nodes in GCNs. All in all, by this GCNs with VAT algorithm including three sets of back propagation, we are capable of imposing the local adversarial regularization on the original supervised loss of GCNs through smoothing the posterior distribution of the model in the most adversarial direction, thereby improving the generalization of original GCNs.
GCNSVAT and GCNDVAT
In the real scenarios, there are usually sparse features for each node especially for a large graph, which are involved in the computation of sparse tensor. In this case, in the implementation of our GCNVAT algorithm framework, we customize two similar GCNVAT methods for different properties of node features. For GCN Sparse VAT (GCNSVAT), we only apply virtual adversarial perturbations on the specific sparse elements in feature of each node, which may save much computation time especially for high-dimensional feature vectors. For GCN Dense VAT (GCNDVAT), we actually perturb each element in feature by transforming the the sparse feature matrix to a dense one.
In the experimental part, we conduct extensive experiments to demonstrate the effectiveness of our GCNSVAT and GCNDVAT algorithms. Firstly, we test the performance of both algorithms under different label rates compared with the original GCN. Then we make another comparison under the standard semi-supervised setting with other state-of-the-art approaches. Finally, a sensitivity analysis of hyper-parameters is provided for broad deployment of our method in real applications.
For the graph dataset, we select the three commonly used citation networks: CiteSeer, Cora and PubMed (Sen et al., 2008). Dateset statistics are summarized in Table 1. For all methods involved in GCNs, we use the same hyper-parameters as in (Kipf and Welling, 2016): learning rate of 0.01, 0.5 dropout rate, 2 convolutional layers, and 16 hidden units without validation set for fair comparison. As for the hyper-parameters, we fix regularization coefficient and only change the perturbation magnitude to control the regularization effect under different training sizes, which is further discussed later in the sensitivity analysis part. All the results are the mean accuracy of 10 runs to avoid stochastic effect.
4.1 Effect under Different Training Sizes
To verify the consistent effectiveness of our two methods on the improvement of generalization performance, we compare GCNSVAT and GCNDVAT algorithms with original GCN method (Kipf and Welling, 2016) under different training sizes across the three datasets and the results can be observed in Figure 1.
As illustrated in Figure 1, GCNSVAT (the red line) and GCNDVAT (the blue line) outperform original GCN (the black line) consistently under all tested label rates. Actually, it is important to note that with the increasing of label rates, the regularization effect imposed by VAT on GCNs diminishes in both approaches since the improvement from regularization based on unlabeled data is decreasing. In other words, the superior performance of GCN with Virtual Adversarial Training are especially significant when there are few training sizes. Fortunately, in real scenarios, it is common to observe graphs with a small number of labeled nodes, thereby our algorithms are especially practical in these applications.
Choice of GCNSVAT and GCNDVAT. GCNDVAT performs consistently better in comparison with GCNSVAT even though GCNDVAT requires extra computation cost related to perturbations in the entire feature space. As for the reason, we argue that continuous perturbations in features facilitate the effect of VAT than discrete perturbations in sparse features. However, in the scenarios where the graph are large-scaled and their features are sparse, it is more appropriate to utilize GCNSVAT from the perspective of economical computation.
More specifically, we list the detailed performances of GCNSVAT and GCNDVAT compared with original GCN under different label rates, which are exhibited in Tables 4, 4 and 4, respectively. We report the mean accuracy of 10 runs. The results in tables provide a more sufficient evidence for the effectiveness of our two methods.
4.2 Effect on Standard Semi-Supervised Learning
Apart from the experiments under different training sizes, we also test the performance of GCNSVAT and GCNDVAT algorithms in standard semi-supervised setting with standard label rates listed in Table 1. Particularly, we compare our methods with other state-of-the-art methods on the node classification task under standard label rate and the results of baselines are referred from (Kipf and Welling, 2016).
From Table 5, it turns out that our GCNDVAT algorithm exhibits the state-of-the-art performance though the improvement are not apparent compared with that in few training sizes, while our GCNSVAT algorithm also shares a similar performance. Through the extensive experiments in semi-supervised learning, we demonstrate thoroughly that VAT suffices to improve the generalization performance of GCNs by additionally providing an adversarial regularization both in semi-supervised setting with few labeled nodes and standard semi-supervised setting.
4.3 Sensitivity Analysis of Hyper-parameters
One of the notable advantage of VAT in GCNs is that there are just two scalar-valued hyper-parameters: (1) the perturbation magnitude that constraints the norm of adversarial perturbation and (2) the regularization coefficient that controls the balance between supervised loss and virtual adversarial loss . We refine the analysis in original VAT (Miyato et al., 2018) and theoretically demonstrate the total loss is more sensitive to rather than in the regularization control of GCNs with VAT setting.
Consider the second approximation of virtual adversarial regularization:
where is the dominant eigenvalue of Hessian matrix of . Substituting this into the objective function, we obtain
Thus, the strength of regularization is approximately proportional to and . In consideration of the regularization term is more sensitive to the change of , in our experiments we just tune the perturbation to control the regularization by fixing for both methods.
Further, we present the tendency between the selected optimal and label rates. As for the different label rates, it is natural to expect that GCNs with VAT under lower label rate requires larger VAT regularization, yielding the urge for larger optimal . We empirically verify this conclusion in Figure 2.
From Figure 2, it is easy to observe that with the increasing of label rates, there is a descending trend of optimal for both GCNSVAT and GCNDVAT across three datasets. It meets our expectation since large VAT regularization are more expected for GCNs under lower label rates to obtain the optimal generalization of GCNs. In addition, the optimal parameter in GCNSVAT under the same label rate tends to be higher than that in GCNDVAT, especially when the label rate is lower. The reason is obvious because GCNSVAT only applies perturbations on specific elements of sparse feature for each node, thus requiring larger perturbations on those features to get similar regularization effect compared with GCNDVAT.
5 Discussions and Conclusion
GCNs with Virtual Adversarial Training is established on the adversarial training on GCNs, which in our paper is simply constrained in the adversarial perturbations on the features of nodes. However, there may exists a better form of adversarial training in GCNs by additionally considering the change of sensitive edges with respects to the output performance. Therefore, incorporating a better form of Virtual Adversarial Training into graphs allows better improvement of generalization of GCNs. Besides, how to combine VAT with other form Graph Neural Networks especially in inductive setting, is also worthwhile to explore in the future.
In our paper, we impose VAT regularization on the original supervised loss of GCN to enhance its generalization in semi-supervised learning, resulting in GCNSVAT and GCNDVAT, whose perturbations are added in sparse and dense features, respectively. Particularly, we apply VAT on GCNs in a theoretical way by additionally imposing virtual adversarial loss on the basic supervised loss of GCNs. Then we empirically demonstrate the improvement caused by the VAT regularization under different training sizes across three datasets. Our endeavour validates that smoothing anisotropic direction on the posterior distribution of GCNs suffices to improve the Symmetric Laplacian Smoothing of original GCN model.
- Bruna et al.  Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
- Chen and Zhu  Jianfei Chen and Jun Zhu. Stochastic training of graph convolutional networks. arXiv preprint arXiv:1710.10568, 2017.
- Dai et al.  Hanjun Dai, Zornitsa Kozareva, Bo Dai, Alex Smola, and Le Song. Learning steady-states of iterative algorithms over graphs. In International Conference on Machine Learning, pages 1114–1122, 2018.
- Defferrard et al.  Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.
- Fortunato  Santo Fortunato. Community detection in graphs. Physics reports, 486(3-5):75–174, 2010.
- Grover and Leskovec  Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864. ACM, 2016.
- Hamilton et al.  Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
- Kipf and Welling  Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- Li et al.  Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. arXiv preprint arXiv:1801.07606, 2018.
- Miyato et al.  Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725, 2016.
- Miyato et al.  Takeru Miyato, Shin-ichi Maeda, Shin Ishii, and Masanori Koyama. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 2018.
- Perozzi et al.  Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014.
- Sen et al.  Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93, 2008.
- Tang et al.  Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pages 1067–1077. International World Wide Web Conferences Steering Committee, 2015.
- Velickovic et al.  Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 1(2), 2017.
- Yu et al.  Bing Yu, Jingfeng Wu, and Zhanxing Zhu. Tangent-normal adversarial regularization for semi-supervised learning. arXiv preprint arXiv:1808.06088, 2018.
- Zhu et al.  Jun Zhu, Jiaming Song, and Bei Chen. Max-margin nonparametric latent feature models for link prediction. arXiv preprint arXiv:1602.07428, 2016.
- Zügner and Günnemann  Daniel Zügner and Stephan Günnemann. Adversarial attacks on graph neural networks via meta learning. 2018.
- Zügner et al.  Daniel Zügner, Amir Akbarnejad, and Stephan Günnemann. Adversarial attacks on neural networks for graph data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2847–2856. ACM, 2018.