GAE-AT
On Generalization of Graph Autoencoders with Adversarial Training[ECML2021]
view repo
Adversarial training is an approach for increasing model's resilience against adversarial perturbations. Such approaches have been demonstrated to result in models with feature representations that generalize better. However, limited works have been done on adversarial training of models on graph data. In this paper, we raise such a question does adversarial training improve the generalization of graph representations. We formulate L2 and L1 versions of adversarial training in two powerful node embedding methods: graph autoencoder (GAE) and variational graph autoencoder (VGAE). We conduct extensive experiments on three main applications, i.e. link prediction, node clustering, graph anomaly detection of GAE and VGAE, and demonstrate that both L2 and L1 adversarial training boost the generalization of GAE and VGAE.
READ FULL TEXT VIEW PDFOn Generalization of Graph Autoencoders with Adversarial Training[ECML2021]
Networks are ubiquitous in a plenty of real-world applications and they contain relationships between entities and attributes of entities. Modeling such data is challenging due to its non-Euclidean characteristic. Recently, graph embedding that converts graph data into low dimensional feature space has emerged as a popular method to model graph data, For example, DeepWalk [14], node2vec [7] and LINE [23] learn graph embedding by extracting patterns from the graph. Graph Convolutions Networks (GCNs) [9] learn graph embedding by repeated multiplication of normalized adjacency matrix and feature matrix. In particular, graph autoencoder (GAE) [10, 24, 27] and graph variational autoencoder (VGAE) [10]
have been shown to be powerful node embedding methods as unsupervised learning. They have been applied to many machine learning tasks, e.g. node clustering
[16, 24, 20], link prediction [19, 10], graph anomaly detection [13, 4] and etc.Adversarial training is an approach for increasing model’s resilience against adversarial perturbations by including adversarial examples in the training set [11]. Several recent studies demonstrate that adversarial training improves feature representations leading to better performance for downstream tasks [26, 18]. However, little work in this direction has been done for GAE and VGAE. Besides, real-world graphs are usually highly noisy and incomplete, which may lead to a sub-optimal results for standard trained models [32]. Therefore, we are interested to seek answers to the following two questions:
Does adversarial training improve generalization, i.e. the performance in applications of node embeddings learned by GAE and VGAE?
Which factors influence this improvement?
In order to answer the first question above, we firstly formulate and adversarial training for GAE and VGAE. Then, we select three main tasks of VGAE and GAE: link prediction, node clustering and graph anomaly detection for evaluating the generalization performance brought by adversarial training. Besides, we empirically explore which factors affect the generalization performance brought by adversarial training.
Contributions: To the best of our knowledge, we are the first to explore generalization for GAE and VGAE using adversarial training. We formulate and adversarial training, and empirically demonstrate that both and adversarial training boost the generalization with a large margin for the node embeddings learned by GAE and VGAE. An additional interesting finding is that the generalization performance of the proposed adversarial training is more sensitive to attributes perturbation than adjacency matrix perturbation and not sensitive to the degree of nodes.
Adversarial training has been extensively studied in images. It has been important issues to explore whether adversarial training can help generalization. Tsipras et al. [25] illustrates that adversarial robustness could conflict with model’s generalization by a designed simple task. However, Stutz et al. [21] demonstrates that adversarial training with on-manifold adversarial examples helps the generalization. Besides, Salman et al. [18] and Utrera et al. [26] show that the latent features learned by adversarial training are improved and boost the performance of their downstream tasks.
Recently, few works bring adversarial training in graph data. Deng, Dong and Zhu [3] and Sun et al. [22] propose virtual graph adversarial training to promote the smoothness of model. Feng et al. [5] propose graph adversarial training by inducing dynamical regularization. Dai et al. [2] formulate an interpretable adversarial training for DeepWalk. Jin and Zhang [8] introduce latent adversarial training for GCN, which train GCN based on the adversarial perturbed output of the first layer. Besides, several studies explored adversarial training based on adversarial perturbed edges for graph data [31, 1, 28]. Among these works,part of studies pay attention to achieving model’s robustness while ignoring the effect of generalization [31, 8, 1, 28, 2] and the others simply utilize perturbations on nodal attributes while not explore the effect of perturbation on edges [3, 22, 5]. The difference between these works and ours is two-fold: (1) We extend both and adversarial training for graph models while the previous studies only explore
adversarial training. (2) We focus on the generalization effect brought by adversarial training for unsupervised deep learning graph models, i.e. GAE and VGAE while most of the previous studies focus on adversarial robustness for supervised/semi-supervised models.
We first summarize some notations and definitions used in this paper. Following the commonly used notations, we use bold uppercase characters for matrices, e.g.
, bold lowercase characters for vectors, e.g.
, and normal lowercase characters for scalars, e.g. . The row of a matrix is denoted by and element of matrix is denoted as . The row of a matrix is denoted by . We useWe consider an attributed network with nodes, edges and node attributed matrix. is the binary adjacency matrix of .
Graph autoencoders is a kind of unsupervised learning models on graph-structure data [10], which aim at learning low dimensional representations for each node by reconstructing inputs. It has been demonstrated to achieve competitive results in multiple tasks, e.g. link prediction [16, 10, 17], node clustering [16, 24, 20], graph anomaly detection [4, 13]. Generally, graph autoencoder consists of a graph convolutional network for encoder and an inner product for decoder [10]. Formally, it can be expressed as follows:
(1) | |||
(2) |
where
is the sigmoid function,
is a graph convolutional network, is the learned low dimensional representations and is the reconstructed adjacency matrix.During the training phase, the parameters will be updated by minimizing the reconstruction loss. Usually, the reconstruction loss is expressed as cross-entropy loss between and [10]:
(3) |
Kipf and Welling [10] introduced variational graph autoencoder (VGAE) which is a probabilistic model. VGAE is consisted of inference model and generative model. In their approach, the inference model, i.e. corresponding to the encoder of VGAE, is expressed as follows:
(4) |
where and
are learned by a graph neural network respectively. That is,
and , with is the matrix of stacking vectors ; likewise, is the matrix of stacking vectors .The generative model, i.e. corresponding to the decoder of autoencoder, is designed as an inner product between latent variables , which is formally expressed as follows:
(5) |
During the training phase, the parameters will be updated by minimizing the the variational lower bound :
(6) |
where a Gaussian prior is adopted for .
By now, multiple variants of adversarial training has been proposed and most of them are built on supervised learning and Euclidean data, e.g. FGSM-adversarial training
[6], PGD-adversarial training [11], Trades [33], MART [29] and etc. Here we introduce Trades that will be extended to GAE and VGAE settings in Section 4. Trades [33]separates loss function into two terms:1) Cross-Entropy Loss for achieving natural accuracy; 2) Kullback-Leibler divergence for achieving adversarial robustness. Formally, given inputs
, it can be expressed as follows [33]:(7) |
where is a supervised model, is the adversarial examples that maximize divergence and
is the output probability after softmax.
is a tunable hyperparameter and it controls the strength of the
regularization term.In this section, we formulate and adversarial training for GAE and VGAE respectively.
Considering that: (1) the inputs of GAE contains adjacency matrix and attributes, (2) the latent representation is expected to be invariant to the input perturbation, we reformulate the loss function in Eq. 3 as follows:
(8) | |||
(9) |
where is the adversarial perturbed adjacency matrix and is the adversarial perturbed attributes. Here the important question is how to generate the perturbed adjacency matrix and attributes in Eq. 9.
Attributes Perturbation . We generate the perturbed by projection gradient descent (PGD) [11]. We denote total steps as .
For bounded by norm ball, the perturbed data in -th step is expressed as follows:
(10) | |||
(11) |
where is the projection operator and is the norm ball of nodal attributes .
For bounded by norm ball, the perturbed data in -th step is expressed as follows:
(12) | |||
(13) |
where is the norm ball of nodal attributes and is the sign function.
Adjacency Matrix Perturbation . Adjacency matrix perturbation includes two-fold:(1) perturb node connections, i.e. Adding or dropping edges, (2) perturb the strength of information flow between nodes, i.e. the strength of correlation between nodes. Here we choose to perturb the strength of information flow between nodes and leave the perturb of node connections for future work. Specifically, we add weight for each edge and change these weights in order to perturb the strength of information flow. Formally, given the adjacency matrix , the weighted adjacency matrix is expressed as where the elements of are continuous and its values are initialized as same value as . denotes the element-wise product. Formally, is expressed as follows:
(14) | |||
(15) |
For bounded by norm ball, the perturbed data in -th step is expressed as follows:
(16) | |||
(17) | |||
(18) |
For bounded by norm ball, the perturbed data in -th step is expressed as follows:
(19) | |||
(20) | |||
(21) |
Similarly to GAE, we reformulate the loss function for training VGAE (Eq. 6) as follows:
(22) | |||
(23) |
We generate and exactly the same way as with GAE (replacing with in Eq. 10-21.)
For convenience, we abbreviate and adversarial training as AT-2 and AT-Linf respectively in the following tables and figures where / denote both attributes and adjacency matrix perturbation are bounded by / norm ball.
In practice, we train models by alternatively adding adjacency matrix perturbation and attributes perturbation ^{1}^{1}1We find that optimizing models by alternatively adding these two perturbation is better than adding these two perturbation together (See Appendix)..
In this section, we present the results of the performance evaluation of and adversarial training under three main applications of GAE and VGAE: link prediction, node clustering, and graph anomaly detection. Then we conduct parameter analysis experiments to explore which factors influence the performance.
Datasets
. We used six real-world datasets: Cora, Citeseer and PubMed for link prediction and node clustering tasks, and BlogCatalog, ACM and Flickr for the graph anomaly detection task. The detailed descriptions of the six datasets are showed in Table
1.Model Architecture. All our experiments are based on the GAE/VGAE model where the encoder/inference model is consisted with a two-layer GCN by default.
DataSets | Cora | Citeseer | PubMed | BlogCatalog | ACM | Flickr |
---|---|---|---|---|---|---|
#Nodes | 2708 | 3327 | 19717 | 5196 | 16484 | 7575 |
#Links | 5429 | 4732 | 44338 | 171743 | 71980 | 239738 |
#Features | 1433 | 3703 | 500 | 8189 | 8337 | 12074 |
Metrics. Following [10]
, we use the area under a receiver operating characteristic curve (AUC) and average precision (AP) as the evaluation metric. We conduct 30 repeat experiments with random splitting datasets into 85%, 5% and 10% for training sets, validation sets and test sets respectively. We report the mean and standard deviation values on test sets.
. We train models on Cora and Citeseer datasets with 600 epochs, and PubMed with 800 epochs. All models are optimized with Adam optimizer and 0.01 learning rate. The
is set to . For attributes perturbation, the is set to 3e-1 and 1e-3 on Citeseer and Cora, 1 and 5e-3 on PubMed for and adversarial training respectively. For adjacency matrix perturbation, the is set to 1e-3 and 1e-1 on Citeseer and Cora, and 1e-3 and 3e-1 on PubMed for and adversarial training respectively. The steps is set to . The is set to .For standard training GAE and VGAE, we run the official Pytorch geometric code
^{2}^{2}2https://github.com/rusty1s/pytorch_geometric/blob/master/examples/autoencoder.py with 600 epochs for Citeseer and Cora datasets, 1000 epochs ^{3}^{3}3Considering PubMed is big graph data, we use more epochs in order to avoiding underfitting.for PubMed dataset. Other parameters are set the same as in
[10].Methods | Cora | Citeseer | PubMed | |||
---|---|---|---|---|---|---|
AUC (in%) | AP (in%) | AUC (in%) | AP (in%) | AUC (in%) | AP (in%) | |
GAE | ||||||
AT-L2-GAE | ||||||
AT-Linf-GAE | ||||||
VGAE | ||||||
AT-L2-VGAE | ||||||
AT-Linf-VGAE |
, we use accuracy (ACC), normalized mutual information (NMI), precision, F-score(F1) and average rand index (ARI) as our evaluation metrics. We conduct 10 repeat experiments. For each experiment, datasets are random split into training sets( 85% edges), validation sets (5% edges) and test sets (10% edges). We report the mean and standard deviation values on test sets.
Likewise, for standard GAE and VGAE, we run the official Pytorch geometric code with 400 epochs for Citeseer and Cora datasets, 800 epochs for PubMed dataset.
Experimental Results.
The results are showed in Table 3, Table 4 and Table 5. It can be seen that both and adversarial trained models consistently outperform the standard trained models for all metrics. In particular, on Cora and Citeseer datasets, both and adversarial training improve the performance with large margin for all metrics, i.e. at least +5.4% for GAE, +6.7% for VGAE on Cora dataset (Table 3), and at least +5.8% for GAE, +5.6% for VGAE on Citeseer dataset (Table 4).
Methods | Acc (in%) | NMI (in%) | F1 (in%) | Precision (in%) | ARI (in%) |
---|---|---|---|---|---|
GAE | |||||
AT-L2-GAE | |||||
AT-Linf-GAE | |||||
VGAE | |||||
AT-L2-VGAE | |||||
AT-Linf-VGAE |
Methods | Acc (in%) | NMI (in%) | F1 (in%) | Precision (in%) | ARI (in%) |
---|---|---|---|---|---|
GAE | |||||
AT-L2-GAE | |||||
AT-Linf-GAE | |||||
VGAE | |||||
AT-L2-VGAE | |||||
AT-Linf-VGAE |
Methods | Acc (in%) | NMI (in%) | F1 (in%) | Precision (in%) | ARI (in%) |
---|---|---|---|---|---|
GAE | |||||
AT-L2-GAE | |||||
AT-Linf-GAE | |||||
VGAE | |||||
AT-L2-VGAE | |||||
AT-Linf-VGAE |
We exactly follow [4] to conduct experiments for graph anomaly detection. In [4], the authors take reconstruction errors of attributes and links as the anomaly scores. Specifically, the node with larger scores are more likely to be considered as anomalies.
Model Architecture. Different from link prediction and node clustering, the model architecture in graph anomaly detection not only contains structure reconstruction decoder, i.e. link reconstruction, but also contains attribute reconstruction decoder. We adopt the same model architecture as in the official code of [4] where the encoder is consisted of two GCN layers, and the decoder of structure reconstruction decoder is consisted of a GCN layer and a InnerProduction layer, and the decoder of attributes reconstruction decoder is consisted of two GCN layers.
Metrics. Following [4, 13], we use the area under the receiver operating characteristic curve (ROC-AUC) as the evaluation metric.
Parameter Settings.
We set the in anomaly scores to where it balances the structure reconstruction errors and attributes reconstruction errors. We train the GAE model on Flickr, BlogCatalog and ACM datasets with 300 epochs. We set to . For adjacency matrix perturbation, we set to 3e-1, 5e-5 on both BlogCatalog and ACM datasets, 1e-3 and 1e-6 on Flickr dataset for and adversarial training respectively. For attributes perturbations, we set to 1e-3 on BlogCatalog for both and adversarial training, 1e-3 and 1e-2 on ACM for and adversarial training respectively, 5e-1 and 3e-1 on Flickr for and adversarial training respectively. We set steps to 1 and the to
Anomaly Generation. Following [4], we inject two kinds of anomaly by perturbing structure and nodal attributes respectively:
Structure anomalies. We randomly select nodes from the network and then make those nodes fully connected, and then all the nodes forming the clique are labeled as anomalies. cliques are generated repeatedly and totally there are structural anomalies.
Attribute anomalies. We first randomly select nodes as the attribute perturbation candidates. For each selected node , we randomly select another nodes from the network and calculate the Euclidean distance between and all the nodes. Then the node with largest distance is selected as and the attributes of node is changed to the attributes of .
In this experiments, we set and for BlogCatalog, Flickr and ACM respectively which are the same to [4, 13].
Experimental Results.
From Table 6, it can be seen that both and adversarial training boost the performance in detecting anomalous nodes. Since adversarial training tend to learn feature representations that are less sensitive to perturbations in the inputs, we conjecture that the adversarial trained node embeddings are less influenced by the anomalous nodes, which helps the graph anomaly detection. A similar claim are also made in image domain [15]
where they demonstrate adversarial training of autoencoders are beneficial to novelty detection.
Methods | Flickr | BlogCatalog | ACM |
---|---|---|---|
GAE | |||
AT-L2-GAE | |||
AT-Linf-GAE |
In this section, we explore the impact of three hyper-parameters on the performance of GAE and VGAE with adversarial training, i.e. the , and in generating and . These three hyper-parameters are commonly considered to control the strength of regularization for adversarial robustness [33]. Besides, we explore the relationship between the improvements achieved by adversarial training and node degree.
The experiments are conducted on link prediction and node clustering tasks based on Cora dataset. We fix to 5e-1 and 1e-3 on adjacency matrix perturbation for and adversarial training respectively when vary on attributes perturbation. We fix to 1e-3 and 3e-1 on attributes perturbation for and adversarial training respectively when vary on adjacency matrix perturbation.
The results are showed in Fig. 9. From Fig. 9, we can see that the performance are less sensitive to adjacency matrix perturbation and more sensitive to attributes perturbation. Besides, it can be seen that there is an increase and then a decrease trend when increasing for attributes perturbation. We conjecture that it is because too large perturbation on attributes may destroy useful information in attributes. Therefore, it is necessary to carefully adapt the perturbation magnitude when we apply adversarial training for improving the generalization of model.
The experiments are conducted on link prediction and node clustering tasks based on Cora dataset. For adversarial training, we set to 1e-3 and 5e-1 for adjacency matrix perturbation and attributes perturbation respectively. For adversarial training, we set to 1e-1 and 1e-3 for adjacency matrix perturbation and attributes perturbation respectively. We set to .
Results are showed in Fig. 14. From Fig. 14, we can see that there is a slightly drop on both link prediction and node clustering tasks when increasing from 2 to 4, which implies that a big is not helpful to improve the generalization of node embeddings learned by GAE and VGAE. We suggest that one step is good choice for generating adjacency matrix perturbation and attributes perturbation in both and adversarial training.
The experiments are conducted on link prediction and node clustering task based on Cora dataset. Likewise, for adversarial training, is set to 1e-3 and 5e-1 for adjacency matrix perturbation and attributes perturbation respectively. For adversarial training, is set to 1e-1 and 1e-3 for adjacency matrix perturbation and attributes perturbation respectively. is set to 1.
Results are showed in Fig. 17. From Fig. 17, it can be seen that there is a significant increasing trend with the increase of , which indicates the effectiveness of both and adversarial training in improving the generalization of GAE and VGAE. Besides, we also notice that a too large is not necessary and may lead to a negative effect in generalization of GAE and VGAE.
In this section, we explore whether the performance of adversarial trained GAE/VGAE is sensitive to the degree of nodes. To conduct this experiments, we firstly learn node embeddings from Cora and Citeseer datasets by GAE/VGAE with adversarial training and standard training respectively. The hyper-parameters are set the same as in the Node clustering task. Then we build a linear classification based on the learned node embeddings. Their accuracy can be found in Appendix. The accuracy with respect to degree distribution are showed in Fig. 22.
From Fig. 22, it can be seen seem that for most degree groups, both and adversarial trained models outperform standard trained models, which indicates that both and adversarial training improve the generalization of GAE and VGAE with different degrees. However, we also notice that adversarial training does not achieve a significant improvement on [9,N] group. We conjecture that it is because node embeddings with very large degrees already achieve a high generalization.
In this paper, we formulated and adversarial training for GAE and VGAE, and studied their impact on the generalization performance. We conducted experiments on link prediction, node clustering and graph anomaly detection tasks. The results show that both and adversarial trained GAE and VGAE outperform GAE and VGAE with standard training. This indicates that and adversarial training improve the generalization of GAE and VGAE. Besides, we showed that the generalization performance achieved by the and adversarial training is more sensitive to attributes perturbation than adjacency matrix perturbation, and not sensitive to node degree. In addition, the parameter analysis suggest that a too large , and would lead to a negative effect on the performance w.r.t. generalization.
Shi, H., Fan, H., Kwok, J.T.: Effective decoding in graph auto-encoder using triadic closure. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 906–913 (2020)
Stutz, D., Hein, M., Schiele, B.: Disentangling adversarial robustness and generalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6976–6987 (2019)
Xia, R., Pan, Y., Du, L., Yin, J.: Robust multi-view spectral clustering via low-rank and sparse decomposition. In: Proceedings of the AAAI conference on artificial intelligence. vol. 28 (2014)