Log In Sign Up

Graph Sanitation with Application to Node Classification

The past decades have witnessed the prosperity of graph mining, with a multitude of sophisticated models and algorithms designed for various mining tasks, such as ranking, classification, clustering and anomaly detection. Generally speaking, the vast majority of the existing works aim to answer the following question, that is, given a graph, what is the best way to mine it? In this paper, we introduce the graph sanitation problem, to answer an orthogonal question. That is, given a mining task and an initial graph, what is the best way to improve the initially provided graph? By learning a better graph as part of the input of the mining model, it is expected to benefit graph mining in a variety of settings, ranging from denoising, imputation to defense. We formulate the graph sanitation problem as a bilevel optimization problem, and further instantiate it by semi-supervised node classification, together with an effective solver named GaSoliNe. Extensive experimental results demonstrate that the proposed method is (1) broadly applicable with respect to different graph neural network models and flexible graph modification strategies, (2) effective in improving the node classification accuracy on both the original and contaminated graphs in various perturbation scenarios. In particular, it brings up to 25 network methods.


Adversarial Immunization for Improving Certifiable Robustness on Graphs

Despite achieving strong performance in the semi-supervised node classif...

Graph Neural Networks for Small Graph and Giant Network Representation Learning: An Overview

Graph neural networks denote a group of neural network models introduced...

Semi-Supervised Node Classification on Graphs: Markov Random Fields vs. Graph Neural Networks

Semi-supervised node classification on graph-structured data has many ap...

Flip Initial Features: Generalization of Neural Networks for Semi-supervised Node Classification

Graph neural networks (GNNs) have been widely used under semi-supervised...

Scalable Attack on Graph Data by Injecting Vicious Nodes

Recent studies have shown that graph convolution networks (GCNs) are vul...

Autonomous Graph Mining Algorithm Search with Best Speed/Accuracy Trade-off

Graph data is ubiquitous in academia and industry, from social networks ...

1. Introduction

Graph mining has become the cornerstone in a wealth of real-world applications, such as social media mining (zafarani2014social), brain connectivity analysis (shi2015brainquest), computational epidemiology (keeling2005networks) and financial fraud detection (zhang2017hidden). For the vast majority of existing works, they essentially aim to answer the following question, that is, given a graph, what is the best model and/or algorithm to mine it? To name a few, Pagerank (page1999pagerank) and its variants (tong2006fast)

measure the node importance and node proximity respectively based on multiple weighted paths; spectral clustering 

(shi2000normalized) minimizes inter-cluster connectivity and maximizes the intra-cluster connectivity to partition nodes into different groups; graph neural networks (GNNs) (kipf2016semi; velivckovic2018graph; wu2019simplifying; klicpera2018predict) learn representation of nodes by aggregating information from the neighborhood. In all these works and many more, they require a given graph, including its topology and/or the associated attribute information, as part of the input of the corresponding mining model.

Despite tremendous success, some fundamental questions largely remain open, e.g., where does the input graph come from at the first place? to what extent does the quality of the given graph impact the effectiveness of the corresponding graph mining model? In response, we introduce the graph sanitation

problem, which aims to improve the initially provided graph for a given graph mining model, so as to maximally boost its performance. The rationality is as follows. In many existing graph mining works, the initially provided graph is typically constructed manually based on some heuristics. The graph construction is often treated as a pre-processing step, without the consideration of the specific mining task. What is more, the initially constructed graph could be subject to various forms of contamination, such as missing information, noise and even adversarial attacks. This suggests that there might be under-explored space for improving the mining performance by learning a ‘better’ graph as the input of the corresponding mining model.

There are a few lines of existing works for modifying graphs. For example, network imputation (liben2007link; huisman2009imputation)

and knowledge graph completion 

(bordes2013translating; wang2014knowledge) problems focus on restoring missing links in a partially observed graph; graph connectivity optimization (chen2018network) and computational immunization (chen2015node) problems aim to manipulate the graph connectivity in a desired way by changing the underlying topology; robust graph neural networks (GNNs) (entezari2020all; wu2019adversarial; DBLP:conf/kdd/Jin0LTWT20) utilize empirical properties of a benign graph to remove or assign lower weights to the poisoned graph elements (e.g., contaminated edges).

The graph sanitation problem introduced in this paper is related to but bears subtle difference from these existing work in the following sense. Most, if not all, of these existing works for modifying graphs assume the initially provided graph is impaired or perturbed in a specific way, e.g., due to missing links, or noise or adversarial attacks. Some existing works further impose certain assumptions on the specific graph modification algorithms, such as the low-rank assumption behind many network imputation methods, the types of attacks and/or the empirical properties of the benign graph (e.g., topology sparsity, feature smoothness) behind some robust GNNs. In contrast, the proposed graph sanitation problem does not make any such assumption, and instead pursues a different design principle. That is, we aim to let the performance of the downstream data mining task, measured on a validation set, dictate how we should optimally modify the initially provided graph. This is crucial, as it not only ensures that the modified graph will directly and maximally improve the mining performance, but also lends itself to be applicable to a variety of graph mining tasks.

Formally, we formulate the graph sanitation problem as a generic bilevel optimization problem, where the lower-level optimization problem corresponds to the specific mining task and the upper-level optimization problem encodes the supervision to modify the provided graph and maximally improve the mining performance. Based on that, we instantiate such a bilevel optimization problem by semi-supervised node classification with graph convolutional networks, where the lower-level objective function represents the cross-entropy classification loss over the training data and the upper-level objective function represents the loss over validation data, using the mining model trained from the lower-level optimization problem. We propose an effective solver (GaSoliNe) which adopts an efficient approximation of hyper-gradient to guide the modification over the given graph. We carefully design the hyper-gradient aggregation mechanism to avoid potential bias from a specific dataset split by aggregating the hyper-gradient from different folds of data. GaSoliNe is versatile, equipped with multiple variants, such as discretized vs. continuous modification, deleting vs. adding edges, modifying graph topology vs. feature. Comprehensive experiments demonstrate that (1) GaSoliNe

 is broadly applicable to benefit different downstream node classifiers together with flexible choices of variants and modification strategies, (2)

GaSoliNe  can significantly boost downstream classifiers on both the original and contaminated graphs in various perturbation scenarios and can work hand-in-hand with existing robust GNNs methods. For instance, in Table 1, the proposed GaSoliNe significantly boosts GAT (velivckovic2018graph),APPNP-SVD (entezari2020all), and RGCN (zhu2019robust).

Cora N 48.830.18 60.120.55 50.570.77
Y 63.680.59 79.730.61 62.560.55
Citeseer N 62.400.69 50.600.63 55.531.43
Y 69.680.21 76.460.63 66.090.77
Polblogs N 48.216.64 77.343.25 50.760.86
Y 70.760.59 89.210.72 67.730.33
Table 1. Performance (MeanStd Accuracy) boosting of existing defense methods on heavily poisoned graphs ( edges perturbed by metattack (DBLP:conf/iclr/ZugnerG19)) by the proposed GaSoliNe.

In summary, our main contributions in this paper are as follows:

  • Problem Definition. We introduce a novel graph sanitation problem, and formulate it as a bilevel optimization problem. The proposed graph sanitation problem can be potentially applied to a variety of graph mining models as long as they are differentiable w.r.t. the input graph.

  • Algorithmic Instantiation. We instantiate the graph sanitation problem by semi-supervised node classification with graph convolutional networks. We further propose an effective solver named GaSoliNe with versatile variants.

  • Empirical Evaluations. We perform extensive empirical studies on real-world datasets to demonstrate the effectiveness and the applicability of the proposed GaSoliNe algorithms.

The rest of this paper is organized as follows. Section 2 formally defines the graph sanitation problem with various instantiations over different mining tasks. Section 3 studies the graph sanitation problem with semi-supervised node classification. In Section 4 we design a set of experiments to verify the effectiveness and the applicability of GaSoliNe. After reviewing the related literature in Section 5 we conclude this paper in Section 6.

2. Graph Sanitation Problem

A - Notations. Table 2 summarizes the main symbols and notations used throughout the paper. We use bold uppercase letters for matrices (e.g.,

), bold lowercase letters for vectors (e.g.,

), lowercase letters for scalars (e.g., ), and calligraphic letters for sets (e.g., ). We use to represent the entry of matrix at the -th row and the -th column, to represent the -th row of matrix , and to represent the -th column of matrix . Similarly, denotes the -th entry of vector . We use prime to denote the transpose of matrices and vectors (e.g., is the transpose of ). For the variables of the modified graphs, we set over the corresponding variables of the original graphs (e.g., ).

We represent an attributed graph as , where is an adjacency matrix and is the feature matrix composed by -dimensional feature vectors of nodes. For supervised graph mining models, we first divide the node set into two disjoint subsets: labeled node set and test set , and then divide the labeled node set into two disjoint subsets: the training set and the validation set . We use and with appropriate indexing to denote the ground truth supervision and the prediction result respectively. Take the classification task as an example, if node belongs to class and otherwise;

is the predicted probability that node

belongs to class . Furthermore, we use and to denote the supervision information of all the nodes in the training set and the validation set , respectively.

Symbols Definitions
the initial graph
the adjacency matrix of
the attribute/feature matrix of
the number of nodes
the dimension of node feature vector
the number of classes
the training set
the validation set
the ground truth (e.g., class labels) of the training set
the ground truth (e.g., class labels) of the validation set
the solution of the mining model
the modified graph
budget of modification
Table 2. Symbols and Notations

B - Graph Mining Models. For many graph mining models, they can be formulated from the optimization perspective (kang2020inform; kang2019n2n). For these models, the goal is to find an optimal solution so that a task-specific loss is minimized. Here, and are the training set and the associated ground truth (e.g., class labels for the classification task), which would be absent for the unsupervised graph mining tasks (e.g., clustering, ranking). We give three concrete examples next.

Mining Tasks Pagerank (page1999pagerank) Spectral clustering (shi2000normalized) Semi-supervised node classification
none none training set
none none labels of training set
positive node set
negative node set
‘must-link’ set
‘cannot-link’ set
validation set
none none labels of validation set
normalized adjacency matrix
damping factor
preference vector
width parameter
Laplacian matrix
degree matrix
link constraints matrix
number of classes
predicted probability of node to class
binary ground truth of node to class
Table 3. Instantiations of graph sanitation problem over various mining tasks

Example #1: Pagerank (page1999pagerank) is a fundamental graph ranking model. When the adjacency matrix of the underlying graph is normalized in a symmetric way, the Pagerank vector can be obtained as:


where is the symmetrically normalized adjacency matrix; is the damping factor; is the preference vector; and is the ranking vector which serves as the solution of the ranking model (i.e., ).

Example #2: spectral clustering (shi2000normalized) is a classic graph clustering model which aims to minimizes the normalized cut between clusters:


where is the Laplacian matrix of adjacency matrix , is the diagonal degree matrix where ; the model solution is the cluster indicator vector .

Example #3: node classification aims to construct a classification model based on the graph topology and feature . A typical loss for node classification is cross-entropy (CE) over the training set:


where is the ground truth which indicates if node belongs to class , is the training set, is the predicted probability that node belongs to class by a classifier parameterized by . For example, classifier can be a graph convolutional networks whose trained model parameters form the solution

Remarks. Both the standard Pagerank and spectral clustering are unsupervised models and therefore the training set and its supervision

are absent in the corresponding loss functions (i.e., Eq. (

1) and (2), respectively). Nonetheless, both Pagerank and spectral clustering have been generalized to further incorporate some forms of supervision, as we will show next.

C - Graph Sanitation: Formulation and Instantiations. In the proposed graph sanitation problem, given an initial graph and a graph mining model , we aim to learn a modified graph to boost the performance of the corresponding mining model. The basic idea is to let the mining performance on a validation set guide the modification process. Formally, the graph sanitation problem is defined as follows.

Problem 1 ().

Graph Sanitation Problem


(1) a graph represented as , (2) a graph mining task represented as , (3) a validation set and its supervision , and (4) the sanitation budget ;


A modified graph to boost the performance of input graph mining model.

We further formulate Problem 1 as a bilevel optimization problem as follows:


where the lower-level optimization is to train the model based on the training set ; the upper-level optimization aims to optimize the performance of the trained model on the validation set , and there is no overlap between and ; the distance function measures the distance between two graphs. Notice that the loss function at the upper level might be different from the one at the lower level . For example, for both Pagerank (Eq. (1)) and spectral clustering (Eq. (2)) does not involve any supervision. However, for both models is designed to measure the performance on a validation set with supervision and therefore should be different from . We elaborate this next.

The proposed bilevel optimization problem in Eq. (4) is quite general. In principle, it is applicable to any graph model with differentiable and . We give its instantiations with the three aforementioned mining tasks and summarize them in Table 3.

Instantiation #1: supervised Pagerank. The original PageRank (page1999pagerank) has been generalized to encode pair-wised ranking preference (backstrom2011supervised; li2016quint). For graph sanitation with supervised Pagerank, the training set together with its supervision is absent, and the lower-level loss is given in Eq. (1). For the validation set , it consists of a positive node set and a negative node set . The supervision of the upper-level problem is that pagerank scores of nodes from should be higher than that from , i.e., . Several choices for the upper-level loss exist. For example, we can use Wilcoxon-Mann-Whitney loss (yan2003optimizing):


where is the width parameter. It is worth-mentioning that Eq. (5) only modifies graph topology . Although Eq. (5) does not contain variable , is determined by through the lower-level problem.

Instantiation #2: supervised spectral clustering. A typical way to encode supervision in spectral clustering is via ‘must-link’ and ‘cannot-link’ (wagstaff2000clustering; wang2010flexible). For graph sanitation with supervised spectral clustering, the training set together with its supervision is absent, and the lower-level loss is given in Eq. (2). The validation set contains a ‘must-link’ set and a ‘cannot-link’ set . For the upper-level loss, the idea is to encourage nodes from must-link set to be grouped in the same cluster and in the meanwhile push nodes from cannot-link set to be in different clusters. To be specific, can be instantiated as follows.


where encodes the ‘must-link’ and ‘cannot-link’, that is, if , if , and otherwise. This instantiation only modifies the graph topology .

Instantiation #3: semi-supervised node classification. For graph sanitation with semi-supervised node classification, its lower-level optimization problem is given in Eq. (3). We have cross-entropy loss over validation set as the upper-level problem:


As mentioned before, there should be no overlap between the training set and the validation set .

Remarks. If the initially given graph is poisoned by adversarial attackers (zugner2018adversarial; DBLP:conf/iclr/ZugnerG19), the graph sanitation problem with semi-supervised node classification can also be used as an adversarial defense strategy. However, it bears important difference from the existing robust GNNs (DBLP:conf/kdd/Jin0LTWT20; entezari2020all; wu2019adversarial) as it does not assume the given graph is poisoned or any specific way by which it is poisoned. Therefore, graph sanitation problem in this scenario could boost the performance under a wide range of attacking scenarios (e.g., non-poisoned graphs, lightly-poisoned graphs, and heavily-poisoned graphs) and has the potential to work hand-in-hand with existing robust GNNs model. In the next section, we propose an effective algorithm to solve the graph sanitation problem with semi-supervised node classification.

3. Proposed Algorithms: GaSoliNe

In this section, we focus on graph sanitation problem in the context of semi-supervised node classification and propose an effective solver named GaSoliNe. The general workflow of GaSoliNe is as follows. At first, we solve the lower-level problem (Eq. (3)) and obtain a solution together with its corresponding updating trajectory. Then we compute the hyper-gradient of the upper-level loss function (Eq. (7)) w.r.t. the graph and use a set of hyper-gradient-guided modification to solve the upper-level optimization problem. Recall that we need a classifier to produce the predicted labels which is parameterized by and we refer to this classifier as the backbone classifier. Finally we test if another classifier can obtain a better performance over the modified graph on the test set and this classifier is named as the downstream classifier. In the following subsections, we will introduce our proposed solution GaSoliNe in three parts, including (A) hyper-gradient computation, (B) hyper-gradient aggregation, and (C) hyper-gradient-guided modification.

A - Hyper-Gradient Computation. Eq. (4) and its corresponding instantiation Eq. (7) fall into the family of bilevel optimization problem where the lower-level problem is to optimize via minimizing the loss over the training set given , and the upper-level problem is to optimize via minimizing the loss over . We perform gradient descent w.r.t. the upper-level problem and view the lower-level problem as a dynamic system:


where is the initialization of and () is the updating formula which can be instantiated as an optimizer over the lower-level objective function on training set . Hence, in order to get the hyper-gradient of the upper-level problem , we assume that the dynamic system converges in iterations (i.e., ). We can then unroll the iterative solution of the lower-level problem and obtain the hyper-gradient

by the chain rule as follows 

(baydin2017automatic). For brevity, we abbreviate cross-entropy loss over the validation set as .


where , .

For our setting, the downstream classifier is required to be trained till convergence over the modified graph. Hence, is set as a relatively high value (e.g., 200 in our experiments) to ensure the hyper-gradient from the upper-level problem is computed over a converged classifier. In order to balance the effectiveness and the efficiency, we adopt the truncated hyper-gradient (shaban2019truncated) w.r.t. and rewrite the second part of Eq. (9) as

where denotes the truncating iteration from which the hyper-gradient across iterations (i.e.

) of the lower-level problem should be counted in. In order to achieve a faster estimation of the hyper-gradient, we further adopt a first-order approximation 

(nichol2018first; DBLP:conf/iclr/ZugnerG19) and the hyper-gradient can be computed as:


where the updating trajectory of is the same as Eq. (8). If the initially-provided graph is undirected, it indicates that . Hence, when we compute the hyper-gradient w.r.t. the undirected graph topology , we need to calibrate the partial derivative into the derivative (kang2019n2n) and update the hyper-gradient as follows:


For the hyper-gradient w.r.t. feature and directed graph topology (), the above calibration process is not needed.

B - Hyper-Gradient Aggregation. To ensure the quality of graph sanitation without introducing bias from a specific dataset split, we adopt -fold training/validation split with similar settings as cross-validation (chen2020trading). The specific procedure is that during the training period, we split all the labeled nodes into folds and alternatively select one of them as (with labels ) and the others as (with labels ). In total, there are sets of training/validation splits. With the -th dataset split, by Eq. (10), we obtain the hyper-gradient . For the hyper-gradient from the sets of training/validation split, we sum them up as the aggregated hyper-gradient: .

C - Hyper-Gradient-Guided Modification. With regard to modifying the graph based on the aggregated hyper-gradient , we provide two variants, discretized modification and continuous modification. The discretized modification can work with binary inputs such as adjacency matrices of unweighted graphs and binary feature matrices. The continuous modification is suitable for both continuous and binary inputs. For the clarity of explanation, we replace the with the adjacency matrix as an example for the topology modification. It is straight-forward to generalize that to the feature modification with feature matrix .

The discretized modification follows the guide of a hyper-gradient-based score matrix:


where is an all 1s matrix. This score matrix is composed by ‘preference’ (i.e., ) and ‘modifiability’ (i.e., ). Only entries with both high ‘preference’ and ‘modifiability’ can be assigned with high scores. For example, large positive indicates strong preference of adding an edge between the -th node and the -th node based on the hyper-gradient and if there was no edge between the -th node and the -th node (i.e., ), the signs of and are the same which result in a large . Then, corresponding entries in are modified by flipping the sign of them based on the indices of the top- entries in .

Notice that if users prefer GaSoliNe  to do ‘deleting’ or ‘adding’ modification instead of ‘flipping’, an additional operation of the hyper-gradient matrix (i.e., the ‘preference’ matrix) is sufficient which sets the positive or the negative entries of to zeros.

The continuous modification utilizes the hyper-gradient directly which is represented as:


We compute the learning rate based on the ratio of the modification budget to the sum of absolute values of the hyper-gradient matrix. For the modified adjacency matrix using continuous modification, we project the adjacency matrix into to ensure non-negative weights which are required by the downstream classifier in their normalization process. In implementation, for both modification methods, we set the budget in every iteration as and update the graph in multiple steps until we run out of the total budget so as to balance the effectiveness and the efficiency. Algorithm 1 summarizes the detailed modification procedure, which can be used to improve various elements of graph (e.g., adjacency matrix and feature matrix ). In addition, in our experiments, the for topology () and the for feature () are set and counted separately since the cost of modification on different elements of a graph might not be comparable with each other.

Input : the original graph , the set of labeled nodes , modification budget , modification budget in every modification step, number of training/validation split fold , truncating iteration and converging iteration ;
Output : the modified graph ;
1 initialization: split the labeled nodes and their corresponding labels into folds: , ; initialize modified graph ; initialize the cumulative budget ;
2 while  do
3       for k=1 to K do
4            , , , , ;
5             for  to  do
6                   update to by Eq. (8);
7                   if  then
8                         compute given ;
10                   end if
12             end for
14       end for
15      calibrate hyper-gradients by Eq. (11) (if needed);
16       sum hyper-gradients into ;
17       set negative or positive entries of the hyper-gradient matrix as zeros (if needed);
18       update based on the guide of score matrix by Eq. (12) (discretized modification) or by Eq. (13) (continuous modification) with budget ;
20 end while
Algorithm 1 GaSoliNe

4. Experiments

In this section, we perform empirical evaluations. All the experiments are designed to answer the following research questions:

  • How applicable is the proposed GaSoliNe with respect to different backbone/downstream classifiers, as well as different modification strategies?

  • How effective is proposed GaSoliNe for the initial graph under various forms of perturbation? To what extent does the proposed GaSoliNe help strengthen the existing robust graph neural network methods?

4.1. Experiment Setups

We evaluate the proposed GaSoliNe  on Cora, Citeseer, and Polbolgs datasets (kipf2016semi; zugner2018adversarial; DBLP:conf/iclr/ZugnerG19) whose detailed statistics are attached in Appendix (Table 6

). Since the Polblogs dataset does not contain node features, we use an identical matrix as the node feature matrix. All the above three datasets are undirected and unweighted graphs and we conduct experiments on the largest connected component of every dataset.

In order to set fair modification budgets across different datasets, the modification budget on adjacency matrix and on feature matrix are computed as follows:

where is the total number of non-zero entries () in the adjacency matrix; is the total number of features of all the nodes. We set and throughout all the experiments. Detailed hyper-parameter settings and the modification budget analysis are attached in Appendix A and Appendix B, respectively.

We use the accuracy as the evaluation metric and repeat every set of experiment

times to report the mean std value.

(a) Original
(b) GCN
(c) SGC
Figure 1. Visualization results of original Citeseer graph (a) and modified Citeseer graphs by different variants of GaSoliNe  with backbone classifiers as GCN (b), SGC (c), APPNP (d), respectively. Best viewed in color.

4.2. Applicability of GaSoliNe

In this subsection, we conduct an in-depth study about the property of modified graphs by GaSoliNe. The proposed GaSoliNe trains a backbone classifier in the lower-level problem and uses the trained backbone classifier to modify the initially-provided graph and improve the performance of the downstream classifier on the test nodes. In addition, GaSoliNe is capable of modifying both the graph topology (i.e., ) and feature (i.e., ) in both the discretized and continuous fashion. To validate and verify that, we select three classic GNNs-based node classifiers, including GCN (kipf2016semi), SGC (wu2019simplifying), and APPNP (klicpera2018predict) to serve as the backbone classifiers and the downstream classifiers. The detailed experiment procedure is as follows. First, we modify the given graph using proposed GaSoliNe  algorithm with modification strategies (i.e., modifying topology or node feature with discretized or continuous modification). Each variant is implemented with backbone classifiers so that in total there are sets of GaSoliNe settings. Second, with the modified graphs, we test the performance of downstream classifiers and report the result (meanstd Acc) under each setting. For the results reported in this subsection, the initially provided graph is Citeseer (kipf2016semi). Experimental results are reported in Table 4 where ‘DT’ stands for ‘discretized topology modification’, ‘CT’ stands for ‘continuous topology modification’, ‘DF’ stands for ‘discretized feature modification’, and ‘CF’ stands for ‘continuous feature modification’. The second row of Table 4 shows the results on the initially-provided graph and other rows (excluding the first row) denote the results on modified graphs with different settings. We use to indicate that the improvement of the result is statistically significant compared with results on the initially-provided graph with a -value, and we use to indicate no statistically significant improvement. We have the following observations. First, the proposed GaSoliNe is able to improve the accuracy of the downstream classifier over the initially-provide graph, for every combination of the modification strategy (discretized vs. continuous) and the modification target (topology vs. feature), and in the vast majority cases, the improvement is statistically significant. Second, the graphs modified by GaSoliNe can benefit different downstream classifiers with different backbone classifier, which demonstrate great transferability and broad applicability of the proposed GaSoliNe.

We further provide visualization of graphs before and after modification. We present the visualizations of initial Citeseer graph and the modified Citeseer graphs from three GaSoliNe  discretized topology modification variants with backbone classifiers as GCN (kipf2016semi), SGC (wu2019simplifying), and APPNP (klicpera2018predict), respectively. The detailed visualization procedure is that we utilize the training set (and corresponding labels ) of given initial/modified graphs to train a GCN (kipf2016semi)

and use hidden representation of the trained GCN to encode every node into a high-dimensional vector. Then, we adopt t-SNE 

(van2014accelerating) method to map the high-dimensional embeddings into two-dimensional ones for visualization. Figure 1 shows the visualization results of node embeddings of the original Citeseer graph and the modified Citeseer graphs with different variants of GaSoliNe. Clearly, the node embeddings from modified graphs are more discriminative than the embeddings from the original graph. It further demonstrates that the proposed GaSoliNe  can incorporate various backbone classifiers and improve the graph quality to benefit downstream classifiers.

Variant Backbone GCN SGC APPNP
None None 72.230.45 72.810.18 71.800.37
DT GCN 74.710.26 74.770.11 75.360.24
SGC 74.700.42 75.240.15 75.640.29
APPNP 74.610.25 74.600.12 75.420.40
DF GCN 72.360.30 72.730.15 72.800.38
SGC 73.310.45 73.440.17 73.600.42
APPNP 72.560.34 72.910.13 73.560.43
CT GCN 73.100.39 73.550.11 74.750.20
SGC 72.960.33 73.480.15 74.400.29
APPNP 72.780.53 73.350.11 74.370.28
CF GCN 72.700.44 73.550.09 73.840.25
SGC 72.850.37 73.640.37 73.750.38
APPNP 73.040.33 73.630.15 73.900.30
Table 4. Effectiveness of GaSoliNe  under multiple variants (MeanStd Accuracy). The first and second columns denote the modification strategies and backbone classifiers adopted by GaSoliNe  respectively. The remaining columns show the performance of various downstream classifiers. indicates significant improvement compared with results on the original graph (values at the second row and the corresponding columns) with a -value¡0.01 and indicates no statistically significant improvement.

4.3. Effectiveness of GaSoliNe

(a) metattack
(b) Nettack
(c) random attack
Figure 2. Performance of models under (a) metattack, (b) Nettack, and (c) random attack. Best viewed in color.

As we point out in Sec. 1, the defects of the initially-provided graph could be due to various reasons. In this subsection, we evaluate the effectiveness of the proposed GaSoliNe  by (A) the comparison with baseline methods on various poisoned/noisy graphs, (B) integrating existing robust GNNs methods, and (C) a case study about the response of GaSoliNe  towards poisoned/noisy graphs. The attack methods we use to poison benign graphs are as follows:

  • Random Attack: The attacker randomly flips entries of benign adjacency matrices. The budget of attacker is set as .

  • Targeted Attack: The attacker aims to lower the performance of classifiers on a group of ‘target’ nodes. We adopt Nettack (zugner2018adversarial) to conduct targeted attack towards the graph topology only. The budget of attacker for Nettack  is set as .

  • Global Attack: The attacker poisons the overall performance of node classifiers. We adopt metattack (DBLP:conf/iclr/ZugnerG19) to attack topology of benign graphs. The budget of attacker of metattack is set in the same form as random attack but with a different perturbation rate.

A - Comparison with baseline methods. We compare GaSoliNe  with the following baseline methods: APPNP (klicpera2018predict), GAT (velivckovic2018graph), APPNP-Jaccard (wu2019adversarial), APPNP-SVD (entezari2020all), RGCN (zhu2019robust). Detailed description of baseline methods can be found in Appendix A.2. Recall that we feed all the graph modification-based methods (APPNP-Jaccard, APPNP-SVD, GaSoliNe) with the exactly same downstream classifier (APPNP) for a fair comparison.

We set variants of GaSoliNe  to compare with the above baselines. To be specific, we refer to (1) GaSoliNe with discretized modification on topology as GaSoliNe-DT, (2) GaSoliNe with continuous modification on feature as GaSoliNe-CF, and (3) GaSoliNe with discretized modification on topology and continuous modification on feature as GaSoliNe-DTCF. All these GaSoliNe  variants use APPNP (klicpera2018predict) as both the backbone classifier and the downstream classifier. We test various perturbation rates (i.e., perturbation rate of metattack from to with a step of , perturbation rate of random attack from to with a step of , and perturbations/node of Nettack  from to ) to attack the Cora (kipf2016semi) dataset and report the accuracy (meanstd) in Figure 2. From experiment results we observe that: (1) with the increase of attack budget , the performance of all methods drops, which is consistent with our intuition; (2) variants of GaSoliNe  consistently outperform the baselines under various adversarial/noisy scenarios; and (3) the proposed GaSoliNe even improves over the original, benign graphs (i.e., perturbation rate and perturbations/node).

An interesting question is, if the initially-provided graph is heavily poisoned/noisy, to what extent is the proposed GaSoliNe still effective? To answer this question, we study the performance of GaSoliNe  and other baseline methods on heavily-poisoned graphs ( perturbation rate of random attack, perturbation rate of metattack, and perturbations/node of Nettack). The detailed experiment results are presented in Table 5. In most cases, GaSoliNe  can obtain competitive or even better performance against baseline methods. On the Polblogs graph, GaSoliNe does not perform as well as in the other two datasets. This is because, the Polblogs graph does not have node feature. Therefore, in the heavily-poisoned case, GaSoliNe  can only bring limited improvement without the supervision from node feature.

metattack Cora 46.970.74 48.830.18 65.540.92 60.120.55 50.570.77 67.280.74 56.960.86 68.750.86
Citeseer 49.412.23 62.400.69 57.840.90 50.600.63 55.531.43 63.541.48 58.411.45 62.160.98
Polblogs 58.423.56 48.216.64 N/A 77.343.25 50.760.86 65.040.65 55.044.07 64.661.39
Nettack Cora 60.721.23 54.222.25 63.311.18 53.372.41 56.511.14 64.462.24 63.862.41 66.141.90
Citeseer 68.256.77 61.904.35 71.901.43 54.605.13 56.351.46 71.593.85 69.374.82 74.291.56
Polblogs 90.481.03 91.110.65 N/A 92.562.17 93.130.23 92.261.62 90.280.66 92.371.70
Cora 74.330.42 58.051.01 74.980.26 72.510.40 68.930.43 77.120.27 78.290.52 77.820.22
Citeseer 69.760.61 60.771.59 69.450.44 66.410.40 65.660.20 73.770.21 72.270.42 73.350.53
Polblogs 74.742.78 84.481.02 N/A 84.041.72 81.730.91 73.404.06 77.131.59 77.572.93
Table 5. Comparison with baselines on heavily poisoned datasets (MeanStd Accuracy). Some results are not applicable since APPNP-Jaccard requires node features which are absent on Polblogs graph.

B - Incorporating with graph defense strategies. GaSoliNe  does not make any assumption about the property of the defects of the initially-provided graph. We further evaluate if GaSoliNe  can help boost the performance of both model-based and data-based defense strategies under the heavily-poisoned settings. We use a data-based defense baseline APPNP-SVD (entezari2020all), a model-based defense baseline RGCN (zhu2019robust), and another strong baseline GAT (velivckovic2018graph) to integrate with GaSoliNe  since they have shown competitive performance from Table 5 and Figure 2. The detailed procedure is that for model-based methods (i.e., GAT and RGCN), GaSoliNe  modifies the graph at first, and then the baselines are implemented on the modified graphs to report the final results. For the data-based method (i.e., APPNP-SVD), we first implement the baseline to preprocess graphs, and then we modify graphs again by GaSoliNe, and finally run the downstream classifiers (APPNP) on the twice-modified graphs. In this task, specifically, we use GaSoliNe-DTCF to integrate various defense methods. In order to heavily poison the graphs, we use metattack (DBLP:conf/iclr/ZugnerG19) with perturbation rate as to attack the benign graphs. We report the results in Table 1 and observe that after integrating with GaSoliNe, performance of all the defense methods further improves significantly with a -value¡0.01.

C - Case study about the behaviour of GaSoliNe. Here, we further study the potential reasons behind the success of GaSoliNe. To this end, we conduct a case study whose core idea is to label malicious modifications (from adversaries) and test if GaSoliNe is able to detect them. The specific procedure is that we utilize different kinds of attackers (i.e., metattack (DBLP:conf/iclr/ZugnerG19), Nettack (zugner2018adversarial), and random attack) to modify the graph structure of a benign graph (with adjacency matrix ) into a poisoned graph (with adjacency matrix ). Then, we utilize the score matrix from Eq. (12) to assign a score to every entry of the poisoned adjacency matrix . As we mentioned in Section 3, the higher score an entry obtains, the more likely GaSoliNe will modify it. We compute the average score of three groups of entries from : the poisoned entries after adding/deleting perturbations from adversaries, the benign existing edges without perturbation, and the benign non-existing edges without perturbation. Remark that both the benign graphs and the poisoned graphs are unweighted and we define following auxiliary matrices. is a difference matrix whose entries with value indicate poisoned entries. is a benign edge indicator matrix whose entries with value indicate the benign existing edges without perturbation. indicates element-wise multiplication. is a benign non-existing edge indicator matrix whose entries with value indicate the benign non-existing edges without perturbation. Based on that, we have the following three statistics:

which denote the average score obtained by poisoned entries, benign existing edges, and benign non-existing edges.

Detailed results are presented in Figure 3. We observe that GaSoliNe  tends to modify poisoned entries more (with higher scores) than to modify benign unperturbed entries in the adjacency matrix of poisoned graphs, which is consistent with our expectation and enables the algorithm to partially recover the benign graphs and to boost the performance of downstream classifiers.

(a) metattack
(b) Nettack
(c) random attack
Figure 3. Score of various entries under metattack (a), Nettack  (b), and random attack (c). Best viewed in color.

5. Related Work

A - Graph Modification The vast majority of the existing work on graph modification assume the initially provided graph is impaired or perturbed in a specific way. Network imputation problems focuses on restoring missing links in a partially observed graph. For example, Liben-Nowell and Kleinberg (liben2007link) carefully studied the performance of a set of node topology proximity measures for predicting interactions on the network; Huisman (huisman2009imputation) proposed several procedures to handle missing data in the framework of exponential random graph data. Besides conventional network imputation tasks, knowledge graph completion is to predict missing links between entities. The representative works include translation-based methods TransE (bordes2013translating), TransH (wang2014knowledge), and factorization-based method ComplEx (trouillon2016complex). For network connectivity analysis, Chen et al. (chen2015node; chen2018network) manipulated the graph connectivity by changing the underlying topology. Another relevant line is adversarial defense, which is designed for the fragility of graph mining models. Wu et al. (wu2019adversarial) took a close look at the adversarial behaviours and proposed that deleting edges connecting two dissimilar nodes is effective to defend adversarial attacks; Entezari et al. (entezari2020all) carefully studied the property of Nettack (zugner2018adversarial) and proposed that low-rank approximation was effective to retain the performance of downstream GCNs on the poisoned graphs; Jin et al. (DBLP:conf/kdd/Jin0LTWT20) modeled the modification of graph into the training of downstream classifiers and utilized the topology sparsity and feature smoothness to guide the optimization. In addition, works like supervised pagerank (backstrom2011supervised; li2016quint) and constrained spectral clustering (wang2010flexible) encoded extra supervision to guide the modification of graphs.

B - Bilevel Optimization

Bilevel optimization problem is a powerful mathematical tool with broad applications in machine learning and data mining. Various tasks can be modelled as a bilevel optimization problem. For instance, Finn et al. 

(finn2017model) studied the model-agnostic learning to the initialization problem via solving a bilevel optimization problem with a first-order approximation of the hyper-gradient; Li et al. (li2016data) proposed a bilevel optimization-based poisoning attack method for factorization-based systems; Chen et al. (chen2020trading) designed a data debugging framework to identify the overly-personalized ratings and improve the performance of collaborative filtering. For solutions, the forward and reverse gradients (franceschi2017forward) are important tools for computing hyper-gradient; truncated back-propagation (shaban2019truncated) can provide effective approximation of the hyper-gradient with a theoretic guarantee. In addition, Colson et al. (colson2007overview) provided a detailed review about bilevel optimization problems.

6. Conclusion

In this paper, we introduce the graph sanitation problem, which aims to improve an initially-provided graph for a given graph mining model. We formulate the graph sanitation problem as a bilevel optimization problem and shows that it can be instantiated by a variety of graph mining models such as supervised Pagerank, supervised clustering and node classification. We further propose an effective solver named GaSoliNe for the graph sanitation problem with semi-supervised node classification. GaSoliNe adopts an efficient approximation of hyper-gradient to guide the modification over the initially-provided graph. GaSoliNe is versatile, equipped with multiple variants. The extensive experimental evaluations demonstrate the broad applicability and effectiveness of the proposed GaSoliNe.


Appendix A Reproducibility

We will release the source code upon the publication of the paper. Here, we present detailed experiment settings with dataset resources and all the model hyper-parameter settings.

a.1. Datasets

We conduct experiments on three public graph datasets: Cora111, Citeseer1, and Polbolgs 222 The detailed statistics of datasets are in Table 6.

Data Nodes Edges Classes Features
Cora 2,485 5,069 7 1,433
Citeseer 2,110 3,668 6 3,703
Polblogs 1,222 16,714 2 N/A
Table 6. Statistics of Datasets

a.2. Baseline Methods

The detailed descriptions of baseline methods that our proposed GaSoliNe  compare with are as follows:

  • APPNP (klicpera2018predict): A personalized Pagerank-based neural prediction model with an improved propagation scheme which leverages information from a large, adjustable neighborhood.

  • GAT (velivckovic2018graph): Graph neural nets with masked self-attention layers to specify different weights to different nodes in the neighborhood. It has the potential to assign lower weights to malicious edges added by the adversary and is often selected as a strong baseline against poisoning attacks.

  • APPNP-Jaccard (wu2019adversarial): A preprocessing-based method which deletes edges if the node feature similarity between the starting node and ending node is lower than a given threshold. We implement the APPNP (klicpera2018predict) on the preprocessed graph for a fair comparison since (1) APPNP enjoys stronger generalization ability than GCN (kipf2016semi) according to (klicpera2018predict), and (2) we ensure that all the graph modification-based methods (APPNP-Jaccard, APPNP-SVD (entezari2020all), and GaSoliNe) have the exactly same downstream classifier (APPNP).

  • APPNP-SVD (entezari2020all): Inspired by the observation that Nettack (zugner2018adversarial) prefers to improve the rank of the adjacency matrix, this method utilizes the low-rank approximation of the adjacency matrix as a counteraction against the adversarial attack. We implement APPNP (klicpera2018predict) on the preprocessed graph for a fair comparison.

  • RGCN (zhu2019robust):

    This method replaces the vector-formed node embedding by a Gaussian distribution-formed embedding and equips a variance-based attention mechanism to absorb the effect of adversarial changes.

a.3. Hyper-Parameter Settings

We summarize the hyper-parameter settings of models implemented in our experiments including baseline methods, backbone and downstream classifiers of GaSoliNe:

  • GCN (kipf2016semi): We follow the settings of publicly available implementation of GCN1 to implement a -layered GCN with hidden units, , and regularization parameter .

  • SGC (wu2019simplifying): We implement a -layered SGC with hidden units and regularization parameter .

  • APPNP (klicpera2018predict): We follow the recommended hyper-parameter settings of APPNP (klicpera2018predict) to set the and the number of power iterations as . The number of hidden units is and regularization parameter .

  • GAT (velivckovic2018graph): We use the publicly available implementation of GAT333 and adopt the same hyper-parameter settings as the authors did.

  • APPNP-Jaccard (wu2019adversarial): We search the edge removing threshold of Jaccard node similarity from and implement APPNP (klicpera2018predict) on the preprocessed graph to report the best results from the above settings.

  • APPNP-SVD (entezari2020all): We search the reduced rank of SVD from to get low-rank approximation of given graphs and then implement APPNP (klicpera2018predict) on the preprocessed graph to report the best results from the above settings.

  • RGCN (zhu2019robust): We use the publicly available implementation of RGCN444 and adopt the same hyper-parameter settings as the authors did.

The detailed settings of GaSoliNe  are as follows: (1) for all the modification strategies (discretized vs. continuous and topology vs. feature), the modification budget towards topology and the modification budget towards feature have been introduced in the Section 4.1. Specifically, the and . We modify the graph in steps so the budget in every modification step is (i.e., and ). For discretized modification, we adopt the ‘flipping’ strategy instead of ‘adding’ or ‘deleting’ over three datasets. (2) the settings of backbone classifiers and downstream classifiers (GCN (kipf2016semi), SGC (wu2019simplifying), APPNP (klicpera2018predict)) used in our experiments follow the aforementioned settings. (3) the number of iterations for the optimization of lower-level problem is set as and the truncating iteration is set as . The number of folds is set as .

For the attacking methods, their attacking budget has been introduced in the Section 4.3, and here we present the detailed implementation of them. (1) We follow the publicly-available implementation555 of metattack (DBLP:conf/iclr/ZugnerG19) and adopt the ‘Meta-Self’ variant to attack the provided graphs; (2) we follow (DBLP:conf/kdd/Jin0LTWT20) to select nodes with degree larger than as the target nodes and implement Nettack  with the publicly-available implementation666; (3) we implement random attack by symmetrically flipping entries of the adjacency matrix of provided graphs.

Appendix B Effect of Modification Budget

In this section we study the relationships between the budget of GaSoliNe  and the corresponding performance of the downstream classifier. Here, we instantiate two variants of GaSoliNe: discretized modification towards topology (GaSoliNe-DT) and continuous modification towards feature (GaSoliNe-CF). The provided graph is Cora (kipf2016semi) which is heavily-poisoned by metattack (DBLP:conf/iclr/ZugnerG19) with . Both the backbone classifier and the downstream classifier of GaSoliNe  are the APPNP (klicpera2018predict) models with the aforementioned settings. From Figure 4 we observe that with the increase of the modification budget ( and ), GaSoliNe enjoys great potential to further improve the performance of the downstream classifiers. At the same time, ‘economic’ choices are strong enough to benefit downstream classifiers so we set as and as throughout our experiment settings.

(a) GaSoliNe-DT
(b) GaSoliNe-CF
Figure 4. Performance of the downstream classifier vs. the modification budget of GaSoliNe-DT (a) and GaSoliNe-CF (b)