1. Introduction
Graph mining has become the cornerstone in a wealth of realworld applications, such as social media mining (zafarani2014social), brain connectivity analysis (shi2015brainquest), computational epidemiology (keeling2005networks) and financial fraud detection (zhang2017hidden). For the vast majority of existing works, they essentially aim to answer the following question, that is, given a graph, what is the best model and/or algorithm to mine it? To name a few, Pagerank (page1999pagerank) and its variants (tong2006fast)
measure the node importance and node proximity respectively based on multiple weighted paths; spectral clustering
(shi2000normalized) minimizes intercluster connectivity and maximizes the intracluster connectivity to partition nodes into different groups; graph neural networks (GNNs) (kipf2016semi; velivckovic2018graph; wu2019simplifying; klicpera2018predict) learn representation of nodes by aggregating information from the neighborhood. In all these works and many more, they require a given graph, including its topology and/or the associated attribute information, as part of the input of the corresponding mining model.Despite tremendous success, some fundamental questions largely remain open, e.g., where does the input graph come from at the first place? to what extent does the quality of the given graph impact the effectiveness of the corresponding graph mining model? In response, we introduce the graph sanitation
problem, which aims to improve the initially provided graph for a given graph mining model, so as to maximally boost its performance. The rationality is as follows. In many existing graph mining works, the initially provided graph is typically constructed manually based on some heuristics. The graph construction is often treated as a preprocessing step, without the consideration of the specific mining task. What is more, the initially constructed graph could be subject to various forms of contamination, such as missing information, noise and even adversarial attacks. This suggests that there might be underexplored space for improving the mining performance by learning a ‘better’ graph as the input of the corresponding mining model.
There are a few lines of existing works for modifying graphs. For example, network imputation (liben2007link; huisman2009imputation)
and knowledge graph completion
(bordes2013translating; wang2014knowledge) problems focus on restoring missing links in a partially observed graph; graph connectivity optimization (chen2018network) and computational immunization (chen2015node) problems aim to manipulate the graph connectivity in a desired way by changing the underlying topology; robust graph neural networks (GNNs) (entezari2020all; wu2019adversarial; DBLP:conf/kdd/Jin0LTWT20) utilize empirical properties of a benign graph to remove or assign lower weights to the poisoned graph elements (e.g., contaminated edges).The graph sanitation problem introduced in this paper is related to but bears subtle difference from these existing work in the following sense. Most, if not all, of these existing works for modifying graphs assume the initially provided graph is impaired or perturbed in a specific way, e.g., due to missing links, or noise or adversarial attacks. Some existing works further impose certain assumptions on the specific graph modification algorithms, such as the lowrank assumption behind many network imputation methods, the types of attacks and/or the empirical properties of the benign graph (e.g., topology sparsity, feature smoothness) behind some robust GNNs. In contrast, the proposed graph sanitation problem does not make any such assumption, and instead pursues a different design principle. That is, we aim to let the performance of the downstream data mining task, measured on a validation set, dictate how we should optimally modify the initially provided graph. This is crucial, as it not only ensures that the modified graph will directly and maximally improve the mining performance, but also lends itself to be applicable to a variety of graph mining tasks.
Formally, we formulate the graph sanitation problem as a generic bilevel optimization problem, where the lowerlevel optimization problem corresponds to the specific mining task and the upperlevel optimization problem encodes the supervision to modify the provided graph and maximally improve the mining performance. Based on that, we instantiate such a bilevel optimization problem by semisupervised node classification with graph convolutional networks, where the lowerlevel objective function represents the crossentropy classification loss over the training data and the upperlevel objective function represents the loss over validation data, using the mining model trained from the lowerlevel optimization problem. We propose an effective solver (GaSoliNe) which adopts an efficient approximation of hypergradient to guide the modification over the given graph. We carefully design the hypergradient aggregation mechanism to avoid potential bias from a specific dataset split by aggregating the hypergradient from different folds of data. GaSoliNe is versatile, equipped with multiple variants, such as discretized vs. continuous modification, deleting vs. adding edges, modifying graph topology vs. feature. Comprehensive experiments demonstrate that (1) GaSoliNe
is broadly applicable to benefit different downstream node classifiers together with flexible choices of variants and modification strategies, (2)
GaSoliNe can significantly boost downstream classifiers on both the original and contaminated graphs in various perturbation scenarios and can work handinhand with existing robust GNNs methods. For instance, in Table 1, the proposed GaSoliNe significantly boosts GAT (velivckovic2018graph),APPNPSVD (entezari2020all), and RGCN (zhu2019robust).Data  With GaSoliNe?  GAT  APPNPSVD  RGCN 
Cora  N  48.830.18  60.120.55  50.570.77 
Y  63.680.59  79.730.61  62.560.55  
Citeseer  N  62.400.69  50.600.63  55.531.43 
Y  69.680.21  76.460.63  66.090.77  
Polblogs  N  48.216.64  77.343.25  50.760.86 
Y  70.760.59  89.210.72  67.730.33 
In summary, our main contributions in this paper are as follows:

Problem Definition. We introduce a novel graph sanitation problem, and formulate it as a bilevel optimization problem. The proposed graph sanitation problem can be potentially applied to a variety of graph mining models as long as they are differentiable w.r.t. the input graph.

Algorithmic Instantiation. We instantiate the graph sanitation problem by semisupervised node classification with graph convolutional networks. We further propose an effective solver named GaSoliNe with versatile variants.

Empirical Evaluations. We perform extensive empirical studies on realworld datasets to demonstrate the effectiveness and the applicability of the proposed GaSoliNe algorithms.
The rest of this paper is organized as follows. Section 2 formally defines the graph sanitation problem with various instantiations over different mining tasks. Section 3 studies the graph sanitation problem with semisupervised node classification. In Section 4 we design a set of experiments to verify the effectiveness and the applicability of GaSoliNe. After reviewing the related literature in Section 5 we conclude this paper in Section 6.
2. Graph Sanitation Problem
A  Notations. Table 2 summarizes the main symbols and notations used throughout the paper. We use bold uppercase letters for matrices (e.g.,
), bold lowercase letters for vectors (e.g.,
), lowercase letters for scalars (e.g., ), and calligraphic letters for sets (e.g., ). We use to represent the entry of matrix at the th row and the th column, to represent the th row of matrix , and to represent the th column of matrix . Similarly, denotes the th entry of vector . We use prime to denote the transpose of matrices and vectors (e.g., is the transpose of ). For the variables of the modified graphs, we set over the corresponding variables of the original graphs (e.g., ).We represent an attributed graph as , where is an adjacency matrix and is the feature matrix composed by dimensional feature vectors of nodes. For supervised graph mining models, we first divide the node set into two disjoint subsets: labeled node set and test set , and then divide the labeled node set into two disjoint subsets: the training set and the validation set . We use and with appropriate indexing to denote the ground truth supervision and the prediction result respectively. Take the classification task as an example, if node belongs to class and otherwise;
is the predicted probability that node
belongs to class . Furthermore, we use and to denote the supervision information of all the nodes in the training set and the validation set , respectively.Symbols  Definitions 
the initial graph  
the adjacency matrix of  
the attribute/feature matrix of  
the number of nodes  
the dimension of node feature vector  
the number of classes  
the training set  
the validation set  
the ground truth (e.g., class labels) of the training set  
the ground truth (e.g., class labels) of the validation set  
the solution of the mining model  
the modified graph  
budget of modification  
B  Graph Mining Models. For many graph mining models, they can be formulated from the optimization perspective (kang2020inform; kang2019n2n). For these models, the goal is to find an optimal solution so that a taskspecific loss is minimized. Here, and are the training set and the associated ground truth (e.g., class labels for the classification task), which would be absent for the unsupervised graph mining tasks (e.g., clustering, ranking). We give three concrete examples next.
Mining Tasks  Pagerank (page1999pagerank)  Spectral clustering (shi2000normalized)  Semisupervised node classification  


none  none  training set  
none  none  labels of training set  


validation set  
none  none  labels of validation set  
Remarks 



Example #1: Pagerank (page1999pagerank) is a fundamental graph ranking model. When the adjacency matrix of the underlying graph is normalized in a symmetric way, the Pagerank vector can be obtained as:
(1) 
where is the symmetrically normalized adjacency matrix; is the damping factor; is the preference vector; and is the ranking vector which serves as the solution of the ranking model (i.e., ).
Example #2: spectral clustering (shi2000normalized) is a classic graph clustering model which aims to minimizes the normalized cut between clusters:
(2) 
where is the Laplacian matrix of adjacency matrix , is the diagonal degree matrix where ; the model solution is the cluster indicator vector .
Example #3: node classification aims to construct a classification model based on the graph topology and feature . A typical loss for node classification is crossentropy (CE) over the training set:
(3) 
where is the ground truth which indicates if node belongs to class , is the training set, is the predicted probability that node belongs to class by a classifier parameterized by . For example, classifier can be a graph convolutional networks whose trained model parameters form the solution
Remarks. Both the standard Pagerank and spectral clustering are unsupervised models and therefore the training set and its supervision
are absent in the corresponding loss functions (i.e., Eq. (
1) and (2), respectively). Nonetheless, both Pagerank and spectral clustering have been generalized to further incorporate some forms of supervision, as we will show next.C  Graph Sanitation: Formulation and Instantiations. In the proposed graph sanitation problem, given an initial graph and a graph mining model , we aim to learn a modified graph to boost the performance of the corresponding mining model. The basic idea is to let the mining performance on a validation set guide the modification process. Formally, the graph sanitation problem is defined as follows.
Problem 1 ().
Graph Sanitation Problem
 Given::

(1) a graph represented as , (2) a graph mining task represented as , (3) a validation set and its supervision , and (4) the sanitation budget ;
 Find::

A modified graph to boost the performance of input graph mining model.
We further formulate Problem 1 as a bilevel optimization problem as follows:
(4) 
where the lowerlevel optimization is to train the model based on the training set ; the upperlevel optimization aims to optimize the performance of the trained model on the validation set , and there is no overlap between and ; the distance function measures the distance between two graphs. Notice that the loss function at the upper level might be different from the one at the lower level . For example, for both Pagerank (Eq. (1)) and spectral clustering (Eq. (2)) does not involve any supervision. However, for both models is designed to measure the performance on a validation set with supervision and therefore should be different from . We elaborate this next.
The proposed bilevel optimization problem in Eq. (4) is quite general. In principle, it is applicable to any graph model with differentiable and . We give its instantiations with the three aforementioned mining tasks and summarize them in Table 3.
Instantiation #1: supervised Pagerank. The original PageRank (page1999pagerank) has been generalized to encode pairwised ranking preference (backstrom2011supervised; li2016quint). For graph sanitation with supervised Pagerank, the training set together with its supervision is absent, and the lowerlevel loss is given in Eq. (1). For the validation set , it consists of a positive node set and a negative node set . The supervision of the upperlevel problem is that pagerank scores of nodes from should be higher than that from , i.e., . Several choices for the upperlevel loss exist. For example, we can use WilcoxonMannWhitney loss (yan2003optimizing):
(5) 
where is the width parameter. It is worthmentioning that Eq. (5) only modifies graph topology . Although Eq. (5) does not contain variable , is determined by through the lowerlevel problem.
Instantiation #2: supervised spectral clustering. A typical way to encode supervision in spectral clustering is via ‘mustlink’ and ‘cannotlink’ (wagstaff2000clustering; wang2010flexible). For graph sanitation with supervised spectral clustering, the training set together with its supervision is absent, and the lowerlevel loss is given in Eq. (2). The validation set contains a ‘mustlink’ set and a ‘cannotlink’ set . For the upperlevel loss, the idea is to encourage nodes from mustlink set to be grouped in the same cluster and in the meanwhile push nodes from cannotlink set to be in different clusters. To be specific, can be instantiated as follows.
(6) 
where encodes the ‘mustlink’ and ‘cannotlink’, that is, if , if , and otherwise. This instantiation only modifies the graph topology .
Instantiation #3: semisupervised node classification. For graph sanitation with semisupervised node classification, its lowerlevel optimization problem is given in Eq. (3). We have crossentropy loss over validation set as the upperlevel problem:
(7) 
As mentioned before, there should be no overlap between the training set and the validation set .
Remarks. If the initially given graph is poisoned by adversarial attackers (zugner2018adversarial; DBLP:conf/iclr/ZugnerG19), the graph sanitation problem with semisupervised node classification can also be used as an adversarial defense strategy. However, it bears important difference from the existing robust GNNs (DBLP:conf/kdd/Jin0LTWT20; entezari2020all; wu2019adversarial) as it does not assume the given graph is poisoned or any specific way by which it is poisoned. Therefore, graph sanitation problem in this scenario could boost the performance under a wide range of attacking scenarios (e.g., nonpoisoned graphs, lightlypoisoned graphs, and heavilypoisoned graphs) and has the potential to work handinhand with existing robust GNNs model. In the next section, we propose an effective algorithm to solve the graph sanitation problem with semisupervised node classification.
3. Proposed Algorithms: GaSoliNe
In this section, we focus on graph sanitation problem in the context of semisupervised node classification and propose an effective solver named GaSoliNe. The general workflow of GaSoliNe is as follows. At first, we solve the lowerlevel problem (Eq. (3)) and obtain a solution together with its corresponding updating trajectory. Then we compute the hypergradient of the upperlevel loss function (Eq. (7)) w.r.t. the graph and use a set of hypergradientguided modification to solve the upperlevel optimization problem. Recall that we need a classifier to produce the predicted labels which is parameterized by and we refer to this classifier as the backbone classifier. Finally we test if another classifier can obtain a better performance over the modified graph on the test set and this classifier is named as the downstream classifier. In the following subsections, we will introduce our proposed solution GaSoliNe in three parts, including (A) hypergradient computation, (B) hypergradient aggregation, and (C) hypergradientguided modification.
A  HyperGradient Computation. Eq. (4) and its corresponding instantiation Eq. (7) fall into the family of bilevel optimization problem where the lowerlevel problem is to optimize via minimizing the loss over the training set given , and the upperlevel problem is to optimize via minimizing the loss over . We perform gradient descent w.r.t. the upperlevel problem and view the lowerlevel problem as a dynamic system:
(8) 
where is the initialization of and () is the updating formula which can be instantiated as an optimizer over the lowerlevel objective function on training set . Hence, in order to get the hypergradient of the upperlevel problem , we assume that the dynamic system converges in iterations (i.e., ). We can then unroll the iterative solution of the lowerlevel problem and obtain the hypergradient
by the chain rule as follows
(baydin2017automatic). For brevity, we abbreviate crossentropy loss over the validation set as .(9) 
where , .
For our setting, the downstream classifier is required to be trained till convergence over the modified graph. Hence, is set as a relatively high value (e.g., 200 in our experiments) to ensure the hypergradient from the upperlevel problem is computed over a converged classifier. In order to balance the effectiveness and the efficiency, we adopt the truncated hypergradient (shaban2019truncated) w.r.t. and rewrite the second part of Eq. (9) as
where denotes the truncating iteration from which the hypergradient across iterations (i.e.
) of the lowerlevel problem should be counted in. In order to achieve a faster estimation of the hypergradient, we further adopt a firstorder approximation
(nichol2018first; DBLP:conf/iclr/ZugnerG19) and the hypergradient can be computed as:(10) 
where the updating trajectory of is the same as Eq. (8). If the initiallyprovided graph is undirected, it indicates that . Hence, when we compute the hypergradient w.r.t. the undirected graph topology , we need to calibrate the partial derivative into the derivative (kang2019n2n) and update the hypergradient as follows:
(11) 
For the hypergradient w.r.t. feature and directed graph topology (), the above calibration process is not needed.
B  HyperGradient Aggregation. To ensure the quality of graph sanitation without introducing bias from a specific dataset split, we adopt fold training/validation split with similar settings as crossvalidation (chen2020trading). The specific procedure is that during the training period, we split all the labeled nodes into folds and alternatively select one of them as (with labels ) and the others as (with labels ). In total, there are sets of training/validation splits. With the th dataset split, by Eq. (10), we obtain the hypergradient . For the hypergradient from the sets of training/validation split, we sum them up as the aggregated hypergradient: .
C  HyperGradientGuided Modification. With regard to modifying the graph based on the aggregated hypergradient , we provide two variants, discretized modification and continuous modification. The discretized modification can work with binary inputs such as adjacency matrices of unweighted graphs and binary feature matrices. The continuous modification is suitable for both continuous and binary inputs. For the clarity of explanation, we replace the with the adjacency matrix as an example for the topology modification. It is straightforward to generalize that to the feature modification with feature matrix .
The discretized modification follows the guide of a hypergradientbased score matrix:
(12) 
where is an all 1s matrix. This score matrix is composed by ‘preference’ (i.e., ) and ‘modifiability’ (i.e., ). Only entries with both high ‘preference’ and ‘modifiability’ can be assigned with high scores. For example, large positive indicates strong preference of adding an edge between the th node and the th node based on the hypergradient and if there was no edge between the th node and the th node (i.e., ), the signs of and are the same which result in a large . Then, corresponding entries in are modified by flipping the sign of them based on the indices of the top entries in .
Notice that if users prefer GaSoliNe to do ‘deleting’ or ‘adding’ modification instead of ‘flipping’, an additional operation of the hypergradient matrix (i.e., the ‘preference’ matrix) is sufficient which sets the positive or the negative entries of to zeros.
The continuous modification utilizes the hypergradient directly which is represented as:
(13) 
We compute the learning rate based on the ratio of the modification budget to the sum of absolute values of the hypergradient matrix. For the modified adjacency matrix using continuous modification, we project the adjacency matrix into to ensure nonnegative weights which are required by the downstream classifier in their normalization process. In implementation, for both modification methods, we set the budget in every iteration as and update the graph in multiple steps until we run out of the total budget so as to balance the effectiveness and the efficiency. Algorithm 1 summarizes the detailed modification procedure, which can be used to improve various elements of graph (e.g., adjacency matrix and feature matrix ). In addition, in our experiments, the for topology () and the for feature () are set and counted separately since the cost of modification on different elements of a graph might not be comparable with each other.
4. Experiments
In this section, we perform empirical evaluations. All the experiments are designed to answer the following research questions:

How applicable is the proposed GaSoliNe with respect to different backbone/downstream classifiers, as well as different modification strategies?

How effective is proposed GaSoliNe for the initial graph under various forms of perturbation? To what extent does the proposed GaSoliNe help strengthen the existing robust graph neural network methods?
4.1. Experiment Setups
We evaluate the proposed GaSoliNe on Cora, Citeseer, and Polbolgs datasets (kipf2016semi; zugner2018adversarial; DBLP:conf/iclr/ZugnerG19) whose detailed statistics are attached in Appendix (Table 6
). Since the Polblogs dataset does not contain node features, we use an identical matrix as the node feature matrix. All the above three datasets are undirected and unweighted graphs and we conduct experiments on the largest connected component of every dataset.
In order to set fair modification budgets across different datasets, the modification budget on adjacency matrix and on feature matrix are computed as follows:
where is the total number of nonzero entries () in the adjacency matrix; is the total number of features of all the nodes. We set and throughout all the experiments. Detailed hyperparameter settings and the modification budget analysis are attached in Appendix A and Appendix B, respectively.
We use the accuracy as the evaluation metric and repeat every set of experiment
times to report the mean std value.4.2. Applicability of GaSoliNe
In this subsection, we conduct an indepth study about the property of modified graphs by GaSoliNe. The proposed GaSoliNe trains a backbone classifier in the lowerlevel problem and uses the trained backbone classifier to modify the initiallyprovided graph and improve the performance of the downstream classifier on the test nodes. In addition, GaSoliNe is capable of modifying both the graph topology (i.e., ) and feature (i.e., ) in both the discretized and continuous fashion. To validate and verify that, we select three classic GNNsbased node classifiers, including GCN (kipf2016semi), SGC (wu2019simplifying), and APPNP (klicpera2018predict) to serve as the backbone classifiers and the downstream classifiers. The detailed experiment procedure is as follows. First, we modify the given graph using proposed GaSoliNe algorithm with modification strategies (i.e., modifying topology or node feature with discretized or continuous modification). Each variant is implemented with backbone classifiers so that in total there are sets of GaSoliNe settings. Second, with the modified graphs, we test the performance of downstream classifiers and report the result (meanstd Acc) under each setting. For the results reported in this subsection, the initially provided graph is Citeseer (kipf2016semi). Experimental results are reported in Table 4 where ‘DT’ stands for ‘discretized topology modification’, ‘CT’ stands for ‘continuous topology modification’, ‘DF’ stands for ‘discretized feature modification’, and ‘CF’ stands for ‘continuous feature modification’. The second row of Table 4 shows the results on the initiallyprovided graph and other rows (excluding the first row) denote the results on modified graphs with different settings. We use to indicate that the improvement of the result is statistically significant compared with results on the initiallyprovided graph with a value, and we use to indicate no statistically significant improvement. We have the following observations. First, the proposed GaSoliNe is able to improve the accuracy of the downstream classifier over the initiallyprovide graph, for every combination of the modification strategy (discretized vs. continuous) and the modification target (topology vs. feature), and in the vast majority cases, the improvement is statistically significant. Second, the graphs modified by GaSoliNe can benefit different downstream classifiers with different backbone classifier, which demonstrate great transferability and broad applicability of the proposed GaSoliNe.
We further provide visualization of graphs before and after modification. We present the visualizations of initial Citeseer graph and the modified Citeseer graphs from three GaSoliNe discretized topology modification variants with backbone classifiers as GCN (kipf2016semi), SGC (wu2019simplifying), and APPNP (klicpera2018predict), respectively. The detailed visualization procedure is that we utilize the training set (and corresponding labels ) of given initial/modified graphs to train a GCN (kipf2016semi)
and use hidden representation of the trained GCN to encode every node into a highdimensional vector. Then, we adopt tSNE
(van2014accelerating) method to map the highdimensional embeddings into twodimensional ones for visualization. Figure 1 shows the visualization results of node embeddings of the original Citeseer graph and the modified Citeseer graphs with different variants of GaSoliNe. Clearly, the node embeddings from modified graphs are more discriminative than the embeddings from the original graph. It further demonstrates that the proposed GaSoliNe can incorporate various backbone classifiers and improve the graph quality to benefit downstream classifiers.Variant  Backbone  GCN  SGC  APPNP 
None  None  72.230.45  72.810.18  71.800.37 
DT  GCN  74.710.26  74.770.11  75.360.24 
SGC  74.700.42  75.240.15  75.640.29  
APPNP  74.610.25  74.600.12  75.420.40  
DF  GCN  72.360.30  72.730.15  72.800.38 
SGC  73.310.45  73.440.17  73.600.42  
APPNP  72.560.34  72.910.13  73.560.43  
CT  GCN  73.100.39  73.550.11  74.750.20 
SGC  72.960.33  73.480.15  74.400.29  
APPNP  72.780.53  73.350.11  74.370.28  
CF  GCN  72.700.44  73.550.09  73.840.25 
SGC  72.850.37  73.640.37  73.750.38  
APPNP  73.040.33  73.630.15  73.900.30 
4.3. Effectiveness of GaSoliNe
As we point out in Sec. 1, the defects of the initiallyprovided graph could be due to various reasons. In this subsection, we evaluate the effectiveness of the proposed GaSoliNe by (A) the comparison with baseline methods on various poisoned/noisy graphs, (B) integrating existing robust GNNs methods, and (C) a case study about the response of GaSoliNe towards poisoned/noisy graphs. The attack methods we use to poison benign graphs are as follows:

Random Attack: The attacker randomly flips entries of benign adjacency matrices. The budget of attacker is set as .

Targeted Attack: The attacker aims to lower the performance of classifiers on a group of ‘target’ nodes. We adopt Nettack (zugner2018adversarial) to conduct targeted attack towards the graph topology only. The budget of attacker for Nettack is set as .

Global Attack: The attacker poisons the overall performance of node classifiers. We adopt metattack (DBLP:conf/iclr/ZugnerG19) to attack topology of benign graphs. The budget of attacker of metattack is set in the same form as random attack but with a different perturbation rate.
A  Comparison with baseline methods. We compare GaSoliNe with the following baseline methods: APPNP (klicpera2018predict), GAT (velivckovic2018graph), APPNPJaccard (wu2019adversarial), APPNPSVD (entezari2020all), RGCN (zhu2019robust). Detailed description of baseline methods can be found in Appendix A.2. Recall that we feed all the graph modificationbased methods (APPNPJaccard, APPNPSVD, GaSoliNe) with the exactly same downstream classifier (APPNP) for a fair comparison.
We set variants of GaSoliNe to compare with the above baselines. To be specific, we refer to (1) GaSoliNe with discretized modification on topology as GaSoliNeDT, (2) GaSoliNe with continuous modification on feature as GaSoliNeCF, and (3) GaSoliNe with discretized modification on topology and continuous modification on feature as GaSoliNeDTCF. All these GaSoliNe variants use APPNP (klicpera2018predict) as both the backbone classifier and the downstream classifier. We test various perturbation rates (i.e., perturbation rate of metattack from to with a step of , perturbation rate of random attack from to with a step of , and perturbations/node of Nettack from to ) to attack the Cora (kipf2016semi) dataset and report the accuracy (meanstd) in Figure 2. From experiment results we observe that: (1) with the increase of attack budget , the performance of all methods drops, which is consistent with our intuition; (2) variants of GaSoliNe consistently outperform the baselines under various adversarial/noisy scenarios; and (3) the proposed GaSoliNe even improves over the original, benign graphs (i.e., perturbation rate and perturbations/node).
An interesting question is, if the initiallyprovided graph is heavily poisoned/noisy, to what extent is the proposed GaSoliNe still effective? To answer this question, we study the performance of GaSoliNe and other baseline methods on heavilypoisoned graphs ( perturbation rate of random attack, perturbation rate of metattack, and perturbations/node of Nettack). The detailed experiment results are presented in Table 5. In most cases, GaSoliNe can obtain competitive or even better performance against baseline methods. On the Polblogs graph, GaSoliNe does not perform as well as in the other two datasets. This is because, the Polblogs graph does not have node feature. Therefore, in the heavilypoisoned case, GaSoliNe can only bring limited improvement without the supervision from node feature.
Attack  Data  APPNP  GAT  APPNPJaccard  APPNPSVD  RGCN  GaSoliNeDT  GaSoliNeCF  GaSoliNeDTCF  
metattack  Cora  46.970.74  48.830.18  65.540.92  60.120.55  50.570.77  67.280.74  56.960.86  68.750.86  
Citeseer  49.412.23  62.400.69  57.840.90  50.600.63  55.531.43  63.541.48  58.411.45  62.160.98  
Polblogs  58.423.56  48.216.64  N/A  77.343.25  50.760.86  65.040.65  55.044.07  64.661.39  
Nettack  Cora  60.721.23  54.222.25  63.311.18  53.372.41  56.511.14  64.462.24  63.862.41  66.141.90  
Citeseer  68.256.77  61.904.35  71.901.43  54.605.13  56.351.46  71.593.85  69.374.82  74.291.56  
Polblogs  90.481.03  91.110.65  N/A  92.562.17  93.130.23  92.261.62  90.280.66  92.371.70  

Cora  74.330.42  58.051.01  74.980.26  72.510.40  68.930.43  77.120.27  78.290.52  77.820.22  
Citeseer  69.760.61  60.771.59  69.450.44  66.410.40  65.660.20  73.770.21  72.270.42  73.350.53  
Polblogs  74.742.78  84.481.02  N/A  84.041.72  81.730.91  73.404.06  77.131.59  77.572.93 
B  Incorporating with graph defense strategies. GaSoliNe does not make any assumption about the property of the defects of the initiallyprovided graph. We further evaluate if GaSoliNe can help boost the performance of both modelbased and databased defense strategies under the heavilypoisoned settings. We use a databased defense baseline APPNPSVD (entezari2020all), a modelbased defense baseline RGCN (zhu2019robust), and another strong baseline GAT (velivckovic2018graph) to integrate with GaSoliNe since they have shown competitive performance from Table 5 and Figure 2. The detailed procedure is that for modelbased methods (i.e., GAT and RGCN), GaSoliNe modifies the graph at first, and then the baselines are implemented on the modified graphs to report the final results. For the databased method (i.e., APPNPSVD), we first implement the baseline to preprocess graphs, and then we modify graphs again by GaSoliNe, and finally run the downstream classifiers (APPNP) on the twicemodified graphs. In this task, specifically, we use GaSoliNeDTCF to integrate various defense methods. In order to heavily poison the graphs, we use metattack (DBLP:conf/iclr/ZugnerG19) with perturbation rate as to attack the benign graphs. We report the results in Table 1 and observe that after integrating with GaSoliNe, performance of all the defense methods further improves significantly with a value¡0.01.
C  Case study about the behaviour of GaSoliNe. Here, we further study the potential reasons behind the success of GaSoliNe. To this end, we conduct a case study whose core idea is to label malicious modifications (from adversaries) and test if GaSoliNe is able to detect them. The specific procedure is that we utilize different kinds of attackers (i.e., metattack (DBLP:conf/iclr/ZugnerG19), Nettack (zugner2018adversarial), and random attack) to modify the graph structure of a benign graph (with adjacency matrix ) into a poisoned graph (with adjacency matrix ). Then, we utilize the score matrix from Eq. (12) to assign a score to every entry of the poisoned adjacency matrix . As we mentioned in Section 3, the higher score an entry obtains, the more likely GaSoliNe will modify it. We compute the average score of three groups of entries from : the poisoned entries after adding/deleting perturbations from adversaries, the benign existing edges without perturbation, and the benign nonexisting edges without perturbation. Remark that both the benign graphs and the poisoned graphs are unweighted and we define following auxiliary matrices. is a difference matrix whose entries with value indicate poisoned entries. is a benign edge indicator matrix whose entries with value indicate the benign existing edges without perturbation. indicates elementwise multiplication. is a benign nonexisting edge indicator matrix whose entries with value indicate the benign nonexisting edges without perturbation. Based on that, we have the following three statistics:
which denote the average score obtained by poisoned entries, benign existing edges, and benign nonexisting edges.
Detailed results are presented in Figure 3. We observe that GaSoliNe tends to modify poisoned entries more (with higher scores) than to modify benign unperturbed entries in the adjacency matrix of poisoned graphs, which is consistent with our expectation and enables the algorithm to partially recover the benign graphs and to boost the performance of downstream classifiers.
5. Related Work
A  Graph Modification The vast majority of the existing work on graph modification assume the initially provided graph is impaired or perturbed in a specific way. Network imputation problems focuses on restoring missing links in a partially observed graph. For example, LibenNowell and Kleinberg (liben2007link) carefully studied the performance of a set of node topology proximity measures for predicting interactions on the network; Huisman (huisman2009imputation) proposed several procedures to handle missing data in the framework of exponential random graph data. Besides conventional network imputation tasks, knowledge graph completion is to predict missing links between entities. The representative works include translationbased methods TransE (bordes2013translating), TransH (wang2014knowledge), and factorizationbased method ComplEx (trouillon2016complex). For network connectivity analysis, Chen et al. (chen2015node; chen2018network) manipulated the graph connectivity by changing the underlying topology. Another relevant line is adversarial defense, which is designed for the fragility of graph mining models. Wu et al. (wu2019adversarial) took a close look at the adversarial behaviours and proposed that deleting edges connecting two dissimilar nodes is effective to defend adversarial attacks; Entezari et al. (entezari2020all) carefully studied the property of Nettack (zugner2018adversarial) and proposed that lowrank approximation was effective to retain the performance of downstream GCNs on the poisoned graphs; Jin et al. (DBLP:conf/kdd/Jin0LTWT20) modeled the modification of graph into the training of downstream classifiers and utilized the topology sparsity and feature smoothness to guide the optimization. In addition, works like supervised pagerank (backstrom2011supervised; li2016quint) and constrained spectral clustering (wang2010flexible) encoded extra supervision to guide the modification of graphs.
B  Bilevel Optimization
Bilevel optimization problem is a powerful mathematical tool with broad applications in machine learning and data mining. Various tasks can be modelled as a bilevel optimization problem. For instance, Finn et al.
(finn2017model) studied the modelagnostic learning to the initialization problem via solving a bilevel optimization problem with a firstorder approximation of the hypergradient; Li et al. (li2016data) proposed a bilevel optimizationbased poisoning attack method for factorizationbased systems; Chen et al. (chen2020trading) designed a data debugging framework to identify the overlypersonalized ratings and improve the performance of collaborative filtering. For solutions, the forward and reverse gradients (franceschi2017forward) are important tools for computing hypergradient; truncated backpropagation (shaban2019truncated) can provide effective approximation of the hypergradient with a theoretic guarantee. In addition, Colson et al. (colson2007overview) provided a detailed review about bilevel optimization problems.6. Conclusion
In this paper, we introduce the graph sanitation problem, which aims to improve an initiallyprovided graph for a given graph mining model. We formulate the graph sanitation problem as a bilevel optimization problem and shows that it can be instantiated by a variety of graph mining models such as supervised Pagerank, supervised clustering and node classification. We further propose an effective solver named GaSoliNe for the graph sanitation problem with semisupervised node classification. GaSoliNe adopts an efficient approximation of hypergradient to guide the modification over the initiallyprovided graph. GaSoliNe is versatile, equipped with multiple variants. The extensive experimental evaluations demonstrate the broad applicability and effectiveness of the proposed GaSoliNe.
References
Appendix A Reproducibility
We will release the source code upon the publication of the paper. Here, we present detailed experiment settings with dataset resources and all the model hyperparameter settings.
a.1. Datasets
We conduct experiments on three public graph datasets: Cora^{1}^{1}1https://github.com/tkipf/gcn, Citeseer1, and Polbolgs ^{2}^{2}2https://github.com/ChandlerBang/ProGNN. The detailed statistics of datasets are in Table 6.
Data  Nodes  Edges  Classes  Features 
Cora  2,485  5,069  7  1,433 
Citeseer  2,110  3,668  6  3,703 
Polblogs  1,222  16,714  2  N/A 
a.2. Baseline Methods
The detailed descriptions of baseline methods that our proposed GaSoliNe compare with are as follows:

APPNP (klicpera2018predict): A personalized Pagerankbased neural prediction model with an improved propagation scheme which leverages information from a large, adjustable neighborhood.

GAT (velivckovic2018graph): Graph neural nets with masked selfattention layers to specify different weights to different nodes in the neighborhood. It has the potential to assign lower weights to malicious edges added by the adversary and is often selected as a strong baseline against poisoning attacks.

APPNPJaccard (wu2019adversarial): A preprocessingbased method which deletes edges if the node feature similarity between the starting node and ending node is lower than a given threshold. We implement the APPNP (klicpera2018predict) on the preprocessed graph for a fair comparison since (1) APPNP enjoys stronger generalization ability than GCN (kipf2016semi) according to (klicpera2018predict), and (2) we ensure that all the graph modificationbased methods (APPNPJaccard, APPNPSVD (entezari2020all), and GaSoliNe) have the exactly same downstream classifier (APPNP).

APPNPSVD (entezari2020all): Inspired by the observation that Nettack (zugner2018adversarial) prefers to improve the rank of the adjacency matrix, this method utilizes the lowrank approximation of the adjacency matrix as a counteraction against the adversarial attack. We implement APPNP (klicpera2018predict) on the preprocessed graph for a fair comparison.

RGCN (zhu2019robust):
This method replaces the vectorformed node embedding by a Gaussian distributionformed embedding and equips a variancebased attention mechanism to absorb the effect of adversarial changes.
a.3. HyperParameter Settings
We summarize the hyperparameter settings of models implemented in our experiments including baseline methods, backbone and downstream classifiers of GaSoliNe:

GCN (kipf2016semi): We follow the settings of publicly available implementation of GCN1 to implement a layered GCN with hidden units, , and regularization parameter .

SGC (wu2019simplifying): We implement a layered SGC with hidden units and regularization parameter .

APPNP (klicpera2018predict): We follow the recommended hyperparameter settings of APPNP (klicpera2018predict) to set the and the number of power iterations as . The number of hidden units is and regularization parameter .

GAT (velivckovic2018graph): We use the publicly available implementation of GAT^{3}^{3}3https://github.com/PetarV/GAT and adopt the same hyperparameter settings as the authors did.

APPNPJaccard (wu2019adversarial): We search the edge removing threshold of Jaccard node similarity from and implement APPNP (klicpera2018predict) on the preprocessed graph to report the best results from the above settings.

APPNPSVD (entezari2020all): We search the reduced rank of SVD from to get lowrank approximation of given graphs and then implement APPNP (klicpera2018predict) on the preprocessed graph to report the best results from the above settings.

RGCN (zhu2019robust): We use the publicly available implementation of RGCN^{4}^{4}4https://github.com/ZWZHANG/RobustGCN and adopt the same hyperparameter settings as the authors did.
The detailed settings of GaSoliNe are as follows: (1) for all the modification strategies (discretized vs. continuous and topology vs. feature), the modification budget towards topology and the modification budget towards feature have been introduced in the Section 4.1. Specifically, the and . We modify the graph in steps so the budget in every modification step is (i.e., and ). For discretized modification, we adopt the ‘flipping’ strategy instead of ‘adding’ or ‘deleting’ over three datasets. (2) the settings of backbone classifiers and downstream classifiers (GCN (kipf2016semi), SGC (wu2019simplifying), APPNP (klicpera2018predict)) used in our experiments follow the aforementioned settings. (3) the number of iterations for the optimization of lowerlevel problem is set as and the truncating iteration is set as . The number of folds is set as .
For the attacking methods, their attacking budget has been introduced in the Section 4.3, and here we present the detailed implementation of them. (1) We follow the publiclyavailable implementation^{5}^{5}5https://github.com/danielzuegner/gnnmetaattack of metattack (DBLP:conf/iclr/ZugnerG19) and adopt the ‘MetaSelf’ variant to attack the provided graphs; (2) we follow (DBLP:conf/kdd/Jin0LTWT20) to select nodes with degree larger than as the target nodes and implement Nettack with the publiclyavailable implementation^{6}^{6}6https://github.com/danielzuegner/nettack; (3) we implement random attack by symmetrically flipping entries of the adjacency matrix of provided graphs.
Appendix B Effect of Modification Budget
In this section we study the relationships between the budget of GaSoliNe and the corresponding performance of the downstream classifier. Here, we instantiate two variants of GaSoliNe: discretized modification towards topology (GaSoliNeDT) and continuous modification towards feature (GaSoliNeCF). The provided graph is Cora (kipf2016semi) which is heavilypoisoned by metattack (DBLP:conf/iclr/ZugnerG19) with . Both the backbone classifier and the downstream classifier of GaSoliNe are the APPNP (klicpera2018predict) models with the aforementioned settings. From Figure 4 we observe that with the increase of the modification budget ( and ), GaSoliNe enjoys great potential to further improve the performance of the downstream classifiers. At the same time, ‘economic’ choices are strong enough to benefit downstream classifiers so we set as and as throughout our experiment settings.