As the number of machine learning models deployed in the real world grows, questions regarding their robustness become increasingly important. In particular, it is critical to assess their vulnerability to adversarial attacks – deliberate perturbations of the data designed to achieve a specific (malicious) goal. Graph-based models suffer from poor adversarial robustness(Dai et al., 2018; Zügner et al., 2018), yet in domains where they are often deployed (e.g. the Web) (Ying et al., 2018), adversaries are pervasive and attacks have a low cost (Castillo and Davison, 2010; Hooi et al., 2016). Even in scenarios where adversaries are not present such analysis is important since it allows us to reason about the behavior of our models in the worst case (i.e. treating nature as an adversary).
Here we focus on semi-supervised node classification – given a single large (attributed) graph and the class labels of a few nodes the goal is to predict the labels of the remaining unlabelled nodes. Graph Neural Networks (GNNs) have emerged as the de-facto way to tackle this task, significantly improving performance over the previous state-of-the-art. They are used for various high impact applications across many domains such as: protein interface prediction (Fout et al., 2017), classification of scientific papers (Kipf and Welling, 2017), fraud detection (Wang et al., 2019a), and breast cancer classification (Rhee et al., 2018). Therefore, it is crucial to asses their sensitivity to adversaries and ensure they behave as expected.
However, despite their popularity there is scarcely any work on certifying or improving the robustness of GNNs. As shown in Zügner et al. (2018) node classification with GNNs is not robust and can even be attacked on multiple fronts – slight perturbations of either the node features or the graph structure can lead to wrong predictions. Moreover, since we are dealing with non i.i.d. data by taking the graph structure into account, robustifying GNNs is more difficult compared to traditional models – perturbing only a few edges affects the predictions for all nodes. What can we do to fortify GNNs and make sure they produce reliable predictions in the presence of adversarial perturbations?
We propose the first method for provable robustness regarding perturbations of the graph structure. Our approach is applicable to a general family of models where the predictions are a linear function of (personalized) PageRank. This family includes GNNs Klicpera et al. (2019) and other graph-based models such as label/feature propagation Buchnik and Cohen (2018); Zhou and Burges (2007). Specifically, we provide: 1. Certificates: Given a trained model and a general set of admissible graph perturbations we can efficiently verify whether a node is certifiably robust – there exists no perturbation that can change its prediction. We also provide non-robustness certificates via adversarial examples. 2. Robust training: We investigate robust training schemes based on our certificates and show that they improve both robustness and clean accuracy. Our theoretical findings are empirically demonstrated and the code is provided for reproducibility111Code, data, and further supplementary material available at https://www.kdd.in.tum.de/graph-cert. Interestingly, in contrast to existing works on provable robustness (Hein and Andriushchenko, 2017; Wong and Kolter, 2018; Zügner and Günnemann, 2019b) that derive bounds (by relaxing the problem), we can efficiently compute exact certificates for some threat models.
2 Related work
Neural networks (Szegedy et al., 2014; Goodfellow et al., 2015), and recently graph neural networks (Dai et al., 2018; Zügner et al., 2018; Zügner and Günnemann, 2019a) and node embeddings (Bojchevski and Günnemann, 2019)
were shown to be highly sensitive to small adversarial perturbations. There exist many (heuristic) approaches aimed at robustifying these models, however, they have only limited usefulness since there is always a new attack able to break them, leading to a cat-and-mouse game between attackers and defenders. A more promising line of research studies certifiable robustness(Hein and Andriushchenko, 2017; Raghunathan et al., 2018; Wong and Kolter, 2018). Certificates provide guarantees that no perturbation regarding a specific threat model will change the prediction of an instance. So far, there has been almost no work on certifying graph-based models.
Different heuristics have been explored in the literature to improve robustness of graph-based models: (virtual) adversarial training (Chen et al., 2019; Feng et al., 2019; Sun et al., 2019; Xu et al., ), trainable edge weights (Wu et al., 2019), graph encoder refining and adversarial contrastive learning (Wang et al., 2019b)et al., 2019), smoothing distillation (Chen et al., 2019), decoupling structure from attributes (Miller et al., 2018)
, measuring logit discrepancyZhang et al. (2019a), allocating reliable queries Zhou et al. (2019)
, representing nodes as Gaussian distributions(Zhu et al., 2019), and Bayesian graph neural networks (Zhang et al., 2019b). Other robustness aspects of graph-based models (e.g. noise or anomalies) have also been investigated Bojchevski and Günnemann (2018a); Bojchevski et al. (2017); Hoang et al. (2019). However, none of these works provide provable guarantees or certificates.
Zügner and Günnemann (2019b) is the only work that proposes robustness certificates for graph neural networks (GNNs). However, their approach can handle perturbations only to the node attributes. Our approach is completely orthogonal to theirs since we consider adversarial perturbations to the graph structure
instead. Furthermore, our certificates are also valid for other semi-supervised learning approaches such as label/feature propagation. Nonetheless, there is a critical need for both types of certificates given that GNNs are shown to be vulnerable to attacks on both the attributes and the structure. As future work, we aim to consider perturbations of the node features and the graph jointly.
3 Background and preliminaries
Let be an attributed graph with nodes and edge set . We denote with the adjacency matrix and the matrix of -dimensional node features for each node. Given a subset of labelled nodes the goal of semi-supervised node classification is to predict for each node one class in . We focus on deriving (exact) robustness certificates for graph neural networks via optimizing personalized PageRank. We also show (Appendix 8.1) how to apply our approach for label/feature propagation (Buchnik and Cohen, 2018).
and a probability distribution over nodesis defined as . 222In practice we do not invert the matrix, but rather we solve the associated sparse linear system of equations. Here is a diagonal matrix of node out-degrees with . Intuitively, represent the probability of random walker on the graph to land at node when it follows edges at random with probability and teleports back to the node with probability . Thus, we have and . For , the -th canonical basis vector, we get the personalized PageRank vector for node . We drop the index on and in when they are clear from the context.
Graph neural networks. As an instance of graph neural network (GNN) methods we consider an adaptation of the recently proposed PPNP approach (Klicpera et al., 2019) since it shows superior performance on the semi-supervised node classification task Fey and Lenssen (2019). PPNP unlike message-passing GNNs decouples the feature transformation from the propagation. We have:
where is the identity, is a symmetric propagation matrix, collects the individual per-node logits, and collects the final predictions after propagation. A neural network outputs the logits by processing the features of every node independently. Multiplying them with we obtain the diffused logits which implicitly incorporate the graph structure and avoid the expensive multi-hop message-passing procedure.
To make PPNP more amenable to theoretical analysis we replace with the personalized PageRank matrix which has a similar spectrum. Here each row equals to the personalized PageRank vector of node . This model which we denote as -PPNP has similar prediction performance to PPNP. We can see that the diffused logit after propagation for class of node is a linear function of its personalized PageRank score: , i.e. a weighted combination of the logits of all nodes for class . Similarly, the margin defined as the difference in logits for node for two given classes and is also linear in . If , where is the ground-truth label for , the node is misclassified since the prediction equals .
4 Robustness certificates
4.1 Threat model, fragile edges, global and local budget
We investigate the scenario in which a subset of edges in a directed graph are "fragile", i.e. an attacker has control over them, or in general we are not certain whether these edges are present in the graph. Formally, we are given a set of fixed edges that cannot be modified (assumed to be reliable), and set of fragile edges . For each fragile edge the attacker can decide whether to include it in the graph or exclude it from the graph, i.e. set to or respectively. For any subset of included edges we can form the perturbed graph . An excluded fragile edge is a non-edge in . This formulation is general, since we can set and arbitrarily. For example, for our certificate scenario given an existing clean graph we can set and which implies the attacker can only add new edges to obtain perturbed graphs . Or we can set and so that the attacker can only remove edges, and so on. There are (exponential) number of valid configurations leading to different perturbed graphs which highlights that certificates are challenging for graph perturbations.
In reality, perturbing an edge is likely to incur some cost for the attacker. To capture this we introduce a global budget. The constraint implies that the attacker can make at most perturbations. The first term equals to the number of newly added edges, and the second to the number of removed existing edges. Here, including an edge that already exists does not count towards the budget. This is only a design choice that depends on the application, and our method works in general. Furthermore, perturbing many edges for a single node might not be desirable, thus we also allow to limit the number of perturbations locally. Let be the set of edges that share the same source node . Then, the constraint enforces a local budget for the node . By setting and we can model an unconstrained attacker. Letting be the power set of , we define the set of admissible perturbed graphs:
4.2 Robustness certificates
Given a graph , a set of fixed and fragile edges, global and local budgets, target node , and a model with logits . Let denote the class of node (predicted or ground-truth). The worst-case margin between class and class under any admissible perturbation is:
If , node is certifiably robust w.r.t. the logits , and the set .
Our goal is to verify whether no admissible can change the prediction for a target node . From Problem 1 we see that if the worst margin over all classes is positive, then , for all , which implies that there exists no adversarial example within that leads to a change in the prediction to some other class , that is, the logit for the given class is always largest.
Challenges and core idea. From a cursory look at Eq. 3 it appears that finding the minimum is intractable. After all, our domain is discrete and we are optimizing over exponentially many configurations. Moreover, the margin is a function of the personalized PageRank which has a non-trivial dependency on the perturbed graph. But there is hope: For a fixed , the margin is a linear function of . Thus, Problem 1
reduces to optimizing a linear function of personalized PageRank over a specific constraint set. This is the core idea of our approach. As we will show, if we consider only local budget constraints the exact certificate can be efficiently computed. This is in contrast to most certificates for neural networks that rely on different relaxations to make the problem tractable. Including the global budget constraint, however, makes the problem hard. For this case, we derive an efficient to compute lower bound on the worst-case margin. Thus, if the lower bound is positive we can still guarantee that our classifier is robust w.r.t. the set of admissible perturbations.
4.3 Optimizing topics-sensitive PageRank with global and local constraints
We are interested in optimizing a linear function of the topic-sensitive PageRank vector of a graph by modifying its structure. That is, we want to configure a set of fragile edges into included/excluded to obtain a perturbed graph maximizing the objective. Formally, we study the general problem:
Given a graph , a set of admissible perturbations as in Problem 1, and any fixed solve the following optimization problem: .
Setting and , we see that Problem 1 is a special case of Problem 2. We can think of as a reward/cost vector, i.e. is the reward that a random walker obtains when visiting node . The objective value is proportional to the overall reward obtained during an infinite random walk with teleportation since exactly equals to the frequency of visits to .
Variations and special cases of this problem have been previously studied (Avrachenkov and Litvak, 2006; Csáji et al., 2010, 2014; de Kerchove et al., 2008; Fercoq et al., 2013; Hollanders et al., 2011; Olsen, 2010). Notably, Fercoq et al. (2013) cast the problem as an average cost infinite horizon Markov decision process (MDP), also called ergodic control problem, where each node corresponds to a state and the actions correspond to choosing a subset of included fragile edges, i.e. we have actions at each state (see also Fig. 1(a)). They show that despite the exponential number of actions, the problem can be efficiently solved in polynomial time, and they derive a value iteration algorithm with different local constraints. They enforce that the final perturbed graph has at most total number of edges per node, while we enforce that at most edges per node are perturbed (see Sec. 4.1).
Our approach for local budget only. Inspired by the MDP idea we derive a policy iteration (PI) algorithm which also runs in polynomial time (Hollanders et al., 2011). Intuitively, every policy corresponds to a perturbed graph in , and each iteration improves the policy. The PI algorithm allows us to: incorporate our local constraints easily, take advantage of efficient solvers for sparse systems of linear equations (line 3 in Alg. 1), and implement the policy improvement step in parallel (lines 4-6 in Alg. 1). It can easily handle very large sets of fragile edges and it scales to large graphs.
We provide the proof in Sec. 8.3 in the appendix. The main idea for Alg. 1 is starting from a random policy, in each iteration we first compute the mean reward before teleportation for the current policy (line 3), and then greedily select the top edges that improve the policy (lines 4-6). This algorithm is guaranteed to converge to the optimal policy, and thus to the optimal configuration of fragile edges.
Certificate for local budget only. Proposition 1 implies that for local constraints only, the optimal solution does not depend on the teleport vector . Regardless of the node (i.e. which in Eq. 3), the optimal edges to perturb are the same if the admissible set and the reward are the same. This means that for a fixed we only need to run the algorithm times to obtain the certificates for all nodes: For each pair of classes we have a different reward vector , and we can recover the exact worst-case margins for all nodes by just computing on the resulting many perturbed graphs . Now, implies certifiable robustness, while implies certifiable non-robustness due to the exactness of our certificate, i.e. we have found an adversarial example for node .
Our approach for both local and global budget.
Algorithm 1 cannot handle a global budget constraint, and in general solving Problem 2 with global budget is NP-hard. More specifically, it generalizes the Link Building problem Olsen (2010) – find the set of optimal edges that point to a given node such that its PageRank score is maximized – which is W-hard and for which there exists no fully-polynomial time approximation scheme (FPTAS). It follows that Problem 2 is also W-hard and allows no FPTAS. We provide the proof and more detials in Sec. 8.5 in the appendix. Therefore, we develop an alternative approach that consists of three steps and is outlined in the lower part of Fig. 1: (a) We propose an alternative unconstrained MDP based on an auxiliary graph which reduces the action set from exponential to binary by adding only
auxiliary nodes; (b) We reformulate the problem as a non-convex Quadratically Constrained Linear Program (QCLP) to be able to handle the global budget; (c) We utilize the Reformulation Linearization Technique (RLT) to construct a convex relaxation of the QCLP, enabling us to efficiently compute a lower bound on the worst-case margin.
(a) Auxiliary graph. Given an input graph we add one auxiliary node for each fragile edge . We define a total cost infinite horizon MDP on this auxiliary graph (Fig. 1(b)) that solves Problem 2 without constraints. The MDP is defined by the 4-tuple , where is the state space (preexisting and auxiliary nodes), and is the set of admissible actions in state . Given action , is the probability to go to state from state and the instantaneous reward. Each preexisting node has a single action , reward , and uniform transitions , discounted by for the fixed edges , where is the degree. For each auxiliary node we allow two actions . For action "off" node goes back to node with probability and obtains reward : . For action "on" node goes only to node with probability (the model is substochastic) and obtains reward: . We introduce fewer aux. nodes compared to previous work (Csáji et al., 2010; Fercoq, 2012).
(b) Global and local budgets QCLP. Based on this unconstrained MDP, we can derive a corresponding linear program (LP) solving the same problem (Puterman, 1994). Since the MDP on the auxiliary graph has (at most) binary action sets, the LP has only constraints and variables. This is in strong contrast to the LP corresponding to the previous average cost MPD (Fercoq et al., 2013) operating directly on the original graph that has an exponential number of constraints and variables. Lastly, we enrich the LP for the aux. graph MDP with additional constraints enforcing the local and global budgets. The constraints for the local budget are linear, however, the global budget requires quadratic constraints resulting in a quadratically constrained linear program (QCLP) that exactly solves Problem 2.
Solving the following QCLP (with decision variables ) is equivalent to solving Problem 2 with local and global constraints, i.e. the value of the objective function is the same in the optimal solution. We can recover from via . Here is the number of "off" fragile edges (the ones where ) in the optimal solution.
Key idea and insights. Eqs. 4b and 4c correspond to the LP of the unconstrained MDP. Intuitively, the variable maps to the PageRank score of node , and from the variables we can recover the optimal policy: if the variable (respectively ) is non-zero then in the optimal policy the fragile edge is turned off (respectively on). Since there exists a deterministic optimal policy, only one of them is non-zero but never both. Eq. 4d corresponds to the local budget. Remarkably, despite the variables not being integral, since they share the factor from Eq. 4c we can exactly count the number of edges that are turned off or on using only linear constraints. Eqs. 4e and 4f enforce the global budget. From Eq. 4e we have that whenever is nonzero it follows that and since that is the only configuration that satisfies the constraints (similarly for ). Intuitively, this effectively makes the variables "counters" and we can utilize them in Eq. 4f to enforce the total number of perturbed edges to not exceed . See detailed proof in Sec. 8.3.
(c) Efficient Reformulation Linearization Technique (RLT). The quadratic constraints in our QCLP make the problem non-convex and difficult to solve. We relax the problem using the Reformulation Linearization Technique (RLT) (Sherali and Tuncbilek, 1995) which gives us an upper bound on the objective. The alternative SDP-relaxation (Vandenberghe and Boyd, 1996) based on semidefinite programming is not suitable for our problem since the constraints are trivially satisfied (see Appendix 8.4 for details). While in general, the RLT introduces many new variables (replacing each product term with a variable ) along with multiple new linear inequality constraints, it turns out that in our case the solution is highly compact:
Proof provided in Sec. 8.3 in the appendix. By replacing Eqs. 4e and 4f with Eq. 5 in Proposition 2, we obtain a linear program which can be efficiently solved. Remarkably, we only have as decision variables since we were able to eliminate all other variables. The solution is an upper bound on the solution for Problem 2 and a lower bound on the solution for Problem 1. The final relaxed QCLP can also be interpreted as a constrained MPD with a single additional constraint (Eq. 5) which admits a possibly randomized optimal policy with at most one randomized state Altman (1999).
Certificate for local and global budget. To solve the relaxed QCLP and compute the final certificate we need to provide the upper bounds for the constraint in Eq. 5. Since the quality of the RLT relaxation depends on the tightness of these upper bounds, we have to carefully select them. We provide here one solution (see Sec. 8.6 in the appendix for a faster to compute, but less tight, alternative): Given an instance of Problem 2, we can set the reward to and invoke Algorithm 1, which is highly efficient, using the same fragile set and the same local budget. Since this explicitly maximizes , the objective value of the problem is guaranteed to give a valid upper bound . Invoking this procedure for every node, leads to the required upper bounds.
Now, to compute the certificate with local and global budget for a target node , we solve the relaxed problem for all , leading to objective function values (minus due to the change from min to max). Thus, is a lower bound on the worst-case margin . If the lower bound is positive then node is guaranteed to be certifiably robust – there exists no adversarial attack (among all graphs in ) that can change the prediction for node .
For our policy iteration approach if we are guaranteed to have found an adversarial example since the certificate is exact, i.e. we also have a non-robustness certificate. However in this case, if the lower bound is negative we do not necessarily have an adversarial example. Instead, we can perturb the graph with the optimal configuration of fragile edges for the relaxed problem, and inspect whether the predictions change. See Fig.1 for an overview of both approaches.
5 Robust training for graph neural networks
In Sec. 4 we introduced two methods to efficiently compute certificates given a trained -PPNP model. We now show that these can naturally be used to go one step further – to improve the robustness of the model. The main idea is to utilize the worst-case margin during training to encourage the model to learn more robust weights. Optimizing some robust loss with respect to the model parameters (e.g. for -PNPP are the neural network parameters) that depends on the worst-case margin is generally hard since it involves an inner optimization problem, namely finding the worst-case margin. This prevents us to easily take the gradient of (and, thus, ) w.r.t. the parameters . Previous approaches tackle this challenge by using the dual Wong and Kolter (2018).
Inspecting our problem, however, we see that we can directly compute the gradient. Since (respectively the corresponding lower bound) is a linear function of and , and furthermore the admissible set over which we are optimizing is compact, it follows from Danskin’s theorem (Danskin, 1967) that we can simply compute the gradient of the loss at the optimal point. We have and , i.e. the gradient equals to the optimal () PageRank scores computed in our certification approaches.
Robust training. To improve robustness Wong and Kolter (2018) proposed to optimize the robust cross-entropy loss: , where is the standard cross-entropy loss operating on the logits, and is a vector such that at index we have . Previous work has shown that if the model is overconfident there is a potential issue when using since it encourages high certainty under the worst-case perturbations (Zügner and Günnemann, 2019a). Therefore, we also study the alternative robust hinge loss. Since the attacker wants to minimize the worst-case margin (or its lower bound), a straightforward idea is to try to maximize it during training. To achieve this we add a hinge loss penalty term to the standard cross-entropy loss. Specifically: . The second term for a single node is positive if and zero otherwise – the node is certifiably robust with a margin of at least . Effectively, if all training nodes are robust, the second term becomes zero, thus, reducing to the standard cross-entropy loss with robustness guarantees. Note again that we can easily compute the gradient of these losses w.r.t. the (neural network) parameters .
6 Experimental results
Setup. We focus on evaluating the robustness of -PPNP without robust training and label/feature propagation using our two certification methods. We also verify that robust training improves the robustness of -PPNP while maintaining high predictive accuracy. We demonstrate our claims on two publicly available datasets: Cora-ML (Bojchevski and Günnemann, 2018b; McCallum et al., 2000) and Citeseer (Sen et al., 2008) with further experiments on Pubmed () (Sen et al., 2008) in the appendix. Following (Klicpera et al., 2019) we configure our -PPNP model with one hidden layer and choose a latent dimensionality of 64. We select 20 nodes per class for the training/validation set and we use the rest for the test set. Unless otherwise specified we set . See Sec. 8.2 in the appendix for further experiments and Sec. 8.7 for more details about the experimental setup. Note, we do not need to compare to any previously introduced adversarial attacks on graphs (Dai et al., 2018; Zügner and Günnemann, 2019a; Zügner et al., 2018), since by the definition of a certificate, for a certifiably robust node w.r.t. a given admissible set there exist no successful attack within that set.
We construct several different configurations of fixed and fragile edges to gain a better understanding of the robustness of the methods to different kind of adversarial perturbations. Namely, "both" refers to the scenario where , i.e. the attacker is allowed to add or remove any edge in the graph, while "remove" refers to the scenario where for a given graph , i.e. the attacker can only remove existing edges. In addition, for all scenarios we specify the fixed set as , where if belongs to the minimum spanning tree (MST) on the graph .333 Fixing the MST edges ensures that every node is reachable by every other node for any policy. This is only to simplify our earlier exposition regarding the MDPs and can be relaxed to e.g. reachable at the optimal policy.
Robustness certificates: Local budget only. We investigate the robustness of different graphs and semi-supervised node classification methods when the attacker has only local budget constraints. We set the local budget relative to the degree of node in the original graph, . Note that leads to a more restrictive budget compared to . Such relative budget is justified since higher degree nodes tend to be more robust in general (Zügner and Günnemann, 2019b; Zügner et al., 2018). We then apply our policy iteration algorithm to compute the (exact) worst-case margin for each node.
In Fig. 2(a) we see that the number of certifiably robust nodes when the attacker can only remove edges is significantly higher compared to when they can also add edges which is consistent with previous work on adversarial attacks (Zügner et al., 2018). As expected, the share of robust nodes decreases with higher budget, and -PPNP is significantly more robust than label propagation since besides the graph it also takes advantage of the node attributes. Feature propagation has similar performance ( score) but it is less robust. Note that since our certificate is exact, the remaining nodes are certifiably non-robust! In Sec. 8.2 in the appendix we also investigate certifiable accuracy – the ratio of nodes that are both certifiably robust and at the same time have a correct prediction. We find that the certifiable accuracy is relatively close to the clean accuracy, and it decreases gracefully as we in increase the budget.
Analyzing influence on robustness. In Fig. 2(b) we see that decreasing the teleport probability is an effective strategy to significantly increase the robustness with no significant loss in accuracy (at most for any , not shown). Thus, provides a useful trade-off between robustness and the size of the effective neighborhood: higher implies higher PageRank scores (i.e. higher influence) for the neighbors. In general we recommend to set the value as low as the accuracy allows. In Fig. 2(c) we investigate what contributes to certain nodes being more robust than others. We see that neighborhood purity – the share of nodes with the same class in a respective node’s two-hop neighborhood – plays an important role. High purity leads to high worst-case margin, which translates to certifiable robustness.
Robustness certificates: Local and global budget. We demonstrate our second approach based on the relaxed QCLP problem by analyzing the robustness w.r.t. increasing global budget. We use a relative local budget with for different values of , and set , i.e. the attacker can only remove edges. We see in Fig.3(a) that by enforcing a global budget we can significantly restrict the success of the attacker compared to having only a local budget. Similar trends hold for different local budgets: the global constraint increases the number of robust nodes, validating our approach.
Efficiency. Fig. 3(b) demonstrates the efficiency of our approach: even for fragile sets as large as , Algorithm 1 finds the optimal solution in just a few iterations. Since each iteration is itself efficient by utilizing sparse matrix operations, the overall wall clock runtime (shown as text annotation) is on the order of few seconds. In Sec. 8.2 in the appendix, we further investigate the runtime as we increase the number of nodes in the graph, as well as the runtime of our relaxed QCLP.
Robust training. While not being our core focus, we investigate whether robust training improves the certifiable robustness of GNNs. We set the fragile set and vary the local budget. The vertical line on Fig. 3(c) indicates the local budget used to train the robust models with losses and . We see that both of our approaches are able to improve the percent of certifiably robust nodes, with the largest improvement (around increase) for the budget we trained on (). Furthermore, the scores on the test split for Citeseer are as follows: for , for , and for , i.e. the robust training besides improving the ratio of certified nodes, it also improves the clean predictive accuracy of the model. has a higher certifiable robustness, but has a higher score. There is room for improvement in how we approach the robust training: e.g. similar to Zügner and Günnemann (2019b) we can optimize over the worst-case margin for the unlabeled in addition to the labeled nodes. We leave this as a future research direction.
We derive the first (non-)robustness certificate for graph neural networks regarding perturbations of the graph structure, and the first certificate overall for label/feature propagation. Our certificates are flexible w.r.t. the threat model, can handle both local (per node) and global budgets, and can be efficiently computed. We also propose a robust training procedure that increases the number of certifiably robust nodes while improving the predictive accuracy. As future work, we aim to consider perturbations and robustification of the node features and the graph structure jointly.
This research was supported by the German Research Foundation, Emmy Noether grant GU 1409/2-1, and the German Federal Ministry of Education and Research (BMBF), grant no. 01IS18036B. The authors of this work take full responsibilities for its content.
- Constrained Markov decision processes. Vol. 7, CRC Press. Cited by: §4.3.
- The effect of new links on google pagerank. Stochastic Models 22 (2). Cited by: §4.3.
Bayesian robust attributed graph clustering: joint learning of partial anomalies and group structure.
AAAI Conference on Artificial Intelligence, Cited by: §2.
- Deep gaussian embedding of graphs: unsupervised inductive learning via ranking. In International Conference on Learning Representations, ICLR, Cited by: §6.
- Adversarial attacks on node embeddings via graph poisoning. In International Conference on Machine Learning, ICML, Cited by: §2.
Robust spectral clustering for noisy data: modeling sparse corruptions improves latent embeddings. In International Conference on Knowledge Discovery and Data Mining, KDD, pp. 737–746. Cited by: §2.
- Bootstrapped graph diffusions: exposing the power of nonlinearity. In Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS, Cited by: §1, §3, §8.1.
- Parameterized complexity of cardinality constrained optimization problems. The Computer Journal 51 (1), pp. 102–121. Cited by: §8.5.
- Adversarial web search. Foundations and Trends in Information Retrieval 4 (5). Cited by: §1.
- Can adversarial network attack be defended?. arXiv preprint arXiv:1903.05994. Cited by: §2.
- PageRank optimization in polynomial time by stochastic shortest path reformulation. In ALT, 21st International Conference, Cited by: §4.3, §4.3.
- PageRank optimization by edge selection. Discrete Applied Mathematics 169. Cited by: §4.3.
- Adversarial attack on graph structured data. In International Conference on Machine Learning, ICML, Cited by: §1, §2, §6.
- The theory of max-min and its application to weapons allocation problems. Cited by: §5.
- Maximizing pagerank via outlinks. Linear Algebra and its Applications 429 (5-6). Cited by: §4.3.
- Graph adversarial training: dynamically regularizing based on graph structure. arXiv preprint arXiv:1902.08226. Cited by: §2.
- Ergodic control and polyhedral approaches to pagerank optimization. IEEE Trans. Automat. Contr. 58 (1). Cited by: §4.3, §4.3, §8.3.
Optimization of perron eigenvectors and applications: from web ranking to chronotherapeutics. Ph.D. Thesis, Ecole Polytechnique X. Cited by: §4.3.
Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428. Cited by: §3.
- Protein interface prediction using graph convolutional networks. In Neural Information Processing Systems, NIPS, Cited by: §1.
- Explaining and harnessing adversarial examples. In International Conference on Learning Representations, ICLR, Cited by: §2.
- Topic-sensitive pagerank. In Eleventh International World Wide Web Conference, WWW, Cited by: §3.
- Formal guarantees on the robustness of a classifier against adversarial manipulation. In Neural Information Processing Systems, NIPS, Cited by: §1, §2.
- Learning graph neural networks with noisy labels. arXiv preprint arXiv:1905.01591. Cited by: §2.
- Policy iteration is well suited to optimize pagerank. arXiv preprint arXiv:1108.3779. Cited by: §4.3, §4.3, §8.3.
BIRDNEST: bayesian inference for ratings-fraud detection. In SIAM International Conference on Data Mining, Cited by: §1.
- Scaling personalized web search. In Twelfth International World Wide Web Conference, WWW, Cited by: §3.
- Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, ICLR, Cited by: §1.
- Predict then propagate: graph neural networks meet personalized pagerank. In International Conference on Learning Representations, ICLR, Cited by: §1, §3, §6.
- Automating the construction of internet portals with machine learning. Inf. Retr. 3 (2). Cited by: §6.
- Improving robustness to attacks against vertex classification. Cited by: §2.
A constant-factor approximation algorithm for the link building problem.
International Conference on Combinatorial Optimization and Applications, pp. 87–96. Cited by: §8.5.
- Maximizing pagerank with new backlinks. In International Conference on Algorithms and Complexity, pp. 37–48. Cited by: §4.3, §4.3, §8.5, §8.5, Problem 3.
- Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, Inc.. Cited by: §4.3, §8.3.
- Semidefinite relaxations for certifying robustness adversarial examples. In Neural Information Processing Systems, NeurIPS, Cited by: §2.
- Hybrid approach of relation network and localized graph convolutional filtering for breast cancer subtype classification. In International Joint Conference on Artificial Intelligence, IJCAI, Cited by: §1.
- Collective classification in network data. AI Magazine 29 (3). Cited by: §6, §8.2.
- A reformulation-convexification approach for solving nonconvex quadratic programming problems. Journal of Global Optimization 7 (1). Cited by: §4.3.
- Generalized optimization framework for graph-based semi-supervised learning. In SIAM International Conference on Data Mining, Cited by: §8.1.
- Virtual adversarial training on graph convolutional networks in node classification. arXiv preprint arXiv:1902.11045. Cited by: §2.
- Intriguing properties of neural networks. In International Conference on Learning Representations, ICLR, Cited by: §2.
- Robust graph neural network against poisoning attacks via transfer learning. arXiv preprint arXiv:1908.07558. Cited by: §2.
- Semidefinite programming. SIAM review 38 (1). Cited by: §4.3, §8.4.
- FdGars: fraudster detection via graph convolutional networks in online app review system. In Companion of The 2019 World Wide Web Conference, WWW, Cited by: §1.
- Adversarial defense framework for graph neural network. arXiv preprint arXiv:1905.03679. Cited by: §2.
- Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, ICML, Cited by: §1, §2, §5, §5.
- Simplifying graph convolutional networks. In International Conference on Machine Learning, ICML, Cited by: §8.1.
- Adversarial examples for graph data: deep insights into attack and defense. In International Joint Conference on Artificial Intelligence, IJCAI, pp. 4816–4823. Cited by: §2.
-  Topology attack and defense for graph neural networks: an optimization perspective. In International Joint Conference on Artificial Intelligence, IJCAI, Cited by: §2.
Graph convolutional neural networks for web-scale recommender systems. In International Conference on Knowledge Discovery & Data Mining, KDD, Cited by: §1.
Comparing and detecting adversarial attacks for graph deep learning. In Proc. Representation Learning on Graphs and Manifolds Workshop, Int. Conf. Learning Representations, New Orleans, LA, USA, Cited by: §2.
- Bayesian graph convolutional neural networks for semi-supervised classification. In AAAI Conference on Artificial Intelligence, Cited by: §2.
- Learning with local and global consistency. In Neural Information Processing Systems, NIPS, Cited by: §8.1.
- Spectral clustering and transductive learning with multiple views. In International Conference on Machine Learning, ICML, Cited by: §1, §8.1.
- Learning from labeled and unlabeled data on a directed graph. In International Conference on Machine Learning, ICML, Cited by: §8.1.
- Adversarial robustness of similarity-based link prediction. International Conference on Data Mining, ICDM. Cited by: §2.
- Robust graph convolutional networks against adversarial attacks. In International Conference on Knowledge Discovery & Data Mining, KDD, Cited by: §2.
- Adversarial attacks on neural networks for graph data. In International Conference on Knowledge Discovery & Data Mining, KDD, Cited by: §1, §1, §2, §6, §6, §6.
- Adversarial attacks on graph neural networks via meta learning. In International Conference on Learning Representations, ICLR, Cited by: §2, §5, §6.
- Certifiable robustness and robust training for graph convolutional networks. In International Conference on Knowledge Discovery & Data Mining, KDD, Cited by: §1, §2, §6, §6.
8.1 Certificates for Label Propagation and Feature Propagation
Label propagation is a classic method for semi-supervised node classification, and there have been many variants proposed over the year (Zhou et al., 2003, 2005; Zhou and Burges, 2007). The general idea is to find a classification function such that the training nodes are predicted correctly and the predicted labels change smoothly over the graph. We can express this formally via the following optimization problem (Sokol et al., 2012):
where, is the node degree, is a regularization parameter trading off smoothness and predicting the labeled nodes correctly, is a hyper-parameter, and is a matrix where the rows are one-hot vectors for the training nodes and zero vectors otherwise (i.e. if and otherwise). The resulting matrix is the learned classification function, i.e. the value gives us the (unnormalized) probability that node belongs to a class , and we can make predictions by taking the argmax. The problem can be solved in closed form (even though in practice one would use power iteration) and the solution is: for . We can see that setting , i.e. the standard Laplacian variant (Zhou and Burges, 2007) we obtain:
From Eq. 7 we have that Label Propagation is very similar to our -PPNP: instead of diffusing logits which come from a neural network it propagates the one-hot vectors of the labeled nodes instead. From here onwards we apply our proposed method without any modifications by simply providing a different matrix in Problem 1.
We can also certify the feature propagation (FP) approach of which there are several variants: e.g. the normalized Laplacian FP (Buchnik and Cohen, 2018), or a recently proposed equivalent model termed simple graph convolution (SGC) (Wu et al., 2019). Feature propagation is carried out in two steps: (i) the node features are diffused to incorporate the graph structure
, and (ii) a simple logistic regression model is trained using the diffused featuresand subset of labelled nodes. Now, let the be the weights corresponding to a trained logistic regression model. The predictions for all nodes are calculated as with . Thus, again by simply providing a different matrix in Problem 1 we can certify feature propagation.
8.2 Further experiments
In Fig. 4(a) we show the percent of certifiable robust nodes for different local budgets on the Pubmed graph () (Sen et al., 2008) demonstrating that our method scales to large graphs. Similar to before (Fig. 2(a)), the models are more robust to attackers that can only remove edges. In Fig. 4(b) we analyze the robustness of Citeseer w.r.t. increasing global budget. The global budget constraints are again able to successfully restrict the attacker. The global budget makes a larger difference when the attacker has a less restrictive local budget (). In Fig. 4(c) we show that the robust training increases the precent of certifiably robust nodes. Comparing to Fig. 3(c) we conclude that training with a less strict local budget ( as opposed to ) makes the model more robust overall while the predictive performance ( score) is the same in both cases.
We also investigate certifiable accuracy. The ratio of nodes that are both certifiably robust and at the same time have a correct prediction is a lower bound on the overall worst-case classification accuracy since the worst-case perturbation can be different for each node. We plot this ratio in Fig. 5(a) for Citeseer and see that the certifiable accuracy is relatively close to the clean accuracy when the budget is restrictive, and it decreases gracefully as we in increase the budget.
To show how the runtime scales with number of nodes and number of edges we randomly generate SBM graphs of increasing size, and we set all edges in the generated graphs as fragile (). In Fig. 5(b) we see the mean runtime across five runs for local budget (VI algorithm). Even for graphs with more than 10K nodes the certificate runs in a few seconds. Similarly, Fig. 5(c) shows the runtime for global budget (RLT relaxation). We see that the runtime scales linearly with the number of edges. Furthermore, the overall runtime can be easily reduced by: (i) stopping early whenever the worst-case margin becomes negative, (ii) using Gurobi’s distributed optimization capabilities to reduce solve times, and (iii) having a single shared preprocessing step for all nodes.
Proof. Proposition 1.
Problem 2 can be formulated as an average cost infinite horizon Markov decision problem, where at each node we decide which subset of edges are active, i.e. where is the power set of and the reward depends only on the starting state but not on the action and the ending state . From the average cost infinite horizon optimality criterion as shown by Fercoq et al. (2013) we have:
is a random variable denoting the state of the system at the discrete time, and is deterministic control strategy determining a sequence of actions and is a function of the history . For this problem there exists a stationary (feedback) strategy that does not depend on the history such that for all . Eq. 8
follows from the ergodic theorem for Markov chains. Here the reward is more general and can be set depending on the edge. Letting and plugging it in Eq. 8 we get that the optimality criterion equals since the transion matrix is row-stochastic. As shown by Hollanders et al. (2011) policy Iteration is well suited to optimize PageRank and our Algorithm 1 corresponds to policy iteration for the above MDP. For a fixed teleport probability (which is our case) policy iteration always converges in less iterations than value iteration (Puterman, 1994) and does so in weakly polynomial time that depends on the number of fragile edges (Hollanders et al., 2011).
Proof. Proposition 2.
Eqs. 4b and 4c correspond to the LP of the unconstrained MDP on the auxiliary graph. Intuitively, the variable maps to the PageRank score of node , and from the variables we can recover the optimal policy: if the variable (respectively ) is non-zero then in the optimal policy the fragile edge is turned off (respectively on). Since there exists a deterministic optimal policy, only one of them is non-zero but never both. Eq. 4d corresponds to the local budget. Remarkably, despite the variables not being integral, since they share the factor from Eq. 4c we can exactly count the number of edges that are turned off or on using only linear constraints. Eqs. 4e and 4f enforce the global budget. From Eq. 4e we have that whenever is nonzero it follows that and since that is the only configuration that satisfies the constraints (similarly for ). Intuitively, this effectively makes the variables "counters" and thus, we can utilize them in Eq. 4f to enforce the total number of perturbed edges to not exceed .
We also have to show that solving the MDP on the auxiliary graph solves the same problem as the MDP on the original graph. Recall that whenever we traverse any edge from node we obtain reward . On the other hand, whenever we traverse an edge from the auxiliary node corresponding to a fragile edge to the node (action "off") we get negative reward , and the transition probability is . Intuitively, traversing back and forth between node and node does not change the overall reward obtained (since and cancel out). That is, we have the same reward as in the original graph with the edge excluded. Similarly, when we traverse the edge from auxiliary node to the node (action "on") we obtain reward, i.e. no additional reward is gained and the transition happens with probability . Therefore, the overall reward is the same as if the fragile edge would be present in the original graph.
More formally, for any given arbitrary policy for the unconstrained MDP on the auxiliary graph, let be the current number of "off" fragile edges for node and let be the current set of "on" fragile edges. From Eqs.4b and Eqs.4c we have:
where we can see that is the personalized PageRank for node for a perturbed original graph corresponding to the current policy, i.e. the graph where all for all are turned "on". Plugging in Eq. 9b into the objective from Eq. 4a we have
which exactly corresponds to the objective of Problem 2. Since the above analysis holds for any policy it also holds for the optimal policy, and therefore solving the unconstrained MDP on the auxiliary graph is equivalent to solving the unconstrained MDP on the original graph.
Combining everything together we have that solving the QCLP is equivalent to solving Problem 2.
Proof. Proposition 3.
Using the reformulation-linearization technique (RLT) we relax the quadratic constraints in Eq. 4e. In general, from RLT it follows that we add the following four linear constraints for each pairwise quadratic constraint
where are lower and upper bounds for .
From Eq. 4e we see that our quadratic terms always equal to (), and we have the following upper , and , and lower bounds . Plugging these upper/lower bounds into Eq. 10 for our quadratic terms and we see that the constraints arising from Eqs. 10a, 10b and 10c are always trivially fulfilled. Thus we are left with the constraints arising from Eq. 10d which for our problem are: