1 Introduction
Machine learning algorithms are used in many application domains, including biology, computer vision and natural language processing. Relevant models are often trained either on thirdparty datasets, internal or customized subsets of publicly available user data. For example, many computer vision models are trained on images from Flickr users Thomee et al. (2016); Guo et al. (2020)
while many natural language processing (e.g., sentiment analysis) and recommender systems heavily rely on repositories such as IMDB
Maas et al. (2011). Furthermore, numerous ML classifiers in computational biology are trained on data from the UK Biobank
Sudlow et al. (2015), which represents a collection of genetic and medical records of roughly half a million participants Ginart et al. (2019). With recent demands for increased data privacy, the above referenced and many other data repositories are facing increasing demands for data removal. Certain laws are already in place guaranteeing the rights of certified data removal, including the European Union’s General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA) and the Canadian Consumer Privacy Protection Act (CPPA) Sekhari et al. (2021).Removing user data from a dataset is insufficient to guarantee the desired level of privacy, since models trained on the original data may still contain information about their patterns and features. This consideration gave rise to a new research direction in machine learning, referred to as machine unlearning Cao and Yang (2015), in which the goal is to guarantee that the user data information is also removed from the trained model. Naively, one can retrain the model from scratch to meet the privacy demand, yet retraining comes at a high computation cost and is thus not practical when accommodating frequent removal requests. To avoid complete retraining, various methods for machine unlearning have been proposed, including exact approaches Ginart et al. (2019); Bourtoule et al. (2021) as well as approximate methods Guo et al. (2020); Sekhari et al. (2021).
At the same time, graphcentered machine learning has received significant interest from the learning community due to the ubiquity of graphstructured data. Usually, the data contains two sources of information: Node features and graph topology. Graph Neural Networks (GNN) leverage both types of information simultaneously and achieve stateoftheart performance in numerous realworld applications, including Google Maps DerrowPinion et al. (2021), various recommender system Ying et al. (2018), selfdriving cars Gao et al. (2020) and bioinformatics Zhang et al. (2021b). Clearly, user data is involved in training the underlying GNNs and it may therefore be subject to removal. However, it is still unclear how to perform unlearning of GNNs.
We take the first step towards solving the approximate unlearning problem by performing a nontrivial theoretical analysis of some simplified GNN architectures. Inspired by the unstructured data certified removal procedure Guo et al. (2020), we propose the first known approach for certified graph unlearning. Our main contributions are as follows. First, we introduce three types of data removal requests for graph unlearning: Node feature unlearning, edge unlearning and node unlearning (see Figure 1). Second, we derive theoretical guarantees for certified graph unlearning mechanisms for all three removal cases on SGC Wu et al. (2019) and their GPR generalizations. In particular, we analyze
regularized graph models trained with differentiable convex loss functions. The analysis is challenging since propagation on graphs “mixes” node features. Our analysis reveals that the degree of the unlearned node plays an important role in the unlearning process, while the number of propagation steps may or may not be important for different unlearning scenarios. To the best of our knowledge, the theoretical guarantees established in this work are the first provable certified removal studies for graphs. Furthermore, the proposed analysis also encompasses node classification and node regression problems. Third, our empirical investigation on frequently used datasets for GNN learning shows that our method offer excellent performancecomplexity tradeoff. For example, when unlearning
nodes on the Cora dataset, our method achieves a fold speedup with only a drop in test accuracy compared to complete retraining. We also test our model on datasets for which removal requests are more likely to arise, including Amazon copurchase networks.Due to space limitations, all proofs and some detailed discussions are relegated to the Appendix.
2 Related Works
Machine unlearning and certified data removal. Cao and Yang Cao and Yang (2015) introduced the concept of machine unlearning and proposed distributed learners for exact unlearning. Bourtoule et al. Bourtoule et al. (2021) introduced shardingbased methods for unlearning, while Ginart et al. Ginart et al. (2019) described unlearning approaches for means clustering. These works focused on exact unlearning: The unlearned model is required to perform identically to a completely retrained model. As an alternative, Guo et al. Guo et al. (2020) introduced a probabilistic definition of unlearning motivated by differential privacy Dwork (2011). Sekhari et al. Sekhari et al. (2021) studied the generalization performance of machine unlearning methods. These probabilistic approaches naturally allow for “approximate” unlearning. None of these works addressed the machine unlearning problem on graph. To the best of our knowledge, the only work in this direction is the preprint Chen et al. (2021). However, the strategy proposed therein uses sharding, which only works for exact unlearning and is hence completely different from our approximate approach. Also, the approach in Chen et al. (2021) relies on partitioning the graph using community detection methods. It therefore implicitly makes the assumption that the graph is homophilic which is not warranted in practice Chien et al. (2021); Lim et al. (2021). In contrast, our method works for arbitrary graphs and allows for approximate unlearning while ensuring excellent tradeoffs between performance and complexity.
Differential privacy (DP) and DPGNNs. Machine unlearning, especially the approximation version described in Guo et al. (2020), is closely related to differential privacy Dwork (2011). In fact, differential privacy is a sufficient condition for machine unlearning. If a model is differentially private, then the adversary cannot distinguish whether the model is trained on the original dataset or on a dataset in which one data point is removed. Hence, even without model updating, a DP model will automatically unlearn the removed data point (see also the explanation in Ginart et al. (2019); Sekhari et al. (2021) and Figure 2). Although DP is a sufficient condition for unlearning, it is not a necessary condition. Also, most of the DP models suffer from a significant degradation in performance even when the privacy constraint is loose Chaudhuri et al. (2011); Abadi et al. (2016). Machine unlearning can therefore be viewed as a means to tradeoff between performance and computational cost, with complete retraining and DP on two different ends of the spectrum Guo et al. (2020). Several recent works proposed DPGNNs Daigavane et al. (2021); Olatunji et al. (2021); Wu et al. (2021); Sajadmanesh et al. (2022) – however, even for unlearning one single node or edge, these methods require a high “privacy budget” to learn with sufficient accuracy.
Graph neural networks. While GNNs are successfully used for many graph related problems, accompanying theoretical analyses are usually difficult due to the combination of nonlinear feature transformation and graph propagation. Recently, several simplified GNN models were proposed that can further the theoretical understanding of their performance and scalability. SGCs Wu et al. (2019) simplify GCNs Kipf and Welling (2017) via linearization (i.e., through the removal of all nonlinearities); although SGC in general underperform compared to stateoftheart GNNs, they still offer competitive performance on many datasets. The analysis of SGCs elucidated the relationship between lowpass graph filtering and GCNs which reveals both advantages and potential limitations of GNNs. The GPR generalization of SGC is closely related to many important models that resolve different issues inherent to GNNs. For example, GPRGNN Chien et al. (2021) addresses the problem of universal learning on homophilic and heterophilic graph datasets and the issue of oversmoothing. SIGN Frasca et al. (2020) based graph models and SGC Zhu and Koniusz (2020) allow for arbitrary sized minibatch training, which improves the scalability and leads to further performance improvements Sun et al. (2021); Zhang et al. (2021a); Chien et al. (2022) of methods on the Open Graph Benchmark leaderboard Hu et al. (2020). Hence, developing certified graph unlearning approaches for SGC and generalizations thereof is not only of theoretical interest, but also of practical importance.
3 Preliminaries
Notation. We reserve boldfont capital letters such as for matrices and boldfont lowercase letters such as
for vectors. We use
to denote the standard basis, so that and represent the row and column vector of respectively. The absolute value is applied componentwise on both matrices and vectors. We also use the symbols for the allone vector andfor the identity matrix. Furthermore, we let
stand for an undirected graph with node set of size and edge set . The symbols and are used to denote the corresponding adjacency and node degree matrix, respectively. The feature matrix is denoted by and the features have dimension ; For binary classification, the label are summarized in , while the nonbinary case is discussed in Section 5. The relevant norms are , the norm, and , the Frobenius norm. Note that we use for both row and column vectors to simplify the notation. The matrices and should not be confused with the symbols for an algorithm and dataset .Certified removal. Let be a (randomized) learning algorithm that trains on , the set of data points before removal, and outputs a model , where represents a chosen space of models. The removal of a subset of points from results in . For instance, let . Suppose we want to remove a data point, from , resulting in . Here, are equal to respectively, except that the row corresponding to the removed data point is deleted. Given , an unlearning algorithm applied to is said to guarantee an certified removal for , where and denotes the space of possible datasets, if
(1) 
This definition is related to DP Dwork (2011) except that we are allowed to update the model based on the removed point (see Figure 2). An certified removal method guarantees that the updated model is “approximately” the same as the model obtained by retraining from scratch. Thus, any information about the removed data is “approximately” eliminated from the model. Ideally, we would like to design such that it satisfies (3) and has complexity that is significantly smaller than that of complete retraining.
4 Certified Graph Unlearning
Unlike standard machine unlearning, certified graph unlearning uses datasets that contain not only node features but also the graph topology , and therefore require different data removal procedures. We focus on node classification, for which the training dataset equals . Here, is identical to on rows indexed by points of the training set while the remaining rows are all zeros. An unlearning method achieves certified graph unlearning with algorithm if (3) is satisfied for and , which differ based on the type of graph unlearning: Node feature unlearning, edge unlearning, and node unlearning.
4.1 Unlearning SGC
SGC is a simplification of GCN obtained by removing all nonlinearities from the latter model. This leads to the following update rule: where denotes the matrix of learnable weights, equals the number of propagation steps and denotes the onestep propagation matrix. The standard choice of the propagation matrix is the symmetric normalized adjacency matrix with selfloops, , where and equals the degree matrix with respect to . We will work with the asymmetric normalized version of , . This choice is made purely for analytical purposes and our empirical results confirm that this normalization ensures competitive performance of our unlearning methods.
The resulting node embedding is used for node classification by choosing an appropriate loss (i.e., logistic loss) and minimizing the regularized empirical risk. For binary classification, can be replaced by a vector ; the loss equals where is a convex loss function that is differentiable everywhere. We also write , where the optimizer is unique whenever .
Motivated by the unlearning approach from Guo et al. (2020) pertaining to unstructured data, we design an unlearning mechanism for graphs that changes the trained model to which represents an approximation of the unique optimizer of . Define and denote the Hessian of at by . Our unlearning mechanism operates according to The definition of matches the one in Guo et al. (2020) when no graph information is available. Note that if , then is the unique optimizer of . If , then information about the removed data point remains. One can show that the gradient residual norm determines the error of when used to approximate the true minimizer of Guo et al. (2020). Hence, upper bounds on can be used to establish certified removal/unlearning guarantees. More precisely, assume that we have for some . Furthermore, consider training with the noisy loss where is drawn randomly according to some distribution. Then one can leverage the following result.
Theorem 4.1 (Theorem 3 from Guo et al. (2020)).
Let be the learning algorithm that returns the unique optimum of the loss . Suppose that for some computable bound , independent of and achieved by . If with , then satisfies (3) with for algorithm applied to , where .
Hence, if we are able to prove that is appropriately bounded for our graph setting as well, then will ensure certified graph unlearning. Our main technical contribution is to establish such bounds for all three types of unlearning methods for graphs. For the analysis, we need the loss function to satisfy the following properties.
Assumption 4.2.
For any , and : (1) (i.e. the norm of is bounded); (2) is Lipschitz; (3) ; (4) is Lipschitz; (5) is bounded.
Assumptions (1)(3) are also needed for unstructured unlearning of linear classifiers Guo et al. (2020). To account for graphstructured data, we require the additional assumptions (4)(5) to establish worstcase bounds. The additional assumptions may be avoided when working with data dependent bounds (Section 5).
In all subsequent derivations, without loss of generality, we assume that the training set comprises the first nodes (i.e. ), where . Also, we assume that the unlearned data point corresponds to the node for node feature and node unlearning; for edge unlearning, we wish to unlearn the edge . Generalizations for multiple unlearning requests are discussed in Section 5.
4.2 Node feature unlearning for SGC
We start with the simplest type of unlearning – node feature unlearning – for SGCs. In this case, we remove the node feature and label of one node from resulting in . The matrices are identical to respectively, except for the row of the former being set to zero. Note that in this case, the graph structure remains unchanged.
Theorem 4.3.
Suppose that Assumption 4.2 holds. For the node feature unlearning scenario and and , we have
(2) 
A similar conclusion holds for the case when we wish to unlearn node features of a node that is not in . In this case we just replace by the degree of the corresponding node. This result shows that the norm bound is large if the unlearned node has large degree, since a largedegree node will affect the values of many rows in . Our result also demonstrates that the norm bound is independent of , due to the fact that is right stochastic. We provide next a sketch of the proof to illustrate the analytical challenges of graph unlearning compared to those of unstructured data unlearning Guo et al. (2020).
Although for node feature unlearning the graph topology does not change, all rows of may potentially change due to graph information propagation. Thus, the original analysis from Guo et al. (2020), which correspond to the special case , cannot be applied directly. There are two particular challenges. The first is to ensure that the norm of each row of is bounded by . We state the following lemma in support of this claim.
Lemma 4.4.
Assume that . Then, , , where .
In the proof, it is critical to choose since all other choices of degree normalization lead to worse bounds (see Appendix 7.7). The second and more difficult challenge is to bound . By the definition of , we have
(3) 
When , the third term in the expression above equals zero, in accordance with Guo et al. (2020). Due to graph propagation, we have to further bound the norm of the third term, which is highly nontrivial since the upper bound is not allowed to grow with or . We first focus on one of the terms in the sum. Using Assumption 4.2, one can bound this term by (we suppressed the dependency on and for simplicity). The key analytical novelty is to explore the sparsity of . Note that
is an allzero matrix except for its
row being equal to . Thus, we have where the last bound follows from the CauchySchwartz inequality, (3) in Assumption 4.2 and the fact that is a (componentwise) nonnegative matrix. Thus, summing over leads to the upper bound , since . Next, observe thatSince
is a left stochastic matrix,
is a probability vector whenever
is a probability vector. Clearly, is a probability vector. Hence, is also a probability vector. Since all diagonal entries of are nonnegative and upper bounded by given the selfloops for all nodes, for any probability vector . Hence, (4.2) is bounded by The bound depends on and does not increase with or . Although node feature unlearning is the simplest case of graph unlearning, our sketch of the proof illustrates the difficulties associated with bounding the third term in . Similar, but more complicated approaches are needed for the analysis of edge unlearning and node unlearning.4.3 Edge unlearning for SGC
We describe next the bounds for edge unlearning and highlight the technical issues arising in the analysis of this setting. Here, we remove one edge from resulting in . The matrix is identical to except for . Furthermore, is the degree matrix corresponding to . Note that the node features and labels remain unchanged.
Theorem 4.5.
Suppose that Assumption 4.2 holds. Under the edge unlearning scenario, and for and , we have
(4) 
Similar to what holds for the node feature unlearning case, Theorem 4.5 still holds when neither of the two end nodes of the removed edge belong to . Since is a right stochastic matrix, Lemma 4.4 still applies. Thus, we only need to describe how to bound . Following an approach similar to the previously described one, we have We also need the following technical lemmas.
Lemma 4.6.
For both edge and node unlearning, we have , .
Lemma 4.7.
For edge unlearning, we have .
Combining the two lemmas and after some algebraic manipulation, we arrive at the desired result. It is not hard to see that has only two nonzero rows, which correspond to the unlearned edge. One can again construct a left stochastic matrix and a right stochastic matrix which lead to the result of Lemma 4.7.
4.4 Node unlearning for SGC
We now discuss the most difficult case, node unlearning. In this case, one node is entirely removed from , including node features, labels and edges. This results in . The matrices are defined similarly to those described for node feature unlearning. The matrix is obtained by replacing the row and column in by allzeros (similar changes are introduced in , with ). For simplicity, we let as this assumption does not affect the propagation results.
Theorem 4.8.
Suppose that Assumption 4.2 holds. For the node unlearning scenario and and , we have
(5) 
Again, the main challenge is to bound . First we observe that . This holds because node is removed from the graph in , and thus its corresponding node features do not affect . Similarly to the proof Lemma 4.6, we first derive the bound For each term, . To proceed, we need the following two lemmas.
Lemma 4.9.
For node unlearning and and ,
Lemma 4.10.
For node unlearning and ,
These two lemmas give rise to the term in the bound of Theorem 4.8 and the rest of the analysis is similar to that of the previous cases. Lemma 4.10 is rather technical, and relies on the following proposition that exploits the structure of .
Proposition 4.11.
For node unlearning and
4.5 Certified graph unlearning in GPRbased model
Our analysis can be extended to Generalized PageRank (GPR)based models Li et al. (2019). The definition of GPR is , where denotes a node feature or node embedding. The learnable weights are called GPR weights and different choices for the weights lead to different propagation rules Jeh and Widom (2003); Chung (2007). GPRtype propagations include SGC and APPNP rules as special cases Gasteiger et al. (2019)
. If we use linearly transformed features
for some weight matrix , the GPR rule can be rewritten as . This constitutes a concatenation of the steps from up to . The learnable weight matrix combines and . These represent linearizations of GPRGNNs Chien et al. (2021) and SIGNs Frasca et al. (2020), simple yet useful models for learning on graphs. For simplicity, we only describe the results for node feature unlearning.Theorem 4.12.
Suppose that Assumption 4.2 holds and consider the node feature unlearning case. For and , we have
(6) 
Note that the resulting bound is the same as the bound in Theorem 4.3. This is due to the fact that we used the normalization factor in . Hence, given the same noise level, the GPRbased models are more sensitive when we trained on the noisy loss . Whether the general highlevel performance of GPR can overcompensate this drawback depends on the actual datasets considered.
5 Empirical Aspect of Certified Graph Unlearning
Logistic and leastsquares regression on graphs.
For binary logistic regression, the loss equals
wheredenotes the sigmoid function. As shown in
Guo et al. (2020), the assumptions (1)(3) in 4.2 are satisfied with and . We only need to show that (4) and (5) of 4.2 hold as well. By standard analysis, we show that our loss satisfies (4) and (5) in 4.2 with and . For multiclass logistic regression, one can adapt the “oneversusall otherclasses” strategy which lead to the same result. For leastsquare regression, since the hessian is independent of our approach offers certified graph unlearning even without loss perturbations. See Appendix 7.2 for the complete discussion and derivation.Sequential unlearning. In practice, multiple users may request unlearning. Hence, it is desirable to have a model that supports sequential unlearning of all types of data points. One can leverage the same proof as in Guo et al. (2020) (induction coupled with the triangle inequality) to show that the resulting gradient residual norm bound equals at the unlearning request, where is the bound for a single instance of certified graph unlearning.
Datadependent bounds. The gradient residual norm bounds derived for different types of certified graph unlearning contain a constant factor , and may be loose in practice. Following Guo et al. (2020), we also examined data dependent bounds.
Corollary 5.1 (Application of Corollary 1 in Guo et al. (2020)).
For all three graph unlearning scenarios, we have .
Hence, there are two ways to accomplish certified graph unlearning. If we do not allow any retraining, we have to leverage the worst case bound in Section 4
based on the expected number of unlearning requests. Importantly, we will also need to constrain the node degree of nodes to be unlearned (i.e., do not allow for unlearning hub nodes), for both node feature and node unlearning. Otherwise, we can select the noise standard deviation
, and and compute the corresponding “privacy budget” . Once the accumulated gradient residual norm exceeds this budget, we retrain the model from scratch. Note that this still greatly reduces the time complexity compare to retraining the model for every unlearning request (see Section 6).6 Experiment
We test our certified graph unlearning methods by verifying of theorems and via comparisons with baseline methods on benchmark datasets.
Settings. We test our methods on benchmark datasets for graph learning, including Cora, Citseer, Pubmed Sen et al. (2008); Yang et al. (2016); Fey and Lenssen (2019) and largescale dataset ogbnarxiv Hu et al. (2020) and Amazon copurchase networks Computers and Photo McAuley et al. (2015); Shchur et al. (2018). We either use the public splitting or random splitting based on similar rules as public splitting and focus on node classification. Following Guo et al. (2020), we use LBFGS as the optimizer for all methods due to its high efficiency on strongly convex problems. Unless specified otherwise, we fix for all experiments, and average the results over independent trails with random initializations. Our baseline methods include complete retraining with graph information after each unlearning request (SGC Retraining), complete retraining without graph information after each unlearning request (No Graph Retraining), and Algorithm 2 in Guo et al. (2020). Additional details can be found in Appendix 7.15.
Bounds on the gradient residual norm. The first row of Figure 3 compares the values of both worstcase bounds computed in Section 4 and datadependent bounds computed from Corollary 5.1 with the true value of the gradient residual norm (True Norm). For simplicity, we set during training. The observation is that the worstcase bounds are looser than the datadependent bounds, and datadependent bounds are weak compared to the actual gradient residual norm.
Dependency on node degrees. While an upper bound does not necessarily capture the dependency of each term correctly, we show in Figure 4 (a) and (b) that our Theorem 4.8 and 4.12 indeed do so. Here, each point corresponds to unlearning one node. We test for all nodes in the training set and fix . Our results show that unlearning a largedegree node is more expensive in terms of the privacy budget (i.e., it induces a larger gradient residual norm). The node degree dependency is unclear if one merely examines Corollary 5.1. For other datasets, refer to Appendix 7.15.
Performance of certified graph unlearning methods. The performance of our proposed certified graph unlearning methods, including the time complexity of unlearning and test accuracy after unlearning, is shown in Figures 3, 4 and 5. It shows that: (1) Leveraging graph information is necessary when designing unlearning methods for node classification tasks. (2) Our method supports unlearning a large proportion of data points with a small loss in test accuracy. (3) Our method is around faster than completely retraining the model after each unlearning request. (4) Our methods have robust performance regardless of the scale of the datasets. For more results see Appendix 7.15.
Tradeoff amongst privacy, performance and time complexity. As indicated in Theorem 4.1, there is a tradeoff amongst privacy, performance and time complexity. Comparing to exact unlearning (i.e. SGC retraining), allowing approximate unlearning gives speedup in time with competitive performance. We further examine this tradeoff by fixing and , then the tradeoff is controlled by and . The results are shown in Figure 5
(c) and (d) for Cora and Citeseer respectively, where we set
. The test accuracy increases when we relax our constraints on , which agrees with our intuition. Remarkably, we can still obtain competitive performance with SGC Retraining when we require to be as small as . In contrast, one needs at least to unlearn even node or edge by leveraging stateoftheart DPGNNs Sajadmanesh et al. (2022); Daigavane et al. (2021) for reasonable performance, albeit our tested datasets are different. This shows the benefit of our certified graph unlearning method as opposed to both retraining from scratch and DPGNNs. Unfortunately, the code of these DPGNNs are not public available, which prevents us from testing them on our datasets in a unified treatment.This work was supported by the NSF grant 1956384.
References
 Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308–318. Cited by: §2.
 Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 141–159. Cited by: §1, §2.
 Towards making systems forget with machine unlearning. In 2015 IEEE Symposium on Security and Privacy, pp. 463–480. Cited by: §1, §2.
 Differentially private empirical risk minimization.. Journal of Machine Learning Research 12 (3). Cited by: §2.
 Graph unlearning. arXiv preprint arXiv:2103.14991. Cited by: §2.

Node feature extraction by selfsupervised multiscale neighborhood prediction
. In International Conference on Learning Representations, External Links: Link Cited by: §2, §7.1.  Adaptive universal generalized pagerank graph neural network. In International Conference on Learning Representations, External Links: Link Cited by: §2, §2, §4.5.
 The heat kernel as the pagerank of a graph. Proceedings of the National Academy of Sciences 104 (50), pp. 19735–19740. Cited by: §4.5.
 Nodelevel differentially private graph neural networks. arXiv preprint arXiv:2111.15521. Cited by: §2, §6.
 Eta prediction with graph neural networks in google maps. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 3767–3776. Cited by: §1.
 Differential privacy. encyclopedia of cryptography and security. Springer Berlin. Cited by: §2, §2, §3.

Fast graph representation learning with pytorch geometric
. arXiv preprint arXiv:1903.02428. Cited by: §6, §7.15.  Sign: scalable inception graph neural networks. arXiv preprint arXiv:2004.11198. Cited by: §2, §4.5.
 Stability of graph scattering transforms. Advances in Neural Information Processing Systems 32. Cited by: §7.1.

Vectornet: encoding hd maps and agent dynamics from vectorized representation.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 11525–11533. Cited by: §1.  Predict then propagate: graph neural networks meet personalized pagerank. In International Conference on Learning Representations (ICLR), Cited by: §4.5.
 Making ai forget you: data deletion in machine learning. Advances in Neural Information Processing Systems 32. Cited by: §1, §1, §2, §2.
 Certified data removal from machine learning models. In International Conference on Machine Learning, pp. 3832–3842. Cited by: Figure 1, §1, §1, §1, §2, §2, Figure 2, §4.1, §4.1, §4.2, §4.2, §4.2, Theorem 4.1, Corollary 5.1, §5, §5, §5, Figure 4, §6, §7.1, §7.1, §7.15, §7.2, §7.2, §7.2, §7.3, §7.3, §7.3, §7.3.
 Open graph benchmark: datasets for machine learning on graphs. Advances in neural information processing systems 33, pp. 22118–22133. Cited by: §2, §6, §7.15.
 Scaling personalized web search. In Proceedings of the 12th international conference on World Wide Web, pp. 271–279. Cited by: §4.5.
 Residual correlation in graph neural network regression. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 588–598. Cited by: §7.2.
 Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §2.
 Optimizing generalized pagerank methods for seedexpansion community detection. Advances in Neural Information Processing Systems 32. Cited by: §4.5.
 New benchmarks for learning on nonhomophilous graphs. arXiv preprint arXiv:2104.01404. Cited by: §2.
 CopulaGNN: towards integrating representational and correlational roles of graphs in graph neural networks. In International Conference on Learning Representations, Cited by: §7.2.
 Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 142–150. External Links: Link Cited by: §1.
 Imagebased recommendations on styles and substitutes. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp. 43–52. Cited by: §6.
 Releasing graph neural networks with differential privacy guarantees. arXiv preprint arXiv:2109.08907. Cited by: §2.
 Spatiotemporal graph scattering transform. In International Conference on Learning Representations, External Links: Link Cited by: §7.1.
 GAP: differentially private graph neural networks with aggregation perturbation. arXiv preprint arXiv:2203.00949. Cited by: §2, §6.
 Remember what you want to forget: algorithms for machine unlearning. Advances in Neural Information Processing Systems 34. Cited by: §1, §1, §2, §2.
 Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §6.
 Pitfalls of graph neural network evaluation. Relational Representation Learning Workshop, NeurIPS 2018. Cited by: §6.
 UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine 12 (3), pp. e1001779. Cited by: §1.
 Scalable and adaptive graph neural networks with selflabelenhanced training. arXiv preprint arXiv:2104.09376. Cited by: §2.
 YFCC100M: the new data in multimedia research. Communications of the ACM 59 (2), pp. 64–73. Cited by: §1.
 LinkTeller: recovering private edges from graph neural networks via influence analysis. arXiv preprint arXiv:2108.06504. Cited by: §2.
 Simplifying graph convolutional networks. In International conference on machine learning, pp. 6861–6871. Cited by: §1, §2.

Revisiting semisupervised learning with graph embeddings
. In International conference on machine learning, pp. 40–48. Cited by: §6. 
Graph convolutional neural networks for webscale recommender systems
. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 974–983. Cited by: §1. 
Graph attention multilayer perceptron
. arXiv preprint arXiv:2108.10097. Cited by: §2.  Graph neural networks and their current applications in bioinformatics. Frontiers in Genetics 12. Cited by: §1.
 Simple spectral graph convolution. In International Conference on Learning Representations, Cited by: §2.
7 Appendix
7.1 Limitations and future research directions
Batch unlearning. In practice, it is likely that we not only require sequential unlearning, but also batch unlearning: A number of users may request their data to be unlearned within a certain (short) time frame. The approach in Guo et al. (2020) can ensure certified removal even in this scenario. The generalization of our approach for batch unlearning is also possible, but will be discussed elsewhere.
Nonlinear models. Also akin to what was described in Guo et al. (2020), we can leverage pretrained (nonlinear) feature extractors or special graph feature transforms to further improve the performance of the overall model. For example, Chien et al. Chien et al. (2022) proposed a node feature extraction method termed GIANTXRT that greatly improves the performance of simple network models such as MLP and SGC. If a public dataset is never subjected to unlearning, one can pretrain GIANTXRT on that dataset and use it for subsequent certified graph unlearning. If such a public dataset is unavailable, we have to make the node feature extractor DP. In this case, we can either design a DP version of GIANTXRT or leverage the DPGNN model described in Section 2. By applying Theorem 5 of Guo et al. (2020), the overall model can be shown to guarantee certified graph unlearning, where the parameters and now also depend on the DP guarantees of the node feature extractor. There is also another line of work on Graph Scattering Transforms (GSTs) Gama et al. (2019); Pan et al. (2021) for use as feature extractors for graph information. Since a GST is a predefined mathematical transform and hence does not require training, it can be easily combined with our approach. The rigorous analysis is delegated to future work.
One current limitation of our work is that the newly proposed prooftechniques do not apply to general graph neural networks where nonlinear activation functions are used. Nevertheless, our work is the first step towards developing certified graph unlearning approaches for general GNNs.
7.2 Additional discussions
Details on on Assumption 4.2. Assumptions (2), (4) and (5) in our model and that of Guo et al. (2020) require Lipschitz conditions with respect to the first argument of , but not the second. We also implicitly assume that the second argument (corresponding to labels) does not effect the norm of gradients or Hessians. One example that meets these constraints is the logistic loss: If then all required assumptions hold.
Leastsquares and logistic regression on graphs. Paralleling once again the results of Guo et al. (2020), it is clear that our certified graph unlearning mechanism can be used in conjunction with leastsquares and logistic regressions. For example, node classification can be performed using a logistic loss. The node regression problem described in Ma et al. (2020); Jia and Benson (2020) is related to leastsquares regression. In particular, leastsquares regression uses the loss . Note that its Hessian is of the form which does not depend on . Thus, based on the same arguments presented in Guo et al. (2020), our proposed unlearning method offers certified graph unlearning even without loss perturbations.
For binary logistic regression, the loss equals where denotes the sigmoid function. As shown in Guo et al. (2020), the assumptions (1)(3) in 4.2 are satisfied with and . We only need to show that (4) and (5) of 4.2 hold as well. Observe that Since the sigmoid function is restricted to lie in , is bounded by , which means that our loss satisfies (5) in 4.2 with . Based on the Mean Value Theorem, one can show that is Lipschitz. Using some simple algebra, one can also prove that Thus our loss satisfies assumption (4) in 4.2 as well, with . For multiclass logistic regression, one can adapt the “oneversusall otherclasses” strategy which leads to the same result.
7.3 Proof of Theorem 4.3
Theorem.
Under the node feature unlearning scenario, and . Suppose Assumption 4.2 holds. For and , we have
(7) 
Proof.
Our proof is a nontrivial generalization and extension of the proof in Guo et al. (2020). For completeness, we outline every step of the proof. We also emphasize novel approaches used to accommodate out graph certified unlearning scenario.
Let . By the Taylor theorem, such that
(8) 
In (a), we wrote , corresponding to the Hessian at . Equality (b) is due to our choice of and the fact that is the minimizer of . We would like to point out that our choice of is more general then that Guo et al. (2020): Since unlearning one node may affect the entire node embedding , a generalization of is crucial. When (i.e., when no graph topology is included), one recovers from Guo et al. (2020) as a special case of our model. In the latter part of the proof, we will see how the graph setting makes the analysis more intricate and complex.
By the CauchySchwartz inequality, we have
(9) 
Below we bound both norms on the right hand side separately. We start with the term . Note that
(10) 
Here, (a) follows from the CauchySchwartz inequality and the Lipschitz condition on in Assumption 4.2. Unlike the analysis in Guo et al. (2020), we are faced with the problem of bounding the term . In Guo et al. (2020) (where ), a simple bound equals , which may be ontained via (3) in Assumption 4.2. However, in our case, due to graph propagation this norm needs more careful examination and a simple application of the CauchySchwartz inequality does not suffice, as it would lead to a term where denotes the operator norm. The simple worst case (i.e., when all rows of are identical) leads to a meaningless bound .
By leveraging Lemma 4.4, we can further upper bound (10) according to
(11) 
where (a) follows from Lemma 4.4.
As a result, we arrive at a bound for of the form
(12) 
Next, we bound . Since is strongly convex, we have . For the norm , we have
(13) 
The third term does not appear in Guo et al. (2020), since when , and are identical except for the row. In the graph certified unlearning scenario, even removing one node feature can make the entire node embedding matrix cnage in every row. This creates new analytical challenges.