1 Introduction
Many text classification tasks naturally occur in the form of graphs where nodes represent text documents and edges are task specific, such as articles citing each other or health records belonging to the same patient. When learning node representations and predicting their categories, models benefit from exploiting information from the neighborhood of each node, as shown in graph neural networks, and graph convolutional networks (GCNs) in particular
(Kipf.Welling.2017.ICLR), making them superior to other models (Xu.et.al.2019.ICLR; DeCao.et.al.2019.NAACL).While GCNs are powerful for a variety of NLP problems, like other neural models they are prone to privacy attacks. Adversaries with extensive background knowledge and computational power might reveal sensitive information about the training data from the model, such as reconstructing information about the original classes of a model (hitaj2017deep) or even auditing membership of an individual’s data in a model (song2019auditing). In order to preserve privacy for graph NLP data, models have to protect both the textual nodes and the graph structure, as both sources carry potentially sensitive information.
Privacypreserving techniques, such as differential privacy (DP) (Dwork.Roth.2013)
, prevent information leaks by adding ‘just enough’ noise during training a model while attaining acceptable performance. Recent approaches to DP in neural models attempt to balance this tradeoff between noise and utility, with differentially private stochastic gradient descent (SGDDP)
(Abadi.et.al.2016.SIGSAC) being a prominent example. However, SGDDP comes with design choices specific to i.i.d. data, such as batches and ‘lots’ (see §4.2), and its suitability for graph neural networks remains an open and nontrivial question.In this work, we ask what privacy guarantees and performance can be provided by differentially private stochastic gradient descent and its variants for GCNs. First, we are interested in how models’ accuraccies differ under varying privacy ‘budgets’. Second, more importantly, we want to understand to which extent the training data size affects private and nonprivate performance and whether simply adding more data would be a remedy for the expected performance drop of DP models. We tackle these questions by adapting SGDDP (Abadi.et.al.2016.SIGSAC) to GCNs as well as proposing a differentiallyprivate version of Adam (kingma2017adam)
, AdamDP. We hypothesize that Adam’s advantages, i.e. fewer training epochs, would lead to a better privacy/utility tradeoff as opposed to SGDDP.
We conduct experiments on five datasets in two languages (English and Slovak) covering a variety of NLP tasks, including research article classification in citation networks, Reddit post classification, and user interest classification in social networks, where the latter ones inherently carry potentially sensitive information calling for privacypreserving models. Our main contributions are twofold. First, we show that DP training can be applied to the case of GCNs despite the challenges of noni.d.d. data. Second, we show that more sophisticated text representations can mitigate the performance drop due to DP noise, resulting in a relative performance of 90% of the nonprivate variant, while keeping strict privacy (). To the best of our knowledge, this is the first study that brings differentially private gradientbased training to graph neural networks.
2 Theoretical background in DP
As DP does not belong to the mainstream methods in NLP, here we shortly outline the principles and present the basic terminology from the NLP perspective. Foundations can be found in (Dwork.Roth.2013; Desfontaines.Pejo.2020).
The main idea of DP is that if we query a database of individuals, the result of the query will be almost indistinguishable from the result of querying a database of individuals, thus preventing each single individual’s privacy to a certain degree. The difference of results obtained from querying any two databases that differ in one individual has a probabilistic interpretation.
Dataset consists of documents where each document is associated with an individual whose privacy we want to preserve.^{1}^{1}1A document can be any arbitrary natural language text, such as a letter, medical record, tweet, personal plain text passwords, or a paper review. Let differ from by one document, so either , or with th document replaced. and are called neighboring datasets.
Let be a function applied to a dataset ; for example a function returning the average document length or the number of documents in the dataset. This function is also called a query which is not to be confused with queries in NLP, such as search queries.^{2}^{2}2In general, the query output is multidimensional ; here we keep it scalar for the sake of simplicity.
In DP, this query function is a continuous random variable associated with a probability density
. Once the function is applied on the dataset, the result is a single draw from this probability distribution. This process is also known as a
randomized algorithm. For example, a randomized algorithm for the average document length can be a Laplace density such that where is the true average document length and is the scale (the ‘noisiness’ parameter). By applying this query to , we obtain , a single draw from this distribution.Now we can formalize the backbone idea of DP. Having two neighboring datasets , , privacy loss is defined as
(1) 
DP bounds this privacy loss by design. Given (the privacy budget hyperparameter), all values of , and all neighboring datasets and , we must ensure that
(2) 
In other words, the allowed privacy loss of any two neighboring datasets is upperbounded by , also denoted as DP.^{3}^{3}3 DP is a simplification of more general DP where is a negligible constant allowing relaxation of the privacy bounds (Dwork.Roth.2013, p. 18). The privacy budget controls the amount of preserved privacy. If , the query outputs of any two datasets become indistinguishable, which guarantees almost perfect privacy but provides very little utility. Similarly, higher values provide less privacy but better utility. Finding the sweet spot is thus the main challenge in determining the privacy budget for a particular application (Lee.Clifton.2011.ISC; Hsu.et.al.2014.CSFS). An important feature of DP is that once we obtain the result of the query , any further computations with cannot weaken the privacy guaranteed by and .
The desired behavior of the randomized algorithm is therefore adding as little noise as possible to maximize utility while keeping the privacy guarantees given by Eq. 2. The amount of noise is determined for each particular setup by the sensitivity of the query , such that for any neighboring datasets we have
(3) 
The sensitivity corresponds to the ‘worst case’ range of a particular query , i.e., what is the maximum impact of changing one individual. The larger the sensitivity, the more noise must be added to fulfill the privacy requirements of (Eq. 2). For example, in order to be DP, the Laplace mechanism must add noise (Dwork.Roth.2013, p. 32). As the query sensitivity directly influences the required amount of noise, it is desirable to design queries with low sensitivity.
The so far described mechanisms consider a scenario when we apply the query only once. To ensure DP with multiple queries^{4}^{4}4Queries might be different, for example querying the average document length first and then querying the number of documents in the dataset. on the same datasets, proportionally more noise has to be added.
3 Related work
A wide range of NLP tasks have been utilizing graph neural networks
, specifically graph convolutional networks (GCNs), including text summarization
(xu2020discourse), machine translation (marcheggiani2018exploiting) and semantic role labeling (zheng2020srlgrn). Recent endtoend approaches combine pretrained transformer models with GNNs to learn graph representations for syntactic trees (sachan2020syntax). Rahimi.et.al.2018.ACL demonstrated the strength of GCNs on predicting geolocation of Twitter users where nodes are represented by users’ tweets and edges by social connections, i.e. mentions of other Twitter users. Their approach shows that user’s neighborhood delivers extra information improving the model’s performance. However, if we want to protect userlevel privacy, the overall social graph has to be taken into account.Several recent works in the NLP area deal with privacy using arbitrary definitions. Li.et.al.2018.ACLShort
propose an adversarialbased approach to learning latent text representation for sentiment analysis and POS tagging. Although their privacypreserving model performs on par with nonprivate models, they admit the lack of formal privacy guarantees. Similarly,
Coavoux.et.al.2018.EMNLP train an adversarial model to predict private information on sentiment analysis and topic classification. The adversary’s model performance served as a proxy for privacy strength but, despite its strengths, comes with no formal privacy guarantees. Similar potential privacy weaknesses can be found in a recent work by Abdalla.et.al.2020.JAMIA who replaced personal health information by semantically similar words while keeping acceptable accuracy of downstream classification tasks.Abadi.et.al.2016.SIGSAC
pioneered the connection of DP and deep learning by bounding the query sensitivity using gradient clipping as well as formally proving the overall privacy bounds by introducing the ‘moments accountant’ mechanism (see §
4.3). While originally tested on image recognition, they inspired subsequent work in language modeling using LSTM (McMahan.et.al.2018.ICLR).General DP over graphs still pose substantial challenges preventing their practical use (Zhu.et.al.2017.book, Sec. 4.4)
. Two very recent approaches to local DP, that is adding noise to each example before passing it to graph model training, transform the latent representation of the input into a binary vector leading to reduced query sensitivity
(Sajadmanesh.GaticaPerez.2020.arXiv; Lyu.et.al.2020.SIGIR).4 Model
4.1 GCN Overview
We employ the Graph Convolutional Network (GCN) architecture (Kipf.Welling.2017.ICLR) for enabling DP in the domain of graphbased NLP. GCN is a common and simpler variant to more complex types of GNNs which allows us to focus primarily on DP analysis and results, allowing for a clear comparison of the DP and nonDP models.
Let model our graph data where each node contains a feature vector of dimensionality . GCN aims to learn a node representation by integrating information from each node’s neighborhood. The features of each neighboring node of pass through a ‘message passing function’ (usually a transformation by a weight matrix ) and are then aggregated and combined with the current state of the node to form the next state . Edges are represented using an adjacency matrix . is then multiplied by the matrix , being the hidden dimension, as well as the weight matrix responsible for message passing. Additional tweaks by Kipf.Welling.2017.ICLR
include adding the identity matrix to
to include selfloops in the computation , as well as normalizing matrix by the degree matrix , specifically using a symmetric normalization . This results in the following equation for calculating the next state of the GCN for a given layer , passing through a nonlinearity function :(4) 
The final layer states for each node are then used for nodelevel classification, given output labels.
4.2 SGDDP and AdamDP
SGDDP (Abadi.et.al.2016.SIGSAC) modifies the standard stochastic gradient descent algorithm to be differentially private. The DP ‘query’ is the gradient computation at time step : , for each in the training set. To ensure DP, the output of this query is distorted by random noise proportional to the sensitivity of the query, which is the range of values that the gradient can take. As gradient range is unconstrained, possibly leading to extremely large noise, Abadi.et.al.2016.SIGSAC clip the gradient vector by its norm, replacing each vector with , being the clipping threshold. This clipped gradient is altered by a draw from a Gaussian: .
Instead of running this process on individual examples, Abadi.et.al.2016.SIGSAC actually break up the training set into ‘lots’ of size , being a slightly separate concept from that of ‘batches’. Whereas the gradient computation is performed in batches, SGDDP groups several batches together into lots for the DP calculation itself, which consists of adding noise, taking the average over a lot and performing the descent .
Incorporating this concept, we obtain the overall core mechanism of SGDDP:
(5) 
In this paper, we also develop a DP version of Adam (kingma2017adam), a widelyused default optimizer in NLP (ruder2016overview). As Adam shares the core principle of gradient computing within SGD, to make it differentialy private we add noise to the gradient following Eq. 5
(prior to Adam’s moment estimates and parameter update).
Despite their conceptual simplicity, both SGDDP and AdamDP have to determine the amount of noise to guarantee privacy. Abadi.et.al.2016.SIGSAC proposed the moments accountant which we present in detail here.
4.3 Moments accountant in detail
SGDDP introduces two features, namely (1) a reverse computation of the privacy budget, and (2) tighter bounds on the composition of multiple queries. First, a common DP methodology is to predetermine the privacy budget () and add random noise according to these parameters. In contrast, SGDDP does the opposite: Given a predefined amount of noise (hyperparameter of the algorithm), the privacy budget () is computed retrospectively. Second, generally in DP, with multiple executions of a ‘query’ (i.e. a single gradient computation in SGD), we can simply sum up the values associated with each query.^{5}^{5}5Such that for queries with privacy budget , the overall algorithm is DP. However, this naive composition leads to a very large privacy budget as it assumes that each query used up the maximum given privacy budget.
The simplest bound on a continuous random variable
, the Markov inequality, takes into account the expectation , such that for :(6) 
Using the Chernoff bound, a variant of the Markov inequality, on the privacy loss treated as a random variable (Eq. 2), we obtain the following formulation by multiplying Eq. 6 by and exponentiating:
(7) 
where
is also known as the momentgenerating function.
The overall privacy loss is composed of a sequence of consecutive randomized algorithms (see §2). Since all are independent, the numerator in Eq. 7 becomes a product of all . Converting to log form and simplifying, we obtain
(8) 
Note the moment generating function inside the logarithmic expression. Since the above bound is valid for any moment of the privacy loss random variable, we can go through several moments and find the one that gives us the lowest bound.
Since the lefthand side of Eq. 8 is by definition the value, the overall mechanism is ()DP for . The corresponding value can be found by modifying 8:
(9) 
The overall SGDDP algorithm, given the right noise scale and a clipping threshold , is thus shown to be differentially private using this accounting method, with representing the ratio between the lot size and dataset size , and being the total number of training steps. See (Abadi.et.al.2016.SIGSAC) for further details.
5 Experiments
5.1 Datasets
We are interested in a text classification usecase where documents are connected via undirected edges, forming a graph. While structurally limiting, this definition covers a whole range of applications. We perform experiments on five singlelabel multiclass classification tasks. The Cora, Citeseer, and PubMed datasets (Yang.et.al.2016.ICML; Sen.et.al.2008.AIMag; McCallum.et.al.2000.IR; Giles.et.al.1998.DL) are widely used citation networks of research papers where citing a paper from paper creates an edge . The task is to predict the category of the particular paper.
The Reddit dataset (Hamilton.et.al.2017.NeurIPS) treats the ‘original post’ as a graph node and connects two posts by an edge if any user commented on both posts. Given the large size of this dataset (230k nodes; all posts from Sept. 2014) causing severe computational challenges, we subsampled 10% of posts (only few days of Sept. 2014). The gold label corresponds to one of the top Reddit communities to which the post belongs to.
Unlike the previous English data sets, the Pokec dataset (takac2012data; snapnets) contains an anonymized social network in Slovak. Nodes represent users and edges their friendship relations. Userlevel information contains many attributes in natural language (e.g., ‘music’, ‘perfect evening’). We set up the following binary task: Given the textual attributes, predict whether a user prefers dogs or cats.^{6}^{6}6 Perozzi.Skiena.2015.WWW used the Pokec data for user profiling, namely age prediction for ad targeting. We find such a use case unethical. In contrast, our classification task is harmless, yet serves well the demonstration purposes of text classification of social network data. Pokec’s personal information including friendship connections shows the importance of privacypreserving methods to protect this potentially sensitive information. For the preparation details see Appendix B.
The four English datasets adapted from the previous work are only available in their encoded form. For the citation networks, each document is represented by a bagofwords encoding. The Reddit dataset combines GloVe vectors (penningtonetal2014glove) averaged over the post and its comments. Only the Pokec dataset is available as raw texts, so we opted for multilingual BERT (devlin2018bert) and averaged all contextualized word embeddings over each users’ textual attributes.^{7}^{7}7SentenceBERT reimers2019sentence resulted in lower performance. Users fill in the attributes such that the text resembles a list of keywords rather than actual discourse. The variety of languages, sizes, and different input encoding allows us to compare nonprivate and private GCNs under different conditions. Table 1 summarizes data sizes and number of classes.
Dataset  Classes  Test size  Training size 

CiteSeer  6  1,000  1,827 
Cora  7  1,000  1,208 
PubMed  3  1,000  18,217 
Pokec  2  2,000  16,000 
41  5,643  15,252 
5.2 Experiment setup
Experiment A
Vanilla GCN on full datasets: The aim is to train the GCN with access to the largest training data possible, but without any privacy mechanism.
Experiment B
Learning curves on the vanilla GCN: Evaluating the influence on performance with less training data, without privacy, allowing for a comparison of results with the DP settings below.
Experiment C
GCN with DP: We evaluate performance varying the amount of privacy budget with the full datasets.
Experiment D
GCN with DP: Varying both data size and the amount of privacy budget. This allows us to see the effects on performance of both adding noise and reducing training data.
5.2.1 Implementation details
As the privacy parameter is typically kept ‘cryptographically small’ (Dwork.Roth.2013) and, unlike the main privacy budget , has a limited impact on accuracy (Abadi.et.al.2016.SIGSAC, Fig. 4), we fixed its value to
for all experiments. The clipping threshold is set at 1. We validated our PyTorch implementation by fully reproducing the MNIST results from
Abadi.et.al.2016.SIGSAC. We perform all experiments five times with different random seeds and report the mean and standard deviation. Early stopping is determined using the validation set. See Appendix A for details on other hyperparameters.
6 Results and discussion
NonDP scores  DP scores  

Rnd.  Maj.  SGD  Adam  SGD  Adam  
CiteSeer  2  0.36  0.14  
0.17  0.18  0.77  0.79  5  0.36  0.14 
10  0.36  0.15  
137  0.36  0.24  
Cora  2  0.39  0.13  
0.15  0.32  0.77  0.88  5  0.39  0.13 
10  0.39  0.14  
137  0.40  0.28  
PubMed  2  0.38  0.36  
0.31  0.40  0.49  0.79  5  0.38  0.36 
10  0.38  0.36  
137  0.38  0.41  
Pokec  2  0.75  0.51  
0.50  0.50  0.83  0.83  5  0.75  0.56 
10  0.75  0.62  
137  0.75  0.76  
2  0.46  0.02  
0.03  0.15  0.68  0.88  5  0.46  0.03 
10  0.46  0.05  
137  0.46  0.25 
Experiment A
Table 2 shows the results on the lefthand side under ‘NonDP’. When trained with SGD, both Cora and CiteSeer datasets achieve fairly good results at 0.77 F1 score each, both having relatively small graphs. Much lower are the PubMed results at 0.49, possibly due to the dataset consisting of a much larger graph. Reddit shows higher performance at 0.68, which could in part be due to its input representations as GloVe embeddings, as opposed to binaryvalued word vectors. Finally, Pokec shows the best result at 0.83 possibly because of more expressive representations (BERT) and a simpler task (binary classification).
In comparison, in line with previous research ruder2016overview, Adam outperforms SGD in all cases, with Pokec showing the smallest gap (0.826 and 0.832 for SGD and Adam, respectively).
Experiment B
Starting with the SGD results in Figure 1, we can notice three main patterns.

Clear improvement as training data increases (e.g. CiteSeer, with 0.70 F1 score at 10% vs. 0.77 at 100%).

The exact opposite pattern, with PubMed dropping from 0.57 at 10% to 0.49 at 100%, with a similar pattern for Pokec.

Early saturation of results for Reddit and Cora (at 2030% for Reddit with approximately 0.69 F1 score, 50% for Cora at a score of 0.77), where results do not increase beyond a certain point.
Regarding points (2) and (3) above, we speculate that, with a larger training size, a vanilla GCN has a harder time to learn the more complex input representations. In particular, for PubMed and Pokec, the increasing number of training nodes only partially increases the graph degree, so the model fails to learn expressive node representations when limited information from the node’s neighborhood is available. By contrast, Reddit graph degree grows much faster, thus advantaging GCNs.
Comparing each of these patterns for Adam, we see that for (1), datasets also improve (CiteSeer), (2) shows a very similar decrease in results for Pokec, but mostly a constant score throughout for PubMed (at ~0.80), while (3) shows continued improvement where SGD saturated for Cora and Reddit, suggesting that Adam allows to break through the learning bottleneck.
Experiment C
The results of Experiment C can be seen in Table 2 for a comparison of different DP noise values, as well as a comparison of the results with and without DP. We note four main patterns in this experiment:

Interestingly, SGDDP results stay the same, regardless of the noise value added.

AdamDP results are far worse than SGDDP, but increasing with lesser privacy (less noise).

SGDDP results almost always outperform the baselines (except for PubMed).

We see bigger drops in performance in the DP setting for datasets with simpler input representations.
SGDDP vs. AdamDP
Points (1) and (2) are very unexpected results, both contrary to expectations. One explanation for the former could be that gradients in SGD are already quite noisy, which may even help in generalization for the model, so the additional DP noise does not pose much difficulty beyond a certain drop in performance. Regarding AdamDP, we see that results are far worse and do increase with lesser privacy (e.g., 0.51 with vs. 0.76 F1 with for Pokec). Several reasons can account for this, one being that Adam has more required hyperparameters, which could be sensitive with respect to the DP setting.
Differences in input features
For points (3) and (4) above, we see varying degrees of performance drops, depending on the dataset. Datasets of simpler input features can have results drop by more than half in comparison to the nonDP implementation, although still outperform a majority baseline (e.g. nonDP DP Maj. for CiteSeer). An exception to this is PubMed, which has DP results slightly below a majority baseline (). The drop in results from nonDP to DP is not as sharp (), most probably explained by the fact that the nonDP model was not able to achieve good performance.
Reddit shows a smaller drop from nonDP to DP and significantly outperforms the majority baseline (, respectively). Finally, the bestperforming SGDDP model was Pokec, with a relatively small drop from the nonDP to DP result ( F1 score, respectively). Hence, CiteSeer, Cora and PubMed, all using onehot textual representations, show fairly low results for DP at . Slightly better is Reddit (GloVe), while Pokec is by far the best (BERT).
Experiment D
Finally, Figure 2 shows the DP results both for varying and with different training subsamples (25%, 50%, 75% and the full 100%). Overall, some parallels and contrasts can be drawn with the learning curves from Experiment B.
Datasets which behave similarly for the two experiments are CiteSeer and Cora, where the former improves with more training data and the latter saturates at a certain point. PubMed, Reddit and Pokec show a contrasting pattern, with both PubMed and Reddit staying about the same for all subsamples, apart from the 100% setting, with a slight drop for PubMed and slight increase for Reddit. In experiment B, both had more gradual learning curves, with a slow decline for PubMed and a quick plateau for Reddit. Similarly, Pokec here shows the best results with the full data, in contrast to the gradual decline in the nonprivate setting.
We can see that the patterns for learning curves are not the same in the DP and nonDP setting. While increasing training data may help to some extent, it does not act as a solution to the general drop in performance caused by adding DP noise.
Summary
The main observations of these experiments can be summarized as follows:

The network is learning useful representations in the SGDDP setting, outperforming the majority baselines.

SGDDP is fairly robust to noise for these datasets and settings even for privacy at .

While being superior in the nonprivate setting, AdamDP does not perform very well.

More complex representations are better for the DP setting, showing a smaller performance drop.

Patterns for decreasing training size and increasing noise are not the same, thus increasing training data does not necessarily mitigate negative performance effects of DP.
We provide an additional error analysis in Appendix C, where we show that failed predictions in Reddit and CiteSeer are caused by ‘hard cases’, i.e. examples and classes that are consistently missclassified regardless of training data size or privacy budget. Moreover, Appendix D describes results on the MNIST dataset with varying lot sizes, showing how this hyperparameter affects model results.
7 Limitations and open questions
Issues of applying SGDDP to GCNs
Splitting graph datasets consisting of one large graph into smaller minibatches is not trivial. Special methods have been developed to specifically deal with such cases, such as sampling and aggregation (Hamilton.et.al.2017.NeurIPS), as well as precomputing graph representations (rossi2020sign). Such techniques would be necessary for adapting ‘batches’ and ‘lots’ from SGDDP directly but it comes with theoretical limitations. Namely, nodes in a graph are not necessarily i.i.d., being by definition related to each other, thus there would be potential privacy leakage when performing computations on separate minibatches of a graph. Further investigation into altering the SGDDP algorithm and incorporating potential graph minibatching methods are thus left for future work.
The benefits of our approach to applying SGDDP and AdamDP to the GCN case directly, are that (1) it is practical, simply adding it as a wrapper on top of the original model and (2) the ability to retain the original graph structure, thus not losing important information present in the original dataset and avoiding potential privacy leakage. The downside of this, however, is that the noise added has to be quite large in order to obtain reasonable values. As we have shown in our experiments, this method is indeed feasible in practice, given enough representational power in the input.
Hyperparameters
We use the same hyperparameters for both DP and nonDP settings to enable a fair comparison. In actual deployment, the DP version should have its own hyperparameter optimized as optimal settings may vary due to the added noise. However, further tuning on the training data comes with extra price as it consumes the privacy budget.
Is our model ‘bulletproof’ Dp?
While the SGDDP algorithm does guarantee differential privacy by design, the ‘devil is in the details’. Abadi.et.al.2016.SIGSAC propose in their implementation that , where ( being the lot size, the size of the input dataset). In our case, due to the nature of large onegraph datasets, , since the lot size is equal to the size of the dataset. This detail is not, however, mentioned in (Abadi.et.al.2016.SIGSAC) directly, but rather in the comments of original SGDDP code.^{8}^{8}8As of 2020, there is only a fork of the original code available at https://tinyurl.com/y2mwmbm9 Whether this minor implementation detail influences the overall privacy budget computation through the moments accountant remains an open theoretical question.
8 Conclusion
We have explored differentiallyprivate training for GCNs, showing the nature of the privacyutility tradeoff. While there is an expected drop in results for the SGDDP models, they generally perform far better than the baselines, reaching up to 90% of their nonprivate variants in one setup. In fact, more complexity in the input representations seems to mitigate the negative performance effects of applying DP noise. By adapting global DP to a challenging class of deep learning networks, we are thus a step closer to flexible and effective privacypreserving NLP.
Acknowledgments
This research work has been funded by the German Federal Ministry of Education and Research and the Hessian Ministry of Higher Education, Research, Science and the Arts within their joint support of the National Research Center for Applied Cybersecurity ATHENE. Calculations were conducted on the Lichtenberg high performance computer of the TU Darmstadt.
References
Appendix A Hyperparameter Configuration
Our GCN model consists of 2 layers, with ReLU nonlinearity, a hidden size of 32 and dropout of 50%, trained with a learning rate of 0.01. We found that early stopping the model works better for the nonDP implementations, where we used a patience of 20 epochs. We did not use early stopping for the DP configuration, which shows better results without it. For all SGD runs we used a maximum of 2000 epochs, while for Adam we used 500.
Due to the smaller amount of epochs for Adam, it is possible to add less noise to achieve a lower value. Table 3 shows the mapping from noise values used for each optimizer to the corresponding .
NoiseSGD  NoiseAdam  

136.51  4  2 
9.75  26  13 
4.91  48  24 
2.00  112  56 
Appendix B Pokec Dataset Preprocessing
In order to prepare the binary classification task for the Pokec dataset, the original graph consisting of 1,632,803 nodes and 30,622,564 edges is subsampled to only include users that filled out the ‘pets’ column and had either cats or dogs as their preference, discarding entries with multiple preferences. For each pet type, users were reordered based on percent completion of their profiles, such that users with most of the information were retained.
For each of the two classes, the top 10,000 users are taken, with the final graph consisting of 20,000 nodes and 32,782 edges. The data was split into 80% training, 10% validation and 10% test partitions.
The textual representations themselves were prepared with ‘bertmultilingualcased’ from Huggingface transformers,^{9}^{9}9https://github.com/huggingface/transformers converting each attribute of user input in Slovak to BERT embeddings with the provided tokenizer for the same model. Embeddings are taken from the last hidden layer of the model, with dimension size 768. The average over all tokens is taken for a given column of user information, with 49 out of the 59 original columns retained. The remaining 10 are left out due to containing less relevant information for textual analysis, such as a user’s last login time. To further simplify input representations for the model, the average is taken over all columns for a user, resulting in a final vector representation of dimension 768 for each node in the graph.
Appendix C Are ‘hard’ examples consistent between private and nonprivate models?
To look further into the nature of errors for experiments B and C, we evaluate the ‘hard cases’. These are cases that the model has an incorrect prediction for with the maximum data size and nonprivate implementation (results of experiment A). For experiment B, we take the errors for every setting of the experiment (10% training data, 20%, and so forth) and calculate the intersection of those errors with that of the ‘hard cases’ from the baseline implementation. This intersection is then normalized by the original number of hard cases to obtain a percentage value. The results for experiment B can be seen in Figure 3. We perform the same procedure for experiment C with different noise values, as seen in Figure 4. This provides a look into how the nature of errors differs among these different settings, whether they stay constant or become more random as we decrease the training size or increase DP noise.
Regarding the errors for experiment C, we can see a strong contrast between datasets such as Reddit and PubMed. For the latter, the more noise we add as decreases, the more random the errors become. In the case of Reddit, however, we see that even if we add more noise, it still fails on the same hard cases. This means that there are hard aspects of the data that remain constant throughout. For instance, out of all the different classes, some may be particularly difficult for the model.
Although the raw data for Reddit does not have references to the original class names and input texts, we can still take a look into these classes numerically and see which ones are the most difficult in the confusion matrix. In the baseline nonDP model, we notice that many classes are consistently predicted incorrectly. For example, class 10 is predicted 93% of the time to be class 39. Class 18 is never predicted to be correct, but 95% of the time predicted to be class 9. Class 21 is predicted as class 16 83% of the time, and so forth. This model therefore mixes up many of these classes with considerable confidence.
Comparing this with the confusion matrix for the differentially private implementation at an value of 2, we can see that the results incorrectly predict these same classes as well, but the predictions are more spread out. Whereas the nonprivate model seems to be very certain in its incorrect prediction, mistaking one class for another, the private model is less certain and predicts a variety of incorrect classes for the target class.
For the analysis of the hard cases of experiment B in Figure 3, we can see some of the same patterns as above, for instance between PubMed and Reddit. Even if the training size is decreased, the model trained on Reddit still makes the same types of errors throughout. In contrast, as training size is decreased for PubMed, the model makes more and more random errors. The main difference between the hard cases of the two experiments is that, apart from Reddit, here we can see that for all other datasets the errors become more random as we decrease training size. For example, Cora goes down from 85% of hard cases at 90% training data to 74% at 10% training data. In the case of experiment C, they stay about the same, for instance Cora retains just over 70% of the hard cases for all noise values. Overall, while we see some parallels between the hard cases for experiments B and C with respect to patterns of individual datasets such as Reddit and PubMed, the general trend of more and more distinct errors that is seen for the majority of datasets with less training size in experiment B is not the same in experiment C, staying mostly constant across different noise values for the latter. The idea that the nature of errors for DP noise and less training data being the same is thus not always the case, meaning that simply increasing training size may not necessarily mitigate the effects of DP noise.
Appendix D MNIST Baselines
Lot Size  Noise  F1  Std.  

600  4  1.26  0.90  0.02 
6,000  4  4.24  0.84  0.01 
60,000  4  15.13  0.45  0.04 
60,000  50  0.98  0.39  0.15 
60,000  100  0.50  0.10  0.01 
Table 4
shows results on the MNIST dataset with different lot sizes and noise values, keeping lot and batch sizes the same. We use a simple feedforward neural network with a hidden size of 512, dropout of 50%, SGD optimizer, and a maximum of 2000 epochs with early stopping of patience 20, with other hyperparameters such as learning rate being the same as above. We note that the configuration in the first row with lot size of 600 and noise 4 is the same as described by
Abadi.et.al.2016.SIGSAC in their application of the moments accountant, reaching the same value of 1.2586.We can see some important patterns in these results that relate to our main results from the GCN experiments. Maintaining a constant noise of 4, as we increase the lot size, not only does the value increase, but we see a dramatic drop in F1 score, especially for a lot size of 60,000, being the full training set. If we try to increase the noise and maintain that 60,000 lot size, while we are able to lower the value below 1, the F1 score continues to drop dramatically, going down to 0.1010 with a noise value of 100.
As mentioned in Section 7, the problem of dividing a large graph into minibatches is not a trivial one. There is a potential loss of edge information if we were to naively divide the graph into subsections. For a differentially private framework there is also the large problem that nodes in a graph are not necessarily i.i.d., meaning that there would be potential privacy leakage with such minibatching methods.
Hence, the current MNIST results show justification for the results of our GCN with DP experiments. Despite using the whole graph, meaning a lot size corresponding to the full training set, we still almost always beat the random and majority baselines, in some cases not being too far from the nonprivate versions, such as for Pokec. This also suggests that the issue of a lack of minibatching could be mitigated by increasing the representational power of the input, such as with the multilingual BERT embeddings for Pokec.
Comments
There are no comments yet.