Many text classification tasks naturally occur in the form of graphs where nodes represent text documents and edges are task specific, such as articles citing each other or health records belonging to the same patient. When learning node representations and predicting their categories, models benefit from exploiting information from the neighborhood of each node, as shown in graph neural networks, and graph convolutional networks (GCNs) in particular(Kipf.Welling.2017.ICLR), making them superior to other models (Xu.et.al.2019.ICLR; DeCao.et.al.2019.NAACL).
While GCNs are powerful for a variety of NLP problems, like other neural models they are prone to privacy attacks. Adversaries with extensive background knowledge and computational power might reveal sensitive information about the training data from the model, such as reconstructing information about the original classes of a model (hitaj2017deep) or even auditing membership of an individual’s data in a model (song2019auditing). In order to preserve privacy for graph NLP data, models have to protect both the textual nodes and the graph structure, as both sources carry potentially sensitive information.
Privacy-preserving techniques, such as differential privacy (DP) (Dwork.Roth.2013)
, prevent information leaks by adding ‘just enough’ noise during training a model while attaining acceptable performance. Recent approaches to DP in neural models attempt to balance this trade-off between noise and utility, with differentially private stochastic gradient descent (SGD-DP)(Abadi.et.al.2016.SIGSAC) being a prominent example. However, SGD-DP comes with design choices specific to i.i.d. data, such as batches and ‘lots’ (see §4.2), and its suitability for graph neural networks remains an open and non-trivial question.
In this work, we ask what privacy guarantees and performance can be provided by differentially private stochastic gradient descent and its variants for GCNs. First, we are interested in how models’ accuraccies differ under varying privacy ‘budgets’. Second, more importantly, we want to understand to which extent the training data size affects private and non-private performance and whether simply adding more data would be a remedy for the expected performance drop of DP models. We tackle these questions by adapting SGD-DP (Abadi.et.al.2016.SIGSAC) to GCNs as well as proposing a differentially-private version of Adam (kingma2017adam)
, Adam-DP. We hypothesize that Adam’s advantages, i.e. fewer training epochs, would lead to a better privacy/utility trade-off as opposed to SGD-DP.
We conduct experiments on five datasets in two languages (English and Slovak) covering a variety of NLP tasks, including research article classification in citation networks, Reddit post classification, and user interest classification in social networks, where the latter ones inherently carry potentially sensitive information calling for privacy-preserving models. Our main contributions are twofold. First, we show that DP training can be applied to the case of GCNs despite the challenges of non-i.d.d. data. Second, we show that more sophisticated text representations can mitigate the performance drop due to DP noise, resulting in a relative performance of 90% of the non-private variant, while keeping strict privacy (). To the best of our knowledge, this is the first study that brings differentially private gradient-based training to graph neural networks.
2 Theoretical background in DP
As DP does not belong to the mainstream methods in NLP, here we shortly outline the principles and present the basic terminology from the NLP perspective. Foundations can be found in (Dwork.Roth.2013; Desfontaines.Pejo.2020).
The main idea of DP is that if we query a database of individuals, the result of the query will be almost indistinguishable from the result of querying a database of individuals, thus preventing each single individual’s privacy to a certain degree. The difference of results obtained from querying any two databases that differ in one individual has a probabilistic interpretation.
Dataset consists of documents where each document is associated with an individual whose privacy we want to preserve.111A document can be any arbitrary natural language text, such as a letter, medical record, tweet, personal plain text passwords, or a paper review. Let differ from by one document, so either , or with -th document replaced. and are called neighboring datasets.
Let be a function applied to a dataset ; for example a function returning the average document length or the number of documents in the dataset. This function is also called a query which is not to be confused with queries in NLP, such as search queries.222In general, the query output is multidimensional ; here we keep it scalar for the sake of simplicity.. Once the function is applied on the dataset
, the result is a single draw from this probability distribution. This process is also known as arandomized algorithm. For example, a randomized algorithm for the average document length can be a Laplace density such that where is the true average document length and is the scale (the ‘noisiness’ parameter). By applying this query to , we obtain , a single draw from this distribution.
Now we can formalize the backbone idea of DP. Having two neighboring datasets , , privacy loss is defined as
DP bounds this privacy loss by design. Given (the privacy budget hyper-parameter), all values of , and all neighboring datasets and , we must ensure that
In other words, the allowed privacy loss of any two neighboring datasets is upper-bounded by , also denoted as -DP.333 -DP is a simplification of more general -DP where is a negligible constant allowing relaxation of the privacy bounds (Dwork.Roth.2013, p. 18). The privacy budget controls the amount of preserved privacy. If , the query outputs of any two datasets become indistinguishable, which guarantees almost perfect privacy but provides very little utility. Similarly, higher values provide less privacy but better utility. Finding the sweet spot is thus the main challenge in determining the privacy budget for a particular application (Lee.Clifton.2011.ISC; Hsu.et.al.2014.CSFS). An important feature of -DP is that once we obtain the result of the query , any further computations with cannot weaken the privacy guaranteed by and .
The desired behavior of the randomized algorithm is therefore adding as little noise as possible to maximize utility while keeping the privacy guarantees given by Eq. 2. The amount of noise is determined for each particular setup by the sensitivity of the query , such that for any neighboring datasets we have
The sensitivity corresponds to the ‘worst case’ range of a particular query , i.e., what is the maximum impact of changing one individual. The larger the sensitivity, the more noise must be added to fulfill the privacy requirements of (Eq. 2). For example, in order to be -DP, the Laplace mechanism must add noise (Dwork.Roth.2013, p. 32). As the query sensitivity directly influences the required amount of noise, it is desirable to design queries with low sensitivity.
The so far described mechanisms consider a scenario when we apply the query only once. To ensure -DP with multiple queries444Queries might be different, for example querying the average document length first and then querying the number of documents in the dataset. on the same datasets, proportionally more noise has to be added.
3 Related work
A wide range of NLP tasks have been utilizing graph neural networks
, specifically graph convolutional networks (GCNs), including text summarization(xu2020discourse), machine translation (marcheggiani2018exploiting) and semantic role labeling (zheng2020srlgrn). Recent end-to-end approaches combine pre-trained transformer models with GNNs to learn graph representations for syntactic trees (sachan2020syntax). Rahimi.et.al.2018.ACL demonstrated the strength of GCNs on predicting geo-location of Twitter users where nodes are represented by users’ tweets and edges by social connections, i.e. mentions of other Twitter users. Their approach shows that user’s neighborhood delivers extra information improving the model’s performance. However, if we want to protect user-level privacy, the overall social graph has to be taken into account.
Several recent works in the NLP area deal with privacy using arbitrary definitions. Li.et.al.2018.ACLShort
propose an adversarial-based approach to learning latent text representation for sentiment analysis and POS tagging. Although their privacy-preserving model performs on par with non-private models, they admit the lack of formal privacy guarantees. Similarly,Coavoux.et.al.2018.EMNLP train an adversarial model to predict private information on sentiment analysis and topic classification. The adversary’s model performance served as a proxy for privacy strength but, despite its strengths, comes with no formal privacy guarantees. Similar potential privacy weaknesses can be found in a recent work by Abdalla.et.al.2020.JAMIA who replaced personal health information by semantically similar words while keeping acceptable accuracy of downstream classification tasks.
pioneered the connection of DP and deep learning by bounding the query sensitivity using gradient clipping as well as formally proving the overall privacy bounds by introducing the ‘moments accountant’ mechanism (see §4.3). While originally tested on image recognition, they inspired subsequent work in language modeling using LSTM (McMahan.et.al.2018.ICLR).
General DP over graphs still pose substantial challenges preventing their practical use (Zhu.et.al.2017.book, Sec. 4.4)
. Two very recent approaches to local DP, that is adding noise to each example before passing it to graph model training, transform the latent representation of the input into a binary vector leading to reduced query sensitivity(Sajadmanesh.Gatica-Perez.2020.arXiv; Lyu.et.al.2020.SIGIR).
4.1 GCN Overview
We employ the Graph Convolutional Network (GCN) architecture (Kipf.Welling.2017.ICLR) for enabling DP in the domain of graph-based NLP. GCN is a common and simpler variant to more complex types of GNNs which allows us to focus primarily on DP analysis and results, allowing for a clear comparison of the DP and non-DP models.
Let model our graph data where each node contains a feature vector of dimensionality . GCN aims to learn a node representation by integrating information from each node’s neighborhood. The features of each neighboring node of pass through a ‘message passing function’ (usually a transformation by a weight matrix ) and are then aggregated and combined with the current state of the node to form the next state . Edges are represented using an adjacency matrix . is then multiplied by the matrix , being the hidden dimension, as well as the weight matrix responsible for message passing. Additional tweaks by Kipf.Welling.2017.ICLR
include adding the identity matrix toto include self-loops in the computation , as well as normalizing matrix by the degree matrix , specifically using a symmetric normalization . This results in the following equation for calculating the next state of the GCN for a given layer , passing through a non-linearity function :
The final layer states for each node are then used for node-level classification, given output labels.
4.2 SGD-DP and Adam-DP
SGD-DP (Abadi.et.al.2016.SIGSAC) modifies the standard stochastic gradient descent algorithm to be differentially private. The DP ‘query’ is the gradient computation at time step : , for each in the training set. To ensure DP, the output of this query is distorted by random noise proportional to the sensitivity of the query, which is the range of values that the gradient can take. As gradient range is unconstrained, possibly leading to extremely large noise, Abadi.et.al.2016.SIGSAC clip the gradient vector by its norm, replacing each vector with , being the clipping threshold. This clipped gradient is altered by a draw from a Gaussian: .
Instead of running this process on individual examples, Abadi.et.al.2016.SIGSAC actually break up the training set into ‘lots’ of size , being a slightly separate concept from that of ‘batches’. Whereas the gradient computation is performed in batches, SGD-DP groups several batches together into lots for the DP calculation itself, which consists of adding noise, taking the average over a lot and performing the descent .
Incorporating this concept, we obtain the overall core mechanism of SGD-DP:
In this paper, we also develop a DP version of Adam (kingma2017adam), a widely-used default optimizer in NLP (ruder2016overview). As Adam shares the core principle of gradient computing within SGD, to make it differentialy private we add noise to the gradient following Eq. 5
(prior to Adam’s moment estimates and parameter update).
Despite their conceptual simplicity, both SGD-DP and Adam-DP have to determine the amount of noise to guarantee privacy. Abadi.et.al.2016.SIGSAC proposed the moments accountant which we present in detail here.
4.3 Moments accountant in detail
SGD-DP introduces two features, namely (1) a reverse computation of the privacy budget, and (2) tighter bounds on the composition of multiple queries. First, a common DP methodology is to pre-determine the privacy budget () and add random noise according to these parameters. In contrast, SGD-DP does the opposite: Given a pre-defined amount of noise (hyper-parameter of the algorithm), the privacy budget () is computed retrospectively. Second, generally in DP, with multiple executions of a ‘query’ (i.e. a single gradient computation in SGD), we can simply sum up the values associated with each query.555Such that for queries with privacy budget , the overall algorithm is -DP. However, this naive composition leads to a very large privacy budget as it assumes that each query used up the maximum given privacy budget.
The simplest bound on a continuous random variable, the Markov inequality, takes into account the expectation , such that for :
is also known as the moment-generating function.
The overall privacy loss is composed of a sequence of consecutive randomized algorithms (see §2). Since all are independent, the numerator in Eq. 7 becomes a product of all . Converting to log form and simplifying, we obtain
Note the moment generating function inside the logarithmic expression. Since the above bound is valid for any moment of the privacy loss random variable, we can go through several moments and find the one that gives us the lowest bound.
The overall SGD-DP algorithm, given the right noise scale and a clipping threshold , is thus shown to be -differentially private using this accounting method, with representing the ratio between the lot size and dataset size , and being the total number of training steps. See (Abadi.et.al.2016.SIGSAC) for further details.
We are interested in a text classification use-case where documents are connected via undirected edges, forming a graph. While structurally limiting, this definition covers a whole range of applications. We perform experiments on five single-label multi-class classification tasks. The Cora, Citeseer, and PubMed datasets (Yang.et.al.2016.ICML; Sen.et.al.2008.AIMag; McCallum.et.al.2000.IR; Giles.et.al.1998.DL) are widely used citation networks of research papers where citing a paper from paper creates an edge . The task is to predict the category of the particular paper.
The Reddit dataset (Hamilton.et.al.2017.NeurIPS) treats the ‘original post’ as a graph node and connects two posts by an edge if any user commented on both posts. Given the large size of this dataset (230k nodes; all posts from Sept. 2014) causing severe computational challenges, we sub-sampled 10% of posts (only few days of Sept. 2014). The gold label corresponds to one of the top Reddit communities to which the post belongs to.
Unlike the previous English data sets, the Pokec dataset (takac2012data; snapnets) contains an anonymized social network in Slovak. Nodes represent users and edges their friendship relations. User-level information contains many attributes in natural language (e.g., ‘music’, ‘perfect evening’). We set up the following binary task: Given the textual attributes, predict whether a user prefers dogs or cats.666 Perozzi.Skiena.2015.WWW used the Pokec data for user profiling, namely age prediction for ad targeting. We find such a use case unethical. In contrast, our classification task is harmless, yet serves well the demonstration purposes of text classification of social network data. Pokec’s personal information including friendship connections shows the importance of privacy-preserving methods to protect this potentially sensitive information. For the preparation details see Appendix B.
The four English datasets adapted from the previous work are only available in their encoded form. For the citation networks, each document is represented by a bag-of-words encoding. The Reddit dataset combines GloVe vectors (pennington-etal-2014-glove) averaged over the post and its comments. Only the Pokec dataset is available as raw texts, so we opted for multilingual BERT (devlin2018bert) and averaged all contextualized word embeddings over each users’ textual attributes.777Sentence-BERT reimers2019sentence resulted in lower performance. Users fill in the attributes such that the text resembles a list of keywords rather than actual discourse. The variety of languages, sizes, and different input encoding allows us to compare non-private and private GCNs under different conditions. Table 1 summarizes data sizes and number of classes.
|Dataset||Classes||Test size||Training size|
5.2 Experiment setup
Vanilla GCN on full datasets: The aim is to train the GCN with access to the largest training data possible, but without any privacy mechanism.
Learning curves on the vanilla GCN: Evaluating the influence on performance with less training data, without privacy, allowing for a comparison of results with the DP settings below.
GCN with DP: We evaluate performance varying the amount of privacy budget with the full datasets.
GCN with DP: Varying both data size and the amount of privacy budget. This allows us to see the effects on performance of both adding noise and reducing training data.
5.2.1 Implementation details
As the privacy parameter is typically kept ‘cryptographically small’ (Dwork.Roth.2013) and, unlike the main privacy budget , has a limited impact on accuracy (Abadi.et.al.2016.SIGSAC, Fig. 4), we fixed its value toAbadi.et.al.2016.SIGSAC
. We perform all experiments five times with different random seeds and report the mean and standard deviation. Early stopping is determined using the validation set. See Appendix A for details on other hyperparameters.
6 Results and discussion
|Non-DP scores||DP scores|
Table 2 shows the results on the left-hand side under ‘Non-DP’. When trained with SGD, both Cora and CiteSeer datasets achieve fairly good results at 0.77 F1 score each, both having relatively small graphs. Much lower are the PubMed results at 0.49, possibly due to the dataset consisting of a much larger graph. Reddit shows higher performance at 0.68, which could in part be due to its input representations as GloVe embeddings, as opposed to binary-valued word vectors. Finally, Pokec shows the best result at 0.83 possibly because of more expressive representations (BERT) and a simpler task (binary classification).
In comparison, in line with previous research ruder2016overview, Adam outperforms SGD in all cases, with Pokec showing the smallest gap (0.826 and 0.832 for SGD and Adam, respectively).
Starting with the SGD results in Figure 1, we can notice three main patterns.
Clear improvement as training data increases (e.g. CiteSeer, with 0.70 F1 score at 10% vs. 0.77 at 100%).
The exact opposite pattern, with PubMed dropping from 0.57 at 10% to 0.49 at 100%, with a similar pattern for Pokec.
Early saturation of results for Reddit and Cora (at 20-30% for Reddit with approximately 0.69 F1 score, 50% for Cora at a score of 0.77), where results do not increase beyond a certain point.
Regarding points (2) and (3) above, we speculate that, with a larger training size, a vanilla GCN has a harder time to learn the more complex input representations. In particular, for PubMed and Pokec, the increasing number of training nodes only partially increases the graph degree, so the model fails to learn expressive node representations when limited information from the node’s neighborhood is available. By contrast, Reddit graph degree grows much faster, thus advantaging GCNs.
Comparing each of these patterns for Adam, we see that for (1), datasets also improve (CiteSeer), (2) shows a very similar decrease in results for Pokec, but mostly a constant score throughout for PubMed (at ~0.80), while (3) shows continued improvement where SGD saturated for Cora and Reddit, suggesting that Adam allows to break through the learning bottleneck.
The results of Experiment C can be seen in Table 2 for a comparison of different DP noise values, as well as a comparison of the results with and without DP. We note four main patterns in this experiment:
Interestingly, SGD-DP results stay the same, regardless of the noise value added.
Adam-DP results are far worse than SGD-DP, but increasing with lesser privacy (less noise).
SGD-DP results almost always outperform the baselines (except for PubMed).
We see bigger drops in performance in the DP setting for datasets with simpler input representations.
SGD-DP vs. Adam-DP
Points (1) and (2) are very unexpected results, both contrary to expectations. One explanation for the former could be that gradients in SGD are already quite noisy, which may even help in generalization for the model, so the additional DP noise does not pose much difficulty beyond a certain drop in performance. Regarding Adam-DP, we see that results are far worse and do increase with lesser privacy (e.g., 0.51 with vs. 0.76 F1 with for Pokec). Several reasons can account for this, one being that Adam has more required hyperparameters, which could be sensitive with respect to the DP setting.
Differences in input features
For points (3) and (4) above, we see varying degrees of performance drops, depending on the dataset. Datasets of simpler input features can have results drop by more than half in comparison to the non-DP implementation, although still outperform a majority baseline (e.g. non-DP DP Maj. for CiteSeer). An exception to this is PubMed, which has DP results slightly below a majority baseline (). The drop in results from non-DP to DP is not as sharp (), most probably explained by the fact that the non-DP model was not able to achieve good performance.
Reddit shows a smaller drop from non-DP to DP and significantly outperforms the majority baseline (, respectively). Finally, the best-performing SGD-DP model was Pokec, with a relatively small drop from the non-DP to DP result ( F1 score, respectively). Hence, CiteSeer, Cora and PubMed, all using one-hot textual representations, show fairly low results for DP at . Slightly better is Reddit (GloVe), while Pokec is by far the best (BERT).
Finally, Figure 2 shows the DP results both for varying and with different training sub-samples (25%, 50%, 75% and the full 100%). Overall, some parallels and contrasts can be drawn with the learning curves from Experiment B.
Datasets which behave similarly for the two experiments are CiteSeer and Cora, where the former improves with more training data and the latter saturates at a certain point. PubMed, Reddit and Pokec show a contrasting pattern, with both PubMed and Reddit staying about the same for all sub-samples, apart from the 100% setting, with a slight drop for PubMed and slight increase for Reddit. In experiment B, both had more gradual learning curves, with a slow decline for PubMed and a quick plateau for Reddit. Similarly, Pokec here shows the best results with the full data, in contrast to the gradual decline in the non-private setting.
We can see that the patterns for learning curves are not the same in the DP and non-DP setting. While increasing training data may help to some extent, it does not act as a solution to the general drop in performance caused by adding DP noise.
The main observations of these experiments can be summarized as follows:
The network is learning useful representations in the SGD-DP setting, outperforming the majority baselines.
SGD-DP is fairly robust to noise for these datasets and settings even for privacy at .
While being superior in the non-private setting, Adam-DP does not perform very well.
More complex representations are better for the DP setting, showing a smaller performance drop.
Patterns for decreasing training size and increasing noise are not the same, thus increasing training data does not necessarily mitigate negative performance effects of DP.
We provide an additional error analysis in Appendix C, where we show that failed predictions in Reddit and CiteSeer are caused by ‘hard cases’, i.e. examples and classes that are consistently miss-classified regardless of training data size or privacy budget. Moreover, Appendix D describes results on the MNIST dataset with varying lot sizes, showing how this hyperparameter affects model results.
7 Limitations and open questions
Issues of applying SGD-DP to GCNs
Splitting graph datasets consisting of one large graph into smaller mini-batches is not trivial. Special methods have been developed to specifically deal with such cases, such as sampling and aggregation (Hamilton.et.al.2017.NeurIPS), as well as pre-computing graph representations (rossi2020sign). Such techniques would be necessary for adapting ‘batches’ and ‘lots’ from SGD-DP directly but it comes with theoretical limitations. Namely, nodes in a graph are not necessarily i.i.d., being by definition related to each other, thus there would be potential privacy leakage when performing computations on separate mini-batches of a graph. Further investigation into altering the SGD-DP algorithm and incorporating potential graph mini-batching methods are thus left for future work.
The benefits of our approach to applying SGD-DP and Adam-DP to the GCN case directly, are that (1) it is practical, simply adding it as a wrapper on top of the original model and (2) the ability to retain the original graph structure, thus not losing important information present in the original dataset and avoiding potential privacy leakage. The downside of this, however, is that the noise added has to be quite large in order to obtain reasonable values. As we have shown in our experiments, this method is indeed feasible in practice, given enough representational power in the input.
We use the same hyperparameters for both DP and non-DP settings to enable a fair comparison. In actual deployment, the DP version should have its own hyperparameter optimized as optimal settings may vary due to the added noise. However, further tuning on the training data comes with extra price as it consumes the privacy budget.
Is our model ‘bullet-proof’ -Dp?
While the SGD-DP algorithm does guarantee differential privacy by design, the ‘devil is in the details’. Abadi.et.al.2016.SIGSAC propose in their implementation that , where ( being the lot size, the size of the input dataset). In our case, due to the nature of large one-graph datasets, , since the lot size is equal to the size of the dataset. This detail is not, however, mentioned in (Abadi.et.al.2016.SIGSAC) directly, but rather in the comments of original SGD-DP code.888As of 2020, there is only a fork of the original code available at https://tinyurl.com/y2mwmbm9 Whether this minor implementation detail influences the overall privacy budget computation through the moments accountant remains an open theoretical question.
We have explored differentially-private training for GCNs, showing the nature of the privacy-utility trade-off. While there is an expected drop in results for the SGD-DP models, they generally perform far better than the baselines, reaching up to 90% of their non-private variants in one setup. In fact, more complexity in the input representations seems to mitigate the negative performance effects of applying DP noise. By adapting global DP to a challenging class of deep learning networks, we are thus a step closer to flexible and effective privacy-preserving NLP.
This research work has been funded by the German Federal Ministry of Education and Research and the Hessian Ministry of Higher Education, Research, Science and the Arts within their joint support of the National Research Center for Applied Cybersecurity ATHENE. Calculations were conducted on the Lichtenberg high performance computer of the TU Darmstadt.
Appendix A Hyperparameter Configuration
Our GCN model consists of 2 layers, with ReLU non-linearity, a hidden size of 32 and dropout of 50%, trained with a learning rate of 0.01. We found that early stopping the model works better for the non-DP implementations, where we used a patience of 20 epochs. We did not use early stopping for the DP configuration, which shows better results without it. For all SGD runs we used a maximum of 2000 epochs, while for Adam we used 500.
Due to the smaller amount of epochs for Adam, it is possible to add less noise to achieve a lower value. Table 3 shows the mapping from noise values used for each optimizer to the corresponding .
Appendix B Pokec Dataset Pre-processing
In order to prepare the binary classification task for the Pokec dataset, the original graph consisting of 1,632,803 nodes and 30,622,564 edges is sub-sampled to only include users that filled out the ‘pets’ column and had either cats or dogs as their preference, discarding entries with multiple preferences. For each pet type, users were reordered based on percent completion of their profiles, such that users with most of the information were retained.
For each of the two classes, the top 10,000 users are taken, with the final graph consisting of 20,000 nodes and 32,782 edges. The data was split into 80% training, 10% validation and 10% test partitions.
The textual representations themselves were prepared with ‘bert-multilingual-cased’ from Huggingface transformers,999https://github.com/huggingface/transformers converting each attribute of user input in Slovak to BERT embeddings with the provided tokenizer for the same model. Embeddings are taken from the last hidden layer of the model, with dimension size 768. The average over all tokens is taken for a given column of user information, with 49 out of the 59 original columns retained. The remaining 10 are left out due to containing less relevant information for textual analysis, such as a user’s last login time. To further simplify input representations for the model, the average is taken over all columns for a user, resulting in a final vector representation of dimension 768 for each node in the graph.
Appendix C Are ‘hard’ examples consistent between private and non-private models?
To look further into the nature of errors for experiments B and C, we evaluate the ‘hard cases’. These are cases that the model has an incorrect prediction for with the maximum data size and non-private implementation (results of experiment A). For experiment B, we take the errors for every setting of the experiment (10% training data, 20%, and so forth) and calculate the intersection of those errors with that of the ‘hard cases’ from the baseline implementation. This intersection is then normalized by the original number of hard cases to obtain a percentage value. The results for experiment B can be seen in Figure 3. We perform the same procedure for experiment C with different noise values, as seen in Figure 4. This provides a look into how the nature of errors differs among these different settings, whether they stay constant or become more random as we decrease the training size or increase DP noise.
Regarding the errors for experiment C, we can see a strong contrast between datasets such as Reddit and PubMed. For the latter, the more noise we add as decreases, the more random the errors become. In the case of Reddit, however, we see that even if we add more noise, it still fails on the same hard cases. This means that there are hard aspects of the data that remain constant throughout. For instance, out of all the different classes, some may be particularly difficult for the model.
Although the raw data for Reddit does not have references to the original class names and input texts, we can still take a look into these classes numerically and see which ones are the most difficult in the confusion matrix. In the baseline non-DP model, we notice that many classes are consistently predicted incorrectly. For example, class 10 is predicted 93% of the time to be class 39. Class 18 is never predicted to be correct, but 95% of the time predicted to be class 9. Class 21 is predicted as class 16 83% of the time, and so forth. This model therefore mixes up many of these classes with considerable confidence.
Comparing this with the confusion matrix for the differentially private implementation at an value of 2, we can see that the results incorrectly predict these same classes as well, but the predictions are more spread out. Whereas the non-private model seems to be very certain in its incorrect prediction, mistaking one class for another, the private model is less certain and predicts a variety of incorrect classes for the target class.
For the analysis of the hard cases of experiment B in Figure 3, we can see some of the same patterns as above, for instance between PubMed and Reddit. Even if the training size is decreased, the model trained on Reddit still makes the same types of errors throughout. In contrast, as training size is decreased for PubMed, the model makes more and more random errors. The main difference between the hard cases of the two experiments is that, apart from Reddit, here we can see that for all other datasets the errors become more random as we decrease training size. For example, Cora goes down from 85% of hard cases at 90% training data to 74% at 10% training data. In the case of experiment C, they stay about the same, for instance Cora retains just over 70% of the hard cases for all noise values. Overall, while we see some parallels between the hard cases for experiments B and C with respect to patterns of individual datasets such as Reddit and PubMed, the general trend of more and more distinct errors that is seen for the majority of datasets with less training size in experiment B is not the same in experiment C, staying mostly constant across different noise values for the latter. The idea that the nature of errors for DP noise and less training data being the same is thus not always the case, meaning that simply increasing training size may not necessarily mitigate the effects of DP noise.
Appendix D MNIST Baselines
shows results on the MNIST dataset with different lot sizes and noise values, keeping lot and batch sizes the same. We use a simple feed-forward neural network with a hidden size of 512, dropout of 50%, SGD optimizer, and a maximum of 2000 epochs with early stopping of patience 20, with other hyperparameters such as learning rate being the same as above. We note that the configuration in the first row with lot size of 600 and noise 4 is the same as described byAbadi.et.al.2016.SIGSAC in their application of the moments accountant, reaching the same value of 1.2586.
We can see some important patterns in these results that relate to our main results from the GCN experiments. Maintaining a constant noise of 4, as we increase the lot size, not only does the value increase, but we see a dramatic drop in F1 score, especially for a lot size of 60,000, being the full training set. If we try to increase the noise and maintain that 60,000 lot size, while we are able to lower the value below 1, the F1 score continues to drop dramatically, going down to 0.1010 with a noise value of 100.
As mentioned in Section 7, the problem of dividing a large graph into mini-batches is not a trivial one. There is a potential loss of edge information if we were to naively divide the graph into sub-sections. For a differentially private framework there is also the large problem that nodes in a graph are not necessarily i.i.d., meaning that there would be potential privacy leakage with such mini-batching methods.
Hence, the current MNIST results show justification for the results of our GCN with DP experiments. Despite using the whole graph, meaning a lot size corresponding to the full training set, we still almost always beat the random and majority baselines, in some cases not being too far from the non-private versions, such as for Pokec. This also suggests that the issue of a lack of mini-batching could be mitigated by increasing the representational power of the input, such as with the multilingual BERT embeddings for Pokec.