Secure Network Release with Link Privacy

05/01/2020 ∙ by Carl Yang, et al. ∙ University of Illinois at Urbana-Champaign University of Illinois at Chicago 0

Many data mining and analytical tasks rely on the abstraction of networks (graphs) to summarize relational structures among individuals (nodes). Since relational data are often sensitive, we aim to seek effective approaches to release utility-preserved yet privacy-protected structured data. In this paper, we leverage the differential privacy (DP) framework, to formulate and enforce rigorous privacy constraints on deep graph generation models, with a focus on edge-DP to guarantee individual link privacy. In particular, we enforce edge-DP by injecting Gaussian noise to the gradients of a link prediction based graph generation model, and ensure data utility by improving structure learning with structure-oriented graph comparison. Extensive experiments on two real-world network datasets show that our proposed DPGGEN model is able to generate networks with effectively preserved global structure and rigorously protected individual link privacy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nowadays, open data of networks play a pivotal role in data mining and data analytics tang2008arnetminer ; sen2008collective ; blum2013learning ; snapnets . By releasing and sharing structured relational data with research facilities and enterprise partners, data companies are harvesting the enormous potential value from their data, which benefits decision making on various aspects including social, financial, environmental, through collectively improved ads, recommendation, retention and so on yang2017bridging ; yang2018know ; sigurbjornsson2008flickr ; kuhn2009compensation . However, network data usually encode sensitive information not only about individuals but also their interactions, which makes direct release and exploitation rather unsafe. More importantly, even with careful anonymization, individual privacy is still at stake under collective attack models facilitated by the underlying network structure zhang2019enabling ; cai2018collective . Can we find a way to securely release network data without drastic sanitization that essentially renders the released data useless?

To deal with such tension between the need to release utilizable data and the concern of data owners’ privacy, quite a few models have been proposed recently, focusing on grid-based data like images, texts and gene sequences frigerio2019differentially ; papernot2018scalable ; triastcyn2018generating ; narayanan2008robust ; mohammed2011differentially ; xie2018differentially ; chen2018differentially ; digvijay2018private ; balog2018differentially ; lecuyer2018certified ; zhang2018differentially . However, none of the existing models can be directly applied to the network (graph) setting. While a secure generative model on grid-based data apparently aims to preserve high-level semantics (e.g., class distributions) and protect detailed training data (e.g., exact images or sentences), it remains obtuse what to be preserved and what to be protected for network data, due to its modeling of complex interactive objects.

(a) Anonymized original net.
(b) DPGGen generated net.
Figure 1: A toy pair of anonymized and generated networks.

Motivating scenario.

In Figure 1, a bank aims to encourage public studies on the community structures of its customers. It does so by firstly anonymizing all users in the network and then sharing the anonymized network (i.e., network (a) in Figure 1) to the public. However, an attacker interested in knowing the financial interactions (e.g., money transfer) between particular customers can easily get access to another public social network and locate a group of users that likely overlap with the customers in network (a) (e.g., by leveraging public user attributes). Simple graph properties like node degree distribution and triangle count can then be used to identify specific users with high accuracy (e.g., customer as the only node with degree 5 and within 1 triangle, and customer as the only node with degree 2 and within 1 triangle). Thus, the attacker confidently knows the identities of A and B and the fact that they have financial interactions, which seriously harms customers’ privacy and poses potential crises.

In this work, we formulate the goal of secure network release as preserving global network structure while protecting individual link privacy

. Continue with the toy example, the solution we propose is to train a graph neural network model on the original network and release the generated networks (

e.g., (b) in Figure 1). Towards the utility of generated networks, we require them to be similar to the original networks from a global perspective, which can be measured by various graph global properties (e.g., network (b) has very similar degree distribution and the same triangle count as (a)). In this way, we expect many downstream data mining and analytical tasks on them to produce similar results as on the original networks. As for privacy protection, we require that the information in the generated networks cannot confidently reveal the existence or absence of any individual links in the original networks (e.g., the attacker may still identify customers A and B in network (b), but their individual link structure has changed).

However, there are two unique challenges in learning such structure-preserved and privacy-protected graph generation models, which have not been explored by existing literature so far.

Challenge 1: Rigorous protection of individual link privacy.

The rich relational structures in graph data often allow attackers to recover private information through various ways of collective inference zhang2014privacy ; narayanan2009anonymizing ; backstrom2007wherefore . Moreover, graph structure can always be converted to numerical features such as spectral embedding, after which most attacks on grid-based data like model inversion fredrikson2015model and membership inference shokri2017membership can be directly applied for link identification. How can we design an effective mechanism with rigorous privacy protection on links in networks against various attacks?

Challenge 2: Effective preservation of global network structure.

In order to capture global network structure, the model has to constantly compare the structures of the input and currently generated graphs during training. However, unlike images and other grid-based data, graphs have flexible structures, and thus lack efficient universal representations dong2019network . How can we allow a network generation model to effectively learn from the structural difference between two graphs, without conducting very time-costly operations like isomorphism tests all the time?

Present work.

In this work, for the first time, we draw attention to the secure release of network data with deep generative models. Technically, towards the aforementioned two challenges, we develop Differentially Private Graph Generative Nets (DPGGen), which imposes DP training over a link prediction based network generation model for rigorous individual link privacy protection, and further ensures structure-oriented graph comparison for effective global network structure preservation. In particular, we first formulate and enforce edge-DP via Gaussian gradient distortion by injecting designed noise into the sensitive modules during model training. Then we leverage graph convolutional networks kipf2016semi through a variational generative adversarial network architecture gu2018dialogwae ; larsen2016autoencoding to enable structure-oriented network generation.

To evaluate the effectiveness of DPGGen, we conduct extensive experiments on two real-world network datasets. On one hand, we evaluate the utility of generated networks by computing a suite of commonly concerned graph properties to compare the global structure of generated networks with the original ones. On the other hand, we validate the privacy of individual links by evaluating links predicted from the generated networks on the original networks. Consistent experimental results show that DPGGen is able to effectively generate networks that are similar to the original ones regarding global network structure, while at the same time useless towards individual link prediction.

2 Related Work

Differential Privacy (DP).

With graph structured data, two types of privacy constraint can be applied, i.e., node-DP Kasiviswanathan13NodeDP and edge-DP Blocki12EdgeDP , which define two neighboring graphs to differ by at most one node or edge. In this work, we aim at the secure release of network data, and particularly, we focus on edge privacy, because it is essential for the protection of object interactions unique for network data in comparison with other types of data. Several existing works have studied the protection of edge-DP. For example, Sala11ShareGraphDP generates graphs based on the statistical representations extracted from the original graphs blurred by designed noise, whereas Wang13DPDegreeGraphGeneration enforces the parameters of dK-graph models to be private. However, based on shallow graph generation models, they do not flexibly capture global network structure that can support various unknown downstream analytical tasks zhang2019enabling ; wasserman2010statistical .

Recent advances in deep learning has led to the rapid development of DP-oriented learning schemes. For example,

Abadi:2016:DLD:2976749.2978318

refines the analysis of privacy costs, which provides tighter estimation on the overall privacy loss by tracking detailed information of the stochastic gradient descent process. DP learning has also been widely adapted to generative models

frigerio2019differentially ; papernot2018scalable ; triastcyn2018generating ; narayanan2008robust ; mohammed2011differentially ; xie2018differentially ; chen2018differentially ; digvijay2018private ; balog2018differentially ; lecuyer2018certified ; zhang2018differentially . For example, frigerio2019differentially ; chen2018differentially ; digvijay2018private ; zhang2018differentially share the same spirit by enforcing DP on the discriminators, and thus inductively on the generators, in a generative adversarial network (GAN) scheme. However, none of them can be directly applied to graph data due to the lack of consideration on structure generation.

Graph Generation (GGen).

GGen has been studied for decades and widely used to synthesize network data used towards the development of various collective analysis and mining models evans2009line ; hallac2017network . Earlier works mainly use probabilistic models to generate graphs with certain properties erds1960evolution ; watts1998collective ; barabasi1999emergence ; newman2001clustering , which are manually designed based on sheer observations and prior assumptions.

Thanks to the surge of deep learning, many advanced GGen models have been developed recently, which leverage different powerful neural networks in a learn-to-generate manner kipf2016variational ; bojchevski2018netgan ; you2018graphrnn ; simonovsky2018graphvae ; li2018learning ; you2018graph ; jin2018junction ; grover2018graphite ; de2018molgan ; zou2018encoding ; ma2018constrained . For example, NetGAN bojchevski2018netgan converts graphs into biased random walks, learns the generation of walks with GAN, and assembles the generated walks into graphs; GraphRNN you2018graphrnn

regards the generation of graphs as node-and-edge addition sequences, and models it with a heuristic breadth-first-search scheme and hierarchical RNN. These neural network based models can often generate graphs with much richer properties and flexible structures learned from real-world graphs.

To the best of our knowledge, no existing work on deep GGen has looked into the potential privacy threats laid during the learning and releasing of the powerful models. In fact, such concerns are rather urgent in the network setting, where sensitive information can often be more easily compromised in a collective manner dai2018adversarial ; backstrom2007wherefore ; zhang2014privacy and privacy leakage can easily further propagate narayanan2009anonymizing ; zugner2018adversarial .

3 DPGGen

In this work, we propose DPGGen for the secure release of generated networks, whose global graph structures are similar to the original sensitive networks, but the individual links (edges) between objects (nodes) are safely protected.

To provide robust privacy guarantees towards various graph attacks, we propose to leverage the well-studied technique of differential privacy (DP) dwork2014algorithmic by enforcing the edge-DP defined as follows.

Definition 1 (Edge Differential Privacy Blocki12EdgeDP )

A randomized mechanism satisfies -edge-DP if for any two neighboring graphs , which differ by at most one edge, , where .

Our key insight is, a graph generation model satisfying the above edge-DP should learn to generate similar graphs if trained with two neighboring graphs that differ by at most one edge; as a consequence, information in the generated graph does not confidently reveal the existence or absence of any one particular edge in the original graph, thus protecting individual link privacy.

To ensure DP on individual links, we exploit the existing link reconstruction based graph generation model GraphVAE kipf2016variational , and design a training algorithm to dynamically distort the gradients of its sensitive model parameters by injecting proper amounts of Gaussian noise based on the framework of DPSGD Abadi:2016:DLD:2976749.2978318 . Moreover, to improve the capturing of global graph structures, we replace the direct BCE loss on graph adjacency matrices in GraphVAE with a structure-oriented graph discriminator based on GCN kipf2016semi and the framework of VAEGAN gu2018dialogwae .

Backbone GraphVAE.

Recent research on graph models has been largely focused around GCN kipf2016semi , which is shown to be promising in calculating universal graph representations maron2019invariant ; xu2018powerful ; chen2019equivalence ; keriven2019universal

. In this work, we harness the power of GCN under the consideration of edge-DP by adapting the link reconstruction based graph variational autoencoder (GraphVAE)

kipf2016variational as our backbone graph generation model.

Particularly, we are given a graph , where is the set of nodes, and is the set of edges, which can be further modeled by a binary adjacency matrix . As a common practice hamilton2017inductive , we set the node features

simply as the one-hot node identity matrix. The autoencoder architecture of GraphVAE consists of a GCN-based graph encoder to guide the learning of a fully connected feedforward neural network (FNN) based adjacency matrix decoder, which can be trained to directly reconstruct graphs with similar links as in the input graphs. A stochastic latent variable

is further introduced as the latent representation of as

(1)

where

is the matrix of mean vectors

, and

is the matrix of standard deviation vectors

. is a two-layer GCN model. and share the first-layer parameters . is the symmetrically normalized adjacency matrix of , with degree matrix . and form the encoder network.

To generate a graph , a reconstructed adjacency matrix is computed from by a decoder network as

(2)

where , is a two-layer FNN appended to

before the logistic sigmoid function. It aims to generate individual links to be compared with those in the input graph.

The whole model is trained through standard variational inference by optimizing the following variational lower bound

(3)

where is implemented as the sum of an element-wise binary cross entropy (BCE) loss between the adjacency matrices of the input and generated graphs, and

is a prior loss based on the Kullback-Leibler divergence towards the Gaussian prior

.

Enforcing DP.

The probabilistic nature of allows the model to be generative, meaning that after training the model with an input graph , we can detach and disregard the encoder, and then freely generate an unlimited amount of graphs with similar links to , by solely drawing random samples of from the prior distribution and computing with the learned decoder network w.r.t.  Eq. 2. However, as shown in kurakin2016adversarial ; gondim2018adversarial , powerful neural network models like VAE can easily overfit training data, so directly releasing a trained GraphVAE model poses potential privacy threats, as links in its generated graphs may be highly indicative towards links in the training graphs.

In this work, we care about rigorously protecting the privacy of individual links in the training data, i.e., ensuring edge-DP. Particularly, in Definition 1, the inequality guarantees that the distinguishability of any one edge in the graph will be restricted to the privacy leak level proportional to ;

relaxes the outlier nodes existing in the graph. The two parameters together quantify the absolute value of privacy information possibly to be leaked by the mechanism

, i.e., a graph generation model.

According to Eq. 2, GraphVAE essentially takes a graph as input and generates a new graph with the same size as output by reconstructing links among the same set of nodes . Therefore, if we regard GraphVAE as the mechanism , as long as its model parameters are properly randomized, the framework satisfies edge-DP. Particularly, any two input graphs and differing by at most one edge in principle lead to similar generated graphs , so information in does not confidently reveal the existence or absence of any particular link in or . To exploit the well-structured graph generation framework of GraphVAE, we leverage the technique of Gaussian mechanism to enforce edge-DP on it.

Theorem 1 (Gaussian Mechanism dwork2014algorithmic )

If the -norm sensitivity of a deterministic function is , we have:

(4)

where

is a random variable obeying the Gaussian distribution with mean 0 and standard deviation

. The randomized mechanism is -differentially private if and .

In our setting, is the original training graph. Then Eq. 4 tells us that a link reconstruction based graph generation model can be randomized to ensure -edge-DP with properly parameterized Gaussian noise. Therefore, we leverage Theorem 1 by perturbing the gradient optimization of GraphVAE. In particular, we follow frigerio2019differentially to inject a designed Gaussian noise to the gradients of our decoder network clipped by a hyper-parameter as follows

(5)

where is the original gradient of decoder network on node , is the clipping hyper-parameter required to bound the influence of each individual node, and is the noise scale hyper-parameter. The idea behind this method is called DPSGD Abadi:2016:DLD:2976749.2978318 . According to Theorem 1 and the analysis in Abadi:2016:DLD:2976749.2978318 ; frigerio2019differentially , a model trained with such distorted gradients is guaranteed to be -DP.

Since GraphVAE is trained in iterations, to guarantee

-DP in the whole training process, we leverage the moments accountant mechanism proposed in

Abadi:2016:DLD:2976749.2978318 . Particularly, according to the composability property of moments accountant, we can accurately bound the total privacy loss of GraphVAE by setting the degree of perturbation (noise scale) at each training iteration as

(6)

where is the number of training iterations, and the sampling ratio. We term this model DPGVae.

In the generation stage, we can disregard the encoder and only use the decoder to generate an unlimited amount of graphs from randomly sampled vectors from the prior distribution . Since the normal Gaussian distribution is privacy irrelevant, it can be regarded as -DP. By the composability property of DP dwork2014algorithmic , graphs generated by DPGVae then satisfy -DP. In particular, according to Eq. 2, since the decoder network of GraphVAE is essentially generating links, the system is -edge-DP, the release of which in principle does not disclose sensitive information regarding individual links in the original sensitive networks.

Note that, although the encoder network also directly touches sensitive data, according to Eq. 1, the gradients are already mixed with randomness of samples from the Gaussian prior before reaching the decoder network, so we do not need to add noise to it. Through this design, we can improve training of the decoder network with limited privacy gradient budget, with minimum interruptions to the encoder network, while guaranteeing the whole generation process to be edge-DP.

Improving structure learning.

Besides individual link privacy, we also aim to preserve global network structure so as to ensure the utility of released data. As we discuss before, original GraphVAE computes the reconstruction loss between input and generated graphs based on the element-wise BCE between their adjacency matrices. Such a computation is specified on each individual link, rather than the structure of the graph as a whole. To improve the learning of global graph structure, we leverage GCN again, which has been shown universally powerful in capturing graph-level structures maron2019invariant ; xu2018powerful ; chen2019equivalence ; keriven2019universal . In particular, we borrow the framework of VAEGAN from recent research gu2018dialogwae ; larsen2016autoencoding ; yang2019conditional , and compute a structure-oriented generative adversarial network (GAN) loss as

(7)

where and are GCN and FNN networks similarly as defined before, besides in the end of the node-level representations are element-wise summed up as the graph-level representation, which resembles the recently proposed GIN model for graph-level representation learning xu2018powerful . In the VAEGAN framework, the decoder also serves as the generator, while is the discriminator. The intuition behind this novel technique is that, the GCN encodings and capture the graph structures of and , so a reconstruction loss captures the intrinsic structural difference between and instead of the simple sum of the differences over their individual links. Note that, the effectiveness of our structure-oriented discriminator is critical not only because it can directly enforce better structure learning of the link-based generator through the minimax game in Eq. 7, but also because it can learn to relax the penalty on certain individual links, through flexible and diverse configurations of the whole graph as long as the global structures remain similar, which exactly fulfills our goals of secure network release. The benefits of such diversity enabled by the VAEGAN have also been discussed in the cases like image generation gu2018dialogwae ; larsen2016autoencoding .

Following gu2018dialogwae , the encoder is trained w.r.t.  , the decoder (generator) w.r.t.  , and the discriminator w.r.t.  , where and are loss weighing hyper-parameters. To enforce DP constraints and complete our proposed DPGGen framework, Eq. 5 is applied to distort the gradients of the discriminator and guarantee the generator to be -edge-DP, which can be used to securely generate networks with the other parts disregarded after training. The overall framework of DPGGen is shown in Figure 2 above and more training details are put in the Appendix.

Figure 2: Neural architecture of DPGGen (best viewed in color): Our novel graph generation model consists of a GCN-based encoder, an FNN-based decoder (generator), and a GCN+FNN-based discriminator. Data, modules, and gradients marked red are sensitive, but their flows are blocked by the green operations (i.e.

, sampling, gradient clipping, and noise injection), resulting in DP modules and data, thus eventually protecting individual link privacy.

4 Experimental Evaluations

We conduct two sets of experiments to evaluate the effectiveness of DPGGen in preserving global network structure and protecting individual link privacy. All code and data will be made public upon the acceptance of this work.

Experimental settings.

To provide side-to-side comparison between the original networks and generated networks, we use two standard datasets of real-world networks, i.e., DBLP and IMDB. DBLP includes 72 networks of author nodes and co-author links, where the average numbers of nodes and links are 177.2 and 258; IMDB includes 1500 networks of actor/actress nodes and co-star links, with average node and link numbers 13 and 65.9. The DBLP networks are constructed based on research communities, whereas the IMDB networks based on movie genres.

To show that DPGGen effectively captures global network structure, we compare it against DPGVae under different privacy budgets (controlled by in Eq. 6), as well as the original GraphVAE model kipf2016variational , regarding a suite of graph statistics commonly used to evaluate the performance of graph generation models, especially from a global perspective bojchevski2018netgan ; you2018graphrnn ; yang2019conditional .111Statistics we use include LCC (size of largest connected component), TC (triangle count), CPL (characteristic path length), GINI (gini index), and REDE (relative edge distribution entropy). Specifically, we train all compared models from scratch to convergence for times, where is the number of networks in the datasets. Each time, the trained models are used to generate one network, which is to be compared with the original network regarding the suite of graph statistics. Then we average the absolute differences between the generated networks and the original networks, which ensures that the positive and negative differences do not cancel out.

To facilitate better understanding towards how the graph statistics reflect the global network structure captured by the models, we also provide results of two recent state-of-the-art network generation methods, i.e., NetGAN bojchevski2018netgan and GraphRNN you2018graphrnn , with default parameter settings and no DP constraints at all. In this experiment, we expect to see the more effective structure-preserving models generate networks that are more similar to the original ones regarding various graph statistics, thus maintaining high network data utility. Besides similarity on graph statistics, we further evaluate the utilities of generated graphs against the original ones on the downstream task of graph classification, of which the details and results are put in the Appendix due to space limitation.

To show that DPGGen effectively guarantees individual link privacy, we train all models for another times on each dataset. Differently from the previous setting where the complete networks are used, we randomly sample 80% of the links from the original networks to train the models. After generating the full networks from the trained models, we use degree distribution to align the nodes in the generated networks with those in the original networks. Then we evaluate the standard AUC metric on the task of individual link prediction222https://github.com/graph-star-team/graph_star by comparing links predicted in the generated networks and links hidden during training in the original networks. In this experiment, we expect to see the more effective privacy-protecting models generate networks that are less useful when used to predict individual links in the original networks, thus rigorously guaranteeing network data privacy.

For GraphVAE and our models, we use two-layer GCNs with sizes for both and of the encoder network, where the first layer is shared, and we use two-layer FNNs with sizes for of the decoder (generator) network. For DPGGen, we use another two-layer GCN with the same sizes for and a three-layer FNN with sizes for . For DP-related hyper-parameters, we follow existing works dwork2014algorithmic ; Abadi:2016:DLD:2976749.2978318 ; ShokriS15PPDL to fix to , noise scale to 5, and sampling ratio to 0.01 (which determines the batch size as with as the graph size). Then we vary from 0.1 to 10 to see how much graph-level utilities are preserved under different privacy budgets. According to Eq. 6, we terminate the training of DPGGenat iterations when is depleted. Other than the essential parameters in Eq. 6, we empirically set the clipping parameter to , decay ratio to , learning rate to , and the loss weighing parameters and both to 0.1. We do not observe the model to be very sensitive to the setting of these non-essential parameters.

All experiments are done on a server with four GeForce GTX 1080 GPUs and a 12-core 2.2GHz CPU. The training time of DP-enforced models is often slightly shorter due to early stops when the privacy budget runs out, (e.g., a typical train of GraphVAE, DPGVae, and DPGGen takes 60, 42 and 53 seconds on average on DBLP, respectively). After training, the generation times of the three models are roughly the same (e.g., 0.02 second on average on DBLP). As a direct comparison, the state-of-the-art deep network generation models of NetGAN and GraphRNN take longer times under the same settings especially for generation (e.g., 89 and 4.5 seconds for NetGAN to train and generate on DBLP, and 75 and 2.4 seconds for GraphRNN). Note that, although efficiency is not our major concern in this work, short runtimes (especially for generation) are favorable for efficient data share.

DBLP Networks IMDB Networks
Models LCC TC CPL GINI REDE LCC TC CPL GINI REDE
Original 107.5 59.90 3.6943 0.3248 0.9385 13.001 305.9 1.2275 0.1222 0.9894
GraphVAE(no DP) 7.51 66.93 0.1330 0.0213 0.0084 0.0145 25.83 0.0121 0.0030 0.0016
NetGAN(no DP) 9.66 39.87 0.1943 0.0105 0.0022 0.0083 27.54 0.0192 0.0042 0.0011
GraphRNN(no DP) 10.27 57.43 0.2043 0.0415 0.0052 0.0594 27.26 0.0214 0.0155 0.0094
DPGVae(=10) 21.96 175.29 0.2471 0.0339 0.0153 0.0147 43.63 0.0367 0.0036 0.0030
DPGVae(=1) 23.80 187.20 0.3059 0.0343 0.0156 0.0253 43.73 0.0373 0.0038 0.0031
DPGVae(=0.1) 26.07 215.13 0.3342 0.0344 0.0158 0.0320 44.12 0.0392 0.0042 0.0032
DPGGen(=10) 10.61 64.75 0.2035 0.0224 0.0093 0.0040 22.89 0.0164 0.0010 0.0017
DPGGen(=1) 12.38 70.97 0.2643 0.0353 0.0117 0.0053 23.81 0.0168 0.0029 0.0023
DPGGen(=0.1) 24.62 77.41 0.2713 0.0485 0.0191 0.0113 24.91 0.0168 0.0029 0.0025
Table 1: Performance evaluation over compared models regarding a suite of important graph structural statistics. The Original rows include the values of original networks, while the rest are the average absolute difference between generated networks by different models and the original networks. Therefore, smaller values indicate better capturing of global network structure and thus better global data utility. Bold font is used for values ranked top-3.

Performances.

In Table 1, our strictly DP-constrained models constantly yield highly competitive and even better results compared with the strongest DP-free baselines regarding the global network structural similarity between generated and original networks on both datasets. As we gradually increase the privacy budget , our two models (especially DPGGen) apparently perform better. The performance gaps are more significant in the poorer conditions, i.e., on DBLP and regarding the more community sensitive metrics like LCC and REDE. Such results clearly advocate the advantages of DPGGen  in capturing global network structure and the effectiveness of our privacy constraints.

Looking deeper into the numbers, we observe that DPGGen constantly achieves significantly better performance over DPGVae

 under the same privacy budgets on both datasets (scores all passed t-tests with p-value 0.01), which corroborates our novel model designs. Moreover, the suite of statistics measure global network structure from different perspectives. As can be inferred from TC, CPL and GINI, the IMDB networks are in general smaller, tighter and likely more structurally complex than the DBLP networks, which favors link generation models (

e.g., GraphVAE) over sequence generation models (e.g., NetGAN and GraphRNN), especially regarding the more structure sensitive measures like TC and CPL. Consequently, our DPGVae  and DPGGen  models also perform better on the IMDB networks, indicating their advantages on modeling complex link structures.

(a) DBLP
(b) IMDB
Figure 3: Accuracy of links predicted based on networks generated by DPGGen with varying hyper-parameters and evaluated on the original networks. Lower AUC means the information in the generated networks is less useful in revealing the true existence or absence of links in the original networks, thus better individual data privacy.

As shown in Figure 3, for both datasets, links predicted on the networks generated by DPGGen are much less accurate than those predicted on the original networks (26%-35% and 15%-20% AUC drops on DBLP and IMDB, respectively) as well as the networks generated by all baselines. This means that even the attackers somehow identify nodes in the generated (released) networks, they cannot leverage the information there to accurately infer the existence or absence of links between particular pairs of nodes on the original networks. This directly corroborates our claim that DPGGen is effective in protecting individual link privacy.

To conduct more detailed inspections, we vary two of the major hyper-parameters, i.e., the privacy budget and sampling ratio . Consistently with the results in Table 1, larger privacy budgets lead to more privacy leakage, which allow attackers to infer individual links in the original networks with higher accuracy. While some DP-constrained deep learning models are observed to be sensitive to the sampling ratio during training Abadi:2016:DLD:2976749.2978318 ; ShokriS15PPDL , the privacy protection utility of DPGGen is robust when is changed in large ranges in practice.

5 Conclusion

Due to the recent development of deep graph generation models, synthetic networks are generated and released for granted, without the concern about possible privacy leakage over the original networks used for model training. In this work, for the first time, we pay attention to the task of secure network release and formulate its goals as preserving global network structure while protecting individual link privacy. Subsequently, we adopt the well-studied DP framework and develop DPGGen, which protects individual link privacy by enforcing edge-DP over the link prediction based graph generation model of GraphVAE while preserving global network structure through adversarial learning with a structure-oriented graph discriminator. Comprehensive experiments show that DPGGen is advantageous in generating networks that are globally similar to the original ones (thus effectively maintaining network data utility), and at the same time useless for predicting individual links in the original network (thus rigorously protecting network data privacy).

References

  • [1] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In SIGSAC, 2016.
  • [2] Lars Backstrom, Cynthia Dwork, and Jon Kleinberg. Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. In WWW, 2007.
  • [3] Albert-László Barabási and Réka Albert. Emergence of scaling in random networks. science, 286(5439):509–512, 1999.
  • [4] Jeremiah Blocki, Avrim Blum, Anupam Datta, and Or Sheffet. The johnson-lindenstrauss transform itself preserves differential privacy. FOCS, 2012.
  • [5] Avrim Blum, Katrina Ligett, and Aaron Roth. A learning theory approach to non-interactive database privacy. JACM, 2013.
  • [6] Aleksandar Bojchevski, Oleksandr Shchur, Daniel Zügner, and Stephan Günnemann. Netgan: Generating graphs via random walks. In ICML, 2018.
  • [7] Digvijay Boob, Rachel Cummings, Dhamma Kimpara, Uthaipon (Tao) Tantipongpipat, Chris Waites, and Kyle Zimmerman. Private synthetic data generation via gans. arXiv preprint arXiv:1803.03148, 2018.
  • [8] Z. Cai, Z. He, X. Guan, and Y. Li. Collective data-sanitization for preventing sensitive information inference attacks in social networks. TDSC, 2018.
  • [9] Qingrong Chen, Chong Xiang, Minhui Xue, Bo Li, Nikita Borisov, Dali Kaarfar, and Haojin Zhu. Differentially private data generative models. arXiv preprint arXiv:1812.02274, 2018.
  • [10] Zhengdao Chen, Soledad Villar, Lei Chen, and Joan Bruna. On the equivalence between graph isomorphism testing and function approximation with gnns. In NIPS, 2019.
  • [11] Quanyu Dai, Qiang Li, Jian Tang, and Dan Wang. Adversarial network embedding. In AAAI, 2018.
  • [12] Nicola De Cao and Thomas Kipf. Molgan: An implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973, 2018.
  • [13] Kun Dong, Austin R Benson, and David Bindel. Network density of states. KDD, 2019.
  • [14] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Theoretical Computer Science, 9(3–4):211–407, 2014.
  • [15] Jennifer G. Dy and Andreas Krause, editors. Differentially private database release via kernel mean embeddings, volume 80, 2018.
  • [16] Pál Erdős and Alfréd Rényi. On the evolution of random graphs. Publ. Math. Inst. Hungar. Acad. Sci, 5:17–61, 1960.
  • [17] TS Evans and Renaud Lambiotte. Line graphs, link partitions, and overlapping communities. Physical Review E, 80(1):016105, 2009.
  • [18] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In SIGSAC, 2015.
  • [19] Lorenzo Frigerio, Anderson Santana de Oliveira, Laurent Gomez, and Patrick Duverger. Differentially private generative adversarial networks for time series, continuous, and discrete open data. In IFIP SEC, 2019.
  • [20] George Gondim-Ribeiro, Pedro Tabacof, and Eduardo Valle. Adversarial attacks on variational autoencoders. arXiv preprint arXiv:1806.04646, 2018.
  • [21] Aditya Grover, Aaron Zweig, and Stefano Ermon. Graphite: Iterative generative modeling of graphs. arXiv preprint arXiv:1803.10459, 2017.
  • [22] Xiaodong Gu, Kyunghyun Cho, Jungwoo Ha, and Sunghun Kim. Dialogwae: Multimodal response generation with conditional wasserstein auto-encoder. In ICLR, 2019.
  • [23] David Hallac, Youngsuk Park, Stephen Boyd, and Jure Leskovec. Network inference via the time-varying graphical lasso. In KDD, 2017.
  • [24] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NIPS, 2017.
  • [25] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. In ICML, 2018.
  • [26] Shiva Prasad Kasiviswanathan, Kobbi Nissim, Sofya Raskhodnikova, and Adam D. Smith. Analyzing graphs with node differential privacy. In TCC, 2013.
  • [27] Nicolas Keriven and Gabriel Peyré. Universal invariant and equivariant graph neural networks. In NIPS, 2019.
  • [28] Thomas N Kipf and Max Welling. Variational graph auto-encoders. In NIPS Workshop on Bayesian Deep Learning, 2016.
  • [29] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
  • [30] Kristine M Kuhn. Compensation as a signal of organizational culture: the effects of advertising individual or collective incentives. IJHRM, 20(7):1634–1648, 2009.
  • [31] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. ICLR, 2017.
  • [32] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In ICML, 2016.
  • [33] Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. arXiv preprint arXiv:1802.03471, 2018.
  • [34] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
  • [35] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning deep generative models of graphs. In ICML, 2018.
  • [36] Tengfei Ma, Jie Chen, and Cao Xiao. Constrained generation of semantically valid graphs via regularizing variational autoencoders. In NIPS, 2018.
  • [37] Haggai Maron, Heli Ben-Hamu, Nadav Sharmir, and Lipman Yaron. Invariant and equivariant graph networks. In ICLR, 2019.
  • [38] Noman Mohammed, Rui Chen, Benjamin C. M. Fung, and Philip S. Yu. Differentially private data release for data mining. In KDD, 2011.
  • [39] Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of large datasets (how to break anonymity of the netflix prize dataset). SP, 2008.
  • [40] Arvind Narayanan and Vitaly Shmatikov. De-anonymizing social networks. SP, 2009.
  • [41] Mark EJ Newman. Clustering and preferential attachment in growing networks. Physical review E, 64(2):025102, 2001.
  • [42] Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Úlfar Erlingsson. Scalable private learning with pate. In ICLR, 2018.
  • [43] Alessandra Sala, Xiaohan Zhao, Christo Wilson, Haitao Zheng, and Ben Y. Zhao. Sharing graphs using differentially private graph models. In SIGCOMM, 2011.
  • [44] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI mag., 2008.
  • [45] Reza Shokri and Vitaly Shmatikov. Privacy-preserving deep learning. In SIGSAC, 2015.
  • [46] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In ISP, 2017.
  • [47] Börkur Sigurbjörnsson and Roelof Van Zwol. Flickr tag recommendation based on collective knowledge. In WWW, 2008.
  • [48] Martin Simonovsky and Nikos Komodakis. Graphvae: Towards generation of small graphs using variational autoencoders. In ICANN, 2018.
  • [49] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. Arnetminer: extraction and mining of academic social networks. In KDD, 2008.
  • [50] Aleksei Triastcyn and Boi Faltings. Generating artificial data for private deep learning. AAAI, 2018.
  • [51] Yue Wang and Xintao Wu. Preserving differential privacy in degree-correlation based graph generation. TDP, 6(2):127–145, 2013.
  • [52] Larry Wasserman and Shuheng Zhou. A statistical framework for differential privacy. JASA, 105(489):375–389, 2010.
  • [53] Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’ networks. nature, 393(6684):440, 1998.
  • [54] Liyang Xie, Kaixiang Lin, Shu Wang, Fei Wang, and Jiayu Zhou. Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739, 2018.
  • [55] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In ICLR, 2019.
  • [56] Carl Yang, Lanxiao Bai, Chao Zhang, Quan Yuan, and Jiawei Han.

    Bridging collaborative filtering and semi-supervised learning: a neural approach for poi recommendation.

    In KDD, 2017.
  • [57] Carl Yang, Xiaolin Shi, Luo Jie, and Jiawei Han. I know you’ll be back: Interpretable new user clustering and churn prediction on a mobile social application. In KDD, 2018.
  • [58] Carl Yang, Peiye Zhuang, Wenhan Shi, Alan Luu, and Pan Li. Conditional structure generation through graph variational generative adversarial nets. In NIPS, 2019.
  • [59] Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and Jure Leskovec. Graph convolutional policy network for goal-directed molecular graph generation. In NIPS, 2018.
  • [60] Jiaxuan You, Rex Ying, Xiang Ren, William L Hamilton, and Jure Leskovec. Graphrnn: Generating realistic graphs with deep auto-regressive models. In ICML, 2018.
  • [61] Aston Zhang, Xing Xie, Kevin Chen-Chuan Chang, Carl A Gunter, Jiawei Han, and XiaoFeng Wang. Privacy risk in anonymized heterogeneous information networks. In EDBT, 2014.
  • [62] Xinyang Zhang, Shouling Ji, and Ting Wang. Differentially private releasing via deep generative model (technical report). arXiv preprint arXiv:1801.01594, 2018.
  • [63] Yanjun Zhang, Xin Zhao, Xue Li, Mingyang Zhong, Caitlin Curtis, and Chen Chen. Enabling privacy-preserving sharing of genomic data for gwass in decentralized networks. In WSDM, 2019.
  • [64] Dongmian Zou and Gilad Lerman. Encoding robust representation for graph generation. arXiv preprint arXiv:1809.10851, 2018.
  • [65] Daniel Zügner, Amir Akbarnejad, and Stephan Günnemann. Adversarial attacks on neural networks for graph data. In KDD, 2018.