The last decade has witnessed the rapid development of online social networks (OSNs). Leading companies in the business have attracted a large of number of users. For instance, Facebook has more than 2 billion monthly active users , and over 400 million users are using Instagram everyday . This has resulted in an unprecedented scale of social graph data available. Beyond OSNs, a number of human activities, such as mobility traces  and email communication , can also be modeled as graphs. Both industry and academia could benefit from large-scale graph data: The former can use graph data to construct appealing commercial products, e.g., recommendation systems , while the latter can use graph data to gain a deeper understanding of many fundamental societal questions, such as people’s communication patterns , information propagation  and epidemiology . Due to these potential benefits, there exists a strong demand for OSN operators to share their social graph data.
As graph data can reveal very sensitive information, such as identity and social relations, it is crucial to preserve a high degree of anonymity in the graph, and thus to sanitize the original graph data before releasing them. The most straightforward approach is to replace each user’s name/ID with a randomly generated number. However, Backstrom et al.  have demonstrated that this approach fails to protect users from being identified. Based on these findings, researchers have developed more sophisticated anonymization mechanisms, such as [10, 11, 12, 13] (see  for a survey). In general, these mechanisms modify the original edge set of the graph, e.g., by adding fake edges between users, such that the resulting anonymized graph satisfies certain predefined privacy criteria. For instance, Liu and Terzi  adopt the notion of -anonymity and construct a -degree anonymous graph in order to prevent re-identification via users’ degrees. In another example, Sala et al.  propose to add noise to representative statistical properties of the graph, thereby perturbing the edge set, to provide a certain level of differential privacy.
However, when modifying the original edge set, these mechanisms do not take into account key characteristics of the underlying graph, such as the higher structural proximity between friends than between strangers in the social graph . By exploiting this vulnerability, we can detect implausible fake edges created between users with low structural proximity, recover part of the original graph structure and eventually jeopardize users’ privacy.
Contributions. In this paper, we identify a fundamental weakness of existing graph anonymization mechanisms and study to which extent this allows for the reconstruction of the original graph. In order to best illustrate the wide applicability of our approach, we concentrate on two of the most widely known anonymization mechanisms, which follow the notions of -anonymity  and differential privacy , respectively. We demonstrate that fake edges created by these anonymization mechanisms can be easily detected due to the low structural proximity between the nodes they connect, and that this vulnerability jeopardizes the original anonymization mechanisms’ privacy guarantees. Then, we develop enhanced graph anonymization mechanisms to generate plausible edges that preserve initial privacy criteria and provide as much - or even more - utility than the original anonymization schemes.
Edge plausibility. In order to evaluate whether a given edge in an anonymized graph is plausible, we measure the structural proximity between the two users it connects. In the context of link prediction , structural proximity is normally measured by human-designed metrics. However, these metrics only capture partial information of the proximity. Instead, we rely on a state-of-the-art graph embedding [16, 17]
method to map users in the anonymized graph into a continuous vector space, where each user’s vector comprehensively reflects her structural properties in the graph. Then, for each edge in the anonymized graph, we define its plausibility as the similarity between the vectors of the two users it connects, and postulate that lower similarity implies lower edge plausibility.
We illustrate the effectiveness of our plausibility metric in differentiating fake edges from original ones, first without fixing a specific decision threshold. Therefore, we adopt the ROC curve that reports the true-positive and false-positive rates for a whole range of thresholds, and its related AUC (area under the curve) as our evaluation metrics. The experimental results on three real-life social network datasets demonstrate that our approach achieves excellent performance (corresponding to AUC values greater than 0.95) for both anonymization mechanisms in most cases. The ROC curve also shows that our edge plausibility measure significantly outperforms traditional structural proximity metrics.
Given the empirical Gaussian distributions of original and fake edges’ plausibility values, we fit the edges’ plausibility into a Gaussian mixture model (GMM), and rely on the maximum a posteriori probabilities (MAP) resulting from our GMM to concretely decide whether an edge is fake. Evaluation results show that our approach achieves strong performance, with both precision and recall above 0.8 in multiple cases.
Privacy damage. The two anonymization mechanisms we study follow different threat models and privacy definitions. In order to precisely quantify the concrete privacy impact of our graph recovery, we propose privacy loss measures tailored to each mechanism. As the first anonymization mechanism relies on the assumption that the adversary is aware of her victims’ degrees in a social graph, we define the corresponding privacy loss as the closeness of the users’ degrees between the original, anonymized and recovered graphs. For the differential privacy-based anonymization mechanism, we measure the magnitude and entropy of noise added to the statistical measurements of the graph. Experimental results show that the privacy provided by both mechanisms significantly decreases, which concretely demonstrates the extent of the threat on existing graph anonymization mechanisms.
Enhancing graph anonymization. We take the first step towards enhancing the two considered anonymization mechanisms with respect to the weakness we discovered. The main idea is that, when adding fake edges, we perform statistical sampling to select potential fake edges that follow a similar distribution as the edge plausibility in the original graph. Experimental results show that our enhanced anonymization mechanisms decrease the performance of our graph recovery by up to 35%, and more importantly preserve better graph utility compared to existing anonymization mechanisms.
Note that we concentrate on fake added edges (and not on deleted edges) in this paper, since most of the graph anonymization mechanisms, including the two we study [10, 11], mainly add edges to the original social graph to preserve as much graph utility as possible.
In summary, we make the following contributions:
We discover a fundamental weakness of existing graph anonymization mechanisms, and propose an edge plausibility metric to exploit this weakness in order to recover the original graph from the anonymized graph. Extensive experiments on three real-life social network datasets demonstrate the effectiveness of our approach.
We propose metrics to evaluate the privacy loss caused by our graph recovery, which demonstrate the privacy threat in existing graph anonymization mechanisms.
We propose solutions to enhance existing graph anonymization mechanisms, with respect to the weakness we discovered. Our enhanced anonymization mechanisms decrease the performance of our graph recovery and preserve better graph utility.
In this section, we, first, introduce the notation used throughout the paper, second, describe the two anonymization mechanisms we concentrate on, and third, present the threat model.
A social graph is defined as an undirected graph . The set contains all users (nodes) and a single user is denoted by . All the edges in are represented by the set . An anonymization mechanism, denoted by , is a map which transforms to an anonymized graph following the privacy criteria of . By this definition, we only consider graph anonymization mechanisms that do not add new nodes but only modify edges. This is in line with most of the previous works in this field [10, 18, 11, 12, 13]. We further use to represent ’s friends in , i.e., . Accordingly, represents ’s friends in .
Ii-B Graph Anonymization Mechanisms
Next, we briefly introduce the two graph anonymization mechanisms, namely -DA  and SalaDP , that we concentrate on in this paper. For more details, we refer the interested readers to the original papers. Note that, to fully understand these two mechanisms, we have also inspected the source code of SecGraph , a state-of-the-art software system for evaluating graph anonymization which includes an implementation of both -DA and SalaDP.
k-DA . The k-DA mechanism follows the notion of -anonymity in database privacy. The mechanism assumes that the adversary has prior knowledge of its target users’ degrees in a social graph, i.e., numbers of friends, and uses this knowledge to identify the targets from the graph. To mitigate this privacy risk, k-DA modifies the original social graph such that, in the resulting anonymized graph, each user shares the same degree with at least other users.
-DA takes two steps: First, it utilizes dynamic programming to construct a -anonymous degree sequence. Second, the mechanism adds edges111In its relaxed version, -DA also deletes a small fraction of edges, but its major operation is still adding edges. to the original graph in order to realize the -anonymous degree sequence. By calculating the differences between the original degree sequence and the -anonymous degree sequence, -DA maintains a list that stores the number of edges needed for each user, namely the user’s residual degree. When adding an edge for a certain user, -DA picks the new adjacent user with the highest residual degree.
SalaDP . SalaDP is one of the first and most widely known mechanisms to apply differential privacy in the field of graph anonymization. The statistical metric that SalaDP concentrates on is the -2 series. The -2 series of a graph counts, for each pair of node degrees and , the number of edges in that connect nodes of these degrees. In the literature, -2 series is also referred to as joint degree distribution, and we will provide a formal definition in Section VI.
SalaDP also takes a two-step approach to anonymize a social graph. First, the mechanism adds Laplace noise to each element in the original -2 series, and obtains a differentially private -2 series. Then, it adds (and deletes a tiny fraction of) edges to guarantee that the resulting anonymized graph follows the new -2 series. The authors of  do not state explicitly how edges should be added to the original graph. By checking the source code of SecGraph, we find that SalaDP adds fake edges among users in a random manner.222Line 252 of SalaDP.java in src/anonymize/ of SecGraph. Similar to -DA, SalaDP’s major operation on modifying a social graph is also edge addition. In Section IV, we will provide statistics on the proportion of edges added and deleted on the original social graph for SalaDP as well as for -DA.
From the above description, we can see that neither of the anonymization mechanisms take into account users’ structural proximity when adding fake edges between them. The main hypothesis we investigate in this paper is that we can effectively separate the fake edges added by such mechanisms from the original edges by using a suitable measure of edge plausability that encodes the structural properties of connected users. We introduce our edge plausability metric in Section III.
Ii-C Threat Model
The adversary’s goal is to detect fake edges in and partially recover the original graph to eventually apply privacy attacks on the recovered graph. In this paper, our main focus lies on the reconstruction of the original graph.
To perform the graph recovery, we assume that the adversary only has access to the anonymized graph and that she knows the used anonymization mechanism . In particular, this means that the adversary does not know any information about the original graph , such as ’s graph structure, or any statistical background knowledge related to this graph. Figure 1 depicts a schematic overview of the attack which takes as input only the anonymized graph. Besides the adversary, an OSN operator can also apply our graph recovery attack to check whether there are any potential flaws in before releasing it.
Iii Edge Plausibility
To verify our hypothesis that an edge is fake if the users it connects are structurally distant, we first need to quantify two users’ structural proximity in a social graph. Previous work on link/friendship prediction in social networks 
provide numerous proximity metrics, such as embeddedness (number of common friends), Jaccard index and Adamic-Adar score. However, these metrics are manually designed and only capture partial information of structural proximity.
The recent advancement of graph embedding [16, 17], also known as graph representation learning, provides us with an alternative approach. In this context, users in a social network are embedded into a continuous vector space, such that each user’s vector preserves her neighborhood information. If two users share similar neighborhoods in a social network, their vectors will be closer to each other than those with very different neighborhoods. In this sense, a user’s vector is able to reflect her structural property in the network. Moreover, graph embedding follows a general optimization objective which is not related to any downstream prediction task which, in our case, is fake edge detection. Among other advantages, this method does not need any prior knowledge, or training data, on whether an edge is fake. This complies with our assumptions on the adversary’s knowledge and allows for a larger scope of application. In the end, for an edge in the anonymized graph, we can define its two users’ structural proximity as the similarity of their vectors, and use this similarity as the edge’s plausibility.
In this section, we will first introduce the procedure of graph embedding and then present our edge plausibility metric.
Iii-a Graph Embedding
The goal of graph embedding is to learn a map from users in the anonymized graph to a continuous vector space:
, as a hyperparameter, represents the dimension of each user’s vector. The state-of-the-art optimization framework for learningis inspired by Skip-gram [20, 21]
, an advanced natural language processing model on word embedding (word2vec). Formally, graph embedding can be represented as the following objective function:
Here, the conditional probability is modeled with a softmax function, i.e.,
where is the dot product of the two vectors, and is a set that represents ’s neighborhood in . To define , one approach would be to include those that are within a certain number of steps from in , i.e., a breadth-first search. However, the authors of  have demonstrated this approach is neither efficient nor effective in the context of graph representation learning. Instead, we follow  and  and use random walks to define the neighborhood of each user. Concretely, we start a random walk from each user in for a fixed number of times , referred to as the walk times. Each random walk takes step, referred to as the walk length. For each user , her transition probability to the next user in the random walk, denoted by
, is uniformly distributed among all her friends, i.e.,
The above procedure eventually results in a set of truncated random walk traces. Given these traces, each user’s neighborhood includes the users333Following  and , we select 10 users before and after the considered user in the random walk trace to be part of the considered user’s neighborhood. that appear before and after her in all random walk traces. Similar to the vector dimension (), walk length and walk times ( and ) are also hyperparameters. We will choose their values through cross-validation.
Objective function 3 implies that, if two users share similar neighborhoods in , then their learned vectors will be closer than those with different neighborhoods. This results in each user’s vector being able to preserve her neighborhood and to eventually reflect her structural property in . To optimize (3
), we rely on stochastic gradient descent (SGD). However, the termrequires summation over all users in during each iteration of SGD, which is computationally expensive. Therefore, in order to speed up the learning process, we apply negative sampling .
Iii-B Quantifying Edge Plausibility
Given the learned vectors from graph embedding, we define an edge’s plausibility as the cosine similarity between its two users’ vectors. Formally, for an edge, its plausibility is defined as
where is the -norm of . Consequently, the greater the (cosine) similarity between two users’ vectors is, the more plausible the edge that connects them is. Note that, as , the range of lies in [-1, 1].
In this section, we evaluate the effectiveness of our edge plausibility metric on differentiating fake edges from original ones without fixing the threshold for deciding whether the edge is fake a priori. This notably enables us to compare our plausibility metric with previous nodes’ structural similarity metrics. We will present how one can optimize the decision rule given the actual data distribution in Section V.
We first describe the experimental setup. Then, we present the general evaluation results. In the end, we study the sensitivity of the hyperparameters involved in graph embedding for defining edge plausibility.
Iv-a Experimental Setup
|Number of users||36,692||63,731||4,039|
|Number of edges||183,831||817,090||88,234|
Dataset. We utilize three datasets for conducting experiments. The first dataset, referred to as Enron, is a network of Email communications in the Enron corporation , while the second  and the third  ones contain data collected from Facebook, which we refer to as Facebook1 and Facebook2, respectively. Note that Enron and Facebook1 are the two datasets used in SecGraph . Table I presents some basic statistics of the three datasets. As we can see, these datasets have different sizes (number of users and edges), which allows us to evaluate our approach comprehensively.
Other structural proximity and distance metrics. To demonstrate the effectiveness of our plausibility metric, which is essentially a structural proximity metric, we also experiment with three classical structural proximity metrics in social networks, namely embeddedness, Jaccard index and Adamic-Adar score. For an edge , these metrics’ formal definitions are as follows:
Recall that cosine similarity is adopted for measuring edge plausibility. We also experiment with two other vector similarity (distance) metrics, namely Euclidean distance and Bray-Curtis distance. They are formally defined as follows:
Here, is the -th element of vector .
Evaluation metrics. In this section, we use the ROC curve and the resulting area under the curve, namely AUC, as measures to evaluate the effectiveness of our plausibility metric. The ROC curve is a 2D plot with X-axis representing the false-positive rate and Y-axis being the true-positive rate (recall). Different points on the curve correspond to different plausibility thresholds. Therefore, the ROC curve does not rely on any optimized threshold for making predictions. It instead shows the trade-off between true positives and false positives. A higher ROC curve indicates stronger prediction performance. Moreover, there exists a conventional standard  for interpreting the AUC value resulting from the ROC curve: is equivalent to random guessing and implies excellent prediction.
Parameters in anonymization mechanisms. We rely on SecGraph to perform -DA and SalaDP on the three datasets. Both anonymization mechanisms involve a privacy parameter. For -DA, we need to choose the value , i.e., the minimal number of users sharing a certain degree for all possible degrees in . Greater implies stronger privacy protection. In our experiment, we choose to be 50, 75 and 100, respectively, to explore different levels of privacy protection. For SalaDP, the privacy parameter is which controls the noise added to -2 series: the smaller is, the higher its privacy provision is. Following the choices of  and , we experiment with three different values: 10, 50 and 100.
As stated before, we concentrate on detecting fake added edges because both -DA and SalaDP’s principal operation when generating the anonymized graph is adding fake edges to the original graph. By running the two anonymization mechanisms on our datasets, we discover that this is indeed the case. From Table III, we observe that both -DA and SalaDP only delete a small proportion of the edges in most cases. Especially for SalaDP, the largest deletion rate is only 4.9% on the Facebook2 dataset when is set to 10. On the other hand, -DA deletes relatively more edges. However, in the worst case, only 26.5% of edges are deleted from the Facebook2 dataset when . Meanwhile, a much larger proportion of edges are added, from up to 63.1% for -DA to up to 218.8% for SalaDP. This demonstrates that just identifying fake added edges can already help recover most of the original graph structure.
Hyperparameter setting. There are mainly three hyperparameters in the graph embedding phase: walk length (), walk times () and vector dimension (). For both -DA and SalaDP, we choose and . Meanwhile, we set for -DA and for SalaDP. These choices are selected through cross-validation, which we will discuss in Section IV-C. For reproducibility purposes, we will make our source code publicly available upon request.
Iv-B Prediction Results
Table II presents the AUC values of using our edge plausibility to differentiate fake edges from original ones (the Cosine column). In most of the cases, we achieve excellent performances with AUC values above 0.95. Especially, for the SalaDP-anonymized Facebook2 dataset (), the AUC value is 0.982. The only case when our edge plausibility does not achieve an excellent performance is when applying SalaDP on the Enron dataset where the AUC values are between 0.83 and 0.86. However, we emphasize that, for a classical classification task, such AUC is already considered good.
Regarding the privacy parameters, we observe that the performance of our approach is not significantly influenced by them. For instance, when applying -DA to anonymize the Enron dataset, if is set to 50, the AUC value is 0.96, while if (i.e., the privacy protection increases), the AUC value is still 0.94. Similarly, when applying SalaDP with on the Facebook2 dataset, our AUC is 0.98. Decreasing to 10 only reduces the AUC by 0.01. This demonstrates that our plausibility metric is quite robust to different privacy parameter values.
The AUC values for other vector similarity (distance) metrics are presented in Table II as well. Cosine similarity outperforms Euclidean distance and Bray-Curtis distance for -DA-anonymized graphs, even though the performance gain is quite small. On the other hand, for SalaDP-anonymized graphs, we can observe that cosine similarity outperforms Euclidean distance by around 10%, while Bray-Curtis distance is still very close to cosine similarity. This shows that cosine similarity (as well as Bray-Curtis distance) is a suitable metric for measuring edge plausibility.
Figures 2 and 3 present the ROC curves of our plausibility metric (with cosine similarity) as well as three other structural proximity metrics for -DA and SalaDP anonymized Facebook1 dataset. We observe that our plausibility metric significantly outperforms these traditional metrics. For instance, for a false-positive rate of 0.1, we achieve a 0.93 true-positive rate on the -DA anonymized Facebook1 dataset () while embeddedness only achieves a 0.56 true-positive rate. For a false-positive rate of 0.01, we reach a true-positive rate of 0.59 while the result for embeddedness is only 0.17. It also appears that embeddedness outperforms the other two metrics for -DA while Jaccard index is rather effective for SalaDP. The ROC curves for Enron and Facebook2 are depicted in the appendix.
In general, these results fully demonstrate the effectiveness of our edge plausibility metric on differentiating fake edges from original edges in anonymized graphs.
Iv-C Hyperparameter Sensitivity
There are mainly three hyperparameters, , and , involved in our edge plausibility metric. We study their sensitivity by relying on the AUC values. Here, and are directly related to the size of the random walk traces, which essentially relate to the amount of data used for learning embedding vectors. For both anonymization mechanisms, we observe that increasing and increases the AUC values. However, the increase is smaller when both of these values are above 60. Therefore, we set and . The corresponding plots are depicted in the appendix.
Meanwhile, we observe interesting results for the vector dimension : different anonymization mechanisms have different optimal choices for (Figures 4 and 5). It appears that when detecting fake edges on -DA-anonymized graphs, is a suitable choice for all datasets. On the other hand, for SalaDP, is able to achieve a stronger prediction. We confirm that the vector dimension is indeed a subtle parameter, as was observed in other data domains, such as biomedical data . In conclusion, our default hyperparameter settings are suitable for our prediction task.
V Graph Recovery
In Section IV, we utilize ROC curves and AUCs to evaluate the general effectiveness of our edge plausibility metric. In this section, we rely on a probabilistic model, namely the Gaussian mixture model (GMM), and a maximum a posteriori probabilities (MAP) estimator to decide whether an edge is fake given the plausibility data distribution. We first describe the GMM and MAP estimation in our context, and then present the evaluation results.
Figure 6 depicts the histograms of the plausibility of both fake and original edges for the
-DA-anonymized Facebook1 dataset. Interestingly, both of them follow Gaussian distributions, with different means and standard deviations. The plausibility of the original edges is centered around 0.6, while the plausibility of the fake edges is centered around 0.05. Also, the standard deviation for the plausibility of the fake edges is relatively larger than that of the original edges. Similarly, we observe different Gaussian distributions for the plausibility of fake and original edges on the SalaDP-anonymized Facebook1 dataset (Figure7) as well as on Enron and Facebook2 datasets under both anonymization mechanisms (see the appendix).
Given that a general population (plausibility of all edges) consists of a mixture of two subpopulations (plausibility of fake and original edges) with each one following a Gaussian distribution, we can fit the general population with a Gaussian mixture model (GMM). With the fitted GMM, we can obtain each edge’s posterior probability of being fake or original given its plausibility. If the former is higher than the latter, then we predict the edge to be fake, and vice versa (MAP estimation). This means, GMM and MAP estimation provide us with an optimal approach for determining whether an edge is fake given the observed data. Moreover, fitting GMM does not require any prior knowledge on which edges are fake and which are original, meaning that the process is unsupervised.
Gaussian mixture model.
To formally define our GMM, we first introduce two random variables, namelyand . represents whether an edge is original () or fake (), while
represents the plausibility of an edge. The probability density function of our GMM is formally defined as:
The GMM is parameterized by 6 parameters: , , , , and . Here, (
) is the prior probability of an edge being original (fake), i.e.,(). Meanwhile, the other 4 parameters are related to the two Gaussian distributions for the plausibility of fake and original edges, respectively. for is the density function of a Gaussian distribution:
To learn the 6 parameters of the GMM, we adopt the expectation maximization (EM) algorithm, which consists of two steps: the expectation (E) step and the maximization (M) step. The E-step calculates, for each edge in, its posterior probability of being fake or original given its plausibility value. The M-step updates all the 6 parameters based on the probabilities calculated from the E-step following maximum likelihood estimation. The learning process iterates over the two steps until convergence. Here, convergence means that the log-likelihoods of two consecutive iterations differ less than a given threshold (we set it to 0.001 in our experiments). In addition, the initial values of the 6 parameters are set randomly before the learning process starts. Next, we describe the two steps of EM in detail.
E-step. At each iteration , for each , we evaluate the following posterior probabilities.
M-step. After obtaining for all edges in from the E-step, we apply maximum likelihood estimation to update all the 6 parameters. Concretely, for , we perform the following computation.
From Formulas 6 and 7, we can see that, when updating and ( and ), each edge contributes its plausibility proportionally to the edge’s posterior probability of being original (fake). Meanwhile, the process for updating the prior probability, i.e., Formula 8, is a summation over all edges’ posterior probabilities normalized by the total number of edges in , i.e., .
Fake edge detection. After the GMM has been learned, we compute for each edge , its posterior probabilities of being original and fake:
and pick the one that is maximum (MAP estimate): If , we predict to be fake, and vice versa. In the end, we delete all the predicted fake edges, and obtain the recovered graph from .
We train GMMs for all the datasets under both anonymization mechanisms. As we now make a concrete prediction on whether an edge is fake, we adopt two classical binary classification metrics, i.e., precision and recall, for evaluation. To further demonstrate the effectiveness of our approach, we build a baseline model that consists in randomly sampling the same number of edges as our MAP estimator predicts to be fake and in classifying them as fake.
Table IV presents the prediction results. We first observe that, in most of the cases, our approach achieves a strong prediction. For instance, for the SalaDP-anonymized Facebook1 dataset (), the precision is 0.948 and the recall is 0.827. Another interesting observation is that, when the privacy level increases, i.e., higher for -DA and lower for SalaDP, our prediction precision increases. The main reason for that is that higher privacy criteria normally leads to more added fake edges (see Table III). On the other hand, we do not observe a similar trend for recall. In most of the cases, our approach outperforms the random baseline significantly. For instance, our precision is 0.546 for the -DA-anonymized Facebook1 dataset (), while the precision for the baseline is only 0.042. We also note that, in our worst prediction, i.e., the SalaDP-anonymized Enron dataset, our results are still better than the baseline, but the performance gain is rather small.
Vi Privacy Loss Quantification
As shown in Section V, we can achieve high performance in detecting fake edges, meaning that the recovered graph is more similar to the original graph compared to . As fake edges help satisfy certain privacy guarantees, we expect that, by inferring from , these guarantees will be violated. In this section, we first define two metrics tailored to each anonymization mechanism for quantifying the privacy loss resulting from our graph recovery. Then, we present the corresponding evaluation results.
Vi-a Privacy Loss Metric
-DA. -DA assumes that the adversary only has knowledge of her targets’ degrees, and uses this knowledge to re-identify them. This implies that, if the users’ degrees in are more similar to those in compared to , then the adversary is more likely to achieve her goal. Therefore, we propose to compute users’ degree difference between and , as well as between and to measure the privacy loss caused by our graph recovery for -DA. Formally, we define users’ degree difference between and as
and define users’ degree difference between and , denoted by , similarly. () is the mean value of all users’ degree differences between and ().
Note that our algorithm also deletes some users’ original edges when recovering (i.e., the false positives). Therefore, if the adversary relies on the users’ exact degrees (as assumed in
-DA) to de-anonymize them, she might fail. However, a sophisticated adversary can apply some extra heuristics such as tolerating some degree differences for finding her targets. In this case,being smaller than can still provide the adversary a better chance to achieve her goal. Therefore, we believe our metric is appropriate for measuring privacy loss resulting from our graph recovery for -DA.
SalaDP. To quantify the privacy loss for the SalaDP mechanism, we consider the noise added to the -2 series of the original graph . Formally, the -2 series of , denoted by , is a set, with each element in representing the number of edges that connect nodes of degrees and in . Semantically, is defined as follows.
Accordingly, and represent the corresponding numbers in and , respectively. Then, we introduce to denote the noise added to when transforming to , and to represent the (lower) noise resulting from our graph recovery. Since SalaDP is a statistical mechanism, we sample 100 anonymized graphs by applying SalaDP to 100 times and produce 100 noise samples for each element in .
We define two metrics for quantifying the privacy loss. In the first metric, we compare the difference of average noises added to the -2 series of by and by . Concretely, for each in , we first calculate the average absolute noise added to , denoted by , over the 100 SalaDP graph samples described above, i.e., . Then, we compute the overall noise difference by graph anonymization as
We analogously compute the average added noise after our graph recovery.
For the second approach, we consider the uncertainty introduced by the added noise. Several works in the literature, such as , explore the connection between differential privacy and the uncertainty of the output produced by a differentially private mechanism. In general, higher uncertainty implies more privacy. We measure the uncertainty of noise added by SalaDP by estimating its empirical entropy. To this end, we calculate the Shannon entropy over the frequencies of elements in (the 100 noise samples described above), denoted by . In the end, to obtain the overall empirical entropy for the anonymization, we average over all -2 series elements:
We compute the overall empirical entropy resulting from our graph recovery similarly, and denote it by .
-DA. Table V presents the results of the users’ degree differences. We observe that, in all cases, is smaller than . This indicates that the adversary has a better chance to identify her targets from than from , and demonstrates that our attack clearly decreases the privacy provided by -DA. It also appears that our graph recovery gains least benefits for the adversary on Facebook1, i.e., is closer to . This is essentially due to the fact that the original Facebook1 dataset already preserves a high -degree anonymity, as we can see from the small fraction of edges added by -DA ( Table III).
SalaDP. Table VI presents the average noise added to the -2 series of the original graph by the anonymized and recovered graphs. In all cases, is smaller than , showing that our recovery mechanism reduces the average noise privacy metric for SalaDP. Further, we can observe that the relative reduction of the average noise with our recovery attack in general decreases with increasing : the added noise is already much smaller (for larger ) and cannot be reduced much more.
Table VII presents the average entropy of the noise added to the -2 series of the original graph after applying SalaDP and after the recovery with our approach. Note that, while one would expect higher entropy for smaller values of , this does not hold true in practice because the SalaDP mechanism is not necessarily optimal with respect to the added uncertainty. Still, across all values for , and across all datasets we can observe a reduction of the empirical entropy, and therefore a reduction of this privacy metric. The relative reduction, however, varies between the values of and, as for the average noise above, between the datasets.
For now, it seems unclear how these various factors impact the relative reduction of empirical entropy. Analyzing the impact of these parameters on the relative reduction of empirical entropy (and average noise in the case of network structure) could provide further insights into the recoverability of anonymized graphs. Such work is, however, orthogonal to the work presented in this paper and provides an interesting direction for future work.
Note that, while our recovery mechanism does indeed reduce both privacy metrics (the average noise and the uncertainty of the added noise), it cannot violate the differential privacy of SalaDP since differential privacy is known to be closed under post-processing . We can instead offer two potential explanations. First, the noising mechanism used in SalaDP might not be optimal with regard to entropy, thus allowing for its reduction without breaking differential privacy. Second, the generation of the anonymized graphs from the noised -2 series generated by SalaDP adds additional entropy that we are able to reduce with plausibility metric (Section III).
Vii Enhancing Graph Anonymization Mechanisms
We have so far demonstrated a fundamental vulnerability of the current graph anonymization mechanisms due to overlooking the structural plausibility of fake edges when creating them. In this section, we take the first step towards mitigating this weakness by generating more plausible edges. We start by presenting our solutions for enhancing the two graph anonymization mechanisms, then evaluate the performance of fake edge detection as well as the graph utility with the enhanced mechanisms.
As discussed in Section II, both -DA and SalaDP create fake edges without considering their plausibility. This is essentially what makes our fake edge detection possible. To improve the anonymization mechanisms, intuitively, we should add fake edges that are more similar to edges in the original graph with respect to edge plausibility.
Figure 8 depicts the edge plausibility distribution for the original Facebook1 dataset.444We map all users in into vectors and compute all edges’ plausibility in following the same procedure as for (Section III). The distributions for Enron and Facebook2 are in the appendix. Note that for -DA, we set the vector dimension to be , while, for SalaDP, we have . These choices are made following the hyperparameter sensitivity study in Section IV. We observe that both edge plausibility empirical distributions follow a Gaussian distribution. If we are able to modify the current graph anonymization mechanisms such that the plausibility of the added fake edges is more likely to come from the same Gaussian distribution, then it should be harder to discover these fake edges, i.e., our fake edge detection’s performance will decrease.
The general procedure for our enhanced graph anonymization mechanisms works as follows. We first apply maximum likelihood estimation to learn the Gaussian distribution for edge plausibility in , denoted by , where represents ’s plausibility in . Then, we conduct the same process as in -DA and SalaDP. A loop is performed through all the users where, in each iteration, if a user needs fake edges, we construct a candidate set for her which includes all the potential users that could share a fake edge with her. Different from the original approaches of -DA and SalaDP for choosing users out of , we compute the plausibility between users in and ,555The plausibility is computed over users’ vectors learned from . represented as a set . Then, for each plausibility in , we calculate its density using the previously learned , and treat the density as the weight of the user in . Next, we perform a weighted sampling to choose users out of and add edges between these users and . In the end, we obtain our new anonymized graph under the enhanced mechanisms. As our solutions do not involve any modifications of the privacy parameters of -DA and SalaDP, does not change the privacy guarantees provided in the original algorithms. We will make the source code for the aforementioned enhanced versions of -DA and SalaDP publicly available.
Note that, as presented in Section II, for a user , SalaDP chooses users in in a random manner, while -DA picks the users with the highest residual degrees. However, the reason for -DA to take this approach is to efficiently construct the anonymized graph, thus is not related to any privacy guarantee. Through experiments, we discover that our enhanced -DA can also build the anonymized graph in a similar time.
Fake edge detection. After obtaining , we perform the same process as described in Section III to compute the plausibility of all edges in . Then, we calculate the AUC values when using plausibility to differentiate between fake and original edges in . The results for both enhanced anonymization mechanisms are presented in Table VIII.
First of all, the AUC values drop in all cases compared to the results in Table II. Especially for the -DA-anonymized Facebook1 dataset (), AUC drops by 35%, to 0.63. This can be also observed from the histograms in Figure 9: by plausibility, fake edges are hidden quite well among the original edges. On the other hand, the performance drops for SalaDP-anonymized datasets are smaller, but still quite significant. Moreover, given that the plausibility histograms (Figure 10) show that the two Gaussian distributions of largely overlap, the Gaussian mixture model approach described in Section V will not help much with fake edge detection.666Note that the corresponding histograms for Enron and SalaDP datasets are depicted in the appendix. Our experiments with the GMM approach only achieved around 27% precision for SalaDP () with Facebook1, which represents a decrease of almost 50% for around the same recall. When applying our enhanced -DA mechanism on the Facebook2 dataset, the experimental results drops, but less than for Facebook1. This may be due to the dataset’s small size (4,039 users) and the large value, which leads to a large number of fake edges compared to the original number.
It is worth noting that all the edges added by our enhanced anonymization mechanisms still have relatively smaller plausibility than the original edges. Given that our weighted sampling follows the original plausibility distribution in , this implies that not many potential fake edges are normal with respect to plausibility. In consequence, it is rather hard to make fake edges totally indistinguishable from original edges.
Graph utility. The main motivation for OSNs or any data holder to share their graph data is to allow third parties to conduct research or build appealing applications. Therefore, a graph anonymization mechanism needs to take into account graph utility, i.e., how well the anonymized graph preserves the structural properties of the original graph. To show that our enhancing technique is an improvement over the current graph anonymization mechanisms, we also evaluate ’s utility.
|Degree distribution||Eigencentrality||Triangle count|
. For the sake of conciseness, we focus on three of them: degree distribution, eigencentrality and triangle count. The degree distribution is a general structural property of graphs, and is essentially a list with each element representing the proportion of users with a certain degree. Eigencentrality is a classical measure to evaluate the influence/importance of each user in a graph. It assigns a centrality score for each user based on the eigenvector of the graph’s adjacency matrix. Triangle count summarizes the number of triangles each user belongs to in a graph, thus reflecting the connectivity of the graph. We compute the three properties for , and , and calculate the cosine similarity between ’s and ’s properties as well as between ’s and ’s properties . Higher similarity implies that the two graphs are more similar with respect to the property, and thus that higher utility is preserved.
Table IX presents the evaluation results. First of all, we obtain a strong similarity between and for all graph properties, i.e., preserves high utility. For instance, the cosine similarity for triangle count is above 0.96 in most of the cases for our enhanced SalaDP mechanism. On the other hand, the lowest cosine similarity (degree distribution) is still approaching 0.7 when applying enhanced -DA () to anonymize the Facebook2 dataset. More importantly, we observe that, in most cases, preserves more graph utility than . For instance, the eigencentrality’s cosine similarity between and is 0.985 while the similarity between and is of only 0.836 for the -DA anonymized Facebook1 dataset (). This is due to the fact that the fake edges added by our enhanced mechanisms are more structurally similar to the original edges, thus preserve more utility.
In conclusion, our enhanced graph anonymization mechanisms can keep the same privacy properties as the original mechanisms, make the anonymized graph less vulnerable to graph recovery, and preserve better graph utility.
Viii Related Work
The rapid development of online social networks has raised serious concerns about their users’ privacy. Researchers have studied this topic from various perspectives, such as information inference [29, 30], scam detection [31, 32], user identity linkage [33, 34], and social graph anonymization. This paper falls into the domain of graph anonymization.
One class of graph anonymization mechanisms follows the concept of -anonymity in database privacy. Liu and Terzi  propose the first mechanism in this direction, i.e., -DA. Meanwhile, Zhou and Pei  propose -neighborhood anonymity, where each user in the anonymized graph will share the same neighborhood, i.e., the sub-social network among her friends, with at least other users. The authors adopt minimum BFS coding to represent each user’s neighborhood, then rely on a greedy match to realize -neighborhood anonymity. Other -anonymity based mechanisms include [35, 36].
Another class of graph anonymization mechanisms is inspired by differential privacy. Besides SalaDP, multiple solutions have been proposed [37, 38, 13]. For instance, the authors of  present a 2K-graph generation model to achieve differential privacy, where noise is added based on smooth sensitivity. Xiao et al. 
encode users’ connection probabilities with a hierarchical random graph model, and perform Markov chain Monte Carlo to sample a possible graph structure from the model while enforcing differential privacy.
In contrast to the above two classes, Mittal et al.  propose a random walk based mechanism. We also experiment with this mechanism, and discover that with only 4-step walk, 92% of the original edges in the Facebook1 dataset are replaced with fake ones, thus substantially degrading the graph’s original utility. Therefore, it becomes nearly impossible to recover the original graph, and, by carrying out our attack against this mechanism, we achieve a low AUC of around 0.7. Besides the above, other graph anonymization techniques include [39, 40].
Note that, due to space constraints, we only consider the two most widely known anonymization mechanisms -DA and SalaDP to study the possibility of recovering the original graph from the anonymized graph. In the future, we plan to apply our approach to more anonymization mechanisms.
Besides anonymization, graph de-anonymization has been extensively studied as well. Backstrom et al.  are among the first to de-anonymize users in a naively anonymized social graph. Narayanan and Shmatikov  propose a general framework where they assume that an attacker has an auxiliary graph and tries to map users in the auxiliary graph to the anonymized graph for de-anonymization. Narayanan and Shmatikov’s approach relies on an initial seed mapping between the auxiliary graph and the anonymized graph, and a self-reinforcing algorithm to match users. Inspired by , multiple de-anonymization attacks have been proposed, such as [3, 42, 43, 44, 45]. Evaluating if, and to what extent, our graph recovery attack can further help improve de-anonymization attacks is an interesting line of future research. It could further demonstrate the concrete privacy impact of the approach proposed in this work.
In this paper, we identify a fundamental vulnerability of existing graph anonymization mechanisms that do not take into account key structural characteristics of social graphs when generating fake edges. We propose an edge plausibility metric based on graph embedding to exploit this weakness: our extensive experiments show that, using this metric, we are able to recover the original graph to a large degree from graphs anonymized by the -DA and SalaDP mechanisms. Our graph recovery also results in significant privacy damage to the original anonymization mechanisms, which we quantify using privacy metrics suited to the respective anonymization mechanisms. To mitigate this weakness, we propose enhancements for -DA and SalaDP that take into account the plausibility of potential fake edges before adding them to the graph. Our evaluation shows that these enhanced mechanisms significantly reduce the performance of our graph recovery and, at the same time, provide better graph utility.
In addition to the future directions we already discussed throughout the paper, there are two other directions we plan to pursue. First, we concentrate only on identifying fake added edges in this paper. The detection of deleted edges is another interesting direction to pursue. One solution would be to rely on link prediction methods. However, as social networks typically exhibit power law node degree distributions, the search space for potential deleted edges is very large. Second, we measure an edge’s plausibility solely based on its two users’ structural proximity. It is unclear if, and which, additional edge properties might also contribute to its plausibility. Taking into account these additional properties may further increase the performance of our graph recovery.
-  https://newsroom.fb.com/company-info/, 2017.
-  https://instagram-press.com/our-story/, 2017.
-  M. Srivatsa and M. Hicks, “Deanonymizing Mobility Traces: Using Social Network as a Side-channel,” in Proceedings of the 2012 ACM Conference on Computer and Communications Security (CCS). ACM, 2012, pp. 628–637.
-  B. Klimt and Y. Yang, “Introducing the Enron Corpus.”
-  H. Ma, D. Zhou, C. Liu, M. R. Lyu, and I. King, “Recommender Systems with Social Regularization,” in Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM). ACM, 2011, pp. 287–296.
-  A. Bruns and S. Stieglitz, “Quantitative Approaches to Comparing Communication Patterns on Twitter,” Journal of Technology in Human Services, vol. 30, no. 3-4, pp. 160–185, 2012.
-  D. Kempe, J. Kleinberg, and É. Tardos, “Maximizing the Spread of Influence through a Social Network,” in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, 2003, pp. 137–146.
-  L. F. Berkman, I. Kawachi, and M. M. Glymour, Social Epidemiology. Oxford University Press, 2014.
-  L. Backstrom, C. Dwork, and J. Kleinberg, “Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural Steganography,” in Proceedings of the 16th International Conference on World Wide Web (WWW). ACM, 2007, pp. 181–190.
-  K. Liu and E. Terzi, “Towards Identity Anonymization on Graphs,” in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD). ACM, 2008, pp. 93–106.
-  A. Sala, X. Zhao, C. Wilson, H. Zheng, and B. Y. Zhao, “Sharing Graphs using Differentially Private Graph Models,” in Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference (IMC). ACM, 2011, pp. 81–98.
-  P. Mittal, C. Papamanthou, and D. Song, “Preserving Link Privacy in Social Network Based Systems,” in Proceedings of the 20th Network and Distributed System Security Symposium (NDSS), 2013.
-  Q. Xiao, R. Chen, and K.-L. Tan, “Differentially Private Network Data Release via Structural Inference,” in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, 2014, pp. 911–920.
-  S. Ji, P. Mittal, and R. Beyah, “Graph Data Anonymization, De-Anonymization Attacks, and De-Anonymizability Quantification: A Survey,” IEEE Communications Surveys & Tutorials, vol. 19, no. 2, pp. 1305–1326, 2016.
-  D. Liben-Nowell and J. Kleinberg, “The Link-Prediction Problem for Social Networks,” Journal of the American Society for Information Science and Technology, vol. 58, no. 7, pp. 1019–1031, 2007.
-  B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online Learning of Social Representations,” in Proceedings of the 20th ACM Conference on Knowledge Discovery and Data Mining (KDD). ACM, 2014, pp. 701–710.
-  A. Grover and J. Leskovec, “node2vec: Scalable Feature Learning for Networks,” in Proceedings of the 22nd ACM Conference on Knowledge Discovery and Data Mining (KDD). ACM, 2016, pp. 855–864.
-  B. Zhou and J. Pei, “Preserving Privacy in Social Networks Against Neighborhood Attacks,” in Proceedings of the IEEE 24th International Conference on Data Engineering (ICDE). IEEE, 2008, pp. 506–515.
-  S. Ji, W. Li, P. Mittal, X. Hu, and R. Beyah, “SecGraph: A Uniform and Open-source Evaluation System for Graph Data Anonymization and De-anonymization,” in Proceedings of the 24th USENIX Security Symposium (SEC). USENIX, 2015, pp. 303–318.
-  T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” in Proceedings of the 1st International Conference on Learning Representations (ICLR), 2013.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed Representations of Words and Phrases and their Compositionally,” inProceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS). NIPS, 2013, pp. 3111–3119.
-  Stanford Large Network Dataset Collection. https://snap.stanford.edu/data/.
-  B. Viswanath, A. Mislove, M. Cha, and K. P. Gummadi, “On the Evolution of User Interaction in Facebook,” in Proceedings of the 2nd ACM SIGCOMM Workshop on Social Networks (WOSN). ACM, 2009, pp. 37–42.
-  http://gim.unmc.edu/dxtests/roc3.htm.
-  M. Backes, P. Berrang, A. Hecksteden, M. Humbert, A. Keller, and T. Meyer, “Privacy in Epigenetics: Temporal Linkability of MicroRNA Expression Profiles,” in Proceedings of the 25th USENIX Security Symposium (Security). USENIX, 2016, pp. 1223–1240.
-  A. McGregor, I. Mironov, T. Pitassi, O. Reingold, K. Talwar, and S. Vadhan, “The Limits of Two-Party Differential Privacy,” in Proceedings of the IEEE 51st Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2010, pp. 81–90.
-  C. Dwork and A. Roth, “The Algorithmic Foundations of Differential Privacy,” Foundations and Trends in Theoretical Computer Science, vol. 9, no. 3-4, pp. 211–407, 2014.
-  J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of Massive Datasets. Cambridge University Press, 2014.
-  N. Z. Gong and B. Liu, “You Are Who You Know and How You Behave: Attribute Inference Attacks via Users’ Social Friends and Behaviors,” in Proceedings of the 25th USENIX Security Symposium (SEC). USENIX, 2016, pp. 979–995.
-  J. Jia, B. Wang, L. Zhang, and N. Z. Gong, “AttriInfer: Inferring User Attributes in Online Social Networks Using Markov Random Fields,” in Proceedings of the 26th International Conference on World Wide Web (WWW). ACM, 2017, pp. 1561–1569.
-  G. Stringhini, C. Kruegel, and G. Vigna, “Detecting Spammers on Social Networks,” in Proceedings of the 26th Annual Computer Security Applications Conference (ACSAC). ACM, 2010, pp. 1–9.
-  M. Egele, G. Stringhini, C. Kruegel, and G. Vigna, “Compa: Detecting Compromised Accounts on Social Networks,” in Proceedings of the 20th Network and Distributed System Security Symposium (NDSS), 2013.
-  G. Venkatadri, O. Goga, C. Zhong, B. Viswanath, K. P. Gummadi, and N. Sastry, “Strengthening weak identities through inter-domain trust transfer,” in Proceedings of the 25th International Conference on World Wide Web (WWW). ACM, 2016, pp. 1249–1259.
-  M. Backes, P. Berrang, O. Goga, K. P. Gummadi, and P. Manoharan, “On Profile Linkability despite Anonymity in Social Media Systems,” in Proceedings of the 2016 ACM on Workshop on Privacy in the Electronic Society (WPES). ACM, 2016, pp. 25–35.
-  L. Zou, L. Chen, and M. T. Özsu, “K-Automorphism: A General Framework for Privacy Preserving Network Publication,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 946–957, 2009.
-  J. Cheng, A. W. chee Fu, and J. Liu, “K-Isomorphism: Privacy Preserving Network Publication against Structural Attacks,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD). ACM, 2010, pp. 459–470.
-  D. Proserpio, S. Goldberg, and F. McSherry, “A Workflow for Differentially-private Graph Synthesis,” in Proceedings of the 2012 ACM workshop on Workshop on Online Social Networks (WOSN). ACM, 2012, pp. 13–18.
-  Y. Wang and X. Wu, “Preserving Differential Privacy in Degree-correlation based Graph Generation,” Transactions on Data Privacy, vol. 6, no. 2, p. 127, 2013.
-  M. Hay, G. Miklau, D. Jensen, D. Towsley, and P. Weis, “Resisting Structural Re-identification in Anonymized Social Networks,” Proceedings of the VLDB Endowment, vol. 1, no. 1, pp. 102–114, 2008.
-  S. Bhagat, G. Cormode, B. Krishnamurthy, and D. Srivastava, “Class-based Graph Anonymization for Social Network Data,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 766–777, 2009.
-  A. Narayanan and V. Shmatikov, “De-anonymizing Social Networks,” in Proceedings of the 30th IEEE Symposium on Security and Privacy (S&P). IEEE, 2009, pp. 173–187.
-  P. Pedarsani, D. R. Figueiredo, and M. Grossglauser, “A Bayesian Method for Matching Two Similar Graphs without Seeds,” in Proceedings of 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2013, pp. 1598–1607.
-  S. Ji, W. Li, M. Srivatsa, and R. Beyah, “Structural Data De-anonymization: Quantification, Practice, and Implications,” in Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 2014, pp. 1040–1053.
-  S. Nilizadeh, A. Kapadia, and Y.-Y. Ahn, “Community-enhanced De-anonymization of Online Social Networks,” in Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 2014, pp. 537–548.
-  K. Sharad and G. Danezis, “An Automated Social Graph De-anonymization Technique,” in Proceedings of the 13th Workshop on Privacy in the Electronic Society (WPES). ACM, 2014, pp. 47–58.