Publishing Community-Preserving Attributed Social Graphs with a Differential Privacy Guarantee

08/31/2019
by   Xihui Chen, et al.
University of Luxembourg
0

We present a novel method for publishing differentially private synthetic attributed graphs. Unlike preceding approaches, our method is able to preserve the community structure of the original graph without sacrificing the ability to capture global structural properties. Our proposal relies on C-AGM, a new community-preserving generative model for attributed graphs. We equip C-AGM with efficient methods for attributed graph sampling and parameter estimation. For the latter, we introduce differentially private computation methods, which allow us to release community-preserving synthetic attributed social graphs with a strong formal privacy guarantee. Through comprehensive experiments, we show that our new model outperforms its most relevant counterparts in synthesising differentially private attributed social graphs that preserve the community structure of the original graph, as well as degree sequences and clustering coefficients.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 27

05/23/2018

pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity

We propose a method for the release of differentially private synthetic ...
09/11/2020

Intertwining Order Preserving Encryption and Differential Privacy

Ciphertexts of an order-preserving encryption (OPE) scheme preserve the ...
08/05/2021

Differentially Private n-gram Extraction

We revisit the problem of n-gram extraction in the differential privacy ...
12/31/2020

Kamino: Constraint-Aware Differentially Private Data Synthesis

Organizations are increasingly relying on data to support decisions. Whe...
10/03/2020

Differentially Private Representation for NLP: Formal Guarantee and An Empirical Study on Privacy and Fairness

It has been demonstrated that hidden representation learned by a deep mo...
01/05/2018

Differentially Private Releasing via Deep Generative Model

Privacy-preserving releasing of complex data (e.g., image, text, audio) ...
09/20/2018

PP-DBLP: Modeling and Generating Attributed Public-Private Networks with DBLP

In many online social networks (e.g., Facebook, Google+, Twitter, and In...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The use of online social networks (OSNs) has grown steadily during the last years, and is expected to continue growing in the future. Billions of people share many aspects of their lives on OSNs and use these systems to interact with each other on a regular basis. The ubiquity of OSNs has turned them into one of the most important sources of data for the analysis of social phenomena. Such analyses have led to significant findings used in a wide range of applications, from efficient epidemic disease control [22, 5] to information diffusion [44, 13].

Despite the undeniable social benefits that can be obtained from social network analysis, access to such data by third parties such as researchers and companies should understandably be limited due to the sensitivity of the information stored in OSNs, e.g. personal relationships, political preferences and religious affiliations. In addition, the increase of public awareness about privacy and the entry into effect of strong privacy regulations such as GDPR [1] strengthen the reluctance of OSN owners from releasing their data. Therefore, it is of critical importance to provide mechanisms for privacy-preserving data publication to encourage OSN owners to release data for analysis.

Social graphs are a natural representation of social networks, with nodes corresponding to participants and edges to connections between participants. In view of the privacy discussion, social network owners should only release sanitised sample of the underlying social graphs. However, it has been shown that even social graphs containing only structural information remain vulnerable to privacy attacks leveraging knowledge from public sources [29], deploying sybil accounts [2, 24], etc. In order to prevent such attacks, a large number of graph anonymisation methods have been devised. Initially, the proposed methods focused on editing the original graph via vertex/edge additions and deletions until obtaining a graph satisfying some privacy property. A critical limitation of graph editing methods is their reliance on assumptions about the adversary knowledge, which determine the information that needs to be anonymised and thus the manner in which privacy is enforced. To avoid this type of assumptions, an increasingly popular trend is that of using semantic privacy notions, which place formal privacy guarantees on the data processing algorithms rather than the dataset. Among semantic privacy notions, differential privacy [8] has become the de facto standard due to its strong privacy guarantees.

According to the type of published data, we can divide differentially private mechanisms for social graphs into two classes. The methods in the first category directly release specific statistics of the underlying social graph, e.g. the degree sequence [16, 9] or the number of specific subgraphs (triangles, stars, etc.) [43]. The second family of methods focuses on publishing synthetic social graphs as a replacement of real social networks in a two-step process [27, 35, 36, 39]. In the first step, differentially private methods are used to compute the parameters of a generative graph model that accurately captures the original graph properties. Then, in the second step, this model is sampled for synthetic graphs, profiting from the fact that the result of post-processing the output of differentially private algorithms remains differentially private [17].

Differential privacy requires one to define a privacy budget in advance, which determines the amount of perturbation that will be applied to the outputs of algorithms. In consequence, the methods in the first family need to either limit in advance the number of queries that will be answered or deliver increasingly lower quality answers. On the contrary, the methods in the second family can devote the entire privacy budget to the model parameter estimation, without further degradation of the privacy of the sampled graphs. For this reason, in this paper we focus on the second type of methods.

For analysts, the utility of synthetic graphs is determined by the ability of the graph models to capture relevant properties of the original graph. To satisfy this need, several graph models have been proposed to accurately capture global structural properties such as degree distributions and clustering coefficients, as well as heterogeneous attributes of the users such as gender, education or marital status. A common limitation of the aforementioned approaches is their inability to represent an important type of information: the community structure. Informally, a community is a set of users who are substantially more interrelated among themselves than to other users of the network. This interrelation may, e.g., stem from the explicit existence of relations between the users. An example of such a community is a group of Gmail users who frequently e-mail each other, as represented by the occurrence of a large number of edges connecting the user nodes from the group. Alternatively, interrelations may stem from the co-occurrence of relevant features, such as users working at the same company or alumni from the same university. The emergence of communities has been documented to be an inherent property of social networks [34, 41]. For analysts, the availability of synthetic attributed graphs that preserve the community structure of the original graph represents an opportunity to improve existing applications. For example, they may be able to improve online shopping recommendations based on the common purchases of users belonging to the same community. Current models and methods are insufficient for enabling such an analysis, as they either lack information about the community structure or they lack vertex features.

In this paper, we address the problem discussed in the previous paragraph by introducing a new generative attributed graph model, C-AGM (short for Community-Preserving Attributed Graph Model), which in addition to global structural properties, is also capable of preserving the community structure of the original graph. C-AGM is based on the attributed graph model AGM [11], and improves on it by incorporating the capability of preserving the number and sizes of the communities of the original graph, as well as the densities of intra- and inter-community connections (that is, connections between nodes belonging to the same community or to different communities, respectively). C-AGM

also preserves a number of statistics describing the correlations between the feature vectors that describe the users and the existence of connections between pairs of users, as well as their community co-affiliation. We equip

C-AGM with efficient parameter estimation and graph sampling methods, and provide differentially private variants of the former, which allow us to release synthetic attributed social graphs with a strong privacy guarantee and increased utility with respect to preceding approaches.

Summary of contributions:

  • We propose a new generative attributed graph model, C-AGM, which captures a number of properties of the community structure, as discussed in the previous paragraph, along with global structural properties.

  • We present efficient methods for learning an instance of our model from an input graph and sampling community-preserving synthetic attributed graphs from this instance. We show, via a number of experiments on real-world social networks, that the community structures of synthetic graphs sampled from our model are more similar to those of the original graphs than those of the graphs sampled from previously existing models. Additionally, we show that this behaviour is obtained without sacrificing the ability to preserve global structural features.

  • We devise differentially private methods for computing the parameters of the new model. We demonstrate that our methods are practical in terms of efficiency and accuracy. To support the latter claim, we empirically show that differentially private synthetic attributed graphs generated by our model suffer a reasonably low degradation with respect to their counterparts, in terms of their ability to capture the community structure and structural features of the original graphs.

2 Related Work

Private graph synthesis. The key to synthesising social graphs is the model which determines both the information embedded in the published graphs and the properties preserved. Mir et al. [27] used the Kronecker graph generative model [20] to generate differentially private graphs. As the Kronecker model cannot accurately capture structural properties, Sala et al. [35] proposed an alternative approach which makes use of the -graph model. Wang et al. [36] further improved the work of Sala et al. by considering global sensitivity instead of local sensitivity (refer to Section 3 for the definition of sensitivity). Xiao et al. [39] introduced the HRG-graph model [7] and found that it can further reduce the amount of added noise and thus increase the accuracy.

The approaches described so far work on unlabelled graphs. Pfiffer et al. [11] introduced a new model called AGM, which attaches binary attributes to nodes and captures the correlations between shared attributes and the existence of connections. Jorgensen et al. [12] adopted this model and proposed differentially private methods to accurately estimate the model parameters. They also designed a new graph generation algorithm based on the TCL model [10], which enables the model to sample attributed graphs preserving the clustering coefficient. As discussed previously, C-AGM, the model introduced in this paper, is comparable to this model in preserving global structural properties of the original graphs, but it outperforms it by also capturing the community structure.

Private statistics publishing. Degree sequences and degree correlations are two types of the statistics frequently studied in the literature. The general trend in publishing these statistics under differential privacy consists in adding noise to the original sequences and then post-processing the perturbed sequences to enforce or restore certain properties, such as graphicality [16], vertex order in terms of degrees [9], etc. Subgraph count queries, e.g. the number of triangles or -stars, have also received considerable attention. Among the approaches to accurately compute such queries, we have the so-called ladder functions [43] and smooth sensitivity [15, 37].

Community-preserving graph generation models. A number of existing random graph models claim to capture community structure, e.g., BTER [18], ILFR [34], SBM [38] and its variants (e.g., DCSBM [14] and DCPPM [31]). BTER generates community-preserving social graphs given expected node degrees and, for every degree value , the average of the clustering coefficients of the nodes of degree . The model assumes that every community is a set of nodes with degree . On the contrary, C-AGM makes no assumptions on the community partition received. Finally, ILFR and the variants of SBM preserve edge densities at the community level but, unlike our new model, they do not preserve the clustering coefficients of the original graph.

3 Preliminaries

3.1 Notation

An attributed graph is represented as a triple , where is the set of nodes, is the set of edges, and is a binary matrix called the attribute matrix. The -th row of is the attribute vector of , which is individually denoted by . Every column of represents a binary feature, which is set to (true), or (false), for each user. For example, if the -th column represents the attribute “owning a car”, means that the user represented by  owns a car. Non-binary real-life attributes are assumed to be binarised. For example, a binarisation of the integer-valued attribute “age” is age ”, “ age ”, “ age ”, “age . The order of the columns of is fixed, but arbitrary, and has no impact on the results described hereafter. Throughout the paper, we deal with undirected graphs. That is, if , then . Additionally, we use to denote the adjacency matrix of the graph.

We use , with for every , to represent a community partition of the attributed graph. As the term suggests, in this paper we assume that , with , and . The community has a special interpretation. Since some community detection algorithms assign no community to some vertices, we will use as a “discard” community of unassigned vertices. We do so to avoid having a potentially large number of singleton communities, for which no meaningful co-affiliation statistics can be computed. We use to denote the community to which the node belongs in the community partition . We will use for short in cases where the partition is clear from the context.

3.2 Differential Privacy

Differential privacy [8]

is a well studied statistical notion of privacy. The intuition behind it is to randomise the output of an algorithm in such a way that the presence of any individual element in the input dataset has a negligible impact on the probability of observing any particular output. In other words, a mechanism is

-differentially private if for any pair of neighbouring datasets, i.e. datasets that only differ by one element, the probabilities of obtaining any output are measurably similar. The amount of similarity is determined by the parameter , which is commonly called the privacy budget. In what follows, we will use the notation for the set of possible datasets, for the set of possible outputs, and for a pair of neighbouring datasets.

Definition 1 (-differential privacy [8]).

A randomised mechanism satisfies -differential privacy if for every pair of neighbouring datasets , , and for every , we have

A number of differentially private mechanisms have been proposed. For queries of the form , the most widely used mechanism to enforce differential privacy is the so-called Laplace mechanism, which consists in obtaining the (non-private) output of and adding to every component a carefully chosen amount of random noise, which is drawn from the Laplace distribution

where is a real-valued variable indicating the noise to be added, and is a property of the original function called global sensitivity. This property is defined as the largest difference between the outputs of for any pair of neighbouring datasets, that is

where is the norm. For categorical (non-numerical) queries of the form , where is a finite set of categories, the so-called exponential mechanism [26] is the most commonly used. In this case, for each value , a score is assigned by a function (usually called scoring function) quantifying the value’s utility, denoted by . The global sensitivity of is

and the randomised output is drawn with probability

Differentially private methods are composable [25]. That is, given a set of algorithms such that () satisfies -differential privacy, if the algorithms are applied sequentially and the results combined by a deterministic method, then the final result satisfies -differential privacy. If the algorithms are applied independently on disjoint subsets of the input, then -differential privacy is satisfied. Moreover, post-processing on the output of an -differentially private algorithm also satisfies -differential privacy if the post-processing is deterministic or randomised with a source of randomness independent from the noise added to the original algorithm [17]. These properties allow us to divide a complex computation, such as the set of model parameters in our case, into a sequence of sub-tasks for which differentially private methods exist or can be more easily developed.

In addition to the global sensitivity, a dataset-dependent notion, called local sensitivity [33], has been enunciated. The local sensitivity of query on a dataset is defined as

that is, the maximum difference between the output of on and those on its neighbouring datasets. It is simple to see that .

4 The C-Agm Model

In this section we give the formal definition of C-AGM. We introduce the methods for sampling synthetic graphs from the model, and describe the methods for learning the model parameters from an attributed graph.

4.1 Overview

Algorithm 1 summarises the process by which C-AGM is used for publishing synthetic attributed graphs. As discussed in [9, 17, 16], synthetic graph generation is done as a post-processing step of the differentially private computation, so the synthetic graphs are also differentially private.

1 Split privacy budget;
2 Obtain differentially private community partition;
3 Differentially-privately estimate C-AGM parameters;
4 for  do
5       Sample from C-AGM;
6       Sample from C-AGM;
7      
8 end for
Algorithm 1 Given , obtain differentially private attributed synthetic graphs.

The manner in which the privacy budget is split among the different computations (step 1) is discussed in Section 5. For the differentially private community partition (step 2), we introduce in this paper an extension of the algorithm ModDivisive [32]. The purpose of this extension is to incorporate information from node attributes into the objective function optimised by ModDivisive. We discuss the community partition method in detail in Section 5.1. A thorough description of the parameters of C-AGM is given in Section 4.2, and parameter estimation is discussed in Section 4.4.

Once the model parameters have been estimated, we can sample any number of synthetic attributed graphs from the model, as described in steps 4 to 8 of Algorithm 1. The differentially private parameter estimation methods introduced in this paper use the notion of neighbouring attributed graphs [12], which is discussed in detail in the preamble of Section 5. Under this notion, the existence of relations (edges) and personal characteristics of the network users (feature vectors) are treated as sensitive, but vertex identities are not. Thus, the synthetic graphs generated by Algorithm 1 have the same vertex set as the original graph, whereas the attribute matrix and the edge set are sampled from the model (step 7). For every new synthetic attributed graph, we first sample the attribute matrix, and then this matrix is used, in combination with an edge generation model (Section 4.3.1), to generate the edge set of the synthetic graph. There are two reasons for dividing this process into two steps. The first one is to make the sampling process efficient. The second reason is to profit from the two-step process to enforce the intuition that users with similar features are more likely to be connected in the social network. The attributed graph sampling procedure is discussed in detail in Section 4.3.

4.2 Model Parameters

As we discussed in Section 1, given an attributed graph and a community partition of , the purpose of C-AGM is to capture a number of properties of that are overlooked by previously defined models, without sacrificing the ability to capture global structural properties such as degree distributions and clustering coefficients. To that end, C-AGM models the following properties of the community partition:

  1. the number and sizes of communities;

  2. the number of intra-community edges in every community;

  3. the number of inter-community edges;

  4. the distributions of attribute vectors in every community;

  5. the distributions of the so-called attribute-edge correlations [12], for the set of inter-community edges and for the set of intra-community edges in every community.

Graphs generated by C-AGM will have the same number of vertices as the original graph, as well as the same number of communities. Moreover, every community will have the same cardinality as in the original graph, and the same number of intra-community edges. The number of inter-community edges of the generated graph will also be the same as that of the original graph. Notice that the model preserves the total number, but not necessarily the pairwise numbers of inter-community edges for every pair of communities.

Attribute-edge correlations were defined in [12]

as heuristic values for characterising the relation between the feature vectors labelling a pair of vertices and the likelihood that these vertices are connected. They encode the intuition that, for example, co-workers who attended the same university and live near to each other are more likely to be friends than persons with fewer features in common, whereas friends are more likely to support the same sports teams or go to the same bars than unrelated persons. In 

[12], attribute-edge correlations are considered to behave uniformly over the entire graph. Here, we introduce the rationale that they behave differently within different communities, as well as across communities.

A key element in the representation of attribute-edge correlations is the notion of aggregator functions. An aggregator function maps a pair of attribute vectors of dimensionality into a value in a discrete range , which is used as a descriptor, also called aggregated feature, of the pair . For example, can contain a set of similarity levels for pairs of feature vectors, such as {low, medium, high}, and

can map a pair of vectors whose cosine similarity is in the interval

to low, a pair of vectors whose cosine similarity is in the interval to high, etc. Attribute-edge correlations, along with the community-wise distributions of attribute vectors, are useful for analysts, as they allow to characterise the members of a community in terms of frequently shared features, hypothesise explanations for the emergence of a community, etc.

Formally, a C-AGM model is defined as a quintuple , where:

  • is a set of vertices.

  • is a community partition of .

  • is an instance of an edge set generative model that preserves properties 1 to 3 of the community partition , as well as degree distributions and clustering coefficients. The model introduced in this paper is called CPGM, and is described in detail in Section 4.3.1.

  • is an instance of an attribute vector generative model, which aims to preserve property 4. The model defines, for every community and every attribute vector , the probability that a vertex in is labelled with . The model introduced in this paper is described in detail in Section 4.4.2.

  • is an instance of a generative model for attribute-edge correlations, which aims to preserve property 5. This model defines:

    • The discrete range and an aggregator function .

    • The probability

      for every community and every value .

    • The probability

      for every value .

    The instantiations that we propose for these three components are described in detail in Section 4.4.3.

4.3 Sampling Attributed Graphs from an Instance of C-Agm

Given a C-AGM model , with , an attributed graph is sampled from with probability which, for the sake of tractability, is approximated as

That is, we first sample from the attribute vectors labelling each vertex and then use them in sampling the edge set. Again, to keep the sampling process tractable, we introduce an additional independence assumption, according to which

The computation of the probabilities of the form will be discussed in Section 4.4.2. Introducing the assumption that edges are sampled independently from each other, the probability of generating given , , , and is

As it is inefficient to sample edges directly from this distribution, we adapt the sampling method introduced in [11] to account for the computation of community-wise separated counts. Thus, edges are drawn from the distribution

where is the probability that is drawn from the edge generation model , given , as a candidate edge; while is the probability that it is accepted by , given . We split the computation of into two cases: , for every and every such that ; and , for every such that . Formally, we have

where

The computation of will be discussed in Section 4.3.1, whereas that of
, and every will be discussed in Section 4.4.3.

Algorithm 2 describes the procedure to sample an attributed graph from C-AGM. The method first generates the attribute vectors (line 1). Then, it pre-computes the acceptance probabilities (lines 2 to 11). In line 3, the call to SampleEdgeSet consists in the sequential execution of Algs. 3 and 4, which will be described in detail in Section 4.3.1. Finally, the loop in lines 12 to 20 repeatedly draws candidate edges from the edge generation model and adds to the graph those that are accepted according to the pre-computed probabilities (lines 17 and 18). The method stops when the required number of edges is added.

1 ;
2 ;
3 ;
4 for  do
5        Compute ;
6        for  do
7               Compute
8        end for
9       
10 end for
11;
12 while  do
13        ;
14        ;
15        ;
16        if  then
17               ;
18              
19        end if
20       
21 end while
22return ;
Algorithm 2

4.3.1 Edge generation model

As we discussed in Section 4.2, the component of C-AGM is an edge generation model which preserves several properties of the community partition of the original graph (properties 1 to 3 listed in Section 4.2), in addition to the degree distribution and clustering coefficients. We call this model CPGM, and describe it in what follows.

The model takes as input the set of vertices, as well as the expected number of neighbours of every vertex within its community (that is, its intra-community degree, denoted by ) and the expected number of neighbours outside its community (that is, the inter-community degree, denoted by ). These values are used to enforce the expected densities within every community and between communities. Additionally, adapting to our setting a heuristics introduced in [12], the model also requires the number of triangles having all vertices in one community (which we call intra-community triangles and denote by ), as well as the number of triangles spanning more than one community (inter-community triangles, denoted by ). As shown empirically in [12], synthetic graphs that preserve the number of triangles of the original graph are more likely to approximate the clustering coefficient of the original graph. We adopt this intuition as well, but unlike [12], we separate the counts of intra- and inter-community triangles. As we will discuss in Section 5, and can be efficiently and accurately computed under differential privacy.

According to our model, the edge sampling process consists of two steps. The first step generates a graph that preserves the intra- and inter-community degrees, but not the number of intra- and inter-community triangles. Then, the second step iteratively edits the original edge set until and are enforced.

At the first step, we follow the idea of the CL model [6]. For every pair of vertices and satisfying , the intra-community edge is added with probability , where is the original number of intra-community edges in . That is, intra-community edges are added with a probability proportional to product of the intra-community degrees of the linked vertices. If , then the inter-community edge is added with probability , where is the total number of inter-community edges in the original graph. Algorithm 3 describes the first step of the generation process.

1 ;
2 for  do
3        ;
4        ;
5        while  do
6               ;
7               if  then
8                      ;
9                      ;
10                     
11               end if
12              
13        end while
14       
15 end for
16;
17 while  do
18        ;
19        if  then
20               ;
21               ;
22              
23        end if
24       
25 end while
Algorithm 3 GenInitialEdgeSet
1 );
2 while  do
3        Uniformly sample from ;
4        Sample from with probability ;
5        Uniformly sample from ;
6        Uniformly sample from ;
7        if  then
8               ;
9               ;
10               ;
11               ;
12               if  then
13                      ;
14                      ;
15                     
16              else
17                      ;
18                     
19               end if
20              
21        end if
22       
23 end while
24);
25 while  do
26        Sample from with probability ;
27        Uniformly sample from ;
28        Uniformly sample from ;
29        ;
30        ;
31        ;
32        ;
33        if  then
34               ;
35              
36       else
37               ;
38              
39        end if
40       
41 end while
Algorithm 4 GetFinalEdgeSet

At the second step, we use the intuition that the clustering behaviour in social networks stems from the higher likelihood of users with common friends to connect [10], thus creating triangles. Algorithm 4 enforces the values of and of the original graph on the graph synthesised by Algorithm 3. In Algorithm 4, we denote by the set of neighbours of in its community, that is . Likewise, we denote by the set of neighbours of in different communities, that is . In Algorithm 4, is enforced first because adding or removing an intra-community edge may change the number of inter-community triangles as well, whereas inter-community triangles can be created without modifying the number of intra-community triangles. At every iteration, we sample a new edge. If replacing the oldest intra-community edge (in terms of the order of creation by Algorithm 3) with the newly sampled edge causes the number of intra-community triangles to increase, we make the edge exchange permanent. Otherwise, we do not add the newly sampled edge and set the oldest edge to be the youngest, keeping it in the graph. The iteration stops when the number of intra-community triangles is greater than or equal to that of the original graph. Then, we proceed to enforce the number of inter-community triangles by adding inter-community edges. In this case the idea is to find open “wedges” composed of one intra-community edge and one inter-community edge such that the edge has not been added to the graph. This ensures that newly added edges will not affect the number of intra-community triangles. Let be the oldest inter-community edge. If the graph obtained by removing and adding contains more triangles than the current version of the synthetic graph, then is added and is removed. The iteration stops when the number of inter-community triangles is greater than or equal to that of the original graph.

Due to the removal of initially generated edges, the synthetic graph may become disconnected. In this case, we apply an edge-swapping post-processing step to reconnect every small connected component to the main component (the connected component with the most nodes). If the post-processing reduces the number of triangles, we recall Algorithm 4. The alternation between the post-processing and Algorithm 4 is not guaranteed to yield a graph having exactly the required number of triangles, so we stop the iteration when the total number of triangles in the synthetic graph is within a tolerance window with respect to the original one.

4.4 Parameter Estimation for C-Agm

We now discuss the methods for estimating the parameters of a C-AGM model from a given attributed graph with a community partition .

4.4.1 Estimating

The estimation of reduces to computing the community-wise counters that it relies on: intra- and inter-community degrees of every vertex, the number of intra-community triangles for each community and the number of inter-community triangles. As we mentioned in Section 4.3.1, degrees and triangle counts will be used to preserve global structural properties of the generated graphs such as degree distribution and clustering coefficients. They can be efficiently computed in the original graph both exactly and under differential privacy.

4.4.2 Estimating

In order to keep the estimation procedure tractable, we introduce the assumption that attributes are independent. This assumption simplifies the estimation and handles the sparsity of the attribute vectors when the number of attributes is large. As seen in [11, 12], not having such an assumption severely limits the number of features that can be practically handled. Furthermore, as we will see in Section 5, in addition to tractability, this assumption will also allow us to limit the amount of noise added by the differentially private computation.

We will denote by be the value for the -th component of the attribute of vector . Likewise, we will denote by the value of the -th component of the vector labelling vertex . We estimate the probability that a node is labelled with an attribute vector by the following formula:

where is the number of columns of (ergo the cardinality of all attribute vectors) and .

4.4.3 Estimating

As we discussed in Section 4.2, for defining it is necessary to define an aggregator function for pairs of attribute vectors. Our aggregator function is based on the widely used cosine similarity, that is, the cosine of the angle between two vectors. Since the range of aggregator functions needs to be discrete, we split the range of the cosine similarity into a set of intervals, determined by a parameter satisfying . Let denote the similarity between vectors and . Our aggregator function is defined as . Note that, according to this definition, . Finally, the probability of the attribute vectors of a pair of connected vertices being described by an aggregated feature  is computed as

Compared to the approach introduced in [11, 12], our method uses a coarser granularity for aggregated features. Thanks to that, it avoids the need to compute different values, which is not only inefficient, but also results in an excessive amount of noise injected when applying differential privacy.

5 Differentially Private C-Agm

In this section, we describe in detail our mechanisms for obtaining differentially private instances of the C-AGM model, as well as the necessary adaptations of the sampling methods when the model has been computed under differential privacy. As we discussed in Section 3, the difference between different instantiations of differential privacy for graphs lies in the definition of the pairs of graphs that are considered to be neighbouring datasets. Here, we adopt the following definition from [12].

Definition 2 (Neighbouring attributed graphs [12]).

A pair of attributed graphs and are neighbouring, denoted , if and only if they differ in the presence of exactly one edge or the attribute vector of exactly one node. That is,

Definition 2 entails that the existence of relations, that is the occurrence of edges, and the attributes describing every particular user, are treated as sensitive. On the contrary, vertex identifiers are treated as non-private. These criteria are in line with the current privacy policies of most social networking sites, where the fact that a profile exists is public information, but users can keep their personal information and friends list private or hidden from the general public. With Definition 2 in mind, we describe in what follows the differentially private computation of every parameter of C-AGM.

5.1 Obtaining the Community Partition

Our differentially private community partition method extends the algorithm ModDivisive [32], in such a way that it takes node attributes into account. In its original formulation, ModDivisive searches for a community partition that maximises modularity, a structural parameter encoding the intuition that a user tends to be more connected to users in the same community than to users in other communities [30]. Modularity is defined as

where is the number of edges between the nodes in and is the sum of degrees of the nodes in . ModDivisive uses the exponential mechanism, considering the set of possible partitions as the categorical co-domain, and using modularity as the scoring function.

In order to integrate node features into ModDivisive, we introduce a new objective function that combines the original modularity with an attribute-based quality criterion. The new objective function is defined as

where , , is the modularity of the original graph and is the modularity of an auxiliary graph obtained from the original as follows. First, we take the vertex set of the original graph. Then, we compute all pairwise similarities between their associated feature vectors. Similarities are computed using the cosine measure (as done in Section 4.4.3 for computing aggregated attributes, but without applying the discretisation). Finally, we add to the auxiliary graph the edges corresponding to the most similar attributed node pairs.

It is proven in [32] that the global sensitivity of is upper bounded by , where is the minimum number of edges of all potential graphs to publish. In the worst case, , considering that the original graph is an arbitrary non-empty graph. However, this is not the case for real-life social graphs, so introducing more realistic assumptions about the value of allows us to use smaller values of and thus reduce the amount of noise added in differentially privately computing . Throughout this paper, we assume , which leads to . As we will see in Section 6, all datasets used in our experiments comply with this assumption. In what follows, we apply an analogous reasoning for bounding .

Proposition 1.

Every graph of order satisfies

Proof.

Let be two neighbouring attributed graphs and let and be the auxiliary graphs obtained from and , respectively. If the difference between and consists only in one edge, then , so in what follows we will consider that and differ in one attribute vector. Let be the (sole) vertex such that . In the worst case, we have that, for every , and (or vice versa). It was shown in [32] that the modularities of two graphs differing in one edge differ in up to , where is the minimum number of edges. Then, in the worst case we have , where is the order of and , and is the minimum number of edges in auxiliary graphs. As we discussed in Sect. 5.1, , so . The proof is thus completed. ∎

Combining the result in [32] with that of Proposition 1, we conclude that for every satisfying the aforementioned assumptions, and use this value as an upper bound for .

5.2 Attribute Vector Distribution

As discussed in Section 4.4, given a community partition , in order to obtain the differentially private estimation of (denoted by

), we need to compute the probability distribution of each attribute for every community, i.e.,

, for each (where is the number of attributes) and . Computing this probability reduces to computing the number of nodes whose -th attribute has value , which we denote by . Let be the sequence . In order to obtain the differentially private sequence , we add to each element in noise sampled from , where is the privacy budget reserved for this computation and is the global sensitivity of , as shown in the next result.

Proposition 2.

The global sensitivity of is .

Proof.

Let be two neighbouring attributed graphs, let be a community and let and be the instances of in and , respectively. If the difference between and consists only in one edge, then , so in what follows we will consider that and differ in one attribute vector. Let be the (sole) vertex such that . If , then . On the contrary, if , for every component such that , we have that