1 Introduction
The use of online social networks (OSNs) has grown steadily during the last years, and is expected to continue growing in the future. Billions of people share many aspects of their lives on OSNs and use these systems to interact with each other on a regular basis. The ubiquity of OSNs has turned them into one of the most important sources of data for the analysis of social phenomena. Such analyses have led to significant findings used in a wide range of applications, from efficient epidemic disease control [22, 5] to information diffusion [44, 13].
Despite the undeniable social benefits that can be obtained from social network analysis, access to such data by third parties such as researchers and companies should understandably be limited due to the sensitivity of the information stored in OSNs, e.g. personal relationships, political preferences and religious affiliations. In addition, the increase of public awareness about privacy and the entry into effect of strong privacy regulations such as GDPR [1] strengthen the reluctance of OSN owners from releasing their data. Therefore, it is of critical importance to provide mechanisms for privacypreserving data publication to encourage OSN owners to release data for analysis.
Social graphs are a natural representation of social networks, with nodes corresponding to participants and edges to connections between participants. In view of the privacy discussion, social network owners should only release sanitised sample of the underlying social graphs. However, it has been shown that even social graphs containing only structural information remain vulnerable to privacy attacks leveraging knowledge from public sources [29], deploying sybil accounts [2, 24], etc. In order to prevent such attacks, a large number of graph anonymisation methods have been devised. Initially, the proposed methods focused on editing the original graph via vertex/edge additions and deletions until obtaining a graph satisfying some privacy property. A critical limitation of graph editing methods is their reliance on assumptions about the adversary knowledge, which determine the information that needs to be anonymised and thus the manner in which privacy is enforced. To avoid this type of assumptions, an increasingly popular trend is that of using semantic privacy notions, which place formal privacy guarantees on the data processing algorithms rather than the dataset. Among semantic privacy notions, differential privacy [8] has become the de facto standard due to its strong privacy guarantees.
According to the type of published data, we can divide differentially private mechanisms for social graphs into two classes. The methods in the first category directly release specific statistics of the underlying social graph, e.g. the degree sequence [16, 9] or the number of specific subgraphs (triangles, stars, etc.) [43]. The second family of methods focuses on publishing synthetic social graphs as a replacement of real social networks in a twostep process [27, 35, 36, 39]. In the first step, differentially private methods are used to compute the parameters of a generative graph model that accurately captures the original graph properties. Then, in the second step, this model is sampled for synthetic graphs, profiting from the fact that the result of postprocessing the output of differentially private algorithms remains differentially private [17].
Differential privacy requires one to define a privacy budget in advance, which determines the amount of perturbation that will be applied to the outputs of algorithms. In consequence, the methods in the first family need to either limit in advance the number of queries that will be answered or deliver increasingly lower quality answers. On the contrary, the methods in the second family can devote the entire privacy budget to the model parameter estimation, without further degradation of the privacy of the sampled graphs. For this reason, in this paper we focus on the second type of methods.
For analysts, the utility of synthetic graphs is determined by the ability of the graph models to capture relevant properties of the original graph. To satisfy this need, several graph models have been proposed to accurately capture global structural properties such as degree distributions and clustering coefficients, as well as heterogeneous attributes of the users such as gender, education or marital status. A common limitation of the aforementioned approaches is their inability to represent an important type of information: the community structure. Informally, a community is a set of users who are substantially more interrelated among themselves than to other users of the network. This interrelation may, e.g., stem from the explicit existence of relations between the users. An example of such a community is a group of Gmail users who frequently email each other, as represented by the occurrence of a large number of edges connecting the user nodes from the group. Alternatively, interrelations may stem from the cooccurrence of relevant features, such as users working at the same company or alumni from the same university. The emergence of communities has been documented to be an inherent property of social networks [34, 41]. For analysts, the availability of synthetic attributed graphs that preserve the community structure of the original graph represents an opportunity to improve existing applications. For example, they may be able to improve online shopping recommendations based on the common purchases of users belonging to the same community. Current models and methods are insufficient for enabling such an analysis, as they either lack information about the community structure or they lack vertex features.
In this paper, we address the problem discussed in the previous paragraph by introducing a new generative attributed graph model, CAGM (short for CommunityPreserving Attributed Graph Model), which in addition to global structural properties, is also capable of preserving the community structure of the original graph. CAGM is based on the attributed graph model AGM [11], and improves on it by incorporating the capability of preserving the number and sizes of the communities of the original graph, as well as the densities of intra and intercommunity connections (that is, connections between nodes belonging to the same community or to different communities, respectively). CAGM
also preserves a number of statistics describing the correlations between the feature vectors that describe the users and the existence of connections between pairs of users, as well as their community coaffiliation. We equip
CAGM with efficient parameter estimation and graph sampling methods, and provide differentially private variants of the former, which allow us to release synthetic attributed social graphs with a strong privacy guarantee and increased utility with respect to preceding approaches.Summary of contributions:

We propose a new generative attributed graph model, CAGM, which captures a number of properties of the community structure, as discussed in the previous paragraph, along with global structural properties.

We present efficient methods for learning an instance of our model from an input graph and sampling communitypreserving synthetic attributed graphs from this instance. We show, via a number of experiments on realworld social networks, that the community structures of synthetic graphs sampled from our model are more similar to those of the original graphs than those of the graphs sampled from previously existing models. Additionally, we show that this behaviour is obtained without sacrificing the ability to preserve global structural features.

We devise differentially private methods for computing the parameters of the new model. We demonstrate that our methods are practical in terms of efficiency and accuracy. To support the latter claim, we empirically show that differentially private synthetic attributed graphs generated by our model suffer a reasonably low degradation with respect to their counterparts, in terms of their ability to capture the community structure and structural features of the original graphs.
2 Related Work
Private graph synthesis. The key to synthesising social graphs is the model which determines both the information embedded in the published graphs and the properties preserved. Mir et al. [27] used the Kronecker graph generative model [20] to generate differentially private graphs. As the Kronecker model cannot accurately capture structural properties, Sala et al. [35] proposed an alternative approach which makes use of the graph model. Wang et al. [36] further improved the work of Sala et al. by considering global sensitivity instead of local sensitivity (refer to Section 3 for the definition of sensitivity). Xiao et al. [39] introduced the HRGgraph model [7] and found that it can further reduce the amount of added noise and thus increase the accuracy.
The approaches described so far work on unlabelled graphs. Pfiffer et al. [11] introduced a new model called AGM, which attaches binary attributes to nodes and captures the correlations between shared attributes and the existence of connections. Jorgensen et al. [12] adopted this model and proposed differentially private methods to accurately estimate the model parameters. They also designed a new graph generation algorithm based on the TCL model [10], which enables the model to sample attributed graphs preserving the clustering coefficient. As discussed previously, CAGM, the model introduced in this paper, is comparable to this model in preserving global structural properties of the original graphs, but it outperforms it by also capturing the community structure.
Private statistics publishing. Degree sequences and degree correlations are two types of the statistics frequently studied in the literature. The general trend in publishing these statistics under differential privacy consists in adding noise to the original sequences and then postprocessing the perturbed sequences to enforce or restore certain properties, such as graphicality [16], vertex order in terms of degrees [9], etc. Subgraph count queries, e.g. the number of triangles or stars, have also received considerable attention. Among the approaches to accurately compute such queries, we have the socalled ladder functions [43] and smooth sensitivity [15, 37].
Communitypreserving graph generation models. A number of existing random graph models claim to capture community structure, e.g., BTER [18], ILFR [34], SBM [38] and its variants (e.g., DCSBM [14] and DCPPM [31]). BTER generates communitypreserving social graphs given expected node degrees and, for every degree value , the average of the clustering coefficients of the nodes of degree . The model assumes that every community is a set of nodes with degree . On the contrary, CAGM makes no assumptions on the community partition received. Finally, ILFR and the variants of SBM preserve edge densities at the community level but, unlike our new model, they do not preserve the clustering coefficients of the original graph.
3 Preliminaries
3.1 Notation
An attributed graph is represented as a triple , where is the set of nodes, is the set of edges, and is a binary matrix called the attribute matrix. The th row of is the attribute vector of , which is individually denoted by . Every column of represents a binary feature, which is set to (true), or (false), for each user. For example, if the th column represents the attribute “owning a car”, means that the user represented by owns a car. Nonbinary reallife attributes are assumed to be binarised. For example, a binarisation of the integervalued attribute “age” is “age ”, “ age ”, “ age ”, “age ”. The order of the columns of is fixed, but arbitrary, and has no impact on the results described hereafter. Throughout the paper, we deal with undirected graphs. That is, if , then . Additionally, we use to denote the adjacency matrix of the graph.
We use , with for every , to represent a community partition of the attributed graph. As the term suggests, in this paper we assume that , with , and . The community has a special interpretation. Since some community detection algorithms assign no community to some vertices, we will use as a “discard” community of unassigned vertices. We do so to avoid having a potentially large number of singleton communities, for which no meaningful coaffiliation statistics can be computed. We use to denote the community to which the node belongs in the community partition . We will use for short in cases where the partition is clear from the context.
3.2 Differential Privacy
Differential privacy [8]
is a well studied statistical notion of privacy. The intuition behind it is to randomise the output of an algorithm in such a way that the presence of any individual element in the input dataset has a negligible impact on the probability of observing any particular output. In other words, a mechanism is
differentially private if for any pair of neighbouring datasets, i.e. datasets that only differ by one element, the probabilities of obtaining any output are measurably similar. The amount of similarity is determined by the parameter , which is commonly called the privacy budget. In what follows, we will use the notation for the set of possible datasets, for the set of possible outputs, and for a pair of neighbouring datasets.Definition 1 (differential privacy [8]).
A randomised mechanism satisfies differential privacy if for every pair of neighbouring datasets , , and for every , we have
A number of differentially private mechanisms have been proposed. For queries of the form , the most widely used mechanism to enforce differential privacy is the socalled Laplace mechanism, which consists in obtaining the (nonprivate) output of and adding to every component a carefully chosen amount of random noise, which is drawn from the Laplace distribution
where is a realvalued variable indicating the noise to be added, and is a property of the original function called global sensitivity. This property is defined as the largest difference between the outputs of for any pair of neighbouring datasets, that is
where is the norm. For categorical (nonnumerical) queries of the form , where is a finite set of categories, the socalled exponential mechanism [26] is the most commonly used. In this case, for each value , a score is assigned by a function (usually called scoring function) quantifying the value’s utility, denoted by . The global sensitivity of is
and the randomised output is drawn with probability
Differentially private methods are composable [25]. That is, given a set of algorithms such that () satisfies differential privacy, if the algorithms are applied sequentially and the results combined by a deterministic method, then the final result satisfies differential privacy. If the algorithms are applied independently on disjoint subsets of the input, then differential privacy is satisfied. Moreover, postprocessing on the output of an differentially private algorithm also satisfies differential privacy if the postprocessing is deterministic or randomised with a source of randomness independent from the noise added to the original algorithm [17]. These properties allow us to divide a complex computation, such as the set of model parameters in our case, into a sequence of subtasks for which differentially private methods exist or can be more easily developed.
In addition to the global sensitivity, a datasetdependent notion, called local sensitivity [33], has been enunciated. The local sensitivity of query on a dataset is defined as
that is, the maximum difference between the output of on and those on its neighbouring datasets. It is simple to see that .
4 The CAgm Model
In this section we give the formal definition of CAGM. We introduce the methods for sampling synthetic graphs from the model, and describe the methods for learning the model parameters from an attributed graph.
4.1 Overview
Algorithm 1 summarises the process by which CAGM is used for publishing synthetic attributed graphs. As discussed in [9, 17, 16], synthetic graph generation is done as a postprocessing step of the differentially private computation, so the synthetic graphs are also differentially private.
The manner in which the privacy budget is split among the different computations (step 1) is discussed in Section 5. For the differentially private community partition (step 2), we introduce in this paper an extension of the algorithm ModDivisive [32]. The purpose of this extension is to incorporate information from node attributes into the objective function optimised by ModDivisive. We discuss the community partition method in detail in Section 5.1. A thorough description of the parameters of CAGM is given in Section 4.2, and parameter estimation is discussed in Section 4.4.
Once the model parameters have been estimated, we can sample any number of synthetic attributed graphs from the model, as described in steps 4 to 8 of Algorithm 1. The differentially private parameter estimation methods introduced in this paper use the notion of neighbouring attributed graphs [12], which is discussed in detail in the preamble of Section 5. Under this notion, the existence of relations (edges) and personal characteristics of the network users (feature vectors) are treated as sensitive, but vertex identities are not. Thus, the synthetic graphs generated by Algorithm 1 have the same vertex set as the original graph, whereas the attribute matrix and the edge set are sampled from the model (step 7). For every new synthetic attributed graph, we first sample the attribute matrix, and then this matrix is used, in combination with an edge generation model (Section 4.3.1), to generate the edge set of the synthetic graph. There are two reasons for dividing this process into two steps. The first one is to make the sampling process efficient. The second reason is to profit from the twostep process to enforce the intuition that users with similar features are more likely to be connected in the social network. The attributed graph sampling procedure is discussed in detail in Section 4.3.
4.2 Model Parameters
As we discussed in Section 1, given an attributed graph and a community partition of , the purpose of CAGM is to capture a number of properties of that are overlooked by previously defined models, without sacrificing the ability to capture global structural properties such as degree distributions and clustering coefficients. To that end, CAGM models the following properties of the community partition:

the number and sizes of communities;

the number of intracommunity edges in every community;

the number of intercommunity edges;

the distributions of attribute vectors in every community;

the distributions of the socalled attributeedge correlations [12], for the set of intercommunity edges and for the set of intracommunity edges in every community.
Graphs generated by CAGM will have the same number of vertices as the original graph, as well as the same number of communities. Moreover, every community will have the same cardinality as in the original graph, and the same number of intracommunity edges. The number of intercommunity edges of the generated graph will also be the same as that of the original graph. Notice that the model preserves the total number, but not necessarily the pairwise numbers of intercommunity edges for every pair of communities.
Attributeedge correlations were defined in [12]
as heuristic values for characterising the relation between the feature vectors labelling a pair of vertices and the likelihood that these vertices are connected. They encode the intuition that, for example, coworkers who attended the same university and live near to each other are more likely to be friends than persons with fewer features in common, whereas friends are more likely to support the same sports teams or go to the same bars than unrelated persons. In
[12], attributeedge correlations are considered to behave uniformly over the entire graph. Here, we introduce the rationale that they behave differently within different communities, as well as across communities.A key element in the representation of attributeedge correlations is the notion of aggregator functions. An aggregator function maps a pair of attribute vectors of dimensionality into a value in a discrete range , which is used as a descriptor, also called aggregated feature, of the pair . For example, can contain a set of similarity levels for pairs of feature vectors, such as {low, medium, high}, and
can map a pair of vectors whose cosine similarity is in the interval
to low, a pair of vectors whose cosine similarity is in the interval to high, etc. Attributeedge correlations, along with the communitywise distributions of attribute vectors, are useful for analysts, as they allow to characterise the members of a community in terms of frequently shared features, hypothesise explanations for the emergence of a community, etc.Formally, a CAGM model is defined as a quintuple , where:

is a set of vertices.

is a community partition of .

is an instance of an edge set generative model that preserves properties 1 to 3 of the community partition , as well as degree distributions and clustering coefficients. The model introduced in this paper is called CPGM, and is described in detail in Section 4.3.1.

is an instance of an attribute vector generative model, which aims to preserve property 4. The model defines, for every community and every attribute vector , the probability that a vertex in is labelled with . The model introduced in this paper is described in detail in Section 4.4.2.

is an instance of a generative model for attributeedge correlations, which aims to preserve property 5. This model defines:

The discrete range and an aggregator function .

The probability
for every community and every value .

The probability
for every value .
The instantiations that we propose for these three components are described in detail in Section 4.4.3.

4.3 Sampling Attributed Graphs from an Instance of CAgm
Given a CAGM model , with , an attributed graph is sampled from with probability which, for the sake of tractability, is approximated as
That is, we first sample from the attribute vectors labelling each vertex and then use them in sampling the edge set. Again, to keep the sampling process tractable, we introduce an additional independence assumption, according to which
The computation of the probabilities of the form will be discussed in Section 4.4.2. Introducing the assumption that edges are sampled independently from each other, the probability of generating given , , , and is
As it is inefficient to sample edges directly from this distribution, we adapt the sampling method introduced in [11] to account for the computation of communitywise separated counts. Thus, edges are drawn from the distribution
where is the probability that is drawn from the edge generation model , given , as a candidate edge; while is the probability that it is accepted by , given . We split the computation of into two cases: , for every and every such that ; and , for every such that . Formally, we have
where
The computation of will be discussed in Section 4.3.1,
whereas that of
,
and every
will be discussed in Section 4.4.3.
Algorithm 2 describes the procedure to sample an attributed graph from CAGM. The method first generates the attribute vectors (line 1). Then, it precomputes the acceptance probabilities (lines 2 to 11). In line 3, the call to SampleEdgeSet consists in the sequential execution of Algs. 3 and 4, which will be described in detail in Section 4.3.1. Finally, the loop in lines 12 to 20 repeatedly draws candidate edges from the edge generation model and adds to the graph those that are accepted according to the precomputed probabilities (lines 17 and 18). The method stops when the required number of edges is added.
4.3.1 Edge generation model
As we discussed in Section 4.2, the component of CAGM is an edge generation model which preserves several properties of the community partition of the original graph (properties 1 to 3 listed in Section 4.2), in addition to the degree distribution and clustering coefficients. We call this model CPGM, and describe it in what follows.
The model takes as input the set of vertices, as well as the expected number of neighbours of every vertex within its community (that is, its intracommunity degree, denoted by ) and the expected number of neighbours outside its community (that is, the intercommunity degree, denoted by ). These values are used to enforce the expected densities within every community and between communities. Additionally, adapting to our setting a heuristics introduced in [12], the model also requires the number of triangles having all vertices in one community (which we call intracommunity triangles and denote by ), as well as the number of triangles spanning more than one community (intercommunity triangles, denoted by ). As shown empirically in [12], synthetic graphs that preserve the number of triangles of the original graph are more likely to approximate the clustering coefficient of the original graph. We adopt this intuition as well, but unlike [12], we separate the counts of intra and intercommunity triangles. As we will discuss in Section 5, and can be efficiently and accurately computed under differential privacy.
According to our model, the edge sampling process consists of two steps. The first step generates a graph that preserves the intra and intercommunity degrees, but not the number of intra and intercommunity triangles. Then, the second step iteratively edits the original edge set until and are enforced.
At the first step, we follow the idea of the CL model [6]. For every pair of vertices and satisfying , the intracommunity edge is added with probability , where is the original number of intracommunity edges in . That is, intracommunity edges are added with a probability proportional to product of the intracommunity degrees of the linked vertices. If , then the intercommunity edge is added with probability , where is the total number of intercommunity edges in the original graph. Algorithm 3 describes the first step of the generation process.
At the second step, we use the intuition that the clustering behaviour in social networks stems from the higher likelihood of users with common friends to connect [10], thus creating triangles. Algorithm 4 enforces the values of and of the original graph on the graph synthesised by Algorithm 3. In Algorithm 4, we denote by the set of neighbours of in its community, that is . Likewise, we denote by the set of neighbours of in different communities, that is . In Algorithm 4, is enforced first because adding or removing an intracommunity edge may change the number of intercommunity triangles as well, whereas intercommunity triangles can be created without modifying the number of intracommunity triangles. At every iteration, we sample a new edge. If replacing the oldest intracommunity edge (in terms of the order of creation by Algorithm 3) with the newly sampled edge causes the number of intracommunity triangles to increase, we make the edge exchange permanent. Otherwise, we do not add the newly sampled edge and set the oldest edge to be the youngest, keeping it in the graph. The iteration stops when the number of intracommunity triangles is greater than or equal to that of the original graph. Then, we proceed to enforce the number of intercommunity triangles by adding intercommunity edges. In this case the idea is to find open “wedges” composed of one intracommunity edge and one intercommunity edge such that the edge has not been added to the graph. This ensures that newly added edges will not affect the number of intracommunity triangles. Let be the oldest intercommunity edge. If the graph obtained by removing and adding contains more triangles than the current version of the synthetic graph, then is added and is removed. The iteration stops when the number of intercommunity triangles is greater than or equal to that of the original graph.
Due to the removal of initially generated edges, the synthetic graph may become disconnected. In this case, we apply an edgeswapping postprocessing step to reconnect every small connected component to the main component (the connected component with the most nodes). If the postprocessing reduces the number of triangles, we recall Algorithm 4. The alternation between the postprocessing and Algorithm 4 is not guaranteed to yield a graph having exactly the required number of triangles, so we stop the iteration when the total number of triangles in the synthetic graph is within a tolerance window with respect to the original one.
4.4 Parameter Estimation for CAgm
We now discuss the methods for estimating the parameters of a CAGM model from a given attributed graph with a community partition .
4.4.1 Estimating
The estimation of reduces to computing the communitywise counters that it relies on: intra and intercommunity degrees of every vertex, the number of intracommunity triangles for each community and the number of intercommunity triangles. As we mentioned in Section 4.3.1, degrees and triangle counts will be used to preserve global structural properties of the generated graphs such as degree distribution and clustering coefficients. They can be efficiently computed in the original graph both exactly and under differential privacy.
4.4.2 Estimating
In order to keep the estimation procedure tractable, we introduce the assumption that attributes are independent. This assumption simplifies the estimation and handles the sparsity of the attribute vectors when the number of attributes is large. As seen in [11, 12], not having such an assumption severely limits the number of features that can be practically handled. Furthermore, as we will see in Section 5, in addition to tractability, this assumption will also allow us to limit the amount of noise added by the differentially private computation.
We will denote by be the value for the th component of the attribute of vector . Likewise, we will denote by the value of the th component of the vector labelling vertex . We estimate the probability that a node is labelled with an attribute vector by the following formula:
where is the number of columns of (ergo the cardinality of all attribute vectors) and .
4.4.3 Estimating
As we discussed in Section 4.2, for defining it is necessary to define an aggregator function for pairs of attribute vectors. Our aggregator function is based on the widely used cosine similarity, that is, the cosine of the angle between two vectors. Since the range of aggregator functions needs to be discrete, we split the range of the cosine similarity into a set of intervals, determined by a parameter satisfying . Let denote the similarity between vectors and . Our aggregator function is defined as . Note that, according to this definition, . Finally, the probability of the attribute vectors of a pair of connected vertices being described by an aggregated feature is computed as
Compared to the approach introduced in [11, 12], our method uses a coarser granularity for aggregated features. Thanks to that, it avoids the need to compute different values, which is not only inefficient, but also results in an excessive amount of noise injected when applying differential privacy.
5 Differentially Private CAgm
In this section, we describe in detail our mechanisms for obtaining differentially private instances of the CAGM model, as well as the necessary adaptations of the sampling methods when the model has been computed under differential privacy. As we discussed in Section 3, the difference between different instantiations of differential privacy for graphs lies in the definition of the pairs of graphs that are considered to be neighbouring datasets. Here, we adopt the following definition from [12].
Definition 2 (Neighbouring attributed graphs [12]).
A pair of attributed graphs and are neighbouring, denoted , if and only if they differ in the presence of exactly one edge or the attribute vector of exactly one node. That is,
Definition 2 entails that the existence of relations, that is the occurrence of edges, and the attributes describing every particular user, are treated as sensitive. On the contrary, vertex identifiers are treated as nonprivate. These criteria are in line with the current privacy policies of most social networking sites, where the fact that a profile exists is public information, but users can keep their personal information and friends list private or hidden from the general public. With Definition 2 in mind, we describe in what follows the differentially private computation of every parameter of CAGM.
5.1 Obtaining the Community Partition
Our differentially private community partition method extends the algorithm ModDivisive [32], in such a way that it takes node attributes into account. In its original formulation, ModDivisive searches for a community partition that maximises modularity, a structural parameter encoding the intuition that a user tends to be more connected to users in the same community than to users in other communities [30]. Modularity is defined as
where is the number of edges between the nodes in and is the sum of degrees of the nodes in . ModDivisive uses the exponential mechanism, considering the set of possible partitions as the categorical codomain, and using modularity as the scoring function.
In order to integrate node features into ModDivisive, we introduce a new objective function that combines the original modularity with an attributebased quality criterion. The new objective function is defined as
where , , is the modularity of the original graph and is the modularity of an auxiliary graph obtained from the original as follows. First, we take the vertex set of the original graph. Then, we compute all pairwise similarities between their associated feature vectors. Similarities are computed using the cosine measure (as done in Section 4.4.3 for computing aggregated attributes, but without applying the discretisation). Finally, we add to the auxiliary graph the edges corresponding to the most similar attributed node pairs.
It is proven in [32] that the global sensitivity of is upper bounded by , where is the minimum number of edges of all potential graphs to publish. In the worst case, , considering that the original graph is an arbitrary nonempty graph. However, this is not the case for reallife social graphs, so introducing more realistic assumptions about the value of allows us to use smaller values of and thus reduce the amount of noise added in differentially privately computing . Throughout this paper, we assume , which leads to . As we will see in Section 6, all datasets used in our experiments comply with this assumption. In what follows, we apply an analogous reasoning for bounding .
Proposition 1.
Every graph of order satisfies
Proof.
Let be two neighbouring attributed graphs and let and be the auxiliary graphs obtained from and , respectively. If the difference between and consists only in one edge, then , so in what follows we will consider that and differ in one attribute vector. Let be the (sole) vertex such that . In the worst case, we have that, for every , and (or vice versa). It was shown in [32] that the modularities of two graphs differing in one edge differ in up to , where is the minimum number of edges. Then, in the worst case we have , where is the order of and , and is the minimum number of edges in auxiliary graphs. As we discussed in Sect. 5.1, , so . The proof is thus completed. ∎
5.2 Attribute Vector Distribution
As discussed in Section 4.4, given a community partition , in order to obtain the differentially private estimation of (denoted by
), we need to compute the probability distribution of each attribute for every community, i.e.,
, for each (where is the number of attributes) and . Computing this probability reduces to computing the number of nodes whose th attribute has value , which we denote by . Let be the sequence . In order to obtain the differentially private sequence , we add to each element in noise sampled from , where is the privacy budget reserved for this computation and is the global sensitivity of , as shown in the next result.Proposition 2.
The global sensitivity of is .
Proof.
Let be two neighbouring attributed graphs, let be a community and let and be the instances of in and , respectively. If the difference between and consists only in one edge, then , so in what follows we will consider that and differ in one attribute vector. Let be the (sole) vertex such that . If , then . On the contrary, if , for every component such that , we have that
Comments
There are no comments yet.