Since graph neural network (GNN) enables directly model the structure information of network topology, it has attracted significant interest recently, both from research and application perspectives [wu2020comprehensive, zhou2020graph]. However, primarily due to business competition and regulatory restrictions, a wealth of sensitive graph-structured data that is held by different clients are unwilling to be shared, thus plaguing many practical applications, such as fraud detection over banks [kurshan2020graph] and social network recommendation over platforms [wu2021fedgnn].
Despite kinds of privacy preserving machine learning models have been successfully applied in data types like image[hsu2020federated], text [ge2020fedner] and table [wu2021fedgnn], few works have concentrated on the domain of graph machine learning. For decentralized graph structure data, both nodes and edges are isolated, rendering most of the privacy learning methods designed for conventional datasets infeasible.
In this work, we restrict attention to the problem of designing a privacy-preserving GNN for node classification task that allows performance intact in the setting of horizontally partitioned graph dataset, which means the attributes of nodes and edges are aligned. As illustrated in Fig.1, we consider the scenario where several data holders which store private subgraphs access to one semi-honest (a.k.a. honest-but-curious) server. Each local subgraph contains sensitive information about nodes, edges, attributes and labels. The semi-honest server assumption means the server will follow protocol honestly, but it attempts to infer as much information as possible from received messages. In view of the fact that one node may interact with the same entity at several platforms, unlike previous work, we consider a more general scenario where overlapped nodes and edges exist among subgraphs.
To address the decentralized graph learning issue under privacy constraint, motivated by the ideas of split learning [vepakomma2018split] and horizontal federated learning [aono2017privacy], we propose a Server Aided Privacy-preserving GNN (SAPGNN), where each GNN layer is divided into two sub-models: the local model includes all the private data related computation to generate local node embedding, whereas the global model calculates global embedding by aggregating all local embedding. By this, the isolated neighborhood can be collaboratively utilized, and the receptive field can be enlarged by stacking multiple layers. Most importantly, when employing a pooling aggregator with proper update function, SAPGNN can generate identical node representation compared to the one learned over the combined graph.
The main contributions of this paper are summarized as follows:
We present a novel SAPGNN framework for training privacy-preserving GNN in horizontally partitioned data setup. To the best of our knowledge, it is the first GNN learning paradigm that is capable of generating the same node embedding as the centralized counterpart.
We analyse the privacy and overhead of the proposed SAPGNN. A secure pooling mechanism instead of a naive global pooling aggregator is proposed to further protect privacy from the semi-honest adversaries of server.
Experimental results on three datasets demonstrate the accuracy and macro-f1 of SAPGNN surpass the one learned over isolated data, and also comparable to the state-of-the-art approach, especially in the setting of I.I.D. label distribution.
This paper is organized as follows: Section 2 and 3 introduce recent works on privacy preserving GNN learning paradigms, notations as well as preliminaries; In Section 4 we describe and discuss our proposed SAPGNN framework in detail; These are followed by the experiments in Section 5; and finally Section 6 provides conclusion discussions and outlook.
2 Related Works
To tackle the privacy-preserving node classification problem over decentralized graph data, some methods have recently been investigated to train a global GNN collaboratively on various split types of dataset.
First, two learning paradigms named PPGNN [zhou2020privacy] and ASFGNN [zheng2021asfgnn] were proposed based on split learning for vertically and horizontally split datasets respectively. Both of them alleviate isolation by firstly training local GNN models over private graphs and then learning global embedding at an assistant semi-trusted third party. As the graph topology is still exploited locally, the model performance may be substantially reduced when the dataset is largely decentralized. More recently, LPGNN [sina2020practical] was developed to reduce communication overhead under the assumption that the server has accessed global graph topology except private node attributes. Despite its potential, this precondition is not always satisfactory since releasing topology to server may lead to privacy disclosure risk. We show the comparison of these methods in Table I.
|PPGNN||aligned||not limited||not limited|
|LPGNN||not limited||shared to server||aligned|
|SAPGNN||not limited||not limited||aligned|
From the perspective of application, [wu2021fedgnn] proposes a GNN-based privacy-preserving recommendation framework for the decentralized learning from user-item graph. [hefedgraphnn]
presents an open-source federated learning system and gives important insights into the federated GNN training over non-I.I.D. molecular datasets.
The nice property of our proposed SAPGNN is that it generates the same node embedding as the centralized GNN without having access to the raw data stored at other data holders. Unlike previous works, it can achieve the same accuracy as the one learned over the combined data for isolated datasets. In addition, it relaxes the constraints on the partition manners of both nodes and edges.
For clarity, we summarize all the notations used in this paper in Table II.
|local graph of data holder|
|nodes of data holder|
|edges of data holder|
|total number of data holders|
|set of data holders|
|total number of layers|
|local loss at data holder|
|total loss of all data holders|
|neighbour of node at data holder|
|input embedding of node at the -th layer|
|and global embedding at the -th layer|
|message of the edge connected to node and|
|at data holder|
|local aggregation of node at data holder|
|local embedding of node at data holder|
|global aggregation of node at server|
|message construction function at layer|
|local vertex update function at layer|
|global vertex update function at layer|
|Nonnegative integer set not greater than|
|encryption using additive sharing|
|encryption using boolean sharing|
|the gradient of local weights at data holder|
|the data size of local model weights|
|the length of node embedding|
|s the data size of the value of each weight|
|the number of nodes from all local graphs|
weights of linear transformer matrix
|the label distribution ratio|
3.1 Graph representation learning
Let defines a graph with vertex set and edge set . Most existing -layer stacked GNN models can be viewed as a special case of message passing architecture [gilmer2017neural]. Specifically, at the -th layer, the message passing on node and its neighborhood set can be composed of three steps:
where in (1) is a message construction function defined on each edge connected to . The message is constructed by combining the edge feature with the features of its incident nodes and ; The message aggregation function in (2) calculates by aggregating the feature of incoming finite unordered message set. The function is usually designed as a permutation invariant set function to guarantee the invariance/equivariance to isomorphic graph, popular choices include mean [hamilton2017inductive], pooling [li2019deepgcns], sum [xu2018powerful] and attention [velickovic2018graph]. Vertex update function in (3) updates the node feature according to its own feature and the aggregated message
. Lastly, the node representations are applied to loss functions for specific downstream tasks, e.g., node or graph classification[kurshan2020graph, xu2018powerful], link prediction [wu2021fedgnn], etc.
3.2 Split learning
Unlike federate learning [yang2019federated] where each client trains an entire replica of model, the keynote of split learning is splitting the execution of a model on a per-layer basis between clients and aided server [vepakomma2018split, gupta2018distributed]. In principle, each data holder first finishes the private data related calculation up to a cut layer, then the outputs are sent to another entity for subsequent computation. After the forward propagation, the gradients are computed based on loss function and backward propagated. Throughout the training or inference process, data privacy is guaranteed by the fact that raw data only participates in local computation and will never be accessed by others. Both theoretical analysis [singh2019detailed] and practical application [gao2020end] compare the efficiency and effectiveness of federated learning and split learning, and show the potential of both methods to design private decentralized learning procedures. For more details and advances, we refer to the reference [kairouz2019advances] and the website111https://splitlearning.github.io/.
3.3 Secret sharing
Our proposed model employs n-out-of-n secret sharing schemes to recover privacy from secret shares [shamir1979share, demmler2015aby]. In particular, when client wants to share a -bit value to parties, it firstly generates and sends uniformly at random to each client and generates that satisfies mod for additive sharing and for boolean sharing, respectively. Accordingly, can be reconstructed at any entity by gathering all shared values. Secret sharing has become one popular basis of advanced secure multi-party computation frameworks [patra2020aby2, byali2020flash] and been applied to many privacy preserving machine learning algorithms, such as secure aggregation [bonawitz2017practical], embedding generation [zhou2020privacy], and secure computation [mohassel2020practical]. For clarity, we denote additive sharing by and boolean sharing by in the following.
4 The Proposed SAPGNN Framework
In this section, we describe the proposed SAPGNN framework that has the ability to keep accuracy intact compared to the counterpart learned over the combined graph. The learning paradigm consists of parameter initialization, forward propagation, back propagation and local parameter fusion. At last, we give a discussion about additional overhead and data privacy in the presence of semi-honest adversaries.
4.1 Parameter initialization
First of all, the participated data holders and server build pair-wise secure channels for all sequential communication to ensure data integrity. Recall that all the nodes from local graphs share the same feature domain. Inspired by horizontal federated learning [aono2017privacy]
, local models at all data holders are initialized by the same weights to keep identical model behavior. This can be easily implemented by sharing the same initialization approach and random seed. Additionally, the shared parameters also include: (1) training hyperparameters that are shared among data holders and server, (2) hashed node index list that only shared to server. The hashed index list is used to index and distinguish nodes from all local graphs to hide the raw index information from the server. As for the server, it requires randomly initialization of global model weights to generate global embedding.
4.2 Forward Propagation
As illustrated in Fig.2, in order to protect data privacy (i.e., node attributes, edge information and node labels) while exploiting all isolated graph information, we design a modified message passing architecture in the manner of layer-wise split learning. To be specific, the forward steps at each layer can be divided into two steps: it first calculates local embedding at each data holder individually with private data. Then, the semi-honest server collects non-private local embeddings to compute global embedding. In the end, the output of the last layer is sent to label prediction and loss computation functions.
4.2.1 Private local embedding computation
In line of the message passing architecture, each data holder first constructs local message as
where denotes the neighbor node set at the local graph of data holder , is the parameters of function .
The next step is local message aggregation. Suppose that aggregation is conducted over a combined graph from all data holders. Since the same edge may simultaneously appear at several data holders, it will lead to count the same node multiple times when sum [xu2018powerful], mean [hamilton2017inductive] and degree-based [kipf2017semi]
aggregators are employed. Fortunately, max/min pooling aggregator tackles this problem naturally, therefore we will complete the decentralized learning paradigm based on the pooling aggregator. Taking max pooling as an example, each data holderaggregates messages over local neighbors by
After local aggregation, each data holder calculates the local node embeddings based on the node feature and aggregated neighbor feature via local vertex update function
denotes the vector whose elements are all infinitesimals. The local embeddingshide raw information of local graph, hence it can be sent to server for further global computation.
4.2.2 Global embedding computation
This step consists of global aggregation and vertex update. Concretely, the server first aggregates local node embeddings from all data holders with the same pooling function to (5) by
After that, the server transforms the aggregated embedding to compute global node representation of layer as
To meet the various design space for GNN [you2020design], the combination of linear transformer, batchnorm, activation and dropout can be incorporated into the vertex update function to enhance model capacity.
Note that the result of pooling aggregation in (5) and (7) only depends on the element-wise maximum. In order to follow the same behavior of centralized GNN layer, the global aggregated result (i.e., the left side of (9)) should be identical to that aggregated at all neighbors of combined graph (i.e., the right side of (9)), which can be formulated as
To satisfy the equation above, the constraints of local updates function can be given in the following:
Proposition[Constraints of local updates function] When the aggregation function is element-wise max, each element of the output of local update function should monotonically increase with each increased element of , e.g. can be chosen from , and , where denotes concatenation, denotes element-wise multiplication,
denotes multilayer perceptron.
Denote the results of element-wise max aggregation at the local neighbor information as
and the entire neighbor information as
Omits the layer index and node index , denote , the above equation reduces to
Obviously, according to the property of max function, equation (12) holds if and only if for and each element of vector , i.e., each element of the output of should monotonically increase with the increasing of each element of . ∎
When the global embeddings of all nodes at layer have been obtained by (8), the server distributes them to each data holder according to the node list of local graph for the forward propagation of the next layer. This process is conducted iteratively until the last layer .
4.2.3 Private local loss computation
When the global node embeddings of the last layer have been computed, each data holder predicts labels based on the embeddings by
the local loss at data holder over local training node labels can be then computed by
respectively, where is the number of labels at data holder , is the loss function, such as cross-entropy for a classification task and mean square loss for a regression task.
To summarize, the forward propagation algorithm is given in Algorithm 1. When the forward propagation is finished, model weights can be updated by the back propagation procedure outlined in what follows.
4.3 Back Propagation
Recall that the local part (i.e., the local embedding and loss related computation at data holders) and the global part (i.e., the global embedding related computation at server side) at each layer are spatially isolated. According to the chain rule of derivation, the entire model can be updated iteratively through communicating intermediate gradients between data holder and server. Herein, the gradient of local model weights are computed individually and then secretly aggregated for update. In the following, we give the computation and communication of back propagation procedure in detail.
4.3.1 Individual back propagation of predict layer
As the bridge of the final node embedding and the model output, the weight of predict function at data holder can be first learned by gradient descent through minimizing local loss individually. After that, the data holder computes the gradient of loss with respect to the input of predict function and then sends it to the server for subsequent back propagation.
4.3.2 Back propagation of each SAPGNN layer
Due to the property of derivation, the gradient of entire loss with respect to the output of the last layer can be computed as
while for the -th layer (), based on the derivation of max function, the gradient of loss with respect to the input embedding can be decomposed as
where part denotes the gradient of loss with respect to the output global embedding of layer , denotes the gradient of global embedding with respect to the result of global aggregation , denotes the gradient of with respect to the input of global model . Obviously, both and can be computed at the server side. Part denotes the gradient of data holder output with respect to the input embedding , this can be obtained at each data holder individually. Therefore, according to (17), the gradients can be back propagated layer by layer recursively. At each layer, the propagation is first carried out globally at the server side and then locally and parallelly at each data holder side.
4.3.3 Global back propagation at server side.
The server first obtains by summing received gradients from all data holders, and then computes the derivation with respect to global model weights and local embedding for every :
respectively. The result of (19) is sent to corresponding data holder for the sequential local back propagation.
4.3.4 Local back propagation at data holder side.
The gradient of loss with respect to local weights set at data holder can be expressed as
According to of (17), each data holder also needs to calculate and send the gradient of output local embedding with respect to input node embedding to server.
4.4 Weights update
As described above, the model weights of SAPGNN are spatially divided into two categories: global submodel weights held by server and local submodel weights held by data holders.
4.4.1 Update of global model weight.
When the corresponding gradients have been obtained by (18), the global weights can be directly updated through gradient descent.
4.4.2 Update of local model weight.
To keep the isolated local weights of all data holders identical during training, the corresponding local gradients should be federally aggregated at all data holders respectively, such as secure aggregation [bonawitz2017practical] or homomorphic encryption [aono2017privacy]. Taking secure aggregation as an example, let the gradients of weights be aggregated at data holder as . Each data holder first secretly shares local gradient to the data holder and then sums up the shares by
After that, each data holder reconstructes the entire gradients for update by gathering aggregated results from others by
Note that during this procedure, each data holder only accesses the secret shares and reconstructed entire gradients, whereas the server knows nothing about local gradients.
4.5 Discussion of security and overhead
4.5.1 Data privacy
In our proposed learning paradigm, data privacy can be guaranteed by the following reasons:
All aforementioned private data (including node attributes, edge information, labels and local model gradients) related computations are carried out by data holders locally. From the perspective of semi-honest server, only the hashed node lists of local graph, local embedding computed at (6) and global model are observable. Therefore, our SAPGNN is secure against semi-honest adversaries.
The only sensitive messages observed by data holders are the secret shares of gradients of local model weights. Since the gradients are split by n-out-of-n secret sharing algorithm, raw data can be reconstructed if and only if one can gather all the shared parts. It prevents semi-honest adversaries from other data holders.
TLS/SSL protocol ensures security and data integrity of pair-wise network communication [aono2017privacy].
4.5.2 Extra communication overhead
N-out-of-n secret sharing leads to a quadratic growth of communication overhead with respect to the number of data holders. The overhead of aggregating gradient of local model for update is given as , where denotes the data size of each weight. In addition, as we have explained in the forward process, the local embedding and the global embedding are transmitted between server and data holder at each layer. Let denotes the length of node embedding, denotes the number of nodes from all local graphs, the communication overhead can be represented as . Therefore, although a small number of layers is sufficient for training a competitive GNN [chen2020simple] that impedes over-smoothing, the communication overhead will become a bottleneck and limits efficiency and scalability, since and can be extremely large in the case of a heavy model with millions of parameters, or the Internet of Things scenario with massive devices [gao2020end]. Potential solutions include conducting mini-batch training instead of full-batch training, or utilizing model and communication compression technology [rothchild2020fetchsgd]. We leave these optimizations as future works.
4.5.3 Secure global pooling aggregation
Note that when conducting global aggregation, only the element-wise maximum values over all local embedding in (7) are required during forward step, while corresponding indexes of data holder (i.e., in (17
)) are needed at backward step. To further improve privacy, the raw information can be encrypted by private compare approaches by exploiting the technique of secure maximum computation protocol, which has been widely utilized in machine learning applications such as k-means[jaschke2018unsupervised, mohassel2020practical]. Specifically, for each element of local embedding , the problem of outputting the secret share of the index vector that indicates the maximum value among numbers can be formulated as . This function has been deeply investigated in recent works such as [mohassel2020practical], which can be efficiently implemented by employing less-than garbled circuits and instances of oblivious transfer extension. Utilizing the secure global pooling aggregation leads to more obstacles for the semi-honest server to learn private information from data holders.
In this section we present our experimental results for our proposed SAPGNN. We first describe the datasets, experimental setup and comparison methods. After that, we ran experiments to point the superiority of SAPGNN under a near IID label distribution setting.
5.1 Datasets and experimental setup
We test SAPGNN on three publicly available citation node classification datasets that are used for node classification in previous works [zhou2020privacy, zheng2021asfgnn], i.e., Cora, Citeseer and Pubmed. For these datasets, each node represents a document, while edges denote citation links. Each node has a bag-of-words feature vector and a label indicating its category. We follow the same node mask with the default setting of DGL framework [wang2019dgl] for training, validation, and test node sets. The main characteristics of each dataset are given in Table III. All experiments are evaluated on a Windows desktop with 3.2G 6-core Intel Core i7-8700 CPU and 16 GB of RAM.
5.2 Compared methods
We compare SAPGNN against two methods
The first is separate training (SP), i.e., each data holder trains GNN individually over their own subgraph. It cannot utilize information from others and thus can be treated as a baseline method.
The second is PPGNN [zhou2020privacy] that first conducts separate training and then predicts over combined node embedding. Note that training, validation, and test node sets for PPGNN need to be privately aligned among data holders respectively before experiments since it requires each node exists at all local graphs.
For all methods, we use a two-layer GNN constructed by following formulation:
ReLU activation function and dropout are applied on the output of each layer except the last one. All the considered models are trained over a maximum of 300 epochs using the cross-entropy with Adam optimizer and learning rate of 0.01. We performed a grid search with early stop to find the best choices for the hidden size for each method, and the accuracy and macro-F1 are evaluated on the test set over 40 consecutive runs.
|Number of data holders||1||2||3||4|
5.3 Results with uniformly split edges
Firstly, we compare the three decentralized learning methods under the IID edge information setting, where the original edge set is divided uniformly into the subgraph of each data holder, and the performance results are reported on Table IV. First, we can observe that the metrics of SAPGNN keep identical with varying numbers of data holders, and equal to the results obtained by centralized counterpart (i.e., SP when the number of data holders is 1). The reason is straightforward, as the learned global node representation of SAPGNN is the same as that learned over the combined graph. Secondly, SAPGNN consistently outperforms SP, and the gaps widen with the growth of data holders, since SP only accesses local information. Compared to PPGNN, SAPGNN is competitive for Cora and Citeseer datasets, but is slightly worse in the case of Pubmed.In the following, we mainly compare SAPGNN and PPGNN in case of non-IID label distribution and drop the SP method for conciseness.
5.4 Results with varies label distribution
Existing works have demonstrated that the performance of decentralized learning method decreases with the raise of non-IID label distribution [gao2020end, zhao2018federated]. To examine this, we first divide nodes into different data holders according to the label, and then % nodes from each data holder are split uniformly to other data holders. Only the edges connected to nodes at the same data holder retained. Thus varying the label distribution level from to implies more similar label distribution among data holders, and increasing the number of data holders will lead to more removed edges. Taking two data holders with Cora dataset as an example, Fig.3 shows the percentage of nodes at different data holders for each class, where implies the labels among data holders are absolutely different, i.e., the subgraph at data holder 1 includes 1097 nodes of the first four classes, while data holder 2 only has 543 nodes with labels of the last three classes. As for the case of , each subgraph contains about half of the nodes of each class (821 nodes at data holder1 while 819 nodes at data holder 2). Note that original PPGNN can only generate embeddings for overlapped nodes at all data holders. For fair comparison, instead of directly removing nodes, we remove all connected edges for these nodes at each local subgraph and thus no messages will pass from or to adjacent neighbors.
Fig.4 and Fig.5 respectively show the node classification accuracy and F1 score when the number of data holders varies from 2 to 4 and . We can observe that label distribution has an important influence on metrics. In specific, when , the performance of PPGNN has a comfortable lead over SAPGNN.This is because PPGNN generates node embedding locally and thus can balance the contributions from different data holders. When the classes of nodes are totally different among data holders, training a shared or federal model has no benefit over that learns individually with a relatively simple classification task [zhao2018federated]. On the other hand, SAPGNN has comparable performance when , and outperforms PPGNN when in all datasets, which means SAPGNN is more effective on learning from adjacent information for the scenario where all data holders tend to have near IID label distribution (). At last, by comparing the performances of SAPGNN with the same over various number of data holders, we can find that removing inter-class edges may reduce the learning performance for Citeseer, while has relatively low influence on those for Cora and Pubmed.
In this paper, we proposed a server aided privacy-preserving GNN framework for the horizontally partitioned graph structure dataset. It enables the ability to generate the same node embeddings as the centralized GNN without revealing raw data. Therefore, proven concepts from the centralized one (e.g., convergence and generalization) can also be transferred to the proposed SAPGNN. For privacy concerns, we further give a secure global pooling aggregation mechanism that is capable of hiding raw local embeddings from semi-honest adversaries. We showed successful cases of SAPGNN on the node classification task especially when the labels of isolated datasets tend to have identical distribution, but it behaves worse than existing methods under highly skewed non-IID label distribution. This observation can be utilized for the guidance of choosing suitable decentralized learning paradigms according to the distribution of graph data.
In future, we would like to transfer our proposed learning framework to more general GNN architecture and more partition types of graph dataset. More importantly, how to enhance communication efficiency should pay attention to unleash the full potential of SAPGNN and other decentralized GNN learning approaches for the applications in reality.