Towards Representation Identical Privacy-Preserving Graph Neural Network via Split Learning

07/13/2021 ∙ by Chuanqiang Shan, et al. ∙ 0

In recent years, the fast rise in number of studies on graph neural network (GNN) has put it from the theories research to reality application stage. Despite the encouraging performance achieved by GNN, less attention has been paid to the privacy-preserving training and inference over distributed graph data in the related literature. Due to the particularity of graph structure, it is challenging to extend the existing private learning framework to GNN. Motivated by the idea of split learning, we propose a Server Aided Privacy-preserving GNN (SAPGNN) for the node level task on horizontally partitioned cross-silo scenario. It offers a natural extension of centralized GNN to isolated graph with max/min pooling aggregation, while guaranteeing that all the private data involved in computation still stays at local data holders. To further enhancing the data privacy, a secure pooling aggregation mechanism is proposed. Theoretical and experimental results show that the proposed model achieves the same accuracy as the one learned over the combined data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since graph neural network (GNN) enables directly model the structure information of network topology, it has attracted significant interest recently, both from research and application perspectives [wu2020comprehensive, zhou2020graph]. However, primarily due to business competition and regulatory restrictions, a wealth of sensitive graph-structured data that is held by different clients are unwilling to be shared, thus plaguing many practical applications, such as fraud detection over banks [kurshan2020graph] and social network recommendation over platforms [wu2021fedgnn].

Despite kinds of privacy preserving machine learning models have been successfully applied in data types like image

[hsu2020federated], text [ge2020fedner] and table [wu2021fedgnn], few works have concentrated on the domain of graph machine learning. For decentralized graph structure data, both nodes and edges are isolated, rendering most of the privacy learning methods designed for conventional datasets infeasible.

In this work, we restrict attention to the problem of designing a privacy-preserving GNN for node classification task that allows performance intact in the setting of horizontally partitioned graph dataset, which means the attributes of nodes and edges are aligned. As illustrated in Fig.1, we consider the scenario where several data holders which store private subgraphs access to one semi-honest (a.k.a. honest-but-curious) server. Each local subgraph contains sensitive information about nodes, edges, attributes and labels. The semi-honest server assumption means the server will follow protocol honestly, but it attempts to infer as much information as possible from received messages. In view of the fact that one node may interact with the same entity at several platforms, unlike previous work, we consider a more general scenario where overlapped nodes and edges exist among subgraphs.

To address the decentralized graph learning issue under privacy constraint, motivated by the ideas of split learning [vepakomma2018split] and horizontal federated learning [aono2017privacy], we propose a Server Aided Privacy-preserving GNN (SAPGNN), where each GNN layer is divided into two sub-models: the local model includes all the private data related computation to generate local node embedding, whereas the global model calculates global embedding by aggregating all local embedding. By this, the isolated neighborhood can be collaboratively utilized, and the receptive field can be enlarged by stacking multiple layers. Most importantly, when employing a pooling aggregator with proper update function, SAPGNN can generate identical node representation compared to the one learned over the combined graph.

Fig. 1: The proposed SAPGNN on horizontally partitioned data. The isolated data holders have the same feature domain (e.g., {f1,f2,f3}) and edge type, but differ in nodes, edges and labels.

The main contributions of this paper are summarized as follows:

  • We present a novel SAPGNN framework for training privacy-preserving GNN in horizontally partitioned data setup. To the best of our knowledge, it is the first GNN learning paradigm that is capable of generating the same node embedding as the centralized counterpart.

  • We analyse the privacy and overhead of the proposed SAPGNN. A secure pooling mechanism instead of a naive global pooling aggregator is proposed to further protect privacy from the semi-honest adversaries of server.

  • Experimental results on three datasets demonstrate the accuracy and macro-f1 of SAPGNN surpass the one learned over isolated data, and also comparable to the state-of-the-art approach, especially in the setting of I.I.D. label distribution.

This paper is organized as follows: Section 2 and 3 introduce recent works on privacy preserving GNN learning paradigms, notations as well as preliminaries; In Section 4 we describe and discuss our proposed SAPGNN framework in detail; These are followed by the experiments in Section 5; and finally Section 6 provides conclusion discussions and outlook.

2 Related Works

To tackle the privacy-preserving node classification problem over decentralized graph data, some methods have recently been investigated to train a global GNN collaboratively on various split types of dataset.

First, two learning paradigms named PPGNN [zhou2020privacy] and ASFGNN [zheng2021asfgnn] were proposed based on split learning for vertically and horizontally split datasets respectively. Both of them alleviate isolation by firstly training local GNN models over private graphs and then learning global embedding at an assistant semi-trusted third party. As the graph topology is still exploited locally, the model performance may be substantially reduced when the dataset is largely decentralized. More recently, LPGNN [sina2020practical] was developed to reduce communication overhead under the assumption that the server has accessed global graph topology except private node attributes. Despite its potential, this precondition is not always satisfactory since releasing topology to server may lead to privacy disclosure risk. We show the comparison of these methods in Table I.

Model Nodes Edges Features
PPGNN aligned not limited not limited
ASFGNN different different aligned
LPGNN not limited shared to server aligned
SAPGNN not limited not limited aligned
TABLE I: The comparison of data partition manners

From the perspective of application, [wu2021fedgnn] proposes a GNN-based privacy-preserving recommendation framework for the decentralized learning from user-item graph. [hefedgraphnn]

presents an open-source federated learning system and gives important insights into the federated GNN training over non-I.I.D. molecular datasets.

The nice property of our proposed SAPGNN is that it generates the same node embedding as the centralized GNN without having access to the raw data stored at other data holders. Unlike previous works, it can achieve the same accuracy as the one learned over the combined data for isolated datasets. In addition, it relaxes the constraints on the partition manners of both nodes and edges.

3 Preliminaries

For clarity, we summarize all the notations used in this paper in Table II.

Not. Descriptions
local graph of data holder
nodes of data holder
edges of data holder
total number of data holders
set of data holders
total number of layers
local loss at data holder
total loss of all data holders
neighbour of node at data holder
input embedding of node at the -th layer
and global embedding at the -th layer
message of the edge connected to node and
at data holder
local aggregation of node at data holder
local embedding of node at data holder
global aggregation of node at server
message construction function at layer
local vertex update function at layer
global vertex update function at layer
XOR operator
Nonnegative integer set not greater than
encryption using additive sharing
encryption using boolean sharing
model weights
the gradient of local weights at data holder
the data size of local model weights
the length of node embedding
s the data size of the value of each weight
the number of nodes from all local graphs

weights of linear transformer matrix

the label distribution ratio
TABLE II: Notations and descriptions.

3.1 Graph representation learning

Let defines a graph with vertex set and edge set . Most existing -layer stacked GNN models can be viewed as a special case of message passing architecture [gilmer2017neural]. Specifically, at the -th layer, the message passing on node and its neighborhood set can be composed of three steps:

(1)
(2)
(3)

where in (1) is a message construction function defined on each edge connected to . The message is constructed by combining the edge feature with the features of its incident nodes and ; The message aggregation function in (2) calculates by aggregating the feature of incoming finite unordered message set. The function is usually designed as a permutation invariant set function to guarantee the invariance/equivariance to isomorphic graph, popular choices include mean [hamilton2017inductive], pooling [li2019deepgcns], sum [xu2018powerful] and attention [velickovic2018graph]. Vertex update function in (3) updates the node feature according to its own feature and the aggregated message

. Lastly, the node representations are applied to loss functions for specific downstream tasks, e.g., node or graph classification

[kurshan2020graph, xu2018powerful], link prediction [wu2021fedgnn], etc.

3.2 Split learning

Unlike federate learning [yang2019federated] where each client trains an entire replica of model, the keynote of split learning is splitting the execution of a model on a per-layer basis between clients and aided server [vepakomma2018split, gupta2018distributed]. In principle, each data holder first finishes the private data related calculation up to a cut layer, then the outputs are sent to another entity for subsequent computation. After the forward propagation, the gradients are computed based on loss function and backward propagated. Throughout the training or inference process, data privacy is guaranteed by the fact that raw data only participates in local computation and will never be accessed by others. Both theoretical analysis [singh2019detailed] and practical application [gao2020end] compare the efficiency and effectiveness of federated learning and split learning, and show the potential of both methods to design private decentralized learning procedures. For more details and advances, we refer to the reference [kairouz2019advances] and the website111https://splitlearning.github.io/.

3.3 Secret sharing

Our proposed model employs n-out-of-n secret sharing schemes to recover privacy from secret shares [shamir1979share, demmler2015aby]. In particular, when client wants to share a -bit value to parties, it firstly generates and sends uniformly at random to each client and generates that satisfies mod for additive sharing and for boolean sharing, respectively. Accordingly, can be reconstructed at any entity by gathering all shared values. Secret sharing has become one popular basis of advanced secure multi-party computation frameworks [patra2020aby2, byali2020flash] and been applied to many privacy preserving machine learning algorithms, such as secure aggregation [bonawitz2017practical], embedding generation [zhou2020privacy], and secure computation [mohassel2020practical]. For clarity, we denote additive sharing by and boolean sharing by in the following.

4 The Proposed SAPGNN Framework

In this section, we describe the proposed SAPGNN framework that has the ability to keep accuracy intact compared to the counterpart learned over the combined graph. The learning paradigm consists of parameter initialization, forward propagation, back propagation and local parameter fusion. At last, we give a discussion about additional overhead and data privacy in the presence of semi-honest adversaries.

4.1 Parameter initialization

First of all, the participated data holders and server build pair-wise secure channels for all sequential communication to ensure data integrity. Recall that all the nodes from local graphs share the same feature domain. Inspired by horizontal federated learning [aono2017privacy]

, local models at all data holders are initialized by the same weights to keep identical model behavior. This can be easily implemented by sharing the same initialization approach and random seed. Additionally, the shared parameters also include: (1) training hyperparameters that are shared among data holders and server, (2) hashed node index list that only shared to server. The hashed index list is used to index and distinguish nodes from all local graphs to hide the raw index information from the server. As for the server, it requires randomly initialization of global model weights to generate global embedding.

4.2 Forward Propagation

Fig. 2: Forward propagation of SAPGNN. At each layer, local embeddings are first computed by message passing architecture over local graph at each data holder side. Then the server obtains global embeddings via global message aggregation and vertex update steps at the server side. At last, label prediction is conducted individually at each data holder.

As illustrated in Fig.2, in order to protect data privacy (i.e., node attributes, edge information and node labels) while exploiting all isolated graph information, we design a modified message passing architecture in the manner of layer-wise split learning. To be specific, the forward steps at each layer can be divided into two steps: it first calculates local embedding at each data holder individually with private data. Then, the semi-honest server collects non-private local embeddings to compute global embedding. In the end, the output of the last layer is sent to label prediction and loss computation functions.

4.2.1 Private local embedding computation

In line of the message passing architecture, each data holder first constructs local message as

(4)

where denotes the neighbor node set at the local graph of data holder , is the parameters of function .

The next step is local message aggregation. Suppose that aggregation is conducted over a combined graph from all data holders. Since the same edge may simultaneously appear at several data holders, it will lead to count the same node multiple times when sum [xu2018powerful], mean [hamilton2017inductive] and degree-based [kipf2017semi]

aggregators are employed. Fortunately, max/min pooling aggregator tackles this problem naturally, therefore we will complete the decentralized learning paradigm based on the pooling aggregator. Taking max pooling as an example, each data holder

aggregates messages over local neighbors by

(5)

After local aggregation, each data holder calculates the local node embeddings based on the node feature and aggregated neighbor feature via local vertex update function

(6)

where

denotes the vector whose elements are all infinitesimals. The local embeddings

hide raw information of local graph, hence it can be sent to server for further global computation.

4.2.2 Global embedding computation

This step consists of global aggregation and vertex update. Concretely, the server first aggregates local node embeddings from all data holders with the same pooling function to (5) by

(7)

After that, the server transforms the aggregated embedding to compute global node representation of layer as

(8)

To meet the various design space for GNN [you2020design], the combination of linear transformer, batchnorm, activation and dropout can be incorporated into the vertex update function to enhance model capacity.

Note that the result of pooling aggregation in (5) and (7) only depends on the element-wise maximum. In order to follow the same behavior of centralized GNN layer, the global aggregated result (i.e., the left side of (9)) should be identical to that aggregated at all neighbors of combined graph (i.e., the right side of (9)), which can be formulated as

(9)

To satisfy the equation above, the constraints of local updates function can be given in the following:

Proposition[Constraints of local updates function] When the aggregation function is element-wise max, each element of the output of local update function should monotonically increase with each increased element of , e.g. can be chosen from , and , where denotes concatenation, denotes element-wise multiplication,

denotes multilayer perceptron.

Proof.

Denote the results of element-wise max aggregation at the local neighbor information as

(10)

and the entire neighbor information as

(11)

respectively. Incorporating (10) and (11), equation (9) in the paper can be simplified as

(12)

Omits the layer index and node index , denote , the above equation reduces to

(13)

Obviously, according to the property of max function, equation (12) holds if and only if for and each element of vector , i.e., each element of the output of should monotonically increase with the increasing of each element of . ∎

When the global embeddings of all nodes at layer have been obtained by (8), the server distributes them to each data holder according to the node list of local graph for the forward propagation of the next layer. This process is conducted iteratively until the last layer .

4.2.3 Private local loss computation

When the global node embeddings of the last layer have been computed, each data holder predicts labels based on the embeddings by

(14)

the local loss at data holder over local training node labels can be then computed by

(15)

respectively, where is the number of labels at data holder , is the loss function, such as cross-entropy for a classification task and mean square loss for a regression task.

To summarize, the forward propagation algorithm is given in Algorithm 1. When the forward propagation is finished, model weights can be updated by the back propagation procedure outlined in what follows.

Input: local graph set on data holder and node features for ; all node set ; labeled node set ; the number of layers ;
Output: Label predictions and loss on each data holder ;

1:  for  to  do
2:     for  in parallel do
3:        Data holder: calculates local node embedding by (4)-(6) and sends to server.
4:     end for
5:     Server: combines the local embeddings to calculate global embedding by (7) and (8), then distributes them back based on node lists .
6:  end for
7:  Data holder: private label prediction and loss computation by (14) and (15).
Algorithm 1 Forward propagation of SAPGNN learning algorithm

4.3 Back Propagation

Recall that the local part (i.e., the local embedding and loss related computation at data holders) and the global part (i.e., the global embedding related computation at server side) at each layer are spatially isolated. According to the chain rule of derivation, the entire model can be updated iteratively through communicating intermediate gradients between data holder and server. Herein, the gradient of local model weights are computed individually and then secretly aggregated for update. In the following, we give the computation and communication of back propagation procedure in detail.

4.3.1 Individual back propagation of predict layer

As the bridge of the final node embedding and the model output, the weight of predict function at data holder can be first learned by gradient descent through minimizing local loss individually. After that, the data holder computes the gradient of loss with respect to the input of predict function and then sends it to the server for subsequent back propagation.

4.3.2 Back propagation of each SAPGNN layer

Due to the property of derivation, the gradient of entire loss with respect to the output of the last layer can be computed as

(16)

while for the -th layer (), based on the derivation of max function, the gradient of loss with respect to the input embedding can be decomposed as

(17)

where part denotes the gradient of loss with respect to the output global embedding of layer , denotes the gradient of global embedding with respect to the result of global aggregation , denotes the gradient of with respect to the input of global model . Obviously, both and can be computed at the server side. Part denotes the gradient of data holder output with respect to the input embedding , this can be obtained at each data holder individually. Therefore, according to (17), the gradients can be back propagated layer by layer recursively. At each layer, the propagation is first carried out globally at the server side and then locally and parallelly at each data holder side.

4.3.3 Global back propagation at server side.

The server first obtains by summing received gradients from all data holders, and then computes the derivation with respect to global model weights and local embedding for every :

(18)
(19)

respectively. The result of (19) is sent to corresponding data holder for the sequential local back propagation.

4.3.4 Local back propagation at data holder side.

The gradient of loss with respect to local weights set at data holder can be expressed as

(20)

According to of (17), each data holder also needs to calculate and send the gradient of output local embedding with respect to input node embedding to server.

1:   Step 1: Back propagation of predict layer
2:  for  in parallel do
3:     Data holder : computes , computes and sends to server.
4:  end for
5:   Step 2: Back propagation of SAPGNN layer
6:  for  to  do
7:     Server: computes by (16) if or (17) if , computes and , computes gradient of global model weights by (18).
8:     for  in parallel do
9:        Data holder: computes gradient of local model weights by (20), computes and sends to server.
10:     end for
11:  end for
12:   Step 3: Weight update
13:  for Data holder in parallel: do
14:     locally generates and distributes to data holder .
15:     computes and sends it to other data holders.
16:     reconstructs and updates local weights via gradient descent.
17:  end for
18:  Server: updates global weights via gradient descent.
Algorithm 2 Back propagation and weights update of SAPGNN framework

4.4 Weights update

As described above, the model weights of SAPGNN are spatially divided into two categories: global submodel weights held by server and local submodel weights held by data holders.

4.4.1 Update of global model weight.

When the corresponding gradients have been obtained by (18), the global weights can be directly updated through gradient descent.

4.4.2 Update of local model weight.

To keep the isolated local weights of all data holders identical during training, the corresponding local gradients should be federally aggregated at all data holders respectively, such as secure aggregation [bonawitz2017practical] or homomorphic encryption [aono2017privacy]. Taking secure aggregation as an example, let the gradients of weights be aggregated at data holder as . Each data holder first secretly shares local gradient to the data holder and then sums up the shares by

(21)

After that, each data holder reconstructes the entire gradients for update by gathering aggregated results from others by

(22)

Note that during this procedure, each data holder only accesses the secret shares and reconstructed entire gradients, whereas the server knows nothing about local gradients.

4.5 Discussion of security and overhead

4.5.1 Data privacy

In our proposed learning paradigm, data privacy can be guaranteed by the following reasons:

  • All aforementioned private data (including node attributes, edge information, labels and local model gradients) related computations are carried out by data holders locally. From the perspective of semi-honest server, only the hashed node lists of local graph, local embedding computed at (6) and global model are observable. Therefore, our SAPGNN is secure against semi-honest adversaries.

  • The only sensitive messages observed by data holders are the secret shares of gradients of local model weights. Since the gradients are split by n-out-of-n secret sharing algorithm, raw data can be reconstructed if and only if one can gather all the shared parts. It prevents semi-honest adversaries from other data holders.

  • TLS/SSL protocol ensures security and data integrity of pair-wise network communication [aono2017privacy].

4.5.2 Extra communication overhead

N-out-of-n secret sharing leads to a quadratic growth of communication overhead with respect to the number of data holders. The overhead of aggregating gradient of local model for update is given as , where denotes the data size of each weight. In addition, as we have explained in the forward process, the local embedding and the global embedding are transmitted between server and data holder at each layer. Let denotes the length of node embedding, denotes the number of nodes from all local graphs, the communication overhead can be represented as . Therefore, although a small number of layers is sufficient for training a competitive GNN [chen2020simple] that impedes over-smoothing, the communication overhead will become a bottleneck and limits efficiency and scalability, since and can be extremely large in the case of a heavy model with millions of parameters, or the Internet of Things scenario with massive devices [gao2020end]. Potential solutions include conducting mini-batch training instead of full-batch training, or utilizing model and communication compression technology [rothchild2020fetchsgd]. We leave these optimizations as future works.

4.5.3 Secure global pooling aggregation

Note that when conducting global aggregation, only the element-wise maximum values over all local embedding in (7) are required during forward step, while corresponding indexes of data holder (i.e., in (17

)) are needed at backward step. To further improve privacy, the raw information can be encrypted by private compare approaches by exploiting the technique of secure maximum computation protocol, which has been widely utilized in machine learning applications such as k-means

[jaschke2018unsupervised, mohassel2020practical]. Specifically, for each element of local embedding , the problem of outputting the secret share of the index vector that indicates the maximum value among numbers can be formulated as . This function has been deeply investigated in recent works such as [mohassel2020practical], which can be efficiently implemented by employing less-than garbled circuits and instances of oblivious transfer extension. Utilizing the secure global pooling aggregation leads to more obstacles for the semi-honest server to learn private information from data holders.

5 Evaluation

In this section we present our experimental results for our proposed SAPGNN. We first describe the datasets, experimental setup and comparison methods. After that, we ran experiments to point the superiority of SAPGNN under a near IID label distribution setting.

Subgraph Cora Citeseer Pubmed
Nodes 2708 3327 19717
Edges 5278 4552 44324
Features 1433 3703 500
Train 140 120 60
Val 500 500 500
Test 1000 1000 1000
Classes 7 6 3
TABLE III: Main characteristics of each dataset

5.1 Datasets and experimental setup

We test SAPGNN on three publicly available citation node classification datasets that are used for node classification in previous works [zhou2020privacy, zheng2021asfgnn], i.e., Cora, Citeseer and Pubmed. For these datasets, each node represents a document, while edges denote citation links. Each node has a bag-of-words feature vector and a label indicating its category. We follow the same node mask with the default setting of DGL framework [wang2019dgl] for training, validation, and test node sets. The main characteristics of each dataset are given in Table III. All experiments are evaluated on a Windows desktop with 3.2G 6-core Intel Core i7-8700 CPU and 16 GB of RAM.

5.2 Compared methods

We compare SAPGNN against two methods

  • The first is separate training (SP), i.e., each data holder trains GNN individually over their own subgraph. It cannot utilize information from others and thus can be treated as a baseline method.

  • The second is PPGNN [zhou2020privacy] that first conducts separate training and then predicts over combined node embedding. Note that training, validation, and test node sets for PPGNN need to be privately aligned among data holders respectively before experiments since it requires each node exists at all local graphs.

Fig. 3: The percentage of nodes for each class with non-IID Cora dataset and two data holders when (a) =0%, (b) =25% and (c) =50%, where means each subgraph includes the nodes with different classes, while means each subgraph contains about half of nodes of each class.

For all methods, we use a two-layer GNN constructed by following formulation:

(23)

ReLU activation function and dropout are applied on the output of each layer except the last one. All the considered models are trained over a maximum of 300 epochs using the cross-entropy with Adam optimizer and learning rate of 0.01. We performed a grid search with early stop to find the best choices for the hidden size for each method, and the accuracy and macro-F1 are evaluated on the test set over 40 consecutive runs.

Number of data holders 1 2 3 4
Dataset Model Acc F1 Acc F1 Acc F1 Acc F1
Cora SP 78.5 77.4 75.2 74.3 72.7 71.7 70.6 69.5
0.54 0.55 0.79 0.77 0.85 1.00 0.87 0.91
PPGNN 77.5 76.5 77.0 75.9 76.4 75.1
1.36 1.31 1.10 1.14 1.36 1.36
SAPGNN 78.5 77.4 78.5 77.4 78.5 77.4 78.5 77.4
0.54 0.55 0.54 0.55 0.54 0.55 0.54 0.55
Citeseer SP 69.8 66.6 68.0 64.8 65.2 61.7 63.2 59.0
0.59 0.62 1.81 1.76 0.99 1.16 2.12 2.38
PPGNN 67.1 63.3 66.3 62.8 64.9 61.5
1.72 2.45 2.07 1.81 2.69 2.58
SAPGNN 69.8 66.6 69.8 66.6 69.8 66.6 69.8 66.6
0.59 0.62 0.59 0.62 0.59 0.62 0.59 0.62
Pubmed SP 78.3 77.7 75.9 75.3 73.9 73.4 72.2 71.7
0.51 0.49 1.08 1.12 1.09 1.08 1.01 1.01
PPGNN 78.9 78.4 79.0 78.7 79.2 79.0
0.88 0.84 0.58 0.57 0.61 0.56
SAPGNN 78.3 77.7 78.3 77.7 78.3 77.7 78.3 77.7
0.51 0.49 0.51 0.49 0.51 0.49 0.51 0.49
TABLE IV: Comparison of accuracy and macro-F1 (standard deviation) over varying number of data holders from 1 to 4. The edges are divided uniformly into all data holders.

5.3 Results with uniformly split edges

Firstly, we compare the three decentralized learning methods under the IID edge information setting, where the original edge set is divided uniformly into the subgraph of each data holder, and the performance results are reported on Table IV. First, we can observe that the metrics of SAPGNN keep identical with varying numbers of data holders, and equal to the results obtained by centralized counterpart (i.e., SP when the number of data holders is 1). The reason is straightforward, as the learned global node representation of SAPGNN is the same as that learned over the combined graph. Secondly, SAPGNN consistently outperforms SP, and the gaps widen with the growth of data holders, since SP only accesses local information. Compared to PPGNN, SAPGNN is competitive for Cora and Citeseer datasets, but is slightly worse in the case of Pubmed.In the following, we mainly compare SAPGNN and PPGNN in case of non-IID label distribution and drop the SP method for conciseness.

5.4 Results with varies label distribution

Existing works have demonstrated that the performance of decentralized learning method decreases with the raise of non-IID label distribution [gao2020end, zhao2018federated]. To examine this, we first divide nodes into different data holders according to the label, and then % nodes from each data holder are split uniformly to other data holders. Only the edges connected to nodes at the same data holder retained. Thus varying the label distribution level from to implies more similar label distribution among data holders, and increasing the number of data holders will lead to more removed edges. Taking two data holders with Cora dataset as an example, Fig.3 shows the percentage of nodes at different data holders for each class, where implies the labels among data holders are absolutely different, i.e., the subgraph at data holder 1 includes 1097 nodes of the first four classes, while data holder 2 only has 543 nodes with labels of the last three classes. As for the case of , each subgraph contains about half of the nodes of each class (821 nodes at data holder1 while 819 nodes at data holder 2). Note that original PPGNN can only generate embeddings for overlapped nodes at all data holders. For fair comparison, instead of directly removing nodes, we remove all connected edges for these nodes at each local subgraph and thus no messages will pass from or to adjacent neighbors.

Fig. 4: Node classification accuracy of SAPGNN and PPGNN, where the number of data holders is from 2 to 4 and .
Fig. 5: Node classification F1 of SAPGNN and PPGNN, where the number of data holders is from 2 to 4 and .

Fig.4 and Fig.5 respectively show the node classification accuracy and F1 score when the number of data holders varies from 2 to 4 and . We can observe that label distribution has an important influence on metrics. In specific, when , the performance of PPGNN has a comfortable lead over SAPGNN.This is because PPGNN generates node embedding locally and thus can balance the contributions from different data holders. When the classes of nodes are totally different among data holders, training a shared or federal model has no benefit over that learns individually with a relatively simple classification task [zhao2018federated]. On the other hand, SAPGNN has comparable performance when , and outperforms PPGNN when in all datasets, which means SAPGNN is more effective on learning from adjacent information for the scenario where all data holders tend to have near IID label distribution (). At last, by comparing the performances of SAPGNN with the same over various number of data holders, we can find that removing inter-class edges may reduce the learning performance for Citeseer, while has relatively low influence on those for Cora and Pubmed.

6 Conclusion

In this paper, we proposed a server aided privacy-preserving GNN framework for the horizontally partitioned graph structure dataset. It enables the ability to generate the same node embeddings as the centralized GNN without revealing raw data. Therefore, proven concepts from the centralized one (e.g., convergence and generalization) can also be transferred to the proposed SAPGNN. For privacy concerns, we further give a secure global pooling aggregation mechanism that is capable of hiding raw local embeddings from semi-honest adversaries. We showed successful cases of SAPGNN on the node classification task especially when the labels of isolated datasets tend to have identical distribution, but it behaves worse than existing methods under highly skewed non-IID label distribution. This observation can be utilized for the guidance of choosing suitable decentralized learning paradigms according to the distribution of graph data.

In future, we would like to transfer our proposed learning framework to more general GNN architecture and more partition types of graph dataset. More importantly, how to enhance communication efficiency should pay attention to unleash the full potential of SAPGNN and other decentralized GNN learning approaches for the applications in reality.

References