Visual analysis of multi-party graphs plays an important role in helping us understand real-world complex data [48, 49, 6], such as ego-network analysis in social media [52, 56], disease diagnosis in healthcare 
, and anomaly detection in public security[7, 55]
. Various features or models extracted from multi-party graphs can be integrated to support a comprehensive understanding of the entire graph data. Using the integrated information, we can conduct more comprehensive investigations. For instance, by combining knowledge graphs of patients and diseases from multiple hospitals, doctors can gain a deeper understanding of diseases and develop best treatment plans.
One main bottleneck of exploiting multi-party graphs is data accessibility. Early studies on graph visual analysis assume that the graph data is freely accessible. Currently, however, more and more graph data are distributed (e.g., on servers in different organizations). To analyze such data, we need to combine multi-party graphs and examine them as an entirety. Considering of privacy and security, raw data in distributed clients may be prohibited from accessing. This leads to two challenges for visual analysis of multi-party graphs. The first is the federation of multi-party graphs. Because raw data should be kept locally, creating a joint representation of data in all clients must resort to privacy-preserving feature extraction techniques. This situation is even exaggerated by the fact that features of different graphs may be different. A uniformed joint representation that is capable of characterizing the essential features of each graph is needed. The second challenge is that the federated analysis based on joint representations is difficult. Designing a generalized framework for federating various and complex analysis tasks remains a huge challenge.
Visual analysis of multi-party graphs based on the joint representations requires new approaches. Conventional graph analysis, which is performed in a centralized model, or only uses limited accessible graph data, cannot be applied to multi-party graph analysis. Existing solutions for distributed graph analysis [14, 28] mainly focus on the partitions of data and analysis tasks for the purposes of performance improvement, and are incapable of supporting multi-party graph analysis either.
One way to support privacy-preserving decentralized graph analysis is to build a decentralized federation of both data features and the analysis. We propose GraphFederator, a novel federation approach that constructs joint representations of multi-party graphs, and supports privacy-preserving visual analysis of graphs. Inspired by federated learning, we reformulate the analysis of multi-party graphs into a decentralization process. The new federation framework consists of a shared module that is responsible for joint modeling and analysis, and a set of local modules that run on respective graph data. Specifically, we propose a federated representation model that is iteratively learned from the encrypted characteristics of multi-party graphs in local modules. We design multiple visualization views for joint visualization, exploration, and analysis of multi-party graphs. The contributions of this paper include:
a federated graph representation model to represent and extract distinctive features from multi-party graphs; and
a federated visual analysis approach to support privacy-preserving graphs analysis.
The rest of this paper is organized as follows. Related work is discussed in Section 2. Section 3 introduces the problem formulation. Section 4 explains our design goals and the overview of our approach. Our federated graph representation model is introduced in Section 5. Section 6 presents the visual interface. The evaluation is discussed in Section 7. Discussions and the conclusion are given in Section 9 and Section 10, respectively.
2 Related Work
2.1 Visual Analysis of Graph Data
A graph analysis task is usually defined as the analysis of entities or associated properties . Here, entities denote nodes, links, paths, and networks, while the properties include structures and derived features. Graph analysis tasks 
can be classified into four groups: topology-based tasks, attribute-based tasks, browsing tasks, and overview tasks. Complex tasks can be decomposed into a set of basic tasks. Alternatively, tasks can be represented as a combination of two fundamental tasks: analyzing topology for given attributes, and analyzing attributes for a given topological structure. These two tasks are supported with respect to topological structures, including nodes, edges, clusters, node neighbors, paths, and substructures.
A recent study  proposes a multi-level typology to facilitate specific task classifications. Likewise, 29 group-level graph visualization tasks  are classified into four groups: group only tasks, group-node tasks, group-link tasks, and group-network tasks. From the viewpoint of graph-based sensemaking, four categories of graph visualization tasks  are introduced: visualization and exploration; global, local, and hybrid views; subgraph mining and interaction. Without loss of generality, this paper follows their notations and design specified tasks. A typical graph analysis system named Network Repository  provides users with the ability to explore, visualize, and compare data along many different dimensions interactively and in real-time. By combining global network statistics, local node-level network statistics, and features, users can easily discover key insights into the data.
2.2 Distributed Analysis and Federated Learning
Machine learning is benefited from the ability to train increasingly sophisticated models with the unprecedented growth of data collection . To overcome the problem of high computational cost for analyzing large-scale data, parallel or distributed computing has become popular . Similarly, decentralized machine learning approaches can be reformulated from centralized versions. There are two basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling . Much effort is paid to improve communication efficiency. For instance, minimizing the number of rounds of communication works well for cases where data is unevenly distributed over an extremely large number of nodes . A new framework is proposed to manage asynchronous data communications between clients and servers with flexible consistency, elastic scalability, and fault tolerance .
To protect the privacy data during the process of communication, a Homomorphic Encryption (HE) scheme 
is designed to preserve structures of the original message space. HE can be leveraged to privacy-preserving training or prediction of linear regression, linear classifiers, decision trees, matrix factorization, and neural networks. To address the problem of non-linear activation functions, a HE-based neural network scheme is proposed with an interactive protocol between the data owner and the model owner. The calculated transformation of the data owner is swapped in an encrypted form with the result from the model.
As a pioneering work on privacy-preserving computing, secure multi-party computation (MPC)  guarantees that clients can only get the final cumulative model weight. Recently, federated learning emerged as a new privacy-protection scheme to construct a global machine learning model with distributed multiple clients [32, 54, 15, 16, 29, 1]. To improve the training efficiency while protecting the privacy of multiple parties, local training data is kept from the central server. It is trained in a decentralized manner on multiple remote clients without transferring raw data. By integrating parameters from clients, the server can compose a global model. For large-scale graph data, distributed processing is needed. A distributed implementation of the Dijkstra algorithm  can handle various graph problems like the depth first search in an undirected graph.  is a distributed engine that supports the layout and computation of graphs with billions of edges and outperforms other state-of-the-art graph engines by using a series of graph-based optimizations.
To our best knowledge, research on using federated learning for graph analysis is rare. A distributed learning algorithm on graph generalizes the previous work on federated learning  and provides a fully decentralized framework with localized data of individual nodes kept from one another. The entire learning process over nodes does not use a central server and hence is a peer-to-peer method. A distributed graph neural network is constructed by following the scheme of federated learning . The algorithm uses a similarity matrix to capture the high-distance structure of nodes precisely in graph neural network.
3 Problem Formulation
For the reason of data privacy, directly analyzing graphs in multiple clients is difficult. Our solution is to extract joint representations from multi-party graphs and use joint representations for analysis. However, existing federated learning frameworks are inapplicable for multi-party graphs, because the data characteristics and analysis tasks of multi-party graphs are quite diverse. Designing a model that extracts joint representations with privacy-persevering is a nontrivial issue.
Given parties, we define as a graph in the -th party, where and represent the node set and edge set, respectively. Each graph owns the identical attributes , where denotes the dimension of attributes. Each node has a public ID. Nodes from different graphs may have the same public ID. Our goal is to construct a federated graph representation model that generates the joint representations of nodes . Sensitive information should be kept locally: attributes of each node, such as age, year and salary; links among nodes; and personal privacy information, like name, e-mail, and address.
4 Approach Overview
4.1 Design Goals
We work closely with five domain experts. Two of them are professors whose research focus is on federated learning and privacy analysis, respectively. The other three include one professor and two Ph.D. students. Their research interests are all related to graph representation learning. We also consulted two experts of an online game publisher and provider. Both of them have experience in federated learning and graph analysis research. Through discussions with these experts, we identify the following design goals:
Global representation learning model for multi-party graphs data. Considering that the data distribution can be distinctive for different parties, a globalized criterion is needed to ensure all computations are secured and generating satisfying results.
Representations learning for each party graph. The main goal of multi-party graphs representation learning is to derive the localized graph representation of each party. In particular, localized representations should be:
Privacy-preserving: Parties are not allowed to share or transmit their raw data, which should be kept locally to avoid the leakage of sensitive information.
High-quality: Local representations learned by our decentralized scheme should achieve comparable performance with the ones learned in a centralized manner.
Information-diverse: Learned local representations are comprehensive and can enlighten insights about graphs from various aspects. Such representations should be extracted on the basis of rich and diverse characteristics of a graph, such as the structure information, the side information (node attributes), and graph embedding results.
Customizations support. Analysis tasks vary with different users, and thus a flexible scheme is needed. Users should be allowed to customize analysis methods and control relevant parameters of different steps upon their preferences.
4.2 Approach Overview
Our general approach is shown in Figure 1. The central component of our approach is a server that runs a global federated graph representation model (FGRM). The server communicates with individual clients, each of which owns its local graph data and runs its local FGRM. And with the user interface to provide data for various views.
FGRM is designed to extract multiple graph representations (G). It runs in a server-client mode, and contains two components: graph representation and federated computing.
The graph representation of each local graph is computed based on three components (R2, R3): the embedding component, the structure component, and the attribute component. The federated computing is used to federate graph representations from multi-party graphs.
The federated computing (R1, C) includes three parts: federated initiation, client-side update, and server-side update. In the federated initiation, the server distributes FGRM, predefined encryption schemes, and related rules to multiple remote clients. The client-side module updates FGRM with the local graph and sends extracted graph representations to the server. The server-side module collects and handles these representations. Next, the server sends back the representations to each client. The update process is iterated until the specified number of rounds is reached. Finally, the server receives federated graph representations from multi-party graphs.
We design and implement a visual interface to visualize different federated graph representations. Users configure and build FGRM by custom schemes in the server view for extracting federated representations from multi-party graphs (C). The embedding view, the attribute view, and the structure view display the federated embedding representation, the federated attribute representation, and the federated structure representation, respectively.
5 Federated Graph Representation Model
5.1 Graph Representation
FGRM supports the construction of three different types of graph representations: graph embedding, node attribute, and structure information. These representations depict graph information from different aspects, making FGRM suitable for various visual analysis tasks. For the purpose of simplicity, the federated graph representations are denoted as . Here, the federated embedding representation is denoted as . The federated atrribute representation is denoted as , and denotes the dimension of attributes. The federated structure representation is denoted as .
5.1.1 The Embedding Component
The embedding component is used to construct a graph embedding representation of a graph. This representation converts the node set into low dimensional vectors in a canonical space.
Input: The input includes a graph with attributes , an embedding model, and corresponding parameters.
Extraction: For a graph , its embeddings are generated from a graph embedding model. Some models only use the topology to extracted representations, while some require node features. For embedding learning models that require topology and features as the input, the final representation is extracted directly by the model with and . For those models that only require topology as the input, basic embedding vectors are extracted by the model with . Then, features of nodes are extracted by following these steps. First, one-hot vectors are extracted to represent categorical attributes and text attributes. Then, normalized vectors are extracted to represent the numerical attributes. Feature vectors are then concatenated by these two vectors. Thereafter, feature vectors are reduced to the same number of dimensions as basic embedding vectors. The embedding representation is concatenated by reduced feature vectors and basic embedding vectors, yielding a high dimensional vector for each node.
Output: A set of vectors of nodes is generated.
5.1.2 The Structure Component
The structure component is used to construct the connection relationships of a graph. The raw connection relationships are sensitive. Therefore, the structures are restructured from the embeddings of the graph.
Input: The input includes a graph with , an embedding method, and corresponding parameters.
Extraction: The embedding can be extracted by the embedding component or by specified embedding models. Then, the distance matrix is calculated by node embeddings. The edges among nodes are extracted by the user-specified reconstruction method with the distance matrix. The reconstruction is controlled by predefined parameters to avoid privacy disclosure. Here, different embedding methods can be used to support different analysis tasks. Some methods are strong in link prediction and graph reconstruction, while some are good at node clustering. This component and embedding component can use different embedding methods for specified analysis tasks.
Output: Reconstructed graph structures are output.
5.1.3 The Attribute Component
The attribute component is used to construct the attribute distribution of a graph. Attribute distributions are extracted to avoid privacy disclosure.
Input: The input includes a graph with attributes , the bin size of the attribute distribution and filter condition.
For numeric attributes, the bin size of the distribution is specified by users.
For categorical attributes, the bin size of the distribution is the dimensions of attributes.
Those data types that are less meaningful for counting (e.g., name) are not used.
Moreover, the topology attributes can also be extracted. These attributes are important to the analysis of the nodes of a graph, such as identifying the social influence in the social media network.
To avoid privacy exposure, the extraction only constructs topology distributions rather than topology values.
The distribution of the topology attribute of one node characterizes the node.
The supported topology metrics include:
(1) Degree, (2) Betweenness, (3) Eigenvector, (4) PageRank, (5) Clustering Coefficient, (6) Average Nearest Neighbors Degree (KNN)
Average Nearest Neighbors Degree (KNN). This component will filter out specific nodes according to the filter conditions, and count the attribute distribution of these nodes (e.g., extracting the attribute distribution of players whose age is between 10 and 30).
Output: The attribute distribution of each node is extracted.
Note that other graph representation components can also be designed for specified graph visual analysis tasks.
5.2 Federated Computation
Federated computation is used to federate graphs and generate joint representations. The process contains three steps: federated initialization, server-side update, and client-side update.
5.2.1 Federated Initialization
Federated initialization sets the configuration of the federation model: encryption schemes, rules for computing feature vectors, and the distribution of the model.
Encryption: We provide different encryption schemes to protect the transmission of data, although the transmitted data contains no sensitive information. Sometimes, the data owner still has concerns about data privacy, so our model employs encrypted federated average and encrypted model training 
. Specifically, our model is implemented with the federated functions of TensorFlow (e.g.,federated_mean and federated_max). TensorFlow uses homomorphic encryption  to secure the transmission process.
Attributes of nodes can be used to specify an individual’s identity uniquely. A traditional way of protecting privacy is to transfer only attribute distributions of nodes. Unfortunately, the identity can be re-identified by exploiting the side information or schematic meaning of data [41, 51]. Thus, FGRM employs multiple attribute distribution protection models to improve results: (1) syntactic anonymization models: -anonymity  and -diversity  . (2) differential privacy models: Laplace mechanism  and exponential mechanism .
Computing feature vectors: The server counts fields of categorical attributes from all clients, and sets the corresponding one-hot vector for each field. Then, the server counts the maximum and minimum values of each numerical attribute and formulate the normalization standard of each attribute. Finally, each client calculates feature vectors of each node based on computing rules.
The distribution of FGRM: The server distributes our FGRM and relevant settings to each client. The weights of the model (the hidden layer) are the embedding results of the graph. The weights compose an matrix, where the row number is the number of different nodes of all graphs, and the column number is the dimensions of the vector of a node. can be configured by users. A row presents the embedding representation of a node. Models in both clients and the server have the same weight. However, the nodes of local graphs are different. To find the embedding of the row corresponding to each node, the server unifies the row index of each node of all clients. It should be noted that the node counts of graphs in clients are different. The model makes statistics of the number of graph nodes from all clients and the server, and then sets the index of the row of the matrix corresponding to each node.
5.2.2 Client-side Update
Client-side update runs on each client. For the embedding component and the structure component, the server distributes a graph representation model and the initial weights of the model to each client in the federated initiation. Each client executes the model with an initial weight. Then, each client learns the graph representation from the learning model with the local graph per round, and calculates the gradients that encode the differences between weight pairs between two weights. Thereafter, each client sends gradients to the server. This process does not transmit the raw data, but only transfers the gradients of the learning model.
Each client calculates by a user-specified attribute distribution protection model. Then, is encrypted and transmitted to the server by means of TensorFlow. Note that, secure aggregation protocols and homomorphic encryption algorithms  are supported by TensorFlow.
5.2.3 Server-side Update
The server randomly generates the weights of the embedding learning model and distributes the model with weights to each client. The server collects gradients of models from clients to fulfill federated average per round. Then the server computes the weighted average of gradients according to the node number of each client. Next, new weights are computed based on weights and averaged gradients, and are sent back to each client. The server executes the weights updating process iteratively until the specified number of rounds is reached. The weights of the model, , are learned from graphs in clients. This transmission process does not transmit raw data.
The federated structure representations is generated based on with link prediction and graph restructuration algorithms. For instance, the distance matrix of nodes is calculated. Edges can be generated by calculating the nearest node pairs. The federated attribute representation is calculated by means of TensorFlow.
6 Visual Interface
GraphFederator: Federated Visual Analysis for Multi-party Graphs shows the interface of our system. It consists of five views: a server view (GraphFederator: Federated Visual Analysis for Multi-party Graphs (A)) that provides FGRM configuration (GraphFederator: Federated Visual Analysis for Multi-party Graphs (A1)), data selection (GraphFederator: Federated Visual Analysis for Multi-party Graphs (A2)) and model monitoring (GraphFederator: Federated Visual Analysis for Multi-party Graphs (A3)); a client view (GraphFederator: Federated Visual Analysis for Multi-party Graphs (B)) that shows the information and the process state of clients; an embedding view (GraphFederator: Federated Visual Analysis for Multi-party Graphs (C)) that shows federated embedding representations; a structure view (GraphFederator: Federated Visual Analysis for Multi-party Graphs (D)) that shows federated structure representations: and an attribute view (GraphFederator: Federated Visual Analysis for Multi-party Graphs (E)) that visualizes federated attribute representations.
6.1 The Server View
Within the server view (GraphFederator: Federated Visual Analysis for Multi-party Graphs (A)), users can configure graph learning models, multiple parameters and encryption schemes (GraphFederator: Federated Visual Analysis for Multi-party Graphs (A1)), and monitor the running process of FGRM (GraphFederator: Federated Visual Analysis for Multi-party Graphs (A3)). The loss and the accuracy of FGRM are shown in a line chart (GraphFederator: Federated Visual Analysis for Multi-party Graphs (A3)). Users can choose the federated graph representation anytime in a training process and visualize them in other views (GraphFederator: Federated Visual Analysis for Multi-party Graphs (F1)). Through these visual graphs, users can explore the training representation, verify the accuracy of the representation, and evaluate the models and parameters.
6.2 The Client View
In the client view (GraphFederator: Federated Visual Analysis for Multi-party Graphs (B)), users can observe the general information, the running state, and the training process of each client. Users can select a client to examine its process state and visualize its federated graph representations in other views.
6.3 The Embedding View
The embedding view contains (GraphFederator: Federated Visual Analysis for Multi-party Graphs (C)) multiple visualization components associated with federated graph embedding representations. Following visual graphs are supported: the projection view (GraphFederator: Federated Visual Analysis for Multi-party Graphs (C1)), the clustering view (GraphFederator: Federated Visual Analysis for Multi-party Graphs (C2)), and the anomaly view ((GraphFederator: Federated Visual Analysis for Multi-party Graphs (C3))).
Projection view: This view shows the distance of nodes of multi-party graphs in high-dimensional embedding space. Each point in the view represents a node (GraphFederator: Federated Visual Analysis for Multi-party Graphs (C1)). The node color encodes clustering information or anomaly information. Interactive actions such as panning, zooming, and lasso-based selection are supported. Users can select nodes to study attribute distributions in the attribute view and their reconstructed structures in the structure view.
Cluster view: This view uses a table to visualize the clustering result of nodes from multi-client graphs according to federated embedding representations(GraphFederator: Federated Visual Analysis for Multi-party Graphs (C2)). Users can select nodes from one of the clusters for visualizing their attribute distributions and reconstructed structures in other views (GraphFederator: Federated Visual Analysis for Multi-party Graphs (D2) (E2)).
Anomaly view: This view lists the anomaly detection results of multi-party graphs based on federated embedding representations (GraphFederator: Federated Visual Analysis for Multi-party Graphs (C3)). When users select some nodes, their attribute distributions and reconstructed structures are shown in other views.
Control panel: With the control panel, users can configure parameters or methods in each view (in the top right corner). In the projection view, different projection methods including MDS , and -SNE  can be chosen. In the cluster view, different node clustering algorithms and corresponding parameters, including -Means  and DBSCAN  can be set. In the anomaly view, One-Class SVM  or IsolationForest  can be used for nodes’ anomaly detection.
6.4 The Structure View
The structure view (GraphFederator: Federated Visual Analysis for Multi-party Graphs (D)) depicts reconstructed structures of selected nodes from other views. To make structure exploration clear and avoiding heavy overlapping, this view hides nodes without any edge. Different layout methods in the view can be chosen.
6.5 The Attribute View
In the attribute view (GraphFederator: Federated Visual Analysis for Multi-party Graphs (E)), multiple histograms are used to visualize the attribute distribution of multi-party graphs. Users can interact with bins of histograms (GraphFederator: Federated Visual Analysis for Multi-party Graphs (F3)) to specify attribute distributions in the corresponding attribute interval. The y-axis in each histogram can be either linear or logarithmic.
In implementing GraphFederator, the front-end client was developed with the React framework and D3.js. Our in-house graph visualization engine is employed with rich user interactions, and flexible customizations. TensorFlow is used to compute the federated average of the attribute component. Pytorch is used to execute the graph embedding learning model.
Datasets. DBLP Dataset: This is a paper citation graph dataset. Each node represents a paper, and an edge represents the citation relationship between two papers, which has 5 attributes. Papers and their citation graphs in four areas are extracted to form a graph: AI, System, Theory, and Interdisciplinary. The total number of papers in each area are 64,232, 62,020, 14,430, and 62,708, respectively.
NetEase-Game-Player Dataset (NEGP): This is a game player transaction graph dataset provided by NetEase Co.111http://game.163.com It is collected in five servers of a massively multi-player online role playing game. In each graph, a node represents a player, and an edge represents a transaction between two players. Each player has 36 attributes related to the player’s role information and account status, such as role_level, role_class and create_date.
|NEGP||Game server 1||60,578||649,378||36|
|Game server 2||45,687||378,350|
|Game server 3||40,657||363,080|
|Game server 4||61,351||579,937|
|Game server 5||72,043||587,594|
We conducted several experiments to evaluate our approach. Experiments were conducted in a PC with a single Intel (R) Xeon (R) Gold 5218 CPU (a basic frequency of 2.3GHz and 16 cores), 128GB internal memory, and single RTX 2080TI GPU.
7.1.1 Federated Embedding Representation
Considering that the federated scheme may have a lower quality of embedding representations, the performance and computing cost with our approach and its centralized counterpart were tested on two datasets DBLP  and NEGP.
Configurations. We ran three configurations to evaluate the performance of different models: embedding learning in a single client (ELSC), centralized embedding learning (CEL), and our federated embedding component (FEC).
Data processing. DBLP Dataset : The citation graphs of years 2014-2018 were used in our experiments with CEL and FEC, each of which located in a client. The abstract of each paper was encoded as a fix-length vector based on word frequency count. For ELSC, the embedding generated from the graph of the year 2016, which is the largest graph among them, was used to generate the performance baseline (ELSC). The sub-filed information was employed as the ground truth to evaluate embedding representations with three configurations.
NEGP Dataset: The game player transaction graphs from five servers were evaluated with CEL and FEC. The graph from game server 1 was used to derive ELSC. Players were classified into two categories: high and low levels of in-game consumption. The consumption information was used as the ground truth to evaluate embedding representations with three configurations.
, one of the most popular neural graph architectures of graphs, which captures both the structure and the feature of each node. The random walk step of DeepWalk was firstly applied to generate samples by setting the number of walk times to be 80, and the length of walks to be 40. DeepWalk learned node representations by using the skip-gram model. We set the dimension of representation and the length of the sliding window to be 128 and 10, respectively. The GAT model was configured as follows: the number of layers was 3; the intermediate dimension of representations was 256; three heads were used in GAT. To train GAT, we employed the stochastic gradient descent (SGD) algorithm. The learning rate and the coefficient of L2 regularization were set to be 0.001 and 0.0001, respectively. DeepWalk and GAT of all clients were updated with 300 rounds.
We tested three configurations on the DBLP and NEGP datasets. GAT has high computational complexity and memory consumption. The CEL was not conducted on GAT. All evaluations were derived by a 5-fold cross-validation.
Results. The accuracy and time consumption are shown in Table 2. The accuracy of FEC has a similar performance with that of CEL, and the accuracy of FEC or CEL is better than that of SCEL. The time consumption of FEC is more than that of SCEL, and less than that of CEL. Line charts of the loss of our approach are shown in Figure 2. Projections of test dataset embeddings extracted by our approach are shown in Figure 3 by using t-SNE. In the first half of the training process, the embedding already has good adequate performance. Users can early terminate the training by observing the visualization of the results in real-time.
7.1.2 Federated Attribute Representation
We tested the computing cost of the federated average in the attribute component. For the attribute protection model, the time consumption is very short. The federated average supported by TensorFlow was used to encrypt attributes distributions of all clients. The time consuming of the encrypt could not be ignored.
Settings. The computing cost of three aspects was collected: the number of clients (1-30), the number of attributes (1-30), and the node number in each client (1-50,000). We evaluated one aspect while fixing the other two. Each evaluation was repeated for five times.
Results. The influence of each variable on the computing cost is reported in Figure 4. The computing cost of the extaraction of federated attribute representations shows a linear complexity over the number of clients and attributes (Figure 4 (A) (B)) but has little relevance with the node number in each client (Figure 4 (C)).
7.1.3 Federated Structure Representation
We evaluated the performance of federated structure representation by link prediction evaluation metrics: the Area Under Curve (AUC) score and Precision. Deepwalk was selected because it precisely captures the linkages among nodes from the node sequence generated by random walk. Supervised models such as GAT seek to minimize difference among intra-class nodes, and may lead to a result that node embeddings of same category are concentrative in space, which is problematic for linkage reconstruction.
Setting. We used training representations of the entire dataset to test the performance. We selected 10,000,000 edge pairs to test AUC score and set of Precision as 1000.
Results. Given and , the distance matrix of nodes is calculated, and the time complexity is . Edges can be generated by calculating the nearest node pairs. Its time complexity is by using the heap sorting technique. The total time complexity . Results are show in Table 3
. AUC score is high on two datasets. Precision is not satisfactory on the NEGP dataset, probably because the dataset has a high data complicacy.
7.2 Case Study
7.2.1 Case 1: NetEase-Game-Player Dataset
We invited an expert to use our GraphFederator to analyze the game data. He works in NetEase and is skilled at game data analysis. We introduced our system and showed how it works. Then, he used GraphFederator to analyze and explore NEGP datasets freely. His interest was in verifying the validity of the learned model and anomalous trades with GraphFederator. He took graphs from four different game servers as the input data (clients) of FGRM in the server view (GraphFederator: Federated Visual Analysis for Multi-party Graphs (A2)). Then, he configured FGRM to build federated graph representations (GraphFederator: Federated Visual Analysis for Multi-party Graphs (A1)). The general information of the graph from each client is shown in the client view (GraphFederator: Federated Visual Analysis for Multi-party Graphs (B)). The running status and the progress of FGRM are shown in the monitoring view (GraphFederator: Federated Visual Analysis for Multi-party Graphs (A3)). The monitor information indicates that the model runs well.
With the obtained representations of all components after 300 rounds of training, the expert saw clear clusters of the embedding in the projection view (GraphFederator: Federated Visual Analysis for Multi-party Graphs (C1)). The structure view showed clear and diverse structures. He concluded that FGRM performed well. He studied the distributions of the main attributes of players (Figure 6 (a1-5)) and found that the distributions of several attributes are interesting. While most Role_total_score (players’ scores) are low, and a small number of players have scores distributed in the highest interval. This result indicates that a small number of players’ game progress leads to the vast majority. The distribution of Role_equip_score (players’ equipment score) and the distribution of Degree (the number of trades of a player) are generally in a relatively low range (Figure 6 (a2) (a5)), and imply that most players have low equipment scores and trading records. These distributions helped the expert better understand the upgrading speed of the game, the consumption preferences of the players, and the preferences of players in different activities.
He selected 20,526 detected anomalous players (GraphFederator: Federated Visual Analysis for Multi-party Graphs (F2)) to study anomalies in in-game trading behavior from the anomaly view. He found that many anomalous players had no trading record in the structure view (GraphFederator: Federated Visual Analysis for Multi-party Graphs (D1)), and many players’ accounts were banned (GraphFederator: Federated Visual Analysis for Multi-party Graphs (E1)) in the attribute view. Apparently, these players behaved differently from the rest of all players (Figure 6 (a1)). It should be noted that the status of being banned is one of the most important attributes for game analysis because it indicates that a player may have an illegal plugin or other behaviors that violate game fairness policies [44, 45]. He selected banned players from the corresponding histogram (GraphFederator: Federated Visual Analysis for Multi-party Graphs (F3)), and the structure view showed trades among 1,122 players who were banned and detected as anomalies (GraphFederator: Federated Visual Analysis for Multi-party Graphs (D2)). There are 462 trades among 465 players. It indicates that some banned players knew each other and may belong to a studio that controls the accounts of many plugin players. In addition, these players may accumulate their virtual wealth in the game onto a specific player’s account. He selected three structures (GraphFederator: Federated Visual Analysis for Multi-party Graphs (D2)) because he believed that the selected structures were typical anomalous trading patterns based on his domain knowledge. The short and intensive trading chain is the unique characteristic of anomalous trades.
The Degree distribution showed that some banned players made many trades (GraphFederator: Federated Visual Analysis for Multi-party Graphs (E2)). Interestingly, he found that kjjf (cumulative score of an activity) distributions were distributed in the lowest interval (GraphFederator: Federated Visual Analysis for Multi-party Graphs (E2)), compared with those of all players (Figure 6 (a4)). He gave two possible explanations. The first one is that banned players were banned earlier, and they had no chance to accumulate scores. The second is that some plugin players seek to make money or upgrade the level. They made no contribution to kjjf, which is an indicator of entertainment activities.
The three structures with 66 players and 168 trades (Figure 5 (C)) were also presented to the expert. 113 trades are anomalous trades. The expert was surprised that our system achieved a high detection rate of anomalous trades. Raw graph data of each game server was used to learn embeddings and detect anomalous trades. We found that trades among these players are very intensive (Figure 5 (D)-(G)). We recorded structures of each step analysis in this case study. These results were presented to the expert. Using FGRM to detect anomalous trades from four game servers, yields detection rate of 50.90% (Figure 5 (A)). However, when a single game server data is employed, the detection rate are much lower (Figure 5 (D)-(G)). This indicates that federated graph representations extracted from multi-party graphs can capture more underlying insights.
In this case study, the expert achieved a detection rate of 67.26% (Figure 5 (C)). GraphFederator empowers him to improve the detection rate of anomalous trades and discover anomaly trading patterns.
7.2.2 Case 2: DBLP Dataset
We invited a professor to use our GraphFederator to analyze the DBLP dataset. His research area is soical network analysis. He wanted to analyze the clustering result of the paper citation data.
He chose data from 2014 to 2018 and configured the embedding component with DeepWalk. He selected the training results of the last round and observed federated graph representations in other views. He modulated different parameters of the -Means algorithm from 3 to 6 and selected 5 (Figure 7 (B)). Then, he selected different clusters. By observing the attribute view (Figure 7 (C)), he found differences of attributes: Ref_num (the number of references), N_citation (the number of citations), and Average_neighbor_degree, the distributions of which are shown in Figure 7 (C1). Clusters (a) and (d) have similar distributions with distributions of the entire set of papers. He inferred that most papers belong to these two clusters, so they have similar distributions. However, the distributions of three clusters (b) (c) and (d) lie in the lower interval. This indicates that those papers falling into these clusters are rarely cited. He inferred that these papers were published recently. The distributions in clusters (c) and (e) lie in the lowest interval. Clusters (c) and (e) have fewer papers than other clusters. The Ref_num and N_citation distributions indicate that the numbers of citations and references of papers in these two clusters are small. The Average_neighbor_degree implies that the number of citations and references of cited papers and reference papers is also small. He concluded that papers in two clusters are not attractive.
We used raw data of each year from 2014-2016 to learn embeddings(Figure 7 (A2)-(A4)). Figure 7 (A1) shows the projection of federated embedding representations of years 2014-2016. The color encodes ground truth labels. We found that papers with the green label are separated in 2014 and 2015, but clustered together in 2016. Papers in red circles (Figure 7 (A1)-(A4)) are clustered in terms of both data representations. It is regarded as a cluster, even though they have different labels. It indicates that federated graph representations extracted from multi-party graphs can capture more features and information, and help find clusters with unique features.
7.3 Expert Reviews
We interviewed the expert involved in our first case study (subsubsection 7.2.1). The expert thought that the findings could help to improve strategies of the anomalous player detection. He also believed that the clustering result of players of different clients might be used to stimulate more strategies for studying players of different clusters with unique features. The professor in the second case study (subsubsection 7.2.2) suggested that our system supports the extraction and visualizations of keywords or abstracts of papers. He hoped that analysts should be able to control privacy standards and analyze more data or dimensions; otherwise, some interesting insights would be missed.
To evaluate the effectiveness of our approach, we also conducted one-on-one interviews with five additional domain experts from NetEase Co. They are all skilled at game data analysis, graph analysis, and federated learning. With a live, hands-on demonstration for approximately 10 minutes, we showed them case studies. We discussed the feasibility of FGRM and findings and solicited feedback from them. They all confirmed that our approach could help them analyze features and information of players without touching raw data, and GraphFederator empowers them to explore multiple aspects and features of graphs. They liked our intuitive user interface for visual analysis of large-scale graphs. One expert commented on our system by saying: “…The system can help me validate training results of the federated model by the visual interface and help accomplish various analysis tasks of multiple graphs. ” Another expert claimed: “…When I want to analyze graphs which are distributed in multiple clients, the accessibility of graphs limits me. FGRM can solve these problems and give wonderful visualizations of features from graphs.”
Privacy. In the computing of federated representations, the server computes the weighted average of gradients of each client model and sends weights to each client. By jointly averaging gradients, the model can be trained by using multi-party graphs without switching raw data. The federated average algorithm computes the attribute representations from each client graph with the encryption of TensorFlow. The encrypted algorithm prevents transmission data from being intercepted. Although the transmitted data contains no sensitive information, it is still possible that privacy-related data can be inferred from non-sensitive information. At the same time, to prevent the identification of individuals from attribute distributions, our model employs multiple strategies like syntactic anonymization models and differential privacy models. Structure representations are reconstructed by embedding representations, and reconstructed parameters could be used to adjust accuracy. In fact, a small difference between raw structures and reconstructed structures can protect privacy . In the fields of secure multiple computing and homomorphic encryption, there are various strategies to handle different privacy and security issues. Our approach is fully compatible with them and supports interactive configurations.
Expansibility. Our approach can accomplish various graph visual analysis tasks. Three components are employed for constructing different types of graph representations: embedding representation, attribute representation, and structure representation. Users can freely configure strategies, methods, and parameters of each component. As shown in case studies, users accomplished different analysis tasks for multi-party graphs, including anomaly detection, clustering, and comparison. Experts highly rated our approach in gaining and identifying patterns. Our approach also supports to design new components for constructing distinctive graph representations. Various visualization styles for different graph representations can also be employed to fulfill a variety of complex tasks.
Scalability. Our approach constructs federated graph representations from multi-party graphs with reasonable scalability. FGRM is compatible with different models and encrypt strategies for different tasks and requirements. We conducted multiple experiments to measure the performance of the federated graph representation (subsubsection 7.1.1). The efficiency of FGRM in terms of data size depends on the selected model. For the embedding component, GAT can only handle a moderate-sized data due to the use of matrix, and DeepWalk can support large-sized data because it uses the skip-gram technique. The other two components also support extracting representations from large-scale graphs. Our in-house visualization engine is amenable for visualizing large-scaled graphs with rich user interactions.
Performance. Our model indeed extracted high-quality federated graph representations from multi-party graphs. Federated representations improve the efficiency of anomaly detection and clustering results compared with using the representation extracted from single data. GraphFederator with rich interactions empowers experts to accelerate the process of detecting anomalous trades and comparing clusters of papers. There are three conclusions drawn from experiments.
Compared with the centralized counterpart, our approach can generate results with a similar quality, and achieve a better running performance.
The graph can be reconstructed well with federated structure representations in terms of AUC score and precision, and the reconstructed structures keep differences to prevent privacy leaks.
The federated attribute representation can be constructed with relatively low computing costs.
This paper presents GraphFederator, a federation approach that constructs joint representations of multi-party graphs, and supports privacy-preserving visual analysis of multi-party graphs. In the future, we plan to explore various encryption strategies. We alsp plan to extend our approach to other graph data. Currently, we assume that multi-party graphs have identical attributes. We will improve FGRM to support heterogeneous graphs.
-  E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov. How to backdoor federated learning. Computing Research Repository, abs/1807.00459, 2018.
-  K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth. Practical secure aggregation for privacy-preserving machine learning. In Proceedings of ACM Conference on Computer and Communications Security, pp. 1175–1191, 2017.
-  M. Brehmer and T. Munzner. A multi-level typology of abstract visualization tasks. IEEE Transactions on Visualization and Computer Graphics, 19(12):2376–2385, 2013.
-  H. Cai, V. W. Zheng, and K. C.-C. Chang. A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering, 30(9):1616–1637, 2018.
-  K. Canini, T. Chandra, E. Ie, J. McFadden, K. Goldman, M. Gunter, J. Harmsen, K. LeFevre, D. Lepikhin, T. L. Llinares, et al. Sibyl: A system for large scale supervised machine learning. Technical Talk, 1:113, 2012.
-  N. Cao, Y.-R. Lin, L. Li, and H. Tong. g-Miner: Interactive visual group mining on multivariate graphs. In Proceedings of ACM Conference on Human Factors in Computing Systems, pp. 279–288, 2015.
-  N. Cao, C. Shi, W. S. Lin, J. Lu, Y. Lin, and C. Lin. Targetvue: Visual analysis of anomalous user behaviors in online communication systems. IEEE Transactions on Visualization and Computer Graphics, 22(1):280–289, 2015.
-  K. M. Chandy and J. Misra. Distributed computation on graphs: Shortest path algorithms. Communications of the ACM, 25(11):833–837, 1982.
Y. Chen, X. S. Zhou, and T. S. Huang.
One-class SVM for learning in image retrieval.In Proceedings of International Conference on Image Processing, pp. 34–37, 2001.
M. A. Cox and T. F. Cox.
Handbook of data visualization, pp. 315–347. Springer, 2008.
-  C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of Theory of Cryptography Conference, pp. 265–284, 2006.
-  M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of International Conference on Knowledge Discovery and Data Mining, pp. 226–231, 1996.
-  O. Goldreich. Secure multi-party computation. Manuscript. Preliminary version, 78, 1998.
-  S. Hong, S. Depner, T. Manhardt, J. Van Der Lugt, M. Verstraaten, and H. Chafi. PGX.D: A fast distributed graph processing engine. In Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12, 2015.
-  J. Konečnỳ, H. B. McMahan, D. Ramage, and P. Richtárik. Federated optimization: Distributed machine learning for on-device intelligence. Computing Research Repository, abs/1610.02527, 2016.
-  J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon. Federated learning: Strategies for improving communication efficiency. Computing Research Repository, abs/1610.05492, 2016.
K. Krishna and M. N. Murty.
Genetic K-means algorithm.IEEE Transactions on Systems, Man, and Cybernetics, 29(3):433–439, 1999.
-  A. Lalitha, O. C. Kilinc, T. Javidi, and F. Koushanfar. Peer-to-peer federated learning on graphs. Computing Research Repository, abs/1901.11173, 2019.
-  B. Lee, C. Plaisant, C. S. Parr, J.-D. Fekete, and N. Henry. Task taxonomy for graph visualization. In Proceedings of AVI Workshop on BEyond Time and Errors: Novel Evaluation Methods for Information Visualization, pp. 1–5, 2006.
-  K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran. Speeding up distributed machine learning using codes. IEEE Transactions on Information Theory, 64(3):1514–1529, 2017.
-  M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D. G. Andersen, and A. Smola. Parameter server for distributed machine learning. In Proceedings of Big Learning NIPS Workshop, p. 2, 2013.
-  N. Li, T. Li, and S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. In Proceedings of International Conference on Data Engineering, pp. 106–115, 2007.
-  F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation forest. In Proceedings of IEEE International Conference on Data Mining, pp. 413–422, 2008.
-  J. Liu, E. Bier, A. Wilson, J. A. Guerra-Gomez, T. Honda, K. Sricharan, L. Gilpin, and D. Davies. Graph analysis for detecting fraud, waste, and abuse in healthcare data. AI Magazine, 37(2):33–46, 2016.
-  L. Lü and T. Zhou. Link prediction in complex networks: A survey. Physica A: statistical mechanics and its applications, 390(6):1150–1170, 2011.
-  L. v. d. Maaten and G. Hinton. Visualizing data using t-SNE. Journal of machine learning research, 9(11):2579–2605, 2008.
-  A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data, 1(1):3, 2007.
-  R. R. McCune, T. Weninger, and G. Madey. Thinking like a vertex: A survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Computing Surveys, 48(2):1–39, 2015.
-  H. B. McMahan, E. Moore, D. Ramage, and B. A. y Arcas. Federated learning of deep networks using model averaging. Computing Research Repository, abs/1602.05629, 2016.
-  F. McSherry and K. Talwar. Mechanism design via differential privacy. In Proceedings of Symposium on Foundations of Computer Science, pp. 94–103, 2007.
-  G. Mei, Z. Guo, S. Liu, and L. Pan. SGNN: A graph neural network based federated learning approach by hiding structure. In Proceedings of IEEE International Conference on Big Data, pp. 2560–2568, 2019.
-  V. Mugunthan, A. Peraire-Bueno, and L. Kagal. PrivacyFL: A simulator for privacy-preserving and secure federated learning. Computing Research Repository, abs/2002.08423, 2020.
-  C. Nobre, M. D. Meyer, M. Streit, and A. Lex. The state of the art in visualizing multivariate networks. In Proceedings of Computer Graphics Forum, pp. 807–832, 2019.
-  C. Orlandi, A. Piva, and M. Barni. Oblivious neural network computing via homomorphic encryption. EURASIP Journal on Information Security, 2007:1–11, 2007.
-  B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining, pp. 701–710, 2014.
-  R. Pienta, J. Abello, M. Kahng, and D. H. Chau. Scalable graph exploration and visualization: Sensemaking challenges and opportunities. In Proceedings of International Conference on Big Data and Smart Computing, pp. 271–278, 2015.
-  J. Pretorius, H. C. Purchase, and J. T. Stasko. Tasks for multivariate network analysis. In Proceedings of Multivariate Network Visualization, pp. 77–95, 2013.
-  R. L. Rivest, L. Adleman, M. L. Dertouzos, et al. On data banks and privacy homomorphisms. Foundations of secure computation, 4(11):169–180, 1978.
R. Rossi and N. Ahmed.
The network data repository with interactive graph analytics and
Proceedings of AAAI Conference on Artificial Intelligence, 2015.
-  B. Saket, P. Simonetto, and S. Kobourov. Group-level graph visualization taxonomy. In Proceedings of Eurographics Conference on Visualization, 2014.
-  L. Sweeney. Simple demographics often identify people uniquely. Health, 671:1–34, 2000.
-  L. Sweeney. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5):557–570, 2002.
-  J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. ArnetMiner: extraction and mining of academic social networks. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining, pp. 990–998, 2008.
-  J. Tao, J. Lin, S. Zhang, S. Zhao, R. Wu, C. Fan, and P. Cui. MVAN: multi-view attention networks for real money trading detection in online games. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining, pp. 2536–2546, 2019.
-  J. Tao, J. Xu, L. Gong, Y. Li, C. Fan, and Z. Zhao. NGUARD: A game bot detection framework for netease mmorpgs. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining, pp. 811–820, 2018.
-  P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention networks. In Proceedings of International Conference on Learning Representations, 2018.
-  J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer. A survey on distributed machine learning. ACM Computing Surveys, 53(2):33, 2020.
-  T. von Landesberger, A. Kuijper, T. Schreck, J. Kohlhammer, J. J. van Wijk, J. Fekete, and D. W. Fellner. Visual analysis of large graphs: State-of-the-art and future research challenges. In Proceedings of Computer Graphics Forum, pp. 1719–1749, 2011.
-  H. Wang, Y. Lu, S. T. Shutters, M. Steptoe, F. Wang, S. Landis, and R. Maciejewski. A visual analytics framework for spatiotemporal trade network analysis. IEEE Transactions on Visualization and Computer Graphics, 25(1):331–341, 2018.
-  X. Wang, W. Chen, J. Chou, C. Bryan, H. Guan, W. Chen, R. Pan, and K. Ma. Graphprotector: A visual interface for employing and assessing multiple privacy preserving graph algorithms. IEEE Transactions on Visualization and Computer Graphics, 25(1):193–203, 2019. doi: 10 . 1109/TVCG . 2018 . 2865021
-  X. Wang, J.-K. Chou, W. Chen, H. Guan, W. Chen, T. Lao, and K.-L. Ma. A utility-aware visual approach for anonymizing multi-attribute tabular data. IEEE Transactions on Visualization and Computer Graphics, 24(1):351–360, 2017.
-  Y. Wu, N. Cao, D. Gotz, Y. Tan, and D. A. Keim. A survey on visual analytics of social media data. IEEE Transactions on Multimedia, 18(11):2135–2148, 2016.
-  W. Xiao, J. Xue, Y. Miao, Z. Li, C. Chen, M. Wu, W. Li, and L. Zhou. Tux: Distributed graph computation for machine learning. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation, pp. 669–682, 2017.
-  Q. Yang, Y. Liu, T. Chen, and Y. Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology, 10(2):1–19, 2019.
-  T. Zhang, X. Wang, Z. Li, F. Guo, Y. Ma, and W. Chen. A survey of network anomaly visualization. Science China Information Sciences, 60(12):121101:1–121101:17, 2017.
-  J. Zhao, M. Glueck, F. Chevalier, Y. Wu, and A. Khan. Egocentric analysis of dynamic networks with egolines. In Proceedings of ACM Conference on Human Factors in Computing Systems, pp. 5003–5014, 2016.