Adversarial Representation Learning on Large-Scale Bipartite Graphs

06/27/2019 ∙ by Chaoyang He, et al. ∙ The University of Texas at Arlington University of Southern California 3

Graph representation on large-scale bipartite graphs is central for a variety of applications, ranging from social network analysis to recommendation system development. Existing methods exhibit two key drawbacks: 1. unable to characterize the inconsistency of the node features within the bipartite-specific structure; 2. unfriendly to support large-scale bipartite graphs. To this end, we propose ABCGraph, a scalable model for unsupervised learning on large-scale bipartite graphs. At its heart, ABCGraph utilizes the proposed Bipartite Graph Convolutional Network (BGCN) as the encoder and adversarial learning as the training loss to learn representations from nodes in two different domains and bipartite structures, in an unsupervised manner. Moreover, we devise a cascaded architecture to capture the multi-hop relationship in bipartite structure and improves the scalability as well. Extensive experiments on multiple datasets of varying scales verify the effectiveness of ABCGraph compared to state-of-the-arts. For the experiment on a real-world large-scale bipartite graph system, fast training speed and low memory cost demonstrate the scalability of ABCGraph model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bipartite graph is a unique type of graph concerning the relationships between two types of entities, which specifically simulates many applications. For example, E-commerce systems (Linden et al., 2003) need to capture the preference between customers and products, and search engines need to recognize the matching between queries and webpages. So bipartite graph representation learning is important.

One way to learn feature representation is that we can find the representations by optimizing the downstream prediction tasks simultaneously. However, due to the large-scale of the networks with millions of nodes, the supervised procedure requires labeling with tremendous effort or even impossible. An alternative approach is to learn feature representations independent to the downstream prediction tasks in a unsupervised way. This not only releases the heavy labor, but also makes the optimization computationally efficient without any dependence to the following prediction tasks.

However, current techniques fail to perform scalable unsupervised feature learning on bipartite graph. The unique character of bipartite graph is that the features in two entities follow a different distributions with different dimensions. Traditional graph embedding approach such as DeepWalk(Perozzi et al., 2014)and Node2Vec(Grover and Leskovec, 2016) can not incorporate the various feature dimensions into learning. The heterogeneous extensionDong et al. (2017) of these classic approaches implement the random walk in terms of the types of the entities. However, doing random walk on large-scale graph especially multi-entities is time consuming and restricted by the width and depth of the exploration of graph. Besides, a universal random walk may not preserve the power-law distribution of degree in the networks. So far, graph convolutional networks (GCNs) offers a promise for graph embedding using convolution operators to learn over graph structures Kipf and Welling (2017); Schlichtkrull et al. (2017)

, which can perform better than Node2Vec. But it can only been applied in homogeneous graph with every node features follow the same distribution and dimensions. Because of the inconsistency between different types of entities in the bipartite graph, a common practice to generalize GCN to bipartite graph is applying convolution operation to one group in bipartite graph through aggregating the two-hop connected neighbors in the same group. But it poses many drawbacks. First, this one group convolutional operation leads to the node embedding vector cannot incorporate node features from another group which is a significant implicit representation ignored by the current GCN-based embedding methods

Kipf and Welling (2016); Hamilton et al. (2017). Moreover, in large-scale graph setting, sampling method (such as GraphSAGEHamilton et al. (2017), AS-GCNHuang et al. (2018) and FastGCNChen et al. (2018)) is necessary to deal with the scalability issue (uncontrollable neighborhood expansion across layers). But when applying it to bipartite, it only consider multiple-layer propagation in one group without tailoring for the bipartite-specific structure. Due to the indirect two-hop connections convolution operation largely increases the edge numbers, the training speed and the memory cost is still unsatisfied.

Figure 1: Top figure is the architecture of ABCGraph Model. Bottom left is adaptive sampling in GCNs. Bottom center is layer-wise sampling in GCNs. Bottom right is our cascaded architecture.

Our work aims to learn graph representation with these goals: (1) Unsupervised learning without labels. (2) Incorporating heterogeneous edge/node features. (3) Learning explicit and implicit structure relation. (4) Improving the scalability for large-scale bipartite graphs. As shown in Figure 1, we proposed ABCGraph (Adversarial, Bipartite GCN, Cascaded), a scalable model for unsupervised learning on large-scale bipartite graph. Our ABCGraph model first extends GCN to bipartite graph through BGCN, a GCN-based embedding model we proposed. We then design two types of adversarial decoder model to align representations from structures with nodes in two different domains in the unsupervised setting. Moreover, based on this adversarial learning framework, a cascaded architecture is introduced, which capture the multi-hop relationship in bipartite structure and improves the scalability as well. Finally, large-scale bipartite graph system is built to evaluate the scalability of our ABCGraph model.

We evaluate the performance of our ABCGraph algorithm against four unsupervised graph embedding baselines: Node2VecGrover and Leskovec (2016), standard GCNKipf and Welling (2017), VGAEKipf and Welling (2016), and GraphSAGEHamilton et al. (2017). We test on a large scale social network data set which has around one million edges, and three bipartite graph data sets synthesized from citation networks (Cora, Citeseer, and PubMed). Using these benchmarks, we show that our approach is able to effectively generate representations and outperform relevant baselines by a significant margin. Moreover, experiment on a real-world large-scale bipartite graph system shows our model can remarkably speed up the training process with low memory cost.

2 Related Work

Our work is related to the neural graph embedding methods. The pioneer work DeepWalk (Perozzi et al., 2014) and Node2vec (Grover and Leskovec, 2016) extend the idea of Skip-gram (Mikolov et al., 2013) to model homogeneous network. However, they may not be effective to preserve both explicit and implicit relations of the graph. Specifically, LINE (Tang et al., 2015) learns two separated embeddings for 1st-order and 2nd-order relations; GraRep (Cao et al., 2015) and AROPE (Zhang et al., 2018) further extends the method to capture higher-order proximities. Besides capturing high-order proximities, there are several proposals to incorporate side information into vertex embedding learning, such as vertex labels (Li et al., 2017; Huang et al., 2017), community information (Chen et al., 2016), textual content (Wang et al., 2017), user profiles (Liao et al., 2018), and location information (Xie et al., 2016). However, these methods might be suboptimal for learning vertex representations for a bipartite network by ignoring the vertex type information. Metapath2vec++ (Dong et al., 2017) and HNE (Chang et al., 2015) and EOE (Xu et al., 2017)

are representative vertex embedding methods for heterogeneous networks. Driven by the huge success of convolutional networks in computer vision domain, a number of methods have applied the same notation of convolution for graph data, where called graph convolutional networks (GCN)

(Wu et al., 2019; Kipf and Welling, 2017; Bruna et al., 2013; Henaff et al., 2015; Hamilton et al., 2017; Duvenaud et al., ). In general, these methods perform a convolution by aggregating the neighbor nodes’ information, so each node is able to learn a relationship between the nodes in the entire graph. Based on this property of GCN, variational graph auto-encoders (VGAE) have been introduced to solve network embedding problem which aims to represent network vertices into a low-dimensional vector space(Cao et al., ; Wang et al., 2016; Pan et al., 2018; Yu et al., 2018). However, the weakness for GCN based methods is that same node attributes should be assumed. (Nassar, 2018) tried to combine GCN with the bipartite graph, where they aggregate nodes by clustering to generate a bipartite graph which can efficiently accelerate and scale the computations of GCN algorithm, but their goal is not learning representation on bipartite graph data.

3 Proposed model: ABCGraph

The main objective of ABCGraph algorithm is to learn the representations in a bipartite graph with two entities of different feature distributions and dimensions. We first describe the general idea of graph embedding and the extension of GCN to bipartite graph (BGCN). We then propose two different types of decoder models to align BGCN encoded representations with feature itself. Finally, a cascaded architecture based on the BGCN-decoder is introduced, which can incorporate structure information for each node further away without facing the neighborhood expansion problem across layers.

3.1 Bipartite Graph Embedding

Definition 1.

(Bipartite Graph) Let be a bipartite network, where and denote the set of the two types of vertices respectively, and defines the inter-set edges. and denote the -th and -th vertex in and , respectively, where and . The incidence matrix for set is and for set . if .

The features of two sets of nodes can be formulated as and respectively, where is a feature matrix with representing the feature vector of node and with for node .

In bipartite graph embedding, the purpose is to learn the embedding representations for two entities of nodes separately. The transformation can be shown in equation

(1)

where is a linear or non-linear embedding function,

is the adjacent matrix. Traditionally in unsupervised learning, the connectivity is used as loss function to learn the embedding function, e.g. random walk

Perozzi et al. (2014) will embed node latent representations closer when two nodes appear frequently in random walks. The problem is this type of objective function does not explicitly include any feature information. Alternatively, we apply node features itself to learn the embedding function. To be specific, given the latent representation, our goal is to minimize it between feature itself

(2)

where

is a linear mapping between latent representations and features. Since in most cases, the features are in high dimensions, which can not be fully represented by linear transformation. To generalize this, we propose two neural networks based non-linear mapping models which can efficiently capture both the features and the embedded latent representations.

3.2 Bipartite Graph Convolutional Network (BGCN)

As for embedding function, there are recently approaches to learning over graph structures using convolution operators. However, so far GCNs can only be applied on homogeneous graph with all nodes sharing the same feature dimensions. We first extend GCNs to bipartite graph with multiple feature dimensions. Formally, we can partition adjacent matrix into blocks

(3)

Replacing this into the GCNs aggregation function ,

(4)

where . and are the degree matrix for set and respectively. and are the node latent representation in layer respectively. The embedding function can be shown as first aggregate the one-hop neighbor features for each node, then apply a non-linear transformation to the aggregated features. To be notice that accumulation only perform on the neighbor nodes without node itself due to the distinct feature dimensions. In this reason, after one convolutional operation, the encoded representations only contain one-hop neighbors which are nodes from opposite set in bipartite graph. In order to carry feature information from node itself, we examine two neural network based decoder architecture.

3.3 BGCN with multi-layer perceptron (MLP) Decoder

Our first decoder is the multi-layer perceptron (MLP) model, where we simply align the output from the embedding function with each node feature itself through two fully-connected neural network layers. In Section

5 we show that even this simple architecture is able to gain great improvement on the downstream classification task.

Formally, we define output of BGCN for set as , then the unsupervised loss function can be generated as

(5)

where is the original nodes’ features in set

. The mean-square-error form loss is enforcing the neighborhood encoded representation similar to feature representation. The output of the MLP is the nodes’ latent representation and can be used for following machine learning tasks.

3.4 BGCN with Adversarial Decoder

The features and encoded representations of nodes can also be regarded as two different distributions. To align these two distributions, we present our adversarial approach. Let be the encoded representations generated by BGCN and be the feature representations. A discriminator is trained to discriminate between elements from and . In another way, BGCN is trained to prevent the discriminator from making accurate predictions. As a result, this adversarial process can be analogized as a two-player game, where the discriminator aims to maximize the ability to identify the feature representation, and BGCN aims to prevent the discriminator from doing so by generating the encoded representations as similar as possible. This approach is a extension of work Goodfellow et al. (2014) to the graph embedding domain.

Discriminator objective We define the parameters of the discriminator as . We refer

as the probability that input vector

is from feature representations. The discriminator loss function is as follows:

BGCN objective In generative setting, BGCN is trained so that the discriminator is unable to distinguish the feature representation:

After training, the latent representations are generated from the output of BGCN and can be used as other tasks. The detail training procedure is in Section 5.

3.5 ABCGraph Architecture and Algorithm

Although previous BGCN-decoder architecture is able to incorporate both graph structure and features into node representation, only one-hop connection is employed. In real bipartite graph cases, the nodes’ connection in two sets are usually imbalanced, e.g. in our Tencent social network dataset, users only join in average two groups but groups contain around ten users, which means that only considering users direct connection to groups is not enough. Multi-hops connections consist of more structure information. In order to integrate this with our previous model, we design an advanced cascaded learning architecture, which sweep the input between set and up to depths . Between each depth, one more hop of connectivity has been embedded into the representations. Due to this, no sampling step is needed for large-scale graph but still capture the multi-hops strucutre information.

Input: Graph in batches; input features ; non-linearity:
Output: Node representations for all
;
for  do
       for 

e in epochs

 do
             for  in batches of input do
                   ;
                   ;
                   ;
                  
             end for
            
       end for
      ALTERNATE and for BGCN input
      
end for
;
Algorithm 1 ABCGraph algorithm

Algorithm 1 describes the embedding generation process of our ABCGraph where the entire graph and features of all nodes are given as input. We describe in detail how to partition our large-scale graph into mini-batch to fit into the model in appendix. Each step in the outmost loop proceeds as follows, where represents the current step (depth) and

are the current hidden representations. First for one mini-batch input

of nodes in , a QUERY operation is acted to select all the neighbor connected nodes in set for this mini-batch, . This query operation will highly reduce the graph size, which makes it possible for training on the entire graph. Then each node

aggregates the one-hop neighbor representations into one single vector by MEAN operator and this vector is fed through a fully connected layer with non-linear activation function

. We refer this as BGCN step and the output is the encoded representation . After the BGCN steps, ABCGraph will feed the encoded representations and the feature representations into decoder for alignment. The decoder can be replaced by either MLP decoder or adversarial decoder. Finally after several epochs of training, the embedding representation of set can be extracted. Note that this only incorporate one-hop information into the final embedding. In order to capture information from further reaches of graph for set , we need to increase the depths. The cascaded architecture is to sweep between two sets and use the encoded representations of previous depth as input. The experiment results in Section 5 prove that with the depth increase, the embedding performance can increase significantly.

4 Towards Large-scale Bipartite Graphs

ABCGraph uses the cascaded adversarial learning architecture described by the previous section because of its scalability advantage. As we can see from Figure 1, the node-wised and layer-wised sampling require to perform sampling in each layer of the model, but in our cascaded architecture, we only require one-hop aggeration, which the layer by layer sampling process so as to further improve the scalability issue in bipartite graph setting. Here we summarize some advantages here.

Training Speed-Up. Our model does not need to perform random walk preprocessing like GraphSAGE and Node2Vec, which also saves the training time. Moreover, in the case of the same batch size, we have fewer iterations and the iteration time of a single batch size. Taking the bipartite graph with the 600,000 nodes as an example, when GraphSAGE is used for the bipartite graph, we need to transform the bipartite graph in to a single graph which has 3,600,000 edges. If the mini batch size is equal to 600, the number of iterations required for an epoch is as high as 60,000. For our bipartite training system, an epoch only needs 1100 iterations, accelerated more than 50 times overall. Finally, in our ABCGraph model design, the cascaded architecture utilize only one GCN layer which avoids uncontrollable neighborhood expansion across layers as shown in the bottom right of Figure 1.

Faster convergence to the higher accuracy. The Alternate operation in Figure 1 can simplify the cascade architecture model, which requires fewer iterations and has better convergence property. Surprisingly, we found that our model requires only 3 epochs on a very large data set to achieve the best results, as shown discussion in Section 5.2 and Section 5.3.

Memory and Computing Optimization. Taking the BGCN process from V to U and the adversarial representation as an example in Figure 1, the feature matrix of V is cached into memory, and the adjacent matrix of U can be split into mini-batch, and then fed into the model. The complete feature matrix connected to U does not need to be sent to the model to complete the operation, but only needs to find the neighborhoods connected to the U node through the in-memory QUERY operation. In the BGCN stage, the operation can be done by sparse matrix multiplication. The advantage of this is that the memory overhead is greatly reduced.

5 Experiments

We design our experiments with the goals of (i) providing a rigorous comparison of the graph representation performance between our ABCGraph model and the baselines. (ii) verifying the effectiveness of the cascaded architecture. (iii) evaluating the scalability on a large-scale dataset.

#Nodes #Features #Classes
Dataset Type U V #Edges U V U V
Tencent Social Network 619,030 90,044 991,734 8 16 2 N/A
Cora Citation network 734 877 1802 1433 1000 7 6
Citeseer Citation network 613 510 1000 3703 3000 6 6
PubMed Medical Citation 13424 3435 18782 400 500 3 3
Table 1: Dataset statistics

Dataset statistics are summarized in Table 1. Dataset distributions are shown as in the appendix. These four datasets cover graphs from small, medium to large-scale size, and from the long tail (Tencent, Cora, Citeseer) to balanced degree distribution (PubMed), which plays a comprehensive role in evaluating model effectiveness. Tencent social network represents a real-world large scale bipartite graphs, with nodes and edges at a scale of and respectively. Each node in the group U represents an Internet user in the social network, while the node in the other V group represents the social community which the user in the U group belongs to. We generate bipartite graphs based on the open source citation networks data sets, including Cora, Citeseer, and PubMed(Sen et al., 2008). Documents and citation links between them are treated as nodes and undirected edges, respectively.

We evaluate the performance of our ABCGraph algorithm against four unsupervised graph embedding baselines: random walkDong et al. (2017); Perozzi et al. (2014) based Node2Vec, GCN, GraphSAGE, variational inference Kingma and Welling (2013) based variational graph auto-encoder (VGAE). GCNKipf and Welling (2017) approach learns a feature representations in a convolutional way. Due to the inconsistency of nodes’ feature dimensions in two sets, direct applying GCN is impractical. To make the comparison fair, we reconstruct the network which only contains nodes from one set with their two-hops connection by the opposite set.

5.1 Results

We evaluate our node representation on the classification task using logistic regression. We split 80 percentage of nodes in the entire graph as the training dataset, and the rest are used as testing. For binary classification on the Tencent dataset, we report F1 scores. For other multi-classification tasks, we use both micro and micro-averaged F1 scores. We also evaluate the performance only using raw features to determine how much the embedding can help.

Tencent Citeseer Cora PubMed
Methods F1 Micro F1 Macro F1 Micro F1 Macro F1 Micro F1 Macro F1
Raw features 0.497 0.707 0.621 0.789 0.758 0.838 0.843
Node2Vec 0.610 0.724 0.627 0.810 0.780 0.834 0.839
GCN 0.529 0.715 0.627 0.782 0.763 0.838 0.843
VGAE 0.732 0.645 0.782 0.754 0.823 0.828
GraphSAGE 0.580 0.748 0.665 0.823 0.801 0.838 0.843
ABCGraph-MLP 0.583 0.764 0.676 0.810 0.784 0.847 0.843
ABCGraph-Adv 0.639 0.772 0.703 0.864 0.838 0.862 0.865
% gain over feat. 29% 9% 13% 10% 11% 3% 3%
Table 2: Results for unsupervised graph embedding performance evaluation on four data sets.

From table 2 we can the effectiveness of ABCGraph on large-scale bipartite graph representation learning. ABCGraph-Adv (adversarial decoder introduce in section 3.4) surpasses other methods on both large and small data sets. Although ABCGraph-MLP does not work better than ABCGraph-Adv, it still outperforms the baselines. On the Tencent large-scale real data sets, ABCGraph gets more percentage of gain over the raw features than other datasets (E.g. 29% increment at Tencent dataset). In the PubMed data set, due to balanced degree distribution, although each model only improves a small fraction of gain over the raw feature representation, ABCGraph still get the best result among them. This verifies that the ABCGraph has better compatibility between the long-tail data set and the balanced data set.

In order to provide a fair comparison, we tried a large range of hyper-parameters to find the best hyper-parameter in each data set and report the best result among all parameters. For each baseline model, we follow the open source implementation from the author’s original paper. We implement ABCGraph model on PyTorch library. Different from the standard GCN which uses ReLu as non-linear function among layers, we choose TanH as the non-linear function BGCN to output the embedding vector. For the ABCGraph model, as we discussed in section

3, we use mini-batch to reduce the memory and computation cost for large-scale data set. We found the optimal batch size for all four data set is near 500, and it only requires around 3 epochs on each data set to converge to the best result quickly. The appendix contains further implementations details and hyper-parameters.

All of our experiments are running on a GPU server with 8 Tesla V100 cards. The scalable computing bipartite system is built on a cluster with 23 physical machines CPU server. More information about the infrastructure is summarized in the appendix.

5.2 Evaluation of the Cascaded Architecture

We also evaluate the effectiveness of the proposed cascaded architecture on four data sets. Table 2 shows that the effect of the 3 depths (ABCGraph-K3) is indeed better than that of the 1 depth (ABCGraph-K1). Both depth1 and depth3 in the curve (Figure 2(a) and 2(b)

) indicate the loss convergence process of node U, and depth2 is node V in the middle. It can be seen from the curve consistency between depth1 and depth3 that our cascade architecture allows to alternate training between the two sides of the bipartite. This alternative training can be viewed as the multi-layered deep learning process, which improves the model’s capability and also solves the inconsistent problem that described in Section

3.5. This confirms by the accuracy comparison in Table 3(a).

Model Tencent Citeseer Cora PubMed
ABCGraph-K1 0.537 0.736 0.799 0.838
ABCGraph-K3 0.639 0.772 0.864 0.862
Table 3: Evaluation of the cascaded architecture with node classification task (Micro-F1 score)
(a) Discriminator loss
(b) Generator (BGCN) loss
Figure 2: Adversarial training loss on Pubmed.

5.3 Training Time and Memory Cost Comparison On Large-Scale Bipartite Graphs

(a) Training Time
(b) Memory Cost Comparison in Running Time
Figure 3: Training Comparison on Tencent Large-Scale Graph Data Set

As shown in Figure 3, ABCGraph’s training speed is very attractive that its training time is not only a small fraction of the GraphSAGE model on the GPU hardware but also has similar training time on the CPU hardware. Compared to the Node2Vec C++ high-performance library, ABCGraph model’s training time is comparable. In the aspect of memory cost, running on large-scale Tencent dataset, ABCGraph is only 1/5 cost of the GraphSAGE model. The appendix covers more details about the scalability evaluation.

6 Conclusion

In this paper, we focus on the unsupervised-learning in large-scale bipartite graphs. We employ the BGCN model to address the node feature inconsistency issue in bipartite graph and design the cascade architecture to capture the multi-hop relationship in local bipartite structure and improves the scalability. Furthermore, we propose two unsupervised models: MLP Decoder and Adversarial Decoder to learn the node embeddings in a unsupervised way. We evaluate our learned node representation on the node classification task. The extensive experiments confirm the effectiveness and efficiency of our models.

7 Acknowledgement

The authors would like to thank Oracle Cloud for providing GPUs and CPUs to run our experiments. Thanks to Zhenyu Yang and Zijian Hu for helping with data preprocessing. The authors also thank Tencent for providing support in feature engineering, dataset, and computational infrastructure.

References

8 Appendix

This supplementary material provides source code for reproductivity, more details of the datasets, more hyper-parameter setting details, more experiment results, the infrastructure, and more details of our large-scale bipartite graph system we use in our experiment.

Appendix A Source Code for Reproductivity

Source code. For experiment results reproductivity, we store this paper’s original source code at https://tinyurl.com/ABCGraph. Since we may refactor our code for further research, we maintain the original version of our code in this URL. We also provide the data that we use in this paper for running experiments. Besides the ABCGraph model, we also provide baseline codes we use in our experiments. Each model’s code is organized in an independent directory. In order to help reproduce our results efficiently, in the readme.md file at the root directory, we organize a table of scripts for training, which covers different datasets and different models.

Appendix B More Experimental Evaluations

b.1 Datasets

(a)
(b)
(c)
(d)
Figure 4: Node degree distributions on four datsets

Data Distribution Visualization. Distribution of our datasets are as in Figure 4. To test the robustness of our model, the datasets have two kinds of distributions. One is the long tail distribution (Tencent, Citeseer, Cora), meaning that the data is imbalance and some nodes have more neighbors than others. Another dataset (PubMed) has an even distribution. All nodes share similar neighborhood connectivity. Our experiments results show that ABCGraph model performs better than other baselines on these two types of datasets, which proves our model can capture graph structure information in various types of graph.

Data Preprocessing (Citation Networks). For citation network dataset, we process the Cora, CiteSeer and PubMed datasets similarly. We treat the original graph as an undirected graph. First, we divide the paper documents of each class into two equal size subsets. Then, we combine the first half of all classes into group and the second half into group. We remove some of the features of papers in the group to introduce heterogeneity between and . Lastly, we only keep edges that connected a paper in group and a paper in group and remove all other edges to make the graph bipartite. All the nodes that are isolated are removed.

Data Preprocessing (Social Networks). As for Tencent dataset, it is already a bipartite graph, with one set represents users and another set represents the social communities that users joined in. As for data preprocessing, the format keeps the same as the citation networks datasets to simplify the data loading process.

Insight from the dataset. In the Tencent large-scale dataset, the average degree of nodes in users set is 1.6 and 11.0 in groups set. This sparse connection of users motivates us to design the cascaded architecture in ABCGraph. Since only implementing one side aggregation will significantly lose the structure information from multi-hops connections. For example, the average one-hop edge for users is 1.6, but each user has around two-hops connections on average. The cascaded architecture is designed to target this problem, whereby alternating the input from hidden representations between two sets. The experiments results prove the effectiveness of this cascaded architecture.

b.2 Model Implementation Details

Baseline Code link
Node2Vec (high performance version) https://github.com/snap-stanford/snap
VGAE https://github.com/tkipf/gae
GraphSage https://github.com/williamleif/GraphSAGE
GCN https://github.com/williamleif/GraphSAGE
Table 4: Reference for baselines code

Logistic regression. In order to evaluate our model output embedding performance, we use logistic regression to predict the nodes’ label. We use the logistic SGDClassifier from scikit-learn Python package. We split the nodes into 80 percentage for training and 20 percentage for testing.

Model Implementation. we use the code of baselines published by the author of the original paper. We summarize the baseline code we use in Table 4. We follow the parameter settings in their original papers and fine tuned on our bipartite datasets. The Node2Vec is a high performance version (C++), so its running time is compariable to ours. Since all the baselines are not designed for heterogeneous bipartite graph, in order to make fair comparison with our models, we first transform the bipartite graph into simple graph. We multiply the incidence matrix with its transpose to extract all two-hops connection. Since it is a bipartite graph, the two-hops connection of one set will only contain nodes in the exactly same set. Through this simple transformation, the graph becomes to homogeneous single graph and all the baselines are able to achieve on it.

Hyper-Parameters

. We tune our model on every dataset to select the best F1 score. Here, we list all the final hyperparameters ABCGraph for different datasets.

Hyperparameters Tencent Citeseer Cora PubMed
ABCGraph-Adv batch size 600 400 400 700
epochs 2 4 2 3
learning rate 0.0004 0.0004 0.0004 0.0004
weight decay 0.0005 0.001 0.001 0.0005
dropout 0.4 0.35 0.35 0.35
encoder output dimensions 16 16 24 24
ABCGraph-MLP batch size 500 64 128 128
epochs 3 3 5 3
learning rate 0.0003 0.001 0.001 0.0001
weight decay 0.001 0.0005 0.0008 0.005
dropout 0.4 0.2 0.2 0.2
encoder output dimensions 24 48 48 48
decoder hidden dimensions 16 16 16 16
Table 5: Hyperparameters for ABCGraph on four datasets

As for epochs, we first search in a wide range and find that with small epochs size will achieve better performance. This also proves the reason why our model requires less training time. The ABCGraph-MLP model contains two dense layers with rectified activation layer and dropout layer in between. The output of the decoder is aligned in the range using hyperbolic tangent, which is the same distribution as the input features. As for ABCGraph-Adv model, the discriminator also contains two dense layers but with leaky rectified activation layer, which can avoid sparse gradient problem.

Hyperparameters ABCGraph-Adv ABCGraph-MLP
batch size
epochs
learning rate
weight decay
dropout
encoder output dimensions
decoder hidden dimensions
Table 6: Hyperparameters validation range

b.3 More Experimental Results on Large-Scale Dataset

(a) Discriminator loss
(b) Generator (BGCN) loss
Figure 5: ABCGraph-Adv training loss on Pubmed.

Training loss. Here we show both types of decoders training loss of our model versus more iterations on Pubmed dataset. The reason is that PubMed is a medium size dataset, so it is a great dataset to illustrate. Figure 5 shows loss curve for the adversarial decoder. The figure exactly shows the discriminator and the generator (BGCN in our model) are competing with each other and finally achieve a balance, where two losses are being steady after around 20 epochs. The detail parameter settings can be found in the table 6.

We also show training for the ABCGraph-MLP model in Figure 6. As we described previously, although this is a naive adversarial model in our bipartite setting, it still outperforms other baselines since it utilizes the adversarial properties of bipartite graphs. Here this curve shows its training process.

Figure 6: ABCGraph-MLP training loss on Pubmed.

Appendix C More Details of our Large-Scale Bipartite Graph System

Figure 7: Training System Pipeline

In our source code, in order to focus on understanding the model, we provide code that can evaluate the performance of our model in a single machine. In this single machine setting, the data loader loads the dataset we use in this paper, and then the model can run in-memory query operation.

Actually, our model can also perform well in distributed execution across multiple machines. Figure 7 is the training system pipeline. Since our ABCGraph model is economical in hardware resource consuming, we can simultaneously run many parallel trainers, which has a benefit of a quick hyper-parameter searching. In step 1 of this pipeline, we load one part mini-batch data of the bipartite graph (Group U in the Figure 7) into the memory, and then load its related opposite data (Group V in the Figure 7) from the feature and adjacent matrix from shared storage by the QUERY operation (step 2). After this, B-GCN can do node-wise sparse matrix multiplication as a convolutional operation by using this mini-batch dataset cached in the memory, and an adversarial forward and backward propagation is performing based on the cached mini-batch data. After several epochs, we can inference the embedding vector by using B-GCN. The embedding vector is the hidden representation of one group (group U in Figure 7). In step 4, the ALTERNATE operation is performed to change the input for BGCN-Decoder model. This is the implementation of our cascaded architecture in the real-world parallel computing system. The entire training process contains many iterations of this loop process from step 1 to 4.

c.1 System Architecture

Figure 8: Large-Scale Graph System Architecture

Our large-scale graph system contains five layers, as shown in Figure 8 . We describe them from a top to down perspective. In the top application layer, there are many real-world tasks, such as classification and link prediction. The second layer is the model layer, which supports all kinds of GCN and random walk-based models, including our ABCGraph and other state-of-the-arts. The three underlying layers support the above two layers. We find that FOUR operators, namely SAMPLE, AGGREGATE, and ALTERNATE, and COMBINE, are shared by all GCN models, so we put these operators as an independent layer. In the lowest two layers, the goal is to fulfill the fast data access requirements of high-level operations and algorithms (E.g., the neighborhood sampling module can help to query features of the connected nodes and generate a sparse adjacent matrix). In a summary of our system design, we balance reusability, flexibility, scalability, and compatibility for different model properties and real-world task requirements.

Appendix D Infrastructure

d.1 GPU Server

GPU Cards: 8 Nvidia Tesla V100

CPU: Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz

Memory: 756 GB

Operating System: Red Hat 4.8.5-16.0.3, gcc version 4.8.5 20150623

PyTorch: 1.1.0

Tensorflow: 1.13.1

d.2 HPC (High Performance Computing)

Our large-scale bipartite system is currently deployed in a cluster environment, consisting of 23 physical machines, each has 48 cores, 320 GB memory, and the CPU is Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz. Suppose we run one process per CPU core, this system can run up to 1152 workers for parallel training for each baseline model.