I Introduction
In recent years, graph neural network (GNN), which apply deep neural network to learning representation in graph data, drew lots of attention in the machine learning and data mining community. Plenty of literatures demonstrate the effectiveness of applying GNN in solving
link prediction problem [43], which means to predict whether two nodes in a graph are likely to have a link. As graph data exists ubiquitously, the link prediction problem has many applications, such as recommendation, fraud detection, knowledge graph completion, etc.
However, most of GNNbased link prediction models are performed within graphs with millions of nodes and tens of millions of edges, in a single machine. There is few literature which focuses on applying GNNbased link prediction models to industralscale graphs with hundreds of millions of nodes as well as billions of edges, which exist in lots of scenarios among industrial community. Take Ant Financial ^{1}^{1}1https://www.antfin.com/ as an example, there are hundreds of millions customers who pay with Alipay ^{2}^{2}2https://www.alipay.com/ everyday. Billions of transactions are made each month, which form a verylargescale transaction graph, containing hundreds of millions of nodes (i.e., customers and merchants) and billions of edges (i.e., transactions).
Many applications, such as personalized merchant recommendation, fraudulent transaction detection, etc., can be formulated as the link prediction problems. Take personalized merchant recommendation as an example, transaction between users and merchants can naturally form a usermerchant bipartite graph (i.e., users/merchants as nodes and transactions as edges). Hence, personalized merchant recommendation can be formulated as a link prediction problem, which aims to predict whether a certain user would have transactions with a certain merchant in the future, based on the usermerchant transaction graph.
In this paper, we focus on how to build a scalable and distributed GNNbased link prediction model which can be applied to the industrialscale graphs. This topic is challenged from three aspects.
First, scalability. The proposal need to handle industrialscale graph data, which may contain hundreds of millions of nodes and billions of edges, as well as the abundant attribute information of nodes and edges. In the other hand, to our best knowledge, there is few literature which mentions about link prediction on industrial settings.
Second, distributed runtime sampling. To handle industrialscale graph data, the proposal should perform distributed model training. As most conventional GNNbased linkprediction models apply various sampling techniques to generate negative samples [2, 21], our proposal should implement effective runtime sampling in the distributed environment.
Last, effectiveness of inference. Different from node prediction (i.e., classification or regression of nodes), the inference procedure of link prediction usually need to perform in much more larger datasets, since the upper bound of the number of links in a graph with nodes is , while for nodes. Therefore, the effectiveness of inference becomes a challenge when facing an industrialscale graph.
In this paper, we present a GNNbased Distributed framework for SemiSupervised Link Prediction problem, named DSSLP, which is able to perform model training and inference effectively in an industrialscale graph with billions of nodes and edges. First, by avoiding training in the whole graph, we proposed the hops neighborhood training setting. As the representation of a certain node is learned by aggregating the information of its hops neighborhood, our proposal perform model training in a distributed, minibatch setting, while each batch contains the subgraph formed by hops neighborhoods of all nodes in the batch. Second, we implement batched runtime sampling, which can perform uniform sampling and dynamic sampling, to generate negative samples. The proposed framework also contains a highperformance inference modules. It reduce the upper bound of computation complexity of node representation inference from to by model splitting technique.
The rest of this paper is organizing as follow:
Ii Related Works
Link prediction is a longstanding problem that attracts attention from both physical and computer science communities [8, 9, 10, 11, 12, 13]. Various of techniques, such as statistic models, matrix factorization, graph neural networks, and so on, have been applied to this problem. In this section, we will address the most related works that motivate us a lot.
One kind of popular approaches is to evaluate node similarity based on structural information of the graph. The basic idea behind is that the more commonality two nodes share in terms of topology, the more likely a link will exist between them. As a result, early heuristic methods are designed based on different statistics about network structure like degree [14, 18], common neighbors [15, 16, 17], connected path [19]
. Although works well in practice, the performance of these heuristic methods may vary a lot as they usually have strong assumptions on when links exist
[2, 25]. Furthermore, researchers also develop algorithms, called network embedding, to encode each node of a graph into a lowdimensional vector, and use a certain distance to measure the existence probability of links between node pairs. A variety of network embedding techniques, such as matrix factorization
[20, 27] , random walk based methods [11, 12, 22] are engaged into this filed and lead to a superior performance in a number of applications [23, 24]. However, evaluating node similarity only based on structural information may be limited, especially for nodes only sparsely connected [21], or for graphs with abundant attribute information. Therefore, they meet their bottleneck when applied to realworld problems [21].It has been a trend that the link prediction problem refers to not only the structural information, but also node/edge attributes, as the attribute information may also play an important role in deciding whether a node will interact with others in some realworld graphs [26]. [21] proposes a model linearly combining structural information and attribute information, while [2, 44] directly encodes structural and attribute information into a lowdimensional feature vector based on graph neural networks [3, 4, 5, 6]. [28] and [29] state that combining the structural information and attribute information in a graph helps improve performance. The main drawback of those methods is scalability, which prevents them from being widely applied to realworld problems.
With the widely use in the industrial applications, these link prediction methods are challenged for their scalability and efficiency. Several works aims to alleviate this problem for network embedding methods. [46] incorporate several modifications to traditional multirelation embedding systems and present PBG, which is able to handle graphs containing over 100 million nodes and 2 billion edges. However, few of work is proposed to leverage both structural and attribute information for link prediction atop the industrialscale graphs with billions of nodes and edges.
Iii Preliminaries
In this section, we will define some notations which will be used in the following sections, as well as describe some algorithmic details of a link prediction model.
Iiia Notations
Generally, a graph can be defined as with associated adjacency matrix , where is the set of nodes and is the set of edges , while means that there is an edge between node and node . Moreover, many realworld graphs contain node attribute information, which results in the concept of attributed graph (or network)[45]. A realvalued matrix is defined to represent dimensional node attributes[1]. In attributed graphs, a subgraph not only contains the structural information, but also contains the attributed information.
Here we will introduce a concept named hops neighborhood. The hops neighborhood of node (denoted as ) in an attributed graph is defined as a subgraph induced from by a set of nodes , where denotes the length of shortest path from node to node .
IiiB Node Embedding
Node embedding (or called graph embedding) methods aim to encode nodes into lowdimensional vectors, which is able to preserve some kinds of graph properties. One kind of popular methods to learn a deep encoder is to update parameters by aggregating information from the local neighborhood of a node in an iterative fashion [3, 4, 5, 6]. Hence, the embedding of node in th hidden layer is denoted as:
(1) 
where
is an aggregator function (i.e., mean, sum and max pooling operator, and so on) defined on the set of
th hidden layer embeddings of its 1hop neighborhood, denotes the trainable parameters of th hidden layer, and . By stacking the aggregating operator times, each node embedding is able to aggregate information from its hops neighborhood.IiiC GNNbased Semisupervised Link Prediction Model
Given a snapshot of a graph, the link prediction problem aims to accurately predict the links (edges) that will be added to the graph in the future. It naturally follows a semisupervised setting, in which existing links in the graph are treated as positive examples, while the negative examples can be generated by different sampling strategies.
Formally, from an encoderdecoder perspective [1], the existence score of a link between node and node is defined as:
(2) 
where is a function that maps node set to dimensional embedding space , while is a pairwise decoder that maps pairs of node embeddings to a realvalued edge score measurement.
Usually, the encoder and decoder can be learned by optimizing an emprical reconstruction loss:
(3) 
where is a userspecified pairwise proximity measure. Specially, represents that there is a link between node and , while indicates no link exists. Meanwhile,
represents a userspecified loss function (e.g.
distance) measuring the discrepancy between estimated value
and the ground truth , and is a dataset for model training.In the GNNbased link prediction model, a deep encoder with layers (i.e., Equation 1) is applied to generate node embeddings. The decoder
here can be dot product or euclidean distance of two node embeddings, or a classifier (i.e., logistic regression and multilayer perceptron) defined on the concatenation of two node embeddings. That is
(4) 
is used to measure the existence score of an edge between nodes and . The larger the score is, the more likely that an edge will appear between node and . To that end, we penalize a marginbased ranking loss [31, 32] to guide the learning process:
(5) 
where represents a positive edge while is a negative example sampled from a certain distribution . In our model, positive and negative edges are sampled according to a set of sampling strategies, which will be detailed in the following section.
is a margin hyperparameter to separate positive and negative edges. The loss function in (
5) ensures that negative edges get lower scores than positive edges by at least a margin . In addition, we also provide a classification loss as an alternative.By minimizing the loss function introduced above via minibatch stochastic gradient descent (SGD), DSSLP jointly learning node embedding
and link prediction score . Therefore, DSSLP can be not only used in link prediction task, but also transferred to other tasks like node classification task with few extra labeled data.Iv Framework
Iva Overview
DSSLP is designed to solve industrialscale link prediction problems. As illustrated in Fig. 1, DSSLP provides a complete solution for link prediction applications: (1) Data preprocessing. Generate hops neighborhood for each node and store them into a shared file system (Maxcompute ^{3}^{3}3https://www.alibabacloud.com/product/maxcompute); (2) Training. Train the link prediction model described above in a distributed and asynchronous manner based on the parametersever architecture [30]; (3) Inference. Compute link prediction score for industrialscale graph data with carefully designed modelsplit strategy.
DSSLP receive a special format of hops neighborhood as input rather than the full graph, which helps reduce the time and memory cost. During training procedure, every worker pulls parameters from a parameter sever and update them independently. For each worker, we present the work flow in Fig. 2, which mainly consists of three parts: (1) Encode node embeddings based on subgraphs with a certain of embedding representation algorithms like GCN [3], GAT [5], Geniepath [6]; (2) Sample positive and negative links according to observed links within a batch; (3) Optimize a certain loss to guide the learning process. After that, we can apply a welltrained model to predicting the existence of links in a graph. Specially, for industrialscale graph data, we split the welltrained model into two parts to avoid some repeated computing and accelerate the inference procedure.
When applied to industrialscale link prediction applications, a framework should not only care about performance, but also hold the properties of high quality in scalability and efficiency. In the following, we will detail our framework and explain how this framework is possible to efficiently and accurately predict links between arbitrary two nodes on huge graphs.
IvB Operating on hops neighborhood
The first challenge for a link prediction model is how to operate on huge graphs with billions of nodes and edges. The cost of memory and time will arrive at an unacceptable level when loading a full huge graph into memory, as a typical graph that contains one billion nodes with each associated with 64dimensional float features would require 250G memory to just store these features, which is beyond the capacity of typical commodity machines.
Rethinking the total work flow of the link prediction model, we find that, although embedding representation is based on structure and attribute information of a graph, it’s unnecessary to load the full graph into memory at once. As stated in section IIIB, embedding a node is to aggregate information from its local neighborhood and a hops embedding of node can be accurately calculated from its hops neighborhood [2]. Therefore, we format our inputs based on hops neighborhood instead of the full huge graph.
Our DSSLP framework takes pairs of nodes together with their hops neighbor as input. As shown in Fig. 2, the input of our framework is encoded into an “edge” format, which mainly consists of four components: (1) Id of node ; (2) Id of ; (3) , hops neighborhood of node ; (4) , hops neighborhood of node . Ids of and are used to label the root nodes of subgraphs and the node pair forms a link to be predicted, while contains information of nodes and edges of the hops neighborhood rooted in node .
For a certain node , the hops neighborhood mainly consists of two parts: nodes associated with node features and edges together with their features in hops neighborhood of , which is constructed by the following steps:

Insert the root node into the node list together with its features.

For each node in , insert edges directly connected to them into the edge list

Update with nodes associated with edges in .

Repeat the second and third steps for times.
By alternately update the node list and the edge list , we finally generate the hops neighborhood for node .
To prepare the input of our model, we first extract hops neighborhood of and respectively, and then construct input with the “edge” format stated above. Since the hops neighborhood is information complete for computing the hops embedding of its root node, this kind of input provide sufficient structural and attribute information for our model to predict the existence between nodes and .
With input in such format, our DSSLP model is born with the ability of running parallelly and distributedly, as the model is able to learn embeddings of root nodes directly from a piece or pieces of data records without the need of referring to the full graph. Furthermore, our model is memory friendly, as the scale of input is decreasing from a full graph to a hops subgraph. For example, given a graph with 0.6 billion nodes and 10.4 billion edges, the average node degree is about 17, and then a hop subgraph only contains about nodes, which is far less than the amount of nodes in the full graph.
IvC Batched Runtime Sampling
The second challenge is how to efficiently sampling negative edges to guide the learning process. A straightforward way is sampling negative links in an offline manner: randomly select a certain number of observed edges as positive links together with about the same number of unobserved edges as negative links [2, 21]. However, when applied into industrialscale applications, offline sampling may only cover a small part of the whole possible negative links and leads to an “imbalance” problem [21]. What’s worse, the imbalance data may break the basic assumption that training and test set should obey the same distribution. To solve this problem, we propose a batched runtime sampling method, which is flexible for different scale of graphs and is able to approximate the distribution of positive and negative edges of the whole observed graph .
The runtime sampling method operates on a batch of records during the training procedure. Given a batch of records, a pair of root nodes in the th record is sampled from the whole observed edge set , which naturally forms a positive link. Meanwhile, a pair of nodes is sampled as a negative link if and are root nodes within the batch and the edge can not be observed in the edge set (all edges within the batch). As all nodes directly connected to the root node are collected in ’s hops neighborhood, sampling negative links according to is equal to sampling according to the whole edge set .
A set of strategies are developed to make our runtime sampling method flexible and powerful:
Shuffling
One key point for batched runtime sampling is that negative examples should be sampled according to the whole dataset rather than only within a batch. While candidates of negative links are combinations of nodes from the whole graph, batched sampling makes candidates of negative samples be limited to the combinations of root nodes within a training batch, which may not reflect the real distribution of the whole graph.
A simple but efficient shuffling strategy is used to solve this problem when we design our batched runtime sampling method. The basic idea behind is that we should provide enough combinations of nodes as candidates of negative samples to sampling methods, which, is equal to provide enough combinations of data records. To that end, we perform a runtime shuffling strategy: First, we pick records in order from the training dataset and put them into a prefetch buffer; Then, we randomly pick records one by one to fill a training batch. Note that, once a record was pulled from the buffer, a new record will be picked from train dataset and replace the former record in the buffer.
With such shuffling strategy, a training batch is filled with randomly selected records, and a certain data record will meet any other records with enough training steps. Therefore, the candidates of negative links generated by batched runtime sampling approximate those sampled over the whole training dataset.
Uniform sampling
We provide a basic uniform negative sampling method in our sampling suite. The uniform negative sampling method is designed based on the basic assumption that every node plays an equal role in the graph. Therefore, the uniform negative sampling method treats every node equally without any bias, and samples about the same number of negative neighbors from each node to form negative links.
The uniform negative sampling method takes a batch of records as input and output a set of negative links according to root nodes and observed edges within the training batch. Given a root node from a certain record, the uniform sampling method works as follows:

Randomly select a root node from the training batch.

Pick the link as a negative link if it can not be observed in the edge set of the batch, and drop otherwise.

Repeat first two steps until getting negative links or reaching predefined times.
It’s worth noting that, even without the shuffling strategy, the uniform negative sampling method is also able to provide a diversity of negative examples for a certain node, as negative links are randomly constructed. The uniform negative sampling method is just a basic solution for runtime sampling and it can not approximate the real distribution in some realworld problems, as the importance of a node may vary due to different roles it played in the realworld applications. It is a good choice to apply this basic sampling method to applications where we don’t have any prior knowledge, and even statistic informations like indegree, outdegree, are not available or beyond belief.
Dynamic sampling
Moreover, we also propose a dynamic sampling method based on the degree of each node to approximate the distribution of the whole observed graph . It is a common sense that the larger the degree of a node is, the more likely we will observe a link connected to it. Therefore, we form an inverse relationship between node degree and the number of negative links associated with the node :
(6) 
where is the number of negative edges to be sampled, represent the degree of node , and is a ratio hyperparameter to adjust the value of . In case that is extremely large or small, we add two extra parameters , to (6):
(7) 
where is the max negative number to be sampled for a node. is a threshold to adjust the real sampling number. (7) ensures that the number of negative links for a certain node should be in the range of .
Different from the uniform sampling method described above, our dynamic sampling method adjusts the number of negative edges associated with a certain node dynamically according to (7). It worth noting that, from a statistic perspective, node degree is equal to the frequency of positive links connected to the node, which naturally reflects the real distribution of observed links. Therefore, examples sampled by our dynamic sampling method will approximately obey the same distribution of links in the observed graph , as the sampling procedure takes the node degree into account.
IvD Offline Inference
The third challenge is that, as the total number of links to be predicted may reach a quite large scale, as many as billions or even trillions, it may take tens or hundreds of hours to perform the inference over an industrialscale dataset, which is usually beyond tolerance for industry applications. To solve this problem, we propose a modelsplit strategy to accelerate the inference procedure.
The basic idea of the modelsplit strategy is to reduce the computational complexity of the inference procedure. Given a graph with nodes, the upper bound of the number of links is , while for nodes, which means that we should perform the forward propagation procedure of our model as many as times. However, there are many unnecessary computations, as the same node may appear in different edges and its embedding is computed repeatedly in different input records, which leads to a high time cost. Based on this observation, we split the inference procedure into two parts: node embedding representation procedure and prediction procedure. As shown in Fig. 3, The modelsplit strategy mainly consists of four steps:

Split parameters of well trained model into two parts: parameters related to embedding representation procedure and predictionrelated parameters.

Compute node embeddings with parameters related to embedding representation procedure.

Pack node embeddings into the edge format with reference to node ids from edge candidates.

Compute prediction scores with predictionrelated parameters.
Note that if we use ranking loss during training procedure, there’s no predictionrelated parameters, as we directly use the result of dot product as the prediction score. If we use the alternative classification loss, the prediction score is computed by a neural network and all the four steps should work in the inference procedure.
Compared with the original inference procedure, the model split strategy reduce the upper bound of the computational complexity in embedding representation procedure from to . As embedding representation is the most time consuming part of our model, the modelsplit strategy is efficient to speed up the inference procedure.
V Experimets
In this section, we present the experimental results of our approaches on some public datasets together with some realworld applications at Alipay.
Va Datasets
The datasets we used mainly consist of two parts: public datasets and some realworld industrialscale data at Alipay.
Public Datasets

USAir [33]: a network of US air transportation system with 332 nodes and 2126 edges. Each edge represents an airline from source city to destination.

NS [34]: a coauthorship network of researchers in network science field with 1589 nodes and 2742 edges.

PB [35]: a network of the US political blogs with 1222 nodes and 16714 edges.

Yeast [36]: a network of molecular interactions in cells of Yeast with 2375 nodes and 11693 edges.

C.ele [37]: a neural network of C. elegans with 297 nodes and 2,148 edges.

Power [37]: an electrical power grid of the western US with 4941 nodes and 6594 edges.

Router [38]: a routerlevel Internet collected by Rocketfuel Project with 5022 nodes and 6253 edges.
Details about those dataset are presented in Table I. means the average node degree of a dataset and implies the sparsity of related network.
Indices  Vertices  Edges  node_degree 

USAir  332  2126  12.81 
NS  1589  2742  3.45 
PB  1222  16714  27.36 
Yeast  2375  11693  9.85 
C.ele  297  2148  14.46 
Power  4941  6594  2.67 
Router  5022  6258  2.49 
Realworld Dataset At Alipay
We apply our approaches to realworld problems at Alipay, the world’s largest mobile payment platform. Datasets we used can be concluded as follows:

Trade Network. Trade network describes transactions between different users (i.e., customers and merchants). In the trade network, each user is treated as a node and we add an edge to the network if a deal happens between two users. We preprocess the data by deleting isolated nodes which are useless in propagating information through the topology [6]. The preprocess data contains about 376 million nodes and 4.48 billion edges. Moreover, a set of attributes is used to describe users and edges. Each node is associated with 3001dimensional features while each edge contains 2dimensional features. A few nodes in trade network are labeled to identify a special role in transaction.

Friendship Network. Friendship network describes the friendship between different individuals on the Internet. After preprocessing the data as stated above, the friendship network contains about 638 million nodes each with 446dimensional features and 10.4 billion edges with 28dimensional features. As time goes on, two individuals may build a friendship and the dataset records the start time of a friendship with a timestamp, which is used to separate the whole dataset into train and test sets.
Indices  Trade Network  Friendship Network 

#Nodes  
#Edges  
#Node feature  3001  446 
#Edge feature  2  28 
Task type(test)  node classification  link prediction 
label type(test)  node label  edge label 
#Train set(link)  
#Test set (link)  
#Train set (Node)  
#Test set (Node) 
To prove the effectiveness of our approaches, which jointly learns link prediction and node embedding, we preform a node classification task on Trade Network and a link prediction task on Friendship Network.
For the node classification task on Trade Network, we first train a link prediction model and then feed node embeddings learned by the model to downstream node classification task with provided node labels. We present details about train and test set in Table II. About training records are used to train the link prediction model while labeled node are used to train the classifier. It’s worth noting that, extra node labels are only used to train the classifier and play no role in node embedding representation procedure.
On the other hand, for the link prediction task on Friendship Network, we first use a timestamp to split the graph into two part: links built before the timestamp as observed links while links after the timestamp as the positive test set. As for the negative links in test set, we randomly sampled unobserved links according to the graph. The total training set contains about records, while the test set contains records with about equal number of positive and negative links. It’s worth noting that all the training records are built based on observed links, and we use the runtime negative sampling method to generate negative samples during training time. Therefore, there is no need to prepare negative samples for training set.
Algorithm  USAir  NS  PB  Yeast  C.ele  Power  Router  Mean 

CN  0.940  0.938  0.919  0.891  0.848  0.590  0.561  0.812 
Jaccard  0.903  0.938  0.873  0.890  0.792  0.590  0.561  0.792 
AA  0.950  0.938  0.922  0.891  0.864  0.590  0.561  0.817 
RA  0.956  0.938  0.923  0.892  0.868  0.590  0.561  0.818 
N2V  0.914  0.915  0.858  0.937  0.841  0.762  0.655  0.840 
WLMN  0.961  0.981  0.939  0.951  0.854  0.874  0.915  0.925 
SEAL  0.966  0.989  0.947  0.979  0.903  0.876  0.964  0.946 
Ours  0.982  0.992  0.923  0.974  0.885  0.921  0.970  0.950 

VB Experimental Settings
We evaluate the performance of our model mainly on two tasks: the link prediction task and the node classification task. Experimental Settings are designed as follows:

Link prediction task. We first compare our model with some popular methods in academic, including : Common Neighbors (CA) [15]
, Jaccard index (Jaccard)
[39], AdamicAdar (AA) [16], resource allocation (RA) [17], node2vec [12], WeisfeilerLehman Neural Machine (WLNM) [27], SEAL [2]. The first four methods are heuristic methods while the last three represent latest learnbased methods. We report results of our methods together with those methods on seven public dataset introduced in section VA. Furthermore, we conduct a set of experiments on Friendship dataset with node2vec [12] as the baseline, which, to our best knowledge, is a public available method that is able handle such large scale graph. 
Node classification task. As our model jointly learn link prediction and node embedding, we apply the learned embedding to node classification task on Trade dataset. Also, we use one of the most famous unsupervised embedding representation method node2vec[12]
as the baseline. Gradient Boosting Decision Tree
[40] (GBDT) is adopted as the classifier which is widely used in industry.
We report the AUC measure on all those experiments, because it is not influenced by the distribution of the classes compared with 01 accuracy. Moreover, we use Geniepath [6] as the default embedding representation method and embedding size of our model is set to 64 in all of those experiments. When performing the link prediction task, innerproduct is used as the metric function of our model.
VC Results
Following the evaluation protocol stated above, we details results on public datasets and readworld industrialscale datasets in terms of performance in this section.
Link prediction task
We report results in terms of link prediction task in Table III and Table IV. In table III, we compare our model with some popular methods in academic on seven public dataset. Underline is used to indicate the top2 results and Bold represents the best results. Meanwhile, we illustrate results on realworld industrialscale dataset in Table IV.
As shown in Table III, our methods consistently outperforms stated heuristic methods like Common Neighbors (CA) [15], Jaccard index (Jaccard) [39], AdamicAdar (AA) [16], resource allocation (RA) [17] along all seven public dataset. On Router dataset, our method even achieve an AUC of 0.970 while those heuristic methods only get 0.561. And the mean AUC across the seven dataset of our method is 0.950 while the best mean AUC of those four heuristic methods is 0.818. In all, our method achieves a significant improvement in terms of AUC compared with those heuristic methods.
Our methods also show advantages over some latest learningbased methods such as node2vec (N2V) [12], WLMN [27], and SEAL [2]. Compared with the state of the art method, SEAL, our model win four out of seven dataset in terms of AUC and the performance on the left three dataset is also about the same level with that of SEAL. Results shown in Table III demonstrate that our model achieves competitive performance compared with those latest methods in academic on small public datasets.
Moreover, results on largescale dataset also demonstrate the effectiveness of our model. Note that, when applied to largescale dataset, prior methods may fail to work due to the high cost of memory. As a result, we mainly take node2vec as the baseline, which, to our best knowledge, is a public available method to handle such large graph and also widely used in industrial applications. As presented in Table IV, our model outperforms node2vec by about (0.842 VS. 0.738). As node2vec only embed the structure information without referring to attribute information, we also show the result based on node2vec embedding with raw features for fairness. Our model also achieves significant improvement compared with node2vec + raw features(0.842 VS. 0.769).
Node classification task
Results in terms of node classification task on Trade dataset are presented in Table IV. As our model learns node embeddings from such large scale dataset without referring to any node labels, we also use node2vec as the baseline.
The node embedding of our method is more distinctive compared with that of node2vec as shown in Table IV. As the node labels in trade dataset represent a special role in transactions, which may mainly be influenced by attributes of a node, embeddings with structural information only may not works well. Therefore, node2vec achieves only 0.569 in terms of AUC on Trade dataset, which is far less than that of our model. Furthermore, our model outperforms node2vec significantly (0.757 VS. 0.658), even node2vec is combined with raw features.
Name  Trade (Node)  Friendship (Link) 

raw feature  0.639  0.702 
node2vec  0.569  0.738 
node2vec + raw feature  0.658  0.769 
Ours  0.757  0.842 
VD Analysis
In this section, we try to analysis and explain the effect of different strategies used in our model by varying the variable we will analysis while fix other variables which is described as follows:
effectiveness
We analysis the influence of shuffling strategy and sampling strategy to show the effectiveness of those strategies. Results of all four combinations across those two strategies are illustrated in Fig. 4
, in which “Uniform” and “Dynamic” refer to model with uniform sampling and dynamic sampling respectively, while suffix “without shuffling” indicates model trained without shuffling strategy. The horizontal axis means the training epoches while vertical axis means losses.
As illustrated in Fig. 4, both shuffling strategy and dynamic strategy are proved to be effective to help minimize the loss to a better place. By fixing the sampling strategy, we can conclude the effect of the shuffling strategy: when enabling the shuffling strategy, both loss curves of models with uniform sampling and dynamic sampling decrease to a lower place compared with models without shuffling strategy. On the other hand, by fixing shuffling shuffling strategy, models with dynamic sampling strategy achieve better result than models with uniform sampling. Moreover, model with dynamic negative sampling and shuffling strategies achieve best result across the four models.
The result described above may be due to the reason that, by shuffling, a certain record is able to meet any other records across the whole training set, which is equal to train the model on the whole dataset rather than within a minbatch. Moreover, by adopting dynamic negative sampling, the distribution of positive and negative links in training set approximates that over the whole observed graph, which helps construct more reasonable number of positive and negative samples.
Scalability
For link prediction tasks on industrialscale dataset, scalability is an import dimension that we should take into account. To prove the scalability of our framework, we design a set of link prediction experiments by varying worker numbers during training process.
As presented in Fig. 5 and Fig. 6, we find that, with more worker numbers, the training process will speed up and converge to about the same level or just slightly worse than model with less workers. It’s reasonable that with more workers, the performance decrease slightly as we adopt asynchronous optimization strategy during training process. We think it’s acceptable when applied our model to realworld tasks with large scale data by trading off time cost and performance.
Offline inference efficiency
By adopting modelsplit strategy, the offline inference process is quite efficient compared with original inference process with Tensorflow
[41]. It’s takes about 1520 hours for inferring the whole Friendship dataset with original inference process, while with modelsplit strategy, it only takes about 5 hours. It’s worth noting that, the more dense the graph is, the better efficiency we will get with modelsplit strategy.Vi Conclusion
In this work, we propose a scalable framework for link prediction at Alipay. Based on subgraphs, our model is potential to jointly learn link prediction and node embedding on graphs with billions of nodes and edges without using any extra label information on nodes or edges. Benefit from the shuffling and sampling strategies, our model is able to approximate distributions of positive and negative links over the whole training dataset. Moreover, the proposed modelsplit strategy helps accelerate the speed of inference process significantly. Experiments show that our approach works consistently well on various realworld problems. In future, we are interested in applying our framework on heterogeneous networks based on metapath negative sampling methods.
Vii Acknowledgment
We would like to thank our colleagues from Alipay, MYbank and the infrastructure team of Ant Financial, who have offered great help in the devolopment and applicaiton of DSSLP. Furthermore, we like to thank Yang Shuang and Yuan Qi for their unwavering support of this project.
References
 [1] Hamilton W L, Ying R, Leskovec J. Representation learning on graphs: Methods and applications[J]. arXiv preprint arXiv:1709.05584, 2017.
 [2] Zhang M, Chen Y. Link prediction based on graph neural networks[C]//Advances in Neural Information Processing Systems. 2018: 51655175.
 [3] Kipf T N, Welling M. Semisupervised classification with graph convolutional networks[J]. arXiv preprint arXiv:1609.02907, 2016.
 [4] Hamilton W, Ying Z, Leskovec J. Inductive representation learning on large graphs[C]//Advances in Neural Information Processing Systems. 2017: 10241034.
 [5] Veličković P, Cucurull G, Casanova A, et al. Graph attention networks[J]. arXiv preprint arXiv:1710.10903, 2017.

[6]
Liu Z, Chen C, Li L, et al. Geniepath: Graph neural networks with adaptive receptive paths[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33: 44244431.
 [7] Kipf T N, Welling M. Variational graph autoencoders[J]. arXiv preprint arXiv:1611.07308, 2016.
 [8] Leicht E A, Holme P, Newman M E J. Vertex similarity in networks[J]. Physical Review E, 2006, 73(2): 026120.
 [9] Sun D, Zhou T, Liu J G, et al. Information filtering based on transferring similarity[J]. Physical Review E, 2009, 80(1): 017101.
 [10] Clauset A, Moore C, Newman M E J. Hierarchical structure and the prediction of missing links in networks[J]. Nature, 2008, 453(7191): 98.
 [11] Pirotte A, Renders J M, Saerens M. Randomwalk computation of similarities between nodes of a graph with application to collaborative recommendation[J]. IEEE Transactions on Knowledge and Data Engineering, 2007 (3): 355369.
 [12] Grover A, Leskovec J. node2vec: Scalable feature learning for networks[C]//Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016: 855864.
 [13] Liben‐Nowell D, Kleinberg J. The link‐prediction problem for social networks[J]. Journal of the American society for information science and technology, 2007, 58(7): 10191031.
 [14] Liu W, Lü L. Link prediction based on local random walk[J]. EPL (Europhysics Letters), 2010, 89(5): 58007.
 [15] Newman M E J. Clustering and preferential attachment in growing networks[J]. Physical review E, 2001, 64(2): 025102.
 [16] Adamic L A, Adar E. Friends and neighbors on the web[J]. Social networks, 2003, 25(3): 211230.
 [17] Zhou T, Lü L, Zhang Y C. Predicting missing links via local information[J]. The European Physical Journal B, 2009, 71(4): 623630.
 [18] Zhang G Q, Wang D, Li G J. Enhancing the transmission efficiency by edge deletion in scalefree networks[J]. Physical Review E, 2007, 76(1): 017101.
 [19] Katz L. A new status index derived from sociometric analysis[J]. Psychometrika, 1953, 18(1): 3943.
 [20] Miller K, Jordan M I, Griffiths T L. Nonparametric latent feature models for link prediction[C]//Advances in neural information processing systems. 2009: 12761284.
 [21] Menon A K, Elkan C. Link prediction via matrix factorization[C]//Joint european conference on machine learning and knowledge discovery in databases. Springer, Berlin, Heidelberg, 2011: 437452.
 [22] Perozzi B, AlRfou R, Skiena S. Deepwalk: Online learning of social representations[C]//Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014: 701710.
 [23] Cai H, Zheng V W, Chang K C C. A comprehensive survey of graph embedding: Problems, techniques, and applications[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(9): 16161637.
 [24] Goyal P, Ferrara E. Graph embedding techniques, applications, and performance: A survey[J]. KnowledgeBased Systems, 2018, 151: 7894.
 [25] Kovács I A, Luck K, Spirohn K, et al. Networkbased prediction of protein interactions[J]. Nature communications, 2019, 10(1): 1240.
 [26] Barbieri N, Bonchi F, Manco G. Who to follow and why: link prediction with explanations[C]//Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014: 12661275.
 [27] Zhang M, Chen Y. WeisfeilerLehman neural machine for link prediction[C]//Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017: 575583.
 [28] Zhao H, Du L, Buntine W. Leveraging node attributes for incomplete relational data[C]//Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, 2017: 40724081.
 [29] Nickel M, Jiang X, Tresp V. Reducing the rank in relational factorization models by including observable patterns[C]//Advances in Neural Information Processing Systems. 2014: 11791187.
 [30] Li M, Zhou L, Yang Z, et al. Parameter server for distributed machine learning[C]//Big Learning NIPS Workshop. 2013, 6: 2.

[31]
Cheng D, Gong Y, Zhou S, et al. Person reidentification by multichannel partsbased cnn with improved triplet loss function[C]//Proceedings of the iEEE conference on computer vision and pattern recognition. 2016: 13351344.
 [32] Chen W, Chen X, Zhang J, et al. Beyond triplet loss: a deep quadruplet network for person reidentification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 403412.
 [33] Vladimir Batagelj and Andrej Mrvar. http://vlado.fmf.unilj.si/pub/networks/data/, 2006.

[34]
Newman M E J. Finding community structure in networks using the eigenvectors of matrices[J]. Physical review E, 2006, 74(3): 036104.
 [35] Ackland R. Mapping the US political blogosphere: Are conservative bloggers more prominent?[C]//BlogTalk Downunder 2005 Conference, Sydney. BlogTalk Downunder 2005 Conference, Sydney, 2005.
 [36] Von Mering C, Krause R, Snel B, et al. Comparative assessment of largescale data sets of protein–protein interactions[J]. Nature, 2002, 417(6887): 399.
 [37] Watts D J, Strogatz S H. Collective dynamics of ‘smallworld’networks[J]. nature, 1998, 393(6684): 440.
 [38] Spring N, Mahajan R, Wetherall D. Measuring ISP topologies with Rocketfuel[C]//ACM SIGCOMM Computer Communication Review. ACM, 2002, 32(4): 133145.
 [39] Chowdhury G G. Introduction to modern information retrieval[M]. Facet publishing, 2010.

[40]
Chen T, Guestrin C. Xgboost: A scalable tree boosting system[C]//Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 2016: 785794.
 [41] Abadi M, Agarwal A, Barham P, et al. Tensorflow: Largescale machine learning on heterogeneous distributed systems[J]. arXiv preprint arXiv:1603.04467, 2016.
 [42] Koller D, Friedman N. Probabilistic graphical models: principles and techniques[M]. MIT press, 2009.
 [43] Liben‐Nowell D, Kleinberg J. The link‐prediction problem for social networks[J]. Journal of the American society for information science and technology, 2007, 58(7): 10191031.
 [44] Kipf T N, Welling M. Variational graph autoencoders[J]. arXiv preprint arXiv:1611.07308, 2016.
 [45] Zhou Y, Cheng H, Yu J X. Graph clustering based on structural/attribute similarities[J]. Proceedings of the VLDB Endowment, 2009, 2(1): 718729.

[46]
Lerer A, Wu L, Shen J, Lacroix T, Wehrstedt L, Bose A, Peysakhovich A. PyTorchBigGraph: A Largescale Graph Embedding System[C]// SysML. 2019
Comments
There are no comments yet.