1 Introduction
Link prediction is a fundamental task in graph data analytics [LibenNowellK07]
. Many realworld applications can benefit from link prediction, such as recommendations, knowledge graph completion, etc
[bhagavatula2018content, kazemi2018simple, xu2019link]. Graph embedding, which represents the nodes on the graph by lowdimensional vectors, has been proved as an effective approach for link prediction on attributed graphs (e.g.,
[kipf2017semi, bojchevski2018deep]).Generally, there are two types of link prediction, i.e., transductive and inductive, as shown in Figure 1. Most existing (attributed) graph embedding approaches focus on transductive link prediction, where both nodes are already in the given graph and can be seen during the training process, such as GCN [kipf2017semi], GAE [kipf2016variational], and SEAL [zhang2018link]. However, many realworld applications need inductive link prediction which requires embeddings to be quickly generated for new nodes with only attribute information (e.g., a new user in a recommender system). SDNE [wang2016structural] and GraphSage [hamilton2017inductive] can compute embeddings for new nodes but the edges of nodes are required. G2G [bojchevski2018deep] can perform inductive link prediction for the unseen nodes without local structures. However, it cannot well distinguish the nodes with similar attributes, because this model does not well capture the structure information in the node representations.
In this work, we propose a novel graph embedding model called DualEncoder graph embedding with ALignment (Deal) for inductive link prediction of new nodes with only attribute information. We aim to learn the connections between the nodes’ attributes and the graph structure through this model. The model embeds the graph nodes into the vector space, and it can compute an embedding vector for the new query node with only attributes which is compared to another node’s embedding for link prediction.
As shown in Figure 2, Deal has three components: an attributeoriented encoder , a structureoriented encoder , and an alignment mechanism. The attributeoriented encoder maps the nodes’ attributes to embeddings in the vector space. Given two linked attributed nodes, the similarity of their embedding vectors computed by is high. is used for computing the embedding vectors of the new nodes with attributes. computes the node embeddings that preserve the structure information. Given two linked nodes, the similarity of their embedding vectors computed by is high. The alignment mechanism aligns the two types of embeddings to build the connections between the attributes and graph structure. The two encoders keep being updated during the training process such that their embeddings produced are aligned. In addition, we use a novel rankingmotivated loss that can be regularized by hyperparameters, which is not considered in existing ranking lossbased graph embedding models. Although Deal focuses on inductive link prediction with only attribute information, it can perform transductive link prediction using both structural and attribute information as well. The main contributions of our approach are summarized as follows:

We design a model Deal for inductive link prediction of the new nodes with only attribute information given.

The proposed alignment mechanism builds the connections between attributes and graph structure, and improves the representation ability of node embeddings.

The experimental results on several realworld datasets show that our proposed model consistently outperforms the stateoftheart models on both inductive (with at least 6% AP score improvement) as well as transductive link prediction.
2 Related Works
The early studies on link prediction usually have strong assumptions on when links may exist, and they are mainly based on different similarity measures such as the number of common neighbours, Jaccard coefficient, and AdamicAdar to measure the node proximity [LibenNowellK07, zhou2009predicting, ZhaoAH16]
. They assume that the probability of the existence of the link between two nodes increases with their similarity. However, such assumptions may not hold in some realworld networks such as the proteinprotein interaction (PPI) networks, because two proteins sharing many common neighbours are actually less likely to interact
[kovacs2019network].Recently, researchers have shown an increasing interest in solving the link prediction problem via graph embedding method [zhang2018link]. Graph embedding has been used widely [gao2018deep, gao2019progan, qiutemporal, 8970926], and it aims to map nodes to dimensional vectors . Some graph embedding methods only capture the structural information of the graph, such as the random walkbased graph embedding approaches DeepWalk [perozzi2014deepwalk] and node2vec [grover2016node2vec] that adapt the SkipGram model and treat each generated path of nodes as the sequence of words, and SDNE [wang2016structural] that learns node embedding preserving both local and global structures. They do not leverage the attribute information and thus cannot be used for link prediction in attributed graphs.
For link prediction in attributed graphs, most existing studies focus on transductive link prediction, where both nodes are already in the graph. For example, GCN [kipf2017semi] requires that the full graph Laplacian is known during training; GAE [kipf2016variational] learns the node embeddings with a GCN encoder and an inner product decoder. SEAL [zhang2018link] is another GCNbased framework that solves the link prediction problem using local subgraphs. To our best knowledge, only G2G [bojchevski2018deep]
is able to perform the task of inductive link prediction with only attribute information. It relies on a deep encoder that embeds each node as a Gaussian distribution. However, the output of its encoder is primarily based on the nodes’ attributes, and thus it cannot well distinguish the nodes with similar attributes. G2G will embed these nodes nearby in the vector space, because their structure information is not fully utilized.
In our model Deal, the alignment mechanism aligns the attribute embedding and the structure embedding, and both attribute and structure information can be well captured by the learned node representations, and thus it can achieve better link prediction performance.
3 Problem Statement
We study the problem on an attributed graph , with a node set (), an edge set , and an attribute feature matrix , where is the attribute feature vector of node ( is the number of attributes in the graph).
Given a graph as well as a pair of nodes , the link prediction task aims to predict the existence of the link between and . In transductive link prediction, both nodes and are already in the graph (i.e., and thus they can be seen during the training process). In this work, we focus on the inductive link prediction, where either or both and are not seen during the training process. During the prediction, only the attribute information of the new nodes is available, i.e., their local structures are unknown.
We propose to utilize the node embedding approach, which embeds the graph nodes into the vector space, to solve this problem. When performing prediction, for a new node with its attribute feature vector , the model outputs the node embedding for this node from and then compares ’s embedding with another node’s embedding to predict their relationship.
4 Our Model Deal
4.1 Overview
Figure 2 illustrates our model Deal, which consists of three components: two node embedding encoders and one alignment mechanism. We aim to predict the links for new nodes with only attribute information. This requires us to build the connections between the attributes and the links between nodes, which means that we need to compute an embedding vector for a given set of attributes. Our attributeoriented encoder performs this task. It maps the nodes’ attributes to embeddings in the vector space. Given two linked nodes with their attributes, the similarity of their embedding vectors computed by is high.
However, when the nodes attributes are too uninformative or similar, such a single encoder is not sufficient to output useful embeddings to distinguish nodes with similar attributes, because they are all embedded nearby in the vector space. To remedy this issue, we use another structureoriented encoder to compute node embeddings that preserve the graph structure information. As a result, given two linked nodes, the similarity of their embedding vectors computed by is high. Next, we propose an alignment mechanism to align the two types of embeddings. During the training process, the two encoders keep being updated in order to produce node embeddings that are aligned in the vector space. Finally, the connections between the attributes and the links are captured by the two encoders, yielding better embedding vector computation during the link prediction.
4.2 Attributeoriented Encoder
The attributeoriented encoder takes a node attribute vector (attributes of the node ) as the input and outputs a node embedding, denoted as , i.e., .
There are many neural network choices for learning
, and we choose the multilayer perceptron (MLP) with nonlinear activation layer
[Goodfellowetal2016] in this paper as follows:(1) 
with trainable parameters , , and , and the exponential linear unit [clevert2015fast]. It is to be noted that the layer number of varies with different datasets. Here, we do not use GCN in , because we observe that aggregating too much information from a node’s neighbours may affect the representation ability of the node’s attributes for the link prediction task in attributed graphs. As shown in Section 5.5, we tried GCNlike encoders to obtain node embeddings from attributes, and the performance is not as good as using MLP. In addition, without aggregating the information from neighbours, the training process can be speeded up significantly.
4.3 Structureoriented Encoder
The structureoriented encoder aims to generate node embeddings that preserve the structural information of the graph without considering the attributes. We expect that the vectors of two linked nodes computed by
can have high similarity. To achieve this, we use the onehot encoding of the nodes
(can be regarded as nodes’ identifiers [you2019position]) as the input of . can be seen as a function that maps node to its node embedding vector : . From the perspective of Occam’s razor, we adopt a linear model as the encoder :(2) 
where the weight normalization [salimans2016weight] is employed to reparameterize the parameter . In addition,
is able to accelerate the convergence of stochastic gradient descent optimization. By minimizing the rankingmotivated loss which will be presented in details in Section
4.4, the final embedding is able to capture the structural information.Note that the encoder can be another node embedding method in Deal that focuses on learning graph structure with corresponding input, such as GCN with the adjacency matrix as the input. However, we find that using GCN in decreases the link prediction performance, which is shown in Tables 2 and 3 in Section 5.
4.4 Alignment Mechanism and Model Training
We propose an alignment mechanism to align the embeddings generated by the two types of encoders to learn the connections between the node attributes and the graph structure. During the model training, the two encoders keep being updated in order to be able to produce the embeddings that are aligned in the vector space. We first introduce the loss functions we used, and then we describe the proposed alignment mechanism in our model, and we finally present the training algorithm for
Deal.Loss Function.
Learning graph embedding via ranking is based on the rankingmotivated loss, which has been proved to be effective in many studies [bojchevski2018deep, bhagavatula2018content]. In this paper, we propose a novel minibatch learning method with a personalized rankingmotivated loss to learn the node embeddings with comprehensive representation performance.
Optimizing a rankingmotivated loss can help to capture the relationships between each pair of training samples. Contrastive loss [hadsell2006dimensionality] is one kind of rankingmotivated loss, and it is originally proposed to solve the dimensionality reduction problem. Given pairs of samples , the loss is shown as follows:
(3) 
where is a distance function, if the samples and are deemed similar and otherwise, is the hinge function, and is the margin. The learning objective of Eq. 3 is to map similar input samples to nearby points in the output vector space and map dissimilar samples to distant points.
We have a similar learning objective in our problem that is to map the linked nodes (positive sample) in the graph to points that are close in the output vector space and map the unlinked nodes (negative sample) to points that are far away from each other. However, there is a problem when directly adapting contrastive loss for link prediction. The negative pairwise samples have different distances in the graph, and thus using a fixed margin for all the negative samples in Eq. 3 is not appropriate, but it is difficult to set proper margins for the negative samples with different distances. Moreover, neither the existing work [bojchevski2018deep, bhagavatula2018content] nor Eq. 3 considers the regularization in the loss function, which is important and can further improve the prediction performance. Motivated by the above observations, we propose the following loss to be optimized for a given minibatch of node pair samples where with (obtained by sampling pairs of nodes from the graph):
(4) 
where is the function to measure the similarity between node embeddings and , is the link relation label with if and are connected and otherwise, is a weight function, and and are derived from the function with different hyperparameters. Specifically, there are many choices of
, such as dot product, cosine similarity, etc. A high score of
indicates that the nodes are similar, and vice versa. We find that using the cosine similarity have good results in our model.We use and because the regularization is not considered in Eq. 3. Inspired by that the logistic loss can be seen as a “soft” version of the hinge loss with an infinite margin [lecun2006tutorial] and is differentiable, we adopt the generalized logisitic loss function as follows:
(5) 
where and are loss margin parameters that can tune regularization [masnadi2015view].
is a weight function to measure the importance of negative samples with different distances. Specifically, we define as follows:
(6) 
where is a hyperparameter, and denotes the shortest path distance between a node pair. If node cannot reach node , . This weight aims to help the model pay more attention to the close negative neighbours during the training process.
Alignment Mechanism.
By minimizing and , we can lean structureoriented node embedding and attributeoriented node embedding , respectively. However, if we learn them separately, the two types of embeddings are isolated and cannot well represent the connections between attributes and graph structure. We propose to align the two types of embeddings during the training process and learn the two encoders simultaneously. We design two alignment methods:
1. Tight Alignment () aims to maximize the similarity between and for each node . Mathematically, the objective of the tight alignment is to minimize
(7) 
However, the tight method sometimes is too strict during the aligning the two types of embeddings.
2. Loose Alignment () aims to maximize the similarity between and of two linked nodes and , and it adopts the loss function in Eq. 4. Mathematically, the objective of the loose alignment is to minimize
(8) 
Putting everything all together, the final objective of our model is as below:
(9) 
where is a hyperparameter vector to parameterize the weights of different losses.
Training algorithm and prediction.
Algorithm 1 summarizes the training process of the proposed model.
To predict whether there is a link between two nodes and , we can calculate a score with and as follows
,  (10) 
where is another hyperparameter vector used to give each similarity score a different weight. In inductive link prediction, for a new node , is computed by , and . Our model can also perform transductive link prediction by setting to a nonzero value.
5 Experiments
5.1 Datasets
Datasets  Nodes  Edges  Attributes 

CS ([shchur2018pitfalls])  18,333  81,894  6,805 
PPI ([Zitnik2017])  1,767  16,159  50 
Cora ([mccallum2000automating])  2,708  5,278  1,433 
CiteSeer ([sen2008collective])  3,327  4,552  3,703 
PubMed ([namata2012query])  19,717  44,324  500 
Computers ([mcauley2015image])  13,752  245,861  767 
Photo ([mcauley2015image])  7,650  119,081  745 
For link prediction tasks, we evaluate our proposed model and baselines on four types of realworld datasets, i.e., the coauthorship graph (CS), the proteinprotein interactions graph (PPI), copurchase graphs (Computers and Photo), and citation network datasets (Cora, CiteSeer and PubMed). Details of these datasets are summarised in Table 1.
Cora  CiteSeer  CS  PubMed  Computers  Photo  
AUC  AP  AUC  AP  AUC  AP  AUC  AP  AUC  AP  AUC  AP  
MLP  0.826  0.674  0.897  0.789  0.921  0.810  0.842  0.705  0.866  0.692  0.901  0.753 
Cite.  0.839  0.712  0.914  0.824  0.939  0.862  0.912  0.809  0.898  0.762  0.926  0.808 
G2G  0.845  0.739  0.922  0.842  0.948  0.889  0.910  0.798  0.853  0.684  0.862  0.704 
GCNDeal  0.855  0.766  0.912  0.862  0.969  0.943  0.961  0.924  0.943  0.888  0.959  0.907 
Deal  0.864  0.804  0.937  0.907  0.977  0.959  0.966  0.931  0.953  0.899  0.965  0.922 
5.2 Baseline Methods
We compare our model Deal with MLP and several stateoftheart graph embedding methods, including SEAL [zhang2018link], G2G [bojchevski2018deep] and GAE [kipf2016variational]. In addition, the original GAE takes GCN as the encoder. We also consider other GAE variants, which replace the GCN encoder with GIN [xu2018how], GAT [velickovic2018graph] and SAGE [hamilton2017inductive] respectively. The GAE variants are denoted as their encoder model names.
Moreover, Deal variants can use different graph embedding models as structureoriented or attributedoriented encoders. Deal denotes the proposed encoders presented in Section 4. A Deal variant is denoted as , using the model as the structureoriented and the model as the attributeoriented encoder. We select three representative variants. All the Deal variants aim to minimize Eq. 9
. To ensure fairness, we set all models with a similar amount of parameters and train them for the same number of epochs.
5.3 Experimental Setup
We evaluate the proposed model Deal and baseline models under both inductive and transductive learning settings.
Inductive link prediction.
For the inductive case, the nodes in the test set are unseen during the training process. Similar to the dataset split setting of [bojchevski2018deep], we randomly hide 10% nodes and use the edges between them for the test set. The remaining nodes and edges are used for training and validation.
Transductive link prediction.
For the transductive case, all the nodes on the graph can be seen during the training. Similar to the dataset split setting of [you2019position], we randomly sample 10%/10% edges and an equal number of nonedges as validation/test set. The remaining nonedges and 80% edges are used as the training set.
The test set performance will be reported when the model achieves the best performance on the validation set. For the experimental results, we report the mean area under the ROC curve (AUC) and the average precision (AP) scores over ten trials with different random seeds and train/validation splits. In all the experiments, the default embedding size is 64. For each training minibatch, the linked node pairs account for 40%. We tune the hyperparameters of baseline models and our proposed Deal with the grid search algorithm on the validation set.
5.4 Results of Inductive Link Prediction
As GCNbased models cannot aggregate neighbours’ information in the inductive link prediction scenario, we compare our proposed model with MLP and G2G, and the experimental results are shown in Table 2. It shows that Deal significantly outperforms MLP and G2G across all datasets. On the Computers dataset, for instance, Deal improves AUC and AP scores by 6.12% and 17.98%, respectively. By comparing GCNDeal with Deal, it shows that using GCNlayer as cannot improve the link prediction performance. Also, it is observed that G2G performs worse when the graphs contain a small number of feature dimensions (compared with the number of nodes), such as Computers and Photo. The reason is that the node embedding encoder of G2G solely takes feature matrix as the input. As an extreme example, when the features of all the nodes are similar, it will be difficult to distinguish different nodes for G2G.
Cora  CiteSeer  CS  PubMed  PPI  
AUC  AP  AUC  AP  AUC  AP  AUC  AP  AUC  AP  
GAT  0.8684  0.8866  0.8423  0.8662  0.9465  0.9473  0.9193  0.9202  0.8092  0.8136 
GCN  0.8670  0.8755  0.8466  0.8620  0.9452  0.9421  0.9287  0.9272  0.8384  0.8364 
GIN  0.8666  0.8762  0.8405  0.8617  0.9432  0.9407  0.9262  0.9254  0.8086  0.8086 
SAGE  0.8739  0.8881  0.8498  0.8721  0.9485  0.9504  0.9254  0.9270  0.8112  0.8131 
Cite.  0.9145  0.9143  0.9385  0.9417  0.9501  0.9517  0.9435  0.9378  0.6047  0.5981 
SEAL  0.8269  0.7959  0.8064  0.7769  0.9146  0.8856  0.9235  0.9239  0.8825  0.8749 
PGNN  0.8225  0.8427  0.8065  0.8436  0.8779  0.8811  0.8145  0.8647  0.7303  0.6716 
G2G  0.9282  0.9336  0.9413  0.9421  0.9636  0.9640  0.9432  0.9364  0.5896  0.5333 
GCNDeal  0.9163  0.9047  0.9221  0.9219  0.9796  0.9801  0.9456  0.9498  0.8673  0.8709 
DealGCN  0.9002  0.9075  0.8496  0.8747  0.9646  0.9663  0.9299  0.9348  0.8861  0.8868 
DealGAT  0.8985  0.9091  0.8463  0.8752  0.9640  0.9666  0.9311  0.9355  0.8711  0.8762 
Deal  0.9455  0.9501  0.9519  0.9591  0.9827  0.9841  0.9593  0.9611  0.8894  0.8973 
5.5 Results of Transductive Link Prediction
The experimental results of transductive link prediction are summarized in Table 3. The results show that our proposed model Deal achieves the best performance. For the baselines, G2G performs well on the citation networks and coauthorship graph, which have informative node attributes. It is worth noting that the number of node attributes in PPI is less than 10% of that in other datasets. In the PPI graph, where each node contains limited attribute information, G2G has the worst performance, while SEAL achieves outstanding performance.
Interestingly, the remaining GAE baseline models achieve comparable performance on all the datasets, although they have different methods of aggregating neighbours’ information. Moreover, the attention mechanism of GAT does not outperform other GNN layers in this scenario. The reason may be that the GAE variants are insensitive to the different information aggregation methods on the link prediction problem. Compared to the baselines, Deal is more robust and shows stronger generalization ability on different types of datasets. In addition, the Deal framework enables GAEs to achieve better performance.
5.6 Comparison of Alignment Methods
To compare different alignment methods in Section 4.4, we conduct both inductive and transductive link prediction experiments on three representative datasets, i.e., Cora, CS, and PubMed. The experimental results (Table 4) show that, on these three datasets, both two alignment methods are effective, and the loose alignment method slightly outperforms the tight alignment method, especially for the inductive link prediction task. The reason is that the loose alignment method places fewer restrictions on the node embedding alignment. The loose one also provides flexibility that the node embeddings and need by adjusting the hyperparameters.
Cora  CS  PubMed  

AUC  AP  AUC  AP  AUC  AP  
I  0.845  0.774  0.972  0.951  0.951  0.905 
I  0.865  0.803  0.976  0.956  0.966  0.931 
T  0.939  0.942  0.976  0.978  0.955  0.958 
T  0.946  0.950  0.983  0.984  0.954  0.961 
Cora  CS  PPI  

AUC  AP  AUC  AP  AUC  AP  
0.932  0.933  0.952  0.950  0.813  0.820  
0.934  0.941  0.962  0.960  0.848  0.851  
0.939  0.943  0.973  0.978  0.869  0.875  
0.946  0.950  0.983  0.984  0.889  0.897 
5.7 Parameter Analysis
We here conduct experiments to analyse two key parameters, (Eq. 5) and (Eq. 6), in Deal. Table 5 indicates that the performance can be improved with tuning both of them, and plays a more important role than . The reason is that varying is able to regularize the loss (Eq. 4). It is interesting to note that different can also change the similarity of node pairs in the embedding space, as shown in Figure 3. It also indicates that the same node pair tends to have a higher similarity score in the structureoriented embedding space than the one in the attributeoriented embedding space. The reason is that the node embeddings in are separate, while there are correlations between the ones in , especially for those who have certain common attributes.
6 Conclusions
In this work, we propose a novel model Deal to address the inductive link prediction problem on attributed graphs, where the local structure of the new node is unknown. Different from the typical GCNs that aggregate information from neighbours, Deal learns comprehensive node representations via two encoders and an alignment mechanism. We have experimentally shown that our proposed model Deal consistently outperforms stateoftheart methods. In the future, we will develop more efficient training algorithms, so that our model can process largescale datasets.
Acknowledgments
Xin Cao is supported by ARC DE190100663. Xike Xie is supported by NSFC (No. 61772492), Jiangsu NSF (No. BK20171240) and the CAS Pioneer Hundred Talents Program. Sibo Wang is supported by Hong Kong RGC ECS Grant (No. 24203419), CUHK Direct Grant (No. 4055114), and NSFC (No. U1936205).
Comments
There are no comments yet.