NRL is a powerful technique to learn representation of a graph. Such a representation gracefully lends itself to a wide variety of network analysis tasks, such as link prediction, node clustering, node classification, recommendation, and so forth.
In most studies, the learning is done in a context-free fashion. That is, the representation of a node characterizes just a single aspect of the node, for instance, the local or global neighborhood of a node. Recently, a complementary line of research has questioned the sufficiency of single representations and considered a context-sensitive approach. Given a node, this approach projects it to different points in a space depending on other contexts it is coupled with. A context node can be sampled from a neighborhood (Tu et al., 2017; Zhang et al., 2018), random walk (Ying et al., 2018), and so on. In this study we sample from a node neighborhood (nodes connected by an edge). Thus, in the learning process of our approach a source node’s representation changes depending on the target (context) node it is accompanied by. Studies have shown that context-sensitive approaches significantly outperform previous context-free SOTA methods in link-prediction task. A related notion (Peters et al., 2018; Devlin et al., 2018) in NLP has significantly improved SOTA across several NLP tasks.
In this paper we propose Gap (Graph neighborhood attentive pooling), which is inspired by attentive pooling networks (apn) (dos Santos et al., 2016), originally proposed for solving the problem of pair ranking in NLP. For instance, given a question , and a set of answers , an apn can be trained to rank the answers in with respect to by using a two-way attention mechanism. apn
is based on the prevalent deep learning formula for SOTA NLP, that is,embed, encode, attend, predict (Honnibal, 2018). Given a question-answer pair , the apn model first projects the embedding of the pairs using two separate encoders, and the encoder can be a cnn or lstm
. The projection helps to capture n-gram context information and/or long-term dependencies in the input sequence. Next, a soft-alignment matrix is obtained by a mutual-attention mechanism that transforms these projections using a parameter matrix. Attention vectors are then computed through a column-wise and row-wise pooling operations on the alignment matrix. Finally, the weighted sum of each of the above projections by its respective attention vector is computed to obtain the representations of the question and answer. Each candidate answeris then ranked according to its similarity with the question computed using the representations of and .
Recently, apn have been applied to context-sensitive NRL by Tu et al. (2017), and the inputs are textual information attached with a pair of incident nodes of edges in a graph. Such information, however, has the added overhead of encoding textual information.
Though we adopt apn in Gap, we capitalize on the graph neighborhood of nodes to avoid the need for textual documents without compromising the quality of the learned representations. Our hypothesis is that one can learn high-quality context-sensitive node representations just by mutually attending to the graph neighborhood of a node and its context node. To achieve this, we naturally assume that the order of nodes in the graph neighborhood of a node is arbitrary. Moreover, we exploit this assumption to simplify the apn model by removing the expensive encode phase.
Akin to textual features in apn, Gap simply uses graph neighborhood of nodes. That is, for every node in the graph we define a graph neighborhood function to build a fixed size neighborhood sequence, which specifies the input of Gap. In the apn model, the encoder phase is usually required to capture high-level features such as n-grams, and long term and order dependencies in textual inputs. As we have no textual features and due to our assumption that there is no ordering in the graph neighborhood of nodes, we can effectively strip off the encoder. The encoder is the expensive part of apn as it involves a rnn or cnn, and hence Gap can be trained faster than apn.
This simple yet empirically fruitful modification of the apn model enables Gap to achieve SOTA performance on link prediction and node clustering tasks using three real world datasets. Furthermore, we have empirically shown that Gap is more than 2 times faster than an apn like NRL algorithm based on text input. In addition, the simplification in Gap does not introduce new hyper-parameters other than the usual ones, such as the learning rate and sequence length in apn.
2 apn Architecture
For the sake of being self-contained, here we briefly describe the original apn architecture. We are given a pair of natural language texts as input, where and are a sequence of words of variable lengths, and each word is a sample from a vocabulary , , , and and could be different. The apn’s forward execution is shown in Fig. 1(A) and in the following we describe each component.
First embedding matrices of and are constructed through a lookup operation on an embedding matrix of words, where is the embedding dimension. That is, for both and , respectively, embedding matrices and are constructed by concatenating embeddings of each word in and , the Embed box in Fig. 1(A).
Each embedding matrix is then projected using a cnn or bi-LSTM encoder to capture inherent high-level features, the Encode box in Fig. 1(A). More formally, the embedded texts , are projected as and where is the encoder, cnn or bi-LSTM, is the set of parameters of the encoder, and and , where is the number of filters or hidden features of the cnn and bi-LSTM, respectively.
In the third step, a parameter matrix is introduced so as to learn a similarity or soft-alignment matrix between the sequence projections and as:
Then unnormalized attention weight vectors and
are obtained through a column-wise and row-wise max-pooling operations on, respectively as , where , and and are the -th and -th row and column of , respectively. Next, the attention vectors are normalized using softmax, and . Finally, the normalized attention vectors are used to compute the final representations as and .
In the last step, the representations and will be used for ranking depending on the task on hand. For instance, in a question and answer setting, each candidate answer’s representation will be ranked based on its similarity score with the question’s representation .
Gap adopts the apn model for learning the representations of node pairs in a graph with a set of nodes and edges . can be a directed or undirected and weighted or unweighted graph. Without loss of generality we assume that is an unweighted directed graph.
We define a neighborhood function , which maps each node to a set of nodes . A simple way of materializing is to consider the first-order neighbors of , that is, . An important assumption that we have on is that the ordering of the nodes in is not important. Gap capitalizes on this assumption to simplify the apn model and achieve SOTA performance. Even though one can explore more sophisticated neighborhood functions, in this study we simply consider the first order neighborhood.
Our goal is to learn node representations using the simplified apn based on the node neighborhood function . Hence, akin to the input text pairs in apn, we consider a pair of neighborhood sequences and associated with a pair of nodes and , and and . Without loss of generality we consider . Recall that we assume the order of nodes in and is arbitrary.
Given a source node , we seek to learn multiple context-sensitive embeddings of with respect to a target node it is paired with. In principle one can learn using all pairs of nodes, however that is not scalable, and hence we restrict learning between pairs in .
Gap’s forward execution model is shown in Fig 1(B), and learning starts by embedding and , respectively, as and . Since there is no order dependency between the nodes in or , besides being a neighbor of the respective node, we leave out the cnn or bi-LSTM based projections of and that could capture the dependencies. No encoder!
Thus, the next step of Gap is mutually attending on the embeddings, and , of the graph neighborhood of the node pairs; the Attend box of 1(B). That is, we employ the trainable parameter matrix and compute the soft-alignment matrix, , between the neighbors of and .
Here is a soft-alignment matrix between every pair of nodes, . Therefore, for each axis of , we proceed by pooling the maximum alignment score for each node to obtain the unnormalized attention vectors and . As a result of the pooling operations, each neighbor of the source node, , selects a neighbor of the target node, , with the maximum alignment or similarity score. A similar selection is done for . This enables the source and target neighborhood sequences of the pair to influence each other in order to learn a context-sensitive representation of and . The normalized attention vectors are then obtained by applying softmax as and . Ultimately, we compute the context-sensitive representations and of the source and target nodes and , respectively as and .
The objective of Gap is to maximize the likelihood of the graph (edges) by predicting edge similarities using the dot product of the source and target representations as ; the Predict box of Fig 1(B). Hence, we employ a hard-margin loss given in Eq. 2.
where is the representation of a negative target node , that is . The goal is to learn, in an unsupervised fashion, a context-sensitive embedding of nodes that enable us to rank the positive edges higher than the negative pairs .
Finally a word on the computational complexity of Gap that is proportional to the the number of edges, as we are considering each edge as an input pair.
4 Experimental Evaluation
In this section we provide an empirical evaluation of Gap. To this end, experiments are carried out using the following datasets, and a basic summary is given in Table 1.
Email Leskovec et al. (2007): is an email communication network between the largest European research institutes. A node represents a person and an edge denotes that person has sent an email to .
The first two datasets have features (documents) associated to nodes. For Cora, abstract of papers and Zhihu user posts. Some of the baselines, discussed beneath, require textual information, and hence they consume the aforementioned features. The Email dataset has ground-truth community assignment for nodes based on a person’s affiliation to one of the 42 departments.
We compare our method against the following 11 popular and SOTA baselines grouped as:
Structure based Context-Sensitive method: splitter Epasto and Perozzi (2019)
Now we report the experimental results carried out in two tasks, which are link prediction and node clustering. All experiments are performed using a 24-Core CPU and 125GB RAM Ubuntu 18.04 machine.
4.1 Link Prediction
Link prediction is an important task that graph embedding algorithms are applied to. Particularly context-sensitive embedding techniques have proved to be well suited for this task. Similar to existing studies we perform this experiment using a fraction of the edges as a training set. We hold out the remaining fraction of the edges from the training phase and we will only reveal them during the test phase, results are reported using this set. All hyper-parameter tuning is performed by taking a small fraction of the training set as a validation set.
In-line with existing techniques (Tu et al., 2017; Zhang et al., 2018), the percentage of training edges ranges from 15% to 95% by a step of 10. The hyper-parameters of all algorithms are tuned using random-search. For some of the baselines, our results are consistent with what is reported in previous studies, and hence for Cora and Zhihu we simply report these results.
Except the “unavoidable” hyper-parameters (eg. learning rate, regularization/dropout rate) that are common in all the algorithms, our model has just one hyper-parameter which is the neighborhood sequence length (#Neighborhood
), for nodes with smaller neighborhood size we use zero padding. As we shall verify later,Gap is not significantly affected by the choice of this parameter.
The quality of the prediction task is measured using the AUC score. AUC indicates the probability that a randomly selected pairwill have a higher similarity score than an edge . Similarity between a pair of nodes is computed as the dot product of their representation. For all the algorithms the representation size – is 200 and Gap’s configuration is shown in Table 2.
|Dataset||#Neighborhood ( and )||Dropout||Learning rate||Representation size|
The results of the empirical evaluations on the Cora, Zhihu, and Email datasets are reported in Tables 3, 4, and 5. Gap outperforms the SOTA baselines in all cases for Zhihu and Email, and in almost all cases for Cora. One can see that as we increase the percentage of training edges, performance increases for all the algorithms. As indicated by the “Gain” row, Gap achieves up to 9% improvement over SOTA context-sensitive techniques. Notably the gain is pronounced for smaller values of percentage of edges used for training. This is shows that Gap is suitable both in cases where there are several missing links and most of the links are present.
|Algorithm||% of training edges|
|Algorithm||%of training edges|
|Algorithm||%of training edges|
4.2 Node Clustering
Nodes in a network has the tendency to form cohesive structures based on some kinds of shared aspects. These structures are usually referred to as groups, clusters or communities and identifying them is an important task in network analysis. In this section we use the Email dataset that has ground truth communities, and there are 42 of them. Since this dataset has only structural information, we have excluded the baselines that require textual information.
Since each node belongs to exactly one cluster, we employ the Spectral Clustering algorithm to identify clusters. The learned representations of nodes by a certain algorithm are the input features of the clustering algorithm. In this experiment the percentage of training edges varies from 25% to 95% by a step of 20%, for the rest we use the same configuration as in the above experiment.
Given the ground truth community assignment of nodes and the predicted community assignments , usually the agreement between and are measured using mutual information . However, is not bounded and difficult for comparing methods, hence we use two other variants of (Vinh et al., 2010). Which are, the normalized mutual information , which simply normalizes and adjusted mutual information , which adjusts or normalizes to random chances.
|Algorithm||%of training edges|
The results of this experiment are reported in Table 6, and Gap significantly outperforms all the baselines by up to 20% with respect to AMI score. Consistent to our previous experiment Gap performs well in both extremes for the value of the percentage of the training edges. Similar improvements are achieved for AMI score.
4.3 Parameter Sensitivity Analysis
Here we first show the sensitivity of the main hyper-parameter of Gap, which is the size of the neighborhood, #Neighborhood . Figures 2 and 3(A) show the effects of this parameter both on link prediction and node clustering tasks. In both cases we notice that Gap is not significantly affected by the change of values. We show the effect across different values of percentage and fixed (55%) of training edges for link prediction and node clustering tasks, respectively. Regardless of the percentage of training edges, we only see a small change of AUC (Fig 2), and NMI and AMI (Fig 3-a) across different values of #Neighborhood.
Next, we analyze the run time of training Gap and our goal is to show the benefit of removing the encoder of apn and we do this by comparing Gap against cane, which employs the exact apn architecture. For this experiment we include two randomly generated graphs, using Erdős–Rényi (ERG) and Barabási–Albert (BAG) models. ERG has 200K edges, and BAG has 1.5M edges. Figure 3(B) clearly shows that Gap is at least 2 times faster than cane in all the graphs.
4.4 Ablation Study
|Baselines||Training ratio||Training ratio|
Here we give a brief ablation experiment to strengthen the case for Gap. Concretely, we seek to compare Gap with different baselines to further motivate our choice of (i) the way we model nodes using their neighborhood, (ii) the assumption that order in this neighborhood is not important, and (iii) the choice of the apn algorithm without the cnn or bi-LSTM encoder. To this end, we introduce the following baselines and apart from the indicated difference everything will be the same as Gap.
First we assume order is important in the neighborhood of nodes based on neighbors similarity with the current node. We use a topological feature to induce order, which is common neighbors. This baseline is referred to as GapCn and uses the exact apn model to capture “order”.
Second, we use the same input as in Gap, but nodes’ neighborhood is now randomly permuted and fed multiple times (concretely 5 times), and the exact apn model is employed; this baseline is referred to as GapApn
In Table 7 we report the results of the ablation experiment. This sheds some light on the assumptions and the design choices of Gap. For the reported results, GapCn and GapApn use a cnn Encoder. In both cases, they quickly over-fit the data, and we found out that we have to employ aggressive regularization using a dropout rate of 0.95. In addition, for GapCn we have also observed that as we increase the kernel size to greater values than 1 (up to 5) the results keep getting worse, and hence, what we reported is the best one which is obtained using a kernel size of 1. For example, with a kernel size of 3 and training ratios of 25, 55, 75, and 95 percent the AUC scores respectively dropped to 54, 55, 56, and 58 percent on the Cora dataset, and 66, 69, 78 and 77 percent on the Email dataset. We conjecture that this is due the model’s attempt to enforce high-level neighborhood patterns (eg. a combination of arbitrary neighbors) that are not intrinsically governing the underlying edge formation phenomena. Rather, what is important is to effectively pay attention to the presence of individual neighbors both in and regardless of their order. Apparently, training this model is at least twice slower than Gap as it is also illustrated in Section 4.3.
In the case of GapApn, though the variations in AUC are marginal with respect to the change in the kernel size, the training time of these model has increased almost by an order of magnitude. Finally we see that the mutual attention mechanism (Attend component) plays an important role by comparing the results between the GapMlp and Gap.
5 Related Work
NRL is usually carried out by exploring the structure of the graph and meta data, such as node attributes, attached to the graph (Perozzi et al., 2014; Grover and Leskovec, 2016; Tang et al., 2015; Perozzi et al., 2016; Wang et al., 2016; Yang et al., 2015; Pan et al., 2016; Sheikh et al., 2019; Kefato et al., 2017; Sheikh et al., 2019). Random walks are widely used to explore local/global neighborhood structures, which are then fed into a learning algorithm. The learning is carried out in unsupervised manner by maximizing the likelihood of observing the neighbor nodes and/or attributes of a center node.
Recently graph convolutional networks have also been proposed for semi-supervised network analysis tasks (Kipf and Welling, 2017; Hamilton et al., 2017; Wu et al., 2019; Velickovic et al., 2017; Abu-El-Haija et al., 2019). These algorithms work by way of aggregating neighborhood features, with a down-stream objective based on partial labels of nodes, for example. All these methods are essentially different from our approach because they are context-free.
Context-sensitive learning is another paradigm to NRL that challenges the sufficiency of a single representation of a node for applications such as, link prediction, product recommendation, ranking. While some of these methods (Tu et al., 2017; Zhang et al., 2018) rely on textual information, others have also shown that a similar goal can be achieved using just the structure of the graph (Epasto and Perozzi, 2019). However, they require an extra step of persona decomposition that is based on microscopic level community detection algorithms to identify multiple contexts of a node. Unlike the first approaches our algorithm does not require extra textual information and with respect to the second ones our approach does not require any sort of community detection algorithm.
In this study we present a novel context-sensitive graph embedding algorithm called Gap. It consumes node neighborhood as input feature, which are constructed based on an important assumption that their ordering is arbitrary. To learn representations of nodes Gap employs attentive pooling networks (apn). By exploiting the above assumption, it makes an important simplification of apn and gains more than 2X speed up over another SOTA method, which employs the exact apn. Furthermore, Gap consistently outperforms all the baselines and achieves up to 9% and 20% improvement over the best performing ones on the link prediction and node clustering tasks, respectively. In future we will investigate how node attributes can be incorporated and provide a theoretical framework on the relation between the neighborhood sampling and topological properties.
- Watch your step: learning graph embeddings through attention. CoRR abs/1710.09599. External Links: Cited by: 1st item.
- MixHop: higher-order graph convolutional architectures via sparsified neighborhood mixing. CoRR abs/1905.00067. External Links: Cited by: §5.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Cited by: §1.
- Attentive pooling networks. CoRR abs/1602.03609. External Links: Cited by: §1.
- Is a single embedding enough? learning node representations that capture multiple social contexts. CoRR abs/1905.02138. External Links: Cited by: 4th item, §5.
- Node2Vec: scalable feature learning for networks. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 855–864. External Links: Cited by: 1st item, §5.
- Inductive representation learning on large graphs. CoRR abs/1706.02216. External Links: Cited by: §5.
- Embed, encode, attend, predict: the new deep learning formula for state-of-the-art nlp models. External Links: Cited by: §1.
Mineral: multi-modal network representation learning.
Proc. of the 3rd International Conference on Machine Learning, Optimization and Big Data, MOD’17. Cited by: §5.
- Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §5.
- Graph evolution: densification and shrinking diameters. ACM Trans. Knowl. Discov. Data 1 (1). External Links: Cited by: item 3.
Tri-party deep network representation.
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, pp. 1895–1901. External Links: Cited by: 2nd item, §5.
- DeepWalk: online learning of social representations. CoRR abs/1403.6652. External Links: Cited by: 1st item, §5.
- Walklets: multiscale graph embeddings for interpretable network classification. CoRR abs/1605.02115. External Links: Cited by: 1st item, §5.
- Deep contextualized word representations. CoRR abs/1802.05365. External Links: Cited by: §1.
- Gat2vec: representation learning for attributed graphs. JOURNAL In Journal of Computing, Cited by: §5.
A simple approach to attributed graph embedding via enhanced autoencoder. CONFERENCE In Proceedings of the Eighth International Conference on Complex Networks and Their Applications (COMPLEX NETWORKS 2019), Studies in Computational Intelligence, Vol. 881, pp. 797–809. Cited by: §5.
- A general framework for content-enhanced network representation learning. CoRR abs/1610.02906. External Links: Cited by: 2nd item.
- LINE: large-scale information network embedding. CoRR abs/1503.03578. External Links: Cited by: 1st item, §5.
- CANE: context-aware network embedding for relation modeling. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1722–1731. External Links: Cited by: §1, §1, item 1, item 2, 3rd item, §4.1, §5.
- Graph attention networks. ArXiv abs/1710.10903. Cited by: §5.
- Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, pp. 2837–2854. Cited by: §4.2.
- Structural deep network embedding. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 1225–1234. External Links: Cited by: §5.
- Simplifying graph convolutional networks. CoRR abs/1902.07153. External Links: Cited by: §5.
- Network representation learning with rich text information. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, pp. 2111–2117. External Links: Cited by: 2nd item, §5.
Graph convolutional neural networks for web-scale recommender systems. CoRR abs/1806.01973. External Links: Cited by: §1.
- Diffusion maps for textual network embedding. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, USA, pp. 7598–7608. External Links: Cited by: §1, item 1, item 2, 3rd item, §4.1, §5.