Graph Neighborhood Attentive Pooling

Network representation learning (NRL) is a powerful technique for learning low-dimensional vector representation of high-dimensional and sparse graphs. Most studies explore the structure and metadata associated with the graph using random walks and employ an unsupervised or semi-supervised learning schemes. Learning in these methods is context-free, because only a single representation per node is learned. Recently studies have argued on the sufficiency of a single representation and proposed a context-sensitive approach that proved to be highly effective in applications such as link prediction and ranking. However, most of these methods rely on additional textual features that require RNNs or CNNs to capture high-level features or rely on a community detection algorithm to identify multiple contexts of a node. In this study, without requiring additional features nor a community detection algorithm, we propose a novel context-sensitive algorithm called GAP that learns to attend on different parts of a node's neighborhood using attentive pooling networks. We show the efficacy of GAP using three real-world datasets on link prediction and node clustering tasks and compare it against 10 popular and state-of-the-art (SOTA) baselines. GAP consistently outperforms them and achieves up to  9 link prediction and clustering tasks, respectively.



page 7


Gossip and Attend: Context-Sensitive Graph Representation Learning

Graph representation learning (GRL) is a powerful technique for learning...

Topic-aware latent models for representation learning on networks

Network representation learning (NRL) methods have received significant ...

Graph Learning Network: A Structure Learning Algorithm

Recently, graph neural networks (GNNs) has proved to be suitable in task...

Which way? Direction-Aware Attributed Graph Embedding

Graph embedding algorithms are used to efficiently represent (encode) a ...

Neural Link Prediction with Walk Pooling

Graph neural networks achieve high accuracy in link prediction by jointl...

Representation Learning over Dynamic Graphs

How can we effectively encode evolving information over dynamic graphs i...

Enhance Ambiguous Community Structure via Multi-strategy Community Related Link Prediction Method with Evolutionary Process

Most real-world networks suffer from incompleteness or incorrectness, wh...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

NRL is a powerful technique to learn representation of a graph. Such a representation gracefully lends itself to a wide variety of network analysis tasks, such as link prediction, node clustering, node classification, recommendation, and so forth.

In most studies, the learning is done in a context-free fashion. That is, the representation of a node characterizes just a single aspect of the node, for instance, the local or global neighborhood of a node. Recently, a complementary line of research has questioned the sufficiency of single representations and considered a context-sensitive approach. Given a node, this approach projects it to different points in a space depending on other contexts it is coupled with. A context node can be sampled from a neighborhood (Tu et al., 2017; Zhang et al., 2018), random walk (Ying et al., 2018), and so on. In this study we sample from a node neighborhood (nodes connected by an edge). Thus, in the learning process of our approach a source node’s representation changes depending on the target (context) node it is accompanied by. Studies have shown that context-sensitive approaches significantly outperform previous context-free SOTA methods in link-prediction task. A related notion (Peters et al., 2018; Devlin et al., 2018) in NLP has significantly improved SOTA across several NLP tasks.

In this paper we propose Gap (Graph neighborhood attentive pooling), which is inspired by attentive pooling networks (apn(dos Santos et al., 2016), originally proposed for solving the problem of pair ranking in NLP. For instance, given a question , and a set of answers , an apn can be trained to rank the answers in with respect to by using a two-way attention mechanism. apn

is based on the prevalent deep learning formula for SOTA NLP, that is,

embed, encode, attend, predict (Honnibal, 2018). Given a question-answer pair , the apn model first projects the embedding of the pairs using two separate encoders, and the encoder can be a cnn or lstm

. The projection helps to capture n-gram context information and/or long-term dependencies in the input sequence. Next, a soft-alignment matrix is obtained by a mutual-attention mechanism that transforms these projections using a parameter matrix. Attention vectors are then computed through a column-wise and row-wise pooling operations on the alignment matrix. Finally, the weighted sum of each of the above projections by its respective attention vector is computed to obtain the representations of the question and answer. Each candidate answer

is then ranked according to its similarity with the question computed using the representations of and .

Recently, apn have been applied to context-sensitive NRL by Tu et al. (2017), and the inputs are textual information attached with a pair of incident nodes of edges in a graph. Such information, however, has the added overhead of encoding textual information.

Though we adopt apn in Gap, we capitalize on the graph neighborhood of nodes to avoid the need for textual documents without compromising the quality of the learned representations. Our hypothesis is that one can learn high-quality context-sensitive node representations just by mutually attending to the graph neighborhood of a node and its context node. To achieve this, we naturally assume that the order of nodes in the graph neighborhood of a node is arbitrary. Moreover, we exploit this assumption to simplify the apn model by removing the expensive encode phase.

Akin to textual features in apn, Gap simply uses graph neighborhood of nodes. That is, for every node in the graph we define a graph neighborhood function to build a fixed size neighborhood sequence, which specifies the input of Gap. In the apn model, the encoder phase is usually required to capture high-level features such as n-grams, and long term and order dependencies in textual inputs. As we have no textual features and due to our assumption that there is no ordering in the graph neighborhood of nodes, we can effectively strip off the encoder. The encoder is the expensive part of apn as it involves a rnn or cnn, and hence Gap can be trained faster than apn.

This simple yet empirically fruitful modification of the apn model enables Gap to achieve SOTA performance on link prediction and node clustering tasks using three real world datasets. Furthermore, we have empirically shown that Gap is more than 2 times faster than an apn like NRL algorithm based on text input. In addition, the simplification in Gap does not introduce new hyper-parameters other than the usual ones, such as the learning rate and sequence length in apn.

2 apn Architecture

For the sake of being self-contained, here we briefly describe the original apn architecture. We are given a pair of natural language texts as input, where and are a sequence of words of variable lengths, and each word is a sample from a vocabulary , , , and and could be different. The apn’s forward execution is shown in Fig. 1(A) and in the following we describe each component.


First embedding matrices of and are constructed through a lookup operation on an embedding matrix of words, where is the embedding dimension. That is, for both and , respectively, embedding matrices and are constructed by concatenating embeddings of each word in and , the Embed box in Fig. 1(A).


Each embedding matrix is then projected using a cnn or bi-LSTM encoder to capture inherent high-level features, the Encode box in Fig. 1(A). More formally, the embedded texts , are projected as and where is the encoder, cnn or bi-LSTM, is the set of parameters of the encoder, and and , where is the number of filters or hidden features of the cnn and bi-LSTM, respectively.


In the third step, a parameter matrix is introduced so as to learn a similarity or soft-alignment matrix between the sequence projections and as:

Then unnormalized attention weight vectors and

are obtained through a column-wise and row-wise max-pooling operations on

, respectively as , where , and and are the -th and -th row and column of , respectively. Next, the attention vectors are normalized using softmax, and . Finally, the normalized attention vectors are used to compute the final representations as and .


In the last step, the representations and will be used for ranking depending on the task on hand. For instance, in a question and answer setting, each candidate answer’s representation will be ranked based on its similarity score with the question’s representation .

Figure 1: The apn (A) and Gap (B) models forward execution for question answering and context-sensitive node representation, respectively.

3 Gap

Gap adopts the apn model for learning the representations of node pairs in a graph with a set of nodes and edges . can be a directed or undirected and weighted or unweighted graph. Without loss of generality we assume that is an unweighted directed graph.

We define a neighborhood function , which maps each node to a set of nodes . A simple way of materializing is to consider the first-order neighbors of , that is, . An important assumption that we have on is that the ordering of the nodes in is not important. Gap capitalizes on this assumption to simplify the apn model and achieve SOTA performance. Even though one can explore more sophisticated neighborhood functions, in this study we simply consider the first order neighborhood.

Our goal is to learn node representations using the simplified apn based on the node neighborhood function . Hence, akin to the input text pairs in apn, we consider a pair of neighborhood sequences and associated with a pair of nodes and , and and . Without loss of generality we consider . Recall that we assume the order of nodes in and is arbitrary.

Given a source node , we seek to learn multiple context-sensitive embeddings of with respect to a target node it is paired with. In principle one can learn using all pairs of nodes, however that is not scalable, and hence we restrict learning between pairs in .

Gap’s forward execution model is shown in Fig 1(B), and learning starts by embedding and , respectively, as and . Since there is no order dependency between the nodes in or , besides being a neighbor of the respective node, we leave out the cnn or bi-LSTM based projections of and that could capture the dependencies. No encoder!

Thus, the next step of Gap is mutually attending on the embeddings, and , of the graph neighborhood of the node pairs; the Attend box of 1(B). That is, we employ the trainable parameter matrix and compute the soft-alignment matrix, , between the neighbors of and .


Here is a soft-alignment matrix between every pair of nodes, . Therefore, for each axis of , we proceed by pooling the maximum alignment score for each node to obtain the unnormalized attention vectors and . As a result of the pooling operations, each neighbor of the source node, , selects a neighbor of the target node, , with the maximum alignment or similarity score. A similar selection is done for . This enables the source and target neighborhood sequences of the pair to influence each other in order to learn a context-sensitive representation of and . The normalized attention vectors are then obtained by applying softmax as and . Ultimately, we compute the context-sensitive representations and of the source and target nodes and , respectively as and .


The objective of Gap is to maximize the likelihood of the graph (edges) by predicting edge similarities using the dot product of the source and target representations as ; the Predict box of Fig 1(B). Hence, we employ a hard-margin loss given in Eq. 2.


where is the representation of a negative target node , that is . The goal is to learn, in an unsupervised fashion, a context-sensitive embedding of nodes that enable us to rank the positive edges higher than the negative pairs .

Finally a word on the computational complexity of Gap that is proportional to the the number of edges, as we are considering each edge as an input pair.

4 Experimental Evaluation

Dataset #Nodes #Edges Features
Cora 2277 5214 Paper Abstract
Zhihu 10000 43894 User post
Email 1005 25571 NA
Table 1: Summary of datasets, the Features column is relevant to some of the baselines not Gap

In this section we provide an empirical evaluation of Gap. To this end, experiments are carried out using the following datasets, and a basic summary is given in Table 1.

  1. Cora Tu et al. (2017); Zhang et al. (2018): is a citation network dataset, where a node represents a paper and an edge represents that paper has cited paper .

  2. Zhihu Tu et al. (2017); Zhang et al. (2018): is the biggest social network for Q&A and it is based in China. Nodes are the users and the edges are follower relations between the users.

  3. Email Leskovec et al. (2007): is an email communication network between the largest European research institutes. A node represents a person and an edge denotes that person has sent an email to .

The first two datasets have features (documents) associated to nodes. For Cora, abstract of papers and Zhihu user posts. Some of the baselines, discussed beneath, require textual information, and hence they consume the aforementioned features. The Email dataset has ground-truth community assignment for nodes based on a person’s affiliation to one of the 42 departments.

We compare our method against the following 11 popular and SOTA baselines grouped as:

  • Structure based methods: DeepWalk Perozzi et al. (2014), Node2Vec Grover and Leskovec (2016), WalkLets Perozzi et al. (2016), AttentiveWalk Abu-El-Haija et al. (2017), Line Tang et al. (2015):

  • Structure and attribute based methods: TriDnr Pan et al. (2016), tadw Yang et al. (2015), cene Sun et al. (2016)

  • Structure and Content based Context-sensitive methods: cane Tu et al. (2017), dmte Zhang et al. (2018)

  • Structure based Context-Sensitive method: splitter Epasto and Perozzi (2019)

Now we report the experimental results carried out in two tasks, which are link prediction and node clustering. All experiments are performed using a 24-Core CPU and 125GB RAM Ubuntu 18.04 machine.

4.1 Link Prediction

Link prediction is an important task that graph embedding algorithms are applied to. Particularly context-sensitive embedding techniques have proved to be well suited for this task. Similar to existing studies we perform this experiment using a fraction of the edges as a training set. We hold out the remaining fraction of the edges from the training phase and we will only reveal them during the test phase, results are reported using this set. All hyper-parameter tuning is performed by taking a small fraction of the training set as a validation set.


In-line with existing techniques (Tu et al., 2017; Zhang et al., 2018), the percentage of training edges ranges from 15% to 95% by a step of 10. The hyper-parameters of all algorithms are tuned using random-search. For some of the baselines, our results are consistent with what is reported in previous studies, and hence for Cora and Zhihu we simply report these results.

Except the “unavoidable” hyper-parameters (eg. learning rate, regularization/dropout rate) that are common in all the algorithms, our model has just one hyper-parameter which is the neighborhood sequence length (#Neighborhood

), for nodes with smaller neighborhood size we use zero padding. As we shall verify later,

Gap is not significantly affected by the choice of this parameter.

The quality of the prediction task is measured using the AUC score. AUC indicates the probability that a randomly selected pair

will have a higher similarity score than an edge . Similarity between a pair of nodes is computed as the dot product of their representation. For all the algorithms the representation size – is 200 and Gap’s configuration is shown in Table 2.

Dataset #Neighborhood ( and ) Dropout Learning rate Representation size
Cora 100 0.5 0.0001 200
Zhihu 250 0.65 0.0001 200
Email 100 0.8 0.0001 200
Table 2: Conifguration of Gap


The results of the empirical evaluations on the Cora, Zhihu, and Email datasets are reported in Tables 34, and 5. Gap outperforms the SOTA baselines in all cases for Zhihu and Email, and in almost all cases for Cora. One can see that as we increase the percentage of training edges, performance increases for all the algorithms. As indicated by the “Gain” row, Gap achieves up to 9% improvement over SOTA context-sensitive techniques. Notably the gain is pronounced for smaller values of percentage of edges used for training. This is shows that Gap is suitable both in cases where there are several missing links and most of the links are present.

Algorithm % of training edges
15% 25% 35% 45% 55% 65% 75% 85% 95%
DeepWalk 56.0 63.0 70.2 75.5 80.1 85.2 85.3 87.8 90.3
Line 55.0 58.6 66.4 73.0 77.6 82.8 85.6 88.4 89.3
Node2Vec 55.9 62.4 66.1 75.0 78.7 81.6 85.9 87.3 88.2
WalkLets 69.8 77.3 82.8 85.0 86.6 90.4 90.9 92.0 93.3
AttentiveWalk 64.2 76.7 81.0 83.0 87.1 88.2 91.4 92.4 93.0
tadw 86.6 88.2 90.2 90.8 90.0 93.0 91.0 93.0 92.7
TriDnr 85.9 88.6 90.5 91.2 91.3 92.4 93.0 93.6 93.7
cene 72.1 86.5 84.6 88.1 89.4 89.2 93.9 95.0 95.9
cane 86.8 91.5 92.2 93.9 94.6 94.9 95.6 96.6 97.7
dmte 91.3 93.1 93.7 95.0 96.0 97.1 97.4 98.2 98.8
splitter 65.4 69.4 73.7 77.3 80.1 81.5 83.9 85.7 87.2
Gap 95.8 96.4 97.1 97.6 97.6 97.6 97.8 98.0 98.2
GAIN% 4.5% 3.6% 3.4% 2.6% 1.6% 0.5% 0.4%
Table 3: AUC score for link prediction on the Cora dataset
Algorithm %of training edges
15% 25% 35% 45% 55% 65% 75% 85% 95%
DeepWalk 56.6 58.1 60.1 60.0 61.8 61.9 63.3 63.7 67.8
Line 52.3 55.9 59.9 60.9 64.3 66.0 67.7 69.3 71.1
Node2Vec 54.2 57.1 57.3 58.3 58.7 62.5 66.2 67.6 68.5
WalkLets 50.7 51.7 52.6 54.2 55.5 57.0 57.9 58.2 58.1
AttentiveWalk 69.4 68.0 74.0 75.9 76.4 74.5 74.7 71.7 66.8
tadw 52.3 54.2 55.6 57.3 60.8 62.4 65.2 63.8 69.0
TriDnr 53.8 55.7 57.9 59.5 63.0 64.2 66.0 67.5 70.3
cene 56.2 57.4 60.3 63.0 66.3 66.0 70.2 69.8 73.8
cane 56.8 59.3 62.9 64.5 68.9 70.4 71.4 73.6 75.4
dmte 58.4 63.2 67.5 71.6 74.0 76.7 78.7 80.3 82.2
splitter 59.8 61.5 61.8 62.1 62.1 62.4 61.0 60.7 58.6
Gap 72.6 77.9 81.2 80.8 81.4 81.8 82.0 83.7 86.3
GAIN% 3.2% 9.9% 7.2% 5.1% 5.0% 5.1% 3.3% 3.4% 4.1%
Table 4: AUC score for link prediction on the Zhihu dataset
Algorithm %of training edges
15% 25% 35% 45% 55% 65% 75% 85% 95%
DeepWalk 69.2 71.4 74.1 74.7 76.6 76.1 78.7 75.7 79.0
Line 65.6 71.5 73.8 76.0 76.7 77.8 78.5 77.9 78.8
Node2Vec 66.4 68.6 71.2 71.7 72.7 74.0 74.5 74.4 76.1
WalkLets 70.3 73.2 75.2 78.7 78.2 78.1 78.9 80.0 78.5
AttentiveWalk 68.8 72.5 73.5 75.2 74.1 74.9 73.0 70.3 68.6
splitter 69.2 70.4 69.1 69.2 70.6 72.8 73.3 74.8 75.2
Gap 77.6 81.6 81.9 83.3 83.1 84.1 84.5 84.8 84.8
GAIN% 7.3% 8.4% 6.7% 4.6% 4.9% 6.0% 5.6% 4.8% 5.8%
Table 5: AUC score for link prediction on the Email dataset

4.2 Node Clustering

Nodes in a network has the tendency to form cohesive structures based on some kinds of shared aspects. These structures are usually referred to as groups, clusters or communities and identifying them is an important task in network analysis. In this section we use the Email dataset that has ground truth communities, and there are 42 of them. Since this dataset has only structural information, we have excluded the baselines that require textual information.


Since each node belongs to exactly one cluster, we employ the Spectral Clustering algorithm to identify clusters. The learned representations of nodes by a certain algorithm are the input features of the clustering algorithm. In this experiment the percentage of training edges varies from 25% to 95% by a step of 20%, for the rest we use the same configuration as in the above experiment.

Given the ground truth community assignment of nodes and the predicted community assignments , usually the agreement between and are measured using mutual information . However, is not bounded and difficult for comparing methods, hence we use two other variants of  (Vinh et al., 2010). Which are, the normalized mutual information , which simply normalizes and adjusted mutual information , which adjusts or normalizes to random chances.

Algorithm %of training edges
25% 55% 75% 95%
DeepWalk 41.3 28.6 53.6 44.8 50.6 42.4 57.6 49.9
Line 44.0 30.3 49.9 38.2 53.3 42.6 56.3 46.5
Node2Vec 46.6 35.3 45.9 35.3 47.8 38.5 53.8 45.5
WalkLets 47.5 39.9 55.3 47.4 54.0 45.4 50.1 41.6
AttentiveWalk 42.9 30.0 45.7 36.5 44.3 35.7 47.4 38.5
splitter 38.9 23.8 43.2 30.3 45.2 33.6 48.4 37.6
Gap 67.8 58.8 64.7 55.7 65.6 57.6 65.4 58.7
%Gain 20.3% 9.4% 11.0% 7.8%
Table 6: NMI and AMI scores for node clustering experiment on the Email dataset. The Gain is with respect to the NMI only.


The results of this experiment are reported in Table 6, and Gap significantly outperforms all the baselines by up to 20% with respect to AMI score. Consistent to our previous experiment Gap performs well in both extremes for the value of the percentage of the training edges. Similar improvements are achieved for AMI score.

4.3 Parameter Sensitivity Analysis

Figure 2: Sensitivity of Gap to the size of node’s neighborhood (, ) on the link prediction task

Here we first show the sensitivity of the main hyper-parameter of Gap, which is the size of the neighborhood, #Neighborhood . Figures 2 and 3(A) show the effects of this parameter both on link prediction and node clustering tasks. In both cases we notice that Gap is not significantly affected by the change of values. We show the effect across different values of percentage and fixed (55%) of training edges for link prediction and node clustering tasks, respectively. Regardless of the percentage of training edges, we only see a small change of AUC (Fig 2), and NMI and AMI (Fig 3-a) across different values of #Neighborhood.

Next, we analyze the run time of training Gap and our goal is to show the benefit of removing the encoder of apn and we do this by comparing Gap against cane, which employs the exact apn architecture. For this experiment we include two randomly generated graphs, using Erdős–Rényi (ERG) and Barabási–Albert (BAG) models. ERG has 200K edges, and BAG has 1.5M edges. Figure 3(B) clearly shows that Gap is at least 2 times faster than cane in all the graphs.

Figure 3: (a) Sensitivity of Gap to the size of node’s neighborhood on the node clustering task and (b) the run time comparison of Gap and cane using two real and two synthetic datasets

4.4 Ablation Study

Datasets Cora Email
Baselines Training ratio Training ratio
25% 55% 75% 95% 25% 55% 75% 95%
GapCn 60 60 61 61 74 74 78 79
GapApn 59 60 60 65 74 77 78 78
GapMlp 56 63 66 73 72 77 78 77
Gap 96 97 97 98 81 83 84 84
Table 7: AUC results for the variants of Gap using the Cora and Email datasets

Here we give a brief ablation experiment to strengthen the case for Gap. Concretely, we seek to compare Gap with different baselines to further motivate our choice of (i) the way we model nodes using their neighborhood, (ii) the assumption that order in this neighborhood is not important, and (iii) the choice of the apn algorithm without the cnn or bi-LSTM encoder. To this end, we introduce the following baselines and apart from the indicated difference everything will be the same as Gap.

  1. First we assume order is important in the neighborhood of nodes based on neighbors similarity with the current node. We use a topological feature to induce order, which is common neighbors. This baseline is referred to as GapCn and uses the exact apn model to capture “order”.

  2. Second, we use the same input as in Gap, but nodes’ neighborhood is now randomly permuted and fed multiple times (concretely 5 times), and the exact apn model is employed; this baseline is referred to as GapApn

  3. Finally we replace Gap

    ’s Attend component with a standard feed-forward neural network that consumes the same input (Embedding matrices

    and ) and also has the same learning objective specified in Eq. 2; the baseline is referred to as GapMlp.

In Table 7 we report the results of the ablation experiment. This sheds some light on the assumptions and the design choices of Gap. For the reported results, GapCn and GapApn use a cnn Encoder. In both cases, they quickly over-fit the data, and we found out that we have to employ aggressive regularization using a dropout rate of 0.95. In addition, for GapCn we have also observed that as we increase the kernel size to greater values than 1 (up to 5) the results keep getting worse, and hence, what we reported is the best one which is obtained using a kernel size of 1. For example, with a kernel size of 3 and training ratios of 25, 55, 75, and 95 percent the AUC scores respectively dropped to 54, 55, 56, and 58 percent on the Cora dataset, and 66, 69, 78 and 77 percent on the Email dataset. We conjecture that this is due the model’s attempt to enforce high-level neighborhood patterns (eg. a combination of arbitrary neighbors) that are not intrinsically governing the underlying edge formation phenomena. Rather, what is important is to effectively pay attention to the presence of individual neighbors both in and regardless of their order. Apparently, training this model is at least twice slower than Gap as it is also illustrated in Section 4.3.

In the case of GapApn, though the variations in AUC are marginal with respect to the change in the kernel size, the training time of these model has increased almost by an order of magnitude. Finally we see that the mutual attention mechanism (Attend component) plays an important role by comparing the results between the GapMlp and Gap.

5 Related Work

NRL is usually carried out by exploring the structure of the graph and meta data, such as node attributes, attached to the graph (Perozzi et al., 2014; Grover and Leskovec, 2016; Tang et al., 2015; Perozzi et al., 2016; Wang et al., 2016; Yang et al., 2015; Pan et al., 2016; Sheikh et al., 2019; Kefato et al., 2017; Sheikh et al., 2019). Random walks are widely used to explore local/global neighborhood structures, which are then fed into a learning algorithm. The learning is carried out in unsupervised manner by maximizing the likelihood of observing the neighbor nodes and/or attributes of a center node.

Recently graph convolutional networks have also been proposed for semi-supervised network analysis tasks (Kipf and Welling, 2017; Hamilton et al., 2017; Wu et al., 2019; Velickovic et al., 2017; Abu-El-Haija et al., 2019). These algorithms work by way of aggregating neighborhood features, with a down-stream objective based on partial labels of nodes, for example. All these methods are essentially different from our approach because they are context-free.

Context-sensitive learning is another paradigm to NRL that challenges the sufficiency of a single representation of a node for applications such as, link prediction, product recommendation, ranking. While some of these methods (Tu et al., 2017; Zhang et al., 2018) rely on textual information, others have also shown that a similar goal can be achieved using just the structure of the graph (Epasto and Perozzi, 2019). However, they require an extra step of persona decomposition that is based on microscopic level community detection algorithms to identify multiple contexts of a node. Unlike the first approaches our algorithm does not require extra textual information and with respect to the second ones our approach does not require any sort of community detection algorithm.

6 Conclusion

In this study we present a novel context-sensitive graph embedding algorithm called Gap. It consumes node neighborhood as input feature, which are constructed based on an important assumption that their ordering is arbitrary. To learn representations of nodes Gap employs attentive pooling networks (apn). By exploiting the above assumption, it makes an important simplification of apn and gains more than 2X speed up over another SOTA method, which employs the exact apn. Furthermore, Gap consistently outperforms all the baselines and achieves up to 9% and 20% improvement over the best performing ones on the link prediction and node clustering tasks, respectively. In future we will investigate how node attributes can be incorporated and provide a theoretical framework on the relation between the neighborhood sampling and topological properties.


  • S. Abu-El-Haija, B. Perozzi, R. Al-Rfou, and A. Alemi (2017) Watch your step: learning graph embeddings through attention. CoRR abs/1710.09599. External Links: Link, 1710.09599 Cited by: 1st item.
  • S. Abu-El-Haija, B. Perozzi, A. Kapoor, H. Harutyunyan, N. Alipourfard, K. Lerman, G. V. Steeg, and A. Galstyan (2019) MixHop: higher-order graph convolutional architectures via sparsified neighborhood mixing. CoRR abs/1905.00067. External Links: Link, 1905.00067 Cited by: §5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §1.
  • C. N. dos Santos, M. Tan, B. Xiang, and B. Zhou (2016) Attentive pooling networks. CoRR abs/1602.03609. External Links: Link, 1602.03609 Cited by: §1.
  • A. Epasto and B. Perozzi (2019) Is a single embedding enough? learning node representations that capture multiple social contexts. CoRR abs/1905.02138. External Links: Link, 1905.02138 Cited by: 4th item, §5.
  • A. Grover and J. Leskovec (2016) Node2Vec: scalable feature learning for networks. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 855–864. External Links: ISBN 978-1-4503-4232-2, Link, Document Cited by: 1st item, §5.
  • W. L. Hamilton, R. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. CoRR abs/1706.02216. External Links: Link, 1706.02216 Cited by: §5.
  • M. Honnibal (2018) Embed, encode, attend, predict: the new deep learning formula for state-of-the-art nlp models. External Links: Link Cited by: §1.
  • Z. T. Kefato, N. Sheikh, and A. Montresor (2017) Mineral: multi-modal network representation learning. CONFERENCE In

    Proc. of the 3rd International Conference on Machine Learning, Optimization and Big Data

    MOD’17. Cited by: §5.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §5.
  • J. Leskovec, J. Kleinberg, and C. Faloutsos (2007) Graph evolution: densification and shrinking diameters. ACM Trans. Knowl. Discov. Data 1 (1). External Links: ISSN 1556-4681, Link, Document Cited by: item 3.
  • S. Pan, J. Wu, X. Zhu, C. Zhang, and Y. Wang (2016) Tri-party deep network representation. In

    Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence

    IJCAI’16, pp. 1895–1901. External Links: ISBN 978-1-57735-770-4, Link Cited by: 2nd item, §5.
  • B. Perozzi, R. Al-Rfou, and S. Skiena (2014) DeepWalk: online learning of social representations. CoRR abs/1403.6652. External Links: Link, 1403.6652 Cited by: 1st item, §5.
  • B. Perozzi, V. Kulkarni, and S. Skiena (2016) Walklets: multiscale graph embeddings for interpretable network classification. CoRR abs/1605.02115. External Links: Link, 1605.02115 Cited by: 1st item, §5.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. CoRR abs/1802.05365. External Links: Link, 1802.05365 Cited by: §1.
  • N. Sheikh, Z. T. Kefato, and A. Montresor (2019) Gat2vec: representation learning for attributed graphs. JOURNAL In Journal of Computing, Cited by: §5.
  • N. Sheikh, Z. T. Kefato, and A. Montresor (2019)

    A simple approach to attributed graph embedding via enhanced autoencoder

    CONFERENCE In Proceedings of the Eighth International Conference on Complex Networks and Their Applications (COMPLEX NETWORKS 2019), Studies in Computational Intelligence, Vol. 881, pp. 797–809. Cited by: §5.
  • X. Sun, J. Guo, X. Ding, and T. Liu (2016) A general framework for content-enhanced network representation learning. CoRR abs/1610.02906. External Links: Link, 1610.02906 Cited by: 2nd item.
  • J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei (2015) LINE: large-scale information network embedding. CoRR abs/1503.03578. External Links: Link, 1503.03578 Cited by: 1st item, §5.
  • C. Tu, H. Liu, Z. Liu, and M. Sun (2017) CANE: context-aware network embedding for relation modeling. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1722–1731. External Links: Link, Document Cited by: §1, §1, item 1, item 2, 3rd item, §4.1, §5.
  • P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2017) Graph attention networks. ArXiv abs/1710.10903. Cited by: §5.
  • N. X. Vinh, J. Epps, and J. Bailey (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, pp. 2837–2854. Cited by: §4.2.
  • D. Wang, P. Cui, and W. Zhu (2016) Structural deep network embedding. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 1225–1234. External Links: ISBN 978-1-4503-4232-2, Link, Document Cited by: §5.
  • F. Wu, T. Zhang, A. H. S. Jr., C. Fifty, T. Yu, and K. Q. Weinberger (2019) Simplifying graph convolutional networks. CoRR abs/1902.07153. External Links: Link, 1902.07153 Cited by: §5.
  • C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Y. Chang (2015) Network representation learning with rich text information. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, pp. 2111–2117. External Links: ISBN 978-1-57735-738-4, Link Cited by: 2nd item, §5.
  • R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec (2018)

    Graph convolutional neural networks for web-scale recommender systems

    CoRR abs/1806.01973. External Links: Link, 1806.01973 Cited by: §1.
  • X. Zhang, Y. Li, D. Shen, and L. Carin (2018) Diffusion maps for textual network embedding. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, USA, pp. 7598–7608. External Links: Link Cited by: §1, item 1, item 2, 3rd item, §4.1, §5.