Adaptive Edge Features Guided Graph Attention Networks

09/07/2018 ∙ by Liyu Gong, et al. ∙ University of Kentucky 0

Edge features contain important information about graphs. However, current state-of-the-art neural network models designed for graph learning do not consider incorporating edge features, especially multi-dimensional edge features. In this paper, we propose an attention mechanism which combines both node features and edge features. Guided by the edge features, the attention mechanism on a pair of graph nodes will not only depend on node contents, but also ajust automatically with respect to the properties of the edge connecting these two nodes. Moreover, the edge features are adjusted by the attention function and fed to the next layer, which means our edge features are adaptive across network layers. As a result, our proposed adaptive edge features guided graph attention model can consolidate a rich source of graph information that current state-of-the-art graph learning methods cannot. We apply our proposed model to graph node classification, and experimental results on three citaion network datasets and a biological network dataset show that out method outperforms the current state-of-the-art methods, testifying to the discriminative capability of edge features and the effectiveness of our adaptive edge features guided attention model. Additional ablation experimental study further shows that both the edge features and adaptiveness components contribute to our model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Neural networks, especially deep neural networks, have become one of the most successful machine learning techniques in recent years. In many important problems, they achieve state-of-the-art performance, e.g., covolutional neural networks (CNN)

[Lecun et al.1998]

in image recognition, recurrent neural networks (RNN)

[Elman1990]

and Long Short Term Memory (LSTM)

[Hochreiter and Schmidhuber1997]

in natural language processing, etc. In real world, many problems can be better and more naturally modeled with graphs rather than conventional tables, grid type images or time sequences. Generally, a graph contains nodes and edges, where nodes represent entities in real world, and edges represent interactions or relationships between entities. For example, a social network naturally models users as nodes and friendship relationships as edges. For each node, there is often an asscociated feature vector describing it, e.g. a user’s profile in a socail network. Similarly, each edge is also often associated with features depicting relationship strengths or other properties. Due to their complex structures, a challenge in machine learning on graphs is to find effective ways to incorporate different sources of information contained in graphs into machine learning models such as neural networks.

(a) GAT
(b) EGAT
Figure 1: Schematic illustration of the proposed EGAT network. Compared with GAT network in (a), EGAT in (b) improves in three folds. Firstly, the adjacency matrix in GAT is a binary matrix indicates merely the neighborhood of each node; on the contrary, EGAT replaces it with real-valued edge features which may exploit weights of edges. Secondly, the adjacency matrix A in GAT is an matrix, so each edge is associated with only one binary value; whereas the input to EGAT is a

-way tensor, which means each edge is usually associated with a multi-dimensional feature vector. Lastly, in GAT the original adjacency matrix

is fed to every layer; in contrast, the edge features in EGAT are adjusted by each layer before being fed to the next layer.

Recently, several models of neural networks have been developed for graph learning, which obtain better performance than traditional techniques. Inspired by graph Fourier transform, Defferrard et al.

[Defferrard, Bresson, and Vandergheynst2016] propose a graph covolution operator as an analogue to standard convolutions used in CNN. Just like the convolution operation in image spatial domain is equivalent to multiplication in the frequency domain, covolution operators defined by polynomials of graph Laplacian are equivalent to filtering in graph spectral domain. Particularly, by applying Chebyshev polynomials to graph Laplacian, spatially localized filtering is obtained. Kipf et al. [Kipf and Welling2017] approximate the polynomials using a renormalized first order adjacency matrix to obtain comparable results on graph node classification. Those graph covolutional networks (GCNs) [Defferrard, Bresson, and Vandergheynst2016][Kipf and Welling2017] combine the graph node features and graph topological structure information to make predictions. Velickovic et al. [Velickovic et al.2018] adopt attention mechanism into graph learning, and propose a graph attention network (GAT). Unlike GCNs, which use a fixed or learnable polynomial of Laplacian or adjacency matrix to aggregate (filter) node information, GAT aggregates node information by using an attention mechanism on graph neighborhoods. The essential difference between GAT and GCNs is stark: In GCNs the weights for aggregating (filtering) neighbor nodes are defined by the graph strcuture, which is independent on node contents; in contrast, weights in GAT are a function of node contents thanks to the attention mechanism. Results on graph node classification show that the adaptiveness of GAT makes it more effective to fuse information from node features and graph topological structures.

One major problem in the current graph neural network models such as GAT and GCNs is that edge features are not incorporated. In GAT, graph topological information is injected into the model by forcing the attention coefficient between two nodes to zero if they are not connected. Therefore, the edge information used in GAT is only the indication about whether there is an edge or not, i.e. connectivities. However, graph edges are often in possession of rich information like strengths, types, etc. Instead of being a binary indicator variable, edge features could be continous, e.g. strengths, or multi-dimensional. Properly addressing this problem is likely to benefit many graph learning problems. Another problem of GAT and GCNs is that each GAT or GCN layer filters node features based on the original input adjacency matrix. The original adjacency matrix is likely to be noisy and not optimal, which will restrict the effectiveness of the filtering operation.

In this paper, we propose a model of adaptive edge features guided graph attention networks (EGAT) for graph learning. In EGAT, the attention function depends on both node features and edge features. Its value will be adaptive to not only the feature vectors of a pair of nodes, but also the possibly multi-dimensional feature vector of the edge between this pair of nodes. We conduct experiments on several commonly used data sets including three citation datasets. Simply using the citation direction as additional edge features, we observed significant improvement compared with current state-of-the-art methods on all these citation datasets. We also conduct experiments on a weighted biological dataset. The results confirm that edge features are important for graph learning, and our proposed EGAT model effectively incorporates edge features.

As a summary, the novelties of our proposed EGAT model include the following:

  • New multi-dimensional edge features guided graph attention mechanism. We generalize the attention mechanism in GAT to incorporate multi-dimensional real-valued edge features. The new edge features guided attention mechanism eliminates the limitation of GAT which can handle only binary edge indicators.

  • Attention based edge adaptiveness across neural network layers. Based on the edge features guided graph attention mechanism, we design a new graph network architecture which can not only filter node features but also adjust edge features across layers. Therefore, the edges are adaptive to both the local contents and the global layers when passing through the layers of the neural network.

  • Multi-dimensional edge features for directed edges. We propose a method to encode edge directions as multi-dimensional edge features. Therefore, our EGAT can effectively learn from directed graph data.

Related works

The challenge of graph learning is the complex non-Euclidean structure of graph data. To address this issue, traditional machine learning approaches extract graph statistics (e.g. degrees) [Bhagat, Cormode, and Muthukrishnan2011], kernel functions [Vishwanathan et al.2010][Shervashidze et al.2011] or other hand-crafted features which measure local neighborhood structures. Those methods are limited in their flexibility in that designing sensible hand-crafted features is time consuming and extensive experiments are needed to generalize to different tasks or settings. Instead of extracting structural information using hand-engineered statistics, graph representation learning attempts to embed graphs or graph nodes in a low-dimensional vector space using a data-driven approach. One kind of embedding approaches are based on matrix-factorization, e.g. Laplacian Eigenmap (LE) [Belkin and Niyogi2001], Graph Factorization (GF) algorithm [Ahmed et al.2013], GraRep [Cao, Lu, and Xu2015], and HOPE [Ou et al.2016]. Another class of approaches focus on employing a flexible, stochastic measure of node similarity based on random walks, e.g. DeepWalk [Perozzi, Al-Rfou, and Skiena2014], node2vec [Ahmed et al.2013], LINE [Tang et al.2015], HARP [Chen et al.2018], etc. There are several limitations in matrix factorization-based and random walk-based graph learning approaches. First, the embedding function which maps to low-dimensional vector space is either linear or simple so that complex pattern cannot be captured; Second, they do not incorporate node features; Finally, they are inherently transductive because the whole graph structure is required in the training phase.

Recently these limitations in graph learning have been addressed by adopting new advances in deep learning. Deep learning with neural networks can represent complex mapping functions and be efficiently optimized by gradient-descent methods. To embed graph nodes to a Euclidean space, deep autoencoders are adopted to extract connectivity patterns from the node similarity matrix or adjacency matrix, e.g. Deep Neural Graph Representations (DNGR)

[Cao, Lu, and Xu2016] and Structural Deep Network Embeddings (SDNE) [Wang, Cui, and Zhu2016]. Although autoencoder based approaches are able to capture more complex patterns than matrix factorization based and random walk based methods, they are still unable to leverage node features.

With the success of CNN in image recognition, recently, there has been an increasingly growing interest in adopting convolutions to graph learning. In [Bruna et al.2013], the convolution operation is defined in the Fourier domain, that is, the eigen space, of the graph Laplacian. The method is afflicted by two major problems: First, the eigen decomposition is computationally intensive; second, filtering in the Fourier domain may result in non-spatially localized effects. In [Henaff, Bruna, and LeCun2015], a parameterization of the Fourier filter with smooth coefficients is introduced to make the filter spatially localized. [Defferrard, Bresson, and Vandergheynst2016]

proposes to approximate the filters by using a Chebyshev expansion of the graph Laplacian, which produces spatially localized filters, and also avoids computing the eigenvectors of the Laplacian.

Attention mechanisms have been widely employed in many sequence-based tasks [Bahdanau, Cho, and Bengio2014][Zhou et al.2017][Kim et al.2018]. Compared with convolution operators, attention machanisms enjoy two benefits: Firstly, they are able to aggregate any variable sized neighborhood or sequence; further, the weights for aggregation are functions of the contents of a neighborhood or sequence. Therefore, they are adaptive to the contents. [Velickovic et al.2018] adopts an attention mechanism to graph learning and proposes graph attention networks (GAT), which obtains current state-of-the-art performance on several graph node classification problems.

Edge feature guided graph attention networks

Notations

Given a graph with nodes, let be the dimensional feature vector of the node. Let be the dimensional feature vector of the edge connecting the and nodes. Specifically, we denote the channel of the edge feature in as . Without loss of generality, we set to mean that there is no edge between the and nodes. Let be the set of neighboring nodes of node .

Let be an matrix representation of the node features of the whole graph. Similarly, let be an tensor representing the edge feautres of the graph.

Architecture overview

Our proposed network has a feedforward architecture. The inputs are and . After passing through the first EGAT layer, is filtered to an new node feature matrix . In the mean time, edge features are adjusted to but preserves the dimensionality of . The adapted is fed as edge features to the next layer. This procedure is repeated for every subsequent layer. Within each hidden layer, non-linear activations can be applied to the filtered node features . The node features can be considered as an embeding of the graph nodes in an dimensional space. For a node classification problem, a softmax operator will be applied to and the weights of the network will be trained with supervision from ground truth labels. Figure 1 gives a schematic illustration of the EGAT network with a comparison to the GAT network.

Edge features guided attention

For a single EGAT layer, let and be the input and output node feature matirices with sizes and , respectively. Each row represents the feature vector of a specific node. Let and be the input and output edge feature tensors, respectively, both with size . Let and be the input and output feature vectors of the node, respectively. With our new attention mechanism, will be aggregated from the feature vectors of the neighboring nodes of the node, i.e. , by simultaneously incorporating the corresponding edge features. The aggregation operation is defined as follows:

(1)

where is a non-linear activation. is a transformation which maps the node features from the input space to the output space. Usually, a linear mapping is used:

(2)

where is an matrix.

In Eq. (1), is the so-called attention coefficient, which is a function of , and , the feature channel of the edge connecting the two nodes. In previous attention mechanisms, the attention coefficient depends on the two points and only. Here we let the attention operation be guided by edge features of the edge connecting the two nodes, so depends on edge features as well. For multiple dimensional edge features, we consider them as multi-channel signals, and each channel will guide a separate attention operation. The results from different channels are combined by the concantenation operation. For a specific channel of edge features, our attention function is chosen to be the following:

(3)
(4)

where is a normalization term for the node at the edge feature channel and could be any ordinary attention function. In this paper, we use a linear function as the attention function for simplicity:

(5)

where

is the LeakyReLU activation function;

is the same mapping as in (2); is the concatenation operation.

The attention coefficients will be used as new edge features for the next layer, i.e.

where is the entry of at position .

Other feasible choices of attention function

Our attention function in (3)(4) is designed to have a factorized form for ease of computation and use. The factor term could be replaced with any feasible attention function. A list of possible choices have been summarized in [Wang et al.2017]. For example, an alternative of to (5) is an embeded Gaussian:

(6)
(7)
(8)

where and are different linear embeddings. Essentially, the embeded Gaussian is a bilinear function of and . This can be clearly seen if we let , then . A bilinear function will be more flexible than a linear function, which is used in (5). But a bilinear function also requires more trainable parameters, which might be more difficult to train if the dataset is small. Which one to use should be decided according to specific applications and tasks.

Another way of designing is to use different functions for different edge feature channels. For example, by using different parameters for different edge channels, we can get

(9)

Compared with the bilinear function in (6), the linear one in (9) does not increase the number of trainable parameters too much. We tested (9) on the Citeseer dataset, and observed a slight accuracy improvement of about .

Edge features for directed graph

In real world, many graphs are directed, i.e. each edge has a direction associated with it. Often times, edge direction contains important information about the graph. For example, in a citation network, machine learning papers sometimes cite mathematics papers or other theoretical papers. However, mathematics papers may seldom cite machine learning papers. In many previous works including GCNs and GAT, edge directions are not considered. In their experiments, directed graphs such as citation networks are treated as undirected graphs. In this paper, we argue that discarding edge directions will lose important information. Instead, we view directions of edges as a kind of edge features, and encode a directed edge channel to

Therefore, each directed channel is augmented to three channels. Note that the three channels define three types of neighborhoods: forward, backward and undirected. As a result, EGAT will aggregate node information from these three different types of neighborhoods, which contains the direction information. Taking the citation network for instance, EGAT will apply the attention mechanism on the papers that a specific paper cited, the papers cited this paper, and the union of the former two. With this kind of edge features, different patterns in different types of neighborhoods can be effectively captured.

Node features for graph without node attributes

Although the network architecture illustrated in Figure 1 requires node feature matrix as input, it can also handle graphs without node attributes. For these kinds of graph data, graph statistics such as node degrees can be extracted and used as node features. An alternative choice is to assign each node a one-hot indicator vector as node features, i.e. . In either way, graph topological structure will be embeded into node representations as hidden layer outputs of the neural network. In our experiments, we use one-hot indicator vectors as node features for graphs without node attributes, but other ways mentioned above can be also used.

Input: , , , and node labels
;   while 

training epoch is less than

 do
       for  = to  do
             Calculate from and by Eq. (3) Calculate from and by Eq. (1)
       end for
      Back-propagate loss gradient from and Update weights if early stopping condition satisfied then
            Terminate
       end if
      
end while
Algorithm 1 EGAT training algorithm for graph node classification

Graph node classification

A multiple layer EGAT network could be used to solve graph node classification problems. For a class classification problem, we just need to let the weight matrix of the last layer has the dimension

, replace the concatenation operation with averaging operation, and apply softmax activation. Then an appropriate classification loss function, such as cross entropy can be used to train the EGAT parameters. In a typical graph node classification problem, only a part of the nodes are labeled, so the loss need to be defined on predicted values of these labeled nodes. The training process is illustrated in Algorithm

1.

Relation to GAT and GCN

One major contribution of EGAT is that it can handle multi-dimensional real-valued edge features, while in GAT and GCN only a binary edge indicator is considered for each edge. Therefore, EGAT is able to exploit richer information about the input graph.

Another major difference of EGAT compared with GAT and GCNs is that the edge features are adapted by each layers. In GAT and GCNs, original adjancey matrix is fed to every layer of the neural network. In EGAT, the original edge features are fed to the first layer only. Then, the adapted edge features by one layer is used as input edge features for next layer.

The attention function in (5) is a generalization of the attention function in GAT which can be expressed as

(10)
(11)

Compared with (11), our attention function is more flexible in two folds. Firstly, we multiply the edge features with the output of , which will utilize the possibly continuous edge features rather than a binary indicator. Secondly, we combine different channels-guided attention coefficients by concatenation. Concatenation has been used in DenseNet architectures [Huang et al.2017] to combine image features at different scales, which proves to be more efffective than addition used by the skip connection in ResNet [He et al.2016] in many applications. It is natural to adopt concatenation to combine different channels of edge signals. If there is only one channel of binary edge features indicating whether two nodes are connected without direction, then our EGAT model will be reduced to GAT. In this sense, GAT may be regarded a special case of EGAT.

Experimental results

Implementation details

Following the experiment settings of [Kipf and Welling2017][Velickovic et al.2018], we use two layers of EGAT in all of our experiments for fair comparison. We test EGAT on two citation networks (directed graphs) with node attributes, one citation network (directed graph) without node attributes, and a biological network (weighted graph), and compare EGAT with GAT. Throughout the experiments, we use the Adam optimizer [Kingma and Ba2015] with learning rate . An early stopping strategy with window size of is adopted for the three citation networks; i.e. we stop training if the validation loss does not decrease for consecutive epochs. For the biological network, we increase the stop patience to . The output dimensions of node features are for all hidden layers. Futhermore, we apply dropout [Srivastava et al.2014] with drop rate to both input features and normalized attention coefficients. regularization with weight decay is applied to weights of the model (i.e. and ). Moreover, exponential linear unit (ELU) [Clevert, Unterthiner, and Hochreiter2016] is employed as nonlinear activations for hidden layers.

The algorithms are implemented in Python within the Tensorflow framework

[Abadi et al.2016]. Because the edge and node features in our experimental datasets are highly sparse, we further utilize the sparse tensor functionality of Tensorflow to reduce the memory requirement and computational complexity. Thanks to the sparse implementation, all the datasets can be efficiently handled by a Nvidia Tesla K40 graphics card with 12 Gigabyte graphics memory.

Citation networks

Datasets

To test the effectiveness of our proposed model, we apply it to the network node classification problem. Three datasets are tested: Citeseer [Sen et al.2008], Pubmed [Namata et al.2012] and Konect-cora [Subelj and Bajec2013]. Some basic statistics about these datasets are listed in Table 1. The graph of the Citeseer dataset, which has nodes, is small compared with the Pubmed and Konect-cora datasets, which have and nodes, repsectively.

Citeseer Pubmed Konect-cora
# Nodes 3327 19717 23166
# Edges 4732 44338 91500
# Node Features 1433 3703 0
# Classes 6 3 10
Table 1: Summary of citation network datasets

All the three datasets are directed graphs, where edge directions represent the directions of citations. Two of the datasets, Citeseer and Pubmed contain node attributes corresponding to the bag-of-words text features. However, nodes in Konect-cora dataset do not have any attributes, which is suitable to test the effectiveness of our EGAT for embedding the topological structure of graphs.

The Citeseer and Pubmed datasets are also used in [Yang, Cohen, and Salakhudinov2016] [Kipf and Welling2017] [Velickovic et al.2018]. However, they all use a pre-processed version which discards the edge directions. Since our EGAT model requires the edge directions to construct edge features, we use the original version from [Sen et al.2008] and [Namata et al.2012]. For each of the three datasets, we split nodes into subsets with size , and for training, validation and testing, respectively.

Performance

Method Accuracy
GAT
EGAT-UD (Ours)
EGAT-NA (Ours)
EGAT (Ours)
(a) Citeseer
Method Accuracy
GAT
EGAT-UD (Ours)
EGAT-NA (Ours)
EGAT (Ours)
(b) Pubmed
Method Accuracy
GAT
EGAT-UD (Ours)
EGAT-NA (Ours)
EGAT (Ours)
(c) Konect-cora
Table 2: Performances on citation networks.
(a) Citeseer
(b) Pubmed
(c) Konect-cora
Figure 2: Training and validation losses of EGAT and GAT on the citation networks. In (b) and (c), EGAT converges quickly.

For Citeseer and Pubmed, we use the node attributes as node features. Node features of every node are normalized so that they sum to one. For Konect-cora, we use one-hot indicators as described previously as node features. For all the three datasets, we use the edge features described previously.

To further test the effectiveness of the edge feature adaptiveness with respect to layers and edge features separately, we have implemented three versions of EGAT: undirected edges EGAT (EGAT-UD), non-adaptive EGAT (EGAT-NA) and full EGAT. EGAT-UD discards the edge direction and uses the symmetric adjacency matrix as the edge feature matrix, similarly to GAT. Since the citation datasets are unweighted, the first layer of EGAT-UD is reduced to GAT. However, for the following layer, EGAT-UD will use the adapted edge features which are real-valued. Therefore, EGAT-UD can be considered as an adaptive version of GAT for undirected binary edge graphs. EGAT-NA is the same as EGAT except that the edge adaptiveness functonality is removed.

We run each algorithm

times, and report the mean and standard deviation of the accuracies in Table

2. According to the results, our EGAT outperforms GAT across the datasets. The mean accuracies are improved by , and for Citesser, Pubmed and Konect-cora, respectively. The performance gains testify that edge directions are important for the task of citation network node classification, and our EGAT is capable of effectively capturing the patterns related to edge directions. On the Konect-cora dataset, which does not contain any node attributes, the improvement is more significant than the other two datasets. One possible reason is that edge information becomes crucial given no node attributes. Another possible reason is that the Konect-cora graph is denser, i.e. the average degree of nodes is higher, than the other two graphs, so more information can be extracted from the edges. Compared with GAT, EGAT-UD also shows consistent improvement on all the three datasets. Similar trends can be seen between EGAT-NA and EGAT; again, we can observe more significant improvement on Konect-cora than the other two datasets.

To further assess the effectiveness of EGAT and compare with GAT, we record the training and validation losses from one run of each of the two algorithms on the three citation network datasets. The losses are plotted in Figure 2. From the plots we can clearly see that EGAT gets lower training losses and validation losses. On the Konect-cora and Pubmed datasets, EGAT converges very quickly at around and epochs, respectively. On the Citeseer dataset, it takes longer to converge, possibly due to the small size of the dataset.

Biological network

Dataset

One advantage of our EGAT model is to utilize the real-valued edge weights as features rather than simple binary indicators. To test the effectiveness of edge weights, we conduct experiments on a biological network: Bio1kb-Hic [Belkin and Niyogi2001], which is a chromatin interaction network collected by chromosome conformation capture (3C) based technologies with 1kb resolution. The nodes in the network are genetic regions, and edges are strengths of interactions between two genetic regions. The dataset contains nodes and edges. The nodes are labeled with categories, which indicate the genetic functions of genetic regions. There are no attributes associated with the nodes. We split the nodes into subsets with siqze , and for training, validation and testing, respectively.

Note that the genetic interactions measured by 3C-based technologies are very noisy. Therefore, the Bio1kb-Hic dataset is highly dense and noisy. To test the noise-sensitivity of the algorithms, we generate a denoised version of the dataset by setting edge weights less than to be . In the denoised version, the number of edges is reduced from to , which is much less dense.

Performance

We use one-hot indicators as described previously as node features. We test four versions of algorithms: GAT as a baseline, unweighted edges EGAT (EGAT-UW), non-adaptive EGAT (EGAT-NA) and full EGAT. EGAT-NA is defined the same way as in citation network experiment. EGAT-UW is similar to EGAT-UD except that it discards weights instead of discrading edge directions since we are dealing with undirected weighted graphs in this experiment. Similar to EGAT-UD in the previous experiments, the first hidden layer of EGAT-UW is the same as GAT, but subsequent layers utilize attention-adapted real-valued edge features.

We run times of each algorithm on both the raw and denoised networks, the mean and standard deviation of the accuracies are reported in Table 3. Note that we increase the early stopping patience to epochs because of the noisy data.

Method Raw Denoised
GAT
EGAT-UW (Ours)
EGAT-NA (Ours)
EGAT (Ours)
Table 3: Performance summary on Bio1k-Hic dataset.

The results show that our EGAT outperforms GAT significantly on both the raw and denoised network, which confirms that edge weights are important for node classification and they are effectively incorporated in EGAT. Moreover, EGAT is highly noise-resilient. Detailed results also clearly show that the edge adaptiveness contributes to the EGAT model. On the raw network, EGAT is much more stable than EGAT-NA in term of significantly reduced variance. On the denoised network, EGAT-UW obtains an accuracy which is

higher than GAT, and EGAT obtains an accuracy higher than EGAT-NA, both having significantly smaller variances than GAT and EGAT-UW.

Conclusions and future directions

In this paper, we address the existing problems in the current state-of-the-art graph neural network models. Specifically, we generalize the current graph attention mechansim in order to incorporate multi-dimensional real-valued edge features. Then, based on the proposed new attention mechanism, we improve the current graph neural network architectures by adjusting edge features across neural network layers. Finally, we propose a method to design multi-dimensional edge features for directed edges so that our model is able to effectively handle directed graph data. Extensive experiments are conducted on several graph datesets, including two directed citation networks with node attributes, a directed citation network without any node attribute and a weighted undirected biological network. Experimental results show that our EGAT model outperforms GAT, which is the current state-of-the-art model, consistently and significantly on all the four datasets.

While we test EGAT on graph node classification problem only, the model is believed to be general and suitable for other graph problems, such as whole graph or subgraph embedding and classification. A future research line is to apply EGAT to other interesting problems.

References

  • [Abadi et al.2016] Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; Kudlur, M.; Levenberg, J.; Monga, R.; Moore, S.; Murray, D. G.; Steiner, B.; Tucker, P.; Vasudevan, V.; Warden, P.; Wicke, M.; Yu, Y.; and Zheng, X. 2016. TensorFlow: A System for Large-scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, 265–283. Berkeley, CA, USA: USENIX Association.
  • [Ahmed et al.2013] Ahmed, A.; Shervashidze, N.; Narayanamurthy, S.; Josifovski, V.; and Smola, A. J. 2013. Distributed Large-scale Natural Graph Factorization. In Proceedings of the 22Nd International Conference on World Wide Web, WWW ’13, 37–48. New York, NY, USA: ACM.
  • [Bahdanau, Cho, and Bengio2014] Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473 [cs, stat]. 03206 arXiv: 1409.0473.
  • [Belkin and Niyogi2001] Belkin, M., and Niyogi, P. 2001. Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering. In Advances in Neural Information Processing Systems,  7.
  • [Bhagat, Cormode, and Muthukrishnan2011] Bhagat, S.; Cormode, G.; and Muthukrishnan, S. 2011. Node Classification in Social Networks. arXiv:1101.3291 [physics] 115–148.
  • [Bruna et al.2013] Bruna, J.; Zaremba, W.; Szlam, A.; and LeCun, Y. 2013. Spectral Networks and Locally Connected Networks on Graphs. arXiv:1312.6203 [cs].
  • [Cao, Lu, and Xu2015] Cao, S.; Lu, W.; and Xu, Q. 2015. GraRep: Learning Graph Representations with Global Structural Information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15, 891–900. New York, NY, USA: ACM.
  • [Cao, Lu, and Xu2016] Cao, S.; Lu, W.; and Xu, Q. 2016. Deep Neural Networks for Learning Graph Representations. In

    AAAI Conference on Artificial Intelligence

    .
  • [Chen et al.2018] Chen, H.; Perozzi, B.; Hu, Y.; and Skiena, S. 2018. HARP: Hierarchical Representation Learning for Networks. In AAAI Conference on Artificial Intelligence.
  • [Clevert, Unterthiner, and Hochreiter2016] Clevert, D.-A.; Unterthiner, T.; and Hochreiter, S. 2016. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). In International Conference on Learning Representations.
  • [Defferrard, Bresson, and Vandergheynst2016] Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Lee, D. D.; Sugiyama, M.; Luxburg, U. V.; Guyon, I.; and Garnett, R., eds., Advances in Neural Information Processing Systems, 3844–3852. Curran Associates, Inc.
  • [Elman1990] Elman, J. L. 1990. Finding Structure in Time. Cognitive Science 14(2):179–211.
  • [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual Learning for Image Recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 770–778.
  • [Henaff, Bruna, and LeCun2015] Henaff, M.; Bruna, J.; and LeCun, Y. 2015. Deep Convolutional Networks on Graph-Structured Data. arXiv:1506.05163 [cs].
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural Computation 9(8):1735.
  • [Huang et al.2017] Huang, G.; Liu, Z.; Maaten, L. v. d.; and Weinberger, K. Q. 2017. Densely Connected Convolutional Networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2261–2269.
  • [Kim et al.2018] Kim, S.; Hong, J.-H.; Kang, I.; and Kwak, N. 2018. Semantic Sentence Matching with Densely-connected Recurrent and Co-attentive Information. arXiv:1805.11360 [cs]. 00001 arXiv: 1805.11360.
  • [Kingma and Ba2015] Kingma, D. P., and Ba, J. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations.
  • [Kipf and Welling2017] Kipf, T. N., and Welling, M. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations.
  • [Lecun et al.1998] Lecun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
  • [Namata et al.2012] Namata, G.; London, B.; Getoor, L.; and Huang, B. 2012. Query-driven Active Surveying for Collective Classification. In Workshop on Mining and Learning with Graphs.
  • [Ou et al.2016] Ou, M.; Cui, P.; Pei, J.; Zhang, Z.; and Zhu, W. 2016. Asymmetric Transitivity Preserving Graph Embedding. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, 1105–1114. New York, NY, USA: ACM.
  • [Perozzi, Al-Rfou, and Skiena2014] Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. DeepWalk: Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, 701–710. New York, NY, USA: ACM.
  • [Sen et al.2008] Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Gallagher, B.; and Eliassi-Rad, T. 2008. Collective Classification in Network Data. AI Magazine; La Canada 29(3):93–106.
  • [Shervashidze et al.2011] Shervashidze, N.; Schweitzer, P.; Leeuwen, E. J. v.; Mehlhorn, K.; and Borgwardt, K. M. 2011. Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning Research 12(Sep):2539–2561.
  • [Srivastava et al.2014] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 15(1):1929–1958.
  • [Subelj and Bajec2013] Subelj, L., and Bajec, M. 2013. Model of Complex Networks based on Citation Dynamics. In Proceedings of the WWW Workshop on Large Scale Network Analysis, 527–530.
  • [Tang et al.2015] Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; and Mei, Q. 2015. LINE: Large-scale Information Network Embedding. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15, 1067–1077.
  • [Velickovic et al.2018] Velickovic, P.; Cucurull, G.; Casanova, A.; and Romero, A. 2018. Graph Attention Networks. In International Conference on Learning Representations.
  • [Vishwanathan et al.2010] Vishwanathan, S. V. N.; Schraudolph, N. N.; Kondor, R.; and Borgwardt, K. M. 2010. Graph Kernels. Journal of Machine Learning Research 11(Apr):1201–1242.
  • [Wang et al.2017] Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2017. Non-local Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition.
  • [Wang, Cui, and Zhu2016] Wang, D.; Cui, P.; and Zhu, W. 2016. Structural Deep Network Embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, 1225–1234. New York, NY, USA: ACM.
  • [Yang, Cohen, and Salakhudinov2016] Yang, Z.; Cohen, W.; and Salakhudinov, R. 2016.

    Revisiting Semi-Supervised Learning with Graph Embeddings.

    In International Conference on Machine Learning, 40–48.
  • [Zhou et al.2017] Zhou, G.; Song, C.; Zhu, X.; Ma, X.; Yan, Y.; Dai, X.; Zhu, H.; Jin, J.; Li, H.; and Gai, K. 2017. Deep Interest Network for Click-Through Rate Prediction. arXiv:1706.06978 [cs, stat]. 00012 arXiv: 1706.06978.