Sparse Graph Attention Networks

12/02/2019 ∙ by Yang Ye, et al. ∙ Georgia State University 0

Graph Neural Networks (GNNs) have proved to be an effective representation learning framework for graph-structured data, and have achieved state-of-the-art performance on all sorts of practical tasks, such as node classification, link prediction and graph classification. Among the variants of GNNs, Graph Attention Networks (GATs) learn to assign dense attention coefficients over all neighbors of a node for feature aggregation, and improve the performance of many graph learning tasks. However, real-world graphs are often very large and noisy, and GATs are plagued to overfitting if not regularized properly. In this paper, we propose Sparse Graph Attention Networks (SGATs) that learn sparse attention coefficients under an L_0-norm regularization, and the learned sparse attentions are then used for all GNN layers, resulting in an edge-sparsified graph. By doing so, we can identify noisy / insignificant edges, and thus focus computation on more important portion of a graph. Extensive experiments on synthetic and real-world graph learning benchmarks demonstrate the superior performance of SGATs. In particular, SGATs can remove about 50%-80% edges from large graphs, such as PPI and Reddit, while retaining similar classification accuracies. Furthermore, the removed edges can be interpreted intuitively and quantitatively. To the best of our knowledge, this is the first graph learning algorithm that sparsifies graphs for the purpose of identifying important relationship between nodes and for robust training.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph-structured data is ubiquitous in many real-world systems, such as social networks  [21], biological networks [30], and citation networks [18], etc. Graphs can capture interactions (i.e., edges) between individual units (i.e., nodes) and encode data from irregular or non-Euclidean domains to facilitate representation learning and data analysis. Many tasks, from link prediction [23], graph classification [4] to node classification [28], can be naturally performed on graphs, where effective node embeddings that can preserve both node information and graph structure are required. To learn from graph-structured data, typically an encoder function is needed to project high-dimensional node features into a low-dimensional embedding space such that “semantically” similar nodes are close to each other in the low-dimensional Euclidean space (e.g., by dot product) [8].

Recently, various Graph Neural Networks (GNNs) have been proposed to learn such embedding functions. Traditional node embedding methods, such as matrix factorization  [3, 15] and random walk [17, 6], only rely on adjacent matrix (i.e., graph structure) to encode node similarity. Training in an unsupervised way, these methods employ dot product or co-occur on short random walks over graphs to measure the similarity between a pair of nodes. Similar to word embeddings [13, 16], the learned node embeddings from these methods are simple look-up tables. Other approaches exploit both graph structure and node features in a semi-supervised training procedure for node embeddings [11, 7]

. These methods can be classified into two categories based on how they manipulate the adjacent matrix: (1) spectral graph convolution networks, and (2) neighbor aggregation or message passing algorithms. Spectral graph convolution networks transform graphs to the Fourier domain, effectively converting convolutions over the whole graph into element-wise multiplications in the spectral domain. However, once the graph structure changes, the learned embedding functions have to be retrained or finetuned. On the other hand, the neighbor aggregation algorithms treat each node separately and learn feature representation of each node by aggregating (e.g., weighted-sum) over its neighbors’ features. Under the assumption that connected nodes should share similar feature representations, these message passing algorithms leverage local feature aggregation to preserve the locality of each node, which is a generalization of classical convolution on images to irregular graph-structured data. For both the categories of GNN algorithms, they can stack

layers on top of each other and aggregate features from -hop neighbors.

Among all the GNN algorithms, the neighbor aggregation algorithms have proved to be more effective and flexible. In particular, Graph Attention Networks (GATs) [24] use attention mechanism to calculate edge weights at each layer based on node features, and attend adaptively over all neighbors of a node for representation learning. To increase the expressiveness of the model, GATs further employ multi-head attentions to calculate multiple sets of attention coefficients for aggregation. Although multi-head attentions improve prediction accuracies, our analysis of the learned coefficients shows that multi-head attentions usually learn very similar (sometimes almost identical) coefficient distributions. This indicates that there might be some significant redundancy in the GAT modeling. In addition, GATs cannot assign an unique attention score for each edge because multiple attention coefficients are generated (from multi-heads) for an edge per layer and the same edge at different layers might receive different attention coefficients. For example, for a 2-layer GAT with 8-head attentions, each edge receives 16 different attention coefficients. The redundancy in the GAT modeling not only adds significant overhead to computation and memory usage but also increases the risk of overfitting. To mitigate these issues, we propose to simplify the architecture of GATs such that only one single attention coefficient is assigned to each edge across all GNN layers. To further reduce the redundancy among edges, we incorporate a sparsity constraint into the attention mechanism of GATs. Specifically, we optimize the model under an -norm regularization to encourage model use as fewer edges as possible. As we only employ one attention coefficient for each edge across all GNN layers, what we learn is an edge-sparsified graph with redundant edges removed. As a result, our Sparse Graph Attention Networks (SGATs), as shown in Figure 1, outperform the original GATs in two aspects: (1) SGATs can identify noisy/insignificant edges of a graph such that a sparsified graph structure can be discovered while preserving a similar representation capability; and (2) SGATs simplify the architecture of GATs, and this reduces the risk of overfitting while achieving similar or sometimes even higher accuracies than the original GATs.

2 Background and Related Work

In this section, we first introduce our notation and then review prior works related to the neighbor aggregation methods on graphs. Let denote a graph with a set of nodes , connected by a set of edges . Node features are organized in a compact matrix

with each row representing the feature vector of one node. Let

denote the adjacent matrix that describes graph structure of : if there is an edge from node to node , and 0 otherwise. By adding a self-loop to each node, we use to denote the adjacency matrix of the augmented graph, where

is an identity matrix.

For a semi-supervised node classification task, given a set of labeled nodes , where is the label of node and , we learn a function , parameterized by , that takes node features and graph structure as inputs and yields a node embedding matrix for all nodes in ; subsequently, is fed to a classifier to predict the class label of each unlabeled node. To learn the model parameter , we typically minimize an empirical risk over all labeled nodes:

(1)

where denotes the output of for node and

is a loss function, such as the cross-entropy loss that measures the compatibility between model predictions and class labels. There are many different GNN algorithms that can solve Eq. 

1. However, the main difference among these algorithms is how they define the encoder function .

2.1 Neighbor Aggregation Methods

The most effective and flexible graph learning algorithms so far follow a neighbor aggregation mechanism. The basic idea is to learn a parameter-sharing aggregator, which takes as inputs feature vector of node and its neighbors’ feature vectors and outputs a new feature vector for node . Essentially, the aggregator function aggregates lower-level features of a node and its neighbors and generates high-level feature representations. The popular Graph Convolution Networks (GCNs) [11] fall into the category of neighbor aggregation. For a 2-layer GCN, its encoder function can be expressed as:

(2)

where , , and s are the learnable parameters of GCNs. Apparently, GCNs define the aggregation coefficients as the symmetrically normalized adjacency matrix , and these coefficients are shared across all GCN layers. More specifically, the aggregator of GCNs can be expressed as

(3)

where

is the hidden representation of node

at layer , , and denotes the set of all the neighbors of node , including itself.

Since a fixed adjacency matrix is used for feature aggregation, GCNs can only be used for the transductive learning tasks, and if the graph structure changes, the whole GCN model needs to be retrained or fine-tuned. To support inductive learning, GraphSage [7]

proposes to learn parameterized aggregators (e.g., mean, max-pooling or LSTM aggregator) that can be used for feature aggregation on unseen nodes or graphs. To support large-scale graph learning tasks, GraphSage uniformly samples a fixed number of neighbors per node and performs computation on a sub-sampled graph at each iteration. Although it can reduce computational cost and memory usage significantly, its accuracies suffer from uniform sampling and partial neighbor aggregation.

2.2 Graph Attention Networks

Recently, attention networks have achieved state-of-the-art results in many computer vision and natural language processing tasks, such as image captioning 

[27] and machine translation [1]. By attending over a sequence of inputs, attention mechanism can decide which parts of inputs to look at in order to gather the most useful information. Extending the attention mechanism to graph-structured data, Graph Attention Networks (GATs) [24] utilize an attention-based aggregator to generate attention coefficients over all neighbors of a node for feature aggregation. In particular, the aggregator function of GATs is similar to that of GCNs:

(4)

except that (1) is the attention coefficient of an edge at layer , assigned by an attention function other than by a predefined , and (2) different layers utilize different attention functions, while GCNs share a predefined across all layers.

To increase the capacity of attention mechanism, GATs further exploit multi-head attentions for feature aggregation: each head works independently to aggregate information, and all the outputs of multi-heads are then concatenated to form a new feature representation for the next layer. In principle, the learned attention coefficient can be viewed as an importance score of an edge. However, since each edge receives multiple attention coefficients at a layer and the same edge at a different layer has a different set of attention coefficients, GATs cannot assign an unique importance score to quantify the significance of an edge. Built on top of GATs, our SGATs introduce a sparse attention mechanism via an -norm regularization for feature aggregation. Furthermore, we only assign one attention coefficient (or importance score) to each edge across all layers. As a result, we can identify important edges of a graph and remove redundant ones for the purpose of computational efficiency and robust training.

3 Sparse Graph Attention Networks

The key idea of our Sparse Graph Attention Networks (SGATs) is that we can attach a binary gate to each edge to determine if an edge will be used for neighbor aggregation. We optimize the SGAT model under an -norm regularized loss function such that we can use as fewer edges as possible to achieve similar or better classification accuracies. We first introduce our sparse attention mechanism, and then describe how the binary gates can be optimized via stochastic binary optimization.

3.1 Formulation

To identify important edges of a graph and remove noisy/insignificant ones, we attach a binary gate to each edge such that controls if edge will be used for neighbor aggregation 111Note that edges and are treated as two different edges and therefore have their own binary gates and , respectively.. This corresponds to attach a set of binary masks to the adjacent matrix :

(5)

where is the number of edges in graph . Since we want to use as fewer edges as possible for semi-supervised node classification, we train model parameters and binary masks by minimizing the following -norm regularized empirical risk:

(6)

where denotes the -norm of binary masks , i.e., the number of non-zero elements in (edge sparsity), is an indicator function that is 1 if the condition is satisfied, and 0 otherwise, and

is a regularization hyperparameter that balances between data loss and edge sparsity. For the encoder function

, we define the following attention-based aggregation function:

(7)

where is the attention coefficient assigned to edge across all layers. This is in a stark contrast to GATs, in which a layer-dependent attention coefficient is assigned for each edge at layer .

To compute attention coefficients, we simply calculate them by a row-wise normalization of , i.e.,

(8)

As the center node by default is important to itself, we set to 1 so that it can preserve its own information. Compared to GATs, we don’t use softmax to normalize attention coefficients since by definition and usually such that their product .

Similar to GATs, we can also use multi-head attentions to increase the capacity of our model. We thus formulate a multi-head SGAT layer as:

(9)

where is the number of heads, represents concatenation, is the attention coefficients computed by Eq. 8, and is the weight matrix of head at layer . Note that only one set of attention coefficients is calculated for edge , and they are shared among all heads and all layers. With multi-head attention, the final returned output, , will consist of features (rather than ) for each node.

Why can we use one set of coefficients for multi-head attention? This is based on our observation that all GAT heads tend to learn attention coefficients with similar distributions, and thus there might be some significant redundancy in the GAT modeling. In addition, using one set of attention coefficients isn’t rare at all as GCNs use a shared across all layers and are very competitive to GATs in terms of classification accuracies. While GCNs use one set of predefined aggregation coefficients, SGATs learn the coefficients from a sparse attention mechanism. We believe it is the learned attention coefficients instead of multi-set attention coefficients that leads to the improved performance of GATs over GCNs, and the benefit of multi-set attention coefficients might be very limited and could be undermined by the risk of overfitting due to increased complexity. Therefore, the benefits of using one set of attention coefficients over the original multi-set coefficients are at least twofold: (1) one set of coefficients is computationally times cheaper than multiple sets of coefficients; and (2) one set of coefficients can be interpreted as edge importance scores such that they can be used to identify important edges and remove noisy/insignificant edges for computational efficiency and robust training.

3.2 Model Optimization

Stochastic Variational Optimization To optimize Eq. 6, we need to compute its gradient w.r.t. binary masks . However, since

is a set of binary variables, neither the first term nor the second term is differentiable. Hence, we resort to approximation algorithms to solve this binary optimization problem. Specifically, we approximate Eq. 

6 via an inequality from stochastic variational optimization [2]: Given any function and any distribution , the following inequality holds:

(10)

i.e., the minimum of a function is upper bounded by its expectation.

Since

is a binary random variable, we assume

is subject to a Bernoulli distribution with parameter

i.e. . Thus, we can upper bound Eq. 6 by its expectation:

(11)

Now the second term of Eq. 3.2 is differentiable w.r.t. the new model parameters . However, the first term is still problematic since the expectation over a large number of binary random variables is intractable, and thus its gradient does not allow for an efficient computation.

The Hard Concrete Gradient Estimator

We therefore need further approximation to estimate the gradient of the first term of Eq. 

3.2 w.r.t.

. Fortunately, this is a well-studied problem in machine learning and statistics with many gradient estimators existing for this discrete latent variable model, such as REINFORCE 

[26], Gumble-Softmax [9], REBAR [22], RELAX [5] and the hard concrete estimator [12]. We choose the hard concrete estimator due to its superior performance in our experiments and relatively straightforward implementation. Specifically, the hard concrete estimator employs a reparameterization trick to approximate the original optimization problem Eq. 3.2 by a close surrogate function:

(12)

with

where

is a uniform distribution in the range of

,

is the sigmoid function, and

and are the typical parameter values of the hard concrete distribution. For more details on the hard concrete gradient estimator, we refer the readers to [12].

During training, we optimize for each edge . At the test phrase, we generate a deterministic mask by employing the following equation:

(13)

which is the expectation of under the hard concrete distribution . Due to the hard concrete approximation, is now a continuous value in the range of . Ideally, the majority elements of will be zeros, and thus many edges can be removed from the graph.

Inductive Model of

The learning of binary masks discussed above is transductive, by which we can learn a binary mask for each edge in the training graph . However, this approach cannot generate new masks for edges that are not in the training graph. A more desirable approach is inductive that can be used to generate new masks for new edges. This inductive model of can be implemented as a generator, which takes feature vectors of a pair of nodes as input and produce a binary mask as output. We model this generator simply as

(14)

where is the parameter of the generator and is the weight matrix of head 0 at layer 0. To integrate this generator into an end-to-end training pipeline, we define this generator to output . Upon receiving from the mask generator, we can sample a mask from the hard concrete distribution . The set of sampled mask is then used to generate an edge-sparsified graph for the downstream applications. The full pipeline of SGATs is shown in Figure 1. In our experiments, we use this inductive SGAT pipeline for semi-supervised node classification.

Tasks Nodes Edges Features Classes Average neighbor size
Cora
transductive 2,708 13,264 1,433 7 2.0
Citeseer
transductive 3,327 12,431 3,703 6 1.4
Pubmed
transductive 19,717 108,365 500 3 2.3
Amazon computers
transductive 13,381 505,474 767 10 18.4
Amazon photo
transductive 7,487 245,812 745 8 15.9
PPI
inductive 56,944 818,716 50 121 6.7
Reddit
inductive 232,965 114,848,857 602 41 246.0
Table 1: Summary of the datasets used in the experiments

4 Evaluation

To demonstrate SGAT’s ability of identifying important edges for feature aggregation, we conduct a series of experiments on synthetic and real-world semi-supervised node classification benchmarks, including transductive learning tasks and inductive learning tasks. We compare our SGATs with the state-of-the-art GNN algorithms: GCNs [11], GraphSage [7] and GATs [24]. For a fair comparison, our experiments closely follow the configurations of the competing algorithms. We plan to release our code to public to facilitate the research in this area.

4.1 Datasets

We evaluate our algorithm on seven established semi-supervised node classification benchmarks, whose statistics are summarized in Table 1.

Transductive learning tasks Three citation network datasets: Cora, Citeseer and Pubmed [18] and two co-purchase graph datasets: Amazon Computers and Amazon Photo [19] are used to evaluate the performance of our algorithm in the transductive learning setting, where test graphs are included in training graphs for feature aggregation and thus facilitates the learning of feature representations of test nodes for classification. The citation networks have low degrees (e.g., only 1-2 edges per node), while the co-purchase datasets have higher degrees (e.g., 15-18 edges per node) such that we can demonstrate SGAT’s performance on sparse graphs and dense graphs. For the citation networks, nodes represent documents, edges denote citation relationship between two documents, and node features are the bag-of-words representations of document contents; the goal is to classify documents into different categories. For the co-purchase datasets, nodes represent products, edges indicate that two products are frequently purchased together, and node features are the bag-of-words representations of product reviews; similarly, the goal is to classify products into different product categories. Our experiments closely follow the transductive learning setups of  [28, 19]. For all these datasets, 20 nodes per class are used for training, 500 nodes are used for validation, and 1000 nodes are used for test.

Inductive learning tasks Two large-scale graph datasets: PPI [30] and Reddit [7] are used to evaluate the performance of SGATs in the inductive learning setting, where test graphs are excluded from training graphs for model training and the feature representations of test nodes have to be generated from trained aggregators for classification. In this case, our inductive experiments closely follow the settings of GraphSage [7]. The protein-protein interaction (PPI) dataset consists of graphs corresponding to different human tissues. Positional gene sets, motif gene sets and immunological signatures are extracted as node features and 121 gene ontology categories are used as class labels. There are in total 24 subgraphs in the PPI dataset with each subgraph containing 3k nodes and 100k edges on average. Among 24 subgraphs, 20 of them are used for training, 2 for validation and the rest of 2 for test. For the Reddit dataset, each node represents a reddit post and two nodes are connected when the same user comments on both posts. The node features are made up with the embedding of the post title, the average embedding of all the post’s comments, the post’s score and the number of comments made on the post. There are 41 different communities in the Reddit dataset corresponding to 41 categories.The task is to predict which community a post belongs to. This is a large-scale graph learning benchmark that contains over 100 million edges and about 250 edges per node, and therefore a high edge redundancy is expected.

4.2 Modles and Experimental Setup

Models

A 2-layer SGAT with a 2-head attention at each layer is used for feature aggregation, followed by a softmax classifier for node classification. We use ReLU 

[14]

as the activation function and optimize the models with the Adam optimizer 

[10] with the learning rate of . We compare SGATs with GCNs and GATs in terms of node classification accuracies. Since SGATs produce edge-sparsified graphs, we also report edge redundancy of the sparsified graph, i.e., the percentage of edges removed from the original graph. We implemented our SGATs with the DGL library [25].

Hyperparameters

We tune the performance of SGATs based on the hyperparameters of GATs since SGATs are built on top of GATs. For a fair comparison, we also run 1-head and 2-head GAT models with the same architecture as SGATs to illustrate the impact of sparse attention models

vs standard dense attention models. To prevent models from overfitting on small datasets, regularization and dropout [20] are used. Dropout is applied to the inputs of all layers and the attention coefficients. For the large-scale datasets, such as PPI and Reddit, we don’t use regularization or dropout as the models have enough data for training.

Figure 2:

The evolution of the graph of Zachary’s Karate Club at different training epochs. SGAT can remove 46% edges from the graph while retaining almost the same accuracy at 96.88%. Nodes 0 and 33 are the labeled nodes, and the colors show the ground-truth labels. The video can be found at

https://youtu.be/3Jhr26lXRl8.

4.3 Experiments on Synthetic Dataset

To illustrate the idea of SGAT, we first demonstrate it on a synthetic dataset – Zachary’s Karate Club [29], which is a social network of a karate club of 34 members with links between pairs of members representing who interacted outside the club. The club was split into two groups later due to a conflict between the instructor and the administrator. The goal is to predict the groups that all members of the club joined after the split. This is a semi-supervised node classification problem in the sense that only two nodes: the instructor (node 0) and the administrator (node 33) are labeled and we need to predict the labels of all the other nodes.

We train a 2-layer SGAT with a 2-head attention at each layer on the dataset. Figure 2 illustrates the evolution of the graph at different training epochs, the corresponding classification accuracies and number of edges kept in the graph. As can be seen, as the training proceeds, some insignificant edges are removed and the graph is getting sparser; at the end of training, SGAT removes about 46% edges while retaining an accuracy of 96.88% (i.e., only one node is misclassified), which is the same accuracy achieved by GCNs and other competing algorithms that utilize the full graph for prediction. In addition, the removed edges have an intuitive explanation. For example, the edge from node 16 to node 6 is removed while the reversed edge is kept. Apparently, this is because node 6 has 4 neighbors while node 16 has only 2 neighbors, and thus removing one edge between them doesn’t affect node 6 too much while may be catastrophic to node 16. Similarly, the edges between node 27, 28 and node 2 are removed. This might be because node 2 has an edge to node 0 and has no edge to node 33, and therefore node 2 is more like to join node 0’s group and the edges to nodes 27 and 28 are not important or might be due to noise.

Cora Citeseer Pubmed
Amazon
computer
Amazon
Photo
PPI Reddit
GCN
81.5% 70.3% 79.0% 81.5% 91.2% 50.9% 94.38%
GAT
84.0% 72.5 % 79.0% - - 97.3% OOM
GAT-1head
83.3 % 66.8% 77.1% 81.3% 89.7% 85.6% 92.6%
GAT-2head
83.8% 67.5% 77.4% 82.4% 90.4% 97.1% 93.5%
GraphSage
- - - - - 61.2% 95.4%
SGAT-1head
83.1% 67.4% 77.2% 81.1% 89.5% 86.0% 94.9%
SGAT-2head
84.2% 68.2 % 77.6% 81.8% 89.9% 96.6% 95.2%
Edge Redundancy
2.0% 1.2% 2.2% 63.6% 42.3% 49.3% 80.8%
  • From our experiments.

Table 2: Classification accuracies on seven semi-supervised node classification benchmarks. Results of GCNs on PPI and Reddit are trained in a transductive way. The results annotated with are from our experiments, and the rest of results are from the corresponding papers. OOM indicates “out of memory”.

4.4 Experiments on Seven Benchmarks

We evaluate the performance of SGATs on seven semi-supervised node classification benchmarks with the results summarized in Table 2. For a fair comparison, we run each experiments 20 times with different random weight initializations and report the mean accuracies.

Comparing SGAT with GCN, we note that SGAT outperforms GCN on the PPI dataset significantly while being similar on all the other six benchmarks. Comparing SGAT with GraphSage, SGAT again outperforms GraphSage on PPI by a significant margin. Comparing SGAT with GAT, we note that they achieve very competitive accuracies on all six benchmarks except Reddit, where the original GAT is “out of memory” and SGAT can perform successfully due to its simplified architecture and about 80% edge reduction. Another advantage of SGAT over GAT is the regularization effect of the -norm on the edges. To demonstrate this, we test two GAT variants : GAT-1head and GAT-2head that have the similar architectures as SGAT-1head and SGAT-2head but with different attention mechanisms (i.e., standard dense attention vs. sparse attention). As we can see, on the Reddit dataset, the sparse attention-based SGATs outperform GATs by 2-3% while sparsifying the graph by about 80%.

Overall, SGAT is very competitive against GCN, GraphSage and GAT in terms of classification accuracies, while being able to remove different percentages of edges from small and large graphs. More specifically, on the three small citation networks SGATs learn that majority of the edges are critical to maintain competitive accuracies as the original graphs are already very sparse, and therefore SGATs remove only 1-2% edges. On the other hands, on the rest of large or dense graphs, SGATs identify significant amounts of redundancies in edges (e.g., 40-80%), and removing them incurs no or minor accuracy losses.

Figure 3: The evolution of (top) classification accuracy and (bottom) number of nonzero attention coefficients as a function of training epochs on test subgraphs of the PPI dataset.
Figure 4: The evolution of number of edges used for neighbor aggregation as a function of training epochs (top) on the Cora training graph and (b) on the PPI training graphs.
Figure 5: The evolution of classification accuracies on the PPI test dataset when different percentages of edges are removed from the graph. Three different strategies of selecting edges for removal are considered.

4.5 Edge Redundancy Analysis

Lastly, we analyze the edge redundancy identified by SGATs. Figures 3 and 4 illustrate the evolution of number of edges used by SGATs in Cora and PPI during training and test. As we can see, SGAT removes 2% edges from Cora slowly during training epochs, while it removes 49.3% edges from PPI dramatically, indicating a significant edge redundancy in the PPI benchmark.

To demonstrate SGAT’s accuracy of identifying important edges from a graph, Figure 5 shows the evolution of classification accuracies on the PPI test dataset when different percentages of edges are removed from the graph. We compare three different strategies of selecting edges for removal: (1) top-k% edges sorted descending by , (2) bottom-k% edges sorted descending by , and (3) uniformly random k%. As we can see, SGAT identifies important edges accurately as removing them from the graph incurs a dramatically accuracy loss as compared to random edge removal or bottom-k% edge removal.

4.6 Conclusion

We propose sparse graph attention networks (SGATs) that incorporate a sparse attention mechanism into graph attention networks (GATs) via an -norm regularization on the number of edges of a graph. To assign a single attention coefficient to each edge, SGATs further simplify the architecture of GATs by sharing one set of attention coefficients across all heads and all layers. This reduces the risk of overfitting, and leads to a more efficient graph-learning algorithm that achieves very competitive accuracies while being able to identify important edges and sparsify large graphs dramatically (e.g., 50-80%). To the best of our knowledge, this is the first graph learning algorithm that sparsifies graphs for the purpose of robust training.

As for future extensions, we plan to investigate more efficient gradient estimators for stochastic binary optimization to improve the effectiveness of SGATs. We also plan to extend the framework to learn graph structure from data directly when graph structure isn’t available in the first place.

References

  • [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Representation Learning (ICLR), 2015.
  • [2] Thomas Bird, Julius Kunze, and David Barber. Stochastic variational optimization. arXiv preprint arXiv:1809.04855, 2018.
  • [3] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representations with global structural information. In ACM International on Conference on Information and Knowledge Management (CIKM), 2015.
  • [4] David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gomez-Bombarelli, Timothy Hirzel, Alan Aspuru-Guzik, and Ryan P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems (NIPS), 2015.
  • [5] Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, and David Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. In International Conference on Learning Representations (ICLR), 2018.
  • [6] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Knowledge Discovery and Data mining (KDD), 2016.
  • [7] William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (NIPS), 2017.
  • [8] William L. Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. IEEE Data Engineering Bulletin, 40(3):52–74, 2017.
  • [9] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations (ICLR), 2017.
  • [10] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
  • [11] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
  • [12] Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through regularization. In International Conference on Learning Representations (ICLR), 2018.
  • [13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119. 2013.
  • [14] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning (ICML), 2011.
  • [15] Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. Asymmetric transitivity preserving graph embedding. In Knowledge Discovery and Data mining (KDD), 2016.
  • [16] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
  • [17] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Knowledge Discovery and Data mining (KDD), 2014.
  • [18] Prithviraj Sen, Galileo Mark Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina Eliassi-Rad. Collective classification in network data. AI Magazine, 29(3):93–106, 2008.
  • [19] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Gunnemann. Pitfalls of graph neural network evaluation. In NeurIPS Workshop on Relational Representation Learning, 2018.
  • [20] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
  • [21] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In International World Wide Web Conference (WWW), 2015.
  • [22] George Tucker, Andriy Mnih, Chris J. Maddison, John Lawson, and Jascha Sohl-Dickstein.

    Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models.

    In Advances in Neural Information Processing Systems (NIPS), 2017.
  • [23] Rianne Van den Berg, Thomas N Kipf, and Max Welling. Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263, 2017.
  • [24] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations (ICLR), 2018.
  • [25] Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, Ziyue Huang, Qipeng Guo, Hao Zhang, Haibin Lin, Junbo Zhao, Jinyang Li, Alexander Smola, , and Zheng Zhang.

    Deep graph library: Towards efficient and scalable deep learning on graphs.

    In ICLR 2019 Workshop on Representation Learning on Graphs and Manifolds, 2019.
  • [26] Ronald J. Williams.

    Simple statistical gradient-following algorithms for connectionist reinforcement learning.

    Machine Learning, 8(3-4):229–256, May 1992.
  • [27] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML), 2015.
  • [28] Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov.

    Revisiting semi-supervised learning with graph embeddings.

    In International Conference on Machine Learning (ICML), 2016.
  • [29] Wayne W. Zachary. An information flow model for conflict and fission in small groups. Journal of Anthropological Research, 33(4):452–473, 1977.
  • [30] Marinka Zitnik and Jure Leskovec. Predicting multicellular function through multi-layer tissue networks. Bioinformatics, 33(14):190–198, 2017.