Log In Sign Up

Path-Augmented Graph Transformer Network

by   Benson Chen, et al.

Much of the recent work on learning molecular representations has been based on Graph Convolution Networks (GCN). These models rely on local aggregation operations and can therefore miss higher-order graph properties. To remedy this, we propose Path-Augmented Graph Transformer Networks (PAGTN) that are explicitly built on longer-range dependencies in graph-structured data. Specifically, we use path features in molecular graphs to create global attention layers. We compare our PAGTN model against the GCN model and show that our model consistently outperforms GCNs on molecular property prediction datasets including quantum chemistry (QM7, QM8, QM9), physical chemistry (ESOL, Lipophilictiy) and biochemistry (BACE, BBBP).


page 1

page 2

page 3

page 4


Edge-augmented Graph Transformers: Global Self-attention is Enough for Graphs

Transformer neural networks have achieved state-of-the-art results for u...

Enterprise Analytics using Graph Database and Graph-based Deep Learning

In a business-to-business (B2B) customer relationship management (CRM) u...

Edge Attention-based Multi-Relational Graph Convolutional Networks

Graph convolutional network (GCN) is generalization of convolutional neu...

Geometry-aware Transformer for molecular property prediction

Recently, graph neural networks (GNNs) have achieved remarkable performa...

3D-Transformer: Molecular Representation with Transformer in 3D Space

Spatial structures in the 3D space are important to determine molecular ...

A framework for modelling Molecular Interaction Maps

Metabolic networks, formed by a series of metabolic pathways, are made o...

Lovasz Convolutional Networks

Semi-supervised learning on graph structured data has received significa...

1 Introduction

Graph Convolution Networks (GCN) have successfully been applied to molecular graph datasets (Duvenaud et al., 2015; Kearnes et al., 2016; Niepert et al., 2016; Jin et al., 2017). These “message-passing” algorithms exploit the feature locality of graphs through the usage of convolution operations (Gilmer et al., 2017). However, the convolution operator aggregates only local information, so long-range dependencies are naturally difficult for these models to learn. In molecular graphs, many informative structures are characterized by the paths between nodes. We propose the Path-Augmented Graph Transformer Network (PAGTN) model that utilizes these path features in global attention layers, resulting in a richer, more expressive model. Specifically, our model learns a better representation of the graph in the following ways:

Long-range dependencies In GCNs, long-range dependencies take many convolution layers to learn, because feature aggregation happen only within the immediate neighborhoods of each node. For large enough graphs, GCNs may fail to capture these long-range dependencies entirely. Our PAGTN model can more easily capture these dependencies because every node attends to all other nodes in the graph.

Substructures In graph problems, it is imperative for a model to pick up the important substructures in the graph. GCN models necessitate several layers to propagate information and learn these substructures. The advantage of our model is that this interaction can be learned within a single layer.

We test our PAGTN model against the GCN model on 7 benchmark moelcular property prediction tasks ranging from quantum chemistry (QM7, QM8, QM9), physical chemistry (ESOL, Lipophilictiy) and biochemistry (BACE, BBBP) (Wu et al., 2018). Each dataset focuses on a different property of the molecule, making composition of these datasets highly variable. Nevertheless, our model consistently shows improved performance against the GCN baseline, demonstrating that our model can learn more powerful representations.

2 Related Works

Transformer architectures have triumphed over traditional recurrent and convolution models in many natural language tasks such as machine translation (Vaswani et al., 2017). While recurrent and convolution models often incorporate a single attention layer at the top (Luong et al., 2015), it has been shown that using only these globally-connected self-attention layers learns a much more powerful model.

Attention models on graphs have been explored in previous works. Primarily, the Graph Attention Network (Veličković et al., 2017) and its variants (Gong & Cheng, 2018; Zhang et al., 2018; Monti et al., 2018) aggregates information within local neighborhoods by using attention. We emphasize that our model focuses on the global connectivity of the nodes. Moreover, our model does not use any complex attention mechanism across layers, but rather provides a simple framework using the path features that works well empirically. Another proposed model, Graph Transformer (Li et al., 2019), uses global attention layers, but that model does not extend to graphs in which edge and path features are important.

3 Model

In this section, we first briefly overview the Transformer model. Then, we will go over our contributions, describing our variant of the Transformer model that uses path features to learn expressive representations of graphs.

3.1 Transformer

The Transformer model (Vaswani et al., 2017), in contrast to traditional recurrent or convolution architectures, consists of fully-connected attention layers. These models use multi-head self-attention, which confers more flexibility for the attention module. The attention layers are connected by position-wise feed-forward layers, with residual links and layer normalization present at each layer.

The transformer model itself has no direct notion of relative position, so it uses positional encodings in the form of sinusoidal functions. However, this form of positional encoding is not possible in graphs, because there is no longer a natural sequential ordering of the nodes. We introduce path features, which represent how two nodes are connected. These path features influence the attention module in the network, so that the node embeddings are globally aware. We first explain how we construct these path features, then how they are incorporated into the attention framework.

3.2 Path Features

We compute the path features between each node pair by taking the shortest path between them. Due to cycles on graphs, these shortest paths may not be unique. For molecular graphs, these cycles arise due to ring substructures on the graphs. Because the edge features are consistent within a single ring or cycle, multiple paths are almost always equivalent feature-wise; therefore, this approach is sensible for our model.

For efficiency, we truncate the path features between nodes up to a distance apart. We make the assumption that as the distance between two nodes increases, the connectivity between the two nodes matter less. Therefore, this constraint puts a natural regularizer on the model. So while each node attend to all other nodes in the graph, that node only has rich edge features for a local neighborhood.

The path features between two nodes is a concantenation of the following three components:

Edge features: are constructed by concatenating the individual bond features of the shortest path between . Let be the bond features of the th bond along the path, which includes the bond type, conjugacy and ring membership (whether or not that bond is in a ring) features. Then, the edge features are just the concatenation of the features: . Note that if , we zero out these features, and if

, we pad the feature vector with zeros.

Distance: is a one-hot feature of the distance between two nodes , truncated by .

Ring Membership: is a one-hot feature denoting whether the node and node are in the same ring. For molecular graphs, we find that it’s also helpful to include one-hot features for specific rings such as five/six-membered aromatic rings. Note that this is distinct from the bond ring membership features which indicates whether a particular bond is part of a ring.

Figure 1: Illustration of graph propagation properties for GCN (left) and our PAGTN model (right). For the GCN, the source attention node (green) only attends to its immediate neighbors (blue). In the PAGTN, the source attention node (green) has connectivity information in the form of path features for its local neighborhood, , (blue), but also attends to all other nodes (yellow).

A comparison of the information propagation properties of the network layers is illustrated in Figure 1. In regular GCNs, only the direct neighborhood is impacted–which can require many layers of computation to learn from the graph. In our PAGTN model, every node is globally connected, which makes learning complex dependencies easier.

3.3 Additive Self-Attention

Although transformer models normally use scaled dot-product attention, we found in our experiments that an additive form of attention was easier to train and resulted in better performance. One way we deviate from standard self-attention modules is that we exclude the source node when computing attention for that node. The residual links at each layer grounds the learned embedding at each layer to be representative of the original input node.

Define as a matrix of the input node features, where is the number of nodes and is the number of node features. Similarly, let be a matrix of the input pairwise path features where is the number of path features.

At each layer, we update the node features by computing a weighted average using learned attention weights. Let represent the node features at layer , where is the number of model features. Note that the elements of

are the linearly transformed input features (

). We compute , the attention score of node , as:


The attention probabilities

are calculated as a softmax over the attention scores. As mentioned earlier, we exclude the source node itself when computing the attention probabilities.


Using attention probabilities, we can compute a weighted average over the node features. Since we note the importance of path features in graphs, we define the output features to be a function of both node and path features. Here,

is some non-linear function (we use ReLU for our experiments).


As introduced in (Vaswani et al., 2017), multi-head attention can often benefit the model by allowing it more easily to attend to different aspects of the input data. If we split the attention into heads, we can define the update rule for as a function of the embeddings associated with invididual heads :


Here, is the concatenation operator. Empirically, we find that using multi-head attention helps on some tasks, but not on all tasks.

3.4 Molecule Embedding

Since we are interested in property prediction tasks for the molecule as a whole, we compute a molecule embedding by aggregating the individual node embeddings. Here, we add a residual link to the input features, x, of the network.


We choose the sum operator to aggregate the feature embeddings, which has higher expressive power than other classic operators (Xu et al., 2018). The target property is predicted using a 1-layer MLP with as input.

4 Experiments

4.1 Experimental Setup

We test our model on 7 benchmark property prediction tasks, including quantum mechanics (QM7, QM8, QM9), physical chemistry (ESOL, Lipophilicity) and biochemistry (BACE, BBBP) (Wu et al., 2018).

We split each dataset into 10 different folds of 80:10:10 (train:validation:test) splits, and record the average performance over the folds using the appropriate measure for each dataset. Since these datasets feature markedly different properties, we tune the hyperparameters of the model for individual datasets.

4.2 Baselines

We compare our transformer to several baselines.

MolNet Molecule Net (Wu et al., 2018)

tested many graph-based deep learning methods as well as more conventional methods on these property prediction datasets. We use their top performing model for each dataset.

GCN This is a traditional graph convolution model, and here we use a similar model to (Jin et al., 2017). We find that this model achieves very competitive results compared MolNet (which itself uses many different graph-based convolution models), and therefore is a fair baseline. GCN models can have a self-attention layer at the top, but we find empirically that this often hurts performance so we do not include this attention layer in our baseline.

PAGTN (Local) We include a variant of our PAGTN model, which does not attend to nodes for which there are no path features. That is, the model masks out nodes that are further than from the source attention node. We include this baseline to show that global attention does indeed improve performance.

Our proposed model is dubbed the PAGTN (Global), which attends globally to all nodes.

Data set # Data Metric MolNet GCN PAGTN (Local) PAGTN (Global)
QM7 6,830 MAE 333MolNet uses a stratified sampling of the data for QM7, whereas we use random sampling for this work. 52.4 2.8 48.9 3.4 47.8 3.0
QM8 21,786 MAE .0143 .0105 .0003 .0108 .0003 .0102 .0003
QM9 133,885 MAE 2.35 2.20 .03 2.10 .04 2.07 .05
ESOL 1,128 RMSE .580 .587 .05 .592 .06 .554 .06
Lipophilicity 4,200 RMSE .655 .578 .05 .592 .05 .572 .04
BACE 1,513 AUC .867 .878 .02 .876 .02 .880 .01
BBBP 2,039 AUC .729 .907 .03 .898 .04 .913 .03
Table 1: Results comparing our PAGTN model to various baselines. The metrics used were MAE for the quantum mechanics datasets (QM7, QM8, QM9), RMSE for the physical chemistry datasets (ESOL, Lipophilicity), and AUC for the biochemistry datasets (BACE, BBBP). The bold numbers represent the model with the best performance.

4.3 Property Prediction

The results of the property prediction tasks can be seen from Table 1. We first see that the GCN model is very comparable to those of MolNet (Wu et al., 2018). And compared to the GCN model, our PAGTN model achieves surperior performance in all 7 of these property prediction tasks, illustrating the broad representational power of the model. Furthermore, we see from the local PAGTN model that by attending globally rather than restricting to the local neighborhood, we always see an improvement in performance. This reveals that the global attention does indeed help the model.

4.4 Ring Membership

To help elucidate why the PAGTN formulation is better than that of GCN, we turn to a synthetic task. We note that certain properties such as ring membership can prove difficult for regular graph convolution networks. To test this observation, and to demonstrate the effectiveness of our PAGTN model, we create a synthetic dataset by choosing a subset of 5,769 molecules from the property prediction datasets that have at least 2 rings. For each molecule, we randomly choose 5 pairs of atoms that are in the same ring, and 5 pairs of atoms that are in different rings. For atoms in fused ring systems, we count two atoms in the same ring if they are in the smallest possible ring system.

Model Accuracy AUC
GCN 91.6 96.5
PAGTN (Global) 97.8 99.8
Table 2: Results comparing the GCN and the PAGTN (Global) models on a synthetic ring membership prediction task, which is to test whether or not two nodes are in the same ring on the graph. GCN does cannot always predict this property well, while the PAGTN can easily incorporate these features into the model.

From Table 2, we see that the GCN fails to perfectly predict ring membership. This is not surprising as the convolution operation has to learn to disambiguate features of nodes in same and different rings. These subtle but important graph features are imperative for models to fully capture the representation of the graph. Our PAGTN naturally solves this issue, since we can incorporate these features as a part of the network, whereas it is a lot more difficult to incorporate these features in the local convolution model. Note that the PAGTN still does not solve the problem perfectly, and this is due to the fact that in highly symmetrical graphs, multiple nodes are equivalent which leads to ambiguous ring membership as see from Figure 2.

Figure 2: The two green-circled atoms are completely symmetric, so their output feature embeddings are equivalent. Since the ring membership prediction is made by aggregating pairwise node features, it is impossible to tell whether any other atom is in the same or different ring from these two atoms.

5 Conclusion

In this paper, we introduced the PAGTN model that exploits the connectivity structure of the data in its global attention mechanisms. Through the path features that we engineer into model’s attention layers, our model better captures the complex structures of graphs compared to GCNs. On 7 different chemical property prediction tasks, we have shown that our PAGTN model can outperform traditional GCNs, and we hope that these global-attention models that incorporate path features will be used more frequently in works on molecular graphs moving forward.