DeepAI
Log In Sign Up

Equiformer: Equivariant Graph Attention Transformer for 3D Atomistic Graphs

06/23/2022
by   Yi-Lun Liao, et al.
MIT
0

3D-related inductive biases like translational invariance and rotational equivariance are indispensable to graph neural networks operating on 3D atomistic graphs such as molecules. Inspired by the success of Transformers in various domains, we study how to incorporate these inductive biases into Transformers. In this paper, we present Equiformer, a graph neural network leveraging the strength of Transformer architectures and incorporating SE(3)/E(3)-equivariant features based on irreducible representations (irreps). Irreps features encode equivariant information in channel dimensions without complicating graph structures. The simplicity enables us to directly incorporate them by replacing original operations with equivariant counterparts. Moreover, to better adapt Transformers to 3D graphs, we propose a novel equivariant graph attention, which considers both content and geometric information such as relative position contained in irreps features. To improve expressivity of the attention, we replace dot product attention with multi-layer perceptron attention and include non-linear message passing. We benchmark Equiformer on two quantum properties prediction datasets, QM9 and OC20. For QM9, among models trained with the same data partition, Equiformer achieves best results on 11 out of 12 regression tasks. For OC20, under the setting of training with IS2RE data and optionally IS2RS data, Equiformer improves upon state-of-the-art models. Code reproducing all main results will be available soon.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/19/2021

Learning Attributed Graph Representations with Communicative Message Passing Transformer

Constructing appropriate representations of molecules lies at the core o...
10/27/2021

Transformers Generalize DeepSets and Can be Extended to Graphs and Hypergraphs

We present a generalization of Transformers to any-order permutation inv...
02/15/2021

Translational Equivariance in Kernelizable Attention

While Transformer architectures have show remarkable success, they are b...
10/11/2022

Relational Attention: Generalizing Transformers for Graph-Structured Tasks

Transformers flexibly operate over sets of real-valued vectors represent...
05/13/2021

GIPA: General Information Propagation Algorithm for Graph Learning

Graph neural networks (GNNs) have been popularly used in analyzing graph...
07/06/2022

Pure Transformers are Powerful Graph Learners

We show that standard Transformers without graph-specific modifications ...
02/17/2022

Transformer for Graphs: An Overview from Architecture Perspective

Recently, Transformer model, which has achieved great success in many ar...

1 Introduction

Machine learned models can accelerate the prediction of quantum properties of atomistic systems like molecules by learning approximations of ab initio calculations  neural_message_passing_quantum_chemistry; deep_potential_molecular_dynamics; push_limit_of_md_100m; dimenet_pp; nequip; oc20; deep_potential_molecular_dynamics_simulation; se3_wavefunction; gemnet_xl; quantum_scaling; allergo

. In particular, graph neural networks (GNNs) have gained increasing popularity due to their performance. By modeling atomistic systems as graphs, GNNs naturally treat the set-like nature of collections of atoms, encode the interaction between atoms in node features and update the features by passing messages between nodes. One factor contributing to the success of neural networks is the ability to incorporate inductive biases that exploit the symmetry of data. Take convolutional neural networks (CNNs) for 2D images as an example: Patterns in images should be recognized regardless of their positions, which motivates the inductive bias of translational equivariance. As for atomistic graphs, where each atom has its coordinate in 3D Euclidean space, we consider inductive biases related to 3D Euclidean group

, which include equivariance to 3D translation, 3D rotation, and inversion. Concretely, some properties like energy of an atomistic system should be constant regardless of how we shift the system; others like force should be rotated accordingly if we rotate the system. To incorporate these inductive biases, equivariant and invariant neural networks have been proposed. The former leverages geometric tensors like vectors for equivariant node features 

tfn; 3dsteerable; kondor2018clebsch; se3_transformer; nequip; segnn; allergo, and the latter augments graphs with invariant information such as distances and angles extracted from 3D graphs schnet; dimenet; dimenet_pp; spherenet; spinconv; gemnet.

A parallel line of research focuses on applying Transformer networks 

transformer

to other domains like computer vision 

detr; vit; deit and graph generalization_transformer_graphs; spectral_attention; graphormer; graphormer_3d and has demonstrated widespread success. However, as Transformers were developed for sequence data bert; wav2vec2; gpt3, it is crucial to incorporate domain-related inductive biases. For example, Vision Transformer vit

shows that adopting a pure Transformer to image classification cannot generalize well and achieves worse results than CNNs when trained on only ImageNet 

imagenet

since it lacks inductive biases like translational invariance. Note that ImageNet contains over 1.28M images and the size is already larger than that of many quantum properties prediction datasets 

qm9_2; qm9_1; oc20. Therefore, this highlights the necessity of including correct inductive biases when applying Transformers to the domain of 3D atomistic graphs.

In this work, we present Equiformer, an equivariant graph neural network utilizing /-equivariant features built from irreducible representations (irreps) and equivariant attention mechanisms to combine the 3D-related inductive bias with the strength of Transformer. Irreps features encode equivariant information in channel dimensions without complicating graph structures. The simplicity enables us to directly incorporate them into Transformers through replacing original operations with equivariant counterparts and introducing an additional equivariant operation called tensor product. Moreover, we propose a novel equivariant graph attention, which considers both content and geometric information such as relative position contained in irreps features. Equivariant graph attention improves upon typical attention in Transformers by replacing dot product attention with theoretically stronger multi-layer perceptron attention and including non-linear message passing. With these innovations, Equiformer demonstrates the possibility of generalizing Transformer-like networks to 3D atomistic graphs and achieves competitive results on two quantum properties prediction datasets, QM9 qm9_2; qm9_1 and OC20 oc20. For QM9, compared to models trained with the same data partition, Equiformer achieves the best results on 11 out of 12 regression tasks. For OC20, under the setting of training with IS2RE data and optionally IS2RS data, Equiformer improves upon state-of-the-art models.

2 Background

2.1 Graph Neural Networks

Formally, a graph consists of nodes and edges representing relationships between nodes and . Graphs can represent atomistic systems like molecules, where nodes are atoms and edges can represent bonds or atom pairs within a certain distance; we refer to these as atomistic graphs. Graph neural networks (GNNs) extract meaningful representations from graphs through message-passing layers. Given feature on node and edge feature at the -th layer, it first aggregates messages from neighbors and then update features on each node as follows:

(1)

where denotes the set of neighbors of node and and are learnable functions. For atomistic systems in 3D spaces, each node is additionally associated with its position and typically edge features become functions of relative position .

2.2 Equivariance

Atomistic systems are often described using coordinate systems. For 3D Euclidean space, we can freely choose coordinate systems and change between them via the symmetries of 3D space: 3D translation, rotation and inversion ( ). The groups of 3D translation, rotation and inversion form Euclidean group , with the first two forming , the second being , and the last two forming . The laws of physics are invariant to the choice of coordinate systems and therefore properties of atomistic systems are equivariant, e.g., when we rotate our coordinate system, quantities like energy remain the same while others like force rotate accordingly. Formally, a function mapping between vector spaces and is equivariant to a group of transformation if for any input , output and group element , we have , where and are transformation matrices parametrized by in and .

Incorporating equivariance into neural networks as inductive biases is crucial as this enables generalizing to unseen data in a predictable manner. For example, 2D convolution is equivariant to the group of 2D translation, and when input features are shifted by an amount parametrized by , output features will be shifted by the same amount. For 3D atomistic graphs, we consider the group of . Features and and learnable functions and should be -equivariant to geometric transformation acting on position . In this work, following previous works tfn; kondor2018clebsch; 3dsteerable implemented in e3nn e3nn, we achieve /-equivariance by using equivariant features based on vector spaces of irreducible representations and equivariant operations like tensor product for learnable functions.

2.3 Irreducible Representations

A group representation Dresselhaus2007; zee defines the transformation matrices of group elements that act on a vector space . For 3D Euclidean group , two examples of vector spaces with different transformation matrices are scalars and Euclidean vectors in , i.e., vectors change with rotation while scalars do not. To address translation symmetry, we simply operate on relative positions. Below we focus our discussion on . The transformation matrices of rotation and inversion are separable and commute, and we first discuss irreducible representations of .

Any group representation of on a given vector space can be decomposed into a concatenation of provably smallest transformation matrices called irreducible representations (irreps). Specifically, for group element , there are -by- irreps matrices called Wigner-D matrices acting on -dimensional vector spaces, where degree is a non-negative integer. can be interpreted as an angular frequency and determines how quickly vectors change when rotating coordinate systems. of different act on independent vector spaces. Vectors transformed by are type- vectors, with scalars and Euclidean vectors being type- and type- vectors. It is common to index elements of type- vectors with an index called order, where .

The group of inversion only has two elements, identity and inversion, and two irreps, even

and odd

. Vectors transformed by irrep do not change sign under inversion while those by irrep do. We create irreps of by simply multiplying those of and , and we introduce parity to type- vectors to denote how they transform under inversion. Therefore, type- vectors in are extended to type- vectors in , where is or . In the following, we use type- vectors for the ease of discussion, but we can generalize to type- vectors, unless otherwise stated.

Irreps Features.

We concatenate multiple type- vectors to form -equivariant irreps features. Concretely, irreps feature has type- vectors, where and is the number of channels for type- vectors. We index irreps features by channel , degree , and order and denote as . Different channels of type- vectors are parametrized by different weights but are transformed with the same Wigner-D matrix . Regular scalar features correspond to including only type- vectors. This can generalize to by including inversion and extending to .

Spherical Harmonics.

Euclidean vectors in can be projected into type- vectors by using spherical harmonics (SH) : . SH are -equivariant with . SH of relative position generates the first set of irreps features. Equivariant information propagates to other irreps features through equivariant operations like the tensor product.

2.4 Tensor Product

We use tensor products to interact different type- vectors and first discuss the tensor product for . The tensor product denoted as uses Clebsch-Gordan coefficients to combine type- vector and type- vector and produces type- vector as follows:

(2)

where denotes order and refers to the -th element of . Clebsch-Gordan coefficients are non-zero only when and thus restrict output vectors to be of certain types. For efficiency, we discard vectors with , where is a hyper-parameter, to prevent vectors of increasingly higher dimensions. The tensor product is an equivariant operation, with for .

We call each distinct non-trivial combination of a path. Each path is independently equivariant, and we can assign one learnable weight to each path in tensor products, which is similar to typical linear layers. We can generalize Eq. 2 to irreps features and include multiple channels of vectors of different types through iterating over all paths associated with channels of vectors. In this way, weights are indexed by , where is the -th channel of type- vector in input irreps feature. We use to represent tensor product with weights . Weights can be conditioned on quantities like relative distances. For example, in GNNs, we use tensor products of irreps features and SH embedding of relative position to transform features into messages and use some learnable functions transforming into weights of tensor products schnet; nequip. Please refer to Sec. A.4 in appendix for discussion on including inversion in tensor products.

2.5 -Equivariant Neural Networks

-equivariant neural networks (ENNs) are built such that every operation is equivariant to Euclidean symmetry, satisfying that for , inputs and scalar weights and , tfn; kondor2018clebsch; 3dsteerable

. In addition to the tensor product, equivariant operations also include equivariant activation functions and normalizations as in typical neural networks, and we discuss those used in Equiformer in Sec. 

3.1. Because an ENN is a composition of equivariant operations, it is also equivariant; transformations acting on input spaces propagate to the output spaces, e.g., when we rotate an input atomistic graph, predicted vectors will be rotated.

3 Equiformer

Figure 1: Architecture of Equiformer. We embed input 3D graphs with atom and edge-degree embeddings and process them with Transformer blocks, consisting of equivariant graph attention and feed forward networks. In this figure, “” denotes multiplication, “” denotes addition, and “DTP” stands for depth-wise tensor product. within a circle denotes summation over all neighbors. Gray cells indicate intermediate irreps features.

We incorporate /-equivariant irreps features into Transformers transformer and use equivariant operations. To better adapt Transformers to 3D graph structures, we propose equivariant graph attention. The overall architecture of Equiformer is illustrated in Fig. 1.

3.1 Equivariant Operations for Irreps Features

Here we discuss equivariant operations used in Equiformer that serve as building blocks for equivariant graph attention and other modules. They include the equivariant version of the original operations in Transformers and the depth-wise tensor product, as illustrated in Fig. 2 in appendix.

Linear.

Linear layers are generalized to irreps features by transforming different type- vectors separately. Specifically, we apply separate linear operations to each group of type- vectors. We remove bias terms for non-scalar features with as biases do not depend on inputs, and therefore, including biases for type- vectors with can break equivariance.

Layer Normalization.

Transformers adopt layer normalization (LN) layer_norm to stabilize training. Given input , with being the number of nodes and

the number of channels, LN calculates the linear transformation of normalized input as follows:

(3)

where

are mean and standard deviation of input

along channel dimension , and are learnable parameters. denotes element-wise product. By viewing standard deviation as the root mean square value (RMS) of L2-norm of type- vectors, LN can be generalized to irreps features. Specifically, given input of type- vectors, the output is:

(4)

where calculates the L2-norm of each type- vectors in , and calculates the RMS of L2-norm with mean taken along channel dimension . We remove mean and biases for type- vectors with following linear layers.

Gate.

We adopt the gate activation 3dsteerable for equivariant activation function. Typical activation functions are applied to type- vectors. For vectors of higher , we multiply them with non-linearly transformed type- vectors to maintain equivariance. Specifically, given input containing type- vectors with and type- vectors, we apply SiLU activation silu; swish to the first type-

vectors and sigmoid function to the other

type- vectors to obtain non-linear weights. Then, we multiply each type- vector with corresponding non-linear weights. After the gate activation, the number of channels for type- vectors is reduced to .

Depth-wise Tensor Product.

The tensor product defines interaction between vectors of different . To improve its efficiency, we use the depth-wise tensor product (DTP), which restricts one type- vector in output irreps features depends only on one type- vector in input irreps features, where can be equal to or different from . This is similar to depth-wise convolution mobilenet, where one output channel depends on only one input channel. Weights in the DTP can be input-independent or conditioned on relative distances, and the DTP between two tensors and is denoted as .

3.2 Equivariant Graph Attention

Self-attention transformer; gat; se3_transformer; transformer_in_vision; graphormer; gatv2 transforms features sent from one spatial location to another with input-dependent weights. We use the notion from Transformers transformer and message passing networks neural_message_passing_quantum_chemistry; gns; egnn; segnn and define message sent from node to node as follows:

(5)

where attention weights depend on features on node and its neighbors and values are transformed with input-independent weights. In Transformers and Graph Attention Networks (GAT) gat; gatv2, depends only on node . In message passing networks neural_message_passing_quantum_chemistry; gns; egnn; segnn, depends on features on nodes and with constant . The proposed equivariant graph attention adopts tensor products to incorporate content and geometric information and utilizes multi-layer perceptron attention for and non-linear message passing for as illustrated in Fig. 1(b).

Incorporating Content and Geometric Information.

Given features and on target node and source node , we combine the two features with two linear layers to obtain initial message . is passed to a DTP layer and a linear layer to consider geometric information like relative position contained in different type- vectors in irreps features:

(6)

is the tensor product of and spherical harmonics embeddings (SH) of , with weights parametrized by . considers both semantic and geometric features on source and target nodes in a linear manner and is used to derive attention weights and non-linear messages.

Multi-Layer Perceptron Attention.

Attention weights capture how each node interacts with neighboring nodes. are invariant to geometric transformation se3_transformer, and therefore, we only use type- vectors (scalars) of message denoted as for attention. Note that encodes directional information, as they are generated by tensor products of type- vectors with . Inspired by GATv2 gatv2, we adopts multi-layer perceptron attention (MLPA) instead of dot product attention (DPA) used in Transformers transformer; transformer_in_vision. In contrast to dot product, MLPs are universal approximators mlp_universal_approximator; approximation_mlp; approximation_sigmoid and can theoretically capture any attention patterns gatv2. Similar to GAT gat; gatv2, given

, we uses one leaky ReLU layer and one linear layer for

:

(7)

where is a learnable vectors of the same dimension as and is a single scalar. The output of attention is the sum of value multipled by corresponding over all neighboring nodes , where can be obtained by linear or non-linear transformations of as discussed below.

Non-Linear Message Passing.

Values are features sent from one node to another, transformed with input-independent weights. We first split into and , where the former consists of type- vectors with and the latter consists of scalars only. Then, we perform non-linear transformation to to obtain non-linear message:

(8)

We apply gate activation to to obtain . We use one DTP and a linear layer to enable interaction between non-linear type- vectors, which is similar to how we transform into . Weights here are input-independent. We can also use directly as for linear messages.

Multi-Head Attention.

Following Transformers transformer, we can perform parallel equivariant graph attention functions given . The different outputs are concatenated and projected with a linear layer, resulting in the final output as illustrated in Fig. 1(b). Note that parallelizing attention functions and concatenating can be implemented with “Reshape”.

3.3 Overall Architecture

For completeness, we discuss other modules in Equiformer here.

Embedding.

This module consists of atom embedding and edge-degree embedding. For the former, we use a linear layer to transform one-hot encoding of atom species. For the latter, as depicted in the right branch in Fig. 

1(c), we first transform a constant one vector into messages encoding local geometry with two linear layers and one intermediate DTP layer and then use sum aggregation to encode degree information gin; graphormer_3d. The DTP layer has the same form as that in Eq. 6. We scale the aggregated features by dividing with the squared root of average degrees in training sets so that standard deviation of aggregated features would be close to . The two embeddings are summed to produce final embeddings of input 3D graphs.

Radial Basis and Radial Function.

Relative distances parametrizes weights in some DTP layers. To reflect subtle changes in , we represent distances with Gaussian radial basis with learnable mean and standard deviation schnet; spinconv; gemnet; graphormer_3d or radial Bessel basis dimenet; dimenet_pp. We transform radial basis with a learnable radial function to generate weights for those DTP layers schnet; se3_transformer; nequip. The function consists of two MLPs with layer normalization layer_norm and SiLU silu; swish and a final linear layer.

Feed Forward Network.

Similar to Transformers, we use two equivariant linear layers and an intermediate gate activation for the feed forward networks in Equiformer.

Output Head.

The last feed forward network transforms features on each node into a scalar. We perform sum aggregation over all nodes to predict scalar quantities like energy. Similar to edge-degree embedding, we divide the aggregated scalars with the squared root of average numbers of atoms.

4 Related Works

Here, we focus on equivariant neural networks and discuss other works in Sec. B in appendix.

Equivariant GNNs.

Equivariant neural networks tfn; kondor2018clebsch; 3dsteerable; se3_transformer; l1net; geometric_prediction; nequip; gvp; painn; egnn; se3_wavefunction; segnn; torchmd_net; eqgat; allergo operate on geometric tensors like type- vectors to achieve equivariance. The central idea is to use functions of geometry built from spherical harmonics and irreps features to achieve 3D rotational and translational equivariance as proposed in Tensor Field Network (TFN) tfn, which generalizes 2D counterparts harmonis_network; group_equivariant_convolution_networks; spherical_cnn to 3D Euclidean space. Previous works differ in equivariant operations used in their networks. TFN tfn and NequIP nequip mainly use graph convolution with linear messages, with the latter utilizing extra equivariant gate activations 3dsteerable. SE(3)-Transformer se3_transformer adopts an equivariant version of dot product (DP) attention transformer; transformer_in_vision, and subsequent works on equivariant Transformers torchmd_net; eqgat follow the practice of DP attention and use more specialized architectures considering only type- and type- vectors. SEGNN segnn utilizes non-linear message passing networks neural_message_passing_quantum_chemistry; gns for irreps features. The proposed Equiformer incorporates all the advantages of previous works through combining MLP attention with non-linear messages.

5 Experiment

Our implementation is based on PyTorch 

pytorch (Modified BSD license), PyG pytorch_geometric (MIT license), e3nn e3nn (MIT license), timm timm (Apache-2.0 license), and ocp oc20 (MIT license).

5.1 Qm9

Task ZPVE
Methods Units bohr meV meV meV D cal/mol K meV meV bohr meV meV meV
NMP neural_message_passing_quantum_chemistry .092 69 43 38 .030 .040 19 17 .180 20 20 1.50
SchNet schnet .235 63 41 34 .033 .033 14 14 .073 19 14 1.70
Cormorant cormorant .085 61 34 38 .038 .026 20 21 .961 21 22 2.03
LieConv lieconv .084 49 30 25 .032 .038 22 24 .800 19 19 2.28
DimeNet++ dimenet_pp .044 33 25 20 .030 .023 8 7 .331 6 6 1.21
TFN tfn .223 58 40 38 .064 .101 - - - - - -
SE(3)-Transformer se3_transformer .142 53 35 33 .051 .054 - - - - - -
EGNN egnn .071 48 29 25 .029 .031 12 12 .106 12 11 1.55
SphereNet spherenet .046 32 23 18 .026 .021 8 6 .292 7 6 1.12
SEGNN segnn .060 42 24 21 .023 .031 15 16 .660 13 15 1.62
EQGAT eqgat .063 44 26 22 .014 .027 12 13 .257 13 13 1.50
Equiformer .056 33 17 16 .014 .025 10 10 .227 11 10 1.32
Table 1: MAE results on QM9 testing set. denotes using different training, validation, testing data partitions as mentioned in SEGNN segnn. denotes results from SE(3)-Transformer se3_transformer.

Dataset.

The QM9 qm9_2; qm9_1 dataset (CC BY-NC SA 4.0 license) consisting of 134k small molecules, and the goal is to predict their quantum properties such as energy. We follow the data partition used by Cormorant cormorant, which has 100k, 18k and 13k molecules in training, validation and testing sets. We minimize mean absolute error (MAE) between prediction and normalized ground truth and report MAE on the testing set.

Setting.

Please refer to Sec. D in appendix for details on architecture and hyper-parameters.

Result.

We mainly compare with methods trained with the same data partition and summarize the results in Table 1. Equiformer achieves the best results on 11 out of 12 tasks among models trained with same data partition. The comparison to SEGNN segnn, which uses irreps features as Equiformer, demonstrates the effectiveness of combining non-lienar messages with MLP attention. Additionally, Equiformer achieves better results for most of tasks when compared to other equivariant Transformers se3_transformer; eqgat, which suggests a better adaption of Transformers to 3D graphs. Besides, the different data partition as denoted by in Table 1 has more molecules in the training set and less data in the testing set, and this can benefit some tasks that are more dependent on data partitions.

Energy MAE (eV) EwT (%)
Methods ID OOD Ads OOD Cat OOD Both Average ID OOD Ads OOD Cat OOD Both Average
SchNet schnet 0.6465 0.7074 0.6475 0.6626 0.6660 2.96 2.22 3.03 2.38 2.65
DimeNet++ dimenet_pp 0.5636 0.7127 0.5612 0.6492 0.6217 4.25 2.48 4.40 2.56 3.42
GemNet-T gemnet 0.5561 0.7342 0.5659 0.6964 0.6382 4.51 2.24 4.37 2.38 3.38
SphereNet spherenet 0.5632 0.6682 0.5590 0.6190 0.6024 4.56 2.70 4.59 2.70 3.64
(S)EGNN segnn 0.5497 0.6851 0.5519 0.6102 0.5992 4.99 2.50 4.71 2.88 3.77
SEGNN segnn 0.5310 0.6432 0.5341 0.5777 0.5715 5.32 2.80 4.89 3.09 4.03
Equiformer 0.5088 0.6271 0.5051 0.5545 0.5489 4.88 2.93 4.92 2.98 3.93
Table 2: Results on OC20 IS2RE validation set. denotes results reported by SphereNet spherenet.
Energy MAE (eV) EwT (%)
Methods ID OOD Ads OOD Cat OOD Both Average ID OOD Ads OOD Cat OOD Both Average
CGCNN cgcnn 0.6149 0.9155 0.6219 0.8511 0.7509 3.40 1.93 3.10 2.00 2.61
SchNet schnet 0.6387 0.7342 0.6616 0.7037 0.6846 2.96 2.33 2.94 2.21 2.61
DimeNet++ dimenet_pp 0.5621 0.7252 0.5756 0.6613 0.6311 4.25 2.07 4.10 2.41 3.21
SpinConv spinconv 0.5583 0.7230 0.5687 0.6738 0.6310 4.08 2.26 3.82 2.33 3.12
SphereNet spherenet 0.5625 0.7033 0.5708 0.6378 0.6186 4.47 2.29 4.09 2.41 3.32
SEGNN segnn 0.5327 0.6921 0.5369 0.6790 0.6101 5.37 2.46 4.91 2.63 3.84
Equiformer 0.5037 0.6881 0.5213 0.6301 0.5858 5.14 2.41 4.67 2.69 3.73
Table 3: Results on OC20 IS2RE testing set.
Energy MAE (eV) EwT (%)
Methods ID OOD Ads OOD Cat OOD Both Average ID OOD Ads OOD Cat OOD Both Average
GNS noisy_nodes 0.54 0.65 0.55 0.59 0.5825 - - - - -
Noisy Nodes noisy_nodes 0.47 0.51 0.48 0.46 0.4800 - - - - -
Graphormer graphormer_3d 0.4329 0.5850 0.4441 0.5299 0.4980 - - - - -
Equiformer 0.4222 0.5420 0.4231 0.4754 0.4657 7.23 3.77 7.13 4.10 5.56
+ Noisy Nodes 0.4156 0.4976 0.4165 0.4344 0.4410 7.47 4.64 7.19 4.84 6.04
Table 4: Results on OC20 IS2RE validation set when IS2RS node-level auxiliary task is adopted during training.

“GNS” denotes the 50-layer GNS trained without Noisy Nodes data augmentation, and “Noisy Nodes” denotes the 100-layer GNS trained with Noisy Nodes. “Equiformer + Noisy Nodes” uses data augmentation of interpolating between initial structure and relaxed structure and adding Gaussian noise as described by Noisy Nodes 

noisy_nodes.

-2.5 cm-2.5 cm Energy MAE (eV) EwT (%) Methods ID OOD Ads OOD Cat OOD Both Average ID OOD Ads OOD Cat OOD Both Average GNS + Noisy Nodes noisy_nodes 0.4219 0.5678 0.4366 0.4651 0.4728 9.12 4.25 8.01 4.64 6.5 Graphormer graphormer_3d 0.3976 0.5719 0.4166 0.5029 0.4722 8.97 3.45 8.18 3.79 6.1 Equiformer + Noisy Nodes 0.4171 0.5479 0.4248 0.4741 0.4660 7.71 3.70 7.15 4.07 5.66

Table 5: Results on OC20 IS2RE testing set when IS2RS node-level auxiliary task is adopted during training. denotes using ensemble of models trained with both IS2RE training and validation sets. In contrast, we use the same single Equiformer model in Table 4, which is trained with only the training set, for evaluation on the testing set.

5.2 Oc20

Dataset.

The Open Catalyst 2020 (OC20) dataset oc20 (Creative Commons Attribution 4.0 License) consists of larger atomic systems, each composed of a small molecule called the adsorbate placed on a large slab called catalyst. The average number of atoms in a system is more than 70, and there are over 50 atom species. The goal is to understand interaction between adsorbates and catalysts through relaxation. An adsorbate is first placed on top of a catalyst to form initial structure (IS). The positions of atoms are updated with forces calculated by density function theory until the system is stable and becomes relaxed structure (RS). The energy of RS, or relaxed energy (RE), is correlated with catalyst activity and therefore a metric for understanding their interaction. We focus on the task of initial structure to relaxed energy (IS2RE), which predicts relaxed energy (RE) given an initial structure (IS). There are 460k, 100k and 100k structures in training, validation, and testing sets, respectively. Performance is measured in MAE and energy within threshold (EwT), the percentage in which predicted energy is within 0.02 eV of ground truth energy. In validation and testing sets, there are four sub-splits containing in-distribution adsorbates and catalysts (ID), out-of-distribution adsorbates (OOD-Ads), out-of-distribution catalysts (OOD-Cat), and out-of-distribution adsorbates and catalysts (OOD-Both).

Setting.

We consider two training settings based on whether a node-level auxiliary task noisy_nodes is adopted. In the first setting, we minimize MAE between predicted energy and ground truth energy without any node-level auxiliary task. In the second setting, we incorporate the task of initial structure to relaxed structure (IS2RS) as a node-level auxiliary task noisy_nodes. In addition to predicting energy, we predict node-wise vectors indicating how each atom moves from initial structure to relaxed structure. Please refer to Sec. E in appendix for details on Equiformer architecture and hyper-parameters.

IS2RE Result without Node-Level Auxiliary Task.

We summarize the results under the first setting in Table 2 and Table 3. Compared with state-of-the-art models like SEGNN segnn and SphereNet spherenet, Equiformer consistently achieves the lowest MAE for all the four sub-splits in validation and testing sets. Note that energy within threshold (EwT) considers only the percentage of predictions close enough to ground truth and the distribution of errors, and therefore improvement in average errors (MAE) would not necessarily reflect that in error distributions (EwT). Similar phenomena can be observed in Table 3, where for “OOD Both” sub-split, SphereNet spherenet achieves lower MAE yet lower EwT than SEGNN segnn. We also note that models in Table 2 and 3

are trained by minimizing MAE and therefore comparing MAE in validation and testing sets could mitigate the discrepancy between training objectives and evaluation metrics and that OC20 leaderboard ranks the relative performance of models mainly according to MAE.

IS2RE Result with IS2RS Node-Level Auxiliary Task.

We report the results on validation and testing sets under the second setting in Table 4 and Table 5. As of May 20, 2022, Equiformer achieves the best results on IS2RE task when only IS2RE and IS2RS data are used. We note that the proposed Equiformer in Table 5 achieves competitive results even with much less computation. Specifically, training “Equiformer + Noisy Nodes” takes about GPU-days when A6000 GPUs are used. The training time of “GNS + Noisy Nodes” noisy_nodes is TPU-days. “Graphormer” graphormer_3d uses ensemble of models and requires GPU-days to train all models when A100 GPUs are used. The comparison to GNS demonstrates the improvement from invariant message passing networks to equivariant Transformers. Without any data augmentation, Equiformer still achieves competitive results to GNS trained with Noisy Nodes noisy_nodes. Compared to Graphormer graphormer_3d, Equiformer demonstrates the effectiveness of equivariant features and the proposed equivariant graph attention. Note that Equiformer, with Transformer blocks, is relatively shallow as GNS trained with Noisy Nodes has blocks and Graphormer has Transformer blocks and that deeper networks can typically obtain better results when IS2RS auxiliary task is adopted noisy_nodes.

5.3 Ablation Study

Methods
Non-linear
message passing
MLP
attention
Dot product
attention
Task
Index Unit bohr meV meV meV D cal/mol K
1 .056 33 17 16 .014 .025
2 .061 34 18 17 .015 .025
3 .060 34 18 18 .015 .026
Table 6: Ablation study results on QM9.
Methods Energy MAE (eV)
Non-linear
message passing
MLP
attention
Dot product
attention
Index ID OOD Ads OOD Cat OOD Both Average
1 0.5088 0.6271 0.5051 0.5545 0.5489
2 0.5168 0.6308 0.5088 0.5657 0.5555
3 0.5386 0.6382 0.5297 0.5692 0.5689
Methods EwT (%)
Non-linear
message passing
MLP
attention
Dot product
attention
Index ID OOD Ads OOD Cat OOD Both Average
1 4.88 2.93 4.92 2.98 3.93
2 4.59 2.82 4.79 3.02 3.81
3 4.37 2.60 4.36 2.86 3.55
Table 7: Ablation study results on OC20 IS2RE validation set.

We conduct ablation studies on the improvements brought by MLP attention and non-linear messages in the proposed equivariant graph attention. We modify dot product (DP) attention transformer; se3_transformer so that it only differs from MLP attention in how attention weights are generated from . Please refer to Sec. C.3 in appendix for details on DP attention. For experiments on QM9 and OC20, unless otherwise stated, we follow the hyper-parameters used in previous experiments.

Result on QM9.

The comparison is summarized in Table 6. Non-linear messages improve upon linear messages when MLP attention is used. Similar to what is reported by GATv2 gatv2, the improvement of replacing DP attention with MLP attention is not very significant. We conjecture that DP attention with linear operations is expressive enough to capture common attention patterns as the numbers of nighboring nodes and atom species are much smaller than those in OC20. However, MLP attention is roughly faster as it directly generates scalar features and attention weights from instead of producing additional key and query irreps features for attention weights.

Result on OC20.

We consider the setting of training without IS2RS auxiliary task and use a smaller learning rate for DP attention as this improves the performance. We summarize the comparison in Table 7. Non-linear messages consistently improve upon linear messages. In contrast to the results on QM9, MLP attention achieves better performance than DP attention. We surmise this is because OC20 contains larger atomistic graphs with more diverse atom species and therefore requires more expressive attention mechanisms.

6 Conclusion and Broader Impact

In this work, we propose Equiformer, a graph neural network (GNN) combining the strengths of Transformers and equivariant features based on irreducible representations (irreps). With irreps features, we build upon existing generic GNNs and Transformer networks transformer; vit; graphormer; parp; vit_search by incorporating equivariant operations like tensor products. We further propose equivariant graph attention, which incorporates multi-layer perceptron attention and non-linear messages. Experiments on QM9 and OC20 demonstrate both the effectiveness of Equiformer and the advantage of equivariant graph attention over typical dot product attention.

The broader impact lies in two aspects. First, Equiformer demonstrates the possibility of adapting Transformers to domains such as physics and chemistry, where data can be represented as 3D atomistic graphs. Second, Equiformer achieves more accurate approximations of quantum properties calculation. We believe there is much more to be gained by harnessing these abilities for productive investigation of molecules and materials relevant to application such as energy, electronics, and pharmaceuticals oc20, than to be lost by applying these methods for adversarial purposes like creating hazardous chemicals. Additionally, there are still substantial hurdles to go from the identification of a useful or harmful molecule to its large-scale deployment.

Acknowledgement

We thank Simon Batzner, Albert Musaelian, Mario Geiger, Johannes Brandstetter, and Rob Hesselink for helpful discussions including help with the OC20 dataset. We also thank the e3nn e3nn developers and community for the library and detailed documentation. We acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center supercloud for providing high performance computing and consultation resources that have contributed to the research results reported within this paper.

References

Appendix

Appendix A Additional Mathematical Background

In this section, we provide additional mathematical background on group equivariance helpful for the discussion of the proposed method. Other works tfn; 3dsteerable; kondor2018clebsch; cormorant; se3_transformer; segnn also provide similar background. We encourage interested readers to see these works zee; Dresselhaus2007 for more in-depth and pedagogical presentations.

a.1 Group Theory

Definition of Groups.

A group is an algebraic structure that consists of a set and a binary operator and is typically denoted as . Groups satisfy the following four axioms:

  1. Closure: for all .

  2. Identity: There exists an identity element such that for all .

  3. Inverse: For each , there exists an inverse element such that .

  4. Associativity: for all .

In this work, we focus on 3D rotation, translation and inversion. Relevant groups include:

  1. The Euclidean group in three dimensions : 3D rotation, translation and inversion.

  2. The special Euclidean group in three dimensions : 3D rotation and translation.

  3. The orthogonal group in three dimensions : 3D rotation and inversion.

  4. The special orthogonal group in three dimensions : 3D rotation.

Group Representations.

The actions of groups define transformations. Formally, a transformation acting on vector space parametrized by group element is an injective function . A powerful result of group representation theory is that these transformations can be expressed as matrices which act on vector spaces via matrix multiplication. These matrices are called the group representations. Formally, a group representation is a mapping between a group and a set of invertible matrices. The group representation maps an -dimensional vector space onto itself and satisfies for all .

How a group is represented depends on the vector space it acts on. If there exists a change of basis in the form of an matrix such that for all , then we say the two group representations are equivalent. If is block diagonal, which means that acts on independent subspaces of the vector space, the representation is reducible. A particular class of representations that are convenient for composable functions are irreducible representations or “irreps”, which cannot be further reduced. We can express any group representation of as a direct sum (concatentation) of irreps zee; Dresselhaus2007; e3nn:

(9)

where are Wigner-D matrices with degree as metnioned in Sec. 2.3.

a.2 Equivariance

Definition of Equivariance and Invariance.

Equivariance is a property of a function mapping between vector spaces and . Given a group and group representations and in input and output spaces and , is equivariant to G if for all and . Invariance corresponds to the case where is the identity for all .

Equivariance in Neural Networks.

Group equivariant neural networks are guaranteed to to make equivariant predictions on data transformed by a group. Additionally, they are found to be data-efficient and generalize better than non-symmetry-aware and invariant methods nequip; quantum_scaling; neural_scale_of_chemical_models. For 3D atomistic graphs, we consider equivariance to the Euclidean group , which consists of 3D rotation, translation and inversion. For translation, we operate on relative positions and therefore our networks are invariant to 3D translation. We achieve equivariance to rotation and inversion by representing our input data, intermediate features and outputs in vector spaces of irreps and acting on them with only equivariant operations.

a.3 Equivariant Features Based on Vector Spaces of Irreducible Representations

Irreps Features.

As discussed in Sec. 2.3 in the main text, we use type- vectors for -equivariant irreps features111In SEGNN segnn, they are also referred to as steerable features. We use the term “irreps features” to remain consistent with e3nn e3nn library. and type- vectors for -equivariant irreps features. Parity denotes whether vectors change sign under inversion and can be either (even) or (odd). Vectors with change sign under inversion while those with do not. Scalar features correspond to type- vectors in the case of -equivariance and correspond to type- in the case of -equivariance whereas type- vectors correspond to pseudo-scalars. Euclidean vectors in correspond to type- vectors and type- vectors whereas type- vectors correspond to pseudo-vectors. Note that type- vectors and type- vectors are considered vectors of different types in equivariant linear layers and layer normalizations.

Spherical Harmonics.

Euclidean vectors in can be projected into type- vectors by using spherical harmonics :  finding_symmetry_breaking_order

. This is equivalent to the Fourier transform of the angular degree of freedom

, which can be optionally weighted by . In the case of -equivariance, transforms in the same manner as type- vectors. For -equivariance, behaves as type- vectors, where if is even and if is odd.

Vectors of Higher and Other Parities.

Although previously we have restricted concrete examples of vector spaces of irreps to commonly encountered scalars (type- vectors) and Euclidean vectors (type- vectors), vector of higher

and other parities are equally physical. For example, the moment of inertia (how an object rotates under torque) transforms as a

symmetric matrix, which has symmetric-traceless components behaving as type- vectors. Elasticity (how an object deforms under loading) transforms as a rank-4 or symmetric tensor, which includes components acting as type- vectors.

a.4 Tensor Product

Tensor Product for .

We use tensor products to interact different type- vectors. We extend our discussion in Sec. 2.4 in the main text to include inversion and type- vectors. The tensor product denoted as uses Clebsch-Gordan coefficients to combine type- vector and type- vector and produces type- vector as follows:

(10)
(11)

The only difference of tensor products for as described in Eq. 10 from those for described in Eq. 2 is that we additionally keep track of the output parity as in Eq. 11 and use the following multiplication rules: , , and . For example, the tensor product of a type- vector and a type- vector can result in one type- vector, one type- vector, and one type- vector.

Clebsch-Gordan Coefficients.

The Clebsch-Gordan coefficients for are computed from integrals over the basis functions of a given irreducible representation, e.g., the real spherical harmonics, as shown below and are tabulated to avoid unnecessary computation.

(12)

For many combinations of , , and , the Clebsch-Gordan coefficients are zero. The gives rise to the following selection rule for non-trivial coefficients: .

Examples of Tensor Products.

Tensor products generally define the interaction between different type- vectors in a symmetry-preserving manner and consist of common operations as follows:

  1. Scalar-scalar multiplication: scalar scalar scalar .

  2. Scalar-vector multiplication: scalar vector vector .

  3. Vector dot product: vector vector scalar .

  4. Vector cross product: vector vector pseudo-vector .

Appendix B Related Works

b.1 Graph Neural Networks for 3D Atomistic Graphs

Graph neural networks (GNNs) are well adapted to perform property prediction of atomic systems because they can handle discrete and topological structures. There are two main ways to represent atomistic graphs atom3d, which are chemical bond graphs, sometimes denoted as 2D graphs, and 3D spatial graphs. Chemical bond graphs use edges to represent covalent bonds without considering 3D geometry. Due to their similarity to graph structures in other applications, generic GNNs graphsage; neural_message_passing_quantum_chemistry; gcn; gin; gat; gatv2 can be directly applied to predict their properties qm9_2; qm9_1; deepchem; ogb; ogblsc. On the other hand, 3D spatial graphs consider positions of atoms in 3D spaces and therefore 3D geometry. Although 3D graphs can faithfully represent atomistic systems, one challenge of moving from chemical bond graphs to 3D spatial graphs is to remain invariant or equivariant to geometric transformation acting on atom positions. Therefore, invariant neural networks and equivariant neural networks have been proposed for 3D atomistic graphs, with the former leveraging invariant information like distances and angles and the latter operating on geometric tensors like type- vectors.

b.2 Invariant GNNs

Previous works schnet; cgcnn; physnet; dimenet; dimenet_pp; orbnet; spherenet; spinconv; gemnet extract invariant information from 3D atomistic graphs and operate on the resulting invariant graphs. They mainly differ in leveraging different geometric information such as distances, bond angles (3 atom features) or dihedral angles (4 atom features). SchNet schnet uses relative distances and proposes continuous-filter convolutional layers to learn local interaction between atom pairs. DimeNet series dimenet; dimenet_pp incorporate bond angles by using triplet representations of atoms. SphereNet spherenet and GemNet gemnet; gemnet_oc further extend to consider dihedral angles for better performance. In order to consider directional information contained in angles, they rely on triplet or quadruplet representations of atoms. In addition to being memory-intensive gemnet_xl, they also change graph structures by introducing higher-order interaction terms line_graph, which would require non-trivial modifications to generic GNNs in order to apply them to 3D graphs. In contrast, the proposed Equiformer uses equivariant irreps features to consider directional information without complicating graph structures and therefore can directly inherit the design of generic GNNs.

b.3 Attention and Transformer

Graph Attention.

Graph attention networks (GAT) gat; gatv2 use multi-layer perceptrons (MLP) to calculate attention weights in a similar manner to message passing networks. Subsequent works using graph attention mechanisms follow either GAT-like MLP attention relational_graph_attention_networks; graph_attention_with_self_supervision or Transformer-like dot product attention gaan; hard_graph_attention; masked_label_prediction; generalization_transformer_graphs; graph_attention_with_self_supervision; spectral_attention. In particular, Kim et al. graph_attention_with_self_supervision compares these two types of attention mechanisms empirically under a self-supervised setting. Brody et al. gatv2 analyzes their theoretical differences and compares their performance in general settings.

Graph Transformer.

A different line of research focuses on adapting standard Transformer networks to graph problems generalization_transformer_graphs; grove; spectral_attention; graphormer; graphormer_3d. They adopt dot product attention in Transformers transformer and propose different approaches to incorporate graph-related inductive biases into their networks. GROVE grove includes additional message passing layers or graph convolutional layers to incorporate local graph structures when calculating attention weights. SAN spectral_attention proposes to learn position embeddings of nodes with full Laplacian spectrum. Graphormer graphormer proposes to encode degree information in centrality embeddings and encode distances and edge features in attention biases. The proposed Equiformer belongs to one of these attempts to generalize standard Transformers to graphs and is dedicated to 3D graphs. To incorporate 3D-related inductive biases, we adopt an equivariant version of Transformers with irreps features and propose novel equivariant graph attention.

Appendix C Details of Architecture

Figure 2: Equivariant operations used in Equiformer. (a) Each gray line between input and output irreps features contain one learnable weight. Note that the number of output channels can be different from that of input channels. (b) “RMS” denotes the root mean square value (RMS) along the channel dimension. For simplicity, in this figure, we have removed multiplying by . (c) Gate layers are equivariant activation functions where non-linearly transformed scalars are used to gate non-scalar irreps features. (d) The left two irreps features correspond to the two input irreps features, and the rightmost one is the output irreps feature. The two gray lines connecting two vectors in the input irreps features and one vector in the output irreps feature form a path and contain one learnable weight. We only show -equivariant operations in this figure and note that they can be directly generalized to -equivariant features.
Figure 3: An alternative visualization of the depth-wise tensor product. We follow the visualization of tensor products in e3nn e3nn and separate paths into three parts based on the types of output vectors.
Figure 4: Architecture of equivariant dot product attention without non-linear message passing. In this figure, “” denotes multiplication, “” denotes addition, and “DTP” stands for depth-wise tensor product. within a circle denotes summation over all neighbors. Gray cells indicate intermediate irreps features. We highlight the difference of dot product attention from multi-layer perceptron attention in red. Note that key and value are irreps features and therefore in dot product attention typically has more channels than that in multi-layer perceptron attention.

c.1 Equivariant Operation Used in Equiformer

We illustrate the equivariant operations used in Equiformer in Fig. 2 and provide an alternative visualization of depth-wise tensor products in Fig. 3.

c.2 Equiformer Architecture

For simplicity and because most works we compare with do not include equivariance to inversion, we adopt -equivariant irreps features in Equiformer for experiments in the main text and note that -equivariant irreps features can be easily incorporated into Equiformer.

We define architectural hyper-parameters like the number of channels in some layers in Equiformer, which are used to specify the detailed architectures in Sec. D and Sec. E.

We use to denote embedding dimension, which defines the dimension of most irreps features. Specifically, all irreps features in Fig. 1 have dimension unless otherwise stated. Besides, we use to represent the dimension of spherical harmonics embeddings of relative positions in all depth-wise tensor products.

For equivariant graph attention in Fig. 1(b), the first two linear layers have the same output dimension . The output dimension of depth-wise tensor products (DTP) are determined by that of input irreps features. Equivariant graph attention consists of parallel attention functions, and the value vector in each attention function has dimension . We refer to and as the number of heads and head dimension, respectively. By default, we set the number of channels in scalar feature to be the same as the number of channels of type- or type- vectors in . When non-linear messages are adopted in , we set the dimension of output irreps features in gate activation to be . Therefore, we can use two hyper-parameters and to specify the detailed architecture of equivariant graph attention.

As for feed forward networks (FFNs), we denote the dimension of output irreps features in gate activation as . The FFN in the last Transformer block has output dimension , and we set of the last FFN, which is followed by output head, to be as well. Thus, two hyper-parameters and are used to specify architectures of FFNs and the output dimension after Transformer blocks.

Irreps features contain channels of vectors with degrees up to . We denote type- vectors as and type- vectors as and use brackets to represent concatenations of vectors. For example, the dimension of irreps features containing type- vectors and type- vectors can be represented as .

c.3 Dot Product Attention

We illustrate the dot product attention without non-linear message passing used in ablation study in Fig. 4. The architecture is adapted from SE(3)-Transformer se3_transformer. The difference from multi-layer perceptron attention lies in how we obtain attention weights from . We split into two irreps features, key and value , and obtain query with a linear layer. Then, we perform scaled dot product transformer between and for attention weights.

Appendix D Details of Experiments on QM9

d.1 Additional Comparison between and Equivariance

We train two versions of Equiformers, one with -equivariant features denoted as “Equiformer” and the other with -equivariant features denoted as “-Equiformer”, and we compare them in Table 8. Including equivariance to inversion further improves the performance on QM9 dataset.

As for Table 1, we compare “Equiformer” with other works since most of them do not include equivariance to inversion.

Task
Methods Units bohr meV meV meV D cal/mol K
Equiformer .056 33 17 16 .014 .025
-Equiformer .054 32 16 16 .013 .024
Table 8: Ablation study of / equivariance on QM9 testing set. “Equiformer” operates on -equivariant features while “-Equiformer” uses -equivariant features. Including inversion further improves mean absolute errors.

d.2 Training Details

We normalize ground truth by subtracting mean and dividing by standard deviation. For the task of , , , and , where single-atom reference values are available, we subtract those reference values from ground truth before normalizing.

We train Equiformer with 6 blocks with following SEGNN segnn. We choose Gaussian radial basis schnet; spinconv; gemnet; graphormer_3d for the first six tasks in Table 1 and radial Bessel basis dimenet; dimenet_pp for the others. Table 9 summarizes the hyper-parameters for the QM9 dataset. Further details will be provided in the future. The detailed description of architectural hyper-parameters can be found in Sec. C.2.

We use one A6000 GPU with 48GB to train each model and summarize the computational cost of training for one epoch as follows. Training

-Equiformer for one epoch takes about minutes. The time of training Equiformer, Equiformer with linear messages (indicated by index in Table 6), and Equiformer with linear messages and dot product attention (indicated by index in Table 6) for one epoch is minutes, minutes and minutes, respectively.

Hyper-parameters Value or description
Optimizer AdamW
Learning rate scheduling Cosine learning rate with linear warmup
Warmup epochs
Maximum learning rate
Batch size
Number of epochs
Weight decay
Cutoff radius (Å)
Number of radial bases for Gaussian radial basis, for radial bessel basis
Hidden sizes of radial functions
Number of hidden layers in radial functions
Equiformer
Number of Transformer blocks
Embedding dimension
Spherical harmonics embedding dimension
Number of attention heads
Attention head dimension
Hidden dimension in feed forward networks
Output feature dimension
-Equiformer
Number of Transformer blocks
Embedding dimension
Spherical harmonics embedding dimension
Number of attention heads
Attention head dimension
Hidden dimension in feed forward networks
Output feature dimension
Table 9: Hyper-parameters for QM9 dataset. We denote type- vectors as and type- vectors as and use brackets to represent concatenations of vectors.

Appendix E Details of Experiments on OC20

e.1 Additional Comparison between and Equivariance

We train two versions of Equiformers, one with -equivariant features denoted as “Equiformer” and the other with -equivariant features denoted as “-Equiformer”, and we compare them in Table 10. Including inversion improves the MAE results on ID and OOD Cat sub-splits but degrades the performance on the other sub-splits. Overall, using -equivariant features results in slightly inferior performance. We surmise the reasons are as follows. First, inversion might not be the key bottleneck. Second, including inversion would break type- vectors into two parts, type- and type- vectors. They are regarded as different types in equivariant linear layers and layer normalizations, and therefore, the directional information captured in these two types of vectors can only exchange in depth-wise tensor products. Third, we mainly tune hyper-parameters for Equiformer with -equivariant features, and it is possible that using -equivariant features would favor different hyper-parameters.

For Table 234, and 5, we compare “Equiformer” with other works since most of them do not include equivariance to inversion.

Energy MAE (eV) EwT (%)
Methods ID OOD Ads OOD Cat OOD Both Average ID OOD Ads OOD Cat OOD Both Average
Equiformer 0.5088 0.6271 0.5051 0.5545 0.5489 4.88 2.93 4.92 2.98 3.93
-Equiformer 0.5035 0.6385 0.5034 0.5658 0.5528 5.10 2.98 5.10 3.02 4.05
Table 10: Ablation study of / equivariance on OC20 IS2RE validation set. “Equiformer” operates on -equivariant features while “-Equiformer” uses -equivariant features.

e.2 Training Details

IS2RE without Node-Level Auxiliary Task.

We use hyper-parameters similar to those for QM9 dataset and summarize in Table 11. The detailed description of architectural hyper-parameters can be found in Sec. C.2.

IS2RE with IS2RS Node-Level Auxiliary Task.

We increase the number of Transformer blocks to as deeper networks can benefit more from IS2RS node-level auxiliary task noisy_nodes. We follow the same hyper-parameters in Table 11 except that we increase maximum learning rate to and set to . Inspired by Graphormer graphormer_3d, we add an extra equivariant graph attention module after the last layer normalization to predict relaxed structures and use a linearly decayed weight for loss associated with IS2RS, which starts at and decays to . For Noisy Nodes noisy_nodes data augmentation, we first interpolate between initial structure and relaxed structure and then add Gaussian noise as described by Noisy Nodes noisy_nodes. When Noisy Nodes data augmentation is used, we increase the number of epochs to . Further details will be provided in the future.

We use two A6000 GPUs, each with 48GB, to train models when IS2RS is not included during training. Training Equiformer and -Equiformer takes about and hours. Training Equiformer with linear messages (indicated by index in Table 7) and Equiformer with linear messages and dot product attention (indicated by index in Table 7) takes hours and hours, respectively. We use four A6000 GPUs to train Equiformer models when IS2RS node-level auxiliary task is adopted during training. Training Equiformer without Noisy Nodes noisy_nodes data augmentation takes about days and training with Noisy Nodes takes days. We note that the proposed Equiformer in Table 5 achieves competitive results even with much less computation. Specifically, training “Equiformer + Noisy Nodes” takes about GPU-days when A6000 GPUs are used. The training time of “GNS + Noisy Nodes” noisy_nodes is TPU-days. “Graphormer” graphormer_3d uses ensemble of models and requires GPU-days to train all models when A100 GPUs are used.

Hyper-parameters Value or description
Optimizer AdamW
Learning rate scheduling Cosine learning rate with linear warmup
Warmup epochs
Maximum learning rate
Batch size
Number of epochs
Weight decay
Cutoff radius (Å)
Number of radial basis
Hidden size of radial function
Number of hidden layers in radial function
Equiformer
Number of Transformer blocks
Embedding dimension
Spherical harmonics embedding dimension
Number of attention heads
Attention head dimension
Hidden dimension in feed forward networks
Output feature dimension
-Equiformer
Number of Transformer blocks
Embedding dimension
Spherical harmonics embedding dimension
Number of attention heads
Attention head dimension
Hidden dimension in feed forward networks
Output feature dimension
Table 11: Hyper-parameters for OC20 dataset under the setting of training without IS2RS auxiliary task. We denote type- vectors as and type- vectors as and use brackets to represent concatenations of vectors.

e.3 Error Distributions

We plot the error distributions of different Equiformer models on different sub-splits of OC20 IS2RE validation set in Fig. 5. For each curve, we sort the absolute errors in ascending order for better visualization and have a few observations. First, for each sub-split, there are always easy examples, for which all models achieve significantly low errors, and hard examples, for which all models have high errors. Second, the performance gains brought by different models are non-uniform among different sub-splits. For example, using MLP attention and non-linear messages improves the errors on the ID sub-split but is not that helpful on the OOD Ads sub-split. Third, when IS2RS node-level auxiliary task is not included during training, using stronger models mainly improves errors that are beyond the threshold of 0.02 eV, which is used to calculate the metric of energy within threshold (EwT). For instance, on the OOD Both sub-split, using non-linear messages, which corresponds to red and purple curves, improves the absolute errors for the th through th examples. However, the improvement in MAE does not translate to that in EwT as the errors are still higher than the threshold of 0.02 eV. This explains why using non-linear messages in Table 7 improves MAE from to but results in almost the same EwT.

(a) ID sub-split.
(b) OOD Ads sub-split.
(c) OOD Cat sub-split.
(d) OOD Both sub-split.
Figure 5: Error distributions of different Equiformer models on different sub-splits of OC20 IS2RE validation set.