1 Introduction
Machine learned models can accelerate the prediction of quantum properties of atomistic systems like molecules by learning approximations of ab initio calculations neural_message_passing_quantum_chemistry; deep_potential_molecular_dynamics; push_limit_of_md_100m; dimenet_pp; nequip; oc20; deep_potential_molecular_dynamics_simulation; se3_wavefunction; gemnet_xl; quantum_scaling; allergo
. In particular, graph neural networks (GNNs) have gained increasing popularity due to their performance. By modeling atomistic systems as graphs, GNNs naturally treat the setlike nature of collections of atoms, encode the interaction between atoms in node features and update the features by passing messages between nodes. One factor contributing to the success of neural networks is the ability to incorporate inductive biases that exploit the symmetry of data. Take convolutional neural networks (CNNs) for 2D images as an example: Patterns in images should be recognized regardless of their positions, which motivates the inductive bias of translational equivariance. As for atomistic graphs, where each atom has its coordinate in 3D Euclidean space, we consider inductive biases related to 3D Euclidean group
, which include equivariance to 3D translation, 3D rotation, and inversion. Concretely, some properties like energy of an atomistic system should be constant regardless of how we shift the system; others like force should be rotated accordingly if we rotate the system. To incorporate these inductive biases, equivariant and invariant neural networks have been proposed. The former leverages geometric tensors like vectors for equivariant node features
tfn; 3dsteerable; kondor2018clebsch; se3_transformer; nequip; segnn; allergo, and the latter augments graphs with invariant information such as distances and angles extracted from 3D graphs schnet; dimenet; dimenet_pp; spherenet; spinconv; gemnet.A parallel line of research focuses on applying Transformer networks
transformerto other domains like computer vision
detr; vit; deit and graph generalization_transformer_graphs; spectral_attention; graphormer; graphormer_3d and has demonstrated widespread success. However, as Transformers were developed for sequence data bert; wav2vec2; gpt3, it is crucial to incorporate domainrelated inductive biases. For example, Vision Transformer vitshows that adopting a pure Transformer to image classification cannot generalize well and achieves worse results than CNNs when trained on only ImageNet
imagenetsince it lacks inductive biases like translational invariance. Note that ImageNet contains over 1.28M images and the size is already larger than that of many quantum properties prediction datasets
qm9_2; qm9_1; oc20. Therefore, this highlights the necessity of including correct inductive biases when applying Transformers to the domain of 3D atomistic graphs.In this work, we present Equiformer, an equivariant graph neural network utilizing /equivariant features built from irreducible representations (irreps) and equivariant attention mechanisms to combine the 3Drelated inductive bias with the strength of Transformer. Irreps features encode equivariant information in channel dimensions without complicating graph structures. The simplicity enables us to directly incorporate them into Transformers through replacing original operations with equivariant counterparts and introducing an additional equivariant operation called tensor product. Moreover, we propose a novel equivariant graph attention, which considers both content and geometric information such as relative position contained in irreps features. Equivariant graph attention improves upon typical attention in Transformers by replacing dot product attention with theoretically stronger multilayer perceptron attention and including nonlinear message passing. With these innovations, Equiformer demonstrates the possibility of generalizing Transformerlike networks to 3D atomistic graphs and achieves competitive results on two quantum properties prediction datasets, QM9 qm9_2; qm9_1 and OC20 oc20. For QM9, compared to models trained with the same data partition, Equiformer achieves the best results on 11 out of 12 regression tasks. For OC20, under the setting of training with IS2RE data and optionally IS2RS data, Equiformer improves upon stateoftheart models.
2 Background
2.1 Graph Neural Networks
Formally, a graph consists of nodes and edges representing relationships between nodes and . Graphs can represent atomistic systems like molecules, where nodes are atoms and edges can represent bonds or atom pairs within a certain distance; we refer to these as atomistic graphs. Graph neural networks (GNNs) extract meaningful representations from graphs through messagepassing layers. Given feature on node and edge feature at the th layer, it first aggregates messages from neighbors and then update features on each node as follows:
(1) 
where denotes the set of neighbors of node and and are learnable functions. For atomistic systems in 3D spaces, each node is additionally associated with its position and typically edge features become functions of relative position .
2.2 Equivariance
Atomistic systems are often described using coordinate systems. For 3D Euclidean space, we can freely choose coordinate systems and change between them via the symmetries of 3D space: 3D translation, rotation and inversion ( ). The groups of 3D translation, rotation and inversion form Euclidean group , with the first two forming , the second being , and the last two forming . The laws of physics are invariant to the choice of coordinate systems and therefore properties of atomistic systems are equivariant, e.g., when we rotate our coordinate system, quantities like energy remain the same while others like force rotate accordingly. Formally, a function mapping between vector spaces and is equivariant to a group of transformation if for any input , output and group element , we have , where and are transformation matrices parametrized by in and .
Incorporating equivariance into neural networks as inductive biases is crucial as this enables generalizing to unseen data in a predictable manner. For example, 2D convolution is equivariant to the group of 2D translation, and when input features are shifted by an amount parametrized by , output features will be shifted by the same amount. For 3D atomistic graphs, we consider the group of . Features and and learnable functions and should be equivariant to geometric transformation acting on position . In this work, following previous works tfn; kondor2018clebsch; 3dsteerable implemented in e3nn e3nn, we achieve /equivariance by using equivariant features based on vector spaces of irreducible representations and equivariant operations like tensor product for learnable functions.
2.3 Irreducible Representations
A group representation Dresselhaus2007; zee defines the transformation matrices of group elements that act on a vector space . For 3D Euclidean group , two examples of vector spaces with different transformation matrices are scalars and Euclidean vectors in , i.e., vectors change with rotation while scalars do not. To address translation symmetry, we simply operate on relative positions. Below we focus our discussion on . The transformation matrices of rotation and inversion are separable and commute, and we first discuss irreducible representations of .
Any group representation of on a given vector space can be decomposed into a concatenation of provably smallest transformation matrices called irreducible representations (irreps). Specifically, for group element , there are by irreps matrices called WignerD matrices acting on dimensional vector spaces, where degree is a nonnegative integer. can be interpreted as an angular frequency and determines how quickly vectors change when rotating coordinate systems. of different act on independent vector spaces. Vectors transformed by are type vectors, with scalars and Euclidean vectors being type and type vectors. It is common to index elements of type vectors with an index called order, where .
The group of inversion only has two elements, identity and inversion, and two irreps, even
and odd
. Vectors transformed by irrep do not change sign under inversion while those by irrep do. We create irreps of by simply multiplying those of and , and we introduce parity to type vectors to denote how they transform under inversion. Therefore, type vectors in are extended to type vectors in , where is or . In the following, we use type vectors for the ease of discussion, but we can generalize to type vectors, unless otherwise stated.Irreps Features.
We concatenate multiple type vectors to form equivariant irreps features. Concretely, irreps feature has type vectors, where and is the number of channels for type vectors. We index irreps features by channel , degree , and order and denote as . Different channels of type vectors are parametrized by different weights but are transformed with the same WignerD matrix . Regular scalar features correspond to including only type vectors. This can generalize to by including inversion and extending to .
Spherical Harmonics.
Euclidean vectors in can be projected into type vectors by using spherical harmonics (SH) : . SH are equivariant with . SH of relative position generates the first set of irreps features. Equivariant information propagates to other irreps features through equivariant operations like the tensor product.
2.4 Tensor Product
We use tensor products to interact different type vectors and first discuss the tensor product for . The tensor product denoted as uses ClebschGordan coefficients to combine type vector and type vector and produces type vector as follows:
(2) 
where denotes order and refers to the th element of . ClebschGordan coefficients are nonzero only when and thus restrict output vectors to be of certain types. For efficiency, we discard vectors with , where is a hyperparameter, to prevent vectors of increasingly higher dimensions. The tensor product is an equivariant operation, with for .
We call each distinct nontrivial combination of a path. Each path is independently equivariant, and we can assign one learnable weight to each path in tensor products, which is similar to typical linear layers. We can generalize Eq. 2 to irreps features and include multiple channels of vectors of different types through iterating over all paths associated with channels of vectors. In this way, weights are indexed by , where is the th channel of type vector in input irreps feature. We use to represent tensor product with weights . Weights can be conditioned on quantities like relative distances. For example, in GNNs, we use tensor products of irreps features and SH embedding of relative position to transform features into messages and use some learnable functions transforming into weights of tensor products schnet; nequip. Please refer to Sec. A.4 in appendix for discussion on including inversion in tensor products.
2.5 Equivariant Neural Networks
equivariant neural networks (ENNs) are built such that every operation is equivariant to Euclidean symmetry, satisfying that for , inputs and scalar weights and , tfn; kondor2018clebsch; 3dsteerable
. In addition to the tensor product, equivariant operations also include equivariant activation functions and normalizations as in typical neural networks, and we discuss those used in Equiformer in Sec.
3.1. Because an ENN is a composition of equivariant operations, it is also equivariant; transformations acting on input spaces propagate to the output spaces, e.g., when we rotate an input atomistic graph, predicted vectors will be rotated.3 Equiformer
We incorporate /equivariant irreps features into Transformers transformer and use equivariant operations. To better adapt Transformers to 3D graph structures, we propose equivariant graph attention. The overall architecture of Equiformer is illustrated in Fig. 1.
3.1 Equivariant Operations for Irreps Features
Here we discuss equivariant operations used in Equiformer that serve as building blocks for equivariant graph attention and other modules. They include the equivariant version of the original operations in Transformers and the depthwise tensor product, as illustrated in Fig. 2 in appendix.
Linear.
Linear layers are generalized to irreps features by transforming different type vectors separately. Specifically, we apply separate linear operations to each group of type vectors. We remove bias terms for nonscalar features with as biases do not depend on inputs, and therefore, including biases for type vectors with can break equivariance.
Layer Normalization.
Transformers adopt layer normalization (LN) layer_norm to stabilize training. Given input , with being the number of nodes and
the number of channels, LN calculates the linear transformation of normalized input as follows:
(3) 
where
are mean and standard deviation of input
along channel dimension , and are learnable parameters. denotes elementwise product. By viewing standard deviation as the root mean square value (RMS) of L2norm of type vectors, LN can be generalized to irreps features. Specifically, given input of type vectors, the output is:(4) 
where calculates the L2norm of each type vectors in , and calculates the RMS of L2norm with mean taken along channel dimension . We remove mean and biases for type vectors with following linear layers.
Gate.
We adopt the gate activation 3dsteerable for equivariant activation function. Typical activation functions are applied to type vectors. For vectors of higher , we multiply them with nonlinearly transformed type vectors to maintain equivariance. Specifically, given input containing type vectors with and type vectors, we apply SiLU activation silu; swish to the first type
vectors and sigmoid function to the other
type vectors to obtain nonlinear weights. Then, we multiply each type vector with corresponding nonlinear weights. After the gate activation, the number of channels for type vectors is reduced to .Depthwise Tensor Product.
The tensor product defines interaction between vectors of different . To improve its efficiency, we use the depthwise tensor product (DTP), which restricts one type vector in output irreps features depends only on one type vector in input irreps features, where can be equal to or different from . This is similar to depthwise convolution mobilenet, where one output channel depends on only one input channel. Weights in the DTP can be inputindependent or conditioned on relative distances, and the DTP between two tensors and is denoted as .
3.2 Equivariant Graph Attention
Selfattention transformer; gat; se3_transformer; transformer_in_vision; graphormer; gatv2 transforms features sent from one spatial location to another with inputdependent weights. We use the notion from Transformers transformer and message passing networks neural_message_passing_quantum_chemistry; gns; egnn; segnn and define message sent from node to node as follows:
(5) 
where attention weights depend on features on node and its neighbors and values are transformed with inputindependent weights. In Transformers and Graph Attention Networks (GAT) gat; gatv2, depends only on node . In message passing networks neural_message_passing_quantum_chemistry; gns; egnn; segnn, depends on features on nodes and with constant . The proposed equivariant graph attention adopts tensor products to incorporate content and geometric information and utilizes multilayer perceptron attention for and nonlinear message passing for as illustrated in Fig. 1(b).
Incorporating Content and Geometric Information.
Given features and on target node and source node , we combine the two features with two linear layers to obtain initial message . is passed to a DTP layer and a linear layer to consider geometric information like relative position contained in different type vectors in irreps features:
(6) 
is the tensor product of and spherical harmonics embeddings (SH) of , with weights parametrized by . considers both semantic and geometric features on source and target nodes in a linear manner and is used to derive attention weights and nonlinear messages.
MultiLayer Perceptron Attention.
Attention weights capture how each node interacts with neighboring nodes. are invariant to geometric transformation se3_transformer, and therefore, we only use type vectors (scalars) of message denoted as for attention. Note that encodes directional information, as they are generated by tensor products of type vectors with . Inspired by GATv2 gatv2, we adopts multilayer perceptron attention (MLPA) instead of dot product attention (DPA) used in Transformers transformer; transformer_in_vision. In contrast to dot product, MLPs are universal approximators mlp_universal_approximator; approximation_mlp; approximation_sigmoid and can theoretically capture any attention patterns gatv2. Similar to GAT gat; gatv2, given
, we uses one leaky ReLU layer and one linear layer for
:(7) 
where is a learnable vectors of the same dimension as and is a single scalar. The output of attention is the sum of value multipled by corresponding over all neighboring nodes , where can be obtained by linear or nonlinear transformations of as discussed below.
NonLinear Message Passing.
Values are features sent from one node to another, transformed with inputindependent weights. We first split into and , where the former consists of type vectors with and the latter consists of scalars only. Then, we perform nonlinear transformation to to obtain nonlinear message:
(8) 
We apply gate activation to to obtain . We use one DTP and a linear layer to enable interaction between nonlinear type vectors, which is similar to how we transform into . Weights here are inputindependent. We can also use directly as for linear messages.
MultiHead Attention.
Following Transformers transformer, we can perform parallel equivariant graph attention functions given . The different outputs are concatenated and projected with a linear layer, resulting in the final output as illustrated in Fig. 1(b). Note that parallelizing attention functions and concatenating can be implemented with “Reshape”.
3.3 Overall Architecture
For completeness, we discuss other modules in Equiformer here.
Embedding.
This module consists of atom embedding and edgedegree embedding. For the former, we use a linear layer to transform onehot encoding of atom species. For the latter, as depicted in the right branch in Fig.
1(c), we first transform a constant one vector into messages encoding local geometry with two linear layers and one intermediate DTP layer and then use sum aggregation to encode degree information gin; graphormer_3d. The DTP layer has the same form as that in Eq. 6. We scale the aggregated features by dividing with the squared root of average degrees in training sets so that standard deviation of aggregated features would be close to . The two embeddings are summed to produce final embeddings of input 3D graphs.Radial Basis and Radial Function.
Relative distances parametrizes weights in some DTP layers. To reflect subtle changes in , we represent distances with Gaussian radial basis with learnable mean and standard deviation schnet; spinconv; gemnet; graphormer_3d or radial Bessel basis dimenet; dimenet_pp. We transform radial basis with a learnable radial function to generate weights for those DTP layers schnet; se3_transformer; nequip. The function consists of two MLPs with layer normalization layer_norm and SiLU silu; swish and a final linear layer.
Feed Forward Network.
Similar to Transformers, we use two equivariant linear layers and an intermediate gate activation for the feed forward networks in Equiformer.
Output Head.
The last feed forward network transforms features on each node into a scalar. We perform sum aggregation over all nodes to predict scalar quantities like energy. Similar to edgedegree embedding, we divide the aggregated scalars with the squared root of average numbers of atoms.
4 Related Works
Here, we focus on equivariant neural networks and discuss other works in Sec. B in appendix.
Equivariant GNNs.
Equivariant neural networks tfn; kondor2018clebsch; 3dsteerable; se3_transformer; l1net; geometric_prediction; nequip; gvp; painn; egnn; se3_wavefunction; segnn; torchmd_net; eqgat; allergo operate on geometric tensors like type vectors to achieve equivariance. The central idea is to use functions of geometry built from spherical harmonics and irreps features to achieve 3D rotational and translational equivariance as proposed in Tensor Field Network (TFN) tfn, which generalizes 2D counterparts harmonis_network; group_equivariant_convolution_networks; spherical_cnn to 3D Euclidean space. Previous works differ in equivariant operations used in their networks. TFN tfn and NequIP nequip mainly use graph convolution with linear messages, with the latter utilizing extra equivariant gate activations 3dsteerable. SE(3)Transformer se3_transformer adopts an equivariant version of dot product (DP) attention transformer; transformer_in_vision, and subsequent works on equivariant Transformers torchmd_net; eqgat follow the practice of DP attention and use more specialized architectures considering only type and type vectors. SEGNN segnn utilizes nonlinear message passing networks neural_message_passing_quantum_chemistry; gns for irreps features. The proposed Equiformer incorporates all the advantages of previous works through combining MLP attention with nonlinear messages.
5 Experiment
Our implementation is based on PyTorch
pytorch (Modified BSD license), PyG pytorch_geometric (MIT license), e3nn e3nn (MIT license), timm timm (Apache2.0 license), and ocp oc20 (MIT license).5.1 Qm9
Task  ZPVE  

Methods  Units  bohr  meV  meV  meV  D  cal/mol K  meV  meV  bohr  meV  meV  meV 
NMP neural_message_passing_quantum_chemistry  .092  69  43  38  .030  .040  19  17  .180  20  20  1.50  
SchNet schnet  .235  63  41  34  .033  .033  14  14  .073  19  14  1.70  
Cormorant cormorant  .085  61  34  38  .038  .026  20  21  .961  21  22  2.03  
LieConv lieconv  .084  49  30  25  .032  .038  22  24  .800  19  19  2.28  
DimeNet++ dimenet_pp  .044  33  25  20  .030  .023  8  7  .331  6  6  1.21  
TFN tfn  .223  58  40  38  .064  .101              
SE(3)Transformer se3_transformer  .142  53  35  33  .051  .054              
EGNN egnn  .071  48  29  25  .029  .031  12  12  .106  12  11  1.55  
SphereNet spherenet  .046  32  23  18  .026  .021  8  6  .292  7  6  1.12  
SEGNN segnn  .060  42  24  21  .023  .031  15  16  .660  13  15  1.62  
EQGAT eqgat  .063  44  26  22  .014  .027  12  13  .257  13  13  1.50  
Equiformer  .056  33  17  16  .014  .025  10  10  .227  11  10  1.32 
Dataset.
The QM9 qm9_2; qm9_1 dataset (CC BYNC SA 4.0 license) consisting of 134k small molecules, and the goal is to predict their quantum properties such as energy. We follow the data partition used by Cormorant cormorant, which has 100k, 18k and 13k molecules in training, validation and testing sets. We minimize mean absolute error (MAE) between prediction and normalized ground truth and report MAE on the testing set.
Setting.
Please refer to Sec. D in appendix for details on architecture and hyperparameters.
Result.
We mainly compare with methods trained with the same data partition and summarize the results in Table 1. Equiformer achieves the best results on 11 out of 12 tasks among models trained with same data partition. The comparison to SEGNN segnn, which uses irreps features as Equiformer, demonstrates the effectiveness of combining nonlienar messages with MLP attention. Additionally, Equiformer achieves better results for most of tasks when compared to other equivariant Transformers se3_transformer; eqgat, which suggests a better adaption of Transformers to 3D graphs. Besides, the different data partition as denoted by in Table 1 has more molecules in the training set and less data in the testing set, and this can benefit some tasks that are more dependent on data partitions.
Energy MAE (eV)  EwT (%)  

Methods  ID  OOD Ads  OOD Cat  OOD Both  Average  ID  OOD Ads  OOD Cat  OOD Both  Average 
SchNet schnet  0.6465  0.7074  0.6475  0.6626  0.6660  2.96  2.22  3.03  2.38  2.65 
DimeNet++ dimenet_pp  0.5636  0.7127  0.5612  0.6492  0.6217  4.25  2.48  4.40  2.56  3.42 
GemNetT gemnet  0.5561  0.7342  0.5659  0.6964  0.6382  4.51  2.24  4.37  2.38  3.38 
SphereNet spherenet  0.5632  0.6682  0.5590  0.6190  0.6024  4.56  2.70  4.59  2.70  3.64 
(S)EGNN segnn  0.5497  0.6851  0.5519  0.6102  0.5992  4.99  2.50  4.71  2.88  3.77 
SEGNN segnn  0.5310  0.6432  0.5341  0.5777  0.5715  5.32  2.80  4.89  3.09  4.03 
Equiformer  0.5088  0.6271  0.5051  0.5545  0.5489  4.88  2.93  4.92  2.98  3.93 
Energy MAE (eV)  EwT (%)  

Methods  ID  OOD Ads  OOD Cat  OOD Both  Average  ID  OOD Ads  OOD Cat  OOD Both  Average 
CGCNN cgcnn  0.6149  0.9155  0.6219  0.8511  0.7509  3.40  1.93  3.10  2.00  2.61 
SchNet schnet  0.6387  0.7342  0.6616  0.7037  0.6846  2.96  2.33  2.94  2.21  2.61 
DimeNet++ dimenet_pp  0.5621  0.7252  0.5756  0.6613  0.6311  4.25  2.07  4.10  2.41  3.21 
SpinConv spinconv  0.5583  0.7230  0.5687  0.6738  0.6310  4.08  2.26  3.82  2.33  3.12 
SphereNet spherenet  0.5625  0.7033  0.5708  0.6378  0.6186  4.47  2.29  4.09  2.41  3.32 
SEGNN segnn  0.5327  0.6921  0.5369  0.6790  0.6101  5.37  2.46  4.91  2.63  3.84 
Equiformer  0.5037  0.6881  0.5213  0.6301  0.5858  5.14  2.41  4.67  2.69  3.73 
Energy MAE (eV)  EwT (%)  
Methods  ID  OOD Ads  OOD Cat  OOD Both  Average  ID  OOD Ads  OOD Cat  OOD Both  Average 
GNS noisy_nodes  0.54  0.65  0.55  0.59  0.5825           
Noisy Nodes noisy_nodes  0.47  0.51  0.48  0.46  0.4800           
Graphormer graphormer_3d  0.4329  0.5850  0.4441  0.5299  0.4980           
Equiformer  0.4222  0.5420  0.4231  0.4754  0.4657  7.23  3.77  7.13  4.10  5.56 
+ Noisy Nodes  0.4156  0.4976  0.4165  0.4344  0.4410  7.47  4.64  7.19  4.84  6.04 
“GNS” denotes the 50layer GNS trained without Noisy Nodes data augmentation, and “Noisy Nodes” denotes the 100layer GNS trained with Noisy Nodes. “Equiformer + Noisy Nodes” uses data augmentation of interpolating between initial structure and relaxed structure and adding Gaussian noise as described by Noisy Nodes
noisy_nodes.5.2 Oc20
Dataset.
The Open Catalyst 2020 (OC20) dataset oc20 (Creative Commons Attribution 4.0 License) consists of larger atomic systems, each composed of a small molecule called the adsorbate placed on a large slab called catalyst. The average number of atoms in a system is more than 70, and there are over 50 atom species. The goal is to understand interaction between adsorbates and catalysts through relaxation. An adsorbate is first placed on top of a catalyst to form initial structure (IS). The positions of atoms are updated with forces calculated by density function theory until the system is stable and becomes relaxed structure (RS). The energy of RS, or relaxed energy (RE), is correlated with catalyst activity and therefore a metric for understanding their interaction. We focus on the task of initial structure to relaxed energy (IS2RE), which predicts relaxed energy (RE) given an initial structure (IS). There are 460k, 100k and 100k structures in training, validation, and testing sets, respectively. Performance is measured in MAE and energy within threshold (EwT), the percentage in which predicted energy is within 0.02 eV of ground truth energy. In validation and testing sets, there are four subsplits containing indistribution adsorbates and catalysts (ID), outofdistribution adsorbates (OODAds), outofdistribution catalysts (OODCat), and outofdistribution adsorbates and catalysts (OODBoth).
Setting.
We consider two training settings based on whether a nodelevel auxiliary task noisy_nodes is adopted. In the first setting, we minimize MAE between predicted energy and ground truth energy without any nodelevel auxiliary task. In the second setting, we incorporate the task of initial structure to relaxed structure (IS2RS) as a nodelevel auxiliary task noisy_nodes. In addition to predicting energy, we predict nodewise vectors indicating how each atom moves from initial structure to relaxed structure. Please refer to Sec. E in appendix for details on Equiformer architecture and hyperparameters.
IS2RE Result without NodeLevel Auxiliary Task.
We summarize the results under the first setting in Table 2 and Table 3. Compared with stateoftheart models like SEGNN segnn and SphereNet spherenet, Equiformer consistently achieves the lowest MAE for all the four subsplits in validation and testing sets. Note that energy within threshold (EwT) considers only the percentage of predictions close enough to ground truth and the distribution of errors, and therefore improvement in average errors (MAE) would not necessarily reflect that in error distributions (EwT). Similar phenomena can be observed in Table 3, where for “OOD Both” subsplit, SphereNet spherenet achieves lower MAE yet lower EwT than SEGNN segnn. We also note that models in Table 2 and 3
are trained by minimizing MAE and therefore comparing MAE in validation and testing sets could mitigate the discrepancy between training objectives and evaluation metrics and that OC20 leaderboard ranks the relative performance of models mainly according to MAE.
IS2RE Result with IS2RS NodeLevel Auxiliary Task.
We report the results on validation and testing sets under the second setting in Table 4 and Table 5. As of May 20, 2022, Equiformer achieves the best results on IS2RE task when only IS2RE and IS2RS data are used. We note that the proposed Equiformer in Table 5 achieves competitive results even with much less computation. Specifically, training “Equiformer + Noisy Nodes” takes about GPUdays when A6000 GPUs are used. The training time of “GNS + Noisy Nodes” noisy_nodes is TPUdays. “Graphormer” graphormer_3d uses ensemble of models and requires GPUdays to train all models when A100 GPUs are used. The comparison to GNS demonstrates the improvement from invariant message passing networks to equivariant Transformers. Without any data augmentation, Equiformer still achieves competitive results to GNS trained with Noisy Nodes noisy_nodes. Compared to Graphormer graphormer_3d, Equiformer demonstrates the effectiveness of equivariant features and the proposed equivariant graph attention. Note that Equiformer, with Transformer blocks, is relatively shallow as GNS trained with Noisy Nodes has blocks and Graphormer has Transformer blocks and that deeper networks can typically obtain better results when IS2RS auxiliary task is adopted noisy_nodes.
5.3 Ablation Study
Methods  



Task  
Index  Unit  bohr  meV  meV  meV  D  cal/mol K  
1  ✓  ✓  .056  33  17  16  .014  .025  
2  ✓  .061  34  18  17  .015  .025  
3  ✓  .060  34  18  18  .015  .026  
Methods  Energy MAE (eV)  




Index  ID  OOD Ads  OOD Cat  OOD Both  Average  
1  ✓  ✓  0.5088  0.6271  0.5051  0.5545  0.5489  
2  ✓  0.5168  0.6308  0.5088  0.5657  0.5555  
3  ✓  0.5386  0.6382  0.5297  0.5692  0.5689  
Methods  EwT (%)  




Index  ID  OOD Ads  OOD Cat  OOD Both  Average  
1  ✓  ✓  4.88  2.93  4.92  2.98  3.93  
2  ✓  4.59  2.82  4.79  3.02  3.81  
3  ✓  4.37  2.60  4.36  2.86  3.55  
We conduct ablation studies on the improvements brought by MLP attention and nonlinear messages in the proposed equivariant graph attention. We modify dot product (DP) attention transformer; se3_transformer so that it only differs from MLP attention in how attention weights are generated from . Please refer to Sec. C.3 in appendix for details on DP attention. For experiments on QM9 and OC20, unless otherwise stated, we follow the hyperparameters used in previous experiments.
Result on QM9.
The comparison is summarized in Table 6. Nonlinear messages improve upon linear messages when MLP attention is used. Similar to what is reported by GATv2 gatv2, the improvement of replacing DP attention with MLP attention is not very significant. We conjecture that DP attention with linear operations is expressive enough to capture common attention patterns as the numbers of nighboring nodes and atom species are much smaller than those in OC20. However, MLP attention is roughly faster as it directly generates scalar features and attention weights from instead of producing additional key and query irreps features for attention weights.
Result on OC20.
We consider the setting of training without IS2RS auxiliary task and use a smaller learning rate for DP attention as this improves the performance. We summarize the comparison in Table 7. Nonlinear messages consistently improve upon linear messages. In contrast to the results on QM9, MLP attention achieves better performance than DP attention. We surmise this is because OC20 contains larger atomistic graphs with more diverse atom species and therefore requires more expressive attention mechanisms.
6 Conclusion and Broader Impact
In this work, we propose Equiformer, a graph neural network (GNN) combining the strengths of Transformers and equivariant features based on irreducible representations (irreps). With irreps features, we build upon existing generic GNNs and Transformer networks transformer; vit; graphormer; parp; vit_search by incorporating equivariant operations like tensor products. We further propose equivariant graph attention, which incorporates multilayer perceptron attention and nonlinear messages. Experiments on QM9 and OC20 demonstrate both the effectiveness of Equiformer and the advantage of equivariant graph attention over typical dot product attention.
The broader impact lies in two aspects. First, Equiformer demonstrates the possibility of adapting Transformers to domains such as physics and chemistry, where data can be represented as 3D atomistic graphs. Second, Equiformer achieves more accurate approximations of quantum properties calculation. We believe there is much more to be gained by harnessing these abilities for productive investigation of molecules and materials relevant to application such as energy, electronics, and pharmaceuticals oc20, than to be lost by applying these methods for adversarial purposes like creating hazardous chemicals. Additionally, there are still substantial hurdles to go from the identification of a useful or harmful molecule to its largescale deployment.
Acknowledgement
We thank Simon Batzner, Albert Musaelian, Mario Geiger, Johannes Brandstetter, and Rob Hesselink for helpful discussions including help with the OC20 dataset. We also thank the e3nn e3nn developers and community for the library and detailed documentation. We acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center supercloud for providing high performance computing and consultation resources that have contributed to the research results reported within this paper.
References
Appendix
Appendix A Additional Mathematical Background
In this section, we provide additional mathematical background on group equivariance helpful for the discussion of the proposed method. Other works tfn; 3dsteerable; kondor2018clebsch; cormorant; se3_transformer; segnn also provide similar background. We encourage interested readers to see these works zee; Dresselhaus2007 for more indepth and pedagogical presentations.
a.1 Group Theory
Definition of Groups.
A group is an algebraic structure that consists of a set and a binary operator and is typically denoted as . Groups satisfy the following four axioms:

Closure: for all .

Identity: There exists an identity element such that for all .

Inverse: For each , there exists an inverse element such that .

Associativity: for all .
In this work, we focus on 3D rotation, translation and inversion. Relevant groups include:

The Euclidean group in three dimensions : 3D rotation, translation and inversion.

The special Euclidean group in three dimensions : 3D rotation and translation.

The orthogonal group in three dimensions : 3D rotation and inversion.

The special orthogonal group in three dimensions : 3D rotation.
Group Representations.
The actions of groups define transformations. Formally, a transformation acting on vector space parametrized by group element is an injective function . A powerful result of group representation theory is that these transformations can be expressed as matrices which act on vector spaces via matrix multiplication. These matrices are called the group representations. Formally, a group representation is a mapping between a group and a set of invertible matrices. The group representation maps an dimensional vector space onto itself and satisfies for all .
How a group is represented depends on the vector space it acts on. If there exists a change of basis in the form of an matrix such that for all , then we say the two group representations are equivalent. If is block diagonal, which means that acts on independent subspaces of the vector space, the representation is reducible. A particular class of representations that are convenient for composable functions are irreducible representations or “irreps”, which cannot be further reduced. We can express any group representation of as a direct sum (concatentation) of irreps zee; Dresselhaus2007; e3nn:
(9) 
where are WignerD matrices with degree as metnioned in Sec. 2.3.
a.2 Equivariance
Definition of Equivariance and Invariance.
Equivariance is a property of a function mapping between vector spaces and . Given a group and group representations and in input and output spaces and , is equivariant to G if for all and . Invariance corresponds to the case where is the identity for all .
Equivariance in Neural Networks.
Group equivariant neural networks are guaranteed to to make equivariant predictions on data transformed by a group. Additionally, they are found to be dataefficient and generalize better than nonsymmetryaware and invariant methods nequip; quantum_scaling; neural_scale_of_chemical_models. For 3D atomistic graphs, we consider equivariance to the Euclidean group , which consists of 3D rotation, translation and inversion. For translation, we operate on relative positions and therefore our networks are invariant to 3D translation. We achieve equivariance to rotation and inversion by representing our input data, intermediate features and outputs in vector spaces of irreps and acting on them with only equivariant operations.
a.3 Equivariant Features Based on Vector Spaces of Irreducible Representations
Irreps Features.
As discussed in Sec. 2.3 in the main text, we use type vectors for equivariant irreps features^{1}^{1}1In SEGNN segnn, they are also referred to as steerable features. We use the term “irreps features” to remain consistent with e3nn e3nn library. and type vectors for equivariant irreps features. Parity denotes whether vectors change sign under inversion and can be either (even) or (odd). Vectors with change sign under inversion while those with do not. Scalar features correspond to type vectors in the case of equivariance and correspond to type in the case of equivariance whereas type vectors correspond to pseudoscalars. Euclidean vectors in correspond to type vectors and type vectors whereas type vectors correspond to pseudovectors. Note that type vectors and type vectors are considered vectors of different types in equivariant linear layers and layer normalizations.
Spherical Harmonics.
Euclidean vectors in can be projected into type vectors by using spherical harmonics : finding_symmetry_breaking_order
. This is equivalent to the Fourier transform of the angular degree of freedom
, which can be optionally weighted by . In the case of equivariance, transforms in the same manner as type vectors. For equivariance, behaves as type vectors, where if is even and if is odd.Vectors of Higher and Other Parities.
Although previously we have restricted concrete examples of vector spaces of irreps to commonly encountered scalars (type vectors) and Euclidean vectors (type vectors), vector of higher
and other parities are equally physical. For example, the moment of inertia (how an object rotates under torque) transforms as a
symmetric matrix, which has symmetrictraceless components behaving as type vectors. Elasticity (how an object deforms under loading) transforms as a rank4 or symmetric tensor, which includes components acting as type vectors.a.4 Tensor Product
Tensor Product for .
We use tensor products to interact different type vectors. We extend our discussion in Sec. 2.4 in the main text to include inversion and type vectors. The tensor product denoted as uses ClebschGordan coefficients to combine type vector and type vector and produces type vector as follows:
(10) 
(11) 
The only difference of tensor products for as described in Eq. 10 from those for described in Eq. 2 is that we additionally keep track of the output parity as in Eq. 11 and use the following multiplication rules: , , and . For example, the tensor product of a type vector and a type vector can result in one type vector, one type vector, and one type vector.
ClebschGordan Coefficients.
The ClebschGordan coefficients for are computed from integrals over the basis functions of a given irreducible representation, e.g., the real spherical harmonics, as shown below and are tabulated to avoid unnecessary computation.
(12) 
For many combinations of , , and , the ClebschGordan coefficients are zero. The gives rise to the following selection rule for nontrivial coefficients: .
Examples of Tensor Products.
Tensor products generally define the interaction between different type vectors in a symmetrypreserving manner and consist of common operations as follows:

Scalarscalar multiplication: scalar scalar scalar .

Scalarvector multiplication: scalar vector vector .

Vector dot product: vector vector scalar .

Vector cross product: vector vector pseudovector .
Appendix B Related Works
b.1 Graph Neural Networks for 3D Atomistic Graphs
Graph neural networks (GNNs) are well adapted to perform property prediction of atomic systems because they can handle discrete and topological structures. There are two main ways to represent atomistic graphs atom3d, which are chemical bond graphs, sometimes denoted as 2D graphs, and 3D spatial graphs. Chemical bond graphs use edges to represent covalent bonds without considering 3D geometry. Due to their similarity to graph structures in other applications, generic GNNs graphsage; neural_message_passing_quantum_chemistry; gcn; gin; gat; gatv2 can be directly applied to predict their properties qm9_2; qm9_1; deepchem; ogb; ogblsc. On the other hand, 3D spatial graphs consider positions of atoms in 3D spaces and therefore 3D geometry. Although 3D graphs can faithfully represent atomistic systems, one challenge of moving from chemical bond graphs to 3D spatial graphs is to remain invariant or equivariant to geometric transformation acting on atom positions. Therefore, invariant neural networks and equivariant neural networks have been proposed for 3D atomistic graphs, with the former leveraging invariant information like distances and angles and the latter operating on geometric tensors like type vectors.
b.2 Invariant GNNs
Previous works schnet; cgcnn; physnet; dimenet; dimenet_pp; orbnet; spherenet; spinconv; gemnet extract invariant information from 3D atomistic graphs and operate on the resulting invariant graphs. They mainly differ in leveraging different geometric information such as distances, bond angles (3 atom features) or dihedral angles (4 atom features). SchNet schnet uses relative distances and proposes continuousfilter convolutional layers to learn local interaction between atom pairs. DimeNet series dimenet; dimenet_pp incorporate bond angles by using triplet representations of atoms. SphereNet spherenet and GemNet gemnet; gemnet_oc further extend to consider dihedral angles for better performance. In order to consider directional information contained in angles, they rely on triplet or quadruplet representations of atoms. In addition to being memoryintensive gemnet_xl, they also change graph structures by introducing higherorder interaction terms line_graph, which would require nontrivial modifications to generic GNNs in order to apply them to 3D graphs. In contrast, the proposed Equiformer uses equivariant irreps features to consider directional information without complicating graph structures and therefore can directly inherit the design of generic GNNs.
b.3 Attention and Transformer
Graph Attention.
Graph attention networks (GAT) gat; gatv2 use multilayer perceptrons (MLP) to calculate attention weights in a similar manner to message passing networks. Subsequent works using graph attention mechanisms follow either GATlike MLP attention relational_graph_attention_networks; graph_attention_with_self_supervision or Transformerlike dot product attention gaan; hard_graph_attention; masked_label_prediction; generalization_transformer_graphs; graph_attention_with_self_supervision; spectral_attention. In particular, Kim et al. graph_attention_with_self_supervision compares these two types of attention mechanisms empirically under a selfsupervised setting. Brody et al. gatv2 analyzes their theoretical differences and compares their performance in general settings.
Graph Transformer.
A different line of research focuses on adapting standard Transformer networks to graph problems generalization_transformer_graphs; grove; spectral_attention; graphormer; graphormer_3d. They adopt dot product attention in Transformers transformer and propose different approaches to incorporate graphrelated inductive biases into their networks. GROVE grove includes additional message passing layers or graph convolutional layers to incorporate local graph structures when calculating attention weights. SAN spectral_attention proposes to learn position embeddings of nodes with full Laplacian spectrum. Graphormer graphormer proposes to encode degree information in centrality embeddings and encode distances and edge features in attention biases. The proposed Equiformer belongs to one of these attempts to generalize standard Transformers to graphs and is dedicated to 3D graphs. To incorporate 3Drelated inductive biases, we adopt an equivariant version of Transformers with irreps features and propose novel equivariant graph attention.
Appendix C Details of Architecture
c.1 Equivariant Operation Used in Equiformer
c.2 Equiformer Architecture
For simplicity and because most works we compare with do not include equivariance to inversion, we adopt equivariant irreps features in Equiformer for experiments in the main text and note that equivariant irreps features can be easily incorporated into Equiformer.
We define architectural hyperparameters like the number of channels in some layers in Equiformer, which are used to specify the detailed architectures in Sec. D and Sec. E.
We use to denote embedding dimension, which defines the dimension of most irreps features. Specifically, all irreps features in Fig. 1 have dimension unless otherwise stated. Besides, we use to represent the dimension of spherical harmonics embeddings of relative positions in all depthwise tensor products.
For equivariant graph attention in Fig. 1(b), the first two linear layers have the same output dimension . The output dimension of depthwise tensor products (DTP) are determined by that of input irreps features. Equivariant graph attention consists of parallel attention functions, and the value vector in each attention function has dimension . We refer to and as the number of heads and head dimension, respectively. By default, we set the number of channels in scalar feature to be the same as the number of channels of type or type vectors in . When nonlinear messages are adopted in , we set the dimension of output irreps features in gate activation to be . Therefore, we can use two hyperparameters and to specify the detailed architecture of equivariant graph attention.
As for feed forward networks (FFNs), we denote the dimension of output irreps features in gate activation as . The FFN in the last Transformer block has output dimension , and we set of the last FFN, which is followed by output head, to be as well. Thus, two hyperparameters and are used to specify architectures of FFNs and the output dimension after Transformer blocks.
Irreps features contain channels of vectors with degrees up to . We denote type vectors as and type vectors as and use brackets to represent concatenations of vectors. For example, the dimension of irreps features containing type vectors and type vectors can be represented as .
c.3 Dot Product Attention
We illustrate the dot product attention without nonlinear message passing used in ablation study in Fig. 4. The architecture is adapted from SE(3)Transformer se3_transformer. The difference from multilayer perceptron attention lies in how we obtain attention weights from . We split into two irreps features, key and value , and obtain query with a linear layer. Then, we perform scaled dot product transformer between and for attention weights.
Appendix D Details of Experiments on QM9
d.1 Additional Comparison between and Equivariance
We train two versions of Equiformers, one with equivariant features denoted as “Equiformer” and the other with equivariant features denoted as “Equiformer”, and we compare them in Table 8. Including equivariance to inversion further improves the performance on QM9 dataset.
As for Table 1, we compare “Equiformer” with other works since most of them do not include equivariance to inversion.
Task  

Methods  Units  bohr  meV  meV  meV  D  cal/mol K 
Equiformer  .056  33  17  16  .014  .025  
Equiformer  .054  32  16  16  .013  .024 
d.2 Training Details
We normalize ground truth by subtracting mean and dividing by standard deviation. For the task of , , , and , where singleatom reference values are available, we subtract those reference values from ground truth before normalizing.
We train Equiformer with 6 blocks with following SEGNN segnn. We choose Gaussian radial basis schnet; spinconv; gemnet; graphormer_3d for the first six tasks in Table 1 and radial Bessel basis dimenet; dimenet_pp for the others. Table 9 summarizes the hyperparameters for the QM9 dataset. Further details will be provided in the future. The detailed description of architectural hyperparameters can be found in Sec. C.2.
We use one A6000 GPU with 48GB to train each model and summarize the computational cost of training for one epoch as follows. Training
Equiformer for one epoch takes about minutes. The time of training Equiformer, Equiformer with linear messages (indicated by index in Table 6), and Equiformer with linear messages and dot product attention (indicated by index in Table 6) for one epoch is minutes, minutes and minutes, respectively.Hyperparameters  Value or description 

Optimizer  AdamW 
Learning rate scheduling  Cosine learning rate with linear warmup 
Warmup epochs  
Maximum learning rate  
Batch size  
Number of epochs  
Weight decay  
Cutoff radius (Å)  
Number of radial bases  for Gaussian radial basis, for radial bessel basis 
Hidden sizes of radial functions  
Number of hidden layers in radial functions  
Equiformer  
Number of Transformer blocks  
Embedding dimension  
Spherical harmonics embedding dimension  
Number of attention heads  
Attention head dimension  
Hidden dimension in feed forward networks  
Output feature dimension  
Equiformer  
Number of Transformer blocks  
Embedding dimension  
Spherical harmonics embedding dimension  
Number of attention heads  
Attention head dimension  
Hidden dimension in feed forward networks  
Output feature dimension 
Appendix E Details of Experiments on OC20
e.1 Additional Comparison between and Equivariance
We train two versions of Equiformers, one with equivariant features denoted as “Equiformer” and the other with equivariant features denoted as “Equiformer”, and we compare them in Table 10. Including inversion improves the MAE results on ID and OOD Cat subsplits but degrades the performance on the other subsplits. Overall, using equivariant features results in slightly inferior performance. We surmise the reasons are as follows. First, inversion might not be the key bottleneck. Second, including inversion would break type vectors into two parts, type and type vectors. They are regarded as different types in equivariant linear layers and layer normalizations, and therefore, the directional information captured in these two types of vectors can only exchange in depthwise tensor products. Third, we mainly tune hyperparameters for Equiformer with equivariant features, and it is possible that using equivariant features would favor different hyperparameters.
For Table 2, 3, 4, and 5, we compare “Equiformer” with other works since most of them do not include equivariance to inversion.
Energy MAE (eV)  EwT (%)  

Methods  ID  OOD Ads  OOD Cat  OOD Both  Average  ID  OOD Ads  OOD Cat  OOD Both  Average 
Equiformer  0.5088  0.6271  0.5051  0.5545  0.5489  4.88  2.93  4.92  2.98  3.93 
Equiformer  0.5035  0.6385  0.5034  0.5658  0.5528  5.10  2.98  5.10  3.02  4.05 
e.2 Training Details
IS2RE without NodeLevel Auxiliary Task.
IS2RE with IS2RS NodeLevel Auxiliary Task.
We increase the number of Transformer blocks to as deeper networks can benefit more from IS2RS nodelevel auxiliary task noisy_nodes. We follow the same hyperparameters in Table 11 except that we increase maximum learning rate to and set to . Inspired by Graphormer graphormer_3d, we add an extra equivariant graph attention module after the last layer normalization to predict relaxed structures and use a linearly decayed weight for loss associated with IS2RS, which starts at and decays to . For Noisy Nodes noisy_nodes data augmentation, we first interpolate between initial structure and relaxed structure and then add Gaussian noise as described by Noisy Nodes noisy_nodes. When Noisy Nodes data augmentation is used, we increase the number of epochs to . Further details will be provided in the future.
We use two A6000 GPUs, each with 48GB, to train models when IS2RS is not included during training. Training Equiformer and Equiformer takes about and hours. Training Equiformer with linear messages (indicated by index in Table 7) and Equiformer with linear messages and dot product attention (indicated by index in Table 7) takes hours and hours, respectively. We use four A6000 GPUs to train Equiformer models when IS2RS nodelevel auxiliary task is adopted during training. Training Equiformer without Noisy Nodes noisy_nodes data augmentation takes about days and training with Noisy Nodes takes days. We note that the proposed Equiformer in Table 5 achieves competitive results even with much less computation. Specifically, training “Equiformer + Noisy Nodes” takes about GPUdays when A6000 GPUs are used. The training time of “GNS + Noisy Nodes” noisy_nodes is TPUdays. “Graphormer” graphormer_3d uses ensemble of models and requires GPUdays to train all models when A100 GPUs are used.
Hyperparameters  Value or description 

Optimizer  AdamW 
Learning rate scheduling  Cosine learning rate with linear warmup 
Warmup epochs  
Maximum learning rate  
Batch size  
Number of epochs  
Weight decay  
Cutoff radius (Å)  
Number of radial basis  
Hidden size of radial function  
Number of hidden layers in radial function  
Equiformer  
Number of Transformer blocks  
Embedding dimension  
Spherical harmonics embedding dimension  
Number of attention heads  
Attention head dimension  
Hidden dimension in feed forward networks  
Output feature dimension  
Equiformer  
Number of Transformer blocks  
Embedding dimension  
Spherical harmonics embedding dimension  
Number of attention heads  
Attention head dimension  
Hidden dimension in feed forward networks  
Output feature dimension 
e.3 Error Distributions
We plot the error distributions of different Equiformer models on different subsplits of OC20 IS2RE validation set in Fig. 5. For each curve, we sort the absolute errors in ascending order for better visualization and have a few observations. First, for each subsplit, there are always easy examples, for which all models achieve significantly low errors, and hard examples, for which all models have high errors. Second, the performance gains brought by different models are nonuniform among different subsplits. For example, using MLP attention and nonlinear messages improves the errors on the ID subsplit but is not that helpful on the OOD Ads subsplit. Third, when IS2RS nodelevel auxiliary task is not included during training, using stronger models mainly improves errors that are beyond the threshold of 0.02 eV, which is used to calculate the metric of energy within threshold (EwT). For instance, on the OOD Both subsplit, using nonlinear messages, which corresponds to red and purple curves, improves the absolute errors for the th through th examples. However, the improvement in MAE does not translate to that in EwT as the errors are still higher than the threshold of 0.02 eV. This explains why using nonlinear messages in Table 7 improves MAE from to but results in almost the same EwT.