## 1 Introduction

Predicting molecular properties is of central importance to applications such as drug discovery or protein design. In silico (computational) methods with fast and precise predictions can significantly accelerate the overall process of finding better molecular candidates in a faster and cheaper way. Learning on

D environments of molecular structures is a rapidly growing area of machine learning with promising applications but also domain-specific challenges. While Deep Learning (DL) has replaced hand-crafted features to a large extent, many advances are crucially determined through inductive biases in deep neural networks. Developed neural models should maintain an efficient and accurate representation of structures with even up to thousand of atoms and correctly reason about their

D geometry independent of orientation and position. A powerful method to restrict a neural network to the functions of interest, such as a molecular property, is to exploit the symmetry of the data by constraining equivariance with respect to transformations from a certain symmetry group (battaglia2018relational; bronstein2021geometric).Graph Neural Networks (GNNs) have been applied on a widespread of molecular structures, such as in the prediction of quantum chemistry properties of small molecules (schuett2017schnet; gilmer2017neural) but also on macromolecular structures like proteins (fout; ingraham) due to the natural representation of structures as graphs, with atoms as nodes and edges drawn based on bonding or spatial proximity. These networks generally encode the D geometry in terms of rotationally invariant representations, such as pairwise distances when modelling local interactions which leads to a loss of directional information, while the addition of angular information into network architecture has shown to be beneficial in representing the local geometry (klicpera2020directional).

Neural models that preserve equivariance when working on point clouds in D space have been proposed (thomas2018tensor; anderson2019cormorant; fuchs2020se3transformers; batzner2021e3equivariant)

which can be described as Tensorfield Networks. These physics-inspired models leverage higher-order tensor representations and require additional calculations, to construct the basis for the transformations of their learnable kernels, which can be expensive to compute.

While these models enable the interaction between different-order representations, (often referred to as type- representation), many data types are often restricted to scalar values (type- e.g., temperature or energy) and

D vectors (type-

e.g., velocity or forces). Another choice of using more information with the (limited) data at hand and build data-efficient models on point clouds^{1}

^{1}1An example of more information preservation is when considering relative positions between points in D space, where the information of orientation is maintained, as opposed when only the (scalar; invariant) distances between points are considered. through equivariant functions is to operate directly on Cartesian coordinates (satorras2021en; schutt2021equivariant; jing2021learning; jing2021equivariant) and explicitly define the (equivariant) transformations which is conceptually simpler and does not require the basis calculations as in Tensorfield Networks.

In this work, we introduce Equivariant Graph Attention Networks (EQGAT) operating on point clouds that can scale up to hundreds of atoms when considering larger systems, such as proteins or protein-ligand complexes but also achieves state-of-art results on the prediction of quantum mechanical properties of small molecules. Our model architecture implements a novel attention-mechanism which is invariant to global rotations and translations of inputs and includes spatial- but also content related information which serves as powerful edge embedding when propagating information in the Message Passing Neural Networks (MPNNs) (gilmer2017neural) framework.

## 2 Methods and Related Work

#### Preliminaries.

In this work we consider vector spaces for representing a point cloud graph to a feature space. The point cloud has the vertex set and positional set , where states the Cartesian coordinate of node .
Let denote the coordinate matrix of the point-cloud with an arbitrary ordering along the first axis. A graph adjacency matrix can be constructed by defining a distance cutoff and set if and else.

The (symmetric) Euclidean distance matrix can be obtained using the canonical inner product:

(1) |

We aim to develop a Graph Neural Network (GNN) model that transforms feature embeddings invariantly or equivariantly to arbitrary rotations in 3D space. The advantage of including equivariant features into the model as a structural prior lies in the fact that some task-related predicted quantities such as forces

in Molecular Dynamics, or the dipole moment

in the multipole expansion of the electron density (schutt2021equivariant) are geometric quantities and transform as order-1 tensor when a rotation is performed on the analyzed rigid body. Even though some properties of interest might in fact not be geometric tensors, i.e., are scalar representations, preserving the information of the geometry within the network architecture, has shown to be beneficial for their prediction (miller2020relevance; schutt2021equivariant).Formally, in our work we aim to model scalar-valued features and vector-valued features in our architecture separately under the constraint that scalar (type-) features transform invariantly under rotations and vector (type-) features transform equivariantly under rotations.

### 2.1 Invariance and Equivariance

Let be the considered vector space. Our feature representation can be expressed as tuple of scalar- and vector components , where is the invariant and is the equivariant representation that transforms accordingly when a group action is applied on them. The symmetry considered in this paper is the group of orthogonal matrices with determinant in , i.e.,

(2) |

Let be a set of transformations on for the group . The action of on the vector space is defined as:

(3) |

The group action in Eq. (3) reveals that scalar component is not affected by the rotation, as this is not a geometric quantity in and therefore invariant^{2}^{2}2To be precise, the group action applied on the scalar embedding is the trivial representation, which is represented as the identity map through the -dimensional unit matrix.. The vector feature however, transforms as an order- tensor and is therefore rotated.

We say a function is equivariant to if there exists and equivalent transformation on its output space such that following holds:

(4) |

We note that a composition of equivariant functions satisfying Eq. (4) is equivariant again. In our work, is a message passing (MP) layer (gilmer2017neural) that updates node feature representation of the point cloud in an iterative manner such that

(5) |

Combining the group action in Eq. (3) with the equivariance property of the function in (4) results into

(6) |

which shows that the group action commutes with the function .

#### Tensor / Outer Product.

The tensor product^{3}^{3}3Often also referred to as Outer or Kronecker product. between two vectors and is computed as

(7) |

and returns a matrix given two vectors. The tensor/outer product will be a useful operation to construct equivariant features by combining type- and type- representations.

### 2.2 Message Passing Neural Networks (MPNNs)

MPNNs (gilmer2017neural) construct complex representations of vertices within their local neighbourhood through an iterative exchange of messages followed by updates of vertex features. Since MPNNs utilize shared trainable layers among nodes, permutation equivariance is preserved. As mentioned in Section 2, edges between vertices of the point cloud are specified by their relative position of vertices within a local neighbourhood through a distance cutoff , i.e., .
The (Euclidean) distance function in (1) is an invariant function
as the canonical inner product between two vectors and after the group action does not change, i.e., . Hence, the inner product, or a composition of it, like the norm, is a natural way to obtain invariant features from equivariant features. The usage of the tensor/outer product and inner product to obtain equivariant and invariant features, respectively, will be explained in the next Section.

Standard MPNNs implement a message and update function for feature representation as

(8) | ||||

(9) |

In our paper, we aim to implement the message () and update () function to be equivariant, i.e., to satisfy the property in Equation (2.1).

### 2.3 Related Work

Neural networks that specifically achieve E() or SE() equivariance have been proposed in Tensorfield Networks (TFN) (thomas2018tensor) and its variants in the covariant Cormorant (anderson2019cormorant), NequIP (batzner2021e3equivariant) and SE(3)-Transformer (fuchs2020se3transformers). With TFNs, equivariance is achieved through the usage of equivariant function spaces such as Spherical Harmonics combined with Clebsch-Gordan coefficients, while others resort to lifting the spatial space to higher-dimensional spaces such as Lie group spaces (finzi2020generalizing). Since no restriction on the order of tensors is imposed on these methods, sufficient expressive power of these models is guaranteed, but at a cost of excessive computational calculations with increased time and memory. To circumvent the expensive computational cost, another line of research proposed to directly implement equivariant operations in the original (Cartesian) space, providing and efficient approach to preserve equivariance without performing complex space transformations as introduced in the E()-GNN (satorras2021en), GVP (jing2021learning; jing2021equivariant) and PaiNN (schutt2021equivariant) architectures.

Our proposed model implements equivariant operations in the original Cartesian space and includes a continuous filter through the self-attention coefficients which serve as spatial- and content based edge embedding in the message propagation, as opposed to the PaiNN model where the filter solely depends on the distance. The E()-GNN architecture does not learn higher-dimensional type- vector features, but only updates given type- features through a weighted linear combination of such, where the (learnable) scalar weights are obtained from invariant embeddings.
The GVP model which was initially designed to work on macromolecular structures includes a complex message functions of concatenated node- and edge features composed with a series of GVP-blocks that enables information exchange between type- and type- features, with a potential disadvantage of discontinuities through non-smooth components for distances close to the cutoff.

A concurrent work by (thoelke2022equivariant) proposed the Equivariant-Transformer (ET), which is similar to ours, and obtains strong results on QM9 (qm9), MD17 (md17) and ANI-1 (ani) datasets. Within their model, to initialize node embeddings, they implement an additional neighbouring embedding for target node , which can thought of a continuous filter convolution as proposed in the SchNet architecture (schuett2017schnet). This neighbouring embedding is then combined with a self-embedding, generally resembling a graph convolutional layer, which our model does not incorporate. Our model differs in a distinguished message function where we implement a feature-attention embedding that aims to filter scalar-channels from neighbouring nodes. Furthermore, our message function incorporates an intermediate embedding that is content- as well as spatial dependent to modulate the vector-channels, while their filter solely depends on spatial information via interatomic distances. As we benchmark our model on large molecular structures, we assign different channel sizes, for scalar and vector features, respectively, to be able to train deep models on learning problems that include proteins.

## 3 Equivariant Graph Attention Networks

The traditional Transformer model as introduced by Vaswani et al. (vaswani2017attention)

has revolutionized the field of Natural Language Processing (NLP)

(devlin2019bert; dai2019transformerxl) and image analysis (ramachandran2019standalone; zhao2020exploring; dosovitskiy2021image). The self-attention module which lies in the core of Transformers, initially designed to operate on sequence of tokens in NLP, consists in essence of two components: input-dependent attention-weights between any two elements of the set, and an embedding for each set-element, called value. Since graphs, are naturally represented as sets with underlying structure through the (sparse) connectivity between vertices, the implementation of self-attention, following the MPNN paradigm, was introduced in Graph Attention Networks (GATs) (velickovic2018graph). Transformer-like GNNs (that work on a fully-connected 2D graph) were recently introduced in (kreuzer2021rethinking; ying2021transformers) and have found success in several graph-learning benchmarks due to their incorporation of (learnable) relative positional encodings into the attention function.Our proposed Equivariant Graph Attention Networks (EQGAT) operates in D space and implements the message passing for each target node on its local neighbourhood as defined in Section 2 to avoid the quadratic complexity of the vanilla self-attention when one target node would interact with all other nodes in the point cloud. We emphasize that the integration of local neighbourhoods manifests as a powerful inductive bias and in a bio-chemistry context, coincides with the assumption that a large part of energy variations can be attributed to local interactions, although the influence and importance of non-local effects in machine-learned force-fields has been recently analyzed in (spookynet).

In the following Subsections, we will introduce the function components of our EQGAT network displayed in Figure 1a and highlight its invariance and equivariance properties.

### 3.1 Input Embedding

We initially embed atoms of small molecules or proteins based on their nuclear charge . As nuclear charges are discrete and bounded, we utilize a trainable embedding look-up table commonly used in NLP to map the th atom charge to a feature representation as

(10) |

which provides a starting (invariant) scalar representation of the node prior to the message passing.

As in most cases, no directional information for atoms are available, we initialize the vector features as zero tensor .

### 3.2 Distance Encoding

Interatomic distances

are embedded using the Bessel radial basis function (RBF) as introduced by

klicpera2020directional into a representation that serves as distance encoding into the attention mechanism. The Bessel RBF function readswhere is the distance cutoff and . In similar fashion to the continuous filter convolution in the SchNet architecture (schuett2017schnet), the deterministic radial basis encoding is further transformed using a trainable linear layer to obtain an edge embedding ,

As the final edge embedding ought to transition smoothly to to avoid discontinuities when , we apply a cosine-cutoff function as introduced in (behler)

which leads to the final edge embedding:

(11) | ||||

The edge-embedding in Equation (11) can be interpreted as a convolutional filter that solely depends on an invariant representation, i.e., the distance between the nodes and .

### 3.3 Feature Attention

Our attention mechanism depends on invariant scalar representations and includes the distance encoding in Eq. (11) to incorporate spatial connectivity. A novelty of our approach is that the attention coefficient between two vertices and is in fact obtained per feature-channel instead for the entire embedding. Formally, given the invariant scalar embeddings we use two linear layers to obtain query- and key embeddings as

(12) |

We proceed to compute a content and spatial dependent anisotropic embedding through the elementwise product as

(13) |

The embedding vector contains both, semantic information through the query and key representations as well as spatial information via the relative positional encoding expressed through the edge embedding.

Next, we utilize another weight matrix to obtain an embedding vector to compute the feature-attention and filter components for the equivariant features,

(14) |

The feature-attention is calculated using the tensor as follows

(15) |

where

is the sigmoidal activation function and applied componentwise for each channel.

### 3.4 Value Transforms

Another constituent of the self-attention mechanism is the pointwise transform of neighbouring (source) nodes which will be multiplied with the output of the feature-attention mechanism described in Eq. (14). Value tensors for scalar- and vector features are computed using weight matrices , and .

(16) |

where the product on

is applied on the feature axis. By such application of the linear transformation in Eq. (

3.4), the equivariance property in Eq. (3) is maintained.The rationale behind the additional output channels for the scalar embedding is associated to the construction of equivariant features through the tensor product with invariant features as will be explained later.

We split the value tensor of the scalar part into three tensors .

#### Self-Attention as Continuous Filter Convolution.

The invariant feature-attention coefficients in Eq. (15) depend on content related as well as spatial based information and act as a filter when elementwise multiplied with the pointwise transforms of scalar-value tensors to obtain the scalar message-embedding :

(17) |

Furthermore, we compute two additional elementwise products to obtain filtered message tensors:

(18) |

We highlight that the message embeddings in Sections 3.4 and 17 are calculated using invariant representations through the initial query-key computation in (3.3) and only the channels of the -coefficients are bounded in the unit interval to obtain ”modulated” scalar representations when multiplied with the tensor.

Our model differs from the SE(3)-Transformer proposed by fuchs2020se3transformers
as we do not rely on Spherical Harmonics calculation and Clebsch-Gordan decomposition
to build equivariant functions, but we explicitly design functions that are equivariant and operate on 3D-Cartesian coordinates for faster and more memory-efficient calculations as described in the next paragraph. Additionally, our proposed model aims to decouple the information flow between scalar- and vector representations within the attention-mechanism by just using scalar/invariant representations in Equation (14) to modulate scalar and vector embeddings, while the (later applied) update function is used to enable an information flow between vector representations to scalar embeddings.

An important fact is that the attention coefficient(s) which serve as a filter are required to be SO() invariant. Such requirement originates from the idea that the attention coefficient is multiplied with a type-1 feature^{4}^{4}4When considering the vector features. which itself transforms, when a group action (in our case a Rotation), is applied.
For the case that one would want to include type-1 features into the attention calculation, as proposed by (fuchs2020se3transformers) through additional type-1 related query and key representations, a contraction along the spatial axis, such as an inner product between those pairs is required to obtain SO(3) invariance. We tried such construction in our initial development of the attention function, but found worse performance in validation and additional computational complexity.

#### Building Equivariant Features.

In case that no initial vector features such as velocity, or forces are available, equivariant representations in the point cloud can be constructed as a function of (relative) position . Relative position vectors in 3D space can have unbounded norms, so a common practice is to obtain relative positions with unit norm
which describe points on the dimensional unit sphere .

Equivariant interactions between node and are modelled through the tensor-product (cf. Eq. (7)) of the equivariant representation with invariant representation .

Formally, equivariant features are obtained as

(19) |

and (hidden) equivariant type- embeddings are filtered using the (elementwise) scalar multiplication

(20) |

where with ’s in the components and is obtained through the cross product as:

The rationale for including the cross product of the two transformed vector features is to enable interaction between type-1 features and is inspired from TFNs (thomas2018tensor) which models the tensor product between two type-1 vectors resulting in a rank-2 tensor. Such rank-2 tensor includes antisymmetric elements presented in the cross product. We combine the two equivariant representations into a final aggregated equivariant message embedding

(21) |

The aggregated invariant message embedding is obtained in the same fashion, using filtered messages (see. Eq. (17)) between target node and its neighbours

(22) |

We follow our vector space terminology and write

as the final aggregated message embedding (cf. Eq. (8)) which by design also satisfies the equivariance property in Eq. (2.1). We prove this claim in the Appendix A.

At this point, it is worth mentioning that the tensor product applied in Equations 20 and 19 reduces to a scalar multiplication in case the (invariant) embedding(s) are one-dimensional as in the E()-GNN architecture (satorras2021en). Furthermore, notice that by such construction, higher-dimensional type- embeddings can also be constructed in instead of .

### 3.5 Update

We update the hidden embedding for each target node by adding the aggregated message with the previous hidden state

(23) |

The state in Equation (23) includes complex transformation performed in the embedding via the attention mechanism, while no self-interaction transformation on has been applied. We propose to first update the

th’s hidden state in a residual-connection manner and then apply a pointwise

update-layer similar as in the PaiNN architecture (schutt2021equivariant) with the use of gated equivariant non-linearities (weiler2018learning) to combine invariant and equivariant information in the representation to update its state (cf. Eq. (9)) into as illustrated in Figure 1b.## 4 Experiments and Results

Tasks | ||||||

Units | bohr | meV | meV | meV | ||

NMP | .092 | 69 | 43 | 38 | .030 | .040 |

SchNet | .235 | 63 | 41 | 34 | .033 | .033 |

Cormorant | .085 | 61 | 34 | 38 | .038 | .026 |

LieConv | .084 | 49 | 30 | 25 | .032 | .038 |

DimeNet++ | .044 | 33 | 25 | 20 | .030 | .023 |

SE(3) Tr. | .142 | 53 | 35 | 33 | .051 | .054 |

E()-GNN | .071 | 48 | 29 | 25 | .029 | .031 |

ET | .010 | 38 | 21 | 18 | .002 | .026 |

PaiNN | .045 | 46 | 28 | 20 | .012 | .024 |

EQGAT | .063 | 44 | 26 | 22 | .014 | .027 |

EQGAT | .006 | 36 | 9 | 17 | .008 | .019 |

Models displayed with * use different (random) train/val and test splits. For our EQGAT, we report averaged MAEs over 3 runs.

We test the effectiveness of our proposed EQGAT model on four publicly available molecular benchmark datasets which pose significant challenges for the development of efficient and accurate prediction models in small molecules drug discovery but also protein design on different data scales.

### 4.1 Qm9

The QM9 dataset (qm9) is a chemical property regression benchmark and consists of k small molecules with up to atoms per molecule. Molecules are represented as point clouds with each atom having a

D position and a five dimensional one-hot encoding that describes the atom type (H, C, N, O, F), and additional features that can be derived from the 2D topological graph, such as bond-types. In our experiments, we only use the atom positions and atom types/charges as input features for our EQGAT model as described in the Methods Section

3.1. To compare our method with the literature, we import the training, validation- and test-splits from (anderson2019cormorant) which consists of k, k andk compounds, respectively. We adopt the hyperparameters from

(satorras2021en) and implement a -layer EQGAT-encoder with scalar and vector channels as well as radial basis functions to encode interatomic distances. For the downstream networks we only utilize scalar embeddings from the last hidden layer. For a detailed architecture of the entire network with approximately M trainable parameters, we refer to Section B of the Appendix.We optimized and report the Mean Absolute Error (MAE) between prediction and ground truths on 6 targets and compare to NMP (gilmer2017neural), SchNet (schuett2017schnet), Cormorant (anderson2019cormorant), LieConv (finzi2020generalizing), DimeNet++ (klicpera2020directional), SE(3)-Transformer (fuchs2020se3transformers), E-GNN (satorras2021en), ET (thoelke2022equivariant) and PaiNN (schutt2021equivariant).

Tasks | PSR | RSR | LBA | ||
---|---|---|---|---|---|

Metric | Global | Mean | Global | Mean | RMSE |

CNN | 0.431 | 0.789 | 0.264 | 0.372 | 1.415 |

GNN | 0.515 | 0.755 | 0.234 | 0.512 | 1.570 |

GVP-GNN | 0.511 | 0.845 | 0.211 | 0.330 | 1.594 |

E()-GNN | 0.466 | 0.789 | - | - | 1.558 |

PaiNN | 0.485 | 0.808 | - | - | 1.548 |

EQGAT | 0.576 | 0.849 | 0.322 | 0.365 | 1.489 |

We report the results for the Baseline models from (townshend2021atom3d). For the E()-GNN, PaiNN and our EQGAT model, we report averaged metrics over 3 runs.

Results of our proposed model are reported in Table 1. Our EQGAT architecture obtains very strong performance on the 6 predicted targets compared to recent state-of-the-art models while our model similar to the E()-GNN, ET, and PaiNN model implements equivariant functions in the original Cartesian space without the need of higher-order tensor representations. We believe that the implementation of output-specific layers for the prediction of certain targets such as the magnitude of the dipole moment or the electronic spatial extent (not listed in Table 1) as proposed by (schutt2021equivariant) using the GNN encoder’s scalar and vector features in combination with the global geometry via Cartesian coordinates might lead to better performance on those targets. As we adopt the overall architecture for all 6 targets, we did not implement output-specific (decoder) layers for each target. In comparison to the concurrent ET architecture (thoelke2022equivariant), our model obtains similar results on the 6 targets while being parameter-lighter (M vs. M).

We refer the reader to the Appendix B for the architectural details and performance of our models on all QM targets including energy predictions.

### 4.2 Atom3d

The ATOM3D benchmark (townshend2021atom3d) is a collection of eight tasks and datasets for learning on atomic-level 3D molecular structures of different kinds, i.e., proteins, RNAs, small molecules and complexes. Since proteins perform specific biological functions essential for all living organisms and
hence, play a key role when investigating the most fundamental questions in the life sciences, we focus our experiments on the learning problems often encountered in structural biology with different difficulties due to data scarcity and varying structural sizes.
We use provided training, validation and test splits from ATOM3D and refer the interested reader to the original work (townshend2021atom3d) for more details.
For all benchmarks, we compare against the Baseline CNN and GNN models provided by the authors from ATOM3D, the GVP-GNN reported in (jing2021equivariant) and PaiNN (schutt2021equivariant) as well as the E()-GNN (satorras2021en) architectures using our own implementations.

All of our implemented models utilize a 5-layer GNN encoder with (scalar) channels as hidden invariant embedding. For our EQGAT model we apply vector channels, while the PaiNN architecture models vector channels and the E()-GNN does not include learnable vector-feature but just updates the positional coordinates. We refer the reader to the Appendix C for a detailed description concerning the implementation of the networks on the tasks.

The Protein and RNA Structure Ranking tasks (PSR / RSR) in ATOM3D are both regression tasks with the objective to predict the quality score in terms of Global Distance Test (GDT_TS) or Root-Mean-Square Deviation (RMSD) for generated Protein and RNA models wrt. to its experimentally determined ground-truth structure. Being able to reliably rank a biopolymer structure requires a model to accurately learn the atomic environments such that discrepancies between a ground truths state an its corrupted version can be distinguished. We evaluated our model on the biopolymer ranking and obtained the best results on the current benchmark, reported in Table 2 in terms of Spearman rank correlation. Our proposed model performs particularly well on the PSR task outperforming the GVP-GNN (jing2021equivariant)

. We noticed that the RSR benchmark was particularly difficult to validate as only a few dozen experimentally determined RNA structures are existent to date, and the structural models generated in the ATOM3D framework are labeled with the RMSD to its native structure, which is known to be sensitive to outlier regions, for exampling by inadequate modelling of loop regions

(gl_ds), while the GDT_TS metric might be a better suited target to predict a ranking for generated RNA structures as in the PSR benchmark. In our experiments, we tried to obtain comparable results for the PaiNN and E()-GNN architecture on the biopolymer benchmarks but found it difficult to make our implementation train on these two benchmarks after careful comparison to the authors original source code.Another challenging and important task for drug discovery projects is estimating the binding strength (affinity) of a candidate drug atomistic’s interaction with a target protein. We use the ligand binding affinity (LBA) dataset and found that among the GNN architectures, our proposed model obtains the best results, while also being computationally cheap and fast to train. The best performing model in the LBA-task is a 3D CNN model which works on the joint protein-ligand representation using voxel space and enforcing equivariance through data augmentation. The inferior performance of all equivariant GNNs might be caused by the need of larger filters to better capture the locality, where 3D CNNs have an advantage when using voxel representations. Furthermore, as all GNN models jointly represent ligand- and protein as

one graph by connecting vertices through a distance cutoff of , we believe that such union leads to an information loss of distinguishing the atom identity from the ligand and protein. A promising direction to investigate is to incorporate a ligand and protein GNN encoder seperately and merge the two embeddings prior the binding affinity prediction, similar to Graph Matching Networks (li2019graph).Computational Efficiency.
We assess the computational efficiency of the proposed equivariant Graph Attention Network and compare it against our implemented PaiNN and E()-GNN architectures on the PSR, RSR and LBA benchmarks from ATOM3D. As these datasets consist of graphs with up to hundreds of atoms, computationally- and memory efficient models are preferred such that batches of graphs can be stored on GPU memory and trained fast.
We measure the inference time of a random batch comprising 10 macromolecular structures on an NVIDIA V GPU and observe the fastest execution time for our proposed model on the LBA dataset with ms against the PaiNN and E()-GNN with ms and ms, respectively.

Most notably on the PSR and RSR datasets which consist of large biopolymer structures, the time difference is more obvious for EQGAT with ms against PaiNN (ms) and E()-GNN (ms) on the PSR dataset, while the inference on the RNA benchmark takes most time with ms for EQGAT against ms (PaiNN) and ms for the E()-GNN architecture.

The inferior execution time for the PaiNN architecture is most likely attributed to the fact that the number of scalar and vector channels is set to the same value. As the number of scalar channels is set to for all Atom3D models, the PaiNN architecture implements vector features of size which becomes very memory-intensive for large graphs and requires a higher amount of FLOPS in its convolutional layers. Despite the inferior performance of the PaiNN architecture on the ATOM3D benchmark, it is worth highlighting that the PaiNN model was initially designed to predict potential energy surfaces of small molecules with high fidelity, while its architectural components most likely require further re-implementation to be usable on larger complexes.

## 5 Conclusion

In this work, we introduce a novel attention-based graph neural network for the prediction of molecular properties of systems with varying size. Our proposed equivariant Graph Attention Network makes use of rotationally equivariant features in their intermediate layers to faithfully represent the geometry of the data, while being computationally efficient, as all equivariant functions are directly implement in the original Cartesian space. We demonstrate the potential of EQGAT by succesfully applying it to a wide range of molecular tasks, such as the prediction of quantum mechanical properties of small molecules, but also on learning problems that involve large systems such as proteins, or protein-ligand complexes.

## Code and Data Availability

The code and data will be made available upon official publication.

## References

## Encoder Details and Hyperparameters

All EQGAT models in this paper were trained on a single Nvidia Tesla V GPU.

Parameter | QM9 | LBA | PSR | RSR |
---|---|---|---|---|

Learning rate (lr.) | ||||

Maximum epochs |
||||

Early stopping patience | ||||

Lr. patience | ||||

Lr. decay factor | ||||

Batch size | ||||

Num. layers | ||||

Num. RBFs | ||||

Cutoff [Å] | ||||

Scalar channels | ||||

Vector channels | ||||

Num. parameters | M | k | k | k |

Our model implements a Layer Normalization tailored for scalar and vector features as proposed by (jing2021learning) and is applied at the beginning of every EQGAT convolutional layer.

## Appendix A Proof Equivariance

We prove the rotation equivariance for the two tensors in Equations (19) and (20) by analyzing the properties of the tensor product . To recap, the computation is proceed as follows

as well as

where is obtained via the cross-product between two vector features.

Since geometric tensors considered in this work are of type- only, the group action represented as rotation matrix only acts on as well as , where we excluded indices for brewity and better reading.
If the point cloud is rotated, as defined in Eq. (3), (relative) position as well vector features change to

(24) |

while the cross product between two vector features is invariant to rotation, resulting to the property

The message tensor is an invariant embedding and is not affected by any rotation, but its group representation acting on is defined as the identity map.

With definition of the tensor product in Eq. (7), we can conclude that

(25) |

which proves rotation equivariance for the first equation.

The second equation changes to the following expression when a rotation is performed:

(26) | ||||

(27) |

where we use the property that the elementwise product
between and is applied elementwise, allowing to factorization of the rotation matrix .

As the summation of with is an equivariant function, we conclude that is equivariant.

## Appendix B Qm9

Tasks | ZPVE | |||||||||||

Units | bohr | meV | meV | meV | meV | meV | bohr | meV | meV | meV | ||

NMP | .092 | 69 | 43 | 38 | .030 | .040 | 19 | 17 | .180 | 20 | 20 | 1.50 |

SchNet | .235 | 63 | 41 | 34 | .033 | .033 | 14 | 14 | .073 | 19 | 14 | 1.70 |

Cormorant | .085 | 61 | 34 | 38 | .038 | .026 | 20 | 21 | .961 | 21 | 22 | 2.03 |

LieConv | .084 | 49 | 30 | 25 | .032 | .038 | 22 | 24 | .800 | 19 | 19 | 2.28 |

DimeNet++ | .044 | 33 | 25 | 20 | .030 | .023 | 8 | 7 | .331 | 6 | 6 | 1.21 |

SE(3) Tr. | .142 | 53 | 35 | 33 | .051 | .054 | - | - | - | - | - | - |

E()-GNN | .071 | 48 | 29 | 25 | .029 | .031 | 12 | 12 | .106 | 12 | 11 | 1.55 |

ET | .010 | 38 | 21 | 18 | .002 | .026 | 7.64 | 6.48 | .015 | 6.30 | 6.24 | 2.12 |

PaiNN | .045 | 46 | 28 | 20 | .012 | .024 | 7.35 | 5.98 | .066 | 5.83 | 5.85 | 1.28 |

EQGAT | .063 | 44 | 26 | 22 | .014 | .027 | 12 | 13 | .257 | 13 | 13 | 1.50 |

EQGAT | .006 | 36 | 9 | 17 | .008 | .019 | 24 | 3 | .071 | 13 | 2.60 | 3.00 |

For the QM benchmark (qm9), we used the pre-processed dataset partitions from (anderson2019cormorant; satorras2021en). Following (satorras2021en), we normalized all properties by subtracting the mean and dividing by the Mean Absolute Deviation. For all targets, we only retrieve the scalar-embedding from the last GNN layer and pool all (scalar)-node embeddings through summation to obtain the graph embedding

(28) |

The graph embedding is consequently used to predict the target property using a 2-layer MLP following the steps as

.

We further note that our proposed EQGATs produce results on par with the best performing methods
on the non-energy variables, however perform inferior to the state of the art on the energy variables . We hypothesize that these targets could benefit from more involved problem-tailored output-layers and additional optimization strategies, such as those compared against with.

As the comparison of our architecture to recent methods is not straightforward due to the usage of varying data splits, we also performed the experiments on random splits from the QM9 dataset, by creating splits with the same size as in the initial setting of (anderson2019cormorant).

## Appendix C Atom3d

We use the provided datasplits from the ATOM3D benchmark (townshend2021atom3d)

and their open-source python package to process the training, validation and test splits. The LBA dataset consists on average of

atoms, while the PSR and RSR datasets contains larger graphs with an average size of and . Statistics are taken from (jing2021equivariant).For the LBA task, we extract the scalar and vector embeddings, from the GNN encoder’s last layer and implement a 2-layer MLP node decoder using Gated Equivariant layers as proposed in (schutt2021equivariant) that outputs scalar and vector features of dimensionality 1, i.e., and for all nodes For the final prediction we calculate

(29) |

assuming the center of mass being .

The decoder network for the PSR and RSR tasks are implemented using a single layer of Gated Equivariant Layer on the GNN encoder’s last layer output returning which is then required for mean-pooling, where only the output scalar feature is used as input for a 2-layer MLP as follows
.

As described in the E()-GNN (satorras2021en) model trained on the QM dataset, we implement and train the invariant E()-GNN model without the positional coordinates update equation on all three ATOM3D tasks. To enable stable training for the E()-GNN model, we further transform raw distances between nodes and using a reciprocal function as opposed to the authors initial implementation using the squared euclidean distance.

We noticed that including layer normalization between consecutive convolutional layers for the E-GNN and PaiNN architectures enabled stable training on the PSR and LBA benchmark datasets. This is mostly related to the fact that macromolecules consists of larger neighbourhoods when a radial cutoff is applied. Nonetheless, we still perform a sum-aggregation for the message embedding of target node , such that the magnitude of neighboring embeddings is recognized by the network.