1 Introduction
Graph Neural Networks (GNNs) (Gori et al., 2005; Zhou et al., 2020) have emerged as the standard architecture of choice for modeling atomic systems, with a wide range of applications from protein structure prediction to catalyst discovery and drug design (Schütt et al., 2017b; Gilmer et al., 2017; Jørgensen et al., 2018; Zitnick et al., 2020; Schütt et al., 2017a; Xie and Grossman, 2018)
. These models operate on graphstructured inputs, where nodes of the graph represent atoms, and edges represent bonds or atomic neighbors. Despite their widespread success and the availability of large molecular datasets, training massive GNNs (with up to billions of parameters) is an important but underexplored area. Success of similarly large models in computer vision, natural language processing, and speech recognition
(Shoeybi et al., 2020; Huang et al., 2019; Brown et al., 2020; zhai2021scaling) suggests that scaling up GNNs could yield significant performance gains.Most previous approaches to scaling GNNs have focused on scaling small models (with up to a few million parameters) to massive graphs. These methods generally assume that we are working with a large, fixed graph, leading to the development of methods like neighborhood sampling (jangda2021accelerating; zheng2021distdgl) (Jia2020ImprovingTA; Ma2019NeuGraphPD; tripathy2020reducing). These methods do not apply to atomic simulation datasets that contain millions of smaller graphs where it is necessary to consider the entire graph for prediction. Our focus is on the complementary problem of scaling to very large models for a dataset of many moderatelysized graphs ( nodes, edges).
Another limitation of existing methods is that they focus on scaling simple GNN architectures such as Graph Convolutional Networks (GCNs) that only represent lowerorder interactions i.e., representations for nodes and edges, that are then updated by passing messages between neighboring nodes. In practice, the most successful GNNs used for atomic systems also model higherorder interactions between atoms, such as the interactions between triplets or quadruplets of atoms (Klicpera et al., 2020a, 2021; Liu et al., 2022). These interactions are necessary to capture the geometry of the underlying system, critical in making accurate predictions. Scaling such GNN architectures is challenging because even moderatelysized graphs can contain a large number of higherorder interactions. For example, a single graph with nodes could contain several million triplets of atoms. In this paper, we introduce Graph Parallelism, an approach to scale up such GNNs with higherorder interactions to billions of parameters, by splitting up the input graph across multiple GPUs.
We benchmark our approach by scaling up two recent GNN architectures – DimeNet++ (Klicpera et al., 2020a) and GemNetT (Klicpera et al., 2021) – on the Open Catalyst (OC20) dataset (Chanussot* et al., 2021). The OC20 dataset, aimed at discovering new catalyst materials for renewable energy storage, consists of training examples spanning a wide range of adsorbates and catalyst materials. A GNN that can accurately predict peratom forces and system energies on OC20 has the potential to replace the computationally expensive quantum chemistry calculations based on Density Functional Theory (DFT) that are currently the bottleneck in computational catalysis. Our largescale, graphparallelized models lead to relative improvements of 1) for predicting forces on the force MAE metric (S2EF task), and 2) on the AFbT metric for predicting relaxed structures (IS2RS task), establishing new stateoftheart results on this dataset.
2 Graph Parallelism
2.1 Extended Graph Nets
Battaglia et al. (2018) introduced a framework called Graph Network (GN) that provides a general abstraction for many popular Graph Neural Networks (GNNs) operating on edge and node representations of graphs. We build on their work and define the Extended Graph Network (EGN) framework to include GNNs that also operate on higher order terms like triplets or quadruplets of nodes.
In the Graph Network (GN) framework, a graph is defined as a tuple , where represents global attributes about the entire graph; is the set of all nodes, with representing the attributes of node ; and is the set of all edges where represents the attributes for the edge from node to . A GNN then contains a series of GN blocks that iteratively operate on the input graph, updating the various representations.
In our Extended Graph Network (EGN) framework, a graph is defined as a tuple , where and are defined as in the Graph Network, and is the set of higherorder interaction terms involving edges indexed by .
As a concrete example, consider an atomic system represented as a graph in this framework with the nodes representing the atoms and the edges representing atomic neighbors. The node attributes and edge attributes could represent the atom’s atomic numbers and distances between atoms respectively. The higher order interactions could represent triplets of atoms, i.e., pairs of neighboring edges with representing the bond angle, which is the angle between edges that share a common node. Finally, the global attribute can represent the energy of the system. For clarity of exposition, we will limit our discussion to triplets in the rest of the paper, but higher order interactions can be handled in a similar manner. We denote these triplets as .
In the EGN framework, GNNs then contain a series of EGN blocks that iteratively update the graph. Each EGN block consists of several update and aggregation functions that are applied to transform an input graph into an output graph . Each EGN block starts by updating the highest order interactions, which are then aggregated before updating the next highest order interaction. For instance, the triplet representations are first updated (using function) and then aggregated () at each edge. These aggregated representations are then used to update the edge representation (). Next, the edges going into a node are aggregated and used to update the node representations, and so on. This is illustrated in figure 0(a).
Many GNNs can be cast in the EGN framework using appropriate update and aggregation functions.
2.2 Graph Parallelism for Extended Graph Nets
Training large EGNs can be challenging even on moderately sized graphs because of the large memory footprint required in storing and updating the representation for each triplet, edge, and node. In many applications, the number of edges is one or two orders of magnitude larger than the number of nodes, while the number of triplets is one or two orders of magnitude larger than the number of edges. Thus, storing and updating the triplet representations is often the bottleneck in terms of GPU memory and compute. Many recent methods such as (Klicpera et al., 2020a, 2021) overcome this problem by using a very lowdimensional representation for the triplets. However, we found this to significantly reduce model capacity leading to underfitting for some applications with large training datasets. It is necessary to overcome these memory limitations for better generalization.
One way to avoid the memory limits is to distribute the computation across multiple GPUs. For a graph neural network, a natural choice to distribute the computation is by splitting the graph. The update functions in an EGN are easy to apply in parallel since they are applied independently for each triplet, edge, or node. It is substantially more challenging to apply the aggregations in parallel. To simplify parallel aggregation, we restrict ourselves to aggregation functions that are commutative and associative. This is not limiting in practice since most popular GNN architectures use sum or mean aggregations which can be implemented in this manner.
We will now describe the implementation of the distributed EGN block. Suppose we have access to processing units that we wish to split the graph computation over. Each unit is responsible for computing the updates to a subset of triplets, edges and nodes, that we denote by and , respectively. At the beginning of computation, we split the graph so that the processing unit contains its subset of triplets in memory, along with the entire set of edges and nodes . During the forward pass, we first update each set of triplets in parallel to obtain , followed by a local triplet aggregation. Next, an allreduce operation is performed on these locally aggregated triplet representations to obtain globally aggregated triplet representations. These globally aggregated representations are then used to update edges in parallel, and the process is repeated for nodes and ultimately for the global graph level representation. Figure 0(b) shows this.
In this framework, the highest order interaction attributes are never communicated across processors. Therefore, the communication time is bound by the number of lower order node interactions. In our example, the triplet representations are not communicated, while the edge and node representations are communicated once per EGN block, making the total communication cost equal to , where and are the dimensions of node and edge representations. This makes it possible to work with a large number of triplets and large triplet representations. In section 3, we show concrete instantiations of this framework for two contemporary GNNs.
3 Graph Parallelism for Atomic Simulations
In this section, we present two concrete examples of using GNNs for the problem of predicting the energy and the forces for an atomic system, modeled as a graph whose nodes represent the atoms and whose edges represent the atoms’ neighbors. The GNN takes such a graph as input and predicts the energy of the entire system as well as a 3D force vector on each atom. Two paradigms have been proposed in the literature, that we call
energycentric and forcecentricapproaches. These approaches differ in how they estimate forces and whether they are energy conserving, which is an important physical property. We begin by describing the components shared by both approaches, followed by one recent model in each paradigm.
3.1 Inputs and Outputs
The inputs to the network are 3D positions and atomic number for each atom . The outputs are the peratom forces and the energy of the entire structure. The distance between atoms and is denoted as . If edges and exist, then defines a triplet in the graph and we denote the angle between the edges as .
The input graph is constructed with each atom (target) as a node and the edges representing the atom’s neighbors where contains all atoms (source) within a distance
, which is treated as a hyperparameter. Each edge has a corresponding message
that passes information from source atom to target atom .3.2 Estimating Forces and Energy
As previously stated, there are two paradigms for estimating the energy and forces for an atomic system: energycentric and forcecentric. In energycentric models, the model first computes the energy by applying a forward pass of the GNN: , where , and
represent the atomic positions and atomic numbers respectively. The forces are then computed as the negative gradient of the energy with respect to atomic positions by using backpropagation:
.In forcecentric models, the energy and forces are both computed directly during the forward propagation: where represents the matrix of all atomic forces. Forcecentric models tend to be more efficient in terms of computation time and memory usage compared to the energycentric models. However, energycentric models guarantee that the forces are energy conserving which is an important physical property satisfied by atomic systems. In this work, we demonstrate the benefits of scaling up GNNs in both paradigms.
3.3 EnergyCentric model: DimeNet++
DimeNet++ (Klicpera et al., 2020a) is a recently proposed energycentric model for atomic systems. In this model, the edges are represented by a feature representation
, which are iteratively updated using both directional information (via bond angles) as well as the interatomic distances. The edges are initially represented using a radial basis function (RBF) representation
of their lengths . The triplets are represented using a spherical basis function (SBF) expansion of the distances as well as bond angles . In each block, the messages are updated as:(1) 
where the interaction function corresponds to the function, the summation corresponds to the function, and the update function to the
function respectively. The interaction function consists of a Hadamard product of the embeddings, followed by a multilayer perceptron. To speed up these computations, the embeddings are projected down to a smaller dimension before computing interactions, and later projected back up.
The updated messages are then fed as input to an output block that sums them up per atom to obtain a peratom representation: ( function). These are then transformed by another MLP to obtain the node representation ( function).
Thus, DimeNet++ is a case of EGN, and we closely follow the recipe from Sec. 2.2 to parallelize it.
3.4 ForceCentric model: GemNet
GemNet (Klicpera et al., 2021) extends DimeNet++ in a number of ways, including the addition of quadruplets as well as a forcecentric version. Here, we focus only on the forcecentric GemNetT model since it was recently shown to obtain stateoftheart results on the OC20 dataset^{1}^{1}1opencatalystproject.org/leaderboard.html. GemNetT largely follows the structure of the DimeNet++ model, but includes some modifications.
First, GemNetT uses a bilinear layer instead of the Hadamard product for the interaction function. This is made efficient by optimally choosing the order of operations in the bilinear function to minimize computation. Second, GemNetT maintains an explicit embedding for each atom that is first updated by aggregating the directional embeddings involving that atom, similar to DimeNet++. Next, the updated atom embedding is used to update each of the edge embeddings. This creates a second edge update function, , that is run after the node embeddings are updated. Third, GemNet makes use of symmetric message passing, that is the messages and that are on the same edge, but in different directions, are coupled. In a parallel implementation, this step requires an additional allreduce step since messages and could be on different processors.
Thus, GemNetT largely follows the EGN framework, with a few minor deviations from the standard formulation. The distributed EGN implementation described in section 2.2 can be used for the GemNetT as well, but with additional communication steps to account for the second edge update function and symmetric message passing.
4 Experiments
In this section, we present the results of our scaling experiments on the Open Catalyst 2020 (OC20) dataset (Chanussot* et al., 2021). The OC20 dataset contains over 130 million atomic structures used to train models for predicting forces and energies during structure relaxations. We report results for three tasks: 1) Structure to Energy and Forces (S2EF) that involves predicting energy and forces for a given structure; 2) Initial Structure to Relaxed Energy (IS2RE) that involves predicting the relaxed energy for a given initial structure; and 3) Initial Structure to Relaxed Structure (IS2RS) which involves performing a structure relaxation using the predicted forces. DimeNet++ and GemNetT are the current stateoftheart energycentric and forcecentric models respectively.
4.1 Experimental Setup
DimeNet++XL. Our DimeNet++ model consists of interaction blocks, with a hidden dimension of , an output block dimension of , and intermediate triplet dimension of . This model has about parameters, which is over larger than the DimeNet++large model used in (Chanussot* et al., 2021). We call this model DimeNet++XL. The model was trained with the AdamW optimizer (Kingma and Ba, 2015; Loshchilov and Hutter, 2019) starting with an initial learning rate of , that was multiplied by whenever the validation error plateaus. The model was trained with an effective batch size of 128 on 256 Volta 32GB GPUs with a combination of data parallel and graph parallel training: each graph was split over 4 GPUs with data parallel training across groups of 4 GPUs.
GemNetXL. Our GemNet model consists of interaction blocks, with an edge embedding size of , triplet embedding size of and embedding dimension of the bilinear layer of . We found that it was beneficial to use a small atom embedding size of , much smaller than the previous SOTA GemNet model^{2}^{2}2discuss.opencatalystproject.org/t/newgemnetdtcoderesultsmodelweights/102. This model has roughly 300M parameters, which is about larger than the previous SOTA model. However, since we reduced the atom embedding dimension and increased the edge and triplet dimensions, the total amount of compute and memory usage is significantly larger. We call this model GemNetXL. We followed the same training procedure as with DimeNet++, except for a starting learning rate of .
4.2 Structure to Energy and Forces (S2EF)
The Structure to Energy and Forces task takes an atomic structure as input and predicts the energy of the entire structure and peratom forces. The S2EF task has four metrics: the energy and force Mean Absolute Error (MAE), the Force Cosine Similarity, and the Energy and Forces within a Threshold (EFwT). EFwT indicates the percentage of energy and force predictions below a preset threshold.
Table 1 compares the top models on the Open Catalyst Project leaderboard with the GemNetXL model. GemNetXL obtains a roughly lower force MAE and an lower energy MAE relative to the previous stateoftheart. Further, GemNetXL improves the EFwT metric more than relative to the previous best, although the value is still very small. These results indicate that model scaling is beneficial for the S2EF task.
4.3 Initial Structure to Relaxed Structure (IS2RS)
The Initial Structure to Relaxed Structure (IS2RS) task involves taking an initial atomic structure and predicting the atomic structure that minimizes the energy. This is performed by iteratively predicting the forces on each atom and then using these forces to update the atomic positions. This process is repeated until convergence or 200 iterations. There are three metrics for this task: Average Distance within Threshold (ADwT), that measures the fraction of final atomic positions within a distance threshold of the ground truth; Forces below Threshold (FbT), which measures whether a true energy minimum was found (i.e., forces are smaller than a preset threshold); and, the Average Forces below Threshold (AFbt), which averages the FbT over several thresholds.
Model  #Params  Training  IS2RS Test  

Dataset  AFbT  ADwT  FbT  
SpinConv (Shuaibi et al., 2021)  8.9M  S2EFAll  16.67%  53.62%  0.05% 
DimeNet++ (Klicpera et al., 2020a)  1.8M  S2EF 20M + MD  17.15%  47.72%  0.15% 
DimeNet++large (Klicpera et al., 2020a)  10.8M  S2EFAll  21.82%  51.68%  0.40% 
GemNetT (Klicpera et al., 2021)  31M  S2EFAll  27.60%  58.68%  0.70% 
DimeNet++XL  240M  S2EF 20M + MD  33.44%  59.21%  1.25% 
GemNetXL  300M  S2EFAll  30.82%  62.65%  0.90% 
Model  Approach  IS2RE Test  

Energy MAE (EV)  EwT  
SpinConv (Shuaibi et al., 2021)  Relaxation  0.4343  7.90% 
GemNetT (Klicpera et al., 2021)  Relaxation  0.3997  9.86% 
Noisy Nodes (Godwin et al., 2022)  Direct  0.4728  6.50% 
3DGraphormer (Ying et al., 2021)  Direct  0.4722  6.10% 
GemNetXL  Relaxation  0.3712  11.13% 
GemNetXLFT  Direct  0.4623  5.60% 
Table 2 shows the results on the IS2RS task, comparing our models with the top few models on the Open Catalyst leaderboard. Both the DimeNet++XL and GemNetXL models outperform all existing models. The DimeNet++XL model obtains a relative improvement of 53% on the AFbT metric, and more than triples the FbT metric compared to the DimeNetlarge model. The GemNetXL model obtains similar improvements compared to the smaller GemNetT model. These results underscore the importance of model scaling for this task.
4.4 Initial Structure to Relaxed Energy (IS2RE)
The Initial Structure to Relaxed Energy (IS2RE) task takes an initial atomic structure and attempts to predict the energy of the structure after it has been relaxed. Two approaches can be taken to address this problem, the direct and the relaxation approaches (Chanussot* et al., 2021). In the direct approach, we treat this task as a regression problem, and train a model to directly estimate the relaxed energy for a given atomic structure. The relaxation approach first estimates the relaxed structure (IS2RS task) after which the energy is estimated using the relaxed structure as input. Relaxation approaches typically outperform the direct approaches, though they are generally two orders of magnitude slower during inference time due to the need to estimate the relaxed structure.
Table 3 compares our GemNetXL model to the top three models from the Open Catalyst Project leaderboard. There are two metrics for the IS2RE task: energy Mean Absolute Error (MAE) and the Energy within Threshold (EwT) which measures the percentage of time the predicted energy is within a threshold of the true energy. Table 3 shows that the GemNetXL model obtains a roughly 8% lower energy MAE and a 12% higher EwT compared to the previous best, which is the smaller GemNetT model, demonstrating the benefits of scaling up IS2RE models.
Since direct IS2RE models are very fast at inference time, there are applications where they are more useful than relaxation based approaches. It is possible to convert a trained S2EF model into a direct IS2RE model by finetuning it on the IS2RE training data. We finetune our GemNetXL model in this manner for 5 epochs, starting with an initial learning rate of
that is exponentially decayed by multiplying with at the end of each epoch. The resuting model – GemNetXLFT – obtains a relative improvement of on energy MAE compared to 3DGraphormer (Ying et al., 2021), the current stateoftheart direct approach (Table 3).4.5 Scaling Analysis
Weak scaling studies the effect on the throughput when computation is scaled proportional to the number of processors. We study weak scaling efficiency in terms of scaling the model size proportional to the number of GPUs. For these experiments, we use 4 different GemNetT models, ranging from 120M parameters to nearly 1B parameters (see Table 4).
We train increasingly larger models on multiple GPUs that do not fit on a single GPU. Figure 0(c) shows that we are able to obtain a scaling efficiency of roughly with 8 GPUs for our largest model with nearly a billion parameters. This shows that our graph parallel training can be scaled up to billion parameter GemNetT models while obtaining reasonably good scaling performance.
Figure 0(c) further shows the scaling efficiency for each model combining graph parallelism with 32way data parallelism. Pure data parallel training with 32 GPUs obtains a scaling efficiency of for the smallest model, showing the effect of network communication and load imbalance between the GPUs. Combining graph and data parallel training with 256 GPUs only reduces the scaling efficiency to for the largest model compared to the 1 GPU case, suggesting the graph parallelism is promising for training extremely large models on hundreds of GPUs.
Finally, figure 0(d) shows the raw performance of running these models on V100 GPUs in terms of TeraFLOPs per second as a function of the number of GPUs. On a single GPU, the 120M parameter GemNetT sustains 32 TeraFLOPs or roughly of the theoretical peak FLOPS for a single GPU, and is thus a strong baseline. With 256 GPUs, the largest model sustains PetaFLOPs.
Figure 0(d) compares graph and pipeline parallelism (Huang et al., 2019) showing that graph parallelism outperforms pipeline parallelism for the models that we consider. Since each graph in the training data contains different numbers of nodes, edges and triplets, load balancing across GPUs is difficult for pipeline parallelism. Graph parallelism is able to overcome this problem since the nodes, edges and triplets of a given batch are always distributed evenly across the GPUs, helping it outperform pipeline parallelism. It is possible, however, that pipeline parallelism might outperform graph parallelism for very deep GNNs since interGPU communication overhead for pipeline parallelism is independent of the number of blocks. Figure 0(d) also shows results with graph and pipeline parallelism combined, indicating that these methods are complementary to each other.
5 Related Work
GNNs for simluating atomic systems
Many GNN based approaches have been proposed for the task of estimating atomic properties such as (Schütt et al., 2017b; Gilmer et al., 2017; Jørgensen et al., 2018; Schütt et al., 2017a, 2018; Xie and Grossman, 2018; Qiao et al., 2020; Klicpera et al., 2020b), where the atoms are represented by nodes and neighboring atoms are connected by edges. An early approach for force estimation was the SchNet model Schütt et al. (2017a), which computed forces using only the distance between atoms without the use of angular information. SchNet proposed the use of differentiable edge filters which enabled constructing energyconserving models by estimating forces as the gradient of the energy. Subsequent work (Klicpera et al., 2020a, b, 2021; Liu et al., 2022) has extended on the SchNet model by adding bond angles and dihedral angles, which has resulted in improved performance. These models make use of higher order interactions among nodes which make them highly compute and memory intensive. An alternate approach for estimating forces is to directly regress the forces as an output of the network. While this approach does not enforce energy conservation or rotational equivariance, recent work (Hu et al., 2021) has shown that such models can still accurately predict forces.
Distributed GNN Training. Research on distributed GNN training has focused on the regime of training small models on a single, large graph, for instance by sampling local neighborhoods around nodes to create minibatches (jangda2021accelerating; zhang2020agl; zhu2019aligraph). While these approaches can scale to very large graphs, they are not suitable for the task of modeling atomic systems where it is important to consider the entire graph for predicting the energy and forces.
An alternate line of work, that is more similar to ours, keeps the entire graph in memory by efficiently partitioning the graph among multiple nodes (Jia2020ImprovingTA; Ma2019NeuGraphPD; tripathy2020reducing). These methods still operate in a single graph regime that is partitioned ahead of time. Thus the focus of that work is on finding efficient partitions of the graph, which is not applicable to our problem since we operate on millions of graphs. Further, these works do not train very large GNNs, or GNNs that operate on higherorder interactions (e.g. triplets), that are important for atomic systems.
Model Parallelism. methods focus on training large models that do not fit entirely on one GPU (even with a batch size of 1). GPipe (Huang et al., 2019) splits different sequences of layers into different processors, and splits each training minibatch into microbatches to avoid idle time per GPU. MegatronLM (Shoeybi et al., 2020) splits the model breadthwise, where Transformer layer weights are partitioned across multiple GPUs to distribute computation. We see model and graph parallelism as complementary approaches that can be combined to train even larger models.
6 Discussion
We presented Graph Parallelism, a new method for training large GNNs for modeling atomic simulations, where modeling higherorder interactions between atoms (triplets / quadruplets) is critical. We showed that training larger DimeNet++ and GemNetT models can yield significant improvements on the OC20 dataset. Although we demonstrated graph parallelism for just two GNNs, it is broadly applicable to a host of messagepassing GNNs, including equivariant networks, that can be cast in the GN / EGN framework (Sec. 2.1) by appropriately picking update and aggregate functions.
Further, it is possible to combine graph parallelism with model parallel methods such as GPipe (Huang et al., 2019) to train even larger models, which could yield further improvements, as briefly explored in Sec 4.5. For forcecentric GNNs, it should be possible to use graph parallel for ‘breadthwise’ scaling i.e., to split higherorder computations (e.g. triplets) across GPUs, and GPipe for ‘depthwise’ scaling i.e., to scale to larger number of messagepassing blocks, sequentially split across GPUs. For energycentric GNNs e.g., DimeNet++, this combination is less obvious since these models require an additional backward pass through the network to compute forces as gradients of energy with respect to atomic positions. Energycentric GNNs are common for atomic systems because they enforce the physical constraint of energy conservation. As we demonstrate with our DimeNet++XL experiments, graph parallelism is applicable to energycentric GNNs.
We see scaling to large model sizes as a necessary (but not sufficient) step to effectively model atomic simulations from large, diverse datasets of adsorbates and catalysts. Further progress may require marrying large scale with ways to better capture 3D geometry and physical priors.
The carbon emissions from training large deep learning models are nontrivial and the work we have presented here is no exception
(strubell_arxiv19; schwartz_arxiv19). We estimate that training our GemNetXL model with Tesla v100 32 GB GPUs on cloud resources in the US ranges from 3490  8052 kg of CO eq. (lacoste2019quantifying). In the worst case, the emissions are roughly equivalent to 16 round trip flights from Los Angeles to New York. Assuming that the training time is fixed, emissions largely depend on the carbon intensity of the energy generation in a given region and the percentage of emissions offset with other investments by the resource provider. When choosing compute resources we recommend evaluating the stated carbon offset commitments and if possible, consider running experiments in regions that have more sustainable energy generation. The compute resources we utilized for this paper were committed to be offset.References
 Relational inductive biases, deep learning, and graph networks. External Links: 1806.01261 Cited by: §2.1.
 Language models are fewshot learners. In NeurIPS, Cited by: §1.
 Open Catalyst 2020 (OC20) Dataset and Community Challenges. ACS Catalysis. Cited by: §1, §4.1, §4.4, Table 2, §4.
 Neural message passing for quantum chemistry. In ICML, Cited by: §1, §5.
 Simple GNN Regularisation for 3D Molecular Property Prediction and Beyond. In ICLR, Cited by: Table 3.
 A new model for learning in graph domains. In IJCNN, Cited by: §1.
 ForceNet: A Graph Neural Network for LargeScale Quantum Calculations. arXiv preprint arXiv:2103.01436. Cited by: Table 1, §5.
 GPipe: efficient training of giant neural networks using pipeline parallelism. In NeurIPS, Cited by: §1, §4.5, §5, §6.
 Neural message passing with edge updates for predicting properties of molecules and materials. arXiv preprint arXiv:1806.03146. Cited by: §1, §5.
 Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.
 GemNet: Universal Directional Graph Neural Networks for Molecules. In NeurIPS, Cited by: §1, §1, §2.2, §3.4, Table 1, Table 2, Table 3, §5.
 Fast and uncertaintyaware directional message passing for nonequilibrium molecules. In NeurIPSW, Cited by: §1, §1, §2.2, §3.3, Table 1, Table 2, §5.
 Directional message passing for molecular graphs. In ICLR, Cited by: §5.
 Spherical message passing for 3d molecular graphs. In ICLR, Cited by: §1, §5.
 Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §4.1.
 OrbNet: deep learning for quantum chemistry using symmetryadapted atomicorbital features. arXiv preprint arXiv:2007.08026. Cited by: §5.

Schnet: a continuousfilter convolutional neural network for modeling quantum interactions
. In NeurIPS, Cited by: §1, §5. 
Quantumchemical insights from deep tensor neural networks
. Nature communications 8, pp. 13890. Cited by: §1, §5.  SchNet–a deep learning architecture for molecules and materials. The Journal of Chemical Physics 148 (24), pp. 241722. Cited by: §5.
 Megatronlm: training multibillion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053. Cited by: §1, §5.
 Rotation Invariant Graph Neural Networks using Spin Convolutions. arXiv preprint arXiv:2106.09575. Cited by: Table 1, Table 2, Table 3.
 Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Physical review letters 120 (14), pp. 145301. Cited by: §1, §5.
 Do Transformers Really Perform Badly for Graph Representation?. In NeurIPS, Cited by: §4.4, Table 3.
 Graph neural networks: a review of methods and applications. AI Open 1, pp. 57–81. Cited by: §1.

An introduction to electrocatalyst design using machine learning for renewable energy storage
. arXiv preprint arXiv:2010.09435. Cited by: §1.
Appendix A Appendix
a.1 Additional results
Tables 5, 6 and 7 show results from each of the four test sets for the S2EF, IS2RE and IS2RS tasks respectively. DimeNet++XL and GemNetXL achieve the best results for each test set along each metric.
Model  S2EF Test  

Energy MAE (EV)  Force Cos  Force MAE (EV/A)  EFwT  
ID  
GemNetT  0.2257  0.637  0.02099  2.4% 
SpinConv  0.2612  0.5479  0.02689  0.82% 
ForceNetLarge  2.0674  0.533  0.02782  0.02% 
DimeNet++large  29.3333  0.5633  0.02807  0% 
GemNetXL  0.2120  0.6759  0.0181  3.30% 
OOD Ads  
GemNetT  0.2099  0.6242  0.02186  1.15% 
SpinConv  0.2752  0.5345  0.02769  0.38% 
ForceNetLarge  2.4188  0.5212  0.02834  0.01% 
DimeNet++large  30.0338  0.5503  0.02896  0% 
GemNetXL  0.1980  0.6642  0.0186  1.62% 
OOD Cat  
GemNetT  0.3403  0.5813  0.02445  0.93% 
SpinConv  0.3501  0.5187  0.02849  0.46% 
ForceNetLarge  2.0203  0.4936  0.03089  0.01% 
DimeNet++Large  30.0437  0.5109  0.0312  0% 
GemNetXL  0.3083  0.6306  0.0206  1.72% 
OOD Both  
GemNetT  0.3937  0.6221  0.02956  0.3% 
SpinConv  0.4585  0.5554  0.03556  0.14% 
ForcenetLarge  2.5447  0.5302  0.03754  0% 
DimeNet++Large  36.7529  0.5517  0.03705  0% 
GemNetXL  0.362  0.6704  0.0245  0.61% 
Model  Approach  IS2RE Test  

Energy MAE (EV)  EwT  
ID  
GemNetT  Relaxation  0.3901  12.37% 
SpinConv  Relaxation  0.4207  9.4% 
Noisy Nodes  Direct  0.4776  5.71% 
GemNetXL  Relaxation  0.3764  13.25% 
GemNetXLFT  Direct  0.4194  7.52% 
OOD Ads  
GemNetT  Relaxation  0.3907  9.11% 
SpinConv  Relaxation  0.4381  7.47% 
Noisy Nodes  Direct  0.5646  3.49% 
GemNetXL  Relaxation  0.3677  10.00% 
GemNetXLFT  Direct  0.5258  3.95% 
OOD Cat  
GemNetT  Relaxation  0.4339  10.09% 
SpinConv  Relaxation  0.4579  8.16% 
Noisy Nodes  Direct  0.4932  5.02% 
GemNetXL  Relaxation  0.4022  11.61% 
GemNetXLFT  Direct  0.4373  6.76% 
OOD Both  
GemNetT  Relaxation  0.3843  7.87% 
SpinConv  Relaxation  0.4203  6.56% 
Noisy Nodes  Direct  0.5042  3.82% 
GemNetXL  Relaxation  0.3383  9.65% 
GemNetXLFT  Direct  0.4665  4.19% 
Model  Training  IS2RS Test  

Dataset  AFbT  ADwT  FbT  
ID  
GemNetT  S2EFALL  33.75%  59.18%  2.0% 
DimeNet++large  S2EFALL  25.65%  52.45%  1.0% 
SpinConv  S2EFALL  21.10%  53.68%  0.2% 
DimeNet++  S2EF 20M + MD  21.08%  48.6%  0.2% 
DimeNet++XL  S2EF 20M + MD  40.00%  59.90%  2.4% 
GemNetXL  S2EFALL  34.61  62.73%  2.4% 
OOD Ads  
GemNetT  S2EFALL  26.84%  54.59%  0.2% 
DimeNet++large  S2EFALL  20.73%  48.47%  0.4% 
SpinConv  S2EFALL  15.70%  48.87%  0.0% 
DimeNet++  S2EF 20M + MD  17.05%  42.98%  0.0% 
DimeNet++XL  S2EF 20M + MD  36.01%  55.68%  1.6% 
GemNetXL  S2EFALL  30.32%  58.57%  0.6% 
OOD Cat  
GemNetT  S2EFALL  24.69%  58.71%  0.4% 
DimeNet++large  S2EFALL  20.24%  50.99%  0% 
SpinConv  S2EFALL  15.86%  53.92%  0% 
DimeNet++  S2EF 20M + MD  16.43%  48.19%  0% 
DimeNet++XL  S2EF 20M + MD  29.62%  58.43%  0.6% 
GemNetXL  S2EFALL  29.33%  62.60%  0.2% 
OOD Both  
GemNetT  S2EFALL  25.11%  62.23%  0.2% 
DimeNet++large  S2EFALL  20.67%  54.82%  0.2% 
SpinConv  S2EFALL  14.01%  58.03%  0% 
DimeNet++  S2EF 20M + MD  14.02%  51.09%  0.4% 
DimeNet++XL  S2EF 20M + MD  28.14%  62.85%  0.4% 
GemNetXL  S2EFALL  29.02%  66.72%  0.4% 