Large-scale graph representation learning with very deep GNNs and self-supervision

by   Ravichandra Addanki, et al.

Effectively and efficiently deploying graph neural networks (GNNs) at scale remains one of the most challenging aspects of graph representation learning. Many powerful solutions have only ever been validated on comparatively small datasets, often with counter-intuitive outcomes – a barrier which has been broken by the Open Graph Benchmark Large-Scale Challenge (OGB-LSC). We entered the OGB-LSC with two large-scale GNNs: a deep transductive node classifier powered by bootstrapping, and a very deep (up to 50-layer) inductive graph regressor regularised by denoising objectives. Our models achieved an award-level (top-3) performance on both the MAG240M and PCQM4M benchmarks. In doing so, we demonstrate evidence of scalable self-supervised graph representation learning, and utility of very deep GNNs – both very important open issues. Our code is publicly available at:


page 1

page 2

page 3

page 4


Understanding Graph Neural Networks from Graph Signal Denoising Perspectives

Graph neural networks (GNNs) have attracted much attention because of th...

How Can Graph Neural Networks Help Document Retrieval: A Case Study on CORD19 with Concept Map Generation

Graph neural networks (GNNs), as a group of powerful tools for represent...

A Unified Lottery Ticket Hypothesis for Graph Neural Networks

With graphs rapidly growing in size and deeper graph neural networks (GN...

A Fair Comparison of Graph Neural Networks for Graph Classification

Experimental reproducibility and replicability is a critical topic in ma...

Self-supervised Graph Representation Learning via Bootstrapping

Graph neural networks (GNNs) apply deep learning techniques to graph-str...

SSL-Lanes: Self-Supervised Learning for Motion Forecasting in Autonomous Driving

Self-supervised learning (SSL) is an emerging technique that has been su...

New Benchmarks for Learning on Non-Homophilous Graphs

Much data with graph structures satisfy the principle of homophily, mean...

1 Introduction

Effective high-dimensional representation learning necessitates properly exploiting the geometry of data (bronstein2021geometric)

—otherwise, it is a cursed estimation problem. Indeed, early success stories of deep learning relied on imposing strong geometric assumptions, primarily that the data lives on a grid domain; either spatial or temporal. In these two respective settings, convolutional neural networks (CNNs)


and recurrent neural networks (RNNs)

(hochreiter1997long) have traditionally dominated.

While both CNNs and RNNs are demonstrably powerful models, with many applications of high interest, it can be recognised that most data coming from nature cannot be natively represented on a grid. Recent years are marked with a gradual shift of attention towards models that admit a more generic class of geometric structures (masci2015geodesic; velivckovic2017graph; cohen2018spherical; battaglia2018relational; de2020gauge; satorras2021n).

In many ways, the most generic and versatile of these models are graph neural networks (GNNs). This is due to the fact that most discrete-domain inputs can be observed as instances of a graph structure. The corresponding area of graph representaton learning (hamilton2020graph) has already seen immense success across industrial and scientific disciplines. GNNs have successfully been applied for drug screening (stokes2020deep), modelling the dynamics of glass (bapst2020unveiling), web-scale social network recommendations (ying2018graph) and chip design (mirhoseini2020chip).

While the above results are certainly impressive, they likely only scratch the surface of what is possible with well-tuned GNN models. Many problems of real-world interest require graph representation learning at scale: either in terms of the amount of graphs to process, or their sizes (in terms of numbers of nodes and edges). Perhaps the clearest motivation for this comes from the Transformer family of models (vaswani2017attention). Transformers operate a self-attention mechanism over a complete graph, and can hence be observed as a specific instance of GNNs (joshi2020transformers)

. At very large scales of natural language data, Transformers have demonstrated significant returns with the increase in capacity, as exemplified by models such as GPT-3

(brown2020language). Transformers enjoy favourable scalability properties at the expense of their functional complexity: each node’s features are updated with weighted sums of neighbouring node features. In contrast, GNNs that rely on message passing (gilmer2017neural)

—passing vector signals across edges that are conditioned on

both the sender and receiver nodes—are an empirically stronger class of models, especially on tasks requiring complex reasoning (velivckovic2019neural) or simulations (sanchez2020learning; pfaff2020learning).

One reason why generic message-passing GNNs have not been scaled up as widely as Transformers is the lack of appropriate datasets. Only recently has the field advanced from simple transductive benchmarks of only few thousands of nodes (sen2008collective; shchur2018pitfalls; morris2020tudataset) towards larger-scale real-world and synthetic benchmarks (dwivedi2020benchmarking; hu2020open), but important issues still remain. For example, on many of these tasks, randomly-initialised GNNs (velivckovic2018deep), shallow GNNs (wu2019simplifying) or simple label propagation-inspired GNNs (huang2020combining) can perform near the state-of-the-art level at only a fraction of the parameters. When most bleeding-edge expressive methods are unable to improve on the above, this can often lead to controversial discussion in the community. One common example is: do we even need deep, expressive GNNs?

Breakthroughs in deep learning research have typically been spearheaded by impactful large-scale competitions. For image recognition, the most famous example is the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

(russakovsky2015imagenet). In fact, the very “deep learning revolution” has partly been kickstarted by the success of the AlexNet CNN model of krizhevsky2012imagenet at the ILSVRC 2012, firmly establishing deep CNNs as the workhorse of image recognition for the forthcoming decade.

Accordingly, we have entered the recently proposed Open Graph Benchmark Large-Scale Challenge (OGB-LSC) (hu2021ogb). OGB-LSC provides graph representation learning tasks at a previously unprecedented scale—millions of nodes, billions of edges, and/or millions of graphs. Further, the tasks have been designed with immediate practical relevance in mind, and it has been verified that expressive GNNs are likely to be necessary for strong performance. Here we detail our two submitted models (for the MAG240M and PCQM4M tasks, respectively), and our empirical observations while developing them. Namely, we find that the datasets’ immense scale provides a great platform for demonstrating clear outperformance of very deep GNNs (godwin2021very), as well as self-supervised GNN setups such as bootstrapping (thakoor2021bootstrapped). In doing so, we have provided meaningful evidence towards a positive resolution to the above discussion: deep and expressive GNNs are, indeed, necessary at the right level of task scale and/or complexity. Our final models have achieved award-level (top-3) ranking on both MAG240M and PCQM4M.

2 Dataset description

The MAG240M-LSC dataset is a transductive node classification dataset, based on the Microsoft Academic Graph (MAG) (wang2020microsoft). It is a heterogeneous graph containing paper, author and institution nodes, with edges representing the relations between them: paper-cites-paper, author-writes-paper, author-affiliated-with-institution. All paper nodes are endowed with 768-dimensional input features, corresponding to the RoBERTa sentence embedding (liu2019roberta; reimers2019sentence) of their title and abstract. MAG240M is currently the largest-scale publicly available node classification dataset by a wide margin, at 240 million nodes and 1.8 billion edges. The aim is to classify the 1.4 million arXiv papers into their corresponding topics, according to a temporal split: papers published up to 2018 are used for training, with validation and test sets including papers from 2019 and 2020, respectively.

The PCQM4M-LSC dataset is an inductive graph regression dataset based on the PubChemQC project (nakata2017pubchemqc). It consists of

4 million small molecules (described by their SMILES strings). The aim is to accelerate quantum-chemical computations: especially, to predict the HOMO-LUMO gap of each molecule. The HOMO-LUMO gap is one of the most important quantum-chemical properties, since it is related to the molecules’ reactivity, photoexcitation, and charge transport. The ground-truth labels for every molecule were obtained through expensive DFT (density functional theory) calculations, which may take several hours per molecule. It is believed that machine learning models, such as GNNs over the molecular graph, may obtain useful approximations to the DFT at only a fraction of the computational cost, if provided with sufficient training data

(gilmer2017neural). The molecules are split with a 80:10:10 ratio into training, validation and test sets, based on their PubChem ID.

3 GNN Architectures

For both of the tasks above, we rely on a common encode-process-decode blueprint (hamrick2018relational). This implies that our input features are encoded into a latent space using node-, edge- and graph-wise encoder functions, and latent features are decoded to node-, edge- and graph- level predictions using appropriate decoder functions. The bulk of the computational processing is powered by a processor network, which performs multiple graph neural network layers over the encoded latents.

To formalise this, assume that our input graph, , has node features , edge features and graph-level features , for nodes and edges . Our encoder functions , and then transform these inputs into the latent space:


Our processor network then transforms these latents over several rounds of message passing:


where contains all of the latents at a particular processing step .

The processor network is iterated for steps, recovering final latents . These can then be decoded into node-, edge-, and graph-level predictions (as required), using analogous decoder functions , and :


We will detail the specific design of , and in the following sections. Generally, and are simple MLPs, whereas we use highly expressive GNNs for in order to maximise the advantage of the large-scale datasets. Specifically, we use message passing neural networks (MPNNs) (gilmer2017neural) and graph networks (GNs) (battaglia2018relational). All of our models have been implemented using the jraph library (jraph2020github).

4 Mag240m-Lsc


Running graph neural networks over datasets that are even a fraction of MAG240M’s scale is already prone to multiple scalability issues, which necessitated either aggressive subsampling (hamilton2017inductive; chen2018fastgcn; zeng2019graphsaint; zou2019layer), graph partitioning (liao2018graph; chiang2019cluster) or less expressive GNN architectures (rossi2020sign; bojchevski2020scaling; yu2020scalable).

As we would like to leverage expressive GNNs, and be able to pass messages across any partitions, we opted for the subsampling approach. Accordingly, we subsample moderately-sized patches around the nodes we wish to compute latents for, execute our GNN model over them, and use the latents in the central nodes to train or evaluate the model.

We adapt the standard GraphSAGE subsampling algorithm of hamilton2017inductive, but make several modifications to it in order to optimise it for the specific features of MAG240M. Namely:

  • We perform separate subsampling procedures across edge types. For example, an author node will separately sample a pre-specified number of papers written by that author and a pre-specified number of institutions that author is affiliated with.

  • GraphSAGE mandates sampling an exact number of neighbours for every node, and uses sampling with replacement to achieve this even when the neighbourhood size is variable. We find this to be wasteful for smaller neighbourhoods, and hence use our pre-specified neighbour counts only as an upper bound. Denoting this upper bound as , and node ’s original neighbourhood as , we proceed111Note that, according to the previous bullet point, and are defined on a per-edge-type basis. as follows:

    • For nodes that have fewer neighbours of a particular type than the upper bound (), we simply take the entire neighbourhood, without any subsampling;

    • For nodes that have a moderate amount of neighbours () we subsample neighbours without replacement, hence we do not wastefully duplicate nodes when the memory costs are reasonable.

    • For all other nodes (), we resort to the usual GraphSAGE strategy, and sample neighbours with replacement, which doesn’t require an additional row-copy of the adjacency matrix.

  • GraphSAGE directed the edges in the patch from the subsampled neighbours to the node which sampled them, and run their GNN for the exact same number of steps as the sampling depth. We instead modify the message passing update rule to scalably make the edges bidirectional, which naturally allows us to run deeper GNNs over the patch. The exact way in which we performed this will be detailed in the model architecture.

Taking all of the above into account, our model’s subsampling strategy proceeds, starting from paper nodes as central nodes, up to a depth of two (sufficient for institution nodes to become included). We did not observe significant benefits from sampling deeper patches. Instead, we sample significantly larger patches than the original GraphSAGE paper, to exploit the wide context available for many nodes:

  • Contains the chosen central paper node.

  • We sample up to citing papers, cited papers, and up to authors for this paper.

  • We sample according to the following strategy, for all paper and author nodes sampled at depth-1:

    • Identical strategy as for depth-1 papers: up to cited, citing, authors.

    • We sample up to written papers, and up to affiliations for this author.

Overall, this inflates our maximal patch size to nearly nodes, which makes our patches of a comparable size to traditional full-graph datasets (sen2008collective; shchur2018pitfalls). Coupled with the fact that MAG240M has hundreds of millions of papers to sample these patches from, our setting enables transductive node classification at previously unexplored scale. We have found that such large patches were indeed necessary for our model’s performance.

One final important remark for MAG240M subsampling concerns the existence of duplicated paper nodes—i.e. nodes with exactly the same RoBERTa embeddings. This likely corresponds to identical papers submitted to different venues (e.g. conference, journal, arXiv). For the purposes of enriching our subsampled patches, we have combined the adjacency matrix rows and columns to “fuse” all versions of duplicated papers together.

Input preprocessing

As just described, we seek to support execution of expressive GNNs on large quantities of large-scale subsampled patches. This places further stress on the model from a computational and storage perspective. Accordingly, we found it very useful to further compress the input nodes’ RoBERTa features. Our qualitative analysis demostrated that their 129-dimensional PCA projections already account for 90% of their variance. Hence we leverage these PCA vectors as the actual input paper node features.

Further, only the paper nodes are actually provided with any features. We adopt the identical strategy from the baseline LSC scripts provided by hu2021ogb to featurise the authors and institutions. Namely, for authors, we use the average PCA features across all papers they wrote. For institutions, we use the average features across all the authors affiliated with them. We found this to be a simple and effective strategy that performed empirically better than using structural features. This is contrary to the findings of yu2020scalable

, probably because we use a more expressive GNN.

Besides the PCA-based features, our input node features also contain the one-hot representation of the node’s type (paper/author/institution), the node’s depth in the sampled patch (0/1/2), and a bitwise representation of the papers’ publication year (zeroed out for other nodes). Lastly, and according to an increasing body of research that argues for the utility of labels in transductive node classification tasks (zhu2002learning; stretcu2019graph; huang2020combining), we use the arXiv paper labels as features (wang2021bag) (zeroed out for other nodes). We make sure that the validation labels are not observed at training time, and that the central node’s own label is not provided. It is possible to sample the central node at depth 2, and we make sure to mask out its label if this happens.

We also endow the patches’ edges with a simple edge type feature, . It is a 7-bit binary feature, where the first three bits indicate the one-hot type of the sampling node (paper, author or institution) and the next four bits indicate the one-hot type of the sampled neighbour (cited paper, citing paper, author or institution). We found running a standard GNN over these edge-type features more performant than running a heterogeneous GNN—once again contrary to existing baseline results (hu2021ogb), and likely because of the expressivity of our processor GNN.

Model architecture

For the GNN architecture we have used on MAG240M, our encoders and decoders are both two-layer MLPs, with a hidden size of 512 features. The node and edge encoders’ output layers compute 256 features, and we retain this dimensionality for and across all steps .

Our processor network is a deep message-passing neural network (MPNN) (gilmer2017neural). It computes message vectors, , to be sent across the edge , and then aggregates them in the receiver nodes as follows:


Taken together, Equations 45 fully specify the operations of the network in Equation 2. The message function and the update function are both two-layer MLPs, with identical hidden and output sizes to the encoder network. We note two specific aspects of the chosen MPNN:

  • We did not find it useful to use global latents or update edge latents (Equation 4 uses at all times and does not include ). This is likely due to the fact that the prediction is strongly centred at the central node, and that the edge features and types do not encode additional information.

  • Note the third input in Equation 5, which is not usually included in MPNN formulations. In addition to pooling all incoming messages, we also pool all outgoing messages a node sends, and concatenate that onto the input to the sender node’s update function. This allowed us to simulate bidirectional edges without introducing additional scalability issues, allowing us to prototype MPNNs whose depth exceeded the subsampling depth.

The process is repeated for message passing layers, after which for the central node is sent to the decoder network for predictions.

Bootstrapping objective

The non-arXiv papers within MAG240M are unlabelled and hence, under a standard node classification training regime, would contribute only implicitly to the learning algorithm (as neighbours of labelled papers). Early work on self-supervised graph representation learning (velivckovic2018deep) had already shown this could be a wasteful approach, even on small-scale transductive benchmarks. Appropriately using the unlabelled nodes can provide the model with a wealth of information about the feature and network structure, which cannot be easily recovered from supervision alone. On a dataset like MAG240M—which contains 120 more unlabelled papers than labelled ones—we have been able to observe significant gains from deploying such methods.

Especially, we leverage bootstrapped graph latents (BGRL) (thakoor2021bootstrapped)

, a recently-proposed scalable method for self-supervised learning on graphs. Rather than contrasting several node representations across multiple views, BGRL

bootstraps the GNN to make a node’s embeddings be predictive of its embeddings from another view, under a target GNN. The target network’s parameters are always set to an exponential moving average (EMA) of the GNN parameters. Formally, let and be the target versions of the encoder and processor networks (periodically updated to the EMA of and ’s parameters), and and be two views of an input patch (in terms of features, adjacency structure or both). Then, BGRL performs the following computations:


where is short-hand for applying Equation 1, followed by repeatedly applying Equations 45 for steps. The BGRL loss is then optimised to make the central node embedding predictive of its counterpart, . This is done by projecting to another representation using a projector network, , as follows:



is a two-layer MLP with identical hidden and output size as our encoder MLPs. We then optimise the cosine similarity between the projector output and



using stochastic gradient ascent. Once training is completed, the projector network is discarded.

This approach, inspired by BYOL (grill2020bootstrap), eliminates the need for crafting negative samples, reduces the storage requirements of the model, and its pointwise loss aligns nicely with our patch-wise learning setting, as we can focus on performing the bootstrapping objective on each central node separately. All of this made BGRL a natural choice in our setting, and we have found that we can easily apply it at scale.

Previously, BGRL has been applied on moderately-sized graphs with less expressive GNNs, showing modest returns. Conversely, we find the benefits of BGRL were truly demonstrated with stronger GNNs on the large-scale setting of MAG240M. Not only does BGRL monotonically improve when increasing proportions of unlabelled-to-labelled nodes during training, it consistently outperformed relevant self-supervised GNNs such as GRACE (zhu2020deep).

Ultimately, our submitted model is trained with an auxiliary BGRL objective, with each batch containing a ratio of unlabelled to labelled node patches. Just as in the BGRL paper, we obtain the two input patch views by applying dropout (srivastava2014dropout) on the input features (with ) and DropEdge (rong2019dropedge) (with ), independently on each view. The target network (, ) parameters are updated with EMA decay rate .

Training and regularisation

We train our GNN to minimise the cross-entropy for predicting the correct topic over the labelled central nodes in the training patches, added together with the BGRL objective for the unlabelled central nodes. We use the AdamW SGD optimiser (loshchilov2017decoupled)

with hyperparameters

, and weight decay rate of . We use a cosine learning rate schedule with base learning rate and warm-up steps, decayed over training iterations. Optimisation is performed over dynamically-batched data: we fill up each training minibatch with sampled patches until any of the following limits are exceeded: nodes, edges, or patches.

To regularise our model, we perform early stopping on the accuracy over the validation dataset, and apply feature dropout (with ) and DropEdge (rong2019dropedge) (with ) at every message passing layer of the GNN. We further apply layer normalisation (ba2016layer) to intermediate outputs of all of our MLP modules.


At evaluation time, we make advantage of the transductive and subsampled learning setup to enhance our predictions even further: first, we make sure that the model has access to all validation labels as inputs at test time, as this knowledge may be highly indicative. Further, we make sure that any “fused” copies of duplicated nodes also provide that same label as input. As our predictions are potentially conditioned on the specific topology of the subsampled patch, for each test node we average our predictions over 50 subsampled patches—an ensembling trick which consistently improved our validation performance. Lastly, given that we already use EMA as part of BGRL’s target network, for our evaluation predictions we use the EMA parameters, as they are typically slightly more stable.

5 Pcqm4m-Lsc

Input preprocessing

For featurising our molecules within PCQM4M, we initially follow the baseline scripts provided by hu2021ogb to convert SMILES strings into molecular graphs. Therein, every node is represented by a 9-dimensional feature vector, , including properties such as atomic number and chirality. Further, every edge is endowed with 3-dimensional features, , including bond types and stereochemistry. Mirroring prior work with GNNs for quantum-chemical computations (gilmer2017neural), we found it beneficial to maintain graph-level features (in the form of a “master node”), which we initialise at .

As will soon become apparent, our experiments on the PCQM4M benchmark leveraged GNNs that are substantially deeper than most previously studied GNNs for quantum-chemical tasks, or otherwise. While there is implicit expectation to compute useful “cheap” chemical features from the SMILES string, such as molecular fingerprints, partial charges, etc., our experiments clearly demonstrated that most of them do not meaningfully impact performance of our GNNs. This indicates that very deep GNNs are likely implicitly able to compute such features without additional guidance.

The exception to this have been conformer features, corresponding to approximated three-dimensional coordinates of every atom. These are very expensive to obtain accurately. However, using RDKit (landrum2013rdkit), we have been able to obtain conformer estimates that allowed us to attain slightly improved performance with a (slightly) shallower GNN. Specifically, we use the experimental torsion knowledge distance geometry (ETKDGv3) algorithm (wang2020improving) to recover conformers that satisfy essential geometric constraints, without violating our time limits.

Once conformers are obtained, we do not use their raw coordinates as features—these have many equivalent formulations that depend on the algorithm’s initialisation. Instead, we encode their displacements (a 3-dimensional vector recording distances along each axis) and their distances (scalar norm of the displacement) as additional edge features concatenated with . Note that RDKit’s algorithm is not powerful enough to extract conformers for every molecule within PCQM4M; for about of the dataset, the returned conformers will be NaN.

Lastly, we also attempted to use more computationally intensive forms of conformer generation—including energy optimisation using the universal force field (UFF) (rappe1992uff) and the Merck molecular force field (MMFF) (halgren1996merck). In both cases, we did not observe significant returns compared to using rudimentary conformers.

Model architecture

For the GNN architecture we have used on PCQM4M, our encoders and decoders are both three-layer MLPs, computing 512 features in every hidden layer. The node, edge and graph-level encoders’ output layers compute 512 features, and we retain this dimensionality for , and across all steps .

For our processor network, we use a very deep Graph Network (GN) (battaglia2018relational). Each GN block computes updated node, edge and graph latents, performing aggregations across them whenever appropriate. Fully expanded out, the computations of one GN block can be represented as follows:


Taken together, Equations 911 fully specify the operations of the network in Equation 2. The edge update function , node update function and graph update function are all three-layer MLPs, with identical hidden and output sizes to the encoder network.

The process is repeated for message passing layers, after which the computed latents , and are sent to the decoder network for relevant predictions. Specifically, the global latent vector is used to predict the molecule’s HOMO-LUMO gap. Our work thus constitutes a successful application of very deep GNNs, providing evidence towards ascertaining positive utility of such models. We note that, while most prior works on GNN modelling seldom use more than eight steps of message passing (brockschmidt2020gnn), we observe monotonic improvements of deeper GNNs on this task, all the way to 32 layers when the validation performance plateaus.

Non-conformer model

Recalling our prior discussion about conformer features occasionally not being trivially computable, we also trained a GN which does not exploit conformer-based features. While we observe largely the same trends, we find that they tend to allow for even deeper and wider GNNs before plateauing. Namely, our optimised non-conformer GNN computes 1,024-dimensional hidden features in every MLP, and iterates Equations 911 for message passing steps. Such a model performed marginally worse than the conformer GNN overall, while significantly improving the mean absolute error (MAE) on the of validation molecules without conformers.

Denoising objective

Our very deep GNNs have, in the first instance, been enabled by careful regularisation. By far, the most impactful method for our GNN regressor on PCQM4M has been Noisy Nodes (godwin2021very), and our results largely echo the findings therein.

The main observation of Noisy Nodes is that very deep GNNs can be strongly regularised by appropriate denoising objectives. Noisy Nodes perturbs the input node or edge features in a pre-specified way, then requires the decoder to reconstruct the un-perturbed information from the GNN’s latent representations.

In the case of the flat input features, we have deployed a Noisy Nodes objective on both atom types and bond types: randomly replacing each atom and each bond type with a uniformly sampled one, with probability . The model then performs node/edge classification based on the final latents (e.g., , for the conformer GNN), to reconstruct the initial types. Requiring the model to correctly infer and rectify such noise is implicitly imbuing it with knowledge of chemical constraints, such as valence, and is a strong empirical regulariser. Note that, in this discrete-feature setting, Noisy Nodes can be seen as a more general case of the BERT-like objectives from hu2019strategies. The main difference is that Noisy Nodes takes a more active role in requiring denoising—as opposed to unmasking, where it is known upfront which nodes have been noised, and the effects of noising are always predictable.

When conformers or displacements are available, a richer class of denoising objectives may be imposed on the GNN. Namely, it is possible to perturb the individual nodes’ coordinates slightly, and then require the network to reconstruct the original displacement and/or distances—this time using edge regression on the output latents of the processor GNN. The Noisy Nodes manuscript had shown that, under such perturbations, it is possible to achieve state-of-the-art results on quantum chemical calculations without requiring an explicitly equivariant architecture—only a very deep traditional GNN. Our preliminary results indicate a similar trend on the PCQM4M dataset.

Training and regularisation

We train our GNN to minimise the mean absolute error (MAE) for predicting the DFT-simulated HOMO-LUMO gap based on the decoded global latent vectors. This objective is combined with any auxiliary tasks imposed by noisy nodes (e.g. cross-entropy on reconstructing atom and bond types, MAE on regressing denoised displacements). We use the Adam SGD optimiser (kingma2014adam) with hyperparameters , . We use a cosine learning rate schedule with initial learning rate and warm-up steps, peaking at , and decaying over training iterations. We optimise over dynamically-batched data: we fill each training minibatch until exceeding any of the following limits: atoms, bonds, or molecules.

To regularise our model, we perform early stopping on the validation MAE, and apply feature dropout (srivastava2014dropout) (with ) and DropEdge (rong2019dropedge) (with ) at every message passing layer.


At evaluation time, we exploit several known facts about the HOMO-LUMO gap, and our conformer generation procedure, to achieve “free” reductions in MAE.

Firstly, it is known that the HOMO-LUMO gap cannot be negative, and that it is possible for our model to make (very rare) vastly inflated predictions on validation data if it encounters an out-of-distribution molecule. We ameliorate both of these issues by clipping the network’s predictions in the range.

Secondly, as discussed, for a very small fraction ( of molecules), RDKit was unable to compute conformers. We found that it was useful to fall back to the 50-layer non-conformer GNN in these cases, rather than assuming a default value. The observed reductions in MAE were significant across those specific validation molecules only.

Finally, we consistently track the exponential moving average (EMA) of our model’s parameters (with decay rate ), and use it for evaluation. EMA parameters are generally known to be more stable than their online counterparts, an observation that held in our case as well.

6 Ensembling and training on validation

Once we established the top single-model architectures for both our MAG240M and PCQM4M entries, we found it very important to perform two post-processing steps: (a) re-train on the validation set, (b) ensemble various models together.

Re-training on validation data offers a great additional wealth of learning signal, even just by the sheer volume of data available in the OGB-LSC. But aside from this, the way in which the data was split offers even further motivation. On MAG240M, for example, the temporal split implies that validation papers (from 2019) are most relevant to classifying test papers (from 2020)—simply put, because they both correspond to the latest trends in scholarship.

However, training on the full validation set comes with a potentially harmful drawback: no held-out dataset would remain to early-stop on. In a setting where overfitting can easily occur, we found the risk to vastly outweigh the rewards. Instead, we decided to randomly partition the validation data into equally-sized folds, and perform a cross-validation-style setup: we train different models, each one observing the training set and validation folds as its training data, validating and early stopping on the held-out fold. Each model holds out a different fold, allowing us to get an overall validation estimate over the entire dataset by combining their respective predictions.

While this approach may not correspond to the intended dataset splits, we have verified that the scores on individual held-out folds match the patterns observed on models that did not observe any validation data. This gave us further reassurance that no unintended strong overfitting had happened as a result.

Another useful outcome of our -fold approach is that it allowed us a very natural way to perform ensembling as well: simply aggregating all of the models’ predictions would give us a certain mixture of experts, as each of the models had been trained on a slightly modified training set. Our final ensembled models employ exactly this strategy, with the inclusion of two seeds per fold. This brings our overall number of ensembled models to 20, and these ensembles correspond to our final submissions on both MAG240M and PCQM4M.

7 Experimental evaluation

In this section we provide experimental evidence to substantiate the various claims we have made about the key modifications in our model, hoping to advise future research on large scale graph representation learning. To eliminate any possible confounding effects of ensembling, all results reported in this section will be on a single model, evaluated on the provided validation data. We report average performance and standard deviation over three seeds.


We will follow the plots in Figures 12, which seek to uncover various contributing factors to our model’s ultimate performance. We proceed one claim at a time.

Making networks deeper than the patch diameter can help. We find that making the edges in every subsampled patch bidirectional allowed for doubling the message passing steps (to four) with a significant validation accuracy improvement, in spite of the fact that the MPNN was now deeper than the patch diameter. See Figure 1 (left).

Ensembling over multiple subsamples helps. We find that averaging our network’s prediction over several randomly subsampled patches at evaluation time consistently improved performance. See Figure 1 (middle-left).

Using training labels as features helps. On transductive tasks, we confirm that using the training node label as an additional feature provides a substantial boost to validation performance, if done carefully. See Figure 1 (middle-right).

Larger patches help. Providing the model with a larger context (by subsampling more neighbours) proved significantly helpful to our downstream performance. See Figure 1 (right).

Self-supervised objectives help—especially BGRL. We first validate that combining a traditional cross-entropy loss with a self-supervised loss is beneficial to final performance observed. Further, we show that BGRL (thakoor2021bootstrapped) can significantly outperform GRACE (zhu2020deep) in the large-scale regime. See Figure 2 (left).

Self-supervised learning on unlabelled nodes helps. One of the major promises of self-supervised learning is allowing access to a vast quantity of unlabelled nodes, which now can be used as targets. We recover consistent, monotonic gains from incorporating increasing amounts of unlabelled nodes within our training routine. See Figure 2 (middle).

Self-supervised learning allows for more robust models. Finally, the regularising effect of self-supervised learning means that we can train our models for longer without suffering any overfitting effects. See Figure 2 (right).

Figure 1: Ablation studies on our final MAG240M-LSC model, covering aspects of the model depth (left), validation-time sample ensembling (middle-left), using labels as features (middle-right), and subsampling strategy (right).
Figure 2: Ablation studies on self-supervised learning within our MAG240M-LSC entry, showing the influence of various self-supervised objectives (left), using unlabelled nodes as targets (middle) and running for longer (right). Please note, the left-most plot covers training over 50,000 steps, while the other two cover training over 500,000 steps.


We follow Figure 3, which investigates key design aspects in our PCQM4M-LSC models.

Using conformer-based features helps. Utilising features based on RDKit conformers, in the manner described before, proved beneficial to final performance. Note that the gains over our 50-layer non-conformer model are irrelevant, given that the non-conformer model is only applied over molecules where conformers cannot be computed. See Figure 3 (top-left and bottom-left).

Deeper models help. We demonstrate consistent, monotonic gains for larger numbers of message passing steps, at least up to 32 layers—and in the case of the non-conformer model, up to 50 layers. See Figure 3 (top-middle-left and bottom-middle-left).

Noisy Nodes help. Lastly, we show that the regulariser proposed in Noisy Nodes (godwin2021very) proved very effective for this quantum-chemical task as well. It was the key behind the monotonic improvements of our models with depth. Note, for example, that removing Noisy Nodes from our best performing model makes its performance comparable with models that are at least twice as shallow. See Figure 3 (top-middle-right and bottom-middle-right).

Wider message functions help. Towards the end of the contest, we noted that performance gains are possible when favouring wider message functions (in terms of hidden size of their MLP layers) opposed to the latent size of the GNN. We subsequently noticed that such a regime (256 latent dimensions, 1,024-dimensional hidden layers) consistently improved our non-conformer model as well. See Figure 3 (top-right and bottom-right).

Figure 3: Ablation studies on our final PCQM4M-LSC models, covering aspects of conformer usage (left), message passing steps (middle-left), Noisy Nodes regularisation (middle-right) and latent/hidden dimensions (right). Results shown both for our conformer (top) and non-conformer model (bottom). * indicates our final model’s chosen hyperparameter.

8 Results and Discussion

Our final ensembled models achieved a validation accuracy of 77.10% on MAG240M, and validation MAE of 0.110 on PCQM4M. Translated on the LSC test sets, we recover 75.19% test accuracy on MAG240M and 0.1205 test MAE on PCQM4M. We incur a minimal amount of distribution shift, which is a testament to our principled ensembling and post-processing strategies, in spite of using labels as inputs for MAG240M or training on validation for both tasks.

Our entries have been designated as awardees (ranked in the top-3) on both MAG240M and PCQM4M, solidifying the impact that very deep expressive graph neural networks can have on large scale datasets of industrial and scientific relevance. Further, we demonstrate how several recently proposed auxiliary objectives for GNN training, such as BGRL (thakoor2021bootstrapped) and Noisy Nodes (godwin2021very) can both be highly impactful at the right dataset scales. We hope that our work serves towards resolving several open disputes in the community, such as the utility of very deep GNNs, and the influence of self-supervision in this setting.

In many ways, the OGB has been to graph representation learning what ImageNet has been to computer vision. We hope that OGB-LSC is only the first in a series of events designed to drive research on GNN architectures forward, and sincerely thank the OGB team for all their hard work and effort in making a contest of this scale possible and accessible.