1 Introduction
Effective highdimensional representation learning necessitates properly exploiting the geometry of data (bronstein2021geometric)
—otherwise, it is a cursed estimation problem. Indeed, early success stories of deep learning relied on imposing strong geometric assumptions, primarily that the data lives on a grid domain; either spatial or temporal. In these two respective settings, convolutional neural networks (CNNs)
(lecun1998gradient)and recurrent neural networks (RNNs)
(hochreiter1997long) have traditionally dominated.While both CNNs and RNNs are demonstrably powerful models, with many applications of high interest, it can be recognised that most data coming from nature cannot be natively represented on a grid. Recent years are marked with a gradual shift of attention towards models that admit a more generic class of geometric structures (masci2015geodesic; velivckovic2017graph; cohen2018spherical; battaglia2018relational; de2020gauge; satorras2021n).
In many ways, the most generic and versatile of these models are graph neural networks (GNNs). This is due to the fact that most discretedomain inputs can be observed as instances of a graph structure. The corresponding area of graph representaton learning (hamilton2020graph) has already seen immense success across industrial and scientific disciplines. GNNs have successfully been applied for drug screening (stokes2020deep), modelling the dynamics of glass (bapst2020unveiling), webscale social network recommendations (ying2018graph) and chip design (mirhoseini2020chip).
While the above results are certainly impressive, they likely only scratch the surface of what is possible with welltuned GNN models. Many problems of realworld interest require graph representation learning at scale: either in terms of the amount of graphs to process, or their sizes (in terms of numbers of nodes and edges). Perhaps the clearest motivation for this comes from the Transformer family of models (vaswani2017attention). Transformers operate a selfattention mechanism over a complete graph, and can hence be observed as a specific instance of GNNs (joshi2020transformers)
. At very large scales of natural language data, Transformers have demonstrated significant returns with the increase in capacity, as exemplified by models such as GPT3
(brown2020language). Transformers enjoy favourable scalability properties at the expense of their functional complexity: each node’s features are updated with weighted sums of neighbouring node features. In contrast, GNNs that rely on message passing (gilmer2017neural)—passing vector signals across edges that are conditioned on
both the sender and receiver nodes—are an empirically stronger class of models, especially on tasks requiring complex reasoning (velivckovic2019neural) or simulations (sanchez2020learning; pfaff2020learning).One reason why generic messagepassing GNNs have not been scaled up as widely as Transformers is the lack of appropriate datasets. Only recently has the field advanced from simple transductive benchmarks of only few thousands of nodes (sen2008collective; shchur2018pitfalls; morris2020tudataset) towards largerscale realworld and synthetic benchmarks (dwivedi2020benchmarking; hu2020open), but important issues still remain. For example, on many of these tasks, randomlyinitialised GNNs (velivckovic2018deep), shallow GNNs (wu2019simplifying) or simple label propagationinspired GNNs (huang2020combining) can perform near the stateoftheart level at only a fraction of the parameters. When most bleedingedge expressive methods are unable to improve on the above, this can often lead to controversial discussion in the community. One common example is: do we even need deep, expressive GNNs?
Breakthroughs in deep learning research have typically been spearheaded by impactful largescale competitions. For image recognition, the most famous example is the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
(russakovsky2015imagenet). In fact, the very “deep learning revolution” has partly been kickstarted by the success of the AlexNet CNN model of krizhevsky2012imagenet at the ILSVRC 2012, firmly establishing deep CNNs as the workhorse of image recognition for the forthcoming decade.Accordingly, we have entered the recently proposed Open Graph Benchmark LargeScale Challenge (OGBLSC) (hu2021ogb). OGBLSC provides graph representation learning tasks at a previously unprecedented scale—millions of nodes, billions of edges, and/or millions of graphs. Further, the tasks have been designed with immediate practical relevance in mind, and it has been verified that expressive GNNs are likely to be necessary for strong performance. Here we detail our two submitted models (for the MAG240M and PCQM4M tasks, respectively), and our empirical observations while developing them. Namely, we find that the datasets’ immense scale provides a great platform for demonstrating clear outperformance of very deep GNNs (godwin2021very), as well as selfsupervised GNN setups such as bootstrapping (thakoor2021bootstrapped). In doing so, we have provided meaningful evidence towards a positive resolution to the above discussion: deep and expressive GNNs are, indeed, necessary at the right level of task scale and/or complexity. Our final models have achieved awardlevel (top3) ranking on both MAG240M and PCQM4M.
2 Dataset description
The MAG240MLSC dataset is a transductive node classification dataset, based on the Microsoft Academic Graph (MAG) (wang2020microsoft). It is a heterogeneous graph containing paper, author and institution nodes, with edges representing the relations between them: papercitespaper, authorwritespaper, authoraffiliatedwithinstitution. All paper nodes are endowed with 768dimensional input features, corresponding to the RoBERTa sentence embedding (liu2019roberta; reimers2019sentence) of their title and abstract. MAG240M is currently the largestscale publicly available node classification dataset by a wide margin, at 240 million nodes and 1.8 billion edges. The aim is to classify the 1.4 million arXiv papers into their corresponding topics, according to a temporal split: papers published up to 2018 are used for training, with validation and test sets including papers from 2019 and 2020, respectively.
The PCQM4MLSC dataset is an inductive graph regression dataset based on the PubChemQC project (nakata2017pubchemqc). It consists of
4 million small molecules (described by their SMILES strings). The aim is to accelerate quantumchemical computations: especially, to predict the HOMOLUMO gap of each molecule. The HOMOLUMO gap is one of the most important quantumchemical properties, since it is related to the molecules’ reactivity, photoexcitation, and charge transport. The groundtruth labels for every molecule were obtained through expensive DFT (density functional theory) calculations, which may take several hours per molecule. It is believed that machine learning models, such as GNNs over the molecular graph, may obtain useful approximations to the DFT at only a fraction of the computational cost, if provided with sufficient training data
(gilmer2017neural). The molecules are split with a 80:10:10 ratio into training, validation and test sets, based on their PubChem ID.3 GNN Architectures
For both of the tasks above, we rely on a common encodeprocessdecode blueprint (hamrick2018relational). This implies that our input features are encoded into a latent space using node, edge and graphwise encoder functions, and latent features are decoded to node, edge and graph level predictions using appropriate decoder functions. The bulk of the computational processing is powered by a processor network, which performs multiple graph neural network layers over the encoded latents.
To formalise this, assume that our input graph, , has node features , edge features and graphlevel features , for nodes and edges . Our encoder functions , and then transform these inputs into the latent space:
(1) 
Our processor network then transforms these latents over several rounds of message passing:
(2) 
where contains all of the latents at a particular processing step .
The processor network is iterated for steps, recovering final latents . These can then be decoded into node, edge, and graphlevel predictions (as required), using analogous decoder functions , and :
(3) 
We will detail the specific design of , and in the following sections. Generally, and are simple MLPs, whereas we use highly expressive GNNs for in order to maximise the advantage of the largescale datasets. Specifically, we use message passing neural networks (MPNNs) (gilmer2017neural) and graph networks (GNs) (battaglia2018relational). All of our models have been implemented using the jraph library (jraph2020github).
4 Mag240mLsc
Subsampling
Running graph neural networks over datasets that are even a fraction of MAG240M’s scale is already prone to multiple scalability issues, which necessitated either aggressive subsampling (hamilton2017inductive; chen2018fastgcn; zeng2019graphsaint; zou2019layer), graph partitioning (liao2018graph; chiang2019cluster) or less expressive GNN architectures (rossi2020sign; bojchevski2020scaling; yu2020scalable).
As we would like to leverage expressive GNNs, and be able to pass messages across any partitions, we opted for the subsampling approach. Accordingly, we subsample moderatelysized patches around the nodes we wish to compute latents for, execute our GNN model over them, and use the latents in the central nodes to train or evaluate the model.
We adapt the standard GraphSAGE subsampling algorithm of hamilton2017inductive, but make several modifications to it in order to optimise it for the specific features of MAG240M. Namely:

We perform separate subsampling procedures across edge types. For example, an author node will separately sample a prespecified number of papers written by that author and a prespecified number of institutions that author is affiliated with.

GraphSAGE mandates sampling an exact number of neighbours for every node, and uses sampling with replacement to achieve this even when the neighbourhood size is variable. We find this to be wasteful for smaller neighbourhoods, and hence use our prespecified neighbour counts only as an upper bound. Denoting this upper bound as , and node ’s original neighbourhood as , we proceed^{1}^{1}1Note that, according to the previous bullet point, and are defined on a peredgetype basis. as follows:

For nodes that have fewer neighbours of a particular type than the upper bound (), we simply take the entire neighbourhood, without any subsampling;

For nodes that have a moderate amount of neighbours () we subsample neighbours without replacement, hence we do not wastefully duplicate nodes when the memory costs are reasonable.

For all other nodes (), we resort to the usual GraphSAGE strategy, and sample neighbours with replacement, which doesn’t require an additional rowcopy of the adjacency matrix.


GraphSAGE directed the edges in the patch from the subsampled neighbours to the node which sampled them, and run their GNN for the exact same number of steps as the sampling depth. We instead modify the message passing update rule to scalably make the edges bidirectional, which naturally allows us to run deeper GNNs over the patch. The exact way in which we performed this will be detailed in the model architecture.
Taking all of the above into account, our model’s subsampling strategy proceeds, starting from paper nodes as central nodes, up to a depth of two (sufficient for institution nodes to become included). We did not observe significant benefits from sampling deeper patches. Instead, we sample significantly larger patches than the original GraphSAGE paper, to exploit the wide context available for many nodes:

Contains the chosen central paper node.

We sample up to citing papers, cited papers, and up to authors for this paper.

We sample according to the following strategy, for all paper and author nodes sampled at depth1:

Identical strategy as for depth1 papers: up to cited, citing, authors.

We sample up to written papers, and up to affiliations for this author.

Overall, this inflates our maximal patch size to nearly nodes, which makes our patches of a comparable size to traditional fullgraph datasets (sen2008collective; shchur2018pitfalls). Coupled with the fact that MAG240M has hundreds of millions of papers to sample these patches from, our setting enables transductive node classification at previously unexplored scale. We have found that such large patches were indeed necessary for our model’s performance.
One final important remark for MAG240M subsampling concerns the existence of duplicated paper nodes—i.e. nodes with exactly the same RoBERTa embeddings. This likely corresponds to identical papers submitted to different venues (e.g. conference, journal, arXiv). For the purposes of enriching our subsampled patches, we have combined the adjacency matrix rows and columns to “fuse” all versions of duplicated papers together.
Input preprocessing
As just described, we seek to support execution of expressive GNNs on large quantities of largescale subsampled patches. This places further stress on the model from a computational and storage perspective. Accordingly, we found it very useful to further compress the input nodes’ RoBERTa features. Our qualitative analysis demostrated that their 129dimensional PCA projections already account for 90% of their variance. Hence we leverage these PCA vectors as the actual input paper node features.
Further, only the paper nodes are actually provided with any features. We adopt the identical strategy from the baseline LSC scripts provided by hu2021ogb to featurise the authors and institutions. Namely, for authors, we use the average PCA features across all papers they wrote. For institutions, we use the average features across all the authors affiliated with them. We found this to be a simple and effective strategy that performed empirically better than using structural features. This is contrary to the findings of yu2020scalable
, probably because we use a more expressive GNN.
Besides the PCAbased features, our input node features also contain the onehot representation of the node’s type (paper/author/institution), the node’s depth in the sampled patch (0/1/2), and a bitwise representation of the papers’ publication year (zeroed out for other nodes). Lastly, and according to an increasing body of research that argues for the utility of labels in transductive node classification tasks (zhu2002learning; stretcu2019graph; huang2020combining), we use the arXiv paper labels as features (wang2021bag) (zeroed out for other nodes). We make sure that the validation labels are not observed at training time, and that the central node’s own label is not provided. It is possible to sample the central node at depth 2, and we make sure to mask out its label if this happens.
We also endow the patches’ edges with a simple edge type feature, . It is a 7bit binary feature, where the first three bits indicate the onehot type of the sampling node (paper, author or institution) and the next four bits indicate the onehot type of the sampled neighbour (cited paper, citing paper, author or institution). We found running a standard GNN over these edgetype features more performant than running a heterogeneous GNN—once again contrary to existing baseline results (hu2021ogb), and likely because of the expressivity of our processor GNN.
Model architecture
For the GNN architecture we have used on MAG240M, our encoders and decoders are both twolayer MLPs, with a hidden size of 512 features. The node and edge encoders’ output layers compute 256 features, and we retain this dimensionality for and across all steps .
Our processor network is a deep messagepassing neural network (MPNN) (gilmer2017neural). It computes message vectors, , to be sent across the edge , and then aggregates them in the receiver nodes as follows:
(4)  
(5) 
Taken together, Equations 4–5 fully specify the operations of the network in Equation 2. The message function and the update function are both twolayer MLPs, with identical hidden and output sizes to the encoder network. We note two specific aspects of the chosen MPNN:

We did not find it useful to use global latents or update edge latents (Equation 4 uses at all times and does not include ). This is likely due to the fact that the prediction is strongly centred at the central node, and that the edge features and types do not encode additional information.

Note the third input in Equation 5, which is not usually included in MPNN formulations. In addition to pooling all incoming messages, we also pool all outgoing messages a node sends, and concatenate that onto the input to the sender node’s update function. This allowed us to simulate bidirectional edges without introducing additional scalability issues, allowing us to prototype MPNNs whose depth exceeded the subsampling depth.
The process is repeated for message passing layers, after which for the central node is sent to the decoder network for predictions.
Bootstrapping objective
The nonarXiv papers within MAG240M are unlabelled and hence, under a standard node classification training regime, would contribute only implicitly to the learning algorithm (as neighbours of labelled papers). Early work on selfsupervised graph representation learning (velivckovic2018deep) had already shown this could be a wasteful approach, even on smallscale transductive benchmarks. Appropriately using the unlabelled nodes can provide the model with a wealth of information about the feature and network structure, which cannot be easily recovered from supervision alone. On a dataset like MAG240M—which contains 120 more unlabelled papers than labelled ones—we have been able to observe significant gains from deploying such methods.
Especially, we leverage bootstrapped graph latents (BGRL) (thakoor2021bootstrapped)
, a recentlyproposed scalable method for selfsupervised learning on graphs. Rather than contrasting several node representations across multiple views, BGRL
bootstraps the GNN to make a node’s embeddings be predictive of its embeddings from another view, under a target GNN. The target network’s parameters are always set to an exponential moving average (EMA) of the GNN parameters. Formally, let and be the target versions of the encoder and processor networks (periodically updated to the EMA of and ’s parameters), and and be two views of an input patch (in terms of features, adjacency structure or both). Then, BGRL performs the following computations:(6) 
where is shorthand for applying Equation 1, followed by repeatedly applying Equations 4–5 for steps. The BGRL loss is then optimised to make the central node embedding predictive of its counterpart, . This is done by projecting to another representation using a projector network, , as follows:
(7) 
where
is a twolayer MLP with identical hidden and output size as our encoder MLPs. We then optimise the cosine similarity between the projector output and
:(8) 
using stochastic gradient ascent. Once training is completed, the projector network is discarded.
This approach, inspired by BYOL (grill2020bootstrap), eliminates the need for crafting negative samples, reduces the storage requirements of the model, and its pointwise loss aligns nicely with our patchwise learning setting, as we can focus on performing the bootstrapping objective on each central node separately. All of this made BGRL a natural choice in our setting, and we have found that we can easily apply it at scale.
Previously, BGRL has been applied on moderatelysized graphs with less expressive GNNs, showing modest returns. Conversely, we find the benefits of BGRL were truly demonstrated with stronger GNNs on the largescale setting of MAG240M. Not only does BGRL monotonically improve when increasing proportions of unlabelledtolabelled nodes during training, it consistently outperformed relevant selfsupervised GNNs such as GRACE (zhu2020deep).
Ultimately, our submitted model is trained with an auxiliary BGRL objective, with each batch containing a ratio of unlabelled to labelled node patches. Just as in the BGRL paper, we obtain the two input patch views by applying dropout (srivastava2014dropout) on the input features (with ) and DropEdge (rong2019dropedge) (with ), independently on each view. The target network (, ) parameters are updated with EMA decay rate .
Training and regularisation
We train our GNN to minimise the crossentropy for predicting the correct topic over the labelled central nodes in the training patches, added together with the BGRL objective for the unlabelled central nodes. We use the AdamW SGD optimiser (loshchilov2017decoupled)
with hyperparameters
, and weight decay rate of . We use a cosine learning rate schedule with base learning rate and warmup steps, decayed over training iterations. Optimisation is performed over dynamicallybatched data: we fill up each training minibatch with sampled patches until any of the following limits are exceeded: nodes, edges, or patches.To regularise our model, we perform early stopping on the accuracy over the validation dataset, and apply feature dropout (with ) and DropEdge (rong2019dropedge) (with ) at every message passing layer of the GNN. We further apply layer normalisation (ba2016layer) to intermediate outputs of all of our MLP modules.
Evaluation
At evaluation time, we make advantage of the transductive and subsampled learning setup to enhance our predictions even further: first, we make sure that the model has access to all validation labels as inputs at test time, as this knowledge may be highly indicative. Further, we make sure that any “fused” copies of duplicated nodes also provide that same label as input. As our predictions are potentially conditioned on the specific topology of the subsampled patch, for each test node we average our predictions over 50 subsampled patches—an ensembling trick which consistently improved our validation performance. Lastly, given that we already use EMA as part of BGRL’s target network, for our evaluation predictions we use the EMA parameters, as they are typically slightly more stable.
5 Pcqm4mLsc
Input preprocessing
For featurising our molecules within PCQM4M, we initially follow the baseline scripts provided by hu2021ogb to convert SMILES strings into molecular graphs. Therein, every node is represented by a 9dimensional feature vector, , including properties such as atomic number and chirality. Further, every edge is endowed with 3dimensional features, , including bond types and stereochemistry. Mirroring prior work with GNNs for quantumchemical computations (gilmer2017neural), we found it beneficial to maintain graphlevel features (in the form of a “master node”), which we initialise at .
As will soon become apparent, our experiments on the PCQM4M benchmark leveraged GNNs that are substantially deeper than most previously studied GNNs for quantumchemical tasks, or otherwise. While there is implicit expectation to compute useful “cheap” chemical features from the SMILES string, such as molecular fingerprints, partial charges, etc., our experiments clearly demonstrated that most of them do not meaningfully impact performance of our GNNs. This indicates that very deep GNNs are likely implicitly able to compute such features without additional guidance.
The exception to this have been conformer features, corresponding to approximated threedimensional coordinates of every atom. These are very expensive to obtain accurately. However, using RDKit (landrum2013rdkit), we have been able to obtain conformer estimates that allowed us to attain slightly improved performance with a (slightly) shallower GNN. Specifically, we use the experimental torsion knowledge distance geometry (ETKDGv3) algorithm (wang2020improving) to recover conformers that satisfy essential geometric constraints, without violating our time limits.
Once conformers are obtained, we do not use their raw coordinates as features—these have many equivalent formulations that depend on the algorithm’s initialisation. Instead, we encode their displacements (a 3dimensional vector recording distances along each axis) and their distances (scalar norm of the displacement) as additional edge features concatenated with . Note that RDKit’s algorithm is not powerful enough to extract conformers for every molecule within PCQM4M; for about of the dataset, the returned conformers will be NaN.
Lastly, we also attempted to use more computationally intensive forms of conformer generation—including energy optimisation using the universal force field (UFF) (rappe1992uff) and the Merck molecular force field (MMFF) (halgren1996merck). In both cases, we did not observe significant returns compared to using rudimentary conformers.
Model architecture
For the GNN architecture we have used on PCQM4M, our encoders and decoders are both threelayer MLPs, computing 512 features in every hidden layer. The node, edge and graphlevel encoders’ output layers compute 512 features, and we retain this dimensionality for , and across all steps .
For our processor network, we use a very deep Graph Network (GN) (battaglia2018relational). Each GN block computes updated node, edge and graph latents, performing aggregations across them whenever appropriate. Fully expanded out, the computations of one GN block can be represented as follows:
(9)  
(10)  
(11) 
Taken together, Equations 9–11 fully specify the operations of the network in Equation 2. The edge update function , node update function and graph update function are all threelayer MLPs, with identical hidden and output sizes to the encoder network.
The process is repeated for message passing layers, after which the computed latents , and are sent to the decoder network for relevant predictions. Specifically, the global latent vector is used to predict the molecule’s HOMOLUMO gap. Our work thus constitutes a successful application of very deep GNNs, providing evidence towards ascertaining positive utility of such models. We note that, while most prior works on GNN modelling seldom use more than eight steps of message passing (brockschmidt2020gnn), we observe monotonic improvements of deeper GNNs on this task, all the way to 32 layers when the validation performance plateaus.
Nonconformer model
Recalling our prior discussion about conformer features occasionally not being trivially computable, we also trained a GN which does not exploit conformerbased features. While we observe largely the same trends, we find that they tend to allow for even deeper and wider GNNs before plateauing. Namely, our optimised nonconformer GNN computes 1,024dimensional hidden features in every MLP, and iterates Equations 9–11 for message passing steps. Such a model performed marginally worse than the conformer GNN overall, while significantly improving the mean absolute error (MAE) on the of validation molecules without conformers.
Denoising objective
Our very deep GNNs have, in the first instance, been enabled by careful regularisation. By far, the most impactful method for our GNN regressor on PCQM4M has been Noisy Nodes (godwin2021very), and our results largely echo the findings therein.
The main observation of Noisy Nodes is that very deep GNNs can be strongly regularised by appropriate denoising objectives. Noisy Nodes perturbs the input node or edge features in a prespecified way, then requires the decoder to reconstruct the unperturbed information from the GNN’s latent representations.
In the case of the flat input features, we have deployed a Noisy Nodes objective on both atom types and bond types: randomly replacing each atom and each bond type with a uniformly sampled one, with probability . The model then performs node/edge classification based on the final latents (e.g., , for the conformer GNN), to reconstruct the initial types. Requiring the model to correctly infer and rectify such noise is implicitly imbuing it with knowledge of chemical constraints, such as valence, and is a strong empirical regulariser. Note that, in this discretefeature setting, Noisy Nodes can be seen as a more general case of the BERTlike objectives from hu2019strategies. The main difference is that Noisy Nodes takes a more active role in requiring denoising—as opposed to unmasking, where it is known upfront which nodes have been noised, and the effects of noising are always predictable.
When conformers or displacements are available, a richer class of denoising objectives may be imposed on the GNN. Namely, it is possible to perturb the individual nodes’ coordinates slightly, and then require the network to reconstruct the original displacement and/or distances—this time using edge regression on the output latents of the processor GNN. The Noisy Nodes manuscript had shown that, under such perturbations, it is possible to achieve stateoftheart results on quantum chemical calculations without requiring an explicitly equivariant architecture—only a very deep traditional GNN. Our preliminary results indicate a similar trend on the PCQM4M dataset.
Training and regularisation
We train our GNN to minimise the mean absolute error (MAE) for predicting the DFTsimulated HOMOLUMO gap based on the decoded global latent vectors. This objective is combined with any auxiliary tasks imposed by noisy nodes (e.g. crossentropy on reconstructing atom and bond types, MAE on regressing denoised displacements). We use the Adam SGD optimiser (kingma2014adam) with hyperparameters , . We use a cosine learning rate schedule with initial learning rate and warmup steps, peaking at , and decaying over training iterations. We optimise over dynamicallybatched data: we fill each training minibatch until exceeding any of the following limits: atoms, bonds, or molecules.
To regularise our model, we perform early stopping on the validation MAE, and apply feature dropout (srivastava2014dropout) (with ) and DropEdge (rong2019dropedge) (with ) at every message passing layer.
Evaluation
At evaluation time, we exploit several known facts about the HOMOLUMO gap, and our conformer generation procedure, to achieve “free” reductions in MAE.
Firstly, it is known that the HOMOLUMO gap cannot be negative, and that it is possible for our model to make (very rare) vastly inflated predictions on validation data if it encounters an outofdistribution molecule. We ameliorate both of these issues by clipping the network’s predictions in the range.
Secondly, as discussed, for a very small fraction ( of molecules), RDKit was unable to compute conformers. We found that it was useful to fall back to the 50layer nonconformer GNN in these cases, rather than assuming a default value. The observed reductions in MAE were significant across those specific validation molecules only.
Finally, we consistently track the exponential moving average (EMA) of our model’s parameters (with decay rate ), and use it for evaluation. EMA parameters are generally known to be more stable than their online counterparts, an observation that held in our case as well.
6 Ensembling and training on validation
Once we established the top singlemodel architectures for both our MAG240M and PCQM4M entries, we found it very important to perform two postprocessing steps: (a) retrain on the validation set, (b) ensemble various models together.
Retraining on validation data offers a great additional wealth of learning signal, even just by the sheer volume of data available in the OGBLSC. But aside from this, the way in which the data was split offers even further motivation. On MAG240M, for example, the temporal split implies that validation papers (from 2019) are most relevant to classifying test papers (from 2020)—simply put, because they both correspond to the latest trends in scholarship.
However, training on the full validation set comes with a potentially harmful drawback: no heldout dataset would remain to earlystop on. In a setting where overfitting can easily occur, we found the risk to vastly outweigh the rewards. Instead, we decided to randomly partition the validation data into equallysized folds, and perform a crossvalidationstyle setup: we train different models, each one observing the training set and validation folds as its training data, validating and early stopping on the heldout fold. Each model holds out a different fold, allowing us to get an overall validation estimate over the entire dataset by combining their respective predictions.
While this approach may not correspond to the intended dataset splits, we have verified that the scores on individual heldout folds match the patterns observed on models that did not observe any validation data. This gave us further reassurance that no unintended strong overfitting had happened as a result.
Another useful outcome of our fold approach is that it allowed us a very natural way to perform ensembling as well: simply aggregating all of the models’ predictions would give us a certain mixture of experts, as each of the models had been trained on a slightly modified training set. Our final ensembled models employ exactly this strategy, with the inclusion of two seeds per fold. This brings our overall number of ensembled models to 20, and these ensembles correspond to our final submissions on both MAG240M and PCQM4M.
7 Experimental evaluation
In this section we provide experimental evidence to substantiate the various claims we have made about the key modifications in our model, hoping to advise future research on large scale graph representation learning. To eliminate any possible confounding effects of ensembling, all results reported in this section will be on a single model, evaluated on the provided validation data. We report average performance and standard deviation over three seeds.
Mag240mLsc
We will follow the plots in Figures 1–2, which seek to uncover various contributing factors to our model’s ultimate performance. We proceed one claim at a time.
Making networks deeper than the patch diameter can help. We find that making the edges in every subsampled patch bidirectional allowed for doubling the message passing steps (to four) with a significant validation accuracy improvement, in spite of the fact that the MPNN was now deeper than the patch diameter. See Figure 1 (left).
Ensembling over multiple subsamples helps. We find that averaging our network’s prediction over several randomly subsampled patches at evaluation time consistently improved performance. See Figure 1 (middleleft).
Using training labels as features helps. On transductive tasks, we confirm that using the training node label as an additional feature provides a substantial boost to validation performance, if done carefully. See Figure 1 (middleright).
Larger patches help. Providing the model with a larger context (by subsampling more neighbours) proved significantly helpful to our downstream performance. See Figure 1 (right).
Selfsupervised objectives help—especially BGRL. We first validate that combining a traditional crossentropy loss with a selfsupervised loss is beneficial to final performance observed. Further, we show that BGRL (thakoor2021bootstrapped) can significantly outperform GRACE (zhu2020deep) in the largescale regime. See Figure 2 (left).
Selfsupervised learning on unlabelled nodes helps. One of the major promises of selfsupervised learning is allowing access to a vast quantity of unlabelled nodes, which now can be used as targets. We recover consistent, monotonic gains from incorporating increasing amounts of unlabelled nodes within our training routine. See Figure 2 (middle).
Selfsupervised learning allows for more robust models. Finally, the regularising effect of selfsupervised learning means that we can train our models for longer without suffering any overfitting effects. See Figure 2 (right).
Pcqm4mLsc
We follow Figure 3, which investigates key design aspects in our PCQM4MLSC models.
Using conformerbased features helps. Utilising features based on RDKit conformers, in the manner described before, proved beneficial to final performance. Note that the gains over our 50layer nonconformer model are irrelevant, given that the nonconformer model is only applied over molecules where conformers cannot be computed. See Figure 3 (topleft and bottomleft).
Deeper models help. We demonstrate consistent, monotonic gains for larger numbers of message passing steps, at least up to 32 layers—and in the case of the nonconformer model, up to 50 layers. See Figure 3 (topmiddleleft and bottommiddleleft).
Noisy Nodes help. Lastly, we show that the regulariser proposed in Noisy Nodes (godwin2021very) proved very effective for this quantumchemical task as well. It was the key behind the monotonic improvements of our models with depth. Note, for example, that removing Noisy Nodes from our best performing model makes its performance comparable with models that are at least twice as shallow. See Figure 3 (topmiddleright and bottommiddleright).
Wider message functions help. Towards the end of the contest, we noted that performance gains are possible when favouring wider message functions (in terms of hidden size of their MLP layers) opposed to the latent size of the GNN. We subsequently noticed that such a regime (256 latent dimensions, 1,024dimensional hidden layers) consistently improved our nonconformer model as well. See Figure 3 (topright and bottomright).
8 Results and Discussion
Our final ensembled models achieved a validation accuracy of 77.10% on MAG240M, and validation MAE of 0.110 on PCQM4M. Translated on the LSC test sets, we recover 75.19% test accuracy on MAG240M and 0.1205 test MAE on PCQM4M. We incur a minimal amount of distribution shift, which is a testament to our principled ensembling and postprocessing strategies, in spite of using labels as inputs for MAG240M or training on validation for both tasks.
Our entries have been designated as awardees (ranked in the top3) on both MAG240M and PCQM4M, solidifying the impact that very deep expressive graph neural networks can have on large scale datasets of industrial and scientific relevance. Further, we demonstrate how several recently proposed auxiliary objectives for GNN training, such as BGRL (thakoor2021bootstrapped) and Noisy Nodes (godwin2021very) can both be highly impactful at the right dataset scales. We hope that our work serves towards resolving several open disputes in the community, such as the utility of very deep GNNs, and the influence of selfsupervision in this setting.
In many ways, the OGB has been to graph representation learning what ImageNet has been to computer vision. We hope that OGBLSC is only the first in a series of events designed to drive research on GNN architectures forward, and sincerely thank the OGB team for all their hard work and effort in making a contest of this scale possible and accessible.