Log In Sign Up

Taking a Respite from Representation Learning for Molecular Property Prediction

by   Jianyuan Deng, et al.

Artificial intelligence (AI) has been widely applied in drug discovery with a major task as molecular property prediction. Despite the boom of AI techniques in molecular representation learning, some key aspects underlying molecular property prediction haven't been carefully examined yet. In this study, we conducted a systematic comparison on three representative models, random forest, MolBERT and GROVER, which utilize three major molecular representations, extended-connectivity fingerprints, SMILES strings and molecular graphs, respectively. Notably, MolBERT and GROVER, are pretrained on large-scale unlabelled molecule corpuses in a self-supervised manner. In addition to the commonly used MoleculeNet benchmark datasets, we also assembled a suite of opioids-related datasets for downstream prediction evaluation. We first conducted dataset profiling on label distribution and structural analyses; we also examined the activity cliffs issue in the opioids-related datasets. Then, we trained 4,320 predictive models and evaluated the usefulness of the learned representations. Furthermore, we explored into the model evaluation by studying the effect of statistical tests, evaluation metrics and task settings. Finally, we dissected the chemical space generalization into inter-scaffold and intra-scaffold generalization and measured prediction performance to evaluate model generalizbility under both settings. By taking this respite, we reflected on the key aspects underlying molecular property prediction, the awareness of which can, hopefully, bring better AI techniques in this field.


page 1

page 2

page 5

page 6

page 11

page 12

page 14

page 15


KPGT: Knowledge-Guided Pre-training of Graph Transformer for Molecular Property Prediction

Designing accurate deep learning models for molecular property predictio...

ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction

GNNs and chemical fingerprints are the predominant approaches to represe...

Graph-based Molecular Representation Learning

Molecular representation learning (MRL) is a key step to build the conne...

Chemistry42: An AI-based platform for de novo molecular design

Chemistry42 is a software platform for de novo small molecule design tha...

Substructure-Atom Cross Attention for Molecular Representation Learning

Designing a neural network architecture for molecular representation is ...

Tyger: Task-Type-Generic Active Learning for Molecular Property Prediction

How to accurately predict the properties of molecules is an essential pr...

Towards IID representation learning and its application on biomedical data

Due to the heterogeneity of real-world data, the widely accepted indepen...

1 Introduction

Drug discovery is an expensive process in both time and cost with a daunting attrition rate. As revealed by a recent study [wouters2020estimated], the average cost of developing a new drug was around 1 billion dollars and has been ever increasing [simoens2021r]. In the past decade, the practice of drug discovery has been undergoing radical transformations in light of the advancements in artificial intelligence (AI) [chen2018rise, vamathevan2019applications, deng2022artificial], which, at its core, is molecular representation learning. Molecules are usually represented by: 1) chemical descriptors, which reflect the existence of certain fragments, such as hashed fingerprints, Extended-Connectivity FingerPrints (ECFP), 2) linear notations, such as Simplified Molecular Input Line Entry System (SMILES) strings and 3) molecular graphs [david2020molecular]

. With the advent of the deep-learning era, various neural networks have been proposed for molecular representation learning, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs) and graph neural networks (GNNs), among others

[deng2022artificial]. One major task for AI in drug discovery is molecular property prediction, which seeks to learn a function that maps a structure to a property value. In the literature, deep representation learning has been widely reported to show great potential in molecular property prediction over fixed molecular representations [mayr2018large, yang2019analyzing]. More recently, to address the lack of labeled data in drug discovery, self-supervised learning has been proposed to leverage large-scale unlabeled corpus for both SMILES strings [honda2019smiles, chithrananda2020chemberta, fabian2020molecular] and molecular graphs [hu2019strategies, rong2020grover, wang2021molclr, wang2022improving], which has enabled state-of-the-art (SOTA) performance based on the MoleculeNet benchmark datasets [wu2018moleculenet].

Figure 1: Key Aspects underlying Molecular Property Prediction. There are four aspects underlying molecular property prediction: model, dataset, model evaluation and chemical space generalization. Usually in the literature, the focus is more on the model, which aims at developing novel learning paradigms or model architectures on specific molecular representation formats. However, other aspects, which are pertaining to 1) what the model is built upon, 2) how the model is evaluated and 3) eventually what the model is capable of, should also be considered. For the dataset, its quality (such as duplicates, contradictories and uncertainty in labels), the chemical space coverage (w.r.t. both structures and labels) and the splitting method should be carefully checked before building a model applicable to a certain property prediction task. For the model evaluation, the use of statistical test, evaluation metric and the task setting also affect the observed prediction performance and may even alter the conclusion. For the chemical space generalization, there is a necessity to clarify what type of generalization the model is capable of and whether the activity-cliffs issue is also addressed.

Nevertheless, AI-driven drug discovery is not without its critiques despite its current prosperity. Usually, when a new technique is developed for molecular property prediction, improved metrics by experimenting on the MoleculeNet benchmark datasets [wu2018moleculenet] are used to substantiate the argument that the model outperforms previous ones. Although novel techniques are indeed proposed along with impressive metrics, most often they do not suffice to solve the molecular property prediction need in real-world drug discovery settings. In fact, the existing practice of representation learning for molecular property prediction can be dangerous yet quite rampant [robinson2020validating]. The detailed issues are elaborated as follows.

First, there is a heavy reliance on the benchmark datasets, which may be of little relevance to real-world drug discovery [walters2021critical]. Moreover, despite the wide adoption of the benchmark datasets, the split folds can vary in the literature, which entail unfair performance comparison [shen2021out]. Very often, achieving the SOTA performance becomes the major pursuit of new model development, which also leads to little focus on the statistical rigor and the model applicability [robinson2020validating]. For instance, when reporting prediction performance for a newly developed model, most papers just used the mean value averaged over 3-fold [mayr2018large, xiong2019pushing, rong2020grover] or 10-fold [hu2019strategies, fabian2020molecular, na2020machine, shen2021out] splits. The seeds for dataset splitting may or may not be explicitly provided. Sometimes, it may just come from some arbitrary split with a few individual runs. The inherent variability of dataset splitting is simply ignored. The caveat is that, without rigorous analysis, the seemingly improved metric values may just be some statistical noise [robinson2020validating]. For the model applicability, besides the limited relevance of the heavily used benchmark datasets, there is also a lack of practical relevance for the recommended evaluation metrics. One example is AUROC, which, as opined by Robinson et al.[robinson2020validating], cannot well capture the true positive rate, a more relevant metric in virtual screening. To address these existing issues, we took a respite to rethink the key aspects underlying molecular property prediction. We conducted a systematic evaluation on representative models in molecular property prediction by training 4,320 models, with a focus on: 1) dataset profiling, including label distribution and structural analysis; 2) model evaluation, on the effect of statistical test, metrics and task setting; 3) chemical space generalization, w.r.t. inter-scaffold and intra-scaffold generalization.

The outline of the paper is as follows. We first discussed the preliminaries for molecular property prediction, including molecular representation formats, model architectures and learning paradigms [deng2022artificial]. We then discussed in detail the rationale of this study and proposed our experiment schemes. Secondly, we elaborated on the methods, including datasets collection, evaluation metrics, model training and statistical analyses. To further evaluate the usefulness of the molecular representation learning models, we also assembled a suite of opioids-related datasets from ChEMBL [mendez2019chembl]. Thirdly, we presented the results, where we checked on three aspects: 1) dataset profiling, 2) model evaluation and 3) chemical space generalization. Finally, we discussed our thoughts on how to advance the representation learning for molecular property prediction. During this respite, we summarized the key aspects underlying molecular property prediction, which should be considered in future model development (Figure 1). Similar to the quote from Bender et al [bender2020artificial, bender2021artificial] “a method cannot save an unsuitable representation which cannot remedy irrelevant data for an ill-thought-through question.”, our central thesis is that "a model cannot save an unqualified dataset which cannot remedy an improper evaluation for an ambiguous chemical space generalization claim".

2 Preliminaries

2.1 Molecular representation formats

Over the years, various formats have been used to represent small molecules [david2020molecular, deng2022artificial]. Arguably, the simplest format are the 1D descriptors which represent a molecule based on its formula, such as atom counts, types and molecular weight. Moreover, molecules can also be represented using 2D fingerprints, such as structural keys and path-based or circular fingerprints [gao20202d]

. These fingerprints can be either 1) bit vectors, which are binary vectors with each dimension tracking the presence or absence of a particular substructure or 2) count vectors, which track the frequency of each substructure appearance. One of the most widely used molecular fingerprints is the ECFP bit-vector based on the Morgan algorithm, firstly proposed to solve the molecular isomorphism issue, which is to identify if two molecules, with different atom numberings, are the same

[rogers2010extended]. There are three sequential stages during ECFP generation: 1) an initial assignment stage when each atom has an integer identifier assigned to it, 2) an iterative updating stage when each atom identifier is updated to reflect the identifiers of each atom’s neighbors, including identification of whether it is a structural duplicate of other features, 3) a duplicate identifier removal stage when multiple occurrences of the same feature are reduced to single representative in the final feature list (the occurrence count may be retained if a set of counts are required instead of a standard binary fingerprint). ECFP has been the de facto standard circular fingerprint and is still valuable in drug discovery [gao20202d]. The vector size of ECFP is usually 1024 or 2048. The radius size of ECFP can either be 2 or 3, corresponding to ECFP4 and ECFP6, respectively, which are the common variants of ECFP in the literature. For instance, Yang et al.[yang2019analyzing] used ECFP4 while Mayr et al.[mayr2018large], Robinson et al.[robinson2020validating] and Skinnider et al.[skinnider2021deep] used ECFP6.

Intuitively, small molecules can also be represented as graphs, with atoms as nodes and bonds as edges. Formally, a graph is defined as , where and represent nodes (atoms) and edges (bonds), respectively. The attributes of atoms can be represented by a node feature matrix and each node can be represented by an initial vector and a hidden vector . Similarly, the attributes of bonds can also be represented by a feature matrix. In addition, an adjacency matrix is used to represent the pairwise connections between nodes. For every two nodes and , if there is a bond connecting nodes and ;

otherwise. Usually, the edge feature matrix and the adjacency matrix can be combined to form an adjacency tensor. Table 

1 summarizes the commonly used node and edge features.

Type Feature Notes
Node Atom type Element type
Node Formal charge Assigned charges
Node Implicit Hs Number of bonded hydrogens
Node Chirality R or S configuration
Node Hybridization Orbital hybridization
Node Aromaticity Aromatic atom or not
Edge Bond type Single, double, triple, aromatic
Edge Conjugated Conjugated or not
Edge Stereoisomers cis or trans, (E or Z), none, any
Table 1: Commonly Used Node and Edge Features

Although the graph representation carries more structural information, one drawback is that graphs are often storage- and memory-demanding [david2020molecular]. Alternatively, a more computationally efficient representation of molecules is the SMILES string [weininger1988smiles], where atoms are represented by the atomic symbols and bonds are represented by symbols like "-", "=", "#" and ":", corresponding to single, double, triple and aromatic bonds, respectively. Notably, single bonds and aromatic bonds are usually omitted. Moreover, parentheses are used to denote the branches in a molecule. For the cyclic structure, a single or aromatic bond is first broken down in the ring and the bonds are then numbered in any order with the ring-opening bonds by a digit following the atomic symbol at each ring. Notably, one molecule can have multiple SMILES-string representations [david2020molecular]. Thus, the canonicalized SMILES strings [weininger1989smiles] are more often used in the literature. To be machine-readable, SMILES strings are usually converted into one-hot vectors.

2.2 Model architectures

So far, various model architectures have been proposed for molecular property prediction, such as RNNs, GNNs and transformers [deng2022artificial]. Originally designed for handling sequential data (e.g., text and audio), RNNs can be naturally used to model molecules formatted as the SMILES strings, such as SMILES2Vec [goh2017smiles2vec] and SmilesLSTM [mayr2018large]. GNNs, on the other hand, can be applied on the molecular graphs, such as graph convolutional networks (GCN) [kipf2016semi], graph attention network (GAT) [velivckovic2017graph], message passing neural networks (MPNN) [gilmer2017neural], directed MPNN (D-MPNN) [yang2019analyzing], graph isomorphism networks (GIN) [xu2018powerful, hu2019strategies]. Recently, to address the lack of annotated data in drug discovery, self-supervised learning has been proposed for pretraining on large-scale unlabeled molecule corpuses before downstream finetuning [deng2022artificial]. In this study, we mainly utilized two self-supervised models: MolBERT [fabian2020molecular] and GROVER [rong2020grover] which take SMILES strings and molecular graphs as input, respectively. Random forest (RF) on ECFP4 is used as the baseline.

2.2.1 Random forest

Random Forest (RF) is an ensemble of decision tree predictors, which can be applied for classification and regression tasks

[breiman2001random]. RF has been widely adopted in drug discovery prior to the "deep-learning" era [vamathevan2019applications]. By assembling a certain number of de-correlated decision trees, RF establishes a strong baseline for deep learning models in many tasks [jiang2021could]. Therefore, we chose to use RF as the baseline for evaluating advanced molecular representation learning models (i.e., MolBERT and GROVER). Note that RF can be applied to both descriptor-based representation [cano2017automatic] and fingerprint-based representation [yang2019analyzing]. Here, we focused on the circular fingerprint, ECFP4.

2.2.2 MolBERT

The SMILES strings can be viewed as a "chemical" language. Language models, therefore, have been widely applied in molecular representation learning for molecular property prediction, molecule generation and retro-synthesis prediction

[deng2022artificial]. Related model architectures include RNNs and transformers [deng2022artificial]

. Recently, inspired by Bidirectional Encoder Representation from Transformers (BERT) in natural language processing, Fabian

et al.[fabian2020molecular] exploited the architecture of BERT for molecular property prediction. Using transformers as the building block, MolBERT is pretrained on a corpus of c.a. 1.6M SMILES strings [brown2019guacamol], which improves prediction performance on 6 benchmark datasets in both classification (BACE, BBBP, HIV) and regression (ESOL, FreeSolv, Lipop) settings [wu2018moleculenet].

Figure 2: Abstracted Model Architectures Modified from MolBERT [fabian2020molecular] and GROVER [rong2020grover].

The abstracted architecture of MolBERT is depicted in Figure 2 (a). An input SMILES string is firstly tokenized and embedded into a sequence of -dimensional vectors. Unlike RNNs, which handle the input sequentially, a positional embedding layer is added to the input tokens to capture the sequential information. A stack of BERT encoder layers is then added on top of the embedding layers to learn the latent representations of the input sequence. During pretraining, different pretext self-supervised tasks, such as masked language modeling, are designed to utilize the output embeddings after the pooler layer. During finetuning, new task heads can be appended by attaching a single linear layer to the pooled output for downstream property prediction. The learned weights of the backbone model during pretraining are fixed, which provide a better model initialization and can also reduce training burden during finetuning, especially when the model is large.

MolBERT is pretrained on a fixed vocabulary of 42 tokens and a maximum sequence length of 128 characters. To support the arbitrary length of input SMILES strings at inference time (for instance, we adopted the maximum sequence length of 400 during finetuning), relative positional encoding [shaw2018self] is used. Following the original BERT model, MolBERT uses the BERT-Base architecture with an output embedding size of 768, 12 BERT encoder layers, 12 attention heads and a hidden size of 3,072, resulting in c.a. 85M parameters. In this study, we used the pretrained MolBERT provided by Fabian et al [fabian2020molecular].

Figure 3: Experiment Scheme on the Benchmark Datasets.

2.2.3 Grover

In addition to the SMILES strings, molecules can also be intuitively abstracted as graphs. GNNs, therefore, have been widely applied in molecular representations learning [deng2022artificial]. The core operation in GNNs is message passing, a.k.a. neighborhood aggregation [gilmer2017neural]. During message passing, a node’s hidden state is iteratively updated by aggregating the hidden states of its neighboring nodes and edges, which involves multiple hops. After each iteration, the message vectors can be integrated using certain aggregate

functions, such as sum, mean, max pooling or graph attention

[wieder2020compact]. The aggregate

function is essentially a trainable layer together with some activation functions, which is shared by different hops within an iteration. When the message passing is completed, the hidden states of the last hop from the last iteration are used as the nodes’ embeddings, followed by a

readout function to obtain the graph-level embedding. Two tasks are commonly used to pre-train the GNNs [hu2019strategies]: 1) node-level self-supervised training, such as atom type prediction and 2) graph-level supervised training, such as molecular label prediction. However, supervised pretraining may cause "negative transfer", where downstream performance can be deteriorated therein.

Recently, Rong et al.[rong2020grover] proposed GROVER with delicately-designed, self-supervised pretraining tasks at the node-, edge- and graph-level, respectively. GROVER is pretrained on c.a. 10M unlabeled molecules, which is then evaluated on 11 benchmark datasets (classification setting: BACE, BBBP, ClinTox, SIDER, Tox21, ToxCast; regression setting: ESOL, FreeSolv, Lipop, QM7, QM8) and achieves SOTA performance in all datasets. The abstracted model architecture of GROVER is depicted in Figure 2 (b). The input node and edge embeddings are first learned via message passing [gilmer2017neural], which are then passed to the node-view GTransformer and edge-view GTransformer to output the node and edge embeddings from the two views, respectively. With some readout function, the final embeddings can be used for node-level, edge-level or graph-level prediction tasks. For downstream tasks, following the practice in Chemprop [yang2019analyzing], GROVER extracts 200 global molecular features using RDKit [Landrum2016RDKit2016_09_4], which are concatenated with the learned embedding vector, i.e., output of the readout function, and then passes through a linear layer, i.e., a task head, for molecular property prediction.

Notably, GROVER has two configurations: GROVER and GROVER, corresponding to c.a. 48M and c.a. 100M model parameters, respectively. With a corpus of c.a. 10M unlabeled molecules for pretraining, GROVER demands very intensive computational resources. As stated in the paper [rong2020grover], GROVER costs 2.5 days, and GROVER costs 4 days on 250 NVIDIA V100 GPUs for pretraining. Due to the large number of experiments to be conducted in this study, we only used the pretrained GROVER. To examine the actual power of GROVER and the effect of the additional RDKit features, we distinguished GROVER (without RDKit features) and GROVER_RDKit, which are also denoted as "GROVER 1" and "GROVER 2" later in the main text.

2.3 Opioids with reduced overdose effects

To further evaluate the molecular representation learning models, we collected a suite of opioids-related datasets besides the MoleculeNet benchmark datasets [wu2018moleculenet]. Opioid overdose is a leading cause of injury-related death in the United States [centers2020drug] and there is an increasing interest in developing opioid analgesics with reduced overdose effects. As pointed out by a large-scale observational study [deng2021large], reduced overdose effects can potentially be addressed from the pharmacokinetic (PK) perspective and pharmacodynamic (PD) perspective, which correspond to 1) decreasing the overdose events by avoiding excessive amount of opioids in the action site and 2) alleviating the overdose outcome by avoiding off-target effects, respectively. The PK targets include multi-drug resistance protein 1 (MDR1), cytochrome P450 (CYP) 2D6, and CYP3A4 whereas the PD targets include opioid receptor (MOR), opioid receptor (DOR) and opioid receptor (KOR). Details on the opioids-related datasets collection are in Section 3.1.

Figure 4: Experiment Scheme on the Opioids-related Datasets.

2.4 Study rationale and experiment design

2.4.1 How useful are the learned representations?

The first question that our study aims to answer is: how useful are the learned representations for molecular property prediction? Previously, although deep neural networks have been reported to outperform traditional machine learning models, such as RF and support vector machine (SVM) on ECFP in a large-scale bioactivity prediction study

[mayr2018large], Robinson et al.[robinson2020validating] have re-analyzed the results and found that SVM is still competitive with the deep learning models. Indeed, whether the learned representations can supersede fixed representations, such as ECFP, still needs further interrogation. To this end, we selected two representative models for molecular property prediction after an extensive literature review [deng2022artificial]. In total, we investigated 4 models, namely, RF, MolBERT, GROVER and GROVER_RDKit (see Section 2.2). The experiment scheme on the benchmark datasets is in Figure 3. We also evaluated the models using the opioids-related datasets (Figure 4).

2.4.2 Are the models properly evaluated?

As suggested by MoleculeNet [wu2018moleculenet]

, each benchmark dataset comes with a recommended evaluation metric, a common practice widely adopted by subsequent studies. For example, for the classification datasets, the area under the receiver operating characteristic curve (AUROC) is always used; whereas for the regression datasets, the most common metric is the root mean square error (RMSE). However, these recommended metrics can be problematic. As opined by Robinson

et al.[robinson2020validating], AUROC for the classification task may be of little relevance in real-world drug discovery applications, such as virtual screening. In the case of imbalanced datasets, which is often the case in reality where only a small portion of test molecules are actives, AUROC can be biased [saito2015precision]. In fact, AUROC can be seen as the expected true positive rate averaged over all classification thresholds (false positive rates). Thus, if two ROC curves cross, even if one curve has higher AUROC, it may perform much worse (lower true positive rate) under certain thresholds of interest. One alternative is the area under the precision-recall curve (AUPRC) [saito2015precision, robinson2020validating], which only focuses on the minor class, often the actives.

In this study, we further argue that the evaluation metric should be contingent on the question of interest during drug discovery. For instance, one popular sub-task involved in virtual screening is target fishing, which is to identify all possible targets that a molecule can bind to. According to Hu et al.[hu2013likelihood], an active PubChem compound can interact with c.a. 2.5 targets and consequently, off-target effects can be pervasive and may lead to undesired adverse drug reactions. Thus, one research area is to identify potential targets for a molecule during early stage [wale2009target]. In this scenario, the question is not just about predicting whether a molecule can bind to a specific target. Instead, we would care more about questions like: 1) given a set of predicted drug targets , what is the fraction of correct predictions among the predicted positives, i.e., ; 2) given a set of predicted drug targets , what is the fraction of correct predictions among the annotated positives, i.e., . Even in the single-target virtual screening scenario, we may care more about the , which is the positive predictive value (PPV), inasmuch it is imperative to ensure that there is a sufficient amount of true positives out of the predicted positives. On the contrary, if the goal is to exclude molecules inactive against certain targets which are related to adverse reactions, the negative predictive value (NPV) is of more interest. Thus, for each experiment, we evaluated a set of metrics (Figure 3 & Figure 4). More details on the evaluation metrics are in Section 3.2.

In addition to the evaluation metrics, another crucial but often missing part in previous studies is the statistical test, despite that the benchmark datasets are small-sized [robinson2020validating, walters2021critical, patrick2021comparing]. Most often, when a new model is developed, some arbitrary split or 3/10-fold splits are applied to calculate the mean of some metric for rudimentary comparison. The reality is, however, due to lack of rigorous statistical tests, such result is insufficient to justify whether there is a true improvement. Besides, the task setting may also affect the evaluation. Usually, after collecting the pIC50 values, a cutoff value such as 5 or 6 is used to dissect the collected molecules into actives and inactives. Nevertheless, how the classification with an arbitrary cutoff value affects the final prediction compared to directly regressing the pIC50 values hasn’t been unexplored yet. To study the influence of task settings, we conducted experiments under both classification and regression settings for the opioids-related datasets (Figure 4).

2.4.3 What does chemical space generalization mean?

In the representation learning for molecular property prediction, the ultimate goal is to build models that can generalize from seen molecules to unseen ones. To mimic chronological split in the real-world setting, MoleculeNet [wu2018moleculenet] proposed scaffold split as a proxy which ensures that molecules in the test set are equipped with unseen scaffolds during training, posing a more challenging prediction task. In the literature, most papers have adopted the scaffold-split practice and claimed chemical space generalization capability when evaluation metrics are seen improved. The assumption, here, is that chemical space generalization is approximated as generalization between different scaffolds, which further assumes each scaffold is associated with certain property values, for instance, similar activity. Nevertheless, one scaffold may not necessarily map to a narrow range of property values. In this case, the use of scaffold split does not suffice to guarantee chemical space generalization. Moreover, it may lead to ambiguity.

Formally, the chemical space is defined as the set of all possible organic molecules, more specifically, the biologically relevant molecules [dobson2004chemical]. In the chemical space, usually there are some structural constellations, which is populated with certain property values and can be identified using scaffold-based analysis [naveja2019finding]. Since the constellations have diverse scaffolds, two molecules with different scaffolds can have disparate properties, which is termed as the "scaffold cliff" [naveja2019finding]. Using the widely-adopted scaffold split, we argue that it actually addresses the "scaffold cliff", which is essentially doing the inter-scaffold generalization. Meanwhile, another challenge in drug discovery is the "activity cliffs" (see Section 4.1.2), where a tiny structural change causes a drastic activity change between a pair of molecules with similar structures, usually with the same scaffold [stumpfe2019evolving]. On the contrary to inter-scaffold generalization, it is the intra-scaffold generalization needed in the case of the activity cliffs. Unfortunately, while activity cliff is prevalent and has been discussed in computational and medicinal chemistry for nearly three decades [stumpfe2019evolving], it has not been emphasized in most molecular property prediction papers. In this study, we adopted both scaffold split and random split to check inter-scaffold generalization capability (Figures 3 &  4). Furthermore, to check intra-scaffold generalization, we filtered out molecules with scaffolds observed with the activity-cliffs issue, denoted as the AC molecules (see Section 4.1.2), and then calculated prediction performance on the AC and non-AC molecules (see Section 4.3.2), respectively.

3 Methods

3.1 Datasets assembly

3.1.1 Benchmark datasets

In 2018, a suite of benchmark datasets from MoleculeNet for molecular property prediction was proposed [wu2018moleculenet], which have been widely used to develop new molecular representation learning models. Among them, we selected three classification datasets (BACE, BBBP, HIV) and three regression datasets (ESOL, FreeSolv, Lipop), which are used in MolBERT [fabian2020molecular] and GROVER [rong2020grover] (except for HIV) as well as a recent comparison study Jiang et al [jiang2021could]. Note that these datasets are for single-task purpose and were downloaded from MolMapNet [shen2021out], which has a more updated version. Each dataset and its task type, number of molecules, maximum string length and number of scaffolds are summarized in Table 2

. Since MolBERT needs to pad the input SMILES strings to the maximum length, we only kept the molecules with length no longer than 400. As shown in Table 

2, all selected benchmark datasets have maximum length less than 400 except for HIV, where c.a. 0.01% molecules were removed.

Dataset Task #Mol. Max. Len. #Scaff.
BACE CLS 1,513 198 737
BBBP CLS 2,039 400 1,101
HIV CLS 41,127 580 19,085
ESOL REG 1,128 98 268
FreeSolv REG 642 82 62
Lipop REG 4,200 267 2,443
Table 2: Summary of the Benchmark Datasets.

As for dataset splitting, there are several options, such as random split, scaffold split, stratified split and time split [wu2018moleculenet]. Each method has its own purpose. For example, time split makes the model trained on older data points and tested on newer molecules, which mimics the real-world scenario where models are built on existing data points and are used to predict properties of newly synthesized molecules [wu2018moleculenet]. The most widely adopted method in the literature is scaffold split, which addresses the inter-scaffold generalization (see Section 2.4.3). However, the actual split folds can still vary across studies. For instance, for the regression datasets, MolBERT [fabian2020molecular] used the random split folds provided in MolMapNet [shen2021out] while GROVER adopted scaffold split. For the classification datasets, both MolBERT and GROVER adopted scaffold split, although the seeds are not provided and the split folds may not be identical.

In this study, we adopted both scaffold and random split, following a 80:10:10 ratio for training/validation/test sets (Figures 34). Additionally, to ensure sufficient statistical rigor, we repeated the dataset split procedure 30 times with 30 different seeds () using GROVER’s implementation for dataset split, which were saved and kept the same for all experiments to ensure fair comparison. All the used benchmark datasets and split folds are provided in our Github repository.

3.1.2 Opioids-related datasets

To check the usefulness of the representation learning models for molecular property prediction, we also assembled a suite of opioids-related datasets (see Section 2.3). Specifically, we collected the binding affinity for the following pharmacological components [deng2020informatics, deng2021large]: MDR1 (ChEMBL ID: 4302), CYP2D6 (ChEMBL ID: 289), CYP3A4 (ChEMBL ID: 340), MOR (ChEMBL ID: 233), DOR (ChEMBL ID: 236) and KOR (ChEMBL ID: 237). The binding affinity data is retrieved from ChEMBL27 [mendez2019chembl] using in vitro potency measures, namely, IC50, EC50, Ki and Kd. Assay type is set as "Binding", standard relationship is set as "=" with the standard unit as "nM" and the organism as "Homo Sapiens". The raw binding affinity data is further converted into the negative log 10 scale, which is denoted as pIC50. Lastly, we cleaned the datasets by removing contradictory entries and duplicates.

Notably, IC50/EC50/Ki/Kd are often heteroscedastic

[mervin2021probabilistic]. Consequently, measurement errors are not equally distributed throughout the range of activity and, therefore, regression of the raw pIC50 values may not be favorable. Thus, one common practice is to convert direct regression into a binary classification task. For the active vs. inactive threshold, 1 M (pIC50 at 6) is usually used as the default cutoff [lagunin2018comparison]. In this study, we also adopted pIC50 at 6 as the cutoff (Figure 4).

Dataset Task #Mol. Max. Len. #Scaff.
MDR1 CLS-REG 1,438 252 602
CYP2D6 CLS-REG 2,293 217 1,330
CYP3A4 CLS-REG 3,671 244 2,022
MOR CLS-REG 3,553 373 1,623
DOR CLS-REG 3,223 373 1,531
KOR CLS-REG 3,326 373 1,660
Table 3: Summary of the Opioids-related Datasets.

The opioids-related datasets, together with the task type, number of molecules, maximum SMILES string length and number of scaffolds, are summarized in Table 3. Since each dataset has a maximum length less than 400, the collected molecules are 100% retained. For the opioids-related datasets, we also applied both scaffold and random split (Figure 4), each of which was repeated 30 times with 30 different seeds () in GROVER’s implementation for dataset split. The split folds were also saved and kept the same in all subsequent experiments. The opioids-related datasets and the split folds are also provided in our Github repository.

3.2 Evaluation metrics

In Section 2.4.2, we discussed why the recommended metric may not be proper for model evaluation. Details on the set of basic metrics used in this study are illustrated as follows. Note there other enrichment metrics in early discovery, such as Boltzmann-Enhanced Discrimination of ROC (BEDROC) [fabian2020molecular].

3.2.1 Classification metrics

In a binary classification task setting, a probability for each molecule as the positive (or active) class is usually assigned. When the estimated probability is greater than a threshold (a value between 0 and 1), the molecule is classified as positive (or active), otherwise negative (or inactive). In total, there are four possible outcomes: true positive (TP), false positive (FP), true negative (TN) and false negative (FN). Based on the TP and FP rates across different probability thresholds, the receiver operating characteristic curve can be plotted with the area under the ROC curve as AUROC. Similarly, based on precision and recall, the precision-recall curve can be plotted and AUPRC is calculated likewise. AUROC usually ranges from 0.5 and 1, with 0.5 for random classification and 1 for perfect classification; in the case of a classifier worse than random guessing, AUROC can be lower than 0.5. Compared to accuracy, AUROC is more robust in the case of imbalanced datasets

[robinson2020validating]. Nonetheless, by assuming equal importance of positive and negative classes, AUROC may suffer from imbalanced datasets, especially when the minor class is more of interest [robinson2020validating]. Thus, AUPRC is proposed as an alternative when datasets are imbalanced [saito2015precision], with the baseline value as the proportion of the minor class.


Despite the usefulness of AUROC and AUPRC, these "collective" metrics may not be directly pertinent to virtual screening [robinson2020validating], a common application for molecular property prediction [deng2022artificial]. In fact, the primary goal of early drug discovery is to rank molecules based on the predicted activity so as to avoid the intractable number of false positives or false negatives to be filtered out experimentally [shoichet2004virtual]. Given a set of predicted actives or inactives, depending on the screening goal, we argue that positive predictive value (PPV; Equation 1) and negative predictive value (NPV; Equation 2) are more relevant to virtual screening and drug design, as discussed in Section 2.4.2. Notably, unlike AUROC and AUPRC which are averaged across different probability thresholds, a decision threshold is determined first before deriving the TP, FN, TN and FP, based on which PPV and NPV are calculated. When the datasets are balanced, the decision threshold for positive vs. negative is set as 0.5. However, for imbalanced datasets, the threshold needs to be adjusted. Here, we used Youden’s statistic [schisterman2008youden], the distance between the ROC curve and the random chance line, to adjust threshold which maximizes the statistic.

3.2.2 Regression metrics

For the regression task, RMSE (Equation 3) and MAE (Equation 4

) are the recommended metrics, which indicate how apart the predicted values are from the labels: a lower value stands for a better model fit. MAE is a natural measure of average error whereas RMSE gives more weight to large errors and is more sensitive to the outliers. In addition, two other metrics can also measure the regression performance

[wu2018moleculenet, yang2019analyzing, cortes2019kekulescope], namely, Pearson_R and R2, which are scale independent.


Pearson_R is an intuitive measure of the linear correlation between the predicted values and labels [lu2021neural]

, which is the ratio between the covariance of two variables and the product of their standard deviations (Equation 

5), ranging from -1 to 1. An absolute value of 1 indicates a perfect linear relationship between the predicted values and labels. Notably, some studies used Pearson_R [cortes2019kekulescope] while other papers used the square of Pearson_R [wu2018moleculenet, shen2021out]

, i.e., Pearson_R2, which always ranges from 0 to 1. On the other hand, R2, also known as the coefficient of determination, is not based on correlation. Instead, R2 calculates the proportion of the variance in the predicted values that can be explained by the labels (Equation 

6). Usually, R2 ranges from 0 to 1. A higher R2 corresponds to a better model fit. The best scenario is that the predicted values exactly match the observed values and R2 would be 1. On the contrary, the baseline case is that the model always predicts and R2 would be 0. Worse still, when models have poorer predictions than the baseline, R2 can even be negative. Presumably due to naming similarity, R2 and Pearson_R2 can sometimes be messed with each other. In this study, we included both Pearson_R and R2, which are calculated with the scipy package and the scikit-learn package, respectively.

Statistical Test Alias Parametric Normality Equal Variance Equal Size
Paired test Dependent test
Unpaired test Independent or Welch’s test
Wilcoxon signed-rank test -
Wilcoxon rank-sum test Mann-Whitney test
Table 4: Summary of the Statistical Tests.

3.3 Model training

For RF, we followed Chemprop [yang2019analyzing]

and set the number of trees as 500. For the hyperparameters in MolBERT and GROVER, we adopted the optimal hyperparameters reported in the original papers

[fabian2020molecular, rong2020grover]

. All experiments with the neural networks were run on a single NVIDIA V100 GPU. For the HIV dataset, MolBERT takes around 3 hours to complete a 100-epochs training in each split fold when the batch size is 32. Since GROVER takes even more time, we set the batch size as 256 when applying GROVER on HIV, which still takes around 5 hours to complete a 100-epoch training. To ensure fair comparison, we saved the raw predictions, based on which prediction performance were calculated using same evaluation codes, which are also provided in our Github repository.

3.4 Statistical analyses

To examine if the prediction performance is significantly improved by the representation learning models, we conducted statistical analyses on the prediction performance. Usually, two major categories of analyses can be applied: parametric and non-parametric tests [sedgwick2015comparison] (Table 4). Among them, the parametric tests examine whether two groups have equal means, which can be further categorized into paired test and unpaired test. For the paired

test, the null hypothesis is that the means of two populations are equal, underlying by the equal-variance assumption. When two samples have unequal variances and/or unequal sample sizes, the unpaired or independent

test, a.k.a. Welch’s test, should be used. Notably, the paired and independent tests are parametric tests with the normality assumption. When the sample size is large, the tests can still be robust to moderate violations of the normality assumption. Nonetheless, if the normal assumption is violated and the sample size is small, non-parametric tests should be used instead, namely, Wilcoxon signed-rank test and Wilcoxon rank-sum test. Wilcoxon signed-rank test uses the signed rank to compare the medians of two populations, which is a non-parametric version of the paired test. Wilcoxon rank-sum test, a.k.a. Mann-Whitney

test, also compares the medians but is robust to the violations of homoscedasticity. Neither Wilcoxon signed-rank test nor Wilcoxon rank-sum test requires the normality distribution. However, when the data are normally distributed, the non-parametric tests may lead to less statistical power, which corresponds to a higher probability of making the type II error (false negative)

[sedgwick2015comparison, robinson2020validating].

Since we repeated the dataset split independently for 30 times, we adopted the parametric tests for performance comparison based on the central limit theorem. The statistical significance is defined as the two-sided

value less than 0.05.

Figure 5: Datasets Profiling for the Benchmark and Opioids-related Datasets. (a). Label distribution for the selected MoleculeNet benchmark datasets. (b). Label distribution for the opioids-related datasets. (c). Activity-cliffs showcase on a series of molecules with the KOR Top 15 scaffold (Sup. Figure 10).

4 Results

4.1 On the datasets

4.1.1 Dataset profiling

To have a clear grasp on the datasets, we first presented the label profiling for the benchmark datasets and the opioids-related datasets (Figure 5). As shown in Figure 5

(a), BACE is balanced with a positive rate of 45.7%, whereas BBBP is imbalanced towards the positives (76.5%) and HIV has much less positive instances (3.5%). The labels of ESOL, FreeSolv and Lipop all exhibit left-skewed distribution, especially for FreeSolv. On the other hand, the pIC50 distribution for the opioids-related datasets is right-skewed as shown in Figure 

5 (b), presumably because most screened molecules exhibited low activity to the targets. To construct the opioids-related datasets in the classification setting, we applied a cutoff at 6 on the raw pIC50 values to convert them into binary values, namely active and inactive, abiding by the rule that pIC50 less than 6 inactive otherwise active. As shown in Figure 5 (b), the resultant datasets are all imbalanced. For MOR, DOR and KOR, the positive rates are 29.7%, 23.3% and 27.8%, respectively. For MDR1, CYP2D6, CYP3A4, the positive rates are even lowered to 9.1%, 1.4% and 2.2%.

To quantify the difference of label distributions, we calculated the Kolmogorov statistic [massey1951kolmogorov] among the training, validation and test sets (Sup. Figure 1 (a)). When using scaffold split, the statistic is more dispersed with a higher median than that when using random split, suggesting that scaffold split leads to larger gaps in the label distributions in addition to separating molecules by the scaffolds, which, to some extent, manifests that molecules with same scaffolds tend to have similar properties. Random split, on the other hand, results in a more compact distribution of the statistic with a lower median, indicating that training and test sets are more likely to have molecules with close labels. To quantify the structural similarity among the training, validation and test sets, we also calculated the Tanimoto similarity [todeschini2012similarity], shown in Sup. Figure 1 (b). Likewise, Tanimoto similarity exhibits more compact distribution under random split, the median of which are also higher, showing that training and test molecules are more structurally similar compared to those under scaffold split.

We also calculated the percentage of top fragments, i.e., heterocycles and functional groups, for both benchmark and opioids-related datasets, which are summarized in Sup. Figures 2 & 3. The top heterocycles vary across different datasets, which also manifest their unique pharmacological properties. For instance, in Sup. Figure 3 (a), the top heterocycle is piperidine for MOR, DOR and KOR, which is commonly seen in opioid analgesics [smith2022analgesic]. For the functional groups, all datasets share top functional groups such as benzenes and amines, which are common components to facilitate the interaction with drug targets, typically proteins with abundant amino acid residues, via forming hydrogen bonds or - stacking interactions [bissantz2010medicinal]. Other structural traits, such as the molecular weight (MW), number of rotatable bonds and the number of rings, are summarized in Sup. Figure 4.

Figure 6: Comparison of Molecular Property Prediction Performance Using Benchmark Datasets. (a). Bar plot (meanconfidence interval over 30 folds) and violin plot for the prediction performance using recommended metrics (AUROC for classification; RMSE for regression) under random and scaffold split. (b). Point plot (mean confidence interval over 30 folds) for the prediction performance under scaffold split. (c). Statistical significance using paired test in prediction performance comparison between different models under scaffold split.

4.1.2 Activity cliffs

For molecular property prediction, the essential goal is to achieve chemical space generalization, which, ideally, encompasses both inter-scaffold generalization and intra-scaffold generalization (see Section 2.4.3). Scaffold split, which ensures no scaffold overlap among training, validation and test sets, addresses the inter-scaffold challenge and has been widely adopted in the literature. However, the intra-scaffold generalization, especially in the case of activity cliffs, has not been well addressed yet. Thus, we proposed to evaluate the intra-scaffold generalizability of the representation learning models using the opioids-related datasets.

For each dataset, we visualized the top 30 scaffolds along with its pIC50 distribution (Sup. Figures 5-10). To showcase the activity cliff where analogs exhibit drastic difference in potency, we used the KOR Top 15 scaffold (Sup. Figure 10) for illustration. In Figure 5 (c), the replacement of the two hydrogen atoms with the chlorine atoms at the phenyl ring from molecule #9 to molecule #1 results in a drastic activity increase by 7 orders of magnitude, which, presumably, is because the chlorine atoms help the ligand better occupy the hydrophobic space in the binding pocket, an important contributor for binding. When comparing molecule #1 to molecule #5, the introduction of the hydroxyl group at the pyrrolidine ring increases the potency by 4 orders of magnitude, which also indicates that a potential H-bond interaction with the receptor is crucial for binding. Meanwhile, although shortening the acetyl group to the aldehyde group poses a minor reduction in activity when contrasting molecule #5 to molecule #6, longer side chains (molecule #7 & #8) can undermine activity, which suggests limited space around the binding site.

Dataset #AC. Scaff. (%) #Mol. AC. Scaff. (%)
MDR1 62 (10.2) 594 (41.3)
CYP2D6 124 (9.3) 710 (31.0)
CYP3A4 146 (7.2) 926 (25.2)
MOR 213 (13.1) 1,627 (46.1)
DOR 178 (11.6) 1,342 (41.6)
KOR 218 (13.1) 1,502 (45.2)
Table 5: Summary of Activity Cliffs in the Opioids Datasets.

These molecules show that major activity change can happen even with minor structural changes. More formally, the activity cliff is defined as IC50 values spanning at least two orders of magnitude within the same scaffold [hu2013advancing, stumpfe2019evolving]. Notably, there are also studies which adopted one order of magnitude [stumpfe2019evolving]. In this study, the scaffolds observed with activity cliffs are termed as the AC scaffolds and the molecules with the AC scaffolds are denoted as the AC molecules. The number (percentage) of AC scaffolds and AC molecules are summarized in Table 5. Notably, among MDR1, MOR, DOR and KOR, nearly half molecules are equipped with the AC scaffolds, although the percentages of AC scaffolds are around 10% in all opioids-related datasets.

4.2 On the model evaluation

In this section, we focus on the prediction under scaffold split.

4.2.1 Does the learned representation surpass ECFP?

We first compared molecular property prediction performance of different models, namely, RF, MolBERT, GROVER and GROVER_RDKit, using the recommended evaluation metrics [wu2018moleculenet]. For ECFP, we followed Chemprop [yang2019analyzing] and set radius as 2 and number of bits as 2048.

As shown in Figure 6 (a), RF achieves the highest AUROC in BACE and HIV (), where MolBERT also outperforms GROVER (). As for BBBP, all models achieve comparable AUROC, except for a slight but significant improvement of GROVER_RDKit over GROVER. In fact, GROVER_RDKit also outperforms GROVER in BACE and HIV (). Thus, when evaluated by the benchmark classification datasets, RF on ECFP can be more useful than the representation learning models on both SMILES strings and molecular graphs. On the other hand, when using the benchmark regression datasets, GROVER_RDKit beats RF, MolBERT and GROVER, as indicated by the lowest RMSE (). Notably, without concatenating the RDKit features, the regression performance can be harmed, as shown by the significantly increased RMSE of GROVER over GROVER_RDKit. However, even without the concatenated features, GROVER still outperforms RF and MolBERT in ESOL and FreeSolv, which shows the prediction power of GNNs. One exception, though, is in Lipop, where GROVER has the highest RMSE (). In fact, ESOL and FreeSolv have less data points and shorter SMILES lengths with simpler structures compared to other benchmark datasets (Table 2 & Sup. Figure 2). One viable hypothesis is, therefore, that GNNs can fit better when dataset size is small and molecular structures are relatively simple with shorter sequences.

Figure 7: Molecular Property Prediction Performance using Opioids-related Datasets in Classification Setting. (a). Bar plot (mean confidence interval over 30 folds) and violin plot for the prediction performance using recommended metrics (AUROC for classification) under random and scaffold split. (b). Point plot (mean confidence interval over 30 folds) for the prediction performance under scaffold split. (c). Statistical significance using paired test in prediction performance comparison between different models under scaffold split.

To further examine the effectiveness of the learned representations, we also conducted evaluation using the opioids-related datasets under the classification setting, as shown in Figure 7 (a). For MDR1, RF and GROVER_RDKit achieve best performance under scaffold split whereas GROVER shows the lowest mean AUROC, although not statistically different from MolBERT. For CYP2D6, which is highly imbalanced with a positive rate of 1.4% at cutoff 6, all models exhibit very limited performance with mean AUROC around 0.5, or even lower than 0.5 when using GROVER and GROVER_RDKit. For CYP3A4, RF and GROVER_RDKit achieve similarly high performance under scaffold split () whereas GROVER exhibits lowest AUROC among all models (). Notably, these PK-related datasets are all imbalanced, which could be why the AUROC values exhibit large variability across the 30 split folds, especially for CYP2D6. On the contrary, for the PD-related datasets, i.e., MOR, DOR and KOR, the metric variability is smaller but RF still outperforms MolBERT, GROVER and GROVER_RDKit (), which, together with the observation on the PK-related datasets, suggest that ECFP can still be very useful for molecular property prediction.

Moreover, we observed that the RDKit-derived descriptors, originally proposed to concatenate with the learned representations in Chemprop [yang2019analyzing], have a non-negligible role in the GROVER’s prediction performance. For instance, compared to GROVER, GROVER_RDKit achieves significantly higher AUROC in BACE, BBBP and HIV and exhibits significantly lower RMSE in ESOL, FreeSolv and Lipop in Figure 6 (a). In the opioids-related datasets, compared to GROVER, GROVER_RDKit achieves significantly higher AUROC in MDR1, CYP3A4 and DOR (Figure 7). Thus, the concatenation of the RDKit descriptors to the learned representations may be misleading when evaluating the real power of the representation learning models.

However, with the concatenation of RDKit descriptors, the variability of AUROC associated with GROVER can be reduced, as indicated by the violin plots in Figure 6 (a) and Figure 7 (a). For example, different models not only generate varying mean AUROC values but also exhibit different extents of variation in AUROC. Generally, RF on ECFP has the lowest extent of dispersion, whereas GROVER on molecular graphs generates AUROC varied greatly across split folds, indicating high prediction uncertainty. Again, ECFP corresponds to not only better but also more stable metric values, which can be, therefore, applied more reliably.

4.2.2 Are the statistical tests necessary?

To demonstrate the necessity of statistical tests, we conducted a simple analysis using the benchmark datasets, which tries to answer: under scaffold split, the widely-adopted practice, how many 1-fold or 3-fold split(s) out of the 30 split folds are there for a certain model to be concluded as the best based on the average metric value alone? Note that GROVER_RDKit is removed from this analysis since concatenation of RDKit descriptors can significantly bias the comparison of the representation learning models (see Section 4.2.1).

Model random forest molbert grover
BACE 28 2 0
BBBP 12 9 9
HIV 26 4 0
ESOL 0 0 30
FreeSolv 2 0 28
Lipop 1 27 2
Table 6: Number of Single Fold where a Model Achieves the Best Performance under Scaffold Split.

For RF, MolBERT and GROVER (denoted as GROVER), we first calculated the number of single test fold where a model achieves best performance using the recommended metrics (Table 6). For the classification datasets, RF dominates AUROC in BACE and HIV across 28 and 26 folds, respectively, whereas GROVER does not excel in any test folds, which is consistent with the finding in Section 4.2.1. Still, there are a few folds where MolBERT achieves the highest AUROC in BACE and HIV, which means there is a chance to wrongly conclude MolBERT as best-performing, despite that RF has the highest AUROC over 30 folds (see Section 4.2.1). In BBBP, where RF, MolBERT and GROVER exhibit comparable AUROC, there are about 10 split folds for each model to achieve the highest AUROC. This can be seen as a manifestation of statistical noise, which makes it prone to an erroneous conclusion if only tested on a random fold. To emulate the common 3-fold practice, we also calculated the number of 3-fold combinations out of the 30 folds, where a specific model has the best performance based on the algorithmic mean of the recommended metrics. Table 7 shows that there are quite some 3-fold combinations where one model can be mistaken as best performing using the recommended metrics. For the analysis results using other metrics, readers can refer to Sup. Tables 1 & 2.

Model random forest molbert grover
BACE 3,924 136 0
BBBP 2,105 1,013 942
HIV 3,711 349 0.0
ESOL 0 0 4,060
FreeSolv 0 0 4,060
Lipop 49 3,945 66
Table 7: Number of 3-Fold Combinations where a Model Achieves the Best Performance under Scaffold Split.

Thus, without statistical tests, chances are that wrong conclusions can be drawn with regard to whether a new technique truly advances molecular property prediction. Moreover, since the benchmark datasets are publicly accessible, one caveat is that the test folds might be customized to cater to a specific model. Moreover, the type of statistical test for comparison also matters. As mentioned in Section 3.4, we adopted the parametric paired test to compare different models in a pairwise manner. Meanwhile, we also calculated the statistical significance using different tests for the benchmark datasets. As shown in Sup. Figure 11 (a) & (c), there is little difference when using the parametric paired test vs. the non-parametric Wilcoxon test. However, if independent test is used, statistical significance varies in BACE (AUROC, NPV), BBBP (PPV), FreeSolv (RMSE, MAE) and Lipop (RMSE, R2), by contrasting Sup. Figure 11 (a) & (b). Therefore, when evaluating molecular property prediction, it is imperative to conduct statistical tests and also necessary to report the test type.

Figure 8: Evaluation of Molecular Property Prediction Using Opioids-related Datasets in Regression Setting. (a). Bar plot (mean confidence interval over 30 folds) and violin plot for the prediction performance using recommended metrics (RMSE for regression) under random and scaffold split. (b). Point plot (mean confidence interval over 30 folds) for the prediction performance under scaffold split. (c). Statistical significance using paired test in prediction performance comparison between different models under scaffold split.

4.2.3 Which evaluation metric is appropriate?

In MoleculeNet [wu2018moleculenet], each benchmark dataset comes with a recommended evaluation metric. However, in real-world drug discovery, the recommended metrics may not be pertinent (see Section 2.4.2). In this section, we compared the model performance using a set of evaluation metrics in addition to the recommended ones. For the classification datasets, we calculated AUROC, AUPRC, PPV and NPV (see Section 3.2.1); for the regression datasets, we calculated RMSE, MAE, R2 and Pearson_R (see Section 3.2.2). As shown in Figure 6 (b) & (c), when using the recommended AUROC, RF achieves similar performance with MolBERT and GROVER_RDKit in BBBP (). However, if the evaluation metric is NPV, RF significantly outperforms all the other three models. Another example is that, when evaluated by RMSE, MAE or R2, RF and MolBERT achieve similar performance in ESOL (); but when evaluated by Pearson_R, MolBERT has a significantly higher metric value than RF. Besides, although GROVER and GROVER_RDKit have similar Pearson_R values in FreeSolv (), GROVER_RDKit significantly outperforms GROVER with R2 as the evaluation metric. Thus, different evaluation metrics may lead to disparate conclusions and caution is needed, especially for metrics with similar naming, such as R2 and Pearson_R.

As for which metrics are appropriate, we observed that for the opioids-related datasets in Figure 7 (b), AUROC is generally over 0.80 (except for CYP2D6), whereas most AUPRC drops below 0.75. For MolBERT, GROVER and GROVER_RDKit, AUPRC drops to c.a. 0.25 in CYP3A4 (positive rate: 2.2%) and in CYP2D6 (positive rate: 1.4%), AUPRC is nearly zero. Thus, AUROC is over-optimistic and AUPRC can be a more proper metric for imbalanced datasets. Furthermore, despite the high AUROC (c.a. 0.90), AUPRC (c.a. 1.0) and PPV (c.a. 0.90) in BBBP, the NPV drops to c.a. 0.75 for RF and c.a. 0.65 for the other three models in Figure 6 (b). In this case, although the collective metrics, namely AUROC and AUPRC, are nearly perfect, the NPV can still be limited. This can be an issue if the goal of virtual screening is to identify hits impermeable through the blood-brain barrier, since only around 65% are truly not permeable among the predicted negatives, . On the contrary, although the best AUROC in HIV achieves c.a. 0.80 as shown in Figure 6 (b), the best PPV is just c.a. 0.25. Also shown in Figure 7 (b), PPV is limited for the opioids-related datasets. For instance, the best PPV is c.a. 0.7 in MDR1, whereas in MOR, DOR and KOR, PPV is around 0.6, let alone the extremely imbalanced CYP2D6 and CYP3A4, where PPV drops to nearly zero. Consequently, if the goal of virtual screening is to identify hits active for these targets, there can be a large proportion of false positives among the predicted actives. In short, precision metrics, such as PPV and NPV, can be more suitable for evaluating molecular property prediction under the classification setting, which is further contingent upon the emphasis on positives or negatives, i.e., the specific goal of virtual screening. Nevertheless, the decision threshold determination could be an issue. As mentioned in Section 3.2.1, we used Youden’s statistic [schisterman2008youden] to adjust the threshold. Meanwhile, we also calculated PPV and NPV using the default threshold for the opioids-related datasets. As shown in Sup. Figure 12, using the adjusted threshold would decrease PPV but increase NPV, which is especially prominent in RF. For MolBERT, GROVER and GROVER_RDKit, however, using the adjusted threshold may increase PPV, such as in CYP3A4, DOR and KOR, presumably due to the innate ROC curve shapes for each model-dataset combination (Sup. Figure 13). In fact, when adjusting the threshold, there is a trade-off between false positives and false negatives. Using Youden’s statistic, we assumed equal importance of both, which is debatable in real-world virtual screening. Further study is needed for the determination of optimal decision threshold.

Figure 9: Prediction Performance Comparison on Chemical Space Generalization. (a). Point plot (mean confidence interval over 30 folds) for the prediction performance using recommended metrics (AUROC for classification; RMSE for regression) under random and scaffold split. (b). Bar plot (mean confidence interval over 30 folds) for the prediction performance among AC and non-AC molecules using recommended metrics (AUROC for classification; RMSE for regression). (c). Statistical significance using unpaired test in prediction performance comparison between random split and scaffold split. (d). Statistical significance using unpaired test in prediction performance comparison between AC and non-AC molecules.

4.2.4 How does the task setting matter?

To study how task setting affects model evaluation, we conducted experiments to directly regress the pIC50 values in the opioids related datasets and the results are shown in Figure 8.

We first compared model performance using the recommended metric, RMSE. Despite the superior performance of GROVER_RDKit in two benchmark regression datasets, namely ESOL and FreeSolv (see Section 4.2.1), RF achieves the lowest RMSE in all raw opioids-related datasets compared to the representation learning models (), except in MDR1 where GROVER_RDKit shows comparable RMSE with RF. Even using other evaluation metrics, namely, MAE, R2 and Pearson_R, RF still achieves the best performance (), except in MDR1 using Pearson_R. More specifically, MAE can be lowered to c.a. 1.5 by RF in all opioids-related datasets whereas for GROVER, MAE can be as high as c.a. 2.5 in MDR1. These observations, again, supports the effectiveness of ECFP for molecular property prediction.

Next, we compared the prediction performance under the classification and regression settings. As shown in Figure 7 (b), under the classification setting, all models achieve limited performance in CYP2D6, with particularly abysmal performance in PPV. However, under the regression setting, RMSE and MAE in CYP2D6 can be lowered to c.a. 1.5, which suggests that regression might be more suitable in predicting CYP2D6’s activity, although the pIC50 labels can be noisy [mervin2021probabilistic]. On the contrary, in MDR1, MOR, DOR and KOR, where the prediction performance is promising indicated by high AUROC values under the classification setting, RMSE and MAE are worsen when compared to the CYP2D6. One potential cause for the disparate performance between the classification and regression settings can be the arbitrary activity cutoff value. Since each dataset has a unique label distribution (Figure 5), the arbitrary cutoff value at 6 may cause prediction difficulty to different extents. For instance, similar molecular structures with close pIC50 values around 6 could be coerced to actives vs. inactives, which forms a challenging task and may act as a major source of misclassification.

4.3 On the chemical space generalization

In this section, we checked how the predictive models perform in terms of both inter-scaffold and intra-scaffold generalization, using the opioids-related datasets for illustration.

4.3.1 Inter-scaffolds generalization

As illustrated in Figure 4, we conducted experiments using both random and scaffold split for the opioids-related datasets. The prediction performance under scaffold split have been discussed in Section 4.2. Since there is no scaffold overlap among training, validation and test sets scaffold split, we compared the prediction performance between scaffold split and random split (Sup. Figure 14) so as to evaluate how the models perform during the inter-scaffold generalization. To calculate the statistical significance, we adopted the unpaired test to compare with the recommended metrics, which are summarized in Figure  9 (a) & (c). Compared to random split, prediction performance indicated by AUROC and RMSE, significantly deteriorates under scaffold split in nearly all datasets using RF, manifesting the inter-scaffold generalization challenge. In certain datasets, such as MOR, MolBERT and GROVER_RDKit, show similar RMSE in both split settings but still, in other datasets such as KOR, inter-scaffold generalization can be limited. Moreover, the dispersion of the evaluation metric is also higher under scaffold split, as shown in Figure 7 (a) and Figure 8 (a), indicating higher uncertainty in prediction during inter-scaffold generalization.

4.3.2 Intra-scaffold generalization

Here, we compared the prediction performance for the test molecules with and without the AC scaffolds under both scaffold and random split, shown in Figure  9 (b) & (d). For the AC molecules, the prediction performance is generally worse compared to those of the non-AC molecules, as indicated by AUROC and RMSE. The inferior performance for the AC molecules reflects the limited intra-scaffold generalization, especially in the case of activity cliff. Clearly, there is a gap for intra-scaffold generalization. Nonetheless, for models exhibiting limited performance in Figure 7 (a), such as GROVER which exhibits low AUROC in DOR, the gap for intra-scaffold generalization may not be readily observed. For instance, there is a significant difference in prediction performance between AC and non-AC molecules in DOR when using RF, MolBERT and GROVER_RDKit, which, nonetheless, turns insignificant when using GROVER. Besides, the performance differences between AC and non-AC molecules are generally more prominent under scaffold split, as indicated by the green grids in Figure  9 (d). In other words, random split can alleviate the intra-scaffold generalization in the case of activity cliffs, presumably due to that some AC scaffolds have been seen during training, which enables more accurate prediction at inference time. Again, this observation suggests the importance of scaffolds in molecular property prediction.

In fact, as pointed out by Robinson et al.[robinson2020validating]

, active molecules with different scaffolds can interact with the target with very different mechanisms. Thus, expecting a model to generalize by learning from unseen scaffolds can be somewhat unrealistic. Here, by interrogating the intra-scaffold generalization, we further substantiate this point with the activity-cliff issue, which has two-fold meanings: 1). exposing the predictive model to a set of diverse scaffolds during training phase can be conducive for inference, even holding potential for activity cliff, although further study is needed; 2). when applying the trained molecular property prediction models, for example, in a deep reinforcement learning framework for drug design

[deng2020towards], if the generated molecule has a novel scaffold, which is further related to molecules with drastic activity changes, its predicted property for the generated molecules should be treated with lower confidence.

Figure 10: The Raw Predictions (a) and Learned Embeddings (b) for the KOR Activity-Cliffs Showcase Molecules from Figure 5 (c).

Furthermore, we also examined the learned representations for the AC showcase molecules in Figure 5 (c), taken from one split fold under scaffold split as an illustrative example. As shown in Figure 10

(a), the raw pIC50 predictions are abysmal, especially for MolBERT and GROVER, where the average pIC50 values seem to be simply imputed as the predictions for the AC molecules, indistinguishable from each other. Unsurprisingly, the visualized embeddings, shown in Figure 

10 (b), are also mixed despite the disparate pIC50 values, which is a manifestation of the structure-property mismatch [na2020machine].


In this study, we take a respite from representation learning for molecular property prediction. Two self-supervised representation learning models based on SMILES strings and molecular graphs, MolBERT [fabian2020molecular] and GROVER [rong2020grover], are mainly investigated. Compared to supervised learning, self-supervised learning does not require heavy human annotations [jing2020self], which can be particularly expensive in drug discovery [wouters2020estimated]. As shown in the work by Hu et al.[hu2019strategies], self-supervised pre-training helps avoid the negative transfer associated with supervised pre-training. Self-supervised learning can be roughly categorized into: generative, contrastive and generative-contrastive (adversarial) [liu2020self]. Pretraining tasks, such as masked language modeling in MolBERT and contextual property prediction in GROVER, are leaning towards the generative type.

Recently, the contrastive type of self-supervised pretraining has also been applied in molecular property prediction. For example, MolCLR [wang2021molclr] proposes three augmentation strategies, namely, atom masking, bond deletion and subgraph removal, on molecular graphs to pretrain GCN and GIN, respectively. For molecular graph pairs augmented from the same molecule, they are denoted as positive otherwise negative and the normalized temperature-scaled cross-entropy loss (NT-Xent) loss is applied, which maximizes the similarity between positive pairs and minimizes the similarity between negative pairs. More recently, iMolCLR [wang2022improving]

has also been proposed, where structural similarities between negative pairs and fragment-level contrasting substructures decomposed from molecules are integrated into the NT-Xent loss function. Notably, one pretraining task in MolBERT

[fabian2020molecular] is the SMILES equivalence prediction, where given an input of two SMILES strings (the second SMILES is either randomly sampled from the pretraining corpus or a synonymous permutation of the first SMILES), the model predicts whether they represent the same molecule. Based on the ablation study, however, the SMILES equivalence task slightly but consistently decreases downstream performance. Besides, MolBERT [fabian2020molecular] and GROVER [rong2020grover] also utilized RDKit [Landrum2016RDKit2016_09_4] to calculate molecular descriptors values or extract graph-level motifs as domain-relevant labels for pretraining. As indicated by the ablation study in MolBERT, molecular descriptors values prediction has the highest impact on downstream performance. Also shown in this study, the 200 RDKit global molecular features play a crucial role in GROVER. We also observed that ECFP4 significantly outperforms the learned representations in many cases, which coincides with previous studies [lane2020bioactivity, jiang2021could]. One potential direction would therefore be coming up with more suitable ways of exploiting molecular descriptors or fixed fingerprints.

Nonetheless, despite the advancement in AI techniques, whether AI can benefit real-world drug discovery is not without its concerns [bender2020artificial, bender2021artificial]. Guidelines for evaluating molecules generated by AI techniques have been suggested by Walters et al.[walters2020assessing]. Likewise, evaluation of molecular property prediction should also be standardized. Recently, Bender et al.[bender2022evaluation] proposed a set of evaluation guidelines for machine learning tools, covering appropriate comparison methods and evaluation metrics, among others. In our study, we addressed three aspects of molecular property prediction on datasets profiling, model evaluation and chemical space generalization (Figure 1). For the datasets, each of them has unique label distribution and molecular structures, which poses different degrees of prediction difficulty. The molecular structures are dissected into scaffolds and structural traits, including fragments (functional groups and heterocycles) and other structural traits, such as MW and number of rings. Furthermore, under different dataset split schemes and with different random seeds, the structural and label divergence among the training, validation and test sets also vary, which acts as a source for the performance variance. For the model evaluation, we compared the two self-supervised representation learning models, MolBERT and GROVER, with RF on the baseline fixed molecular features, ECFP4. With rigorous statistical analyses, ECFP4 is demonstrated to lead in the performance in many cases, which suggests there is still room for representation learning in molecular property prediction. Besides, we also found that different models exhibit different extents of dispersion in the evaluation metrics, which might reflect the inherent prediction uncertainty underlying each model. We envision that the metric variability should also be incorporated for model evaluation rather than just comparing the means. In a nutshell, there should be sufficient split folds and statistical analysis. Moreover, the use of evaluation metrics may also affect the conclusions drawn, which should be chosen based upon the question of interest. Arguably, the precision metrics should be reported together with the collective metrics such as AUROC, which is more pertinent to virtual screening. As a last component in the model evaluation, we also experimented on how the task setting affects the prediction performance for the opioids-related datasets. The disparate performance between the classification setting and regression setting suggests that applying one arbitrary activity cutoff value can be another source for prediction uncertainty due to potentially distinct structural traits and label distribution around the cutoff. For the chemical space generalization, we dissect it into inter-scaffold generalization and intra-scaffold generalization. The inferior prediction performance under scaffold split, compared to random split, suggests that better AI techniques are needed to improve w.r.t. inter-scaffold generalization. Similarly, the inferior performance for the AC molecules (see Section 2.4.3) suggests more efforts are needed for intra-scaffold generalization, especially in the case of activity cliffs. A more standard protocol on routine activity-cliffs evaluation is needed. One reason for the neglect of this real-world drug discovery challenge could be partially due to the heavy reliance on the MoleculeNet benchmark datasets.

Indeed, the widely-used benchmark datasets may be of little relevance to real-world drug discovery [walters2021critical]. Besides, some benchmark datasets can pose unreasonable prediction tasks [bender2021artificial]. For example, SIDER [wu2018moleculenet] is a benchmark dataset for 1,427 marketed drugs and their side effects in 27 system organ classes. In addition to the molecular structures, there are many other factors underlying the side effects in humans, such as food-drug interactions [deng2017review], drug-drug interactions [deng2020informatics], among others [bender2021artificial]. Thus, it is unrealistic to expect a model to directly predict side effects from the chemical structures. Another example is ClinTox [wu2018moleculenet], which has a classification task for FDA approval status in addition to the clinical trial toxicity. Again, these two tasks can only be partially attributed to the chemical structures. Thus, to examine the usefulness of the advanced representation learning models, we also assembled a suite of opioids-related datasets from ChEMBL. As shown in Figure 5, the MOR, DOR and KOR datasets related to the PD aspect of opioid overdose are quite balanced. On the contrary, the CYP2D6 and CYP3A4 datasets related to the PK aspect of opioid overdose are extremely skewed to the left with an active rate less than 10% under the cutoff 6. Consequently, the PPV for these two metabolic enzymes are very limited. Indeed, the datasets in some domains are still lacking [bender2021artificial]. Besides, these collected datasets from public databases can be very noisy. For example, we found quite some duplicates and contradictory records in the opioids-related datasets, which were removed from further analysis. Sometimes, even the established benchmark datasets need an extra “washing” step to ensure the data quality [jiang2021could].

Last but not least, there are still some limitations in this study. Firstly, the sources underlying molecular property prediction include dataset split uncertainty, experimental data uncertainty, and model training uncertainty [mervin2021probabilistic]. Our experimental scheme of repeating datasets split 30 times with different random seeds only partially addresses the uncertainty. Moreover, there could also be variations introduced during model training, such as random weight initialization and random mini-batch shuffling [fort2019deep]. As a result, the ensembling technique has been proposed to alleviate the uncertainty related to model training and improve prediction accuracy [yang2019analyzing], which was not evaluated in this study due to heavy computation burden. Another underlying assumption, yet often neglected, is that the collected datasets are usually regarded as the gold standard without any experimental errors, which, however, may not hold true. Experimental uncertainty needs to be taken into consideration as well [mervin2021probabilistic]. Secondly, the explainability of the molecular property prediction models is not covered in this study, which is related to the explainable AI, aiming to make the predictions more understandable by domain experts [jimenez2020drug].

In conclusion, we expect that by taking this respite, increased awareness of the key aspects underlying molecular property prediction can be raised, which will eventually bring even better AI techniques in this field.



The experiments in this study are conducted using the computational resources provided by the AI institute in Stony Brook University. The authors also want to acknowledge the explicit codes and pretrained models from MolBERT and GROVER.

Author contributions statement

J.D. and Z.Y. conceived and designed the study. J.D. collected the datasets, conducted the experiments and analyzed the results. J.D. wrote the first draft of the manuscript. H. W. curated the chemistry-related contents. I. O., D.S. and F.W. supervised the project. F. W. provided funding support. All authors proofread the manuscript and made critical revisions.

Additional information

Accession codes The code, data and performance results will be available in the Github repository: Competing interests The authors have no competing interest.