Log In Sign Up

GuacaMol: Benchmarking Models for De Novo Molecular Design

by   Nathan Brown, et al.

De novo design seeks to generate molecules with required property profiles by virtual design-make-test cycles. With the emergence of deep learning and neural generative models in many application areas, models for molecular design based on neural networks appeared recently and show promising results. However, the new models have not been profiled on consistent tasks, and comparative studies to well-established algorithms have only seldom been performed. To standardize the assessment of both classical and neural models for de novo molecular design, we propose an evaluation framework, GuacaMol, based on a suite of standardized benchmarks. The benchmark tasks encompass measuring the fidelity of the models to reproduce the property distribution of the training sets, the ability to generate novel molecules, the exploration and exploitation of chemical space, and a variety of single and multi-objective optimization tasks. The benchmarking framework is available as an open-source Python package.


page 1

page 2

page 3

page 4


Tartarus: A Benchmarking Platform for Realistic And Practical Inverse Molecular Design

The efficient exploration of chemical space to design molecules with int...

JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design

Inverse molecular design, i.e., designing molecules with specific target...

ChemTS: An Efficient Python Library for de novo Molecular Generation

Automatic design of organic materials requires black-box optimization in...

DeepOBS: A Deep Learning Optimizer Benchmark Suite

Because the choice and tuning of the optimizer affects the speed, and ul...

Guiding Deep Molecular Optimization with Genetic Exploration

De novo molecular design attempts to search over the chemical space for ...

Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization

Molecular optimization is a fundamental goal in the chemical sciences an...

1 Introduction

De novo molecular design is a computational technique to generate novel compounds with desirable property profiles from scratch.1 It complements virtual screening, where large virtual compound libraries are pre-generated, stored, and then subsequently ranked on demand. Virtual screening has the advantage of being reasonably fast and well understood. It is particularly useful when the virtual compounds are readily available, for example in in-house screening collections, or for commercially available compounds.2 Datasets sized on the order of can be routinely screened under current computational constraints.3

However, this is only a tiny fraction of chemical space, which has an estimated size anywhere between

and possible structures.3, 4 This number might be even larger considering new modalities such as PROTACs.5

Efficient approaches to query larger combinatorial spaces via similarity to given structures have been reported.6 However, it is not straightforward to perform more complex, multi-objective queries without enumerating substantial parts of the space.

In contrast, in de novo design, a relatively small number of molecules is explicitly generated, and chemical space is explored via search or optimization procedures. Thus, by focusing on the most relevant areas of chemical space for the current multi-objective query, it can in principle explore the full chemical space, given any ranking or scoring function.

In addition to established discrete models,7 models for de novo design based on deep neural networks have been proposed in recent years, and have shown promising results.8, 9 Unfortunately, validation methods for these generative models have not been consistent. Comparative studies to established de novo design models have not yet been performed. Furthermore, property optimization studies often focused on easily optimizable properties, such as drug-likeness or partition coefficients. This makes it difficult to understand the strengths and weaknesses of the different models, to assess which models should be used in practice, and how they can be improved and extended further.

In other fields of machine learning, for example computer vision and natural language processing, standardized benchmarks have triggered rapid progress.

10, 11 Similarly, we believe that the field of de novo molecular design can benefit from the introduction of standardized benchmarks. They allow for a straightforward survey and comparison of existing models, and provide insight into the strengths and weaknesses of models, which is valuable information to improve current approaches.

In this work, we introduce GuacaMol, a framework for benchmarking models for de novo molecular design. We define a suite of benchmarks and implement it as a Python package designed to allow researchers to assess their models easily. We also provide implementations and results for a series of baseline models.

2 Models for De Novo Molecular Design

De novo design approaches require three components: 1) molecule generation, 2) a way to score the molecules, and 3) a way to optimize or search for better molecules with respect to the scoring function.7

For scoring, any function that maps molecules to a real valued score can be used, for example quantitative structure-activity relationships (QSAR) and quantitative structure-property relationships (QSPR) models, structure-based models for affinity prediction, heuristics for the calculation of physicochemical properties, and combinations thereof for multi-objective tasks.

At the center of molecule generation and optimization strategies lies the choice of the molecular representation, which determines the potential range of exploration of chemical space. Molecules can be constructed atom-by-atom or by combination of fragments.7

They can be grown in the context of a binding pocket, which allows for structure-based scoring, or be constructed and scored using entirely ligand-based methods. Choosing a very general representation of molecules for construction, such as atom-by-atom and bond-by-bond building, allows for potentially exploring all of chemical space. However, without any constraints or scoring defining what sensible molecules look like, such representations can lead to molecules which are very difficult to synthesize or potentially unstable, particularly in combination with powerful global optimizers such as genetic algorithms.

7 To alleviate this problem, different approaches were introduced to generate molecules using readily available building blocks and robust reaction schemes (e.g. amide couplings), or fragments derived from known molecules.12, 7, 13 This, however, may drastically limit the potentially explorable space. This compromise between realistic molecules and large explorable molecular space is a major dilemma of de novo design.

A seemingly obvious solution is to add a contribution penalizing unrealistic molecules to the scoring function. Unfortunately, it is not trivial to encode what medicinal chemists would consider as molecules that are acceptable to take forward to synthesis.14, 15 Instead of defining what good molecules looks like as rules or in the form of a scoring function, deep learning approaches represent an attractive alternative, because they are potentially able to learn how reasonable molecules look like from data.16, 17

Machine learning has been used in chemoinformatics for at least 50 years to score molecules (QSAR applications), that is to predict properties given a structure.18 The inverse QSAR problem of predicting structures given target properties has received less attention.19 However, recent advances in algorithms and hardware have made machine learning models which directly output molecular structures tractable.16

Two early papers on generative models for molecules employed the simplified molecular-input line-entry system (SMILES) representation, which allows for the representation of molecular graphs as a string, with parentheses and numbers indicating branching points and ring closures.20 Gómez-Bombarelli et al. proposed to use variational auto-encoders (VAEs), which encode the SMILES representation of a molecule into a latent, continuous representation in real space, and back.8

Optimization can then be performed by gradient descent or Bayesian optimization in real space. Segler et al. introduced the notion of transfer and reinforcement learning for molecule generation using recurrent neural networks (RNNs) on SMILES, where a neural network pretrained on a general corpus of molecules is either fine-tuned on a small number of known actives, or coupled to an external scoring function to drive the generation of molecules with desired properties, e.g. activity against biological targets.

9 Both papers pointed out the then lack of models for direct graph generation, which is considerably harder than generating sequences. Generative models based on neural networks have since been further explored, with work on VAEs, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 RNNs, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44 generative adversarial networks (GANs), 40, 41, 45, 46

adversarial autoencoders (AAEs),

47, 48, 49, 50 and graph convolutional networks, 46, 51, 43 using SMILES strings or graphs to represent molecules.

Also, the first prospective studies using neural generative models for de novo

design have been published. Merk et al. used RNNs on SMILES and transfer learning to generate compounds

1-3 for nuclear receptors, which after synthesis and testing turned out to be micro- to nanomolar ligands (see Figure 1).52, 53

Figure 1: Compounds 1-3, generated by a SMILES LSTM model, were made and tested, and showed micro- to nanomolar activity against the RXR and PPAR receptors.53, 52

An important result about the space pretrained neural models can explore was recently published by Arus-Pous et al. They studied whether SMILES RNNs trained on a small subset (0.1%) of GDB13, an enumerated molecule set, which covers all potentially stable molecules up to 13 heavy atoms, could recover almost the full dataset after sampling a sufficient number of compounds. They showed that the SMILES RNN was able to cover (and thus explore) a large portion of the complete chemical space defined by GDB13, and even generate molecules which were incorrectly omitted in the original GDB13.54

3 Assessing De Novo Design Techniques

To profile models for de novo molecular design, we differentiate between their two main use cases:

  • Given a training set of molecules, a model generates new molecules following the same chemical distribution.

  • A model generates the best possible molecules to satisfy a predefined goal.

The collection of benchmarks we propose below assesses both facets defined here. In the following we will refer to these two categories as distribution-learning benchmarks and goal-directed benchmarks, respectively.

The two benchmark categories are evaluated independently to afford models as much flexibility as possible without penalty, since there is no one-to-one correspondence between distribution-learning and goal-directed tasks. For instance, some models are able to generate optimized molecules without learning the chemical distribution first. Also, a model able to reproduce a chemical distribution may employ different strategies to deliver optimized molecules.

3.1 Distribution-learning benchmarks

Models for de novo drug design often learn to reproduce the distribution of a training set, or use this training set to derive molecular fragments before generating targeted molecules. This allows some model architectures to learn the syntax of molecules in the selected molecular representation, and often accelerates the goal-directed tasks.

The distribution-learning benchmarks assess how well models learn to generate molecules similar to a training set. We consider five benchmarks for distribution learning:

3.1.1 Validity

The validity benchmark assesses whether the generated molecules are actually valid, i.e. whether they correspond to a (at least theoretically) realistic molecule. For example, molecules with an incorrect SMILES syntax, or with invalid valence, are penalized.

3.1.2 Uniqueness

Given the high dimension of chemical space and the huge number of molecules potentially relevant in medicine, generative models should be able to generate a vast number of different molecules. The uniqueness benchmark assesses whether models are able to generate unique molecules — i.e., if a model generates the same molecule multiple times, it will be penalized.

3.1.3 Novelty

Since the ChEMBL training set represents only a tiny subset of drug-like chemical space, a model for de novo molecular design with a good coverage of chemical space will rarely generate molecules present in the training set. The novelty benchmark penalizes models when they generate molecules already present in the training set. Therefore, models overfitting the training set will obtain low scores on this task.

3.1.4 Fréchet ChemNet Distance

Preuer et al. introduced the Fréchet ChemNet Distance (FCD) as a measure of how close distributions of generated generated are to the distribution of molecules in the training set.55

The FCD is determined from the hidden representation of molecules in a neural network called ChemNet trained for predicting biological activities, similarly to the Fréchet Inception Distance sometimes applied to generative models for images. More precisely, the means and covariances of the activations of the penultimate layer of ChemNet are calculated for the reference set and for the set of generated molecules. The FCD is then calculated as the Fréchet distance for both pairs of means and covariances. Similar molecule distributions are characterized by low FCD values.

3.1.5 KL divergence

The KL (Kullback-Leibler) divergence



measures how well a probability distribution

approximates another distribution . For the KL divergence benchmark, the probability distributions of a variety of physicochemical descriptors for the training set and a set of generated molecules are compared, and the corresponding KL divergences are calculated. Models able to capture the distributions of molecules in the training set will lead to small KL divergence values.

It has been noted that a lack of diversity of the generated molecules is an issue for a few models for de novo design, in particular GANs.57, 55 Other model classes do not suffer from that problem.9 The KL divergence benchmark captures diversity to some extent, by requiring the generated molecules to be as diverse as the training set with respect to the considered property distributions.

3.2 Goal-directed benchmarks

The goal-directed optimization of molecules relies on a formalism in which molecules can be scored individually. The molecule score reflects how well a molecule fulfills the required property profile. The goal is to find molecules which maximize the scoring function.

Concretely, the models are asked to generate a given number of molecules with high scores for a given function. The models have access to the scoring function and can iteratively improve their best molecule guesses, without knowing explicitly what the scoring function calculates.

Here, by using robust and simple, but relevant scoring functions for molecular design, we disentangle the problem of molecule optimization from the problem of choosing good scoring functions, which has been highlighted many times in the context of de novo design and virtual screening, and will not be the focus of this article.

The optimization tasks can be summarized in different categories:

3.2.1 Similarity

Similarity is one of the core concepts of chemoinformatics.58 It serves multiple purposes and is an interesting objective for optimization. First, it is a surrogate for machine learning models, since it mimics an interpretable nearest neighbor model. However, it has the strong advantage over more complex machine learning (ML) algorithms that deficiencies in the ML models, stemming from training on small datasets or activity cliffs, cannot be as easily exploited by the generative models. Second, it is directly related to virtual screening: a similarity objective can be interpreted as a form of inverse virtual screening, where molecules similar to a given target compound are generated on the fly instead of looking them up in a large database. In the similarity benchmarks, models aim to generate molecules similar to a target that was removed from the training set. Models perform well for the similarity benchmarks if they are able to generate many molecules that are closely related to a given target molecule. Alternatively, the concept of similarity can be applied to exclude molecules that are too similar to other molecules.

3.2.2 Rediscovery

Rediscovery benchmarks are closely related to the similarity benchmarks described above. The major difference is that the rediscovery task explicitly aims to rediscover the target molecule, not to generate many molecules similar to it.

3.2.3 Isomers

For the isomer benchmarks, the task is to generate molecules that correspond to a target molecular formula (for example CHNO). The isomers for a given molecular formula can in principle be enumerated, but except for small molecules this number will in general be very large. The aim of the isomer task is not to provide a new method for generating isomers. Instead, the isomer benchmarks assess the flexibility of the model to generate molecules following a simple pattern that is a priori unknown.

3.2.4 Physicochemical properties

In the physicochemical property benchmarks, the molecule scores are derived from molecular properties. Since the true properties (i.e. their experimental values) are in general not known, these benchmarks rely on values provided by computational models. Instead of pure maximization of properties, for these benchmarks we target property ranges. This corresponds more closely to typical use cases in drug design.

3.2.5 Median molecules

In the median molecules benchmark, ECFC4 similarity towards both Camphor and Menthol has to be maximized; it is designed to be a conflicting task.59 Besides measuring the obtained top score, it is instructive to study if the models also explore the chemical space between the two structures.

3.2.6 Quantitative estimate of drug-likeness (QED)

The “quantitative estimate of drug-likeness” (QED)60 is an empirical measure of drug-likeness, similar to Lipinski’s rule of 5.61 Even though optimization of drug-likeness alone is not a particularly useful objective in drug discovery, it has been used has been used in several publications as a target for de novo molecular design (see, for instance, Refs. 41, 8, 28, 43, 51).

3.2.7 Multi-objective

In general, in applications of molecular design several conditions must be met at the same time in order to obtain adequate molecules. For this reason, molecular design is often a question of optimizing of the right property profile, i.e. multiple objectives are targeted instead of a single objective. The multi-objective benchmarks encompass aspects from different domains, such as:

  • Structural features. Examples: molecular weight, number of aromatic rings, number of rotatable bonds;

  • Physicochemical properties. Examples: TPSA, logP;

  • Similarity or dissimilarity to other molecules;

  • Presence and absence of functional groups or atom types.

4 Methods

Our framework for benchmarking models for de novo design, GuacaMol, is available as an open-source Python package and can be downloaded from the Internet.62 GuacaMol provides user-friendly interfaces that allow researchers to couple their models to the benchmarking framework with minimal effort.

In this section, we provide the details for the generation of the datasets and the evaluation of the benchmarks. In particular, we list the five distribution-learning and the twenty goal-directed benchmarks that we propose as the first version of a suite of benchmarks for GuacaMol.

For all chemoinformatics-related operations, such as handling or canonicalization of SMILES strings, physicochemical property (logP, TPSA) or similarity calculations, GuacaMol relies on the RDKit package.63

4.1 Dataset

Several of the molecule generation approaches we studied require a training dataset. ChEMBL 24 was used for this purpose.64 The advantage of ChEMBL is that it contains only molecules that have been synthesized and tested against biological targets. Other datasets, such as ZINC,2 contain, at least to some degree, virtual molecules that are likely to be synthesizable, but have not been made yet. Furthermore, ZINC is biased towards smaller and more readily synthesizable molecules, indicated by a larger proportion of molecules containing amide bonds compared to ChEMBL. The QM9 set,65 which is a subset of GDB9,4 is a completely enumerated dataset, which has shown to be of value in many applications. However, since it contains mostly compounds which have not been made yet, including many molecules with complex annulated ring systems, it is not very well suited to learn representations of drug-like, synthesizable molecules.

To generate the final dataset for the benchmarks, ChEMBL is postprocessed by

  1. removal of salts;

  2. charge neutralization;

  3. removal of molecules with SMILES strings longer than 100 characters;

  4. removal of molecules containing any element other than H, B, C, N, O, F, Si, P, S, Cl, Se, Br, and I;

  5. removal of molecules with a larger ECFP4 similarity than 0.323 compared to a holdout set consisting of 10 marketed drugs (celecoxib, aripiprazole, cobimetinib, osimertinib, troglitazone, ranolazine, thiothixene, albuterol, fexofenadine, mestranol). This allows us to define similarity benchmarks for targets that are not part of the training set.

The postprocessed ChEMBL dataset can be downloaded from the Internet.66 Additionally, a docker container with version-controlled dependencies to allow for reproducible dataset creation is provided in the guacamol repository.62

4.2 Implementation details: Distribution learning benchmarks


The validity score is calculated as the ratio of valid molecules out of 10’000 generated molecules. Molecules are considered to be valid if their SMILES representation can be successfully parsed by RDKit.


To calculate the uniqueness score, molecules are sampled from the generative model until 10’000 valid molecules have been obtained. Duplicate molecules are identified by identical canonical SMILES strings. The score is obtained as the number of different canonical SMILES strings divided by 10’000.


The novelty score is calculated by generating molecules until 10’000 different canonical SMILES strings are obtained, and computing the ratio of molecules not present in the ChEMBL dataset.

Fréchet ChemNet Distance (FCD)

To generate the FCD score, a random subset of 10’000 molecules is selected from the ChEMBL dataset, and the generative model is sampled until 10’000 valid molecules are obtained. The FCD between both sets of molecules is calculated with the FCD package available on GitHub,67 and the final score is given by

KL divergence

For this task, the following physicochemical descriptors are calculated with RDKit for both the sampled and the reference set: BertzCT, MolLogP, MolWt, TPSA, NumHAcceptors, NumHDonors, NumRotatableBonds, NumAliphaticRings, and NumAromaticRings

. Furthermore, we calculate the distribution of maximum nearest neighbour similarities on ECFP4 fingerprints for both sets. Then, the distribution of these descriptors is computed via kernel density estimation (using the

scipy package) for continuous descriptors, or as a histogram for discrete descriptors. The KL divergence is then computed for each descriptor , and is aggregated to a final score via


where is the number of descriptors (in our case ).

4.3 Implementation details: Goal-directed benchmarks

In the goal-directed benchmarks discussed in this paper, raw molecular properties rarely correspond to the molecule scores used for optimization. Instead, they are postprocessed by modifier functions that give more flexibility in how the final molecule score is computed, and furthermore restrict scores to the interval [0, 1].

For the benchmarks discussed below, we apply different types of modifier functions:

  • Gaussian: The Gaussian modifier allows to target a specific value of some property, while giving high scores when the underlying value is close to the target. It can be adjusted with the mean

    and standard deviation

    of the underlying Gaussian function.

  • MinGaussian: The min Gaussian modifier corresponds to the right half of a Gaussian function. Values smaller than are given full score, and values larger than decrease continuously to zero.

  • MaxGaussian: The max Gaussian modifier corresponds to the left half of a Gaussian function. Values larger than are given full score, and values smaller than decrease continuously to zero.

  • Thresholded: With the thresholded modifier, full score is attributed to values above a given threshold . Values smaller than decrease linearly to zero.

The effect of these four modifier functions is illustrated in Figure 2.

Figure 2: Modifier functions used in this study.

For many benchmarks the final molecule score corresponds to an average of multiple contributions. For instance, the molecule score for the median molecules benchmark is an average of two similarity scores (see below). In all the benchmarks presented in this work, the score contributions are weighted uniformly.

The final benchmark score is calculated as a weighted average of the molecule scores. For most of the benchmarks discussed below, the molecules with the best scores are given a larger weight. This choice reflects the compromise that models should be able to deliver a few molecules with top scores, but that they should also be able to generate many molecules with satisfactory scores. For all the goal-directed benchmarks, we calculate one or several average score for given numbers of top molecules, and then set the benchmark score to be the mean of these average scores. For instance, many benchmarks consider the combination of top-1, top-10 and top-100 scores, in which the benchmark score is given by



is a 100-dimensional vector of molecule scores

, , sorted in decreasing order (i.e., for ).

All the goal-directed benchmarks are listed in Table 1. The following points give information complementary to Table 1 about our choice and implementation of the benchmarks:

  • The column “Scoring” refers to the numbers of top molecules to consider in the score calculation.

  • CH has 159 possible isomers when ignoring stereochemistry. Therefore, for this benchmark the final score is obtained from the top-159 molecule scores.

  • “CNS MPO” is a benchmark on its own, but is also present as a contribution for the cobimetinib benchmark.

  • “sim(, )” refers to a scoring function calculating the similarity to a target molecule . The score corresponds to the Tanimoto similarity when considering the fingerprint and is computed with RDKit.

  • “isomer()” refers to a scoring function for the isomer benchmarks and is detailed later in this section.

  • The ranolazine MPO benchmark uses a start population, comprising one single molecule (ranolazine).

  • The molecules for the pure similarity and rediscovery benchmarks are present in the holdout set considered during dataset generation.

  • TPSA, logP and QED are calculated using the models contained in RDKit.

width=1 Benchmark name Scoring Scoring function(s) Modifier CH 159 isomer(CH) None CHNO 100 isomer(CHNO) None CHNOPFCl 100 isomer(CHNOPFCl) None Cobimetinib MPO 1, 10, 100 sim(cobimetinib, FCFC4) Thresholded(0.7) sim(cobimetinib, ECFC6) MinGaussian(0.75, 2) Number of rotatable bonds MinGaussian(3, 2) Number of aromatic rings MaxGaussian(3, 2) CNS MPO None Osimertinib MPO 1, 10, 100 sim(osimertinib, FCFC4) Thresholded(0.8) sim(osimertinib, ECFC6) MinGaussian(0.85, 2) TPSA MaxGaussian(100, 2) logP MinGaussian(1, 2) Fexofenadine MPO 1, 10, 100 sim(fexofenadine, AP) Thresholded(0.8) TPSA MaxGaussian(90, 2) logP MinGaussian(4, 2) Physchem MPO 1, 10, 100 Bertz MaxGaussian(1500, 2) Molecular weight MinGaussian(400, 2) Number of aromatic rings MinGaussian(3, 2) Number of fluorine atoms Gaussian(6, 2) Ranolazine MPO 1, 10, 100 sim(ranolazine, AP) Thresholded(0.7) logP MaxGaussian(7, 2) Number of aromatic rings MinGaussian(1, 2) Number of fluorine atoms Gaussian(1, 2) Celecoxib rediscovery 1 sim(celecoxib, ECFC4) None Troglitazone rediscovery 1 sim(troglitazone, ECFC4) None Thiothixene rediscovery 1 sim(thiothixene, ECFC4) None Aripiprazole similarity 1, 10, 100 sim(aripiprazole, FCFC4) Thresholded(0.75) Albuterol similarity 1, 10, 100 sim(albuterol, FCFC4) Thresholded(0.75) Mestranol similarity 1, 10, 100 sim(mestranol, AP) Thresholded(0.75) Median molecules 1, 10, 100 sim(camphor, ECFC4) None sim(menthol, ECFC4) None logP (target: -1) 1, 10, 100 logP Gaussian(-1, 2) logP (target: 8) 1, 10, 100 logP Gaussian(8, 2) TPSA (target: 150) 1, 10, 100 TPSA Gaussian(150, 2) CNS MPO 1, 10, 100 TPSA MinGaussian(90, 2) TPSA MaxGaussian(40, 2) Number of H bond donors MinGaussian(0, 2) logP MinGaussian(6.35584, 2) Molecular weight MinGaussian(360, 2) QED 1, 10, 100 QED None

Table 1: List of goal-directed benchmarks considered in this work. The molecule score given by each benchmark is a linear combination of one or several (equally weighted) contributions, with their respective score modifiers.

4.3.1 Isomer scoring function

For the isomer benchmarks, the molecule score is calculated as an average of the following contributions:

  • For each element type present in the target molecular formula: the number of atoms of this element type, modified by a Gaussian modifier. The Gaussian modifier has a mean corresponding to the target number of atoms of this element type, and a standard deviation of 1.0.

  • The total number of atoms in the molecule, modified by a Gaussian modifier. The Gaussian modifier has a mean corresponding to the total number of atoms of the target molecular formula, and a standard deviation of 2.0.

For instance, for the target formula CH, the score of a molecule is given by:


where and refer to the number of carbon and hydrogen atoms present in , respectively, and is the total number of atoms in .

4.3.2 Additional criteria

Furthermore, for each goal directed benchmark, the duration of the benchmark, the number of calls to the scoring function, and the internal similarity of the top 100 molecules are captured in the results, to allow for more fine-grained analyses.

4.4 Baselines

In addition to the introduction of a benchmark suite for generative chemistry, this paper provides its evaluation over several baselines. The goal is to present a fair comparison of some recent methods that seem to achieve state of the art results in the field. These have been selected to represent a variety of methods, ranging from deep learning generative models to the more traditional genetic algorithms and Monte Carlo Tree Search (MCTS). Internal molecule representation was also taken into account, so that methods using SMILES strings were represented alongside others using a graph representation of atoms and bonds. For completeness and to provide a lower bound on the benchmark scores, two dummy baselines, performing random sampling and best of dataset-picking, are also provided.

Our implementation of the baseline models detailed in this work are available on GitHub.68

4.4.1 Random sampler

This baseline simply samples the requested number of molecules from the dataset at random. It provides a lower bound for the goal directed benchmarks as no optimization is performed to obtain the returned molecules, as well as an upper bound for two of the distribution learning benchmarks, as the molecules returned are taken directly from the original distribution.

4.4.2 Best in dataset

The goal of de novo molecular design is to explore unknown parts of chemical space, generating new compounds with better properties than the ones already known. This baseline simply scores the entire dataset with the provided scoring function and returns the highest scoring molecules. This effectively provides a lower bound for the goal directed benchmarks as any good de novo method should be able to generate better molecules than the ones provided during training or as a starting population. Distribution-learning benchmarks are not applicable to this baseline.

4.4.3 Smiles Ga

Genetic algorithms (GA) for de novo molecular design are well established technique.7, 59 Yoshikawa et al. proposed a method that evolves string molecular representations using mutations exploiting the SMILES context-free grammar 69.

For each goal-directed benchmark the 300 highest scoring molecules in the dataset are selected as the initial population. Each molecule is represented by 300 genes. During each epoch an offspring of 600 new molecules are generated by randomly mutating the population molecules. After deduplication and scoring these new molecules are merged with the current population and a new generation is selected by keeping the top scoring molecules overall. This process is repeated 1000 times or until progress has stopped for 5 consecutive epochs. Distribution-learning benchmarks do not apply to this baseline.

4.4.4 Graph GA

The second genetic algorithm baseline follows the implementation of Jensen 70 in which molecule evolution happens at the graph level.

For each goal-directed benchmark the 100 highest scoring molecules in the dataset are selected as the initial population. During each epoch a mating pool of 200 molecules is sampled with replacement from the population, using scores as weights. This pool may contain many repeated specimens if their score is high. A new population of 100 is then generated by iteratively choosing two molecules at random from the mating pool and applying a crossover operation. With probability of 0.5 a mutation is also applied to the offspring molecule. This process is repeated 1000 times or until progress has stopped for 5 consecutive epochs. Distribution-learning benchmarks do not apply to this baseline.

4.4.5 Graph MCTS

The MCTS molecule generation procedure follows the implementation by Jensen.70 The statistics used during sampling are computed on the GuacaMol dataset.

For this baseline no initial population is selected for the goal-directed benchmarks. Following the author’s parameters, each new molecule is generated by running 40 simulations, starting from a CC base molecule, at each step 25 children are considered and the roll-out stops when reaching 60 atoms. The best scoring molecule found during the roll-out is returned. A population of 100 molecules is generated at each epoch. This process is repeated 1000 times or until progress has stopped for 5 consecutive epochs.

For the distribution-learning benchmark the generation starts from a CC base molecule and a new molecule is generated with the same parameters as for the goal oriented, the only difference being that no scoring function is provided, so the first molecule to reach terminal state is returned instead.

4.4.6 Smiles Lstm

The SMILES LSTM is a simple, yet powerful baseline model, consisting of a long-short term memory (LSTM)

71 neural network which predicts the next character character of partial SMILES strings.9 Combined with a simple hill-climb algorithm for optimization, it has recently been shown to perform as well as more sophisticated reinforcement learning algorithms such as proximal policy optimization (PPO) or advantage actor critic (A2C).35

The model used was an LSTM with 3 layers of hidden size of 1024. For the goal-directed benchmarks 20 iterations of hill-climbing were performed, at each step the model generated 8192 molecules and the top scoring 1024 were used to fine-tune the model parameters. For the distribution-learning benchmarks no fine tuning was done, the model simply generated the requested number of molecules.

5 Results and Discussion

5.1 Distribution-learning benchmarks

Table 2 lists the results for the distribution-learning benchmarks. The random sampler model is a useful baseline for comparison because it shows what scores can be expected for good models on the KL divergence and Fréchet benchmarks. As expected, its novelty score is zero since all the molecules it generates are present in the dataset. The SMILES LSTM model sometimes produces invalid molecules. However, they are diverse, as attested by the uniqueness and novelty benchmarks, and closely resemble molecules from ChEMBL. The Graph MCTS model is characterized by a high degree of valid, unique, and novel molecules. Nevertheless, it is not able to reproduce the property distributions of the underlying training set as well as the SMILES LSTM model.

Benchmark Random Sampler SMILES LSTM Graph MCTS
Validity 1.000 0.946 1.000
Uniqueness 0.997 1.000 1.000
Novelty 0.000 0.967 0.994
KL divergence 0.998 0.987 0.540
Frechet Distance 0.925 0.887 0.014
Table 2: Results of the baseline models for the distribution-learning benchmarks

5.2 Goal-directed benchmarks

Table 3 lists the results for the goal-directed benchmarks. The model “Best of Dataset” is a useful baseline, since it actually corresponds to a strategy based on virtual screening. It illustrates what minimal scores the models must obtain to have an advantage over pure virtual screening. The scores for this baseline are relatively high for all the benchmarks, and are even maximal for the physicochemical property-based logP, TPSA, CNS MPO and QED tasks. This observation indicates that such simple tasks are not adequate for the assessment of models for de novo design, although they have been frequently selected to illustrate the efficacy of new models.

The Graph GA model obtains the best results for most benchmarks. The other GA model, based on SMILES strings, is distinctly better than the “Best of Dataset” baseline. However, especially for the similarity tasks, it scores lower than the Graph GA. The SMILES LSTM model nearly performs as well as the Graph GA model, and outperforms it on two benchmarks. The Graph MCTS model performs worse than the baseline selecting the best candidates from ChEMBL.

width=1 Benchmark Best of Dataset SMILES LSTM SMILES GA Graph GA Graph MCTS CH 0.746 0.928 0.825 0.971 0.533 CHNO 0.976 1.000 0.979 1.000 0.895 CHNOPFCl 0.814 0.940 0.928 0.990 0.770 Cobimetinib MPO 0.921 0.962 0.946 0.947 0.878 Osimertinib MPO 0.873 0.929 0.914 0.939 0.844 Fexofenadine MPO 0.849 0.997 0.923 1.000 0.780 Physchem MPO 0.827 0.993 1.000 1.000 0.712 Ranolazine MPO 0.836 0.924 0.929 0.954 0.816 Celecoxib rediscovery 0.505 1.000 0.607 1.000 0.378 Troglitazone rediscovery 0.419 1.000 0.558 1.000 0.312 Thiothixene rediscovery 0.456 1.000 0.495 1.000 0.308 Aripiprazole similarity 0.907 1.000 0.953 1.000 0.556 Albuterol similarity 0.719 1.000 0.990 1.000 0.789 Mestranol similarity 0.629 1.000 0.961 1.000 0.406 Median molecules 0.429 0.504 0.474 0.485 0.234 logP (target: -1.0) 1.000 1.000 1.000 1.000 0.980 logP (target: 8.0) 1.000 1.000 1.000 1.000 0.979 TPSA (target: 150.0) 1.000 1.000 1.000 1.000 1.000 CNS MPO 1.000 1.000 1.000 1.000 1.000 QED 0.948 0.948 0.948 0.948 0.944 Total 15.855 19.126 17.431 19.234 14.112

Table 3: Results of the baseline models for the goal-directed benchmarks

6 Conclusions and Outlook

Recently, generative models for de novo molecular design based on neural networks have been introduced, and have shown promising results. However, there has been a lack of consistency in the evaluation of such models. The present work addressed this problem by introducing GuacaMol, a framework to quantitatively benchmark models for de novo design. It aims to provide a standardized way of assessing new models, and to improve comparability of the models.

Our framework defines two evaluation dimensions. First, it assesses models for their ability to learn from a dataset of molecules and to generate novel molecules with similar properties. Second, it evaluates the faculty of generative models to deliver molecules with specific properties, which we encode as molecule scores.

We provided a suite of benchmarks compatible with the evaluation framework. It encompasses a wide range of tasks, designed to assess the flexibility of generative models. The proposed suite comprises 5 benchmarks for the ability of generative models to learn distributions of molecules, and 20 optimization benchmarks for the generation of molecules with specific properties.

We evaluated a series of diverse baseline models. In terms of optimization performance, the best model is a genetic algorithm based on a graph representation of molecules. The second best model is a recurrent neural network model that considers SMILES representations of molecules. Both models achieve similar scores, which indicates that deep learning models can reach the performance of gold-standard discrete algorithms for de novo molecular design.

The evaluation of the proposed benchmark suite for the baseline models pointed out that some benchmark tasks can be too easily solved by most of the models. This indicates the necessity of harder tasks for benchmarking models for de novo molecular design in the future.

An important aspect that will need further focus is the quality of the generated molecules, which is difficult to measure objectively. Early results indicate that generative models pre-trained on large datasets suffer less than genetic algorithms from the problem of generating unreasonable molecules. Furthermore, depending on the requirements, time constraints or sample efficiency of different models can become important, and will require further attention.

We are confident that the flexible open source framework presented herein will enable and inspire the community to come up with even more challenging benchmarks for de novo design, so that researchers can rigorously assess their models, and computational chemists can confidently harness models with well-understood strengths and limitations.

7 Acknowledgments

The authors thank Mohamed Ahmed, Matthew Sellwood, Joshua Meyers, and Amir Saffari for helpful discussions.


  • Schneider 2013 Schneider, G. De novo molecular design; John Wiley & Sons, 2013.
  • Irwin and Shoichet 2005 Irwin, J. J.; Shoichet, B. K. J. Chem. Inf. Model. 2005, 45, 177–182.
  • Walters 2018 Walters, W. P. J. Med. Chem. 2018,
  • Ruddigkeit et al. 2012 Ruddigkeit, L.; Van Deursen, R.; Blum, L. C.; Reymond, J.-L. J. Chem. Inf. Model. 2012, 52, 2864–2875.
  • Valeur et al. 2017 Valeur, E.; Guéret, S. M.; Adihou, H.; Gopalakrishnan, R.; Lemurell, M.; Waldmann, H.; Grossmann, T. N.; Plowright, A. T. Angew. Chem. Int. Ed. 2017, 56, 10294–10323.
  • Lauck and Rarey 2013 Lauck, F.; Rarey, M. In De novo Molecular Design; Schneider, G., Ed.; Wiley Online Library, 2013; pp 325–347.
  • Hartenfeller and Schneider 2011 Hartenfeller, M.; Schneider, G. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2011, 1, 742–759.
  • Gómez-Bombarelli et al. 2018 Gómez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hernández-Lobato, J. M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.; Aspuru-Guzik, A. ACS Cent. Sci. 2018, 4, 268–276.
  • Segler et al. 2018 Segler, M. H. S.; Kogej, T.; Tyrchan, C.; Waller, M. P. ACS Cent. Sci. 2018, 4, 120–131.
  • Deng et al. 2009

    Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition. 2009; pp 248–255.

  • Mayr et al. 2016 Mayr, A.; Klambauer, G.; Unterthiner, T.; Hochreiter, S. Front. Environ. Sci. 2016, 3, 80.
  • Chevillard and Kolb 2015 Chevillard, F.; Kolb, P. J. Chem. Inf. Model. 2015, 55, 1824–1835.
  • Degen et al. 2008 Degen, J.; Wegscheid-Gerlach, C.; Zaliani, A.; Rarey, M. ChemMedChem 2008, 3, 1503–1507.
  • Lajiness et al. 2004 Lajiness, M. S.; Maggiora, G. M.; Shanmugasundaram, V. J. Med. Chem. 2004, 47, 4891–4896.
  • Segler et al. 2018 Segler, M. H.; Preuss, M.; Waller, M. P. Nature 2018, 555, 604.
  • Schmidhuber 2015 Schmidhuber, J. Neural networks 2015, 61, 85–117.
  • Chen et al. 2018 Chen, H.; Engkvist, O.; Wang, Y.; Olivecrona, M.; Blaschke, T. Drug Discovery Today 2018,
  • Hansch and Fujita 1964 Hansch, C.; Fujita, T. J. Am. Chem. Soc. 1964, 86, 1616–1626.
  • Baskin et al. 1989 Baskin, I. I.; Gordeeva, E. V.; Devdariani, R. O.; Zefirov, N. S.; Paliulin, V. A.; Stankevich, M. I. Dokl. Akad. Nauk SSSR 1989, 307, 613–617.
  • Weininger 1988 Weininger, D. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36.
  • Janz et al. 2018 Janz, D.; Westhuizen, J. v. d.; Paige, B.; Kusner, M.; Hernández-Lobato, J. M. Learning a Generative Model for Validity in Complex Discrete Structures. ICLR 2018 Conference. 2018.
  • Blaschke et al. 2018 Blaschke, T.; Olivecrona, M.; Engkvist, O.; Bajorath, J.; Chen, H. Mol. Inf. 2018, 37, 1700123.
  • Kusner et al. 2017 Kusner, M. J.; Paige, B.; Hernández-Lobato, J. M. 2017, arXiv:1703.01925.
  • Dai et al. 2018 Dai, H.; Tian, Y.; Dai, B.; Skiena, S.; Song, L. 2018, arXiv:1802.08786.
  • Harel and Radinsky 2018 Harel, S.; Radinsky, K. Mol. Pharmaceutics 2018,
  • Simonovsky and Komodakis 2018 Simonovsky, M.; Komodakis, N. 2018, arXiv:1802.03480.
  • Samanta et al. 2018 Samanta, B.; De, A.; Ganguly, N.; Gomez-Rodriguez, M. 2018, arXiv:1802.05283.
  • Liu et al. 2018 Liu, Q.; Allamanis, M.; Brockschmidt, M.; Gaunt, A. L. 2018, arXiv:1805.09076.
  • Jin et al. 2018 Jin, W.; Barzilay, R.; Jaakkola, T. 2018, arXiv:1802.04364.
  • Kajino 2018 Kajino, H. 2018, arXiv:1809.02745.
  • Lim et al. 2018 Lim, J.; Ryu, S.; Kim, J. W.; Kim, W. Y. J. Cheminf. 2018, 10, 31.
  • Yuan et al. 2017 Yuan, W.; Jiang, D.; Nambiar, D. K.; Liew, L. P.; Hay, M. P.; Bloomstein, J.; Lu, P.; Turner, B.; Le, Q.-T.; Tibshirani, R.; Khatri, P.; Moloney, M. G.; Koong, A. C. J. Chem. Inf. Model. 2017, 57, 875–882.
  • Bjerrum and Threlfall 2017 Bjerrum, E. J.; Threlfall, R. 2017, arXiv:1705.04612.
  • Ertl et al. 2017 Ertl, P.; Lewis, R.; Martin, E.; Polyakov, V. 2017, arXiv:1712.07449.
  • Neil et al. 2018 Neil, D.; Segler, M.; Guasch, L.; Ahmed, M.; Plumbley, D.; Sellwood, M.; Brown, N. Exploring Deep Recurrent Models with Reinforcement Learning for Molecule Design. ICLR 2018 Conference. 2018.
  • Olivecrona et al. 2017 Olivecrona, M.; Blaschke, T.; Engkvist, O.; Chen, H. J. Cheminf. 2017, 9, 48.
  • 37 Gupta, A.; Müller, A. T.; Huisman, B. J. H.; Fuchs, J. A.; Schneider, P.; Schneider, G. Mol. Inf. 37, 1700111.
  • Jaques et al. 2016 Jaques, N.; Gu, S.; Bahdanau, D.; Hernández-Lobato, J. M.; Turner, R. E.; Eck, D. 2016, arXiv:1611.02796.
  • Popova et al. 2018 Popova, M.; Isayev, O.; Tropsha, A. Sci. Adv. 2018, 4, eaap7885.
  • Guimaraes et al. 2017 Guimaraes, G. L.; Sanchez-Lengeling, B.; Outeiral, C.; Farias, P. L. C.; Aspuru-Guzik, A. 2017, arXiv:1705.10843.
  • Sanchez-Lengeling et al. 2017 Sanchez-Lengeling, B.; Outeiral, C.; Guimaraes, G. L.; Aspuru-Guzik, A. 2017, ChemRxiv:5309668.
  • Li et al. 2018 Li, Y.; Vinyals, O.; Dyer, C.; Pascanu, R.; Battaglia, P. 2018, arXiv:1803.03324.
  • Li et al. 2018 Li, Y.; Zhang, L.; Liu, Z. J. Cheminf. 2018, 10, 33.
  • Li et al. 2018 Li, Y.; Zhou, X.; Liu, Z.; Zhang, L. J. Chin. Pharm. Sci. 2018, 27, 451–459.
  • Putin et al. 2018 Putin, E.; Asadulaev, A.; Ivanenkov, Y.; Aladinskiy, V.; Sanchez-Lengeling, B.; Aspuru-Guzik, A.; Zhavoronkov, A. J. Chem. Inf. Model. 2018, 58, 1194–1204.
  • De Cao and Kipf 2018 De Cao, N.; Kipf, T. 2018, arXiv:1805.11973.
  • Kadurin et al. 2016 Kadurin, A.; Aliper, A.; Kazennov, A.; Mamoshina, P.; Vanhaelen, Q.; Khrabrov, K.; Zhavoronkov, A. Oncotarget 2016, 8, 10883–10890.
  • Kadurin et al. 2017 Kadurin, A.; Nikolenko, S.; Khrabrov, K.; Aliper, A.; Zhavoronkov, A. Mol. Pharmaceutics 2017, 14, 3098–3104.
  • Putin et al. 2018 Putin, E.; Asadulaev, A.; Vanhaelen, Q.; Ivanenkov, Y.; Aladinskaya, A. V.; Aliper, A.; Zhavoronkov, A. Mol. Pharmaceutics 2018, 15, 4386–4397.
  • Polykovskiy et al. 2018 Polykovskiy, D.; Zhebrak, A.; Vetrov, D.; Ivanenkov, Y.; Aladinskiy, V.; Mamoshina, P.; Bozdaganyan, M.; Aliper, A.; Zhavoronkov, A.; Kadurin, A. Mol. Pharmaceutics 2018, 15, 4398–4405.
  • You et al. 2018 You, J.; Liu, B.; Ying, R.; Pande, V.; Leskovec, J. 2018, arXiv:1806.02473.
  • Merk et al. 2018 Merk, D.; Grisoni, F.; Friedrich, L.; Schneider, G. Communications Chemistry 2018, 1, 68.
  • Merk et al. 2018 Merk, D.; Friedrich, L.; Grisoni, F.; Schneider, G. Mol. Inf. 2018, 37, 1700153.
  • Arús-Pous et al. 2018 Arús-Pous, J.; Blaschke, T.; Reymond, J.-L.; Chen, H.; Engkvist, O. 2018, ChemRxiv:7172849.
  • Preuer et al. 2018 Preuer, K.; Renz, P.; Unterthiner, T.; Hochreiter, S.; Klambauer, G. J. Chem. Inf. Model. 2018, 58, 1736–1741.
  • Kullback and Leibler 1951 Kullback, S.; Leibler, R. A. Ann. Math. Stat. 1951, 22, 79–86.
  • Benhenda 2017 Benhenda, M. 2017, arXiv:1708.08227.
  • Gasteiger and Engel 2006 Gasteiger, J.; Engel, T. Chemoinformatics: a textbook; John Wiley & Sons, 2006.
  • Brown et al. 2004 Brown, N.; McKay, B.; Gilardoni, F.; Gasteiger, J. J. Chem. Inf. Comput. Sci. 2004, 44, 1079–1087.
  • Bickerton et al. 2012 Bickerton, G. R.; Paolini, G. V.; Besnard, J.; Muresan, S.; Hopkins, A. L. Nat. Chem. 2012, 4, 90–98.
  • Lipinski et al. 1997 Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J. Adv. Drug Deliv. Rev. 1997, 23, 3–25.
  • 62 GuacaMol Package., Accessed: November 19, 2018.
  • 63 Landrum, G. RDKit: Open-source cheminformatics.
  • Mendez et al. 2018 Mendez, D. et al. Nucleic Acids Res. 2018,
  • Ramakrishnan et al. 2014 Ramakrishnan, R.; Dral, P. O.; Rupp, M.; Von Lilienfeld, O. A. Sci. Data 2014, 1, 140022.
  • 66 Post-processed ChEMBL datasets., Accessed: November 20, 2018.
  • 67 FCD Package., Accessed: November 15, 2018.
  • 68 GuacaMol baselines repository., Accessed: November 19, 2018.
  • Yoshikawa et al. 2018 Yoshikawa, N.; Terayama, K.; Honma, T.; Oono, K.; Tsuda, K. 2018, arXiv:1804.02134.
  • Jensen 2018 Jensen, J. H. 2018, ChemRxiv:7240751.
  • Hochreiter and Schmidhuber 1997 Hochreiter, S.; Schmidhuber, J. Neural computation 1997, 9, 1735–1780.