Introduction
Deep Learning has recently enabled generative models for molecules (Segler et al., 2017). SMILES strings (Weininger, 1988)
offer a convenient representation of molecules for generative methods. Therefore, Recurrent Neural Networks (RNNs) are a natural basis for these methods, as they excel at generating text. RNNs have been combined with variational autoencoders (VAE)
(GómezBombarelli et al., 2016)(Segler et al., 2017), reinforcement learning (RL)
(Jaques et al., 2017; Olivecrona et al., 2017) and generative adversarial networks (GAN) (Guimaraes et al., 2017) to yield generative models for molecules. Recently, also graph generating methods have been suggested (Li et al., 2018; Simonovsky and Komodakis, 2018) which can be used to generate graph representations of molecules.Up to now, there is no established evaluation criterion that is consistently used across all publications to assess generative models for molecules. The most basic and commonly reported metric is the fraction of valid SMILES, which is a prerequisite for a generative model. However, this metric can easily be maximized by generating simple molecules, such as “CC” and “CCC”, with a rulebased system. Other metrics focus on numerical or visual comparison of molecular properties like solubility (Olivecrona et al., 2017), number of rotational bonds (Olivecrona et al., 2017), number of aromatic rings (Olivecrona et al., 2017), synthetic accessibility (Guimaraes et al., 2017; Jaques et al., 2017; Yang et al., 2017) and druglikeness (Guimaraes et al., 2017; Jaques et al., 2017). Additional metrics that are used aim at assessing the diversity of the generated molecules such as calculating the Tanimoto similarity/distance to the training set (Guimaraes et al., 2017; Segler et al., 2017) or within the generated molecules (Benhenda, 2017). The diversity of recently developed methods for molecule generation shows the high interest in this new research field, but the diversity of different evaluation methods inhibits focused and comparative research. Therefore, it is necessary to find a common basis for evaluating generative models for molecules. We will suggest such a metric and show that previously used metrics have their specific flaws, and fail to detect certain biases in generative models.
A proficient metric should capture the validity, chemical and biological meaningfulness, and diversity of the generated molecules. We therefore aim at creating a metric which is capable of unifying these requirements into one score. To this end we adopt a strategy that has been established to compare generative models for images, the Fréchet Inception Distance (FID) (Heusel et al., 2017). This distance measure uses a representation obtained from the Inception network(Salimans et al., 2016)
to represent the input objects of the network. For these image representations, the Fréchet distance is used as a distance measure. Analogously, we use the hidden representation from a neural network, called “ChemNet” (see details below), which was trained to predict biological activities, as representation of molecules. This representation contains both chemical as well as biological information about the molecules, since the input layer can be seen as purely chemical information and the output layer as purely biological information. The layers in between contain both chemical and biological information about the molecule, which is a representation that we desire. By taking into account the distribution of these representations within a set of molecules, we also capture the diversity within the set. A purely chemical representation could be also used in combination with the Fréchet distance. This representation could be a set of fingerprints, the principle components calculated on the fingerprints, the latent space of an autoencoder to name just a few. However, we hypothesize that using just the chemical information is not sufficient to judge generative models for drug design. We will examine our assumption by comparing the FCD to a fingerprint based Fréchet distance, which we call Fréchet Fingerprint Distance (FFD).
Fréchet ChemNet Distance.
We introduce the Fréchet ChemNet Distance (FCD) to calculate the distance between the distribution of realworld molecules and the distribution
of molecules from a generative model. To obtain a numerical representation of each molecule, we use the activations of the penultimate layer of “ChemNet” (see next paragraph). We then calculate the first two moments (mean, covariance) of these activations for each of the two distributions. Since the Gaussian is the maximum entropy distribution for given mean and covariance, we assume the hidden representations to follow a multidimensional Gaussian. The two distributions (
, ) are compared using the Fréchet distance Fréchet (1957), which is also known as Wasserstein2 distance Wasserstein (1969). We call the Fréchet distance between the Gaussian with mean and covariance obtained from the realworld samples and the Gaussian with mean and covariance obtained from a generative model the “Fréchet ChemNet Distance” (FCD), which is given by Dowson and Landau (1982):(1) 
Throughout this paper, the FCD is reported as analogously to Heusel et al. (2017).
The FCD is based on the activations of the penultimate layer of ChemNet. The model (Mayr et al., 2018) was trained to predict bioactivities of about 6 000 assays available in three major drug discovery databases (ChEMBL (Bento et al., 2013), ZINC (Irwin et al., 2012), PubChem (Wang et al., 2016)
). We used long shortterm memory (LSTM)
(Hochreiter and Schmidhuber, 1997)recurrent neural networks based on the onehot encoded SMILES representation of chemical molecules. The full architecture of this model consists of two 1Dconvolutional layers with SELU activations,
(Klambauer et al., 2017)followed by a maxpooling layer, two stacked LSTM layers, and a fully connected output layer. Hyperparameter selection and training were done on two thirds of the available data and the last third was used for testing. The network was optimized for mean predictive performance. The hidden representation of the 2nd LSTM layer after processing the full input sequence is used for calculating the FCD.
To estimate the mean
and covariance of the real and generated samples sufficiently large data sets should be chosen. Therefore, we determined the necessary sample size for a reliable estimate. We have chosen 200 000 randomly drawn real molecules to represent the reference molecules. Additionally, we have drawn 5, 50, 500, 5 000, 50 000 and 300 000 real molecules which represent different data sizes of generated samples. If the data set size of the “generated” samples is sufficiently large than the mean and covariancecan be accurately estimated and the FCD is close to 0. This experiment was repeated 5 times, to show that for a large enough sample the variance of FCD becomes negligible. For 5, 50, 500, 5 000, 50 000 and 300 000 samples the mean FCD values
one standard deviation were 76.46
5.03, 31.860.75, 4.410.03, 0.420.01, 0.050.00, 0.020.00, respectively. Therefore, a sample size of 5 000 is already sufficient to get a mean FCD which is close to 0 and a negligible variance meaning that the mean and covariance can be accurately estimated.Detecting flaws in generative models.
We determine the capability of the FCD to detect if a generative model has produced diverse molecules which possess chemical and biological properties similar to already known molecules. We compare the FCD to four commonly reported metrics: mean logP (Wildman and Crippen, 1999), mean druglikeness (Bickerton et al., 2012), mean Synthetic Accessibility (SA) score (Ertl and Schuffenhauer, 2009) and the internal diversity score with Tanimoto distance (Benhenda, 2017) of the generated samples. Additionally, to evaluate if the activations of ChemNet are a suitable representation for molecules we also compare the FCD to the Fréchet Fingerprint Distance (FFD) which is calculated analogously to the FCD. However, in contrast to the FCD, the FFD is based on 2048 bit ECFP_4 fingerprints. A generative model performs well if it produces samples in a similar logP and druglikeness range as the training data and possesses comparable internal diversity.
All compared metrics were calculated in RDKit. The internal diversity score is calculated using Morgan Fingerprints of RDKit with radius 2 (equivalent to ECFP_4 (Rogers and Hahn, 2010)). For the methods comparison we have selected a subset of real molecules from the three databases which were neither used for training ChemNet nor to estimate the real sample statistics (, ) needed for calculating the FCD.
For the real subsample of the combined databases we calculated each metric, i.e., logP, druglikeness, SA score, internal diversity^{1}^{1}1The calculation of the full distance matrix required for this metric is too time and memory intense, therefore we have averaged the results from 5 subsets of randomly drawn 5 000 samples., FFD and FCD to illustrate the baseline value for real samples. In the next step, to simulate generative models with particular flaws, we created “disturbed data sets” of molecules with a) low druglikeness ( 5th percentile), b) high logP values ( 95th percentile), c) low SA scores ( 5th percentile), d) high Tanimoto similarity and e) stemming from the same target class to simulate generative models producing molecules a) with a low druglikeness, b) a high logP, c) which are difficult to synthetize, d) having low diversity and e) are active for a specific target. We aim to show that FCD can detect all four biases, thereby combining the benefits of other metrics into a single one and is able to identify distribution differences in biological meaningful subsets.

Bias towards low druglikeness: A biologically based assessment can be performed in terms of the average druglikeness of the generated compounds. Druglikeness (Bickerton et al., 2012)
is the geometric mean of several desired molecular properties such as solubility, permeability and metabolic stability. Therefore, it is often used to determine if generated molecules are close to real samples. Molecules with a low druglikeness are commonly not desired since they possess low bioavailability. We simulated models generating molecules with a druglikeness lower than the 5th percentile of the druglikeness values (i.e.
) of real molecules. We randomly selected 5 000 molecules with a low druglikeness for this simulation. This was repeated 5 times to create 5 different generative models producing molecules with a low druglikeness. 
Bias towards high logP: Similar chemical properties such as the average logP value of the generated molecules are another way to judge whether created molecules are reasonable or not. Therefore, we simulated a generative model that has a bias towards generating molecules with a high logP. For this purpose, we selected 5 000 molecules that have a logP value higher than the 95th percentile of logP values (i.e. > ) of real molecules. Although these molecules are valid, they are located on the edge of the logP distribution, which should be detectable by an appropriate metric. This procedure was repeated 5 times to simulate 5 different generative models with a bias towards molecules with a high logP.

Low synthetic accessibility: Furthermore, it is beneficial if a generative model produces molecules which are indeed synthetically feasible. Therefore, we simulated generative models which have a bias towards molecules with a low synthetic accessibility (SA) score. We randomly drew 5 000 molecules with a SA score lower than the 5th percentile of the SA scores (i.e. <) of real molecules. This procedure was repeated 5 times to simulate 5 different generative models with a bias towards low SA.

Mode collapse: Another desirable property of generative models is to create a wide variety of different samples. However, generative models might suffer from mode collapse (Metz et al., 2017; Unterthiner et al., 2018), such that they produce only molecules with a low diversity. Although there also exist models which do not suffer from mode collapse (Popova et al., 2017), an appropriate metric should be able to assess the internal diversity of the generated molecules. We simulated generative models which suffer from mode collapse. For this purpose, we used single linkage clustering with a Tanimoto similarity cutoff of 0.65. We have chosen a large cluster of which we randomly selected 5 000 molecules. This procedure was repeated 5 times to simulate 5 different generative models suffering from mode collapse.

Kinase inhibitors: Conditional generative models are used to produce molecules active for a specific target. We assessed if the metrics can catch the bias towards a certain target family. For this experiment we have selected a large scale activity assay from PubChem (AID 720504). We have randomly selected 5 000 active molecules for pololike kinase 1  polobox domain (PLK1PBD).
We examined the average changes of the metrics when generative models suffer from a) a bias towards low druglikeness, b) a bias towards high logP molecules, c) a bias towards low SA scores and d) mode collapse. Furthermore we assess the changes of the metrics for e) molecules stemming from the same target class (see Fig.1). In a), druglikeness, the FFD and the FCD are able clearly to detect the disturbed data set. In b), the bias towards high logP values is reflected in the druglikeness, LogP, the FFD and the FCD. In c), the disturbed set is recognized by the SA score, the FFD and the FCD and to a minor extend by druglikeness and internal diversity. The mode collapse of d) is clearly revealed by internal diversity, the FFD and the FCD. The active compounds for PLK1 are only uncovered by the FFD and the FCD. Both FFD and FCD show in all four cases a clear difference in the mean of the disturbed and undisturbed data sets. The difference is especially high for both metrics in the case in which we simulated a generative model that suffers from mode collape. Although, FFD is able to detect all the biased sets, FCD is able to make more distinct differentiations. Especially in epxeriment e) where more biological relevant information is necessary, the FCD shows superior behaviour.
FCD of recent generative models.
We calculated the FCD for publicly available SMILES strings of recently developed generative models (Benhenda, 2017; Olivecrona et al., 2017; Segler et al., 2017). Segler et al. (2017) generated more than 450 000 SMILES strings with an LSTM network used for next character prediction. Benhenda (2017) aimed at producing SMILES strings with ORGAN (Guimaraes et al., 2017) and RL that are active against the dopamine receptor D2 (DRD2). 32 000 molecules were generated after 40 and 60 training iterations for RL, and after 30 and 60 training iterations for ORGAN. Olivecrona et al. (2017) trained two RL agents to produce molecules active for the DRD2 receptor. The canonical and the reduced agent were trained on the complete ChEMBL and a subset from which molecules similar to Celecoxib were removed, respectively. Each agent produced 128 001 molecules. Furthermore, we examined a simple rulebased approach, which randomly draws C, N and O atoms and concatenates them to obtain SMILES with random lengths between 1 and 50. The sample statistic was calculated with 200 000 randomly selected real molecules that were not used for training ChemNet. For each model, we have randomly drawn 10 000 samples of the generated SMILES 10 times and determined the FCD score.
We calculated our new FCD performance metric for the following methods:

Segler: The nextcharacter LSTM approach by Segler et al. (2017).

Olivecrona canon agent: An RNNbased RL approach to generate active molecules for DRD2 (Olivecrona et al., 2017).

Olivecrona reduced agent: An RNNbased RL approach to generate active molecules for DRD2 using a reduced training set (Olivecrona et al., 2017).

RL 40/60 iterations: An LSTMbased RL approach to generate active molecules for DRD2 trained for 40/60 iterations (Benhenda, 2017).

ORGAN 30/60 iterations: A GAN using an LSTMbased generator and a CNNbased discriminator trained for 30/60 iterations (Benhenda, 2017).

baseline: A generative model producing just methane.
Fig. 2 shows the results for the different methods. Please note that this is not a methods comparison but should demonstrate that the ranking by the FCD matches our intuition. The lowest FCD value of is obtained by randomly drawn real molecules. Since this value is close to zero the randomly drawn subsamples sufficiently represent the underlying distribution of real molecules. The method “Segler” (Segler et al., 2017) achieved the second lowest FCD of , indicating that the distribution of these generated molecules is closer to the real molecule distribution than distributions produced by other methods. This matches our intuition, because the other methods (“Olivecrona”, “RL”, and “ORGAN”) were optimized to generate molecules that are active for DRD2 and are therefore not designed to approximate the distribution of the complete set of real molecules. This optimization procedure is clearly captured by the FCD metric: for all these methods the FCD is notably higher, ranging from to . Furthermore, ORGAN after 60 iterations and RL after 60 iterations produce molecules that are more distant from real molecules than ORGAN after 30 iterations and RL after 40, respectively. Intuitively, more training iterations lead to more DRD2 specific molecules and therefore to molecules more distant to the complete real molecule distribution and lower diversity (Benhenda, 2017). Additionally, the FCD captures that the canonical and the reduced agents both learn a similar chemical space as concluded by Olivecrona et al. (2017). The rulebased system has the highest FCD of , which can be considered an easily achievable baseline. Overall, the ranking of the methods by their FCD matches our intuition and previous findings. In this comparison, randomly drawn molecules from the combined data set were used to determine the sample statistic underlying FCD, therefore methods which are optimized to capture the distribution of active molecules for DRD2 have a higher FCD. By using a different distribution to calculate the sample statistic, the FCD could also be used to evaluate targeted molecule generation.
Conclusions.
In previous studies, the assessment of generative molecules was based on specific properties such as logP, druglikeness or SA score. However, looking at all these properties individiually makes the comparison of generative models difficult. We introduce the FCD, a novel metric for generative models for drug design. FCD is based on a multitask network and therefore incorporates a wide variety of important chemical and biological features into a single metric. FCD was able to detect four potential flaws of generative models, as we have demonstrated in our experiments. Furthermore we also show that the FCD can also catch a biological bias (active PLK1 kinase inhibitors). Our proposed approach is not restricted to generative models that produce SMILES strings, but can readily be used for graph generating methods by converting the produced molecules into a SMILES format. Within our experiments, we compare the FCD also to a fingerprint based Fréchet distance. This comparison clearly illustrates that by incorporating biological information the metric further improves and the differences between the real and biased sets are more distinct. Overall, we show that FCD is a comprehensive, simple and powerful metric for the evaluation of generative models in drug discovery.
Availability
Implementations to calculate FCD and to reproduce the experiments of this work are available at: github.com/bioinfjku/FCD
Acknowledgments
We thank Marwin Segler, Marcus Olivecrona and Mostapha Benhenda for providing their generated molecules. Further more we thank Marwin Segler for helpful discussion. KP, TU, and GK funded by the Institute of Bioinformatics, Johannes Kepler University Linz Austria.
This work was supported by Merck Group (research agreement 05/2016), Zalando (research agreement 01/2016) and by LIT (LIT20173YOU003).
References
References
 Benhenda (2017) Benhenda, M. (2017). ChemGAN challenge for drug discovery: can AI reproduce natural chemical diversity? arXiv preprint arXiv:1708.08227.
 Bento et al. (2013) Bento, A. P., Gaulton, A., Hersey, A., Bellis, L. J., Chambers, J., Davies, M., Krüger, F. A., Light, Y., Mak, L., McGlinchey, S., Nowotka, M., Papadatos, G., Santos, R., and Overington, J. P. (2013). The ChEMBL bioactivity database: an update. Nucleic Acids Research, 42(D1):D1083–D1090.
 Bickerton et al. (2012) Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S., and Hopkins, A. L. (2012). Quantifying the chemical beauty of drugs. Nature Chemistry, 4(2):90–98.

Dowson and Landau (1982)
Dowson, D. C. and Landau, B. V. (1982).
The Fréchet distance between multivariate normal distributions.
Journal of Multivariate Analysis
, 12:450–455.  Ertl and Schuffenhauer (2009) Ertl, P. and Schuffenhauer, A. (2009). Estimation of synthetic accessibility score of druglike molecules based on molecular complexity and fragment contributions. Journal of Cheminformatics, 1(1):8.
 Fréchet (1957) Fréchet, M. (1957). Sur la distance de deux lois de probabilité. C. R. Acad. Sci. Paris, 244:689–692.
 GómezBombarelli et al. (2016) GómezBombarelli, R., Wei, J. N., Duvenaud, D., HernándezLobato, J. M., SánchezLengeling, B., Sheberla, D., AguileraIparraguirre, J., Hirzel, T. D., Adams, R. P., and AspuruGuzik, A. (2016). Automatic chemical design using a datadriven continuous representation of molecules. ACS Central Science.
 Guimaraes et al. (2017) Guimaraes, G. L., SanchezLengeling, B., Outeiral, C., Cunha Farias, P. L., and AspuruGuzik, A. (2017). Objectivereinforced generative adversarial networks (ORGAN) for sequence generation models. arXiv preprint arXiv:1705.10843.
 Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). GANs trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6629–6640.
 Hochreiter and Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. (1997). Long shortterm memory. Neural Computation, 9(8):1735–1780.
 Irwin et al. (2012) Irwin, J. J., Sterling, T., Mysinger, M. M., Bolstad, E. S., and Coleman, R. G. (2012). ZINC: A free tool to discover chemistry for biology. Journal of Chemical Information and Modeling, 52(7):1757–1768.
 Jaques et al. (2017) Jaques, N., Gu, S., Bahdanau, D., HernándezLobato, J. M., Turner, R. E., and Eck, D. (2017). Sequence tutor: Conservative finetuning of sequence generation models with klcontrol. arXiv preprint arXiv:1705.10843.
 Klambauer et al. (2017) Klambauer, G., Unterthiner, T., Mayr, A., and Hochreiter, S. (2017). Selfnormalizing neural networks. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 971–980. Curran Associates, Inc.
 Li et al. (2018) Li, Y., Vinyals, O., Dyer, C., Pascanu, R., and Battaglia, P. (2018). Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324.
 Mayr et al. (2018) Mayr, A., Klambauer, G., Unterthiner, T., Steijaert, M., Wegner, J., Ceulemans, H., Clevert, D.A., and Hochreiter, S. (2018). Largescale comparison of machine learning methods for drug target prediction on ChEMBL. Under review.
 Metz et al. (2017) Metz, L., Poole, B., Pfau, D., and SohlDickstein, J. (2017). Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163.
 Olivecrona et al. (2017) Olivecrona, M., Blaschke, T., Engkvist, O., and Chen, H. (2017). Molecular denovo design through deep reinforcement learning. Journal of cheminformatics, 9(1):48.
 Popova et al. (2017) Popova, M., Isayev, O., and Tropsha, A. (2017). Deep reinforcement learning for denovo drug design. CoRR, abs/1711.10907.
 Rogers and Hahn (2010) Rogers, D. and Hahn, M. (2010). Extendedconnectivity fingerprints. Journal of Chemical Information and Modeling, 50(5):742–754.
 Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. (2016). Improved techniques for training GANs. In Advances in Neural Information Processing Systems, pages 2234–2242.
 Segler et al. (2017) Segler, M. H., Kogej, T., Tyrchan, C., and Waller, M. P. (2017). Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Central Science.
 Simonovsky and Komodakis (2018) Simonovsky, M. and Komodakis, N. (2018). Graphvae: Towards generation of small graphs using variational autoencoders. arXiv preprint arXiv:1802.03480.
 Unterthiner et al. (2018) Unterthiner, T., Nessler, B., Klambauer, G., Heusel, M., Ramsauer, H., and Hochreiter, S. (2018). Coulomb GANs: Provably optimal nash equilibria via potential fields. International Conference of Learning Representations (ICLR).
 Wang et al. (2016) Wang, Y., Bryant, S. H., Cheng, T., Wang, J., Gindulyte, A., Shoemaker, B. A., Thiessen, P. A., He, S., and Zhang, J. (2016). PubChem BioAssay: 2017 update. Nucleic Acids Research, 45(D1):D955–D963.
 Wasserstein (1969) Wasserstein, L. N. (1969). Markov processes over denumerable products of spaces describing large systems of automata. Probl. Inform. Transmission, 5:47–52.
 Weininger (1988) Weininger, D. (1988). SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of Chemical Information and Modeling, 28(1):31–36.
 Wildman and Crippen (1999) Wildman, S. A. and Crippen, G. M. (1999). Prediction of physicochemical parameters by atomic contributions. Journal of Chemical Information and Computer Sciences, 39(5):868–873.
 Yang et al. (2017) Yang, X., Zhang, J., Yoshizoe, K., Terayama, K., and Tsuda, K. (2017). ChemTS: an efficient python library for de novo molecular generation. Science and Technology of Advanced Materials, 18(1):972–976.