MoFlow: An Invertible Flow Model for Generating Molecular Graphs

06/17/2020 ∙ by Chengxi Zang, et al. ∙ cornell university 0

Generating molecular graphs with desired chemical properties driven by deep graph generative models provides a very promising way to accelerate drug discovery process. Such graph generative models usually consist of two steps: learning latent representations and generation of molecular graphs. However, to generate novel and chemically-valid molecular graphs from latent representations is very challenging because of the chemical constraints and combinatorial complexity of molecular graphs. In this paper, we propose MoFlow, a flow-based graph generative model to learn invertible mappings between molecular graphs and their latent representations. To generate molecular graphs, our MoFlow first generates bonds (edges) through a Glow based model, then generates atoms (nodes) given bonds by a novel graph conditional flow, and finally assembles them into a chemically valid molecular graph with a posthoc validity correction. Our MoFlow has merits including exact and tractable likelihood training, efficient one-pass embedding and generation, chemical validity guarantees, 100% reconstruction of training data, and good generalization ability. We validate our model by four tasks: molecular graph generation and reconstruction, visualization of the continuous latent space, property optimization, and constrained property optimization. Our MoFlow achieves state-of-the-art performance, which implies its potential efficiency and effectiveness to explore large chemical space for drug discovery.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Drug discovery aims at finding candidate molecules with desired chemical properties for clinical trials, which is a long (10-20 years) and costly ($0.5-$2.6 billion) process with a high failure rate (Paul et al., 2010; Avorn, 2015). Recently, deep graph generative models have demonstrated their big potential to accelerate the drug discovery process by exploring large chemical space in a data-driven manner (Jin et al., 2018; Zhavoronkov et al., 2019). These models usually first learn a continuous latent space by encoding111In this paper, we use inference, embedding or encoding interchangeably to refer to the transformation from molecular graphs to the learned latent space, and we use decoding or generation for the reverse transformation. the training molecular graphs and then generate novel and optimized ones through decoding from the learned latent space guided by targeted properties (Gómez-Bombarelli et al., 2018; Jin et al., 2018). However, it is still very challenging to generate novel and chemically-valid molecular graphs with desired properties since: a) the scale of the chemical space of drug-like compounds is (Mullard, 2017) but the scale of possibly generated molecular graphs by existing methods are much smaller, and b) generating molecular graphs that have both multi-type nodes and edges and follow bond-valence constraints is a hard combinatorial task.

Prior works leverage different deep generative frameworks for generating molecular SMILES codes (Weininger et al., 1989)

or molecular graphs, including variational autoencoder (VAE)-based models

(Kusner et al., 2017; Dai et al., 2018; Simonovsky and Komodakis, 2018; Ma et al., 2018; Liu et al., 2018; Bresson and Laurent, 2019; Jin et al., 2018), generative adversarial networks (GAN)-based models (De Cao and Kipf, 2018; You et al., 2018), and autoregressive (AR)-based models (Popova et al., 2019; You et al., 2018). In this paper, we explore a different deep generative framework, namely the normalizing flow (Dinh et al., 2014; Madhawa et al., 2019; Kingma and Dhariwal, 2018) to generate molecular graphs. Compared with above three frameworks, the flow-based models are the only one which can memorize and exactly reconstruct all the input data, and at the same time have the potential to generate more novel, unique and valid molecules, which implies its potential capability of deeper exploration of the huge chemical space. To our best knowledge, there have been three flow-based models proposed for molecular graph generation. The GraphAF (Shi et al., 2020) model is an autoregressive flow-based model that achieves state-of-the-art performance in molecular graph generation. GraphAF generates molecules in a sequential manner by adding each new atom or bond followed by a validity check. GraphNVP (Madhawa et al., 2019) and GRF (Honda et al., 2019) are proposed for molecular graph generation in a one-shot manner. However, they cannot guarantee chemical validity and thus show poor performance in generating valid and novel molecules.

In this paper, we propose a novel deep graph generative model named MoFlow to generate molecular graphs. Our MoFlow is the first of its kind which not only generates molecular graphs efficiently by invertible mapping at one shot, but also has a chemical validity guarantee. More specifically, to capture the combinatorial atom-and-bond structures of molecular graphs, we propose a variant of the Glow model (Kingma and Dhariwal, 2018) to generate bonds (multi-type edges, e.g., single, double and triple bonds), a novel graph conditional flow to generate atoms (multi-type nodes, e.g. C, N etc.) given bonds by leveraging graph convolutions, and finally assemble atoms and bonds into a valid molecular graph which follows bond-valence constraints. We illustrate our modelling framework in Figure 1

. Our MoFlow is trained by exact and tractable likelihood estimation, and one-pass inference and generation can be efficiently utilized for molecular graph optimization.

We validate our MoFlow through a wide range of experiments from molecular graph generation, reconstruction, visualization to optimization. As baselines, we compare the state-of-the-art VAE-based model (Jin et al., 2018), autoregressive-based models (You et al., 2018; Popova et al., 2019), and all three flow-based models (Madhawa et al., 2019; Honda et al., 2019; Shi et al., 2020). As for memorizing input data, MoFlow achieves reconstruction rate. As for exploring the unknown chemical space, MoFlow outperforms above models by generating more novel, unique and valid molecules (as demonstrated by the N.U.V. scores in Table 2 and 3). MoFlow generates chemically-valid molecules when sampling from prior distributions. Furthermore, if without validity correction, MoFlow still generates much more valid molecules than existing models (validity-without-check scores in Table 2 and 3). For example, the state-of-the-art autoregressive-flow-based model GraphAF (Shi et al., 2020) achieves and validity-without-check scores for two datasets while MoFlow achieves and respectively, thanks to its capability of capturing the chemical structures in a holistic way. As for chemical property optimization, MoFlow can find much more novel molecules with top drug-likeness scores than existing models (Table 4 and Figure 5). As for constrained property optimization, MoFlow finds novel and optimized molecules with the best similarity scores and second best property improvement (Table 5).

It is worthwhile to highlight our contributions as follows:

  • Novel MoFlow model: our MoFlow is one of the first flow-based graph generative models which not only generates molecular graphs at one shot by invertible mapping but also has a validity guarantee. To capture the combinatorial atom-and-bond structures of molecular graphs, we propose a variant of Glow model for bonds (edges) and a novel graph conditional flow for atoms (nodes) given bonds, and then assemble them into valid molecular graphs.

  • State-of-the-art performance: our MoFlow achieves many state-of-the-art results w.r.t. molecular graph generation, reconstruction, optimization, etc., and at the same time our one-shot inference and generation are very efficient, which implies its potentials in deep exploration of huge chemical space for drug discovery.

The outline of this paper is: survey (Sec. 2), proposed method (Sec. 3 and 4), experiments (Sec. 5), and conclusions (Sec. 6

). In order to promote reproducibility, our codes and datasets are open-sourced at

https://github.com/calvin-zcx/moflow.

2. Related Work

Molecular Generation. Different deep generative frameworks are proposed for generating molecular SMILES or molecular graphs. Among the variational autoencoder (VAE)-based models (Kusner et al., 2017; Dai et al., 2018; Simonovsky and Komodakis, 2018; Ma et al., 2018; Liu et al., 2018; Bresson and Laurent, 2019; Jin et al., 2018), the JT-VAE (Jin et al., 2018) generates valid tree-structured molecules by first generating a tree-structured scaffold of chemical substructures and then assembling substructures according to the generated scaffold. The MolGAN (De Cao and Kipf, 2018) is a generative adversarial networks (GAN)-based model but shows very limited performance in generating valid and unique molecules. The autoregressive-based models generate molecules in a sequential manner with validity check at each generation step. For example, the MolecularRNN (Popova et al., 2019) sequentially generates each character of SMILES and the GCPN (You et al., 2018) sequentially generates each atom/bond in a molecular graphs. In this paper, we explore a different deep generative framework, namely the normalizing flow models (Dinh et al., 2014; Madhawa et al., 2019; Kingma and Dhariwal, 2018), for molecular graph generation, which have the potential to memorize and reconstruct all the training data and generalize to generating more valid, novel and unique molecules.

Flow-based Models. The (normalizing) flow-based models try to learn mappings between complex distributions and simple prior distributions through invertible neural networks and such a framework has good merits of exact and tractable likelihood estimation for training, efficient one-pass inference and sampling, invertible mapping and thus reconstructing all the training data etc. Examples include NICE(Dinh et al., 2014), RealNVP(Dinh et al., 2016), Glow(Kingma and Dhariwal, 2018) and GNF (Liu et al., 2019) which show promising results in generating images or even graphs (Liu et al., 2019). See latest reviews in (Papamakarios et al., 2019; Kobyzev et al., 2019) and more technical details in Section 3.

To our best knowledge, there are three flow-based models for molecular graph generation. The GraphAF (Shi et al., 2020) is an autoregressive flow-based model which achieves state-of-the-art performance in molecular graph generation. The GraphAF generates molecular graphs in a sequential manner with validity check when adding any new atom or bond. The GraphNVP (Madhawa et al., 2019) and GRF (Honda et al., 2019) are proposed for molecular graph generation in a one-shot manner. However, they have no guarantee for chemical validity and thus show very limited performance in generating valid and novel molecular graphs. Our MoFlow is the first of its kind which not only generates molecular graphs efficiently by invertible mapping at one shot but also has a validity guarantee. In order to capture the atom-and-bond composition of molecules, we propose a variant of Glow(Kingma and Dhariwal, 2018) model for bonds and a novel graph conditional flow for atoms given bonds, and then combining them with a post-hoc validity correction. Our MoFlow achieves many state-of-the-art results thanks to capturing the chemical structures in a holistic way, and our one-shot inference and generation are more efficient than sequential models.

3. Model Preliminary

The flow framework. The flow-based models aim to learn a sequence of invertible transformations

between complex high-dimensional data

and in a latent space with the same number of dimensions where the latent distribution is easy to model (e.g., strong independence assumptions hold in such a latent space). The potentially complex data in the original space can be modelled by the change of variable formula where and:

(1)

To sample is achieved by sampling and then to transform by the reverse mapping of .

Let ,   where () are invertible mappings, , and follows a standard isotropic Gaussian with independent dimensions. Then we get the log-likelihood of by the change of variable formula as follows:

(2)

where

is the probability of the

dimension of and is an invertible deep neural network to be learnt. Thus, the exact-likelihood-based training is tractable.

Invertible affine coupling layers. However, how to design a.) an invertible function with b.) expressive structures and c.) efficient computation of the Jacobian determinant are nontrivial. The NICE(Dinh et al., 2014) and RealNVP (Dinh et al., 2016) design an affine coupling transformation :

(3)

by splitting into two partitions . Thus, a.) the invertibility is guaranteed by:

(4)

b.) the expressive power depends on arbitrary neural structures of the Scale function and the Transformation function in the affine transformation of , and c.) the Jacobian determiant can be computed efficiently by .

Splitting Dimensions. The flow-based models, e.g., RealNVP (Dinh et al., 2016) and Glow (Kingma and Dhariwal, 2018), adopt squeeze operation which compresses the spatial dimension into to make more channels and then split channels into two halves for the coupling layer. A deep flow model at a specific layer transforms unchanged dimensions in the previous layer to keep all the dimensions transformed. In order to learn an optimal partition of , Glow (Kingma and Dhariwal, 2018) model introduces an invertible convolution with learnable convolution kernel which is initialized as a random rotation matrix. After the transformation , a fixed partition over the channel is used for the affine coupling layers.

Numerical stability by actnorm. In order to ensure the numerical stability of the flow-based models, actnorm layer is introduced in Glow (Kingma and Dhariwal, 2018) which normalizes dimensions in each channel over a batch by an affine transformation with learnable scale and bias. The scale and the bias are initialized as the mean and the inverse of the standard variation of the dimensions in each channel over the batch.

4. Proposed MoFlow Model

In this section, we first define the problem and then introduce our Molecular Flow (MoFlow) model in detail. We show the outline of our MoFlow in Figure 1 as a roadmap for this section.

Figure 1. The outline of our MoFlow. A molecular graph (e.g. Metformin) is represented by a feature matrix

for atoms and adjacency tensors

for bonds. Inference: the graph conditional flow (GCF) for atoms (Sec. 4.2) transforms given

into conditional latent vector

, and the Glow for bonds (Sec. 4.3) transform into latent vector

. The latent space follows a spherical Gaussian distribution. Generation: the generation process is the reverse transformations of previous operations, followed by a validity correction (Sec. 

4.4) procedure which ensures the chemical validity. We summarize MoFlow in Sec. 4.5. Regression and optimization: the mapping between latent space and molecular properties are used for molecular graph optimization and property prediction (Sec. 5.3, Sec. 5.4).

4.1. Problem Definition: Learning a Probability Model of Molecular Graphs

Let denote the set of olecules which is the Cartesian product of the tom set with at most atoms belonging to atom types and the ond set with bond types. A molecule is a pair of an atom matrix and a bond tensor

. We use one-hot encoding for the empirical molecule data where

represents the atom has atom type , and represents a type bond between atom and atom . Thus, a molecule can be viewed as an undirected graph with multi-type nodes and multi-type edges.

Our primary goal is to learn a molecule generative model which is the probability of sampling any molecule from . In order to capture the combinatorial atom-and-bond structures of molecular graphs, we decompose the into two parts:

(5)

where is the distribution of molecular graphs, is the distribution of bonds (edges) in analogy to modelling multi-channel images , and is the conditional distribution of atoms (nodes) given the bonds by leveraging graph convolution operations. The and are learnable modelling parameters. In contrast with VAE or GAN based frameworks, we can learn the parameters by exact maximum likelihood estimation (MLE) framework by maximizing:

(6)

Our model thus consists of two parts, namely a graph conditional flow for atoms to learn the atom matrix conditional on the bond tensors and a flow for bonds to learn bond tensors. We further learn a mapping between the learned latent vectors and molecular properties to regress the graph-based molecular properties, and to guide the generation of optimized molecular graphs.

4.2. Graph Conditional Flow for Atoms

Given a bond tensor , our goal of the atom flow is to generate the right atom-type matrix to assemble valid molecules . We first define -conditional flow and graph conditional flow to transform given into conditional latent variable which follows isotropic Gaussian . We can get the conditional probability of atom features given the bond graphs by a conditional version of the change of variable formula.

4.2.1. -Conditional Flow and Graph Conditional Flow

Definition 4.0 ().

-conditional flow: A -conditional flow
is an invertible and dimension-kept mapping and there exists reverse transformation where .

The condition keeps fixed during the transformation. Under the independent assumption of and , the Jacobian of is:

(7)

the determiant of this Jacobian is , and thus the conditional version of the change of variable formula in the form of log-likelihood is:

(8)
Definition 4.0 ().

Graph conditional flow: A graph conditional flow is a -conditional flow where is the adjacency tenor for edges with types and is the feature matrix of the corresponding nodes.

Figure 2. Graph conditional flow for the atom matrix. We show the details of one invertible graph coupling layer and a multiscale structure consists of a cascade of layers of such graph coupling layer. The graphnorm is computed only once.

4.2.2. Graph coupling layer

We construct aforementioned invertible mapping and by the scheme of the affine coupling layer. Different from traditional affine coupling layer, our coupling transformation relies on graph convolution (Sun et al., 2019) and thus we name such a coupling transformation as a graph coupling layer.

For each graph coupling layer, we split input into two parts along the row dimension, and we get the output as follows:

(9)

where is the element-wise product. We deign the scale function and the transformation function in each graph coupling layer by incorporating graph convolution structures. The bond tensor keeps a fixed value during transforming the atom matrix . We also apply the masked convolution idea in (Dinh et al., 2016) to the graph convolution in the graph coupling layer. Here, we adopt Relational Graph Convolutional Networks (R-GCN) (Schlichtkrull et al., 2018) to build graph convolution layer graphconv as follows:

(10)

where is the normalized adjacency matrix at channel , is the sum of the in-degree over all the channels for each node, and is a binary mask to select a partition from A. Because the bond graph is fixed during graph coupling layer and thus the graph normalization, denoted as graphnorm, is computed only once.

We use multiple stacked graphconv-¿BatchNorm1d-¿ReLu layers with a multi-layer perceptron (MLP) output layer to build the graph scale function

and the graph transformation function . What’s more, instead of using exponential function for the as discussed in Sec. 3

, we adopt Sigmoid function for the sake of the numerical stability of cascading multiple flow layers. The reverse mapping of the graph coupling layer

is:

(11)

The logarithm of the Jacobian determiant of each graph coupling layer can be efficiently computed by:

(12)

where iterates each element. In principle, we can use arbitrary complex graph convolution structures for and since the computing of above Jacobian determinant of does not involve in computing the Jacobian of or .

4.2.3. Actnorm for 2-dimensional matrix

For the sake of numerical stability, we design a variant of invertible actnorm layer (Kingma and Dhariwal, 2018) for the 2-dimensional atom matrix, denoted as actnorm2D (activation normalization for 2D matrix), to normalize each row, namely the feature dimension for each node, over a batch of 2-dimensional atom matrices. Given the mean

and the standard deviation

for each row dimension, the normalized input follows where is a small constant, the reverse transformation is , and the logarithmic Jacobian determiant is:

(13)

4.2.4. Deep architectures

We summarize our deep graph conditional flow in Figure 2. We stack multiple graph coupling layers to form graph conditional flow. We alternate different partition of in each layer to transform the unchanged part of the previous layer.

4.3. Glow for Bonds

The bond flow aims to learn an invertible mapping where the transformed latent variable follows isotropic Gaussian. According to the change of variable formula, we can get the logarithmic probability of bonds by and generating bond tensor by reversing the mapping where . We can use arbitrary flow model for the bond tensor and we build our bond flow based on a variant of Glow (Kingma and Dhariwal, 2018) framework.

We also follow the scheme of affine coupling layer to build invertible mappings. For each affine coupling layer, We split input into two parts along the channel dimension, and we get the output as follows:

(14)

And thus the reverse mapping is:

(15)

Instead of using exponential function as scale function, we use the Sigmoid function with range to ensure the numerical stability when stacking many layers. We find that exponential scale function leads to a large reconstruction error when the number of affine coupling layers increases. The scale function and the transformation function in each affine coupling layer can have arbitrary structures. We use multiple conv2d-¿BatchNorm2d-¿ReLu layers to build them. The logarithm of the Jacobian determiant of each affine coupling is

(16)
Figure 3. A variant of Glow for bonds’ adjacency tensors.

In order to learn optimal partition and ensure model’s stability and learning rate, we also use the invertible convolution layer and actnorm layer adopted in the Glow. In order to get more channels for masking and transformation, we squeeze the spatial size of from to by a factor and apply the affine coupling transformation to the squeezed data. The reverse unsqueeze operation is adopted to the output. We summarize our bond flow in Figure 3.

4.4. Validity Correction

Molecules must follow the valency constraints for each atom, but assembling a molecule from generated bond tensor and atom matrix may lead to chemically invalid ones. Here we define the valency constraint for the atom as:

(17)

where is the one-hot bond tensor over order of chemical bonds (single, double, triple) and represents the formal charge. Different from existing valency constraints defined in (You et al., 2018; Popova et al., 2019), we consider the effect of formal charge which may introduce extra bonds for the charged atoms. For example, ammonium [NH4] may have 4 bonds for N instead of 3. Similarly, S and O may have 3 bonds instead of 2. Here we only consider for N, S and O and make for other atoms.

In contrast with the existing reject-sampling-based validity check adopted in the autoregressive models

(You et al., 2018; Popova et al., 2019), we introduce a new post-hoc validity correction procedure after generating a molecule at once: 1) check the valency constraints of ; 2) if all the atoms of follows valecny constraints, we return the largest connected component of the molecule and end the procedure; 3) if there exists an invalid atom , namely , we sort the bonds of by their order and delete order for the bond with the largest order; 4) go to step 1). Our validity correction procedure tries to make a minimum change to the existing molecule and to keep the largest connected component as large as possible.

4.5. Inference and Generation

We summarize the inference (encoding) and generation (decoding) of molecular graphs by our MoFlow in Algorithm 1 and Algorithm 2 respectively. We visualize the overall framework in Figure 1. As shown in the algorithms, our MoFlow have merits of exact likelihood estimation/training, one-pass inference, invertible and one-pass generation, and chemical validity guarantee.

Input: : graph conditional flow for atoms, : glow for bonds, : atom matrix, : bond tensor, : isotropic Gaussian distributions.
Output: :latent representation for atom , : logarithmic likelihood of molecule .
Return: ,
Algorithm 1 Exact Likelihood Inference (Encoding) of Molecular Graphs by MoFlow 
Input: : graph conditional flow for atoms, : glow for bonds, :latent representation of molecule or sampling from a prior Gaussian, validity-correction: validity correction rules.
Output: : a molecule
Return:
Algorithm 2 Molecular Graph Generation (Decoding) by the Reverse Transformation of MoFlow 

5. Experiments

Following previous works (Jin et al., 2018; Shi et al., 2020), we validate our MoFlow by answering following questions:

  • Molecular graph generation and reconstruction (Sec. 5.1): Can our MoFlow memorize and reconstruct all the training molecule datasets? Can our MoFlow generalize to generate novel, unique and valid molecules as many as possible?

  • Visualizing continuous latent space (Sec. 5.2): Can our MoFlow embed molecular graphs into continuous latent space with reasonable chemical similarity?

  • Property optimization (Sec. 5.3): Can our MoFlow generate novel molecular graphs with optimized properties?

  • Constrained property optimization (Sec. 5.4): Can our MoFlow generate novel molecular graphs with the optimized properties and at the same time keep the chemical similarity as much as possible?

Baselines. We compare our MoFlow with: a) the state-of-the-art VAE-based method JT-VAE (Jin et al., 2018) which captures the chemical validity by encoding and decoding a tree-structured scaffold of molecular graphs; b) the state-of-the-art autoregressive models GCPN (You et al., 2018) and MolecularRNN (MRNN)(Popova et al., 2019)

with reinforcement learning for property optimization, which generate molecules in a sequential manner; c) flow-based methods GraphNVP

(Madhawa et al., 2019) and GRF (Honda et al., 2019) which generate molecules at one shot and the state-of-the-art autoregressive-flow-based model GraphAF (Shi et al., 2020) which generates molecules in a sequential way.

Datasets. We use two datasets QM9 (Ramakrishnan et al., 2014) and ZINC250K (Irwin et al., 2012) for our experiments and summarize them in Table 1. The QM9 contains molecules with maximum atoms in different types, and the ZINC250K has drug-like molecules with maximum atoms in different types. The molecules are kekulized by the chemical software RDKit (Landrum and others, 2006) and the hydrogen atoms are removed. There are three types of edges, namely single, double, and triple bonds, for all molecules. Following the pre-processing procedure in (Madhawa et al., 2019)

, we encode each atom and bond by one-hot encoding, pad the molecules which have less than the maximum number of atoms with an virtual atom, augment the adjacency tensor of each molecule by a virtual edge channel representing no bonds between atoms, and dequantize

(Madhawa et al., 2019; Dinh et al., 2016) the discrete one-hot-encoded data by adding uniform random noise for each dimension, leading to atom matrix and bond tensor for QM9, and and for ZINC250k.

#Mol. Graphs Max. #Nodes #Node Types #Edge Types
QM9 133,885 9 4+1 3+1
ZINC250K 249,455 38 9+1 3+1
Table 1. Statistics of the datasets.

MoFlow Setup. To be comparable with one-shot-flow baseline GraphNVP (Madhawa et al., 2019), for the ZINC250K, we adopt coupling layers and graph coupling layers for the bonds’ Glow and the atoms’ graph conditional flow respectively. We use two convolution layers with hidden dimensions in each coupling layer. For each graph coupling layer, we set one relational graph convolution layer with

dimensions followed by a two-layer multilayer perceptron with

hidden dimensions. As for the QM9, we adopt coupling layers and graph coupling layers for the bonds’ Glow and the atoms’ graph conditional flow respectively. There are two 3*3 convolution layers with hidden dimensions in each coupling layer, and one graph convolution layer with dimensions followed by a two-layer multilayer perceptron with hidden dimensions in each graph coupling layer. As for the optimization experiments, we further train a regression model to map the latent embeddings to different property scalars (discussed in Sec. 5.3 and 5.4) by a multi-layer perceptron with 18-dim linear layer -¿ ReLu -¿ 1-dim linear layer structures. For each dataset, we use the same trained model for all the following experiments.

Empirical Running Time.

Following above setup, we implemented our MoFlow by Pytorch-1.3.1 and trained it by Adam optimizer

(Kingma and Ba, 2014) with learning rate , batch size , and epochs for both datasets on GeForce RTX 2080 Ti GPU and 16 CPU cores. Our MoFlow finished -epoch training within hours ( minutes/epoch) for ZINC250K and hours ( minutes/epoch) for QM9. Thanks to efficient one-pass inference/embedding, our MoFlow takes negligible minutes to learn an additional regression layer trained in epochs for optimization experiments on ZINC250K. In comparison, as for the ZINC250K dataset, GraphNVP (Madhawa et al., 2019) costs hours ( minutes/epoch) by our Pytorch implementation for training on ZINC250K with the same configurations, and the estimated total running time of GraphAF (Shi et al., 2020) is hours ( minutes/epoch) which consists of the reported hours for a generation model trained by epochs and estimated hours for another optimization model trained by epochs. The reported running time of JT-VAE (Jin et al., 2018) is roughly hours in (You et al., 2018).

5.1. Generation and Reconstruction

% Validity % Validity w/o check % Uniqueness % Novelty % N.U.V. % Reconstruct
GraphNVP (Madhawa et al., 2019) n/a
GRF (Honda et al., 2019) n/a
GraphAF (Shi et al., 2020)
MoFlow
Table 2. Generation and reconstruction performance on QM9 dataset.
% Validity % Validity w/o check % Uniqueness % Novelty % N.U.V. % Reconstruct
JT-VAE (Jin et al., 2018) n/a
GCPN (You et al., 2018) n/a
MRNN (Popova et al., 2019) n/a
GraphNVP (Madhawa et al., 2019) n/a
GRF (Honda et al., 2019) n/a
GraphAF (Shi et al., 2020)
MoFlow
Table 3. Generation and reconstruction performance on ZINC250K dataset.

Setup. In this task, we evaluate our MoFlow ’s capability of generating novel, unique and valid molecular graphs, and if our MoFlow can reconstruct input molecular graphs from their latent representations. We adopted the widely-used metrics, including: Validity which is the percentage of chemically valid molecules in all the generated molecules, Uniqueness which is the percentage of unique valid molecules in all the generated molecules, Novelty which is the percentage of generated valid molecules which are not in the training dataset, and Reconstruction rate which is the percentage of molecules in the input dataset which can be reconstructed from their latent representations. Besides, because the novelty score also accounts for the potentially duplicated novel molecules, we propose a new metric N.U.V. which is the percentage of Novel, Unique, and Valid molecules in all the generated molecules. We also compare the validity of ablation models if not using validity check or validity correction, denoted as Validity w/o check in (Shi et al., 2020).

The prior distribution of latent space follows a spherical multivariate Gaussian distribution where is the learned standard deviation and the hyper-parameter is the temperature for the reduced-temperature generative model (Parmar et al., 2018; Kingma and Dhariwal, 2018; Madhawa et al., 2019). We use in the generation for both QM9 and ZINC250K datasets, and for the ablation study without validity correction. To be comparable with the state-of-the-art baseline GraphAF(Shi et al., 2020), we generate molecules, i.e., sampling latent vectors from the prior and then decode them by the reverse transformation of our MoFlow. We report the the mean and standard deviation of results over 5 runs. As for the reconstruction, we encode all the molecules from the training dataset into latent vectors by the encoding transformation of our MoFlow and then reconstruct input molecules from these latent vectors by the reverse transformation of MoFlow.

Results. Table 2 and Table 3 show that our MoFlow outperfoms the state-of-the-art models on all the six metrics for both QM9 and ZINC250k datasets. Thanks to the invertible characteristic of the flow-based models, our MoFlow builds an one-to-one mapping from the input molecule to its corresponding latent vector , enabling reconstruction rate as shown in Table 2 and Table 3. In contrast, the VAE-based method JT-VAE and the autoregressive-based method GCPN and MRNN can’t reconstruct all the input molecules. Compared with the one-shot flow-based model GraphNVP and GRF, by incorporating validity correction mechanism, our MoFlow achieves validity, leading to significant improvements of the validity score and N.U.V. score for both datasets. Specifically, the N.U.V. score of MoFlow are and times as large as the N.U.V. scores of GraphNVP and GRF respectively in Table 2. Even without validity correction, our MoFlow still outperforms the validity scores of GraphNVP and GRF by a large margin. Compared with the autoregressive flow-based model GraphAF, we find our MoFlow outperforms GraphAF by additional and with respect to N.U.V scores for QM9 and ZINC respectively, indicating that our MoFlow generates more novel, unique and valid molecules. Indeed, MoFlow achieves better uniqueness score and novelty score compared with GraphAF for both datasets. What’s more, our MoFlow without validity correction still outperforms GraphAF without the validity check by a large margin w.r.t. the validity score (validity w/o check in Table 2 and Table 3) for both datasets, implying the superiority of capturing the molecular structures in a holistic way by our MoFlow over autoregressive ones in a sequential way.

In conclusion, our MoFlow not only memorizes and reconstructs all the training molecular graphs, but also generates more novel, unique and valid molecular graphs than existing models, indicating that our MoFlow learns a strict superset of the training data and explores the unknown chemical space better.

5.2. Visualizing Continuous Latent Space

Figure 4.

Visualization of learned latent space by our MoFlow. Top: Visualization of the grid neighbors of a seed molecule in the center, which serves as the baseline for measuring similarity. Bottom: Interpolation between two seed molecular graphs and the left one is the baseline molecule for measuring similarity. Seed molecules are highlighted in red boxs and they are randomly selected from ZINC250K.

Setup. We examine the learned latent space of our MoFlow , denoted as , by visualizing the decoded molecular graphs from a neighborhood of a latent vector in the latent space. Similar to (Kusner et al., 2017; Jin et al., 2018), we encode a seed molecule into and then grid search two random orthogonal directions with unit vector and based on , then we get new latent vector by where and are the searching steps. Different from VAE-based models, our MoFlow gets decoded molecules efficiently by the one-pass inverse transformation . In contrast, the VAE-based models such as JT-VAE need to decode each latent vectors times and autoregressive-based models like GCPN, MRNN and GraphAF need to generate a molecule sequentially. Further more, we measure the chemical similarity between each neighboring molecule and the centering molecule. We choose Tanimoto index (Bajusz et al., 2015) as the chemical similarity metrics and indicate their similarity values by a heatmap. We further visualize a linear interpolation between two molecules to show their changing trajectory similar to the interpolation case between images (Kingma and Dhariwal, 2018).

Results. We show the visualization of latent space in Figure 4. We find the latent space is very smooth and the interpolations between two latent points only change a molecule graph a little bit. Quantitatively, we find the chemical similarity between molecules majorly correspond to their Euclidean distance between their latent vectors, implying that our MoFlow embeds similar molecular graph structures into similar latent embeddings. Searching in such a continuous latent space learnt by our MoFlow is the basis for molecular property optimization and constraint optimization as discussed in the following sections.

5.3. Property Optimization

Setup. The property optimization task aims at generating novel molecules with the best Quantitative Estimate of Druglikeness (QED) scores (Bickerton et al., 2012) which measures the drug-likeness of generated molecules. Following the previous works (You et al., 2018; Popova et al., 2019), we report the best property scores of novel molecules discovered by each method.

We use the pre-trained MoFlow, denoted as , in the generation experiment to encode a molecule and get the molecular embedding , and further train a multilayer perceptron to regress the embedding of the molecules to their property values . We then search the best molecules by the gradient ascend method, namely where the is the length of the search step. We conduct above gradient ascend method by steps. We decode the new embedding in the latent space to the discovered molecule by reverse mapping . The molecule is novel if doesn’t exist in the training dataset.

Method 1st 2nd 3rd 4th
ZINC (Dataset) 0.948 0.948 0.948 0.948
JT-VAE 0.925 0.911 0.910 -
GCPN 0.948 0.947 0.946 -
MRNN 0.948 0.948 0.947 -
GraphAF 0.948 0.948 0.947 0.946
MoFlow 0.948 0.948 0.948 0.948
Table 4. Discovered novel molecules with the best QED scores. Our MoFlow finds more molecules with the best QED scores. More results in Figure 5.
Figure 5. Illustration of discovered novel molecules with the best druglikeness QED scores.

Results. We report the discovered novel molecules sorted by their QED scores in Table 4. We find previous methods can only find very few molecules with the best QED score (). In contrast, our MoFlow finds much more novel molecules which have the best QED values than all the baselines. We show more molecular structures with top QED values in Figure 5.

5.4. Constrained Property Optimization

Setup. The constrained property optimization aims at finding a new molecule with the largest similarity score and the largest improvement of a targeted property value given a molecule . Following the similar experimental setup of (Jin et al., 2018; You et al., 2018), we choose Tanimoto similarity of Morgan fingerprint (Rogers and Hahn, 2010) as the similarity metrics, the penalized logP (plogp) as the target property, and from the molecules with the lowest plogp scores in the training dataset of ZINC250K. We use similar gradient ascend method as discussed in the previous subsetion to search for optimized molecules. An optimization succeeds if we find a novel molecule which is different from and and within steps where is the smallest similarity threshold to screen the optimized molecules.

Results. Results are summarized in Table 5. We find that our MoFlow finds the most similar new molecules at the same time achieves very good plogp improvement. Compared with the state-of-the-art VAE model JT-VAE, our MoFlow achieves much higher similarity score and property improvement, implying that our model is good at interpolation and learning continuous molecular embedding. Compared with the state-of-the-art reinforcement learning based method GCPN and GraphAF which is good at generating molecules step-by-step with targeted property rewards, our model MoFlow achieves the best similarity scores and the second best property improvements. We illustrate one optimization example in Figure 6 with very similar structures but a large improvement w.r.t the penalized logP.

JT-VAE GCPN
Improvement Similarity Success Improvement Similarity Success
0.0
0.2
0.4
0.6
GraphAF MoFlow
Improvement Similarity Success Improvement Similarity Success
0.0
0.2
0.4
0.6
Table 5. Constrained optimization on Penalized-logP
Figure 6. An illustration of the constrained optimization of a molecule leading to an improvement of w.r.t the penalized logP and with Tanimoto similarity . The modified part is highlighted.

6. Conclusion

In this paper, we propose a novel deep graph generative model MoFlow for molecular graph generation. Our MoFlow is one of the first flow-based models which not only generates molecular graphs at one-shot by invertible mappings but also has a validity guarantee. Our MoFlow consists of a variant of Glow model for bonds, a novel graph conditional flow for atoms given bonds, and then combining them with post-hoc validity corrections. Our MoFlow achieves state-of-the-art performance on molecular generation, reconstruction and optimization. For future work, we try to combine the advantages of both sequential generative models and one-shot generative models to generate chemically feasible molecular graphs. Codes and datasets are open-sourced at https://github.com/calvin-zcx/moflow.

Acknowledgement

This work is supported by NSF IIS 1716432, 1750326, ONR N00014-18-1-2585, Amazon Web Service (AWS) Machine Learning for Research Award and Google Faculty Research Award.

References

  • J. Avorn (2015) The $2.6 billion pill—methodologic and policy considerations. New England Journal of Medicine 372 (20), pp. 1877–1879. Cited by: §1.
  • D. Bajusz, A. Rácz, and K. Héberger (2015) Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations?. Journal of cheminformatics 7 (1), pp. 20. Cited by: §5.2.
  • G. R. Bickerton, G. V. Paolini, J. Besnard, S. Muresan, and A. L. Hopkins (2012) Quantifying the chemical beauty of drugs. Nature chemistry 4 (2), pp. 90. Cited by: §5.3.
  • X. Bresson and T. Laurent (2019) A two-step graph convolutional decoder for molecule generation. arXiv preprint arXiv:1906.03412. Cited by: §1, §2.
  • H. Dai, Y. Tian, B. Dai, S. Skiena, and L. Song (2018) Syntax-directed variational autoencoder for structured data. arXiv preprint arXiv:1802.08786. Cited by: §1, §2.
  • N. De Cao and T. Kipf (2018) MolGAN: an implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973. Cited by: §1, §2.
  • L. Dinh, D. Krueger, and Y. Bengio (2014) Nice: non-linear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: §1, §2, §2, §3.
  • L. Dinh, J. Sohl-Dickstein, and S. Bengio (2016) Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: §2, §3, §3, §4.2.2, §5.
  • R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik (2018) Automatic chemical design using a data-driven continuous representation of molecules. ACS central science 4 (2), pp. 268–276. Cited by: §1.
  • S. Honda, H. Akita, K. Ishiguro, T. Nakanishi, and K. Oono (2019) Graph residual flow for molecular graph generation. arXiv preprint arXiv:1909.13521. Cited by: §1, §1, §2, Table 2, Table 3, §5.
  • J. J. Irwin, T. Sterling, M. M. Mysinger, E. S. Bolstad, and R. G. Coleman (2012) ZINC: a free tool to discover chemistry for biology. Journal of chemical information and modeling 52 (7), pp. 1757–1768. Cited by: §5.
  • W. Jin, R. Barzilay, and T. Jaakkola (2018) Junction tree variational autoencoder for molecular graph generation. arXiv preprint arXiv:1802.04364. Cited by: §1, §1, §1, §2, §5.2, §5.4, Table 3, §5, §5, §5.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.
  • D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §1, §1, §2, §2, §2, §3, §3, §4.2.3, §4.3, §5.1, §5.2.
  • I. Kobyzev, S. Prince, and M. A. Brubaker (2019) Normalizing flows: introduction and ideas. arXiv preprint arXiv:1908.09257. Cited by: §2.
  • M. J. Kusner, B. Paige, and J. M. Hernández-Lobato (2017) Grammar variational autoencoder. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1945–1954. Cited by: §1, §2, §5.2.
  • G. Landrum et al. (2006) RDKit: open-source cheminformatics. Cited by: §5.
  • J. Liu, A. Kumar, J. Ba, J. Kiros, and K. Swersky (2019) Graph normalizing flows. In Advances in Neural Information Processing Systems, pp. 13556–13566. Cited by: §2.
  • Q. Liu, M. Allamanis, M. Brockschmidt, and A. Gaunt (2018) Constrained graph variational autoencoders for molecule design. In Advances in Neural Information Processing Systems, pp. 7795–7804. Cited by: §1, §2.
  • T. Ma, J. Chen, and C. Xiao (2018) Constrained generation of semantically valid graphs via regularizing variational autoencoders. In Advances in Neural Information Processing Systems, pp. 7113–7124. Cited by: §1, §2.
  • K. Madhawa, K. Ishiguro, K. Nakago, and M. Abe (2019) GraphNVP: an invertible flow model for generating molecular graphs. arXiv preprint arXiv:1905.11600. Cited by: §1, §1, §2, §2, §5.1, Table 2, Table 3, §5, §5, §5, §5.
  • A. Mullard (2017) The drug-maker’s guide to the galaxy. Nature News 549 (7673), pp. 445. Cited by: §1.
  • G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan (2019) Normalizing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762. Cited by: §2.
  • N. Parmar, A. Vaswani, J. Uszkoreit, Ł. Kaiser, N. Shazeer, A. Ku, and D. Tran (2018) Image transformer. arXiv preprint arXiv:1802.05751. Cited by: §5.1.
  • S. M. Paul, D. S. Mytelka, C. T. Dunwiddie, C. C. Persinger, B. H. Munos, S. R. Lindborg, and A. L. Schacht (2010) How to improve r&d productivity: the pharmaceutical industry’s grand challenge. Nature reviews Drug discovery 9 (3), pp. 203. Cited by: §1.
  • M. Popova, M. Shvets, J. Oliva, and O. Isayev (2019) MolecularRNN: generating realistic molecular graphs with optimized properties. arXiv preprint arXiv:1905.13372. Cited by: §1, §1, §2, §4.4, §4.4, §5.3, Table 3, §5.
  • R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. Von Lilienfeld (2014) Quantum chemistry structures and properties of 134 kilo molecules. Scientific data 1, pp. 140022. Cited by: §5.
  • D. Rogers and M. Hahn (2010) Extended-connectivity fingerprints. Journal of chemical information and modeling 50 (5), pp. 742–754. Cited by: §5.4.
  • M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: §4.2.2.
  • C. Shi, M. Xu, Zhu,Zhaocheng, Zhang,Weinan, M. Zhang, and Jian. Tang (2020) GraphAF: a flow-based autoregressive model for molecular graph generation. ICLR 2020, Addis Ababa, Ethiopia, Apr.26-Apr. 30, 2020. Cited by: §1, §1, §2, §5.1, §5.1, Table 2, Table 3, §5, §5, §5.
  • M. Simonovsky and N. Komodakis (2018) Graphvae: towards generation of small graphs using variational autoencoders. In International Conference on Artificial Neural Networks, pp. 412–422. Cited by: §1, §2.
  • M. Sun, S. Zhao, C. Gilvary, O. Elemento, J. Zhou, and F. Wang (2019) Graph convolutional networks for computational drug development and discovery. Briefings in bioinformatics. Cited by: §4.2.2.
  • D. Weininger, A. Weininger, and J. L. Weininger (1989) SMILES. 2. algorithm for generation of unique smiles notation. Journal of chemical information and computer sciences 29 (2), pp. 97–101. Cited by: §1.
  • J. You, B. Liu, Z. Ying, V. Pande, and J. Leskovec (2018) Graph convolutional policy network for goal-directed molecular graph generation. In Advances in Neural Information Processing Systems, pp. 6410–6421. Cited by: §1, §1, §2, §4.4, §4.4, §5.3, §5.4, Table 3, §5, §5.
  • A. Zhavoronkov, Y. A. Ivanenkov, A. Aliper, M. S. Veselov, V. A. Aladinskiy, A. V. Aladinskaya, V. A. Terentiev, D. A. Polykovskiy, M. D. Kuznetsov, et al. (2019) Deep learning enables rapid identification of potent ddr1 kinase inhibitors. Nature biotechnology 37 (9), pp. 1038–1040. Cited by: §1.