1 Introduction
The goal of drug discovery is to design molecules with desirable chemical properties. The task is challenging since the chemical space is vast and often difficult to navigate. One of the prevailing approaches, known as matched molecular pair analysis (MMPA) (Griffen et al., 2011; Dossetter et al., 2013), learns rules for generating “molecular paraphrases” that are likely to improve target chemical properties. The setup is analogous to machine translation: MMPA takes as input molecular pairs , where is a paraphrase of with better chemical properties. However, current MMPA methods distill the matched pairs into graph transformation rules rather than treating it as a general translation problem over graphs based on parallel data.
In this paper, we formulate molecular optimization as graphtograph translation. Given a corpus of molecular pairs, our goal is to learn to translate input molecular graphs into better graphs. The proposed translation task involves many challenges. While several methods are available to encode graphs (Duvenaud et al., 2015; Li et al., 2015; Lei et al., 2017), generating graphs as output is more challenging without resorting to a domainspecific graph linearization. In addition, the target molecular paraphrases are diverse since multiple strategies can be applied to improve a molecule. Therefore, our goal is to learn multimodal output distributions over graphs.
To this end, we propose junction tree encoderdecoder
, a refined graphtograph neural architecture that decodes molecular graphs with neural attention. To capture diverse outputs, we introduce stochastic latent codes into the decoding process and guide these codes to capture meaningful molecular variations. The basic learning problem can be cast as a variational autoencoder. We constrain the posterior inference over the latent codes to only depend on the target molecule
rather than both on and . The motivation is that the additional hints required for translation should depend only on the target type. Further, to avoid invalid translations, we propose a novel adversarial training method to align the distribution of graphs generated from the model using randomly selected latent codes with the observed distribution of valid targets. Specifically, we perform adversarial regularization on the level of the hidden states created as part of the graph generation.We evaluate our model on three molecular optimization tasks, with target properties ranging from drug likeness to biological activity. As baselines, we utilize stateoftheart graph generation methods (Jin et al., 2018; You et al., 2018a) and MMPA (Dalke et al., 2018). We demonstrate that our model excels in discovering molecules with desired properties, yielding 8.5% to 55.7% relative gain over the best baseline across different tasks. Meanwhile, our model can translate a given molecule into a diverse set of compounds, demonstrating the diversity of learned output distributions.
2 Related Work
Molecule Generation/Optimization Prior work on molecular optimization approached the graph translation task through generative modeling (GómezBombarelli et al., 2016; Segler et al., 2017; Kusner et al., 2017; Dai et al., 2018; Jin et al., 2018; Samanta et al., 2018; Li et al., 2018a)
(Guimaraes et al., 2017; Olivecrona et al., 2017; Popova et al., 2018; You et al., 2018a). Earlier approaches represented molecules as SMILES strings (Weininger, 1988), while more recent methods represented them as graphs. Most of these methods coupled a molecule generator with a property predictor and solved the optimization problem through Bayesian optimization or reinforcement learning. In contrast, our model is trained to translate a molecular graph into a better graph through supervised learning.
Our approach is closely related to matched molecular pair analysis (MMPA) (Griffen et al., 2011; Dossetter et al., 2013)
in drug de novo design, where the matched pairs are hardcoded into graph transformation rules. MMPA’s main drawback is that large numbers of rules have to be realized (e.g. millions) to cover all the complex transformation patterns. In contrast, our approach uses neural networks to learn such transformations, which does not require the rules to be explicitly realized.
Graph Neural Networks Our work is related to graph encoders and decoders. Previous work on graph encoders includes convolutional (Scarselli et al., 2009; Bruna et al., 2013; Henaff et al., 2015; Duvenaud et al., 2015; Niepert et al., 2016; Defferrard et al., 2016; Kondor et al., 2018) and recurrent architectures (Li et al., 2015; Dai et al., 2016; Lei et al., 2017). Graph encoders have been applied to social network analysis (Kipf & Welling, 2016; Hamilton et al., 2017) and chemistry (Kearnes et al., 2016; Gilmer et al., 2017; Schütt et al., 2017; Jin et al., 2017). Recently proposed graph decoders (Simonovsky & Komodakis, 2018; Li et al., 2018b; Jin et al., 2018; You et al., 2018b; Liu et al., 2018) focus on learning generative models of graphs. While our model builds on Jin et al. (2018) to generate graphs, we contribute new techniques to learn multimodal graphtograph mappings.
Image/Text Style Translation
Our work is closely related to imagetoimage translation
(Isola et al., 2017), which was later extended by Zhu et al. (2017) to learn multimodal mappings. Our adversarial training technique is inspired by recent text style transfer methods (Shen et al., 2017; Zhao et al., 2018) that adversarially regularize the continuous representation of discrete structures to enable endtoend training. Our technical contribution is a novel adversarial regularization over graphs that constrains their scaffold structures in a continuous manner.3 Junction Tree EncoderDecoder
Our translation model extends the junction tree variational autoencoder (Jin et al., 2018) to an encoderdecoder architecture for learning graphtograph mappings. Following their work, we interpret each molecule as having been built from subgraphs (clusters of atoms) chosen from a vocabulary of valid chemical substructures. The clusters form a junction tree representing the scaffold structure of molecules (Figure 1), which is an important factor in drug design. Molecules are decoded hierarchically by first generating the junction trees and then combining the nodes of the tree into a molecule. This coarsetofine approach allows us to easily enforce the chemical validity of generated graphs, and provides an enriched representation that encodes molecules at different scales.
In terms of model architecture, the encoder is a graph message passing network that embeds both nodes in the tree and graph into continuous vectors. The decoder consists of a treestructured decoder for predicting junction trees, and a graph decoder that learns to combine clusters in the predicted junction tree into a molecule. Our key departures from Jin et al. (2018) include a unified encoder architecture for trees and graphs, along with an attention mechanism in the tree decoding process.
3.1 Tree and Graph Encoder
Viewing trees as graphs, we encode both junction trees and graphs using graph message passing networks. Specifically, a graph is defined as where is the vertex set and the edge set. Each node has a feature vector . For atoms, it includes the atom type, valence, and other atomic properties. For clusters in the junction tree, is a onehot vector indicating its cluster label. Similarly, each edge has a feature vector . Let be the set of neighbor nodes of . There are two hidden vectors and for each edge representing the message from to and vice versa. These messages are updated iteratively via neural network :
(1) 
where is the message computed in the th iteration, initialized with . In each iteration, all messages are updated asynchronously, as there is no natural order among the nodes. This is different from the tree encoding algorithm in Jin et al. (2018), where a root node was specified and an artificial order was imposed on the message updates. Removing this artifact is necessary as the learned embeddings will be biased by the artificial order.
After steps of iteration, we aggregate messages via another neural network to derive the latent vector of each vertex, which captures its local graphical structure:
(2) 
Applying the above message passing network to junction tree and to graph yields two sets of vectors and , which we call source tree vectors and graph vectors. is the embedding of tree node , and is the embedding of graph node .
3.2 Junction Tree Decoder
We generate a junction tree
with a tree recurrent neural network with an attention mechanism. The tree is constructed in a topdown fashion by expanding the tree one node at a time. Formally, let
be the edges traversed in a depth first traversal over tree , where as each edge is traversed in both directions. Let be the first edges in . At the th decoding step, the model visits node and receives message vectors from its neighbors. The messageis updated through a tree Gated Recurrent Unit
(Jin et al., 2018):(3) 
Topological Prediction When the model visits node , it first computes a predictive hidden state by combining node features and inward messages via a one hidden layer network. The model then makes a binary prediction on whether to expand a new node or backtrack to the parent of
. This probability is computed by aggregating the source encodings
and through an attention layer, followed by a feedforward network (stands for ReLU and
for sigmoid):(4)  
(5)  
(6) 
Here we use to mean the attention mechanism with parameters . It computes two set of attention scores (normalized by softmax) over source tree and graph vectors respectively. The output is a concatenation of tree and graph attention vectors:
(7) 
Label Prediction If node is a new child to be generated from parent , we predict its label by
(8)  
(9) 
where is a distribution over the label vocabulary and is another set of attention parameters.
3.3 Graph Decoder
The second step in the decoding process is to construct a molecular graph from a predicted junction tree . This step is not deterministic since multiple molecules could correspond to the same junction tree. For instance, the junction tree in Figure 2
can be assembled into three different molecules. The underlying degree of freedom pertains to how neighboring clusters are attached to each other. Let
be the set of possible candidate attachments at tree node . Each graph is a particular realization of how cluster is attached to its neighboring clusters . The goal of the graph decoder is to predict the correct attachment between the clusters.To this end, we design the following scoring function for ranking candidate attachments within the set . We first apply a graph message passing network over graph to compute atom representations . Then we derive a vector representation of through sumpooling: . Finally, we score candidate by computing dot products between and the encoded source graph vectors: .
The graph decoder is trained to maximize the loglikelihood of ground truth subgraphs at all tree nodes (Eq. (10)). During training, we apply teacher forcing by feeding the graph decoder with ground truth junction tree as input. During testing, we assemble the graph one neighborhood at a time, following the order in which the junction tree was decoded.
(10) 
4 Multimodal GraphtoGraph Translation
Our goal is to learn a multimodal mapping between two molecule domains, such as molecules with low and high solubility, or molecules that are potent and impotent. During training, we are given a dataset of paired molecules
sampled from their joint distribution
, where are the source and target domains. It is important to note that this joint distribution is a manytomany mapping. For instance, there exist many ways to modify molecule to increase its solubility. Given a new molecule , the model should be able to generate a diverse set of outputs.To this end, we propose to augment the basic encoderdecoder model with lowdimensional latent vectors to explicitly encode the multimodal aspect of the output distribution. The mapping to be learned now becomes , with latent code drawn from a prior distribution , which is a standard Gaussian . There are two challenges in learning this mapping. First, as shown in the image domain (Zhu et al., 2017), the latent codes are often ignored by the model unless we explicitly enforce the latent codes to encode meaningful variations. Second, the model should be properly regularized so that it does not produce invalid translations. That is, the translated molecule should always belong to the target domain given latent code . In this section, we propose two techniques to address these issues.
4.1 Variational Junction Tree EncoderDecoder (VJTNN)
First, to encode meaningful variations, we derive latent code from the embedding of ground truth molecule . The decoder is trained to reconstruct when taking as input both its vector encoding and source molecule . For efficient sampling, the latent code distribution is regularized to be close to the prior distribution, similar to a variational autoencoder. We also restrict to be a low dimensional vector to prevent the model from ignoring input and degenerating to an autoencoder.
Specifically, we first embed molecules and into their tree and graph vectors , using the same encoder with shared parameters (Sec 3.1). Then we perform average pooling over the tree and graph vectors of : and . We sample and from the posterior via the reparameterization trick (Kingma & Welling, 2013)
, where the mean and log variance are computed from
andwith two separate affine layers. Finally, we combine the latent code
and with source tree and graph vectors:(11) 
where and are “perturbed” tree and graph vectors of molecule . The perturbed inputs are then fed into the decoder to synthesize the target molecule . The training objective follows a conditional variational autoencoder, including a reconstruction loss and a KL regularization term:
(12) 
4.2 Adversarial Scaffold Regularization
Second, to avoid invalid translations, we force molecules decoded from latent codes to follow the distribution of the target domain through adversarial training (Goodfellow et al., 2014). The adversarial game involves two components. The discriminator tries to distinguish real molecules in the target domain from fake molecules generated by the model. The generator (i.e. our encoderdecoder) tries to generate molecules indistinguishable from the molecules in the target domain.
The main challenge is how to integrate adversarial training into our decoder, as the discrete decisions in tree and graph decoding hinder gradient propagation. While it is possible to estimate the gradient using REINFORCE
(Williams, 1992), training with these methods could be unstable due to the high variance of sampled gradients. Moreover, those methods require the model to assemble multiple junction tree samples into graphs, thus invoking candidate attachment enumeration multiple times and significantly slowing down the training process.To overcome these issues, we instead apply adversarial regularization over continuous representations of decoded molecular structures, derived from the hidden states in the decoder (Shen et al., 2017; Zhao et al., 2018). That is, we replace the input of the discriminator with continuous embeddings of discrete outputs. For efficient training, we only enforce the adversarial regularization in the tree decoding step.^{1}^{1}1While it is desirable to apply the same idea to graph decoding, the subgraph enumeration requires the tree decoder to make “hard” decisions over node labels, blocking gradient propagation inevitably. We leave this issue for future work. As a result, the adversary only matches the scaffold structure between translated molecules and true samples. While it is an approximation, we found this approach still yields a significant improvement when combined with VJTNN, as chemical properties are largely determined by their scaffold structures.
It remains to be specified how the continuous representation is computed, and how we can decode junction trees such that the gradient can be backpropagated through the entire tree decoding process. The decoder first predicts the label distribution of the root of tree . Starting from the root, we incrementally expand the tree, guided by topological predictions, and compute the hidden messages between nodes in the partial tree. At timestep , the model decides to either expand a new node or backtrack to the parent of node . We denote this binary decision as , which is determined by the topological score in Eq.(6). For the true samples , the hidden messages are computed by Eq.(3) with teacherforcing, namely replacing the label and topological predictions with their ground truth values. For the translated samples from source molecules
, we replace the onehot encoding
with its softmax distribution over cluster labels in Eq.(3) and (4). Moreover, we multiply message with the binary gate , to account for the fact that the messages should depend on the topological layout of the tree:(13) 
As is computed by a nondifferentiable threshold function, we approximate its gradient with a straightthrough estimator (Bengio et al., 2013; Courbariaux et al., 2016)
. Specifically, we replace the threshold function with a differentiable hard sigmoid function during backpropagation, while using the threshold function in the forward pass. This technique has been successfully applied to training neural networks with dynamic computational graphs
(Chung et al., 2016).Finally, after the tree is completely decoded, we derive its continuous representation by concatenating the root label distribution and the sum of its inward messages:
(14) 
We implement the discriminator as a multilayer feedforward network, and train the adversary using Wasserstein GAN with gradient penalty (Arjovsky et al., 2017; Gulrajani et al., 2017). The whole algorithm is described in Algorithm 1.
5 Experiments
Data Our graphtograph translation models are evaluated on three molecular optimization tasks. Following standard practice in MMPA, we construct training sets by sampling molecular pairs with significant property improvement and molecular similarity . The similarity constraint is also enforced at evaluation time to exclude arbitrary mappings that completely ignore the input . We measure the molecular similarity by computing Tanimoto similarity over Morgan fingerprints (Rogers & Hahn, 2010). Next we describe how these tasks are constructed.

[leftmargin=*]

Penalized logP We first evaluate our methods on the constrained optimization task proposed by Jin et al. (2018). The goal is to improve the penalized logP score of molecules under the similarity constraint. Following their setup, we experiment with two similarity constraints ( and ), and we extracted 120K and 53K translation pairs respectively from the ZINC dataset (Sterling & Irwin, 2015; Jin et al., 2018) for training. We use their validation and test sets for evaluation.

Drug likeness (QED) Our second task is to improve drug likeness of compounds. Specifically, the model needs to translate molecules with QED scores (Bickerton et al., 2012) within the range into the higher range . This task is challenging as the target range contains only the top 6.6% of molecules in the ZINC dataset. We extracted a training set of 88K molecule pairs with similarity constraint . The validation and test set contain 1000 molecules each.

Dopamine Receptor (DRD2) The third task is to improve a molecule’s biological activity against a biological target named the dopamine type 2 receptor (DRD2). We use a trained model from Olivecrona et al. (2017) to assess the probability that a compound is active. We ask the model to translate molecules with predicted probability into active compounds with . The active compounds represent only 1.9% of the dataset. With similarity constraint , we derived a training set of 34K molecular pairs from ZINC and the dataset collected by Olivecrona et al. (2017). The validation and test set contain 1000 molecule each.
Baselines We compare our approaches (VJTNN and VJTNN+GAN) with the following baselines:

[leftmargin=*]

MMPA: We utilized (Dalke et al., 2018)’s implementation to perform MMPA. Molecular transformation rules are extracted from the ZINC and Olivecrona et al. (2017)’s dataset for corresponding tasks. During testing, we translate a molecule multiple times using different matching transformation rules that have the highest average property improvements in the database (Appendix B).

Junction Tree VAE: Jin et al. (2018) is a stateoftheart generative model over molecules that applies gradient ascent over the learned latent space to generate molecules with improved properties. Our encoderdecoder architecture is closely related to their autoencoder model.

VSeq2Seq: Our second baseline is a variational sequencetosequence translation model that uses SMILES strings to represent molecules and has been successfully applied to other molecule generation tasks (GómezBombarelli et al., 2016). Specifically, we augment the architecture of Bahdanau et al. (2014) with stochastic latent codes learned in a similar way to our VAE model.

GCPN: GCPN (You et al., 2018a) is a reinforcement learning based model that modifies a molecule by iteratively adding or deleting atoms and bonds. They also adopt adversarial training to enforce naturalness of the generated molecules.
Model Configuration For our models, we set the latent code dimension , and KL regularization weight . We found that with a larger latent code dimension (), the model degenerates to an autoencoder that reconstructs target molecule from the latent code and completely ignores the source molecule . In addition, our model performance is greatly degraded when . Due to limited space, we defer other hyperparameter settings to the appendix.
5.1 Results
Method  

Improvement  Diversity  Improvement  Diversity  
MMPA  0.366  0.432  
JTVAE      
GCPN      
VSeq2Seq  0.122  0.294  
VJTNN  0.310  0.497  
VJTNN+GAN  0.311  0.522 
Method  QED  DRD2  

Success  Diversity  Novelty  Success  Diversity  Novelty  
MMPA  20.8%  0.225  100%  35.6%  0.210  99.8% 
JTVAE  8.8%      3.4%     
GCPN  9.4%  0.216  100%  4.4%  0.152  100% 
VSeq2Seq  38.0%  0.199  78.8%  52.1%  0.027  20.1% 
VJTNN  54.2%  0.387  99.3%  78.5%  0.206  79.0% 
VJTNN+GAN  56.9%  0.377  99.7%  81.0%  0.208  79.1% 
We quantitatively analyze the translation accuracy, diversity, and novelty of different methods.
Translation Accuracy We measure the translation accuracy as follows. On the penalized logP task, we follow the same evaluation protocol as GCPN. That is, for each source molecule, we decode times with different latent codes , and report the molecule having the highest property improvement under the similarity constraint. We set so that it is comparable with baselines. On the QED and DRD2 datasets, we report the success rate of the learned translations. We define a translation as successful if one of the 50 translation candidates satisfies the similarity constraint and its property score falls in the target range (QED and DRD2 ).
Tables 1 and 2 show the performance of all models across the three datasets. Our best model significantly outperforms all baselines, yielding 8.5% to 55.7% relative gain over the best baseline across different tasks. We also found that VJTNN+GAN has the most improvement on the QED and DRD2 tasks because their target domains are explicitly constrained by property ranges. Thus it is beneficial to constrain the model output distribution by adversarial training. We present more ablation studies in the appendix.
Diversity We define the diversity of a set of molecules as the average pairwise Tanimoto distance between them, where Tanimoto distance . For each source molecule, we translate it 50 times (each with different latent codes), and compute the diversity over the set of validly translated molecules.^{2}^{2}2To isolate the translation accuracy from the diversity measure, we exclude the failure cases from diversity calculation, namely excluding molecules that have no valid translation. Otherwise models with lower success rates will always have lower diversity. As we require valid translated molecules to be similar to a given compound, the diversity score is upperbounded by the maximum allowed distance (e.g. for the QED and DRD2 datasets, the maximum diversity score is likely to be around 0.6). As shown in Tables 1 and 2, both of our methods have significantly higher diversity scores than GCPN and VSeq2Seq across all datasets, and outperform MMPA on two out of four test cases. Figure 4 shows some examples of diverse translation over the QED and DRD2 tasks.
Novelty Lastly, we report how often our model discovers new molecules in the target domain that are unseen during training. This is an important metric as the ultimate goal of drug discovery is to design new molecules. Let be the set of molecules generated by the model and be the molecules given during training. We define novelty as . On the QED and DRD2 datasets, our models discover new compounds much more frequently than VSeq2Seq, but less often than MMPA and GCPN. Nonetheless, these methods have much lower translation success rate.
6 Conclusion
In conclusion, we have evaluated various graphtograph translation models for molecular optimization. By combining the variational junction tree encoderdecoder with adversarial training, we can generate better and more diverse molecules than the baselines.
References
 Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 Bickerton et al. (2012) G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L Hopkins. Quantifying the chemical beauty of drugs. Nature chemistry, 4(2):90, 2012.
 Bruna et al. (2013) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
 Chung et al. (2016) Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704, 2016.
 Courbariaux et al. (2016) Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830, 2016.

Dai et al. (2016)
Hanjun Dai, Bo Dai, and Le Song.
Discriminative embeddings of latent variable models for structured
data.
In
International Conference on Machine Learning
, pp. 2702–2711, 2016.  Dai et al. (2018) Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, and Le Song. Syntaxdirected variational autoencoder for structured data. arXiv preprint arXiv:1802.08786, 2018.
 Dalke et al. (2018) Andrew Dalke, Jérôme Hert, and Christian Kramer. mmpdb: An opensource matched molecular pair platform for large multiproperty data sets. Journal of chemical information and modeling, 2018.
 Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3844–3852, 2016.
 Dossetter et al. (2013) Alexander G Dossetter, Edward J Griffen, and Andrew G Leach. Matched molecular pair analysis in drug discovery. Drug Discovery Today, 18(1516):724–731, 2013.
 Duvenaud et al. (2015) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán AspuruGuzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232, 2015.
 Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.
 GómezBombarelli et al. (2016) Rafael GómezBombarelli, Jennifer N Wei, David Duvenaud, José Miguel HernándezLobato, Benjamín SánchezLengeling, Dennis Sheberla, Jorge AguileraIparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán AspuruGuzik. Automatic chemical design using a datadriven continuous representation of molecules. ACS Central Science, 2016. doi: 10.1021/acscentsci.7b00572.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Griffen et al. (2011) Ed Griffen, Andrew G Leach, Graeme R Robb, and Daniel J Warner. Matched molecular pairs as a medicinal chemistry tool: miniperspective. Journal of medicinal chemistry, 54(22):7739–7750, 2011.
 Guimaraes et al. (2017) Gabriel Lima Guimaraes, Benjamin SanchezLengeling, Pedro Luis Cunha Farias, and Alán AspuruGuzik. Objectivereinforced generative adversarial networks (organ) for sequence generation models. arXiv preprint arXiv:1705.10843, 2017.
 Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777, 2017.
 Hamilton et al. (2017) William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. arXiv preprint arXiv:1706.02216, 2017.
 Henaff et al. (2015) Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163, 2015.

Isola et al. (2017)
Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A Efros.
Imagetoimage translation with conditional adversarial networks.
arXiv preprint, 2017.  Jin et al. (2017) Wengong Jin, Connor Coley, Regina Barzilay, and Tommi Jaakkola. Predicting organic reaction outcomes with weisfeilerlehman network. In Advances in Neural Information Processing Systems, pp. 2604–2613, 2017.
 Jin et al. (2018) Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. arXiv preprint arXiv:1802.04364, 2018.
 Kearnes et al. (2016) Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computeraided molecular design, 30(8):595–608, 2016.
 Kingma & Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kipf & Welling (2016) Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 Kondor et al. (2018) Risi Kondor, Hy Truong Son, Horace Pan, Brandon Anderson, and Shubhendu Trivedi. Covariant compositional networks for learning graphs. arXiv preprint arXiv:1801.02144, 2018.
 Kusner et al. (2017) Matt J Kusner, Brooks Paige, and José Miguel HernándezLobato. Grammar variational autoencoder. arXiv preprint arXiv:1703.01925, 2017.
 Landrum (2006) Greg Landrum. Rdkit: Opensource cheminformatics. Online). http://www. rdkit. org. Accessed, 3(04):2012, 2006.
 Lei et al. (2017) Tao Lei, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Deriving neural architectures from sequence and graph kernels. arXiv preprint arXiv:1705.09037, 2017.
 Li et al. (2018a) Yibo Li, Liangren Zhang, and Zhenming Liu. Multiobjective de novo drug design with conditional graph generative model. arXiv preprint arXiv:1801.07299, 2018a.
 Li et al. (2015) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
 Li et al. (2018b) Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324, 2018b.
 Liu et al. (2018) Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander L Gaunt. Constrained graph variational autoencoders for molecule design. arXiv preprint arXiv:1805.09076, 2018.
 Niepert et al. (2016) Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In International Conference on Machine Learning, pp. 2014–2023, 2016.
 Olivecrona et al. (2017) Marcus Olivecrona, Thomas Blaschke, Ola Engkvist, and Hongming Chen. Molecular denovo design through deep reinforcement learning. Journal of cheminformatics, 9(1):48, 2017.
 Popova et al. (2018) Mariya Popova, Olexandr Isayev, and Alexander Tropsha. Deep reinforcement learning for de novo drug design. Science advances, 4(7):eaap7885, 2018.
 Rogers & Hahn (2010) David Rogers and Mathew Hahn. Extendedconnectivity fingerprints. Journal of chemical information and modeling, 50(5):742–754, 2010.
 Samanta et al. (2018) Bidisha Samanta, Abir De, Niloy Ganguly, and Manuel GomezRodriguez. Designing random graph models using variational autoencoders with applications to chemical design. arXiv preprint arXiv:1802.05283, 2018.
 Scarselli et al. (2009) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
 Schütt et al. (2017) Kristof Schütt, PieterJan Kindermans, Huziel Enoc Sauceda Felix, Stefan Chmiela, Alexandre Tkatchenko, and KlausRobert Müller. Schnet: A continuousfilter convolutional neural network for modeling quantum interactions. In Advances in Neural Information Processing Systems, pp. 992–1002, 2017.
 Segler et al. (2017) Marwin HS Segler, Thierry Kogej, Christian Tyrchan, and Mark P Waller. Generating focussed molecule libraries for drug discovery with recurrent neural networks. arXiv preprint arXiv:1701.01329, 2017.
 Shen et al. (2017) Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. Style transfer from nonparallel text by crossalignment. In Advances in Neural Information Processing Systems, pp. 6830–6841, 2017.
 Simonovsky & Komodakis (2018) Martin Simonovsky and Nikos Komodakis. Graphvae: Towards generation of small graphs using variational autoencoders. arXiv preprint arXiv:1802.03480, 2018.
 Sterling & Irwin (2015) Teague Sterling and John J Irwin. Zinc 15–ligand discovery for everyone. J. Chem. Inf. Model, 55(11):2324–2337, 2015.
 Weininger (1988) David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
 Williams (1992) Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 You et al. (2018a) Jiaxuan You, Bowen Liu, Rex Ying, Vijay Pande, and Jure Leskovec. Graph convolutional policy network for goaldirected molecular graph generation. arXiv preprint arXiv:1806.02473, 2018a.
 You et al. (2018b) Jiaxuan You, Rex Ying, Xiang Ren, William L Hamilton, and Jure Leskovec. Graphrnn: A deep generative model for graphs. arXiv preprint arXiv:1802.08773, 2018b.
 Zhao et al. (2018) Junbo Jake Zhao, Yoon Kim, Kelly Zhang, Alexander M Rush, and Yann LeCun. Adversarially regularized autoencoders. arXiv preprint arXiv:1706.04223, 2018.
 Zhu et al. (2017) JunYan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal imagetoimage translation. In Advances in Neural Information Processing Systems, pp. 465–476, 2017.
Appendix A Model Architecture
Tree and Graph Encoder For the graph encoder, functions and are parameterized as a onelayer neural network ( represents the ReLU function):
(15)  
(16) 
For the tree encoder, since it updates the messages with more iterations, we parameterize function as a tree GRU function for learning stability (edge features are omitted because they are always zero). We keep the same parameterization for , with a different set of parameters.
(17) 
Tree Gated Recurrent Unit The tree GRU function for computing message in Eq.(3) is defined as follows (Jin et al., 2018):
(18)  
(19)  
(20)  
(21)  
(22) 
Tree Decoder Attention The attention mechanism is implemented as a bilinear function between decoder state and source tree and graph vectors normalized by the softmax function:
(23) 
Graph Decoder We use the same graph neural architecture (Jin et al., 2018) for scoring candidate attachments. Let be the graph resulting from a particular merging of cluster in the tree with its neighbors , , and let denote atoms in the graph . The main challenge of attachment scoring is local isomorphism: Suppose there are two neighbors and with the same cluster labels. Since they share the same cluster label, exchanging the position of and will lead to isomorphic graphs. However, these two cliques are actually not exchangeable if the subtree under and are different (Illustrations can be found in Jin et al. (2018)). Therefore, we need to incorporate information about those subtrees when scoring the attachments.
To this end, we define index if and if . The index is used to mark the position of the atoms in the junction tree, and to retrieve messages summarizing the subtree under along the edge obtained by running the tree encoding algorithm. The tree messages are augmented into the graph message passing network to avoid local isomorphism:
(24)  
(25) 
The final representation of graph is , where
(26) 
Adversarial Scaffold Regularization Algorithm 2 describes the tree decoding algorithm for adversarial training. It replaces the ground truth input with predicted label distributions , enabling gradient propagation from the discriminator.
Appendix B Experimental Details
Training Details We elaborate on the hyperparameters used in our experiments. For our models, the hidden state dimension is 300 and latent code dimension . The tree encoder runs message passing for 6 iterations, and graph encoder runs for 3 iterations. The entire model has 3.9M parameters. For VSeq2Seq, the encoder is a onelayer bidirectional LSTM and the decoder is a onelayer unidirectional LSTM. The attention scores are computed the same as Bahdanau et al. (2014). We set the hidden state dimension of the recurrent encoder and decoder to be 300 so that the VSeq2Seq model also has 3.8M parameters. For VSeq2Seq, we found that the model performance stays similar under different KL regularization weights ( and ).
All models are trained with the Adam optimizer for 20 epochs with learning rate 0.001. We anneal the learning rate by 0.9 for every epoch. For adversarial training, our discriminator is a threelayer feedforward network with hidden layer dimension 300. The activation function is LeakyReLU. The discriminator is trained for
iterations with gradient penalty weight .Property Calculation The penalized logP is calculated using You et al. (2018a)’s implementation, which utilizes RDKit (Landrum, 2006) to compute clogP and synthetic accessibility scores. The QED scores are also computed using RDKit’s builtin functionality. The DRD2 activity prediction model is downloaded from https://github.com/MarcusOlivecrona/REINVENT/blob/master/data/clf.pkl.
MMPA Procedure We utilized the open source toolkit mmpdb (Dalke et al., 2018) to perform matching molecular pair (MMP) analysis (https://github.com/rdkit/mmpdb). On the logP and QED tasks, we constructed a database of transformation rules extracted from the ZINC dataset (with test set molecules excluded). On the DRD2 task, the database is constructed from both ZINC and the dataset from Olivecrona et al. (2017). During testing, each molecule is translated 50 times with different matching rules. When there are more than 50 matching rules, we choose those with the highest average property improvement. This statistic is calculated during database construction.
Training Set Curation The training set of the penalized logP task is curated from the ZINC dataset of 250K molecules (Jin et al., 2018). A molecular pair is selected into the training set if the Tanimoto similarity and the property improvement is greater than 0.5 (if ) and 2.5 (if ). On the QED and DRD2 tasks, we select training molecular pairs if and both and fall into the source and target property range. Figure 5 shows the histogram of QED and DRD2 score distributions. On the QED dataset, the percentile of the source domain is 37.965.5%, and the target domain is 06.6%. On the DRD2 dataset, the percentile of target domain is 01.9%.
Ablation Study We investigate the translation accuracy of our models against the number of translation candidates . As shown in Figure 6, the model performance greatly increases when more translation candidates are included. We also notice that on the QED and DRD2 datasets, the performance difference between VJTNN and VJTNN+GAN is much larger when , compared to larger values of . This shows that the model with adversarial training generates valid translations much more frequently.
Comments
There are no comments yet.