Learning Multimodal Graph-to-Graph Translation for Molecular Optimization

12/03/2018 ∙ by Wengong Jin, et al. ∙ MIT 0

We view molecular optimization as a graph-to-graph translation problem. The goal is to learn to map from one molecular graph to another with better properties based on an available corpus of paired molecules. Since molecules can be optimized in different ways, there are multiple viable translations for each input graph. A key challenge is therefore to model diverse translation outputs. Our primary contributions include a junction tree encoder-decoder for learning diverse graph translations along with a novel adversarial training method for aligning distributions of molecules. Diverse output distributions in our model are explicitly realized by low-dimensional latent vectors that modulate the translation process. We evaluate our model on multiple molecular optimization tasks and show that our model outperforms previous state-of-the-art baselines by a significant margin.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of drug discovery is to design molecules with desirable chemical properties. The task is challenging since the chemical space is vast and often difficult to navigate. One of the prevailing approaches, known as matched molecular pair analysis (MMPA) (Griffen et al., 2011; Dossetter et al., 2013), learns rules for generating “molecular paraphrases” that are likely to improve target chemical properties. The setup is analogous to machine translation: MMPA takes as input molecular pairs , where is a paraphrase of with better chemical properties. However, current MMPA methods distill the matched pairs into graph transformation rules rather than treating it as a general translation problem over graphs based on parallel data.

In this paper, we formulate molecular optimization as graph-to-graph translation. Given a corpus of molecular pairs, our goal is to learn to translate input molecular graphs into better graphs. The proposed translation task involves many challenges. While several methods are available to encode graphs (Duvenaud et al., 2015; Li et al., 2015; Lei et al., 2017), generating graphs as output is more challenging without resorting to a domain-specific graph linearization. In addition, the target molecular paraphrases are diverse since multiple strategies can be applied to improve a molecule. Therefore, our goal is to learn multimodal output distributions over graphs.

To this end, we propose junction tree encoder-decoder

, a refined graph-to-graph neural architecture that decodes molecular graphs with neural attention. To capture diverse outputs, we introduce stochastic latent codes into the decoding process and guide these codes to capture meaningful molecular variations. The basic learning problem can be cast as a variational autoencoder. We constrain the posterior inference over the latent codes to only depend on the target molecule

rather than both on and . The motivation is that the additional hints required for translation should depend only on the target type. Further, to avoid invalid translations, we propose a novel adversarial training method to align the distribution of graphs generated from the model using randomly selected latent codes with the observed distribution of valid targets. Specifically, we perform adversarial regularization on the level of the hidden states created as part of the graph generation.

We evaluate our model on three molecular optimization tasks, with target properties ranging from drug likeness to biological activity. As baselines, we utilize state-of-the-art graph generation methods (Jin et al., 2018; You et al., 2018a) and MMPA (Dalke et al., 2018). We demonstrate that our model excels in discovering molecules with desired properties, yielding 8.5% to 55.7% relative gain over the best baseline across different tasks. Meanwhile, our model can translate a given molecule into a diverse set of compounds, demonstrating the diversity of learned output distributions.

2 Related Work

Molecule Generation/Optimization Prior work on molecular optimization approached the graph translation task through generative modeling (Gómez-Bombarelli et al., 2016; Segler et al., 2017; Kusner et al., 2017; Dai et al., 2018; Jin et al., 2018; Samanta et al., 2018; Li et al., 2018a)

and reinforcement learning 

(Guimaraes et al., 2017; Olivecrona et al., 2017; Popova et al., 2018; You et al., 2018a). Earlier approaches represented molecules as SMILES strings (Weininger, 1988)

, while more recent methods represented them as graphs. Most of these methods coupled a molecule generator with a property predictor and solved the optimization problem through Bayesian optimization or reinforcement learning. In contrast, our model is trained to translate a molecular graph into a better graph through supervised learning.

Our approach is closely related to matched molecular pair analysis (MMPA) (Griffen et al., 2011; Dossetter et al., 2013)

in drug de novo design, where the matched pairs are hard-coded into graph transformation rules. MMPA’s main drawback is that large numbers of rules have to be realized (e.g. millions) to cover all the complex transformation patterns. In contrast, our approach uses neural networks to learn such transformations, which does not require the rules to be explicitly realized.

Graph Neural Networks Our work is related to graph encoders and decoders. Previous work on graph encoders includes convolutional (Scarselli et al., 2009; Bruna et al., 2013; Henaff et al., 2015; Duvenaud et al., 2015; Niepert et al., 2016; Defferrard et al., 2016; Kondor et al., 2018) and recurrent architectures (Li et al., 2015; Dai et al., 2016; Lei et al., 2017). Graph encoders have been applied to social network analysis (Kipf & Welling, 2016; Hamilton et al., 2017) and chemistry (Kearnes et al., 2016; Gilmer et al., 2017; Schütt et al., 2017; Jin et al., 2017). Recently proposed graph decoders (Simonovsky & Komodakis, 2018; Li et al., 2018b; Jin et al., 2018; You et al., 2018b; Liu et al., 2018) focus on learning generative models of graphs. While our model builds on Jin et al. (2018) to generate graphs, we contribute new techniques to learn multimodal graph-to-graph mappings.

Image/Text Style Translation

Our work is closely related to image-to-image translation 

(Isola et al., 2017), which was later extended by Zhu et al. (2017) to learn multimodal mappings. Our adversarial training technique is inspired by recent text style transfer methods (Shen et al., 2017; Zhao et al., 2018) that adversarially regularize the continuous representation of discrete structures to enable end-to-end training. Our technical contribution is a novel adversarial regularization over graphs that constrains their scaffold structures in a continuous manner.

3 Junction Tree Encoder-Decoder

Our translation model extends the junction tree variational autoencoder (Jin et al., 2018) to an encoder-decoder architecture for learning graph-to-graph mappings. Following their work, we interpret each molecule as having been built from subgraphs (clusters of atoms) chosen from a vocabulary of valid chemical substructures. The clusters form a junction tree representing the scaffold structure of molecules (Figure 1), which is an important factor in drug design. Molecules are decoded hierarchically by first generating the junction trees and then combining the nodes of the tree into a molecule. This coarse-to-fine approach allows us to easily enforce the chemical validity of generated graphs, and provides an enriched representation that encodes molecules at different scales.

In terms of model architecture, the encoder is a graph message passing network that embeds both nodes in the tree and graph into continuous vectors. The decoder consists of a tree-structured decoder for predicting junction trees, and a graph decoder that learns to combine clusters in the predicted junction tree into a molecule. Our key departures from Jin et al. (2018) include a unified encoder architecture for trees and graphs, along with an attention mechanism in the tree decoding process.

Figure 1: Illustration of our encoder-decoder model. Molecules are represented by their graph structures and junction trees encoding the scaffold of molecules. Nodes in the junction tree (which we call clusters) are valid chemical substructures such as rings and bonds. During decoding, the model first generates its junction tree and then combines clusters in the predicted tree into a molecule.

3.1 Tree and Graph Encoder

Viewing trees as graphs, we encode both junction trees and graphs using graph message passing networks. Specifically, a graph is defined as where is the vertex set and the edge set. Each node has a feature vector . For atoms, it includes the atom type, valence, and other atomic properties. For clusters in the junction tree, is a one-hot vector indicating its cluster label. Similarly, each edge has a feature vector . Let be the set of neighbor nodes of . There are two hidden vectors and for each edge representing the message from to and vice versa. These messages are updated iteratively via neural network :

(1)

where is the message computed in the -th iteration, initialized with . In each iteration, all messages are updated asynchronously, as there is no natural order among the nodes. This is different from the tree encoding algorithm in Jin et al. (2018), where a root node was specified and an artificial order was imposed on the message updates. Removing this artifact is necessary as the learned embeddings will be biased by the artificial order.

After steps of iteration, we aggregate messages via another neural network to derive the latent vector of each vertex, which captures its local graphical structure:

(2)

Applying the above message passing network to junction tree and to graph yields two sets of vectors and , which we call source tree vectors and graph vectors. is the embedding of tree node , and is the embedding of graph node .

3.2 Junction Tree Decoder

We generate a junction tree

with a tree recurrent neural network with an attention mechanism. The tree is constructed in a top-down fashion by expanding the tree one node at a time. Formally, let

be the edges traversed in a depth first traversal over tree , where as each edge is traversed in both directions. Let be the first edges in . At the -th decoding step, the model visits node and receives message vectors from its neighbors. The message

is updated through a tree Gated Recurrent Unit 

(Jin et al., 2018):

(3)

Topological Prediction When the model visits node , it first computes a predictive hidden state by combining node features and inward messages via a one hidden layer network. The model then makes a binary prediction on whether to expand a new node or backtrack to the parent of

. This probability is computed by aggregating the source encodings

and through an attention layer, followed by a feed-forward network (

stands for ReLU and

for sigmoid):

(4)
(5)
(6)

Here we use to mean the attention mechanism with parameters . It computes two set of attention scores (normalized by softmax) over source tree and graph vectors respectively. The output is a concatenation of tree and graph attention vectors:

Figure 2: Multiple ways to assemble neighboring clusters in the junction tree.
(7)

Label Prediction If node is a new child to be generated from parent , we predict its label by

(8)
(9)

where is a distribution over the label vocabulary and is another set of attention parameters.

3.3 Graph Decoder

The second step in the decoding process is to construct a molecular graph from a predicted junction tree . This step is not deterministic since multiple molecules could correspond to the same junction tree. For instance, the junction tree in Figure 2

can be assembled into three different molecules. The underlying degree of freedom pertains to how neighboring clusters are attached to each other. Let

be the set of possible candidate attachments at tree node . Each graph is a particular realization of how cluster is attached to its neighboring clusters . The goal of the graph decoder is to predict the correct attachment between the clusters.

To this end, we design the following scoring function for ranking candidate attachments within the set . We first apply a graph message passing network over graph to compute atom representations . Then we derive a vector representation of through sum-pooling: . Finally, we score candidate by computing dot products between and the encoded source graph vectors: .

The graph decoder is trained to maximize the log-likelihood of ground truth subgraphs at all tree nodes (Eq. (10)). During training, we apply teacher forcing by feeding the graph decoder with ground truth junction tree as input. During testing, we assemble the graph one neighborhood at a time, following the order in which the junction tree was decoded.

(10)

4 Multimodal Graph-to-Graph Translation

Our goal is to learn a multimodal mapping between two molecule domains, such as molecules with low and high solubility, or molecules that are potent and impotent. During training, we are given a dataset of paired molecules

sampled from their joint distribution

, where are the source and target domains. It is important to note that this joint distribution is a many-to-many mapping. For instance, there exist many ways to modify molecule to increase its solubility. Given a new molecule , the model should be able to generate a diverse set of outputs.

To this end, we propose to augment the basic encoder-decoder model with low-dimensional latent vectors to explicitly encode the multimodal aspect of the output distribution. The mapping to be learned now becomes , with latent code drawn from a prior distribution , which is a standard Gaussian . There are two challenges in learning this mapping. First, as shown in the image domain (Zhu et al., 2017), the latent codes are often ignored by the model unless we explicitly enforce the latent codes to encode meaningful variations. Second, the model should be properly regularized so that it does not produce invalid translations. That is, the translated molecule should always belong to the target domain given latent code . In this section, we propose two techniques to address these issues.

4.1 Variational Junction Tree Encoder-Decoder (VJTNN)

First, to encode meaningful variations, we derive latent code from the embedding of ground truth molecule . The decoder is trained to reconstruct when taking as input both its vector encoding and source molecule . For efficient sampling, the latent code distribution is regularized to be close to the prior distribution, similar to a variational autoencoder. We also restrict to be a low dimensional vector to prevent the model from ignoring input and degenerating to an autoencoder.

Specifically, we first embed molecules and into their tree and graph vectors , using the same encoder with shared parameters (Sec 3.1). Then we perform average pooling over the tree and graph vectors of : and . We sample and from the posterior via the reparameterization trick (Kingma & Welling, 2013)

, where the mean and log variance are computed from

and

with two separate affine layers. Finally, we combine the latent code

and with source tree and graph vectors:

(11)

where and are “perturbed” tree and graph vectors of molecule . The perturbed inputs are then fed into the decoder to synthesize the target molecule . The training objective follows a conditional variational autoencoder, including a reconstruction loss and a KL regularization term:

(12)
Figure 3: Multimodal graph-to-graph learning. Our model combines the strength of both variational JTNN and adversarial scaffold regularization.

4.2 Adversarial Scaffold Regularization

Second, to avoid invalid translations, we force molecules decoded from latent codes to follow the distribution of the target domain through adversarial training (Goodfellow et al., 2014). The adversarial game involves two components. The discriminator tries to distinguish real molecules in the target domain from fake molecules generated by the model. The generator (i.e. our encoder-decoder) tries to generate molecules indistinguishable from the molecules in the target domain.

The main challenge is how to integrate adversarial training into our decoder, as the discrete decisions in tree and graph decoding hinder gradient propagation. While it is possible to estimate the gradient using REINFORCE

(Williams, 1992), training with these methods could be unstable due to the high variance of sampled gradients. Moreover, those methods require the model to assemble multiple junction tree samples into graphs, thus invoking candidate attachment enumeration multiple times and significantly slowing down the training process.

To overcome these issues, we instead apply adversarial regularization over continuous representations of decoded molecular structures, derived from the hidden states in the decoder (Shen et al., 2017; Zhao et al., 2018). That is, we replace the input of the discriminator with continuous embeddings of discrete outputs. For efficient training, we only enforce the adversarial regularization in the tree decoding step.111While it is desirable to apply the same idea to graph decoding, the subgraph enumeration requires the tree decoder to make “hard” decisions over node labels, blocking gradient propagation inevitably. We leave this issue for future work. As a result, the adversary only matches the scaffold structure between translated molecules and true samples. While it is an approximation, we found this approach still yields a significant improvement when combined with VJTNN, as chemical properties are largely determined by their scaffold structures.

It remains to be specified how the continuous representation is computed, and how we can decode junction trees such that the gradient can be back-propagated through the entire tree decoding process. The decoder first predicts the label distribution of the root of tree . Starting from the root, we incrementally expand the tree, guided by topological predictions, and compute the hidden messages between nodes in the partial tree. At timestep , the model decides to either expand a new node or backtrack to the parent of node . We denote this binary decision as , which is determined by the topological score in Eq.(6). For the true samples , the hidden messages are computed by Eq.(3) with teacher-forcing, namely replacing the label and topological predictions with their ground truth values. For the translated samples from source molecules

, we replace the one-hot encoding

with its softmax distribution over cluster labels in Eq.(3) and (4). Moreover, we multiply message with the binary gate , to account for the fact that the messages should depend on the topological layout of the tree:

(13)

As is computed by a non-differentiable threshold function, we approximate its gradient with a straight-through estimator (Bengio et al., 2013; Courbariaux et al., 2016)

. Specifically, we replace the threshold function with a differentiable hard sigmoid function during back-propagation, while using the threshold function in the forward pass. This technique has been successfully applied to training neural networks with dynamic computational graphs 

(Chung et al., 2016).

Finally, after the tree is completely decoded, we derive its continuous representation by concatenating the root label distribution and the sum of its inward messages:

(14)

We implement the discriminator as a multi-layer feedforward network, and train the adversary using Wasserstein GAN with gradient penalty (Arjovsky et al., 2017; Gulrajani et al., 2017). The whole algorithm is described in Algorithm 1.

1:  for each training iteration do
2:     (1) Train variational JTNN (Sec. 4.1)
3:     Sample batch and compute latent code for molecule .
4:     Train the encoder/decoder by minimizing
5:     (2) Train the discriminator for iterations
6:     for  to  do
7:        Sample batch and .
8:        Let be the junction tree of molecule . For each , compute its continuous representation by unrolling the decoder with teacher forcing.
9:        Encode each molecule with latent codes .
10:        For each , unroll the decoder by feeding the predicted labels and tree topologies to construct the translated junction tree , and compute its continuous representation .
11:        Update by minimizing along with gradient penalty.
12:     end for
13:     (3) Train the encoder/decoder adversarially
14:     Sample batch and .
15:     Repeat lines 8-10.
16:     Update encoder/decoder by minimizing .
17:  end for
Algorithm 1 Training VJTNN with Adversarial Scaffold Regularization

5 Experiments

Data Our graph-to-graph translation models are evaluated on three molecular optimization tasks. Following standard practice in MMPA, we construct training sets by sampling molecular pairs with significant property improvement and molecular similarity . The similarity constraint is also enforced at evaluation time to exclude arbitrary mappings that completely ignore the input . We measure the molecular similarity by computing Tanimoto similarity over Morgan fingerprints (Rogers & Hahn, 2010). Next we describe how these tasks are constructed.

  • [leftmargin=*]

  • Penalized logP We first evaluate our methods on the constrained optimization task proposed by Jin et al. (2018). The goal is to improve the penalized logP score of molecules under the similarity constraint. Following their setup, we experiment with two similarity constraints ( and ), and we extracted 120K and 53K translation pairs respectively from the ZINC dataset (Sterling & Irwin, 2015; Jin et al., 2018) for training. We use their validation and test sets for evaluation.

  • Drug likeness (QED) Our second task is to improve drug likeness of compounds. Specifically, the model needs to translate molecules with QED scores (Bickerton et al., 2012) within the range into the higher range . This task is challenging as the target range contains only the top 6.6% of molecules in the ZINC dataset. We extracted a training set of 88K molecule pairs with similarity constraint . The validation and test set contain 1000 molecules each.

  • Dopamine Receptor (DRD2) The third task is to improve a molecule’s biological activity against a biological target named the dopamine type 2 receptor (DRD2). We use a trained model from Olivecrona et al. (2017) to assess the probability that a compound is active. We ask the model to translate molecules with predicted probability into active compounds with . The active compounds represent only 1.9% of the dataset. With similarity constraint , we derived a training set of 34K molecular pairs from ZINC and the dataset collected by Olivecrona et al. (2017). The validation and test set contain 1000 molecule each.

Baselines We compare our approaches (VJTNN and VJTNN+GAN) with the following baselines:

  • [leftmargin=*]

  • MMPA: We utilized (Dalke et al., 2018)’s implementation to perform MMPA. Molecular transformation rules are extracted from the ZINC and Olivecrona et al. (2017)’s dataset for corresponding tasks. During testing, we translate a molecule multiple times using different matching transformation rules that have the highest average property improvements in the database (Appendix B).

  • Junction Tree VAE: Jin et al. (2018) is a state-of-the-art generative model over molecules that applies gradient ascent over the learned latent space to generate molecules with improved properties. Our encoder-decoder architecture is closely related to their autoencoder model.

  • VSeq2Seq: Our second baseline is a variational sequence-to-sequence translation model that uses SMILES strings to represent molecules and has been successfully applied to other molecule generation tasks (Gómez-Bombarelli et al., 2016). Specifically, we augment the architecture of Bahdanau et al. (2014) with stochastic latent codes learned in a similar way to our VAE model.

  • GCPN: GCPN (You et al., 2018a) is a reinforcement learning based model that modifies a molecule by iteratively adding or deleting atoms and bonds. They also adopt adversarial training to enforce naturalness of the generated molecules.

Model Configuration For our models, we set the latent code dimension , and KL regularization weight . We found that with a larger latent code dimension (), the model degenerates to an autoencoder that reconstructs target molecule from the latent code and completely ignores the source molecule . In addition, our model performance is greatly degraded when . Due to limited space, we defer other hyper-parameter settings to the appendix.

5.1 Results

Method
Improvement Diversity Improvement Diversity
MMPA 0.366 0.432
JT-VAE - -
GCPN - -
VSeq2Seq 0.122 0.294
VJTNN 0.310 0.497
VJTNN+GAN 0.311 0.522
Table 1: Translation performance on penalized logP task. GCPN results are from You et al. (2018a).
Method QED DRD2
Success Diversity Novelty Success Diversity Novelty
MMPA 20.8% 0.225 100% 35.6% 0.210 99.8%
JT-VAE 8.8% - - 3.4% - -
GCPN 9.4% 0.216 100% 4.4% 0.152 100%
VSeq2Seq 38.0% 0.199 78.8% 52.1% 0.027 20.1%
VJTNN 54.2% 0.387 99.3% 78.5% 0.206 79.0%
VJTNN+GAN 56.9% 0.377 99.7% 81.0% 0.208 79.1%
Table 2: Translation performance on QED and DRD2 task. JT-VAE and GCPN results are computed by running Jin et al. (2018) and You et al. (2018a)’s open-source implementations.

We quantitatively analyze the translation accuracy, diversity, and novelty of different methods.

Translation Accuracy We measure the translation accuracy as follows. On the penalized logP task, we follow the same evaluation protocol as GCPN. That is, for each source molecule, we decode times with different latent codes , and report the molecule having the highest property improvement under the similarity constraint. We set so that it is comparable with baselines. On the QED and DRD2 datasets, we report the success rate of the learned translations. We define a translation as successful if one of the 50 translation candidates satisfies the similarity constraint and its property score falls in the target range (QED and DRD2 ).

Tables 1 and 2 show the performance of all models across the three datasets. Our best model significantly outperforms all baselines, yielding 8.5% to 55.7% relative gain over the best baseline across different tasks. We also found that VJTNN+GAN has the most improvement on the QED and DRD2 tasks because their target domains are explicitly constrained by property ranges. Thus it is beneficial to constrain the model output distribution by adversarial training. We present more ablation studies in the appendix.

Diversity We define the diversity of a set of molecules as the average pairwise Tanimoto distance between them, where Tanimoto distance . For each source molecule, we translate it 50 times (each with different latent codes), and compute the diversity over the set of validly translated molecules.222To isolate the translation accuracy from the diversity measure, we exclude the failure cases from diversity calculation, namely excluding molecules that have no valid translation. Otherwise models with lower success rates will always have lower diversity. As we require valid translated molecules to be similar to a given compound, the diversity score is upper-bounded by the maximum allowed distance (e.g. for the QED and DRD2 datasets, the maximum diversity score is likely to be around 0.6). As shown in Tables 1 and 2, both of our methods have significantly higher diversity scores than GCPN and VSeq2Seq across all datasets, and outperform MMPA on two out of four test cases. Figure 4 shows some examples of diverse translation over the QED and DRD2 tasks.

Novelty Lastly, we report how often our model discovers new molecules in the target domain that are unseen during training. This is an important metric as the ultimate goal of drug discovery is to design new molecules. Let be the set of molecules generated by the model and be the molecules given during training. We define novelty as . On the QED and DRD2 datasets, our models discover new compounds much more frequently than VSeq2Seq, but less often than MMPA and GCPN. Nonetheless, these methods have much lower translation success rate.

Figure 4: Examples of diverse translations learned by VJTNN+GAN on QED and DRD2 dataset.

6 Conclusion

In conclusion, we have evaluated various graph-to-graph translation models for molecular optimization. By combining the variational junction tree encoder-decoder with adversarial training, we can generate better and more diverse molecules than the baselines.

References

Appendix A Model Architecture

Tree and Graph Encoder For the graph encoder, functions and are parameterized as a one-layer neural network ( represents the ReLU function):

(15)
(16)

For the tree encoder, since it updates the messages with more iterations, we parameterize function as a tree GRU function for learning stability (edge features are omitted because they are always zero). We keep the same parameterization for , with a different set of parameters.

(17)

Tree Gated Recurrent Unit The tree GRU function for computing message in Eq.(3) is defined as follows (Jin et al., 2018):

(18)
(19)
(20)
(21)
(22)

Tree Decoder Attention The attention mechanism is implemented as a bilinear function between decoder state and source tree and graph vectors normalized by the softmax function:

(23)

Graph Decoder We use the same graph neural architecture (Jin et al., 2018) for scoring candidate attachments. Let be the graph resulting from a particular merging of cluster in the tree with its neighbors , , and let denote atoms in the graph . The main challenge of attachment scoring is local isomorphism: Suppose there are two neighbors and with the same cluster labels. Since they share the same cluster label, exchanging the position of and will lead to isomorphic graphs. However, these two cliques are actually not exchangeable if the subtree under and are different (Illustrations can be found in Jin et al. (2018)). Therefore, we need to incorporate information about those subtrees when scoring the attachments.

To this end, we define index if and if . The index is used to mark the position of the atoms in the junction tree, and to retrieve messages summarizing the subtree under along the edge obtained by running the tree encoding algorithm. The tree messages are augmented into the graph message passing network to avoid local isomorphism:

(24)
(25)

The final representation of graph is , where

(26)

Adversarial Scaffold Regularization Algorithm 2 describes the tree decoding algorithm for adversarial training. It replaces the ground truth input with predicted label distributions , enabling gradient propagation from the discriminator.

0:  Source tree and graph vectors
1:  Initialize: Tree ; Global counter
2:  function DecodeTree(
3:     repeat
4:        
5:        Predict topology score with Eq.(6), replacing with predicted label distribution .
6:        if  then
7:           Create a child and add it to tree .
8:           Predict the node label distribution with Eq.(9)
9:           Compute message with Eq.(13)
10:           DecodeTree()
11:        end if
12:     until 
13:     Let be the parent node of . Compute message with Eq.(13)
14:  end function
Algorithm 2 Soft Tree Decoding for Adversarial Regularization

Appendix B Experimental Details

Training Details We elaborate on the hyper-parameters used in our experiments. For our models, the hidden state dimension is 300 and latent code dimension . The tree encoder runs message passing for 6 iterations, and graph encoder runs for 3 iterations. The entire model has 3.9M parameters. For VSeq2Seq, the encoder is a one-layer bidirectional LSTM and the decoder is a one-layer uni-directional LSTM. The attention scores are computed the same as Bahdanau et al. (2014). We set the hidden state dimension of the recurrent encoder and decoder to be 300 so that the VSeq2Seq model also has 3.8M parameters. For VSeq2Seq, we found that the model performance stays similar under different KL regularization weights ( and ).

All models are trained with the Adam optimizer for 20 epochs with learning rate 0.001. We anneal the learning rate by 0.9 for every epoch. For adversarial training, our discriminator is a three-layer feed-forward network with hidden layer dimension 300. The activation function is LeakyReLU. The discriminator is trained for

iterations with gradient penalty weight .

Property Calculation The penalized logP is calculated using You et al. (2018a)’s implementation, which utilizes RDKit (Landrum, 2006) to compute clogP and synthetic accessibility scores. The QED scores are also computed using RDKit’s built-in functionality. The DRD2 activity prediction model is downloaded from https://github.com/MarcusOlivecrona/REINVENT/blob/master/data/clf.pkl.

MMPA Procedure We utilized the open source toolkit mmpdb (Dalke et al., 2018) to perform matching molecular pair (MMP) analysis (https://github.com/rdkit/mmpdb). On the logP and QED tasks, we constructed a database of transformation rules extracted from the ZINC dataset (with test set molecules excluded). On the DRD2 task, the database is constructed from both ZINC and the dataset from Olivecrona et al. (2017). During testing, each molecule is translated 50 times with different matching rules. When there are more than 50 matching rules, we choose those with the highest average property improvement. This statistic is calculated during database construction.

Training Set Curation The training set of the penalized logP task is curated from the ZINC dataset of 250K molecules (Jin et al., 2018). A molecular pair is selected into the training set if the Tanimoto similarity and the property improvement is greater than 0.5 (if ) and 2.5 (if ). On the QED and DRD2 tasks, we select training molecular pairs if and both and fall into the source and target property range. Figure 5 shows the histogram of QED and DRD2 score distributions. On the QED dataset, the percentile of the source domain is 37.9-65.5%, and the target domain is 0-6.6%. On the DRD2 dataset, the percentile of target domain is 0-1.9%.

Ablation Study We investigate the translation accuracy of our models against the number of translation candidates . As shown in Figure 6, the model performance greatly increases when more translation candidates are included. We also notice that on the QED and DRD2 datasets, the performance difference between VJTNN and VJTNN+GAN is much larger when , compared to larger values of . This shows that the model with adversarial training generates valid translations much more frequently.

Figure 5: QED and DRD2 score histogram. The x-axis of DRD2 plot is shown in logarithmic scale.
Figure 6: Ablation study on the translation accuracy against the number of translation attempts ().