Code for "Discrete Object Generation with Reversible Inductive Construction"
The success of generative modeling in continuous domains has led to a surge of interest in generating discrete data such as molecules, source code, and graphs. However, construction histories for these discrete objects are typically not unique and so generative models must reason about intractably large spaces in order to learn. Additionally, structured discrete domains are often characterized by strict constraints on what constitutes a valid object and generative models must respect these requirements in order to produce useful novel samples. Here, we present a generative model for discrete objects employing a Markov chain where transitions are restricted to a set of local operations that preserve validity. Building off of generative interpretations of denoising autoencoders, the Markov chain alternates between producing 1) a sequence of corrupted objects that are valid but not from the data distribution, and 2) a learned reconstruction distribution that attempts to fix the corruptions while also preserving validity. This approach constrains the generative model to only produce valid objects, requires the learner to only discover local modifications to the objects, and avoids marginalization over an unknown and potentially large space of construction histories. We evaluate the proposed approach on two highly structured discrete domains, molecules and Laman graphs, and find that it compares favorably to alternative methods at capturing distributional statistics for a host of semantically relevant metrics.READ FULL TEXT VIEW PDF
Deep generative models have been successfully used to learn representati...
Deep learning techniques have been hugely successful for traditional
Deep generative models are able to suggest new organic molecules by
Deep generative models have achieved remarkable success in various data
Generative Adversarial Networks (GANs) struggle to generate structured
Despite recent advances, goal-directed generation of structured discrete...
An efficient learner is one who reuses what they already know to tackle ...
Code for "Discrete Object Generation with Reversible Inductive Construction"
Many applied domains of optimization and design would benefit from accurate generative modeling of structured discrete objects. For example, a generative model of molecular structures may aid drug or material discovery by enabling an inexpensive search for stable molecules with desired properties. Similarly, in computer-aided design (CAD), generative models may allow an engineer to sample new parts or conditionally complete partially-specified geometry. Indeed, recent work has aimed to extend the success of learned generative models in continuous domains, such as images and audio, to discrete data including graphs (You et al., 2018; Li et al., 2018), molecules (Gómez-Bombarelli et al., 2018; Kusner et al., 2017), and program source code (Yin and Neubig, 2017; Murali et al., 2018).
However, discrete domains present particular challenges to generative modeling. Discrete data structures often exhibit non-unique representations, e.g., up to equivalent adjacency matrix representations for a graph with nodes. Models that perform additive construction—incrementally building a graph from scratch (You et al., 2018; Li et al., 2018)—are flexible but face the prospect of reasoning over an intractable number of possible construction paths. For example, You et al. (2018) leverage a breadth-first-search (BFS) to reduce the number of possible construction sequences, while Simonovsky and Komodakis (2018) avoid additive construction and instead directly decode an adjacency matrix from a latent space, at the cost of requiring approximate graph matching to compute reconstruction error.
In addition, discrete domains are often accompanied by prespecified hard constraints denoting what constitutes a valid object. For example, molecular structures represented as SMILES strings (Weininger, 1988) must follow strict syntactic and semantic rules in order to be decoded to a real compound. Recent work has aimed to improve the validity of generated samples by leveraging the SMILES grammar (Kusner et al., 2017; Dai et al., 2018)
or encouraging validity via reinforcement learning(Janz et al., 2018). Operating directly on chemical graphs, Jin et al. (2018) leverage chemical substructures encountered during training to build valid molecular graphs and De Cao and Kipf (2018) encourage validity for small molecules via adversarial training. In other graph-structured domains, strict topological constraints may be encountered. For example, Laman graphs (Laman, 1970), a class of geometric constraint graphs, require the relative number of nodes and edges in each subgraph to meet certain conditions in order to represent well-constrained geometry.
In this work we take the broad view that graphs provide a universal abstraction for reasoning about structure and constraints on discrete spaces. This is not a new take on discrete spaces: graph-based representations such as factor graphs (Kschischang et al., 2001), error-correcting codes (Gallager, 1962), constraint graphs (Montanari, 1974), and conditional random fields (Lafferty et al., 2001)
are all examples of ways that hard and soft constraints are regularly imposed on structured prediction tasks, satisfiability problems, and sets of random variables.
We propose to model discrete objects by constructing a Markov chain where each possible state corresponds to a valid object. Learned transitions are restricted to a set of local inductive moves, defined as minimal insert and delete operations that maintain validity. Leveraging the generative model interpretation of denoising autoencoders (Bengio et al., 2013), the chain employed here alternatingly samples from two conditional distributions: a fixed distribution over corrupting sequences and a learned distribution over reconstruction sequences. The equilibrium distribution of the chain serves as the generative model, approximating the target data-generating distribution.
This simple framework allows the learned component—the reconstruction model—to be treated as a standard supervised learning problem for multi-class classification. Each reconstruction step is parameterized as a categorical distribution over adjacent objects, those that are one inductive move away from the input object. Given a local corrupter, the target reconstruction distribution is also local, containing fewer modes and potentially being easier to learn than the full data-generating distribution(Bengio et al., 2013). In addition, various hard constraints, such as validity rules or requiring the inclusion of a specific substructure, are incorporated naturally.
One limitation of the proposed approach is its expensive sampling procedure, requiring Gibbs sampling at deployment time, whereas the time complexity of samplers based on additive construction is typically linear in the object size. Nevertheless, in many areas of engineering and design, it is the downstream tasks following initial proposals that serve as the time bottleneck. For example, in drug design, wet lab experiments and controlled clinical trials are far more time intensive than empirically adequate mixing for the proposed method’s Markov chain. 11todo: 1Say: In the supplementary material, we include autocorrelation analysis of generated chains given local corruptions. ? In addition, as an implicit generative model, the proposed approach is not equipped to explicitly provide access to predictive probabilities. We compare statistics for a host of semantically meaningful features from sets of generated samples with the corresponding empirical distributions in order to evaluate the model’s generative capabilities.
We test the proposed approach on two complex discrete domains: molecules and Laman graphs (Laman, 1970), a class of geometric constraint graphs applied in CAD, robotics, and polymer physics. Quantitative evaluation indicates that the proposed method can effectively model highly structured discrete distributions while adhering to strict validity constraints.
Let be an unknown probability mass function over a discrete domain, , from which we have observed data. We assume there are constraints on what constitutes a valid object, where is the subset of valid objects in , and . For example, in the case of molecular graphs, an invalid object may violate atom-specific valence rules. Our goal is to learn a generative model , approximating , with support restricted to the valid subset.
We formulate our approach, generative reversible inductive construction (GenRIC), as the equilibrium distribution of a Markov chain that only visits valid objects, without a need for inefficient rejection sampling. The chain’s transitions are restricted to legal inductive moves. Here, an inductive move is a local insert or delete operation that, when executed on a valid object, results in another valid object. The Markov kernel then needs to be learned such that its equilibrium distribution approximates over the valid subspace.
The desired Markov kernel is formulated as successive sampling between two conditional distributions, one fixed and one learned, a setup originally proposed to extract the generative model implicit in denoising autoencoders (Bengio et al., 2013). A single transition of the Markov chain involves first sampling from a fixed corrupting distribution and then sampling from a learned reconstruction distribution . While the corrupter is free to damage , validity constraints are built into both conditional distributions. The joint data-generating distribution over original and corrupted samples is defined as , which is also uniquely defined by the corrupting distribution and the target reconstruction distribution, . We use supervised learning to train a reconstruction distribution model to approximate . Together, the corruption and learned reconstruction distributions define a Gibbs sampling procedure that asymptotically samples from marginal , approximating the data marginal .
Given a reasonable set of conditions on the support of these two conditionals and the consistency of the employed learning algorithm, the learned joint distribution can be shown to be asymptotically consistent over the Markov chain, converging to the true data-generating distribution in the limit of infinite training data and modeling capacity(Bengio et al., 2013)
. However, in the more realistic case of estimation with finite training data and capacity, a valid concern arises regarding the effect of an imperfect reconstruction model on the chain’s equilibrium distribution. To this end,Alain et al. (2016) adapts a result from perturbation theory (Schweitzer, 1968) for finite state Markov chains to show that as the learned transition matrix becomes arbitrarily close to the target transition matrix, the equilibrium distribution also becomes arbitrarily close to the target joint distribution. For the discrete domains of interest here, we can enforce a finite state space by simply setting a maximum object size.
Let be a fixed conditional distribution over a sequence of corrupting operations where is a random variable representing the total number of steps and each where is a set of legal inductive moves for a given . The probability of arriving at corrupted sample from is
where denotes the set of all corrupting sequences from to . Thus, the joint data-generating distribution is
where if .
Given a corrupted sample, we aim to train a reconstruction distribution model to maximize the expected conditional probability of recovering the original, uncorrupted sample. Thus, we wish to find the parameters that minimize the expected KL divergence between the true and learned ,
which amounts to maximum likelihood estimation of and likewise . The above is an expectation over the joint data-generating distribution, , which we can sample from by drawing a data sample and then conditionally drawing a corruption sequence:
In general, we are afforded flexibility when selecting a corruption distribution, given certain conditions for ergodicity are met. We implement a simple fixed distribution over corrupting sequences approximately following these steps: 1) Sample a number of moves
from a geometric distribution. 2) For each move, sample a move type fromInsert, Delete. 3) Sample from among the legal operations available for the given move type. We make minor adjustments to the weighting of available operations for specific domains. See Appendix F for full details.
The geometric distribution over corruption sequence length ensures exponentially decreasing support with edit distance, and likewise the support of the target reconstruction distribution is local to the conditioned corrupted object. The globally non-zero (yet exponentially decreasing) support of both the corruption and reconstruction distributions trivially satisfy the conditions required in Corollary A2 from Alain et al. (2016) for the chain defined by the corresponding Gibbs sampler to be ergodic. Alternatively, one could employ conditional distributions with truncated support after some edit distance and still satisfy ergodicity conditions via the stronger Corollary A3 from Alain et al. (2016).
, use a geometric distribution with five expected steps for the corruption sequence length. In general, we observe shorter corruption lengths lead to better samples, though we did not seek to specially optimize this hyperparameter for generation quality. SeeAppendix A for some results with other step lengths.
A sequence of corrupting operations corresponds to a sequence of visited corrupted objects after execution on an initial sample . We enforce the corrupter to be Markov such that its distribution over the next corruption operation to perform depends only on the current object. Likewise, the target reconstruction distribution is then also Markov, and we factorize the learned reconstruction sequence model as the product of memoryless transitions culminating with a stop token:
where , the reverse of the corrupting operation sequence. If a stop token is sampled from the model, reconstruction ceases and the next corruption sequence begins. For the molecule model, an additional “revisit” stop criterion is also used: the reconstruction ceases when a molecule is revisited (see Section D.1 for details).
For each individual step, the reconstruction model outputs a large conditional categorical distribution over , the set of legal modification operations that can be performed on an input . We describe the general architecture employed and include domain-specific details in Sections 4 and 3.
Any operation in may be defined in a general sense by a location on the object where the operation is performed and a vocabulary element describing which vocabulary item (if any) is involved (Fig. 1). The prespecified vocabulary consists of domain-specific substructures, a subset of which may be legally inserted or deleted from a given object. The model induces a distribution over all legal operations (which may be described as a subset of the Cartesian product of the locations and vocabulary elements) by computing location embeddings for an object and comparing those to learned embeddings for each vocabulary element.
For the graph-structured domains explored here, location embeddings are generated using a neural network structure similar toDuvenaud et al. (2015); Gilmer et al. (2017)
. In parallel, the set of vocabulary elements is also given a learned embedding vector. The unnormalized log-probability for a given modification is then obtained by computing the dot product of the embedding of the location where the modification is performed and the embedding of the vocabulary element involved. For most objects from the molecule and Laman graph domains, this defines a distribution over a discrete set of operations with cardinality in the tens of thousands.
We note that although our model induces a distribution over a large discrete set, it does not do so through a traditional fully-connected softmax layer. Indeed, the action space of the model is heavily factorized, ensuring that the computation is efficient. The factorization is present at two levels: the actions are separated into broad categories (e.g., insert at atom, insert at bond, delete, for molecules), that do not interact except through the normalization. Additionally, actions are further factorized through a location component and vocabulary component, that only interact through a dot product, further simplifying the model.
Molecular structures can be defined by graphs where nodes represent individual atoms and edges represent bonds. In order for such graphs to be considered valid molecular structures by standard chemical informatics toolkits (e.g., RDKit (Landrum et al., 2006)), certain constraints must be satisfied. For example, aromatic bonds can only exist within aromatic rings, and an atom can only engage in as many bonds as permitted by its valence. By restricting the corruption and reconstruction operations to a set of modifications that respect these rules, we ensure that the resulting Markov chain will only visit valid molecular graphs.
When altering one valid molecular graph into another, we restrict the set of possible modifications to the insertion and deletion of valid substructures. The vocabulary of substructures consists of non-ring bonds, simple rings, and bridged compounds (simple rings with more than two shared atoms) present in training data. This is the same type of vocabulary proposed in Jin et al. (2018). The legal insertion and deletion operations are set as follows:
Insertion For each atom and bond of a molecular graph, we determine the subset of the vocabulary that would be chemically compatible for attachment. Then, for each compatible vocabulary substructure, the possible assemblies of it with the atom or bond of interest are enumerated (keeping its already-connected neighbors fixed). For example, when inserting a ring from the vocabulary via one of its bonds, there is often more than one valid bond to select from. Here, we only specify the 2D configuration of the molecular graph and do not account for stereochemistry.
Deletion We define the leaves of a molecule to be those substructures that can be removed while the rest of the molecular graph remains connected. Here, the set of leaves consists of either non-ring bonds, rings, or bridged compounds whose neighbors have a non-zero atom intersection. The set of possible deletions is fully specified by the set of leaf substructures. To perform a deletion, a leaf is selected and the atoms whose bonds are fully contained within the leaf node substructure are removed from the graph.
These two minimal operations provide enough support for the resulting Markov chain to be ergodic within the set of all valid molecular graphs constructible via the extracted vocabulary. As Jin et al. (2018) find, although an arbitrary molecule may not be reachable, empirically the finite vocabulary provides broad coverage over organic molecules. Further details on the location and vocabulary representations for each possible operation are given in the appendix.
While predictive probabilities are not available from the implicit generative model, we can perform posterior predictive checks on various semantically relevant metrics to compare our model’s learned distribution to the data distribution. Here, we leverage three commonly used quantities when assessing drug molecules: thequantitative estimate of drug-likeness (QED) score (between 0 and 1) (Bickerton et al., 2012), the synthetic accessibility (SA) score (between 1 and 10) (Ertl and Schuffenhauer, 2009), and the log octanol-water partition coefficient (logP) (Comer and Tam, 2007). For QED, a higher value indicates a molecule is more likely to be drug-like, while for SA, a lower value indicates a molecule is more likely to be easily synthesizable. logP measures the hydrophobicity of a molecule, with a higher value indicating more hydrophobic. Together these metrics take into account a wide array molecular features (ring count, charge, etc.), allowing for an aggregate comparison of distributional statistics.
Our goal is not to optimize these statistics but to evaluate the quality of our generative model by comparing the distribution that our model implies over these quantities to those in the original data. A good generative model would have novel molecules but those molecules would have similar aggregate statistics to real compounds. In Fig. 3
, we display Gaussian kernel density estimates (KDE) of the above metrics for generated sets of molecules from six baseline methods, in addition to our own (see appendix for example chains). A normalized histogram of the ZINC training distribution is shown for visual comparison. For each method, we obtain 20K samples either by running pre-trained models(Jin et al., 2018; Gómez-Bombarelli et al., 2018; Kusner et al., 2017) or by accessing pre-sampled sets (Liu et al., 2018; Simonovsky and Komodakis, 2018; Li et al., 2018). Only novel molecules (those not appearing in the ZINC training set) are included in the metric computation, to avoid rewarding memorization of the training data. In addition, Table 1 displays bootstrapped Kolmogorov–Smirnov (KS) distances between the samples for each method and the ZINC training set.
Our method is capable of generating novel molecules that have statistics closely matched to the empirical QED and logP distributions. The SA distribution seems to be more challenging, although we still report lower mean KS distance than some recent methods. Because we allow the corrupter to uniformly select from the vocabulary, even if a particular vocabulary element occurs very rarely in training data, it can sometimes introduce molecules without an accessible synthetic route that the reconstructor does not immediately recover from. One could alter the corrupter and have it favor commonly appearing vocabulary items to mitigate this. We also note that our approach lends itself to Markov chain transitions reflecting known (or learned) chemical reactions.
In addition to the above metrics, we report a validity score (the percentage of samples that are chemically valid) for each method in Table 1. A sample is considered to be valid if it can be successfully parsed by RDKit (Landrum et al., 2006). The validity scores displayed are the self-reported values from each method. Our method, like Jin et al. (2018); Liu et al. (2018), enforces valid molecular samples, and the model does not have to learn these constraints.
We might also inquire how the reconstructed samples of the Markov chain compare to the corrupted samples. See Fig. 7 in the supplementary material for a comparison. On average, we observe corrupted samples that are less druglike and less synthesizable than their reconstructed counterparts. In particular, the output reconstructed molecule has a 21% higher QED relative to the input corrupted molecule on average.
|Source||QED KS||SA KS||logP KS||% valid|
|ChemVAE (Gómez-Bombarelli et al., 2018)||1.00 (0.00)||1.00 (0.00)||1.00 (0.00)||0.7|
|GrammarVAE (Kusner et al., 2017)||0.94 (0.00)||0.95 (0.00)||0.95 (0.00)||7.2|
|GraphVAE (Simonovsky and Komodakis, 2018)||0.52 (0.00)||0.23 (0.00)||0.54 (0.00)||13.5|
|DeepGAR (Li et al., 2018)||0.20 (0.00)||0.15 (0.00)||0.062 (0.002)||89.2|
|JT-VAE (Jin et al., 2018)||0.090 (0.003)||0.21 (0.00)||0.20 (0.00)||100|
|CG-VAE (Liu et al., 2018)||0.27 (0.00)||0.56 (0.00)||0.064 (0.002)||100|
|GenRIC||0.045 (0.003)||0.28 (0.00)||0.057 (0.002)||100|
Molecular property distributional statistics. For each source, 20K molecules are sampled and compared to the ZINC dataset. For SA, QED, and logP, we compute the two-sample Kolmogorov-Smirnov statistic (and its bootstrapped standard error) compared to the ZINC dataset. (Lower is better for the KS statistic.) Self-reported validity percentages are also shown (the value forGómez-Bombarelli et al. (2018) is obtained from Kusner et al. (2017)).
Geometric constraint graphs are widely employed in CAD, molecular modeling, and robotics. They consist of nodes that represent geometric primitives (e.g., points, lines) and edges that represent geometric constraints between primitives (e.g., specifying perpendicularity between two lines). To allow for easy editing and change propagation, best practices in parametric CAD encourage keeping a part well-constrained at all stages of design (Bettig and Hoffmann, 2011). A useful generative model over CAD models should ideally be restricted to sampling well-constrained geometry.
Laman graphs describe two-dimensional geometry where the primitives have two degrees of freedom and the edges restrict one degree of freedom (e.g., a system of rods and joints)(Laman, 1970). Representing minimally rigid systems, Laman graphs have the property that if any single edge is removed, the system becomes under-constrained. For a graph with nodes to be a valid Laman graph, the following two simple conditions are necessary and sufficient: 1) the graph must have exactly edges, and 2) each node-induced subgraph of nodes can have no more than edges. Together, these conditions ensure that all structural degrees of freedom are removed (given that the constraints are all independent), leaving one rotational and two translational degrees of freedom. In 3D, although the corresponding Laman conditions are no longer sufficient, they remain necessary for well-constrained geometry.
Henneberg (1911) describes two types of node-insertion operations, known as Henneberg moves, that can be used to inductively construct any Laman graph (Table 2). We make these moves and their inverses (the delete versions) available to both the corrupter and reconstruction model. While moves #1 and #2 can always be reversed for any nodes of degree 2 and 3 respectively, a check has to be performed to determine where the missing edge can be inserted for reverse move #2 (Haas et al., 2005). Here, we use the Laman satisfaction check described in (Jacobs and Hendrickson, 1997) to determine the set of legal neighbors. At the rigidity transition, it runs in only .
For Laman graphs, we generate synthetic graphs randomly via Algorithm 7 from Moussaoui (2016), originally proposed for evaluating geometric constraint solvers embedded within CAD programs. This synthetic generator allows us to approximately control a produced graph’s degree of decomposability (DoD), a metric which indicates to what extent a Laman graph is composed of well-constrained subgraphs. Such subsystems are encountered in various applications, e.g., they correspond to individual components in a CAD model or rigid substructures in a protein. The degree of decomposability is defined as , where is the number of well-constrained, node-induced subgraphs and is the total number of nodes. We generate 100K graphs each for a low and high decomposability setting (see Section E.1 for full details).
Table 2 displays statistics for Laman graphs generated by our model as well as by two baseline methods all trained on the low decomposability dataset (we observe similar results in the high decomposability setting). For each method, 20K graphs are sampled. The validity metric is defined the same as for molecules (Section 3.3). In addition, bootstrapped KS distance between the sampled graphs and training data for DoD distribution is shown for each method.
While it is unsurprising that the simple Erdős–Rényi model (Erdös and Rényi, 1959) fails to meet validity requirements (% valid), we see that the recently proposed GraphRNN (You et al., 2018) fails to do much better. While deep graph generative models have proven to be very effective at reproducing a host of graph statistics, Laman graphs represent a particularly strict topological constraint, imposing necessary conditions on every subgraph. Today’s flexible graph generative models, while effective at matching local statistics, are ill-equipped to handle this kind of global constraint. By leveraging domain-specific inductive moves, the proposed method does not have to learn what a valid Laman graph is, and instead learns to match the distributional DoD statistics within the set of valid graphs.
In this work we have proposed a new method for modeling distributions of discrete objects, which consists of training a model to undo a series of local corrupting operations. The key to this method is to build both the corruption and reconstruction steps with support for reversible inductive moves that preserve possibly-complicated validity constraints. Experimental evaluation demonstrates that this simple approach can effectively capture relevant distributional statistics over complex and highly structured domains, including molecules and Laman graphs, while always producing valid structures. One weakness of this approach, however, is that the inductive moves must be identified and specified for each new domain; one direction of future work is to learn these moves from data. In the case of molecules, restricting the Markov chain’s transitions to learned chemical reactions could improve the synthesizability of generated samples. Future work can also explore enforcing additional hard constraints besides structural validity. For example, if a particular substructure should be included (such as in completing a partially specified object), chain transitions can be masked to respect this.
We would like to thank Wengong Jin, Michael Galvin, Dieterich Lawson, and members of the Princeton Laboratory for Intelligent Probabilistic Systems for valuable discussion and feedback. This work was funded by the Alfred P. Sloan Foundation and NSF IIS-1421780. AS was supported by the Department of Defense through the National Defense Science and Engineering Graduate Fellowship (NDSEG) Program.
International Conference on Machine Learning, 2017.
Fig. 3 displays distributions for molecular samples from models trained with varying geometric distributions for the corruption sequence length. As the sequence length increases (and the corruptions become less local), the models produces worse samples. A short average corruption sequence length of one step seems to lead to a better-matched SA distribution, albeit with slower observed mixing for the Markov chain. 22todo: 2show mixing measure?
We describe the representation assigned to each inductive operation. As described in Section 3, each modification is associated with a location (molecule dependent) and an operation type (molecule independent).
Stop: a global operation, naturally associated with the entire molecule. The location embedding is produced by embedding the entire molecule.
Delete atom leaf: a deletion operation where the deletion target is a single atom. The vocabulary is unique, and the location is associated with a single atom.
Delete ring leaf: a deletion operation where the deletion target is a ring or bridged compound. The vocabulary is unique, and the location is associated with a ring.
Insert via atom fusion: an insertion operation where the insertion is performed by attaching at an existing atom. The vocabulary is given by all atoms in each molecule of the vocabulary, and the location is associated with a single atom.
Insert via bond fusion: an insertion operation where the insertion is performed by attaching at an existing bond. The vocabulary is given by all bonds belonging to rings in each molecule of the vocabulary, and the location is associated with a single bond in a ring.
Embeddings for locations are computed in the following fashion.
We follow a message passing architecture similar to Duvenaud et al. (2015); Gilmer et al. (2017), which produces a message for each bond and for each atom.
The atom messages are transformed and pooled to produce the molecule embedding (used for
Messages for each leaf atom are also transformed to produce embeddings for
delete leaf atom actions.
Messages for each bond in a leaf ring are transformed and pooled to produce embeddings for
delete leaf ring.
Messages for atoms and bonds are transformed to produce embeddings for
insert via atom fusion and
insert via bond fusion.
In this section we give a brief description of the choices of parameters in training. We refer the reader to the source code111https://github.com/PrincetonLIPS/reversible-inductive-construction for a full description of the model architecture and parameters.
Molecules are converted into graphs in a manner identical to the representation used by Jin et al. (2018)
. The message-passing model runs five steps of message passing. An embedding for the molecule is produced by transforming atom-level messages through a two-layer fully connected network, and aggregating the result through an average-pooling and a max-pooling operation (concatenated). For each task-relevant location, an embedding is produced by transforming and pooling the relevant messages, concatenating those with the molecule representation, and transforming with a two-layer fully-connected network. All messages and hidden layers have size 384.
We train each model for 50 epochs, with the Adamax optimizer and a base learning rate ofat batch size . The base learning rate is scaled linearly with the batch size. We also apply a learning rate schedule, dividing the learning rate by 10 after epochs , and . Additionally, we apply learning rate warm-up by linearly scaling the learning rate from 0 to its base value during the first five epoch. The training is performed with a batch size of 1024, although we did not see any difference with smaller batch sizes (we did notice some issues with larger batches).
Laman graphs are encoded for the message-passing model with a single node degree feature. That feature is encoded with a Fourier encoding of the node degree. The message-passing model runs five step of message passing. An embedding for the graph is produced by transforming the node messages with a two-layer fully-connected network, and aggregated using average and max pooling. Location-specific embeddings are produced in the same fashion as in the molecule model.
We train each model for 30 epochs, with the same optimizer settings as in the molecule model. We use a batch size of 256.
All models are trained using a Nvidia Titan X Pascal (12 GB) graphics card.
Our proposed models require a Markov sampling step. We describe the details below.
For both the molecule and Laman models, we sample from the chain defined by the trained reconstructor by starting from a random object in the training dataset. The chain then alternatively samples sequences from the corrupter and the reconstructor. In both cases, the results reported in the main text use a corrupter that performs an average of 5 moves (with a geometric distribution).
As we sample from a Markov chain, we do not gather i.i.d. samples. In fact, sometimes the reconstructor returns to the same molecule on adjacent transitions due to perfect reconstruction. The results reported here use every sample from the Markov chain without thinning.
Although validity is maintained through the inductive moves, for both the molecule and Laman models, we in fact encode an action space slightly larger than the true set of valid inductive moves (to make the space more regular). When such an invalid action is sampled by the reconstructor, it is ignored, and another sample is taken. In some very rare instances, the reconstructor repeatedly samples invalid actions, in which case the entire transition (including the corruption) is resampled.
For both the molecule and Laman model, a minimal size is set (one leaf for molecules, and three nodes for Laman), to prevent the chain from deleting the entire object (which would cause problems in terms of the representation). For molecules, we also set a maximum size (in terms of the number of atoms), at 25 atoms, although we found values between 25 and 35 to have little effect on the results.
In the molecule setting, we make use of an additional stop criterion which is necessary as our model exhibits high precision and does not have access to any recurrent state which would enable it to increase the probability of stopping as the length of the reconstruction sequence increases.
At each reconstruction step, we keep a history of all the molecules visited by the reconstructor so far, and stop the reconstruction process when the output of the reconstruction model already exists in its history.
We interpret this “revisit” as the model implicitly indicating that the obtained molecule is realistic enough (on average) that it is willing to return to it, despite not indicating stop itself due to the high precision of the model.
As we did not find high-quality real-world datasets for Laman graphs, we considered some synthetic datasets generated by inductively sampling Henneberg moves in a random fashion. More explicitly, each graph in the dataset is generated using Algorithm 7 of Moussaoui (2016), reproduced here as Algorithm 1, where the size and the probability of selecting Henneberg type I moves are sampled randomly. For all datasets, we sample
from a normal distribution with mean 30 and standard deviation 5. The distribution ofdetermines the distribution of the degree of decomposability of the graphs in the dataset. We choose the following distributions of for each dataset: for low decomposability and for high decomposability.
We use a single fixed corrupter for all molecule models. To corrupt a molecule, we sample a number of corruption steps from a geometric distribution with the given mean, and iteratively apply Algorithm 2 to the molecule. We made no attempt to optimize the corrupter to produce better samples from the generative model or ease the learning process.
We use a single a single corrupter for all our Laman models. For a single corruption sub-step, this corrupter first chooses among the four action types (Henneberg type I and type II, and their inverses) uniformly at random, and then uniformly samples among the valid actions of the chosen type. As with the molecule corrupter, the number of sub-steps is sampled from a geometric distribution with given mean.
In Fig. 7, we display QED and SA score distributions for the reconstructed molecules () and the corrupted molecules () visited during Gibbs sampling as well as molecules generated by solely running the corrupter (with no reconstruction). The corruption-only samples severely diverge from the data distribution.
Below, we display three example chains for the molecular model.