1 Introduction
Accurate prediction of likely 3D geometries of flexible molecules is a long standing goal of computational chemistry, with broad implications for drug design, biopolymer research, and QSAR analysis. However, this is a very difficult problem due to the exponential growth of possible stable physical structures, or conformers, as a function of the size of a molecule. Levinthal’s infamous paradox notes that a medium sized protein polypeptide chain exposes around possible torsion angle combinations, indicating brute force to be an intractable search method for all but the smallest molecules (Levinthal, 1969). While the conformational space of a molecule’s rotatable bonds is continuous with an infinite number of possible conformations, there are a finite number of stable, low energy conformers that lie in a local minimum on the energy surface Moss (1996). Research in pharmaceuticals and biopolymer material design can be accelerated by developing efficient methods for low energy conformer search of large molecules.
Take the example of lignin, a chemically complex branched biopolymer that has great potential as a renewable biofuel (Zimmerman et al., 2020; Ragauskas et al., 2014). The challenge in taking advantage of lignin is its structural complexity that makes it hard to selectively break down into useful chemical components (Sun et al., 2018). Effective strategies to make use of lignin require deep understanding of its chemical reaction pathways, which in turn require accurate sampling of conformational behavior (Mar and Kulik, 2017; Beste and Buchanan, 2013). Molecular dynamics (MD) simulations (though expensive) are the usual method for sampling complex molecules such as lignin (Zhang et al., 2016; Schulz et al., 2009), but this is a brute force approach that is applied due to a lack of other good options.
Conformer generation and rigid rotor model. The goal of conformer generation is to build a representative set of conformers to "cover" the likely conformational space of a molecule, and sample its energy landscape well Ebejer et al. (2012). To that end, many methods have been employed (Hawkins, 2017; Ebejer et al., 2012) to generate diverse sets of low energy conformers. Two notable cheminformatics methods are RDKit’s ExperimentalTorsion Distance Geometry with Basic Knowledge (ETKDG) (Riniker and Landrum, 2015) and OpenBabel’s Confab systematic search algorithm (O’Boyle et al., 2011). ETKDG generates a distance bounds matrix to specify minimum and maximum distances each atomic pair in a molecule can take, and stochastically samples conformations that fit these bounds. On the other hand, Confab is a systematic search process, utilizing the rigid rotor approximation
of fixing constant bond angles and bond lengths. With bond angles and lengths frozen, the only degrees of freedom for molecular geometry are the
torsion angles of rotatable bonds, which Confab discretizes into buckets and then sequentially cycles through all combinations. It has been previously demonstrated that the exhaustive Confab search performs similarly to RDKit for molecules with small rotatable bond number (), but noticeably better for large, flexible () molecules (Ebejer et al., 2012) if the compute time is available. Thorough systematic search is intractable at very high () due to the combinatorial explosion of possible torsion angle combinations, whereas distance geometry fails entirely.Differences from protein folding. Protein folding is a wellstudied subproblem of conformer generation, where there is most often only one target conformer of a single, linear chain of amino acids. Protein folding is aided by vast biological datasets including structural homologies and genetic multiple sequence alignments (MSAs). In addition, the structural motifs for most finite sequences of amino acids are well known, greatly simplifying the folding problem. The general conformer generation problem is a far broader challenge where the goal is not to generate one target structure but rather a set of representative conformers. Additionally, there is insufficient database coverage for other complex polymers that are structurally different from proteins since they are not as immensely studied (Hawkins, 2017)
. For these reasons, deep learning techniques such as Alphafold
Senior et al. (2020) developed for de novo protein generation are not directly applicable.Main Contributions. First, we argue that posing conformer search as a reinforcement learning problem has several benefits over alternative formulations including generative models. Second, we present TorsionNet, a conformer search technique based on Reinforcement Learning (RL). We make careful design choices in the use of MPNNs Gilmer et al. (2017) with LSTMs Hochreiter and Schmidhuber (1997) to generate independent torsion sampling distributions for all torsions at every timestep. Further, we construct a nonstationary reward function to model the task as a dynamic search process that conditions over histories. Third, we employ curriculum learning, a learning strategy that trains a model on easy tasks and then gradually increases the task difficulty. In conformer search, we have a natural indication of task difficulty, namely the number of rotatable bonds. Fourth, we demonstrate that TorsionNet outperforms chemoinformatic methods in an environment of small and medium sized alkanes by up to 4x, and outclasses them by at least four orders of magnitude on a large lignin polymer. TorsionNet also performs competitively with the far more compute intensive SelfGuided MD (SGMD) on the lignin environment. We also demonstrate that TorsionNet has learned to detect important conformational regions. Curriculum learning is increasingly used in RL but we have little theoretical understanding for why it works Narvekar et al. (2020). Our final contribution is showing, via simple simple theoretical arguments, why curriculum learning might be able to reduce the sample complexity of simple exploration strategies in RL under suitable assumptions about task relatedness.
Related work. Recently there has been significant work using deep learning models for de novo drug target generation You et al. (2018), property prediction Gilmer et al. (2017), and conformer search Gebauer et al. (2018); Mansimov et al. (2019). Some supervised approaches like Mansimov et al. (2019) require a target dataset of empirically measured molecule shapes, utilizing scarce data generated by expensive Xray crystallography. Simm and HernándezLobato (2019) utilize dense structural data of a limited class of small molecules generated from a computationally expensive MD simulation. To our knowledge, no previous works exist that attempt to find conformer sets of medium to large sized molecules. You et al. (2018) and Wang et al. (2018)
utilize reinforcement learning on graph neural networks, but neither utilize recurrent units for memory nor action distributions constructed from subsets of node embeddings. Curriculum learning has been proposed as a way to handle nonconvex optimization problems arising in deep learning
Bengio et al. (2009); Ruder (2016); Weinshall et al. (2018). There is empirical work showing that the RL training process benefits from a curriculum by starting with nonsparse reward signals, which mitigates the difficulties of exploration Narvekar et al. (2016); Florensa et al. (2017); Akkaya et al. (2019).2 Conformer Generation as a Reinforcement Learning Problem
We pose conformer search as an RL problem, which introduces several benefits over the generative models that individually place atoms in 3D space, or produce distance constraints. First and foremost, the latter models do not solve the problem of finding a representative set of diverse, accessible conformers since all conformations are generated in parallel without regard for repeats. Moreover, they require access to expensive empirical crystallographic or simulated MD data. Learning from physics alone is a longstanding goal in structure prediction challenges to reduce the need for expensive empirical data. To this end, we utilize a classical molecular force field approximation called MMFF Halgren and Nachbar (1996) that can cheaply calculate the potential energy of conformational states and run gradient descentbased energy minimizations. Conformations that have undergone relaxation become conformers that lie stably at the bottom of a local potential well. RLbased conformer search is able to learn the conformational potential energy surface via the process depicted in Figure 1. RL is naturally adapted to the paradigm of sequential generation with the only training data being scalar energy evaluations as reward. Deep generative models Simm and HernándezLobato (2019) show reasonable performance for constructing geometries of molecules very similar to the training distribution, but their exploration ability is fundamentally limited by the ability to access expensive training sets.
We model the conformer generation problem as a contextual MDP (Hallak et al., 2015; Modi and Tewari, 2020) with a nonstationary reward function, all possible molecular graph structures as the context space , the trajectory of searched conformers as the state space , the torsional space of a given molecule as the action space and horizon . This method can be seen as a deep augmentation of the Confab systematic search algorithm; instead of sequentially cycling through torsion combinations, we sample intelligently. As our goal is to find a set of good conformations, we use a nonstationary reward function, which encourages the agent to search for conformations that have not been seen during its history. Notably, our model learns from energy function and interconformer distance evaluations alone. We use a Message Passing Neural Network (Gilmer et al., 2017) as a feature extractor for the input graph structure to handle the exponentially large context space. We solve this large state and action space problem with the Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017). Finally, to improve the generalization ability of our training method, we apply a curriculum learning strategy (Bengio et al., 2009), in which we train our model within a family of molecules in an imposed order. Next, we formally describe the problem setup.
2.1 Environment
Context space. Our context is the molecular graph structure, which is processed by a customized graph neural network, called TorsionNet. TorsionNet aggregates the structural information of a molecule efficiently for our RL problem. We will discuss TorsionNet in detail in the next subsection.
Conformer space and state space. The conformer space of a given molecule with independent torsions, or freely rotatable bonds, is defined by the torsional space . Since we optimize a nonstationary reward function, the agent requires knowledge of the entire sequence of searched conformers in order to avoid duplication. We compress the partially observed environment into an MDP by considering every sequence of searched conformers to be a unique state. This gives rise to the formalism and .
Action space. Our action space is the torsional space. Generating a conformer at each timestep can be modelled as simultaneously outputting torsion angles for each rotatable bond. We discretize the action space by breaking down each torsion angle into discrete angle buckets, i.e. . Each torsion angle is sampled independently of all the other torsions.
Transition dynamics. At each timestep, our model generates unminimized conformation . Conformation then undergoes a first order optimization, using a molecular force field. We state that the minimizer is a mapping , which accepts input of output conformer and generates new minimized conformer for the next model step, as in . Distinct generated conformations may minimize to the same or similar conformer.
Gibbs Score. To measure performance, we introduce a novel metric called the Gibbs Score, which has not directly been utilized in the conformer generation literature to date. Conformers of a molecule exist in nature as an interconverting equilibrium, with relative frequencies determined by a Gibbs distribution over energies. Therefore, the Gibbs score intends to measure the quality of a set of conformers with respect to a given force field function rather than distance to empirically measured conformations. It is designed as a representativeness measure of a finite conformation output set to the Gibbs partition function. For any , we define Gibbs measure as
where is the corresponding energy of the conformation , the Boltzmann constant, the thermodynamic temperature, and and are normalizing scores and energies, respectively, for molecule gathered from a classical generation method as needed.
The Gibbs measure relates the energy of a conformer to its thermal accessibility at a specific temperature. The Gibbs score of a set is the sum of Gibbs measures for each unique conformer: . With the Gibbs score, the total quality of the conformer set is evaluated, while guaranteeing a level of interconformer diversity with a distance measure that is described in the next paragraph. It can thereby be used to directly compare the quality of different output sets. Large values of this metric correspond to good coverage of the lowenergy regions of the conformational space of a molecule. To our knowledge, this metric is the first one to attempt to examine both conformational diversity and quality at once.
Horizons and rewards. We train the model using a fixed episodic length , which is chosen on a per environment basis based on number of torsions of the target molecule(s). We design the reward function to encourage a search for conformers with low energy and low similarity to minimized conformations seen during the current trajectory. We first describe the stationary reward function, which is the Gibbs measure after MMFF optimization:
To prune overly similar conformers, we create a nonstationary reward. For a threshold , distance metric , and the current sequence of conformers, we define:
2.2 TorsionNet
The TorsionNet model consists of a graph network for node embeddings, a memory unit, and fully connected action layers. TorsionNet takes as input a global memory state and the graph of the current molecule state postminimization, with which it outputs actions for each individual torsion.
Node Embeddings. To extract node embeddings, we utilize a Graph Neural Network variant, namely the edgenetwork MPNN of Gilmer et al. (2017); Fey and Lenssen (2019). Node embedding generates an
dimensional embedding vector
for each of the nodes of a molecule graph by the following iteration:where is the initial embedding that encodes location and atom type information, represents the set of all nodes connected to node , represents the edge features between node and ,
is a Gated Recurrent Unit (GRU) and
is a trained neural net, modelled by a Multiple Layer Perception (MLP).Pooling & Memory Unit. After all message passing steps, we have output node embeddings for each atom in a molecule. The settoset graph pooling operator Gilmer et al. (2017); Vinyals et al. (2016) takes an input all the embeddings and creates a graph representation . We use to denote the graph representation at time step . Up to time , we have a sequence of representations . An LSTM is then applied to incorporate histories and generate the global representation, which we denote as .
Torsion Action Outputs. As previously noted, the action space is the torsional space, with each torsion angle chosen independently. The model receives a list of valid torsions for the given molecule for . A torsion is defined by an ordinal succession of four connected atoms as such with each representing an atom. Flexible ring torsions are defined differently, but are outside of the scope of this paper. For each torsion angle , we use a trained neural network , which takes input of the four embeddings and the representation to generate a distribution over 6 buckets: And finally, torsion angles are sampled independently and are concatenated to produce the final output action at time : , for
Proximal Policy Optimization (PPO). We train our model with PPO, a policy gradient method with proximal trust regions adapted from TRPO (Trust Region Policy Optimization) (Schulman et al., 2015). PPO has been shown to have theoretical guarantee and good empirical performance in a variety of problems (Schulman et al., 2017; Liang et al., 2018; Zhu et al., 2018). We combine PPO with an entropybased exploration strategy, which maximizes the cumulative rewards by executing :
Doubling Curricula. Empirically, we find that training directly on a large molecule is sampling inefficient and hard to generalize. We utilize a doubling curriculum strategy to aid generalization and sample efficiency. Let be the set of target molecules from some molecule class. Let be the first elements in the set.
Our doubling curriculum trains on set , by randomly sampling a molecule from as the context on round . The end of a round is marked by the achievement of desired performance. The design of doubling curriculum is to balance learning and forgetting as we always have a probability to sample molecules in the earlier rounds (see Algorithm 1 in the appendix).
3 Evaluation
In this section, we outline our experimental setup and results. Further details such as the contents of the graph data structure and hyperparameters are presented in Appendix
C. To demonstrate the effectiveness of sequential conformer search, we compare performance first to the stateoftheart conformer generation algorithm RDKit on a family of small molecules, and secondly to molecular dynamics methods on the largescale biopolymer lignin. All test molecules are shown in Appendix C, along with normalizing constants.3.1 Environment Setup
All conformer search environments are set up using the OpenAI Gym framework (Brockman et al., 2016) and use RDKit for the detection and rotation of independent torsion angles. Our training framework is based on Shangtong (2018). For these experiments, we utilize the classical force field MMFF94, both for energy function evaluation and minimization. The minimization process uses an LBFGS optimizer, as implemented by RDKit. and are required for per molecule reward normalization, and are collected by benchmarking on one run of a classical conformer generation method. For the nonstationary reward function described in Section 2.1, we use the distance metric known as the Torsion Fingerprint Deviation SchulzGasch et al. (2012) to compare newly generated conformers to previously seen ones. We set the distance threshold to for both experiments. To benchmark on nonsequential generation methods, we sort output conformers by increasing energy and apply the Gibbs score function.
3.2 Branched Alkane Environment
We created a script to randomly generate 10,000 molecular graphs of branched alkanes via a simple iterative process of adding carbon atoms to a molecular graph. 4332 alkanes containing are chosen for the train set. The curriculum order is given by increasing first and number of atoms second as tiebreaker. The validation environment consists of a single 10 torsion alkane unseen at train time. All molecules use sampling horizon . A input data structure of a branched alkane consists of only one type of atom embedded in 3D space, single bonds, and a list of torsions, lending this environment simplicity and clarity for proof of concept. Hydrogen atoms are included in the energy modelling but left implicit in the graph passed to the model. We collect a normalizing and for each molecule using the ETKDG algorithm, with being the smallest conformer energy encountered, and being the Gibbs score of the output set with . Starting conformers for alkane environments are sampled from RDKit.
Results. Table 1 shows very good performance on two separate test molecules, which are 11 and 22 torsion branched alkane examples. Not only does TorsionNet outperform RDKit by 64% in the small molecule regime, but also it generalizes to molecules well outside the training distribution and beats RDKit by 289% on the 22torsion alkane. TorsionNet’s runtime is comparable to Confab’s.
Method  11 torsion alkane  22 torsion alkane  

Gibbs Score  Wall Time (s)  Gibbs Score  Wall Time (s)  
RDKit  1.05 0.18  14.52 0.03  1.22 0.43  68.72 0.08 
Confab  0.14 0.02  8.87 0.03  26.04 0.12  
TorsionNet  1.72 0.27  13.35 0.14  4.75 3.00  35.23 0.06 
Method comparison of both score and speed on two branched alkane benchmark molecules. All methods sample exactly 200 conformers. Standard errors produced over 10 model runs.
3.3 Lignin Environment
We adapted a method to generate instances of the biopolymer family of lignins (Orella et al., 2019)
. Lignin polymers were generated with the simplest consisting of two monomer unit [2lignin] and the most complex corresponding to eight units [8lignin]. With each additional monomer, the number of possible structural formulas grows exponentially. The training set consists only of 12 lignin polymers up to 7 units large, ordered by number of monomers for the curriculum. The validation and test molecules are each unique 8lignins. The Gibbs score reward for the lignin environment features high variance across several orders of magnitude, even at very high temperatures (
), which is not ideal for RL. To stabilize training, we utilize the log Gibbs Score as reward, which is simply the natural log of the underlying reward function as such: . This reward function is a numerically stable, monotonically increasing function of the Gibbs score. Initial conformers for lignin environments are sampled from OpenBabel.3.4 Performance on Lignin Conformer Generation
We compare the lignin conformers generated from TorsionNet with those generated from MD. The test lignin molecule has 56 torsion angles and is comprised of 8 bonded monomeric units. RDKit’s ETKDG method failed to produce conformers for this test molecule. Since exploration in conventional MD can be slowed down and hindered by high energy barriers between configurations, enhanced sampling methods such as SGMD (Wu and Brooks, 2003; Brooks et al., 2009) that speed up these slow conformational changes are used instead. SGMD is used as a more exhaustive benchmark for TorsionNet performance. Structures from the 50 ns MD simulation were selected at regular intervals and energetically minimized with MMFF94. These conformers were further pruned in terms of pairwise TFD and relative energy cutoffs to eliminate redundant and highenergy conformers.
TorsionNet is comparable to SGMD in terms of conformer impact toward Gibbs Score (Table 2). Conventional MD is left out from the results, as it only produced 5 conformers that are within pruning cutoffs, mainly due to low diversity according to TFD. This means that exploration was indeed hampered by high energy barriers preventing the trajectory from traversing low energy regions of conformational space. SGMD showed better ability to overcome energy barriers and was able to produce a high number of conformers and the highest Gibbs score. Although TorsionNet sampled 10x fewer conformers than SGMD, the Gibbs score is still close between ensembles, which shows that TorsionNet generated lowenergy unique conformers more frequently than SGMD. TorsionNet is therefore highly efficient at conformer sampling and captured most of the SGMD Gibbs score at a thousandth of the compute time.
Method  No. of sampled conformers  CPU Time (h)  Gibbs Score 

Enhanced MD (SGMD)^{1}  10000  277.59  1.00 
Confab  1000  0.24 0.01  
TorsionNet  1000  0.35 0.01  0.60 0.16 

Enhanced MD run only once due to computational expense.
Figure 2 shows the correlated motion of lignin’s torsion angles in SGMD and TorsionNet and gives insight toward the preferred motion of the molecule. The highest contributions in SGMD are mostly localized and found along the diagonal, middle, lower right sections of the matrix. These sections correspond to strong relationships of proximate torsion angles, which SGMD identifies as the main regions that induce systematic conformational changes in the lignin molecule. With TorsionNet, we can see high correlations in similar sections, especially the middle and lower right parts of the matrix. This means that TorsionNet preferred to manipulate torsions in regions that SGMD also deemed to be conformationally significant. This promising result demonstrates that TorsionNet detects the important torsion relationships in novel test molecules.
4 On the Benefit of Curriculum Learning
Previous work (Bengio et al., 2009; Weinshall et al., 2018) explains the benefits of curriculum learning in terms of nonconvex optimization, while many RL papers point out that curriculum learning eases the difficulties of exploration (Narvekar et al., 2016; Florensa et al., 2017). Here we show that a good curriculum allows simple exploration strategies to achieve nearoptimal sample complexity under a task relatedness assumption involving a joint policy class over all tasks.
Joint function class. We are given a finite set of episodic and deterministic MDPs . Suppose each has a unique optimal policy . Let denote a joint policy and we use if all the policies are the optimal policies. For any subscription set , let .
We assume that is from a joint policy space . The relatedness of the MDPs is characterized by some structure on the joint policy space. Our learning process is similar to the wellknown process of eliminating hypotheses from a hypothesis class as in version space algorithms. For any set , once we decide that have policies , the eliminated hypothesis space is denoted by . Finally, for any joint space , we use the subscript to denote the th marginal space of , i.e. .
Curriculum learning. We define a curriculum to be a permutation of . A CL process can be seen as searching a sequence of spaces: , where for is the first elements of the sequence and
is a sequence of estimated policies. To be specific, on round
, our CL algorithm learns MDP by randomly sampling policies from marginal space until all the policies in the space are evaluated and the best policy in the space is found, which is denoted by . On rounds , space is randomly sampled and the best policy is .Theorem 1.
With probability at least , the above procedure guarantees that for all and it takes steps to end.
The proof of Theorem 1 is in Appendix A. In some cases (e.g. combination lock problem (Koenig and Simmons, 1993)), we can show that matches the lower bound of sample complexity of any algorithm. We further verify the benefits of curriculum learning strategy in two concrete case studies, combination lock problem (discussed in Appendix B) and our conformer generation problem.
4.1 Conformer generation
Problem setup. We simplify the conformer generation problem by finding the best conformers (instead of a set) of molecules, where it becomes a set of bandit problems, as our stationary reward function and transition dynamic only depend on actions. We consider a family of molecules, called TBranched Alkanes (see Appendix C) satisfying that the th molecule has independent torsion angles and is a subgraph of molecule for all .
Joint policy space. The policy space is essentially the action space , where . Let be the optimal action of bandit . We make Assumption 2 for the conditional marginal policy spaces of general molecule families.
Assumption 2.
For any , where for is the Hamming distance. Note that in our TBranched Alkanes, .
Sample complexity. Applying Theorem 1, each marginal space is and the total sample complexity following the curriculum is upper bounded by with high probability and learning each molecule separately may require up to , which is essentially larger than the first upper bound when . When remains 0, the upper bound reduces to .
Effects of direct parametertransfer. While it is shown above that a purely random exploration within marginal spaces can significantly reduce the sample complexity, the marginal spaces are unknown in most cases as is an unknown parameter. Instead, we use a direct parametertransfer and entropy based exploration. We train TorsionNet on 10 molecules of TBranched Alkanes sequentially and evaluate the performances on all the molecules at the end of each stage. As shown in Figure 3, the performance on the hardest task increases linearly as the curriculum proceeds.
5 Conclusion
Posing conformer search as an RL problem, we introduced the TorsionNet architecture and its related training platform and environment. We find that TorsionNet reliably outperforms the best freely available conformer sampling methods, sometimes by many orders of magnitude. We also investigate the results of an enhanced molecular dynamics simulation and find that TorsionNet has faithfully reproduced most of the conformational space seen via the more intensive sampling method. These results demonstrate the promise of TorsionNet in conformer generation of largescale high rbn molecules. Such methods open up the avenue to efficient conformer generation on any large molecules without conformational databanks to learn from, and to solve downstream tasks such as mechanistic analysis of reaction pathways. Furthermore, the curriculumbased RL approach to combinatorial problems of increasing complexity is a powerful framework that can extend to many domains, such as circuit design with a growing number of circuit elements, or robotic control bodies with increasing levels of joint detail.
References
 Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113. Cited by: §1.

Curriculum learning.
In
Proceedings of the 26th annual international conference on machine learning
, pp. 41–48. Cited by: §1, §2, §4.  Computational investigation of the pyrolysis product selectivity for hydroxy phenethyl phenyl ether and phenethyl phenyl ether: analysis of substituent effects and reactant conformer selection. The Journal of Physical Chemistry A 117 (15), pp. 3235–3242. Note: PMID: 23514452 External Links: Document, Link, https://doi.org/10.1021/jp4015004 Cited by: §1.
 Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §3.1.
 CHARMM: the biomolecular simulation program. Journal of computational chemistry 30 (10), pp. 1545–1614. Cited by: §3.4.
 Freely available conformer generation methods: how good are they?. Journal of chemical information and modeling 52 (5), pp. 1146–1158. Cited by: §1.

Fast graph representation learning with PyTorch Geometric
. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: §2.2.  Reverse curriculum generation for reinforcement learning. In Conference on Robot Learning, pp. 482–495. Cited by: §1, §4.
 Generating equilibrium molecules with deep neural networks. arXiv preprint arXiv:1810.11347. Cited by: §1.
 Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1263–1272. Cited by: §1, §1, §2.2, §2.2, §2.
 Merck molecular force field. iv. conformational energies and geometries for mmff94. Journal of computational chemistry 17 (56), pp. 587–615. Cited by: §2.

Contextual markov decision processes
. arXiv preprint arXiv:1502.02259. Cited by: §2.  Conformation generation: the state of the art. Journal of Chemical Information and Modeling 57 (8), pp. 1747–1756. Note: PMID: 28682617 External Links: Document, Link, https://doi.org/10.1021/acs.jcim.7b00221 Cited by: §1, §1.
 Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
 Complexity analysis of realtime reinforcement learning. In AAAI, pp. 99–107. Cited by: Appendix B, §4.
 How to fold graciously. Mossbauer spectroscopy in biological systems 67, pp. 22–24. Cited by: §1.
 Adversarial deep reinforcement learning in portfolio management. arXiv preprint arXiv:1808.09940. Cited by: §2.2.
 Molecular geometry prediction using a deep generative graph neural network. Scientific Reports 9 (1), pp. 1–13. Cited by: §1.
 Depolymerization pathways for branching lignin spirodienone units revealed with ab initio steered molecular dynamics. The Journal of Physical Chemistry A 121 (2), pp. 532–543. Note: PMID: 28005362 External Links: Document, Link, https://doi.org/10.1021/acs.jpca.6b11414 Cited by: §1.

Noregret exploration in contextual reinforcement learning.
In
Proceedings of the 36th Annual Conference on Uncertainty in Artificial Intelligence
, Cited by: §2.  Basic terminology of stereochemistry (IUPAC recommendations 1996). Pure and Applied Chemistry 68 (12), pp. 2193–2222. External Links: Document, Link Cited by: §1.
 Curriculum learning for reinforcement learning domains: a framework and survey. arXiv preprint arXiv:2003.04960. Cited by: §1.
 Source task creation for curriculum learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp. 566–574. Cited by: §1, §4.
 Confab  systematic generation of diverse lowenergy conformers. Journal of Cheminformatics 3 (1), pp. 8. External Links: ISSN 17582946, Document, Link Cited by: §1.
 Ligninkmc: a toolkit for simulating lignin biosynthesis. ACS Sustainable Chemistry & Engineering 7 (22), pp. 18313–18322. External Links: Document, Link, https://doi.org/10.1021/acssuschemeng.9b03534 Cited by: §3.3.
 Lignin valorization: improving lignin processing in the biorefinery. Science 344 (6185). External Links: Document, ISSN 00368075, Link, https://science.sciencemag.org/content/344/6185/1246843.full.pdf Cited by: §1.
 Better informed distance geometry: using what we know to improve conformation generation. Journal of Chemical Information and Modeling 55 (12), pp. 2562–2574. Cited by: §1.
 An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: §1.
 Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §2.2.
 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2.2, §2.
 Scaling of multimillionatom biological molecular dynamics simulation on a petascale supercomputer. Journal of Chemical Theory and Computation 5 (10), pp. 2798–2808. Note: PMID: 26631792 External Links: Document, Link, https://doi.org/10.1021/ct900292r Cited by: §1.
 TFD: torsion fingerprints as a new measure to compare small molecule conformations. Journal of chemical information and modeling 52 (6), pp. 1499–1512. Cited by: §3.1.
 Improved protein structure prediction using potentials from deep learning. Nature, pp. 1–5. Cited by: §1.
 Modularized implementation of deep rl algorithms in pytorch. GitHub. Note: https://github.com/ShangtongZhang/DeepRL Cited by: §3.1.
 A generative model for molecular distance geometry. arXiv preprint arXiv:1909.11459. Cited by: §1, §2.
 Principal component analysis on a torus: theory and application to protein dynamics. The Journal of Chemical Physics 147 (24), pp. 244101. External Links: Document, Link, https://doi.org/10.1063/1.4998259 Cited by: Figure 2.
 Bright side of lignin depolymerization: toward new platform chemicals. Chemical Reviews 118 (2), pp. 614–678. Note: PMID: 29337543 External Links: Document, Link, https://doi.org/10.1021/acs.chemrev.7b00588 Cited by: §1.
 Order matters: sequence to sequence for sets. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §2.2.
 NerveNet: learning structured policy with graph neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §1.

Curriculum learning by transfer learning: theory and experiments with deep networks
. In International Conference on Machine Learning, pp. 5238–5246. Cited by: §1, §4.  Efficient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems, pp. 3021–3029. Cited by: Appendix B.
 Selfguided langevin dynamics simulation method. Chemical Physics Letters 381 (3), pp. 512–518. Cited by: §3.4.
 Graph convolutional policy network for goaldirected molecular graph generation. In Advances in neural information processing systems, pp. 6410–6421. Cited by: §1.
 Initial mechanisms for an overall behavior of lignin pyrolysis through largescale reaxff molecular dynamics simulations. Energy & Fuels 30 (4), pp. 3140–3150. External Links: Document, Link, https://doi.org/10.1021/acs.energyfuels.6b00247 Cited by: §1.

Reinforcement and imitation learning for diverse visuomotor skills
. In Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania. External Links: Document Cited by: §2.2.  Designing for a green chemistry future. Science 367 (6476), pp. 397–400. External Links: Document, ISSN 00368075, Link, https://science.sciencemag.org/content/367/6476/397.full.pdf Cited by: §1.
Appendix A Proof of Theorem 1
Recall the Theorem 1.
The optimal policy is guaranteed in for all . With probability at least , the algorithm takes at most steps to end. A curriculumfree algorithm that learns tasks separately requires samples at least .
For the first argument, we use induction. On round , assuming
(1) 
we have . Then for , equation (1) also holds. As , the argument follows by induction. For the second part, we it is essentially a Coupon Collector’s problem.
Lemma 3 (Coupon Collector’s problem).
It takes rounds of random sampling to see all distinct options with a probability at least .
Proof. Consider a general sampling problem: for any finite set with . For any , whose sampling probability is , with a probability at least , it requires at most
Since for all , we have
Searching the whole space with each new element being found with probability at round , it requires at most
with a probability at most .
By Lemma 3, with a probability , search the marginal policy space requires at most times policy evaluation. As the horizon for task is , the total number of samples to search the whole joint space is
Appendix B Combination lock
Problem setup. We consider the combination lock problem (Koenig and Simmons, 1993). As shown in Figure 4, the set of MDPs share the same action space . The th task has the state space , the episode length . The agent receives 0 reward on all but the last state in the th task. There are two actions, one for staying on the current state and the other one for moving forward, i.e. .
Joint policy space. We assume the same optimal actions on the common states shared by different tasks. Formally, for , and .
Sample complexity. By (Wen and Van Roy, 2013), the total number of steps needed to learn is at least . The lower bound can only be achieved by carefully designed exploration strategy, which accounts for the underlying function class. Applying Theorem 1, a purely random exploration strategy following curricula has an upper bound of with probability at least , which matches the lower bound. Solving directly using random exploration requires samples.
Experiment setup. To match the experiment setup in our conformer generation problem, we conduct the combination lock experiment on a harder environment, MiniGrid. MiniGrid is a minimalistic gridworld environment for OpenAI Gym with an image input. The environment is shown in Figure 6. In our experiments, we train an PPO on MiniGrid of size 25, with target grid changing according to the sequence . The model setting and hyperparameters are the same in Torchrl. Whenever the model converges on the current task, we test the average regret over 100 samples on all the tasks from 3 to 17. The results are shown in Figure 5. As we can see, we observe a similar pattern as shown in Figure 3.
Appendix C Algorithm Details and Experimental Parameters
c.1 Curriculum Algorithm
The specifics of our implementation is included with the code.
c.2 Features and Hyperparameters
Feature  Feature Type  Description  Dimensionality 

Atom type  Node  [C, O] (onehot)  2 
Position  Node  3D Cartesian coordinates (float)  3 
Bond type  Edge  [Single, Double, Triple, Aromatic] (onehot)  4 
Conjugated  Edge  Bond belongs to a conjugated system (boolean)  1 
Ringed  Edge  Bond is in a closed ring (boolean)  1 
Position of atoms are given by Cartesian coordinates. These are taken directly from the RDKit conformer object, then normalized in two ways. Firstly atoms are centered on the origin. Then, rotation is normalized such that eigenvectors align with coordinate axes.
Molecules  (kcal/mol)  

11torsion alkane  18.0451260322537  3.34544474520153  500 
22torsion alkane  14.882782943326  1.2363186365185  500 
8lignin  525.8597422  16.1548792743065  2000 
and are utilized for Gibbs evaluation. Normalizers for alkane train and test molecules are sampled from RDKit ETKDG with default settings, and for the lignin test environment via exhaustive SGMD sampling. The lignin train molecules have normalizers collected via OpenBabel sampling. We include the constants for test molecules here, but all remaining constants for train molecules are included in code repository in Appendix D.
Hyperparameter  Value 

Message Passing Steps  6 
SettoSet Passes  6 
Node Embedding Dimension  128 
LSTM Hidden State Dimension  256 
Full hyperparameter setup described in code repo (Appendix D).
c.3 Test Molecule Depiction
Full smiles string is given for each molecule in code repo (Appendix D).
Appendix D Code
Github link will be provided in a future version.
Comments
There are no comments yet.