An End-to-End Framework for Molecular Conformation Generation via Bilevel Programming

by   Minkai Xu, et al.

Predicting molecular conformations (or 3D structures) from molecular graphs is a fundamental problem in many applications. Most existing approaches are usually divided into two steps by first predicting the distances between atoms and then generating a 3D structure through optimizing a distance geometry problem. However, the distances predicted with such two-stage approaches may not be able to consistently preserve the geometry of local atomic neighborhoods, making the generated structures unsatisfying. In this paper, we propose an end-to-end solution for molecular conformation prediction called ConfVAE based on the conditional variational autoencoder framework. Specifically, the molecular graph is first encoded in a latent space, and then the 3D structures are generated by solving a principled bilevel optimization program. Extensive experiments on several benchmark data sets prove the effectiveness of our proposed approach over existing state-of-the-art approaches.


Molecular generative model based on conditional variational autoencoder for de novo molecular design

We propose a molecular generative model based on the conditional variati...

Learning Gradient Fields for Molecular Conformation Generation

We study a fundamental problem in computational chemistry known as molec...

Multiresolution Graph Variational Autoencoder

In this paper, we propose Multiresolution Graph Networks (MGN) and Multi...

A Graph to Graphs Framework for Retrosynthesis Prediction

A fundamental problem in computational chemistry is to find a set of rea...

Decoding Molecular Graph Embeddings with Reinforcement Learning

We present RL-VAE, a graph-to-graph variational autoencoder that uses re...

End-to-End Attention-based Image Captioning

In this paper, we address the problem of image captioning specifically f...

MGOS: A Library for Molecular Geometry and its Operating System

The geometry of atomic arrangement underpins the structural understandin...

Code Repositories

1 Introduction

Recently we have witnessed much success of deep learning for molecule modeling in a variety of applications ranging from molecule property prediction 

(Gilmer et al., 2017) and molecule generation (You et al., 2018; Shi et al., 2020b) to retrosynthesis planning (Shi et al., 2020a). In these applications, molecules are generally represented as graphs with atoms as nodes and covalent chemical bonds as edges. Although this is empirically effective, in reality molecules are better represented as 3D structures (also known as conformations), where each atom is characterized by 3D Cartesian coordinates. Such 3D structures are also more intrinsic and informative, determining many chemical and biological properties such as chemical sensing and therapeutic interactions with proteins.

However, determining the 3D structures from experiments is challenging and costly. Effectively predicting valid and low-energy conformations has been a very important and active topic in computational chemistry. Traditional computational approaches are typically based on Markov chain Monte Carlo (MCMC) or molecular dynamics (MD) 

(De Vivo et al., 2016) to propose conformations combined with simulations to assign energies through cheap empirical potentials or expensive quantum chemical simulations (Ballard et al., 2015). Recently, there is growing interest in developing machine learning approaches (Mansimov et al., 2019; Simm and Hernández-Lobato, 2020; Xu et al., 2021) to model the conditional distribution of stable conformations given the molecular graph by training on a collection of molecules with available stable conformations. Specifically, two recent works (Simm and Hernández-Lobato, 2020; Xu et al., 2021) propose to first predict the distances between atoms and then generate molecular conformations based on the predicted distances by solving a distance geometry problem (Liberti et al., 2014). Such approaches based on distance geometry effectively take into account the rotation and translation invariance of molecular conformations and have hence achieved very promising performance.

However, there is still a significant limitation for these two-stage approaches, which predict the distances and conformations separately: the predicted distances might not be able to properly preserve the 3D geometry of local atomic neighborhoods. Some invalid combinations of distances could be assigned a high likelihood according to the distance prediction model. The errors in these distances could be significantly exaggerated by the distance geometry program of the second stage, yielding unrealistic outlier samples of 3D structures. This is not surprising as the distance prediction model is trained by maximizing the factorized likelihood of distances while our end goal is to predict valid and stable conformations. We propose to effectively address this issue with an end-to-end solution which directly predicts the conformation given the molecular graph. Indeed, in a related problem of predicting 3D structures of proteins (

a.k.a. protein structure prediction) based on amino-acid sequences, the recent huge success of the AlphaFold2 algorithm shows the importance and effectiveness of developing an end-to-end solution compared to the previous AlphaFold algorithm (though exact details of AlphaFold2 algorithm are still lacking) (Senior et al., 2020a; Jumper et al., 2020).

In this paper, we propose such an end-to-end solution called ConfVAE for molecular conformation generation, based on bilevel programming. To model the rotational and translational invariance of conformations, we still take the pairwise distances among atoms as intermediate variables. However, instead of learning to predict distances by minimizing errors in the space of distance, we formulate the whole problem as bilevel programming (Franceschi et al., 2018), with the distance prediction problem and the distance geometry problem for conformation generation being simultaneously optimized. The whole framework is built on the conditional variational autoencoder (VAE) framework (Kingma and Welling, 2013), in which the molecular graph is first encoded into the VAE latent space, and the conformations are generated based on the latent variable and molecular graph. During training, we iteratively sample a set of distances from the distance prediction model, generate the 3D structures by minimizing an inner objective (defined by the distance geometry problem), and then update the distance prediction model by optimizing the outer objective, i.e., the likelihood directly defined on the conformations.

To the best of our knowledge, ConfVAE is the first method for molecular conformation generation which can be trained in an end-to-end fashion and at the same time keep the property of rotational and translational invariance. Extensive experiments demonstrate the superior performance of the proposed method over existing state-of-the-art approaches on several widely used benchmarks including conformation generation and distance distribution modeling. We also verify that the end-to-end objective is of vital importance for generating realistic and meaningful conformations.

2 Background

2.1 Problem Definition

Notations. Following existing work (Simm and Hernández-Lobato, 2020; Xu et al., 2021), each molecule is represented as an attributed atom-bond graph , where is the set of vertices representing atoms and is the set of edges representing inter-atomic bonds. Each node in describes the chosen atomic features such as element type. Each edge in describes the corresponding chemical bond connecting and , and is labeled with its bond type. Since the distances of bonds existing in the molecular graph are not sufficient to determine an unique conformation (e.g.due to so-called internal rotations around the axis of the bond), we adopt the common pre-processing methodology in existing works (Simm and Hernández-Lobato, 2020; Xu et al., 2021) to expand the graphs by incorporating auxiliary edges, which force multi-hop distance constraint eliminating some ambiguities in the 3D conformation, as elaborated in Appendix A.

For the geometry , each atom in

is represented by a 3D coordinate vector

, and the full set of positions is represented by the matrix . Let denote the Euclidean distance between the and atom, then all the distances between connected nodes can be summarized as a vector .

Problem Definition. The problem of molecular conformation generation is a conditional generation process, where the goal is to model the conditional distribution of molecular conformations given the graph , i.e., .

2.2 Bilevel Optimization

Bilevel programs are defined as optimization problems where a set of variables involved in the (outer) objective function are obtained by solving another (inner) optimization problem (Colson et al., 2007). Formally, given the outer objective function and the inner objective , and the corresponding outer and inner variables and , a bilevel program can be formulated by


Bilevel programs have shown effectiveness in a variety of situations such as hyperparameter optimization, adversarial and multi-task learning, as well as meta-learning 

(Maclaurin et al., 2015; Bengio, 2000; Bennett et al., 2006; Flamary et al., 2014; Muñoz-González et al., 2017; Franceschi et al., 2018).

Typically solving equation 1 is intractable since the solution sets of may not be available in closed form (Bengio, 2000). A common approach is to replace the exact minimizer of the inner object with an approximation solution, which can be obtained through an iterative optimization dynamics

such as stochastic gradient descent (SGD) 

(Domke, 2012; Maclaurin et al., 2015; Franceschi et al., 2017). Starting from the initial parameter , we can get the approximate solution by running iterations of the inner optimization dynamics , i.e., , and so on. In the general case where and are real-valued and the objectives and optimization dynamics is smooth, the gradient of the object w.r.t. , named hypergradient , can be computed by:


where denotes the partial derivative to compute the Jacobian on immediate variables while denotes a total derivative taking into account the recursive calls to . The above gradient can be efficiently calculated by unrolling the optimization dynamics with back-propagation, i.e., reverse-mode automatic differentiation (Griewank and Walther, 2008), where we repeatedly substitute

and apply the chain rule.

3 Implicit Distance Geometry

In this section we elaborate on the proposed end-to-end framework. We first present a high-level description of our bilevel formulation in Sec. 3.1. Then we present the model schematic and training objectives in Sec. 3.2. Finally we show how to learn the model via hypergradient descent in Sec. 3.3 and how to draw samples in Sec. 3.4.

3.1 Overview

Since a molecule can have multiple stable conformations, we model the distribution of conformations conditioning on molecular graph (i.e. ) with a conditional variational autoencoder (CVAE) (Kingma and Welling, 2013), in which a latent variable is introduced to model the uncertainty in molecule conformation generation. The CVAE model includes a prior distribution of latent variable and a decoder to capture the conditional distribution of given . During training, we also involve an additional inference model (encoder) . The encoder and decoder are jointly trained to maximize the evidence lower bound (ELBO) of the data log-likelihood:


The ELBO can be interpreted as the sum of the negative reconstruction error (the first term) and a latent space prior regularizer (the second term). In practice, and are all modeled as diagonal Gaussians and

, whose mean and standard deviation are predicted by graph neural networks. To efficiently optimize the ELBO during training, sampling from

is done by reparametrizing as , where .

With similar encoder and prior models, the key differences among different methods lie in the architecture and learning method of the decoder (generator) model , i.e., how to parameterize the decoder and train it with respect to the reconstruction loss . Let denote the decoder function taking prior and graph to obtain a distance vector, we now elaborate how we formulate the optimization problem of the decoder as a bilevel program:

Inner objective: Directly generating conformations as Cartesian coordinates heavily depends on the arbitrary rotation and translation. Therefore, previous effective approaches (Simm and Hernández-Lobato, 2020; Xu et al., 2021) instead make the decoder generate inter-atomic distances , i.e., . The distances are taken as intermediate variables to generate conformations, which are invariant to rotation and translation. To generate a conformation , one needs to first generate the set of distances , and then post-process to obtain the 3D positions , by solving a distance geometry optimization problem:


which we take as the inner loop objective.

Outer objective: Ultimately, we are interested in directly minimizing the generalization error on 3D structures to make the generated conformation consistent with the ground-truth up to rotation and translation. The post-alignment Root-Mean-Square Deviation (RMSD) is a widely used metric for this purpose. To calculate this metric, another conformation is first obtained by an alignment function , which rotates and translates the reference conformation to have the smallest distance to the generated one according to the RMSD metric:


where is the number of atoms. Then the reconstruction objective can be written as:


which is the outer loop objective for computing the reconstruction loss and maximize the log-likelihood.

Figure 1: The overall framework of ConfVAE. At training time, given the graph and conformation , we: 1) compute the distributions of and , and calculate ; 2) sample from by reparameterization, and feed it into the decoder (generator) to generate inter-atomic distances , after which we can obtain an auxiliary objective from the true distances derived from ; 3) run the inner loop (distance geometry) to recover the 3D structure from , and compute the reconstruction RMSD loss . The model is trained end-to-end by optimizing the sum of three object components , and .

Bilevel program: Now we can consider equation 4 and equation 6 as the inner and outer objectives of a bilevel programming problem. In this formulation, the outer objective aims to model the true conditional distribution , and the inner objective solves for the conformation given a set of predicted distances. By taking the expectation over latent variable , the resulting bilevel program for calculating the reconstruction term in equation 3 can be written as:

such (8)

The derived bilevel problem is still challenging because: 1) the solution of conformation structure in the inner problem is not available in closed form; 2) computing this expectation exactly over the continuous latent space is intractable. Thus, in practice we compute an empirical estimation of the output with a variational inference model and the reparametrization trick. We elaborate on how we address these issues in the following parts.

3.2 Generative Model

Figure 2: Schematic illustration of the forward and backward computational graph through the inner loop (distance geometry optimization). We repeatedly update with the gradient during the forward computation, and accumulate hypergradients to update parameters and from backward computation.

We now have the tools needed to define our conditional generative model of molecular conformation. The cornerstone of all modules (encoder, prior and decoder) is message-passing neural networks (MPNNs) (Gilmer et al., 2017), which is a variant of graph neural networks that achieves state-of-the-art performance in representation learning for molecules (Scarselli et al., 2008; Bruna et al., 2013; Duvenaud et al., 2015; Kipf and Welling, 2016; Kearnes et al., 2016; Schütt et al., 2017). The MPNN directly operates on the graph representation and is invariant to graph isomorphism. In each convolutional (message passing) layer, atomic embeddings are updated by aggregating the information from neighboring nodes.

For the encoder and prior , we use the same MPNN architecture as Mansimov et al. (2019); Simm and Hernández-Lobato (2020)

. Since bilevel optimization has a relatively high memory cost, we use an ordinary differential equation (ODE)-based continuous normalizing flow 

(Chen et al., 2018) (CNF) for the decoder , which has constant memory cost. We describe the details of our decoder model below.

Decoder Architecture. As illustrated in Sec. 3.1, our decoder is composed of two cascaded levels: a distance prediction model that decodes back into a set of distances , and a differentiable distance geometry procedure to recover geometry from distances . The model is implemented as a conditional extension of the CNF which transforms noise variables (also the initial distances in the CNF ODE trajectory) sampled from the prior distribution to final distances . The transformation is conditioned on the latent variable as well as the graph :


where is an MPNN that defines the continuous-time dynamics of the flow conditioned on and . Note that, given the true distances , can also be easily computed by reversing the continuous dynamics : . And thus the exact conditional log-likelihood of distances given can be computed by:


An ODE solver can then be applied to estimate the gradients on parameters for optimization. In practice, can be taken as an auxiliary objective defined on distances to supervise the training. In summary, the training objective can be interpreted as the sum of three parts:


where and are hyperparameters to reweight each component. The overall framework is illustrated in Fig. 1.

3.3 End-to-end Learning via Hypergradient Descent

We now discuss how to optimize the bilevel problem defined by equation 8 and equation 7 through a practical algorithm. The inner problem in equation 8 is a classic distance geometry problem about how to infer 3D coordinates from pairwise distances (Anand and Huang, 2018; Simm and Hernández-Lobato, 2020; Xu et al., 2021). Others have used a semi-definite program (SDP) to infer protein structure from nuclear magnetic resonance data (Alipanahi et al., 2013), or an Alternating Direction Method of Multipliers (ADMM) algorithm to fold the protein into the 3D Cartesian coordinates (Anand and Huang, 2018). In this initial work we choose gradient descent (GD), with tractable learning dynamics , to approximately solve for the geometry:


where is the learning rate and is the distance set generated from the distance prediction model. Under appropriate assumptions and for a number of updates , GD can converge to a proper geometry that depends on the predicted pairwise distances (Bottou, 2010).

Now we consider how to calculate the hypergradient from the outer loop reconstruction objective (equation 7) to train the model. Let denote the conformation generated by approximately solving for the distance geometry with steps gradient descent. Now we can write the hypergradient as:


where the gradient can be computed by fully unrolling the dynamics of inner loop from to . Specifically, in the forward computation, successive geometries resulting from the optimization dynamics are cached. In the backward call, the cached geometries are used to compute gradients in a series of Vector-Jacobian Products (VJPs). During the reverse computation, the gradient starting from the can be propagated to the intermediate geometries through :


where denotes the Hessian w.r.t. . With iteratively computed derivatives , the adjoints on

can be computed in forms of VJPs and further backpropagated to the parameters of encoder

and decoder . Formally, is computed by:


where can be substituted by equation 14

. The computation can be done efficiently with reverse-mode automatic differentiation software such as PyTorch 

(Paszke et al., 2019). A schematic illustration of the forward and backward computational graph through distance geometry is presented in Fig. 2. We provide a detailed algorithm of the training procedure in Appendix. B.

3.4 Sampling

Given the graph , to generate a conformation , we first draw the latent variable from the prior distribution . Then we sample the random initial distances

from a Gaussian distribution, then pass

through the invertible Neural ODE conditioned on and to obtain the distance set . Then we produce the conformation by solving the distance geometry optimization problem as defined in equation 4.

[b] Dataset GEOM-QM9 GEOM-Drugs
COV (%) MAT (Å) COV (%) MAT (Å)

Mean Median Mean Median Mean Median Mean Median

8.52 5.62 0.7810 0.7811 0.00 0.00 2.5225 2.4680

55.09 56.47 0.4649 0.4298 7.76 0.00 1.9840 2.0108

69.60 70.64 0.3915 0.3986 49.92 41.07 1.2698 1.3064

75.57 80.76 0.3873 0.3850 51.24 46.36 1.2487 1.2609

77.98 82.82 0.3778 0.3770 52.59 56.41 1.2330 1.2270

79.94 87.20 0.3238 0.3195 65.43 70.00 1.0962 1.0877

63.10 60.95 0.3939 0.4297 83.08 95.21 0.9829 0.9177

GraphDG + FF
70.67 70.82 0.4168 0.3609 84.68 93.94 0.9129 0.9090

73.52 72.75 0.3131 0.3251 92.28 98.15 0.7740 0.7338

ConfVAE- + FF
77.95 79.14 0.2851 0.2817 91.48 99.21 0.7743 0.7436
ConfVAE + FF 81.46 83.80 0.2702 0.2709 91.88 100.00 0.7634 0.7312

  • For COV, the threshold is set as for QM9 and for Drugs following Xu et al. (2021).

Table 1: Comparison of different methods on the conformation generation task. Top rows: deep generative models for molecular conformation generation. Bottom rows: different methods with an additional rule-based force field to further optimize the generated structures. We report the COV and MAT scores, where Mean and Median are calculated over different molecular graphs in the test set of GEOM. In practice, the size of the generated set is sampled as two times of the reference set following Xu et al. (2021).

4 Experiments

4.1 Experiment Setup

Evaluation Tasks. Following previous work on conformation generation (Mansimov et al., 2019; Simm and Hernández-Lobato, 2020; Xu et al., 2021), we conduct extensive experiments by comparing our method with the state-of-the-art baseline models on several standard tasks. Conformation Generation is formulated by Xu et al. (2021), who concentrate on the models’ capacity to generate realistic and diverse molecular conformations. Distance distribution modeling is first proposed by Simm and Hernández-Lobato (2020), who evaluate whether the methods can model the underlying distribution of distances.

Baselines. We compared our proposed model with the following state-of-the-art conformation generation methods. CVGAE (Mansimov et al., 2019) is a conditional VAE-based model, which applied a few layers of graph neural networks to learn the atom representation from the molecular graph, and then directly predicts the 3D coordinates. GraphDG (Simm and Hernández-Lobato, 2020) also employs the conditional VAE framework. Instead of directly generating the conformations in 3D coordinates, they instead learn the distribution over distances. Then the distances are converted into conformations with a distance geometry algorithm. CGCF (Xu et al., 2021), another two-stage method, uses continuous normalizing flows to predict the atomic pairwise distances. Following the baselines, we also compare our model with RDKit (Riniker and Landrum, 2015), a classical distance geometry approach built upon an extensive calculation collection of edge lengths by computational chemistry.

Featurization and Implementation. The MPNNs used for the encoder, prior and decoder are all implemented as Graph Isomorphism Networks (Xu et al., 2018; Hu et al., 2019). For the input features of the graph representation, we only derive the atom and bond types from molecular graphs. As a default setup, the MPNNs are all implemented with layers, and the hidden embedding dimension is set as . For the training of ConfVAE, we train the model on a single Tesla V100 GPU with a batch size of and a learning rate of until convergence, with Adam (Kingma and Welling, 2013) as the optimizer.

4.2 Conformation Generation

Figure 3: Visualization of generated conformations from state-of-the-art baselines, our method and the reference set, where four molecular graphs are randomly taken from the test set of GEOM-Drugs. C, O, H, S and Cl are colored gray, red, white, yellow and green respectively.

Datasets. Following Xu et al. (2021), we use the recent proposed GEOM-Drugs and GEOM-Drugs  (Axelrod and Gomez-Bombarelli, 2020) datasets for the conformation generation task. The Geometric Ensemble Of Molecules (GEOM) dataset contains millions of high-quality stable conformations, which is suitable for the conformation generation task. The GEOM-Drugs dataset consists of generally medium-sized organic compounds, containing an average of 44.2 atoms. We follow the setting from Xu et al. (2021) to randomly take 50000 conformation-molecule pairs as the training set, and another 9161 conformations (covering 100 molecular graphs) as the test split. By contrast, GEOM-QM9 is a much smaller dataset limited to small molecules with 9 heavy atoms. Similarly, we randomly draw 50000 conformation-molecule pairs to constitute the training set, and another 17813 conformations covering 150 molecular graphs as the test set.

Evaluation metrics. In this task we hope the generated samples to be of both high quality and diversity. We follow previous work (Hawkins, 2017; Mansimov et al., 2019; Xu et al., 2021) to calculate the RMSD of the heavy atoms between generated samples and reference ones. Given the generated conformation and the reference , we take the same alignment function defined in equation 5 to obtain the aligned conformation

, and then calculate the evaluation metric by

, where is the number of heavy atoms. Built upon the RMSD metric, Xu et al. (2021) defined Coverage (COV) and Matching (MAT) scores to measure the diversity and quality respectively. COV counts the fraction of conformations in the reference set that are covered by at least one conformation in the generated set:


where and denote the generated and the reference conformations set respectively. Typically, a higher COV score indicates a better diversity performance to cover the complex true distribution.

While COV is able to detect mode-collapse, there is no guarantee for the quality of generated samples. Thus, the MAT score is defined as a complement metric that concentrates on the quality (Xu et al., 2021):


Generally, more realistic generated samples lead to a lower MAT score.

Results. We calculate the COV and MAT evaluations on both GEOM-QM9 and GEOM-Drugs datasets for all baselines, and summarize the results in Tab. 1. We visualize several representative examples in Fig. 3. Our ConfVAE outperforms all existing strong baselines with an obvious margin (top rows). By incorporating an end-to-end training objective via bilevel optimization, we consistently achieved a better result on all four metrics. By contrast, current state-of-the-art models GraphDG and CGCF suffer much worse performance due to the two-stage generation process, where the extra error caused by the distance geometry cannot be taken into account during training. CVGAE enjoys the same training and testing objective, but still shows inferior performance since it fails to keep the vital translation and rotation invariant property.

Similar to previous work (Mansimov et al., 2019; Xu et al., 2021), we also further test all models by incorporating a rule-based empirical force field (Halgren, 1996b) and compare the performance with the classic RDKit toolkit. Specifically, we first generate the conformations with the generative models as initial structures, and then utilize the force field to further optimize the generated structures. The additional results are reported in Tab. 1 (bottom rows). As shown in the table, ConfVAE still achieves the best results among all generative models. More importantly, our method outperforms RDKit on out of evaluations and achieves competitive results on the other one, making our method practically useful for real-world applications.

Ablation Study. So far we have demonstrated the superior performance of the proposed method. However, because we adopt a slightly different architecture, it remains unclear where the effectiveness comes from. In this part, we carefully conduct an ablation study by removing the bilevel component defined in equation 7 during training, i.e., remove and learn the model with only and . We denote this variant of ConfVAE as ConfVAE-. and summarize the additional results in Tab. 1.

As shown in the table, removing the bilevel component hurts performance. These results verify that only learning from distances will introduce an extra bias for the generated conformations, and our end-to-end method for directly learning on the 3D structure helps to overcome this issue. Another observation is that as a combination of flow-based and VAE-based model, ConfVAE- still achieves significantly better results than the Flow-based CGCF and VAE-based GraphDG, with exactly the same training and sampling process. This result indicates that incorporating both global () and local latent variables will contribute to the generated conformations, which can help to capture both the global and local geometric structure and atomic interactions.

Single Pair All

Mean Median Mean Median Mean Median

3.4513 3.1602 3.8452 3.6287 4.0866 3.7519

4.1789 4.1762 4.9184 5.1856 5.9747 5.9928

0.7645 0.2346 0.8920 0.3287 1.1949 0.5485

0.4490 0.1786 0.5509 0.2734 0.8703 0.4447

0.2551 0.1352 0.2719 0.1742 0.2968 0.2132

0.1809 0.1153 0.1946 0.1455 0.2113 0.2047

Table 2: Comparison of different models on the distance distribution modeling task. We compare the marginals (), pairs (

) and joint distribution (

) of edges connecting C and O atoms. We report the Median and Mean of the MMD metric. Molecular graphs are taken from the test set of ISO17.

4.3 Distance Distribution Modeling

Dataset. For the distances modeling task, we follow Simm and Hernández-Lobato (2020); Xu et al. (2021) and use the ISO17 dataset (Simm and Hernández-Lobato, 2020). ISO17 is constructed from the snapshots of ab initio molecular dynamics simulations, where the coordinates are not just equilibrium conformations but are samples that reflect the underlying density around equilibrium states. We follow previous work to split ISO17 into a training set with 167 molecules and a test set with the other 30 molecules.

Evaluation metrics. To obtain a distribution over distances from a distribution over conformations, we sample a set of conformations and then calculate the corresponding atomic lengths between C and O atoms (H atoms are usually ignored). Let denote the conditional distribution of distances on each edge given a molecular graph . To evaluate the distance distributions, we use the maximum mean discrepancy (MMD) (Gretton et al., 2012) to the ground-truth distributions. More specifically, we evaluate against the ground truth the MMD of marginal distributions of each individual edge’s distance , pairs of distances and the joint distance . For this benchmark, the size of the generated sample set is the same as the reference set.

Results. The results of MMDs are summarized in Tab. 2. The statistics show that the generated distance distribution of ConfVAE is significantly closer to the ground-truth distribution compared with the baseline models. These results demonstrate that our method can not only generate realistic conformations, but also model the density around equilibrium states. By contrast, though RDKit shows competitive performance for conformation generation, it seems to struggle with the distribution modeling benchmark. This is because RDKit is only designed to find the equilibrium states by using the empirical force field (Halgren, 1996a), and thus it lacks the capacity to capture the underlying distribution. The further ablation study between ConfVAE and ConfVAE- also verifies the effectiveness of the bilevel optimization components.

5 Related Work

In recent years, deep learning has shown significant progress for 3D structure generation. There have been works using neural networks to derive energy prediction models, which then are taken as faster alternatives to quantum mechanics-based energy calculations (Schütt et al., 2017; Smith et al., 2017) for molecular dynamics simulation or molecule optimization (Wang et al., 2020). However, though accelerated by neural networks, these approaches are still time-consuming due to the lengthy sampling process. Recently, (Gebauer et al., 2019) and (Hoffmann and Noé, 2019) provide methods to generate new 3D molecules with deep generative models, while (Simm et al., 2020b) and (Simm et al., 2020a)

employ reinforcement learning to search the vast geometric space. However, none of these methods is designed to generate the conformations from the molecular graph structure, making them orthogonal to our framework.

Many other works (Lemke and Peter, 2019; AlQuraishi, 2019; Ingraham et al., 2019; Noé et al., 2019) also learn to directly predict 3D structures, but focus on the protein folding problem. Specifically, Senior et al. (2020b); Jumper et al. (2020) significantly advance this field with an end-to-end attention-based model called AlphaFold. Unfortunately, proteins are amino-acid sequences with low chemical diversity, much larger scale and for which abundant structural exists while general molecules are highly structured graphs with a variety of cycles and much broader chemical composition, making it unclear whether these methods are transferable to the general conformation generation task.

6 Conclusion

In this paper, we propose ConfVAE, an end-to-end framework for molecular conformation generation via bilevel programming. Our generative model can overcome significant errors of previous two-stage models, thanks to the end-to-end training based on bilevel programming, while keeping the property of rotational and translational invariance. Experimental results demonstrate the superior performance of our method over all state-of-the-art baselines on several standard benchmarks. Future work includes combining our bilevel optimization framework with other kinds of generative models, and extending our method to other challenging structures such as proteins.


This project is supported by the Natural Sciences and Engineering Research Council (NSERC) Discovery Grant, the Canada CIFAR AI Chair Program, collaboration grants between Microsoft Research and Mila, Samsung Electronics Co., Ldt., Amazon Faculty Research Award, Tencent AI Lab Rhino-Bird Gift Fund and a NRC Collaborative R&D Project (AI4D-CORE-06). This project was also partially funded by IVADO Fundamental Research Project grant PRF-2019-3583139727.


  • B. Alipanahi, N. Krislock, A. Ghodsi, H. Wolkowicz, L. Donaldson, and M. Li (2013) Determining protein structures from noesy distance constraints by semidefinite programming. Journal of Computational Biology 20 (4), pp. 296–310. Cited by: §3.3.
  • M. AlQuraishi (2019) End-to-end differentiable learning of protein structure. Cell systems 8 (4), pp. 292–301. Cited by: §5.
  • N. Anand and P. Huang (2018) Generative modeling for protein structures. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7505–7516. Cited by: §3.3.
  • S. Axelrod and R. Gomez-Bombarelli (2020) GEOM: energy-annotated molecular conformations for property prediction and molecular generation. arXiv preprint arXiv:2006.05531. Cited by: §4.2.
  • A. J. Ballard, S. Martiniani, J. D. Stevenson, S. Somani, and D. J. Wales (2015) Exploiting the potential energy landscape to sample free energy. Wiley Interdisciplinary Reviews: Computational Molecular Science 5 (3), pp. 273–289. Cited by: §1.
  • Y. Bengio (2000) Gradient-based optimization of hyperparameters. Neural computation 12 (8), pp. 1889–1900. Cited by: §2.2, §2.2.
  • K. P. Bennett, J. Hu, X. Ji, G. Kunapuli, and J. Pang (2006) Model selection via bilevel optimization. In The 2006 IEEE International Joint Conference on Neural Network Proceedings, pp. 1922–1929. Cited by: §2.2.
  • L. Bottou (2010) Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §3.3.
  • J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §3.2.
  • R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud (2018) Neural ordinary differential equations. In Advances in neural information processing systems, pp. 6571–6583. Cited by: §3.2.
  • B. Colson, P. Marcotte, and G. Savard (2007) An overview of bilevel optimization. Annals of operations research 153 (1), pp. 235–256. Cited by: §2.2.
  • G. M. Crippen, T. F. Havel, et al. (1988) Distance geometry and molecular conformation. Vol. 74, Research Studies Press Taunton. Cited by: Appendix A.
  • M. De Vivo, M. Masetti, G. Bottegoni, and A. Cavalli (2016) Role of molecular dynamics and related methods in drug discovery. Journal of medicinal chemistry 59 (9), pp. 4035–4061. Cited by: §1.
  • J. Domke (2012) Generic methods for optimization-based modeling. In Artificial Intelligence and Statistics, pp. 318–326. Cited by: §2.2.
  • D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. Gómez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. arXiv preprint arXiv:1509.09292. Cited by: §3.2.
  • R. Flamary, A. Rakotomamonjy, and G. Gasso (2014) Learning constrained task similarities in graphregularized multi-task learning.

    Regularization, Optimization, Kernels, and Support Vector Machines

    , pp. 103.
    Cited by: §2.2.
  • L. Franceschi, M. Donini, P. Frasconi, and M. Pontil (2017) Forward and reverse gradient-based hyperparameter optimization. arXiv preprint arXiv:1703.01785. Cited by: §2.2.
  • L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil (2018) Bilevel programming for hyperparameter optimization and meta-learning. arXiv preprint arXiv:1806.04910. Cited by: §1, §2.2.
  • N. Gebauer, M. Gastegger, and K. Schütt (2019) Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules. In Advances in Neural Information Processing Systems, pp. 7566–7578. Cited by: §5.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212. Cited by: §1, §3.2.
  • A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola (2012) A kernel two-sample test. The Journal of Machine Learning Research 13 (1), pp. 723–773. Cited by: §4.3.
  • A. Griewank and A. Walther (2008) Evaluating derivatives: principles and techniques of algorithmic differentiation. SIAM. Cited by: §2.2.
  • T. A. Halgren (1996a) Merck molecular force field. i. basis, form, scope, parameterization, and performance of mmff94. Journal of computational chemistry 17 (5-6), pp. 490–519. Cited by: §4.3.
  • T. A. Halgren (1996b) Merck molecular force field. v. extension of mmff94 using experimental data, additional computational data, and empirical rules. Journal of Computational Chemistry 17 (5-6), pp. 616–641. Cited by: §4.2.
  • P. C. Hawkins (2017) Conformation generation: the state of the art. Journal of Chemical Information and Modeling 57 (8), pp. 1747–1756. Cited by: §4.2.
  • M. Hoffmann and F. Noé (2019) Generating valid euclidean distance matrices. arXiv preprint arXiv:1910.03131. Cited by: §5.
  • W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec (2019) Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265. Cited by: §4.1.
  • J. Ingraham, A. J. Riesselman, C. Sander, and D. S. Marks (2019) Learning protein structure with a differentiable simulator.. In International Conference on Learning Representations, Cited by: §5.
  • J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, K. Tunyasuvunakool, O. Ronneberger, R. Bates, A. Zidek, A. Bridgland, et al. (2020) High accuracy protein structure prediction using deep learning. Fourteenth Critical Assessment of Techniques for Protein Structure Prediction (Abstract Book) 22, pp. 24. Cited by: §1, §5.
  • S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley (2016) Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design 30 (8), pp. 595–608. Cited by: §3.2.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §3.1, §4.1.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §3.2.
  • T. Lemke and C. Peter (2019) Encodermap: dimensionality reduction and generation of molecule conformations. Journal of chemical theory and computation 15 (2), pp. 1209–1215. Cited by: §5.
  • L. Liberti, C. Lavor, N. Maculan, and A. Mucherino (2014) Euclidean distance geometry and applications. SIAM review 56 (1), pp. 3–69. Cited by: §1.
  • D. Maclaurin, D. Duvenaud, and R. Adams (2015) Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pp. 2113–2122. Cited by: §2.2, §2.2.
  • E. Mansimov, O. Mahmood, S. Kang, and K. Cho (2019) Molecular geometry prediction using a deep generative graph neural network. arXiv preprint arXiv:1904.00314. Cited by: §1, §3.2, §4.1, §4.1, §4.2, §4.2.
  • L. Muñoz-González, B. Biggio, A. Demontis, A. Paudice, V. Wongrassamee, E. C. Lupu, and F. Roli (2017) Towards poisoning of deep learning algorithms with back-gradient optimization. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 27–38. Cited by: §2.2.
  • F. Noé, S. Olsson, J. Köhler, and H. Wu (2019) Boltzmann generators: sampling equilibrium states of many-body systems with deep learning. Science 365 (6457), pp. eaaw1147. Cited by: §5.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703. Cited by: §3.3.
  • S. Riniker and G. A. Landrum (2015) Better informed distance geometry: using what we know to improve conformation generation. Journal of chemical information and modeling 55 (12), pp. 2562–2574. Cited by: §4.1.
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. IEEE transactions on neural networks 20 (1), pp. 61–80. Cited by: §3.2.
  • K. Schütt, P. Kindermans, H. E. S. Felix, S. Chmiela, A. Tkatchenko, and K. Müller (2017)

    Schnet: a continuous-filter convolutional neural network for modeling quantum interactions

    In Advances in neural information processing systems, pp. 991–1001. Cited by: §3.2, §5.
  • A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. Qin, A. Žídek, A. W. Nelson, A. Bridgland, et al. (2020a) Improved protein structure prediction using potentials from deep learning. Nature 577 (7792), pp. 706–710. Cited by: §1.
  • A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. Qin, A. Žídek, A. W. Nelson, A. Bridgland, et al. (2020b) Improved protein structure prediction using potentials from deep learning. Nature 577 (7792), pp. 706–710. Cited by: §5.
  • C. Shi, M. Xu, H. Guo, M. Zhang, and J. Tang (2020a) A graph to graphs framework for retrosynthesis prediction. In International Conference on Machine Learning, pp. 8818–8827. Cited by: §1.
  • C. Shi, M. Xu, Z. Zhu, W. Zhang, M. Zhang, and J. Tang (2020b)

    GraphAF: a flow-based autoregressive model for molecular graph generation

    arXiv preprint arXiv:2001.09382. Cited by: §1.
  • G. N. Simm and J. M. Hernández-Lobato (2020) A generative model for molecular distance geometry. In International Conference on Machine Learning, Cited by: Appendix A, §C.1, §C.1, §C.1, §1, §2.1, §3.1, §3.2, §3.3, §4.1, §4.1, §4.3.
  • G. N. Simm, R. Pinsler, G. Csányi, and J. M. Hernández-Lobato (2020a) Symmetry-aware actor-critic for 3d molecular design. arXiv preprint arXiv:2011.12747. Cited by: §5.
  • G. Simm, R. Pinsler, and J. M. Hernández-Lobato (2020b) Reinforcement learning for molecular design guided by quantum mechanics. In International Conference on Machine Learning, pp. 8959–8969. Cited by: §5.
  • D. G. Smith, L. A. Burns, A. C. Simmonett, R. M. Parrish, M. C. Schieber, R. Galvelis, P. Kraus, H. Kruse, R. Di Remigio, A. Alenaizan, et al. (2020)

    PSI4 1.4: open-source software for high-throughput quantum chemistry

    The Journal of chemical physics 152 (18), pp. 184108. Cited by: §C.1.
  • J. S. Smith, O. Isayev, and A. E. Roitberg (2017) ANI-1: an extensible neural network potential with dft accuracy at force field computational cost. Chemical science 8 (4), pp. 3192–3203. Cited by: §5.
  • W. Wang, T. Yang, W. H. Harris, R. Gomez-Bombarelli, and R. Gómez-Bombarelli (2020) Active learning and neural network potentials accelerate molecular screening of ether-based solvate ionic liquids. Chemical Communications 56 (63), pp. 8920. External Links: ISSN 1359-7345, Link Cited by: §5.
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §4.1.
  • M. Xu, S. Luo, Y. Bengio, J. Peng, and J. Tang (2021) Learning neural generative dynamics for molecular conformation generation. In International Conference on Learning Representations, External Links: Link Cited by: Appendix A, §1, §2.1, item *, §3.1, §3.3, Table 1, §4.1, §4.1, §4.2, §4.2, §4.2, §4.2, §4.3.
  • J. You, B. Liu, Z. Ying, V. Pande, and J. Leskovec (2018) Graph convolutional policy network for goal-directed molecular graph generation. In Advances in neural information processing systems, pp. 6410–6421. Cited by: §1.

Appendix A Data Preprocess

Inspired by classic molecular distance geometry (Crippen et al., 1988), in our framework we also generate the confirmations by taking the inter-atomic distances as the intermediate variables, which enables the invariant property to rotation and translation. In practice, the chemical bonds existing in the molecular graph are not sufficient to determine a conformation, and thus we follow existing works (Simm and Hernández-Lobato, 2020; Xu et al., 2021) to first expand the graphs by extending auxiliary edges. Specifically, the atoms that are or hops away are connected with virtual bonds

, labeled differently from the real bonds in the vanilla molecule. These extra bonds contribute to reducing the degrees of freedom in the 3D coordinates and characterizing the unique graph, with the edges between

-hop neighbors helping to fix the angles between atoms, and those between -hop neighbors fixing dihedral angles.

Appendix B Training Algorithm

Input: objective reweighting coefficients and ; the inner loop optimization iterations and learning rate ; alignment function ; data samples .
Initial: prior , decoder and its dynamics defined as , encoder

  while  have not converged do
      {Calculate from }
     Initialize , sample
     for  do
         {Inner loop}
     end for
  end while
  return , ,
Algorithm 1 Training Algorithm of ConfVAE.

Appendix C Additional Comparisons

c.1 Property Prediction

This task is first proposed in Simm and Hernández-Lobato (2020), which estimates the expected molecular properties for molecular graphs by a set of generated conformations. This task can further demonstrate the effectiveness and quality of generated samples, and is important for many real-world applications such as drug and material design.

Dataset. Following Simm and Hernández-Lobato (2020), we also employ the ISO17 dataset. More details about the dataset can be found in Sec. 4.3.

Evaluation metrics. For comparison, we calculate the ensemble properties of each molecular graph by averaging over a set of generated conformations. Specifically, we calculate the total electronic energy , the energy of HOMO and the LUMO

, and the dipole moment

, using the quantum chemical calculation package Psi4 (Smith et al., 2020). In practice, we generate samples from different methods to estimate the property, and report median error of averaged properties to measure the accuracy of predicted properties. Similar to Simm and Hernández-Lobato (2020), we exclude CVGAE from this analysis due to its poor generated quality.

Results. The results are shown in Tab. 3

. As shown in the table, ConfVAE outperforms all other generative models, and shows competitive results compared with RDKit. Close observation indicates that CGCF struggles with this task since the generated conformations suffer a extremely high variance. By contrast, our proposed method enjoys the best performance thanks to the high quality of generated samples.

42.7 0.08 0.15 0.29

58.0 0.10 0.09 0.33

208.2 0.80 1.11 0.46

40.2 0.10 0.08 0.29

Table 3: Median difference in averaged properties between ground-truth and generated conformations from different methods. Unit: (kJ/mol), (eV), (eV), (debye).

c.2 More Results of Coverage Score

Figure 4: Curves of the Coverage score with different thresholds on GEOM-QM9 (left two) and GEOM-Drugs (right two) datasets. The first and third curves evaluates the generated conformations from different generative models, while the other two are further optimized with the empirical force field.
Figure 5: Marginal distributions of ground-truth and generated conformations from generative models. We study the edges between C and O atoms, and omit the H atoms for clarity. In each subplot, the annotation () denotes the corresponding atoms connected by the chemical bond .

In this section, we give more results about Coverage score with different thresholds . The details about the COV score can be found in Sec. 4.2. Results are shown in Fig. 4. As shown in the figure, ConfVAE consistently achieves better performance than previous state-of-the-art models, which demonstrates our proposed method is capable to generate more realistic samples.

c.3 Visualization of Distributions

In Fig. 5, we investigate the accuracy of generated conformations by visualizing the marginal distributions for all pairwise distances between C and O atoms of a molecular graph in the ISO17 test set. As shown in the figure, though primarily designed for learning the 3D structures via an end-to-end framework, our method can still make a much better estimation of the distance distributions than the state-of-the-art model for molecular geometry modeling. As a representative element of the pairwise property between atoms, the inter-atomic distances demonstrate the capacity of our model to capture the inter-atomic interactions.