Log In Sign Up

Structure-based Drug Design with Equivariant Diffusion Models

Structure-based drug design (SBDD) aims to design small-molecule ligands that bind with high affinity and specificity to pre-determined protein targets. Traditional SBDD pipelines start with large-scale docking of compound libraries from public databases, thus limiting the exploration of chemical space to existent previously studied regions. Recent machine learning methods approached this problem using an atom-by-atom generation approach, which is computationally expensive. In this paper, we formulate SBDD as a 3D-conditional generation problem and present DiffSBDD, an E(3)-equivariant 3D-conditional diffusion model that generates novel ligands conditioned on protein pockets. Furthermore, we curate a new dataset of experimentally determined binding complex data from Binding MOAD to provide a realistic binding scenario that complements the synthetic CrossDocked dataset. Comprehensive in silico experiments demonstrate the efficiency of DiffSBDD in generating novel and diverse drug-like ligands that engage protein pockets with high binding energies as predicted by in silico docking.


page 3

page 8

page 18

page 21


DeepAtom: A Framework for Protein-Ligand Binding Affinity Prediction

The cornerstone of computational drug design is the calculation of bindi...

Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem

Construction of a scaffold structure that supports a desired motif, conf...

LIMO: Latent Inceptionism for Targeted Molecule Generation

Generation of drug-like molecules with high binding affinity to target p...

Iterative Refinement Graph Neural Network for Antibody Sequence-Structure Co-design

Antibodies are versatile proteins that bind to pathogens like viruses an...

DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking

Predicting the binding structure of a small molecule ligand to a protein...

Reinforced Genetic Algorithm for Structure-based Drug Design

Structure-based drug design (SBDD) aims to discover drug candidates by f...

1 Introduction

The rational design of molecular compounds to act as drugs remains an oustanding challenge in biopharmaceutical research. Towards supporting such efforts, structure-based drug design (SBDD) aims to generate small-molecule ligands that bind to a specific 3D protein structure with high affinity and specificity (Anderson, 2003). However, SBDD remains very challenging and with important limitations. A traditional SBDD campaign starts with the identification and validation of a target of interest and its subsequent structural characterisation using experimental structural determination methods. The first step in this process is the identification of the binding pocket; a cavity in which ligands may bind the target to elicit the desired therapeutic effect. This can be achieved via experimental means or a plethora of computational approaches (Pérot et al., 2010). Once a binding site is identified, the goal is to discover lead compounds that exhibit the desired biological activity. Importantly, to transition from leads to promising candidates the compounds need to be evaluated regarding other drug development constraints that are also hard to predict (toxicity, absorption, etc.).

Traditionally, SBDD is handled either by high-throughput experimental (Blundell, 1996) or virtual screening (Lyne, 2002; Shoichet, 2004) of large chemical databases. Not only is this expensive and time consuming but it also limits the exploration of chemical space to the historical knowledge of previously studied molecules, with a further emphasis usually placed on commercial availability (Irwin and Shoichet, 2005). Moreover, the optimization of initial lead molecules is often a biased process, with heavy reliance on human intuition (Ferreira et al., 2015).

Recent advances in geometric deep learning, especially in modeling geometric structures of biomolecules 

(Bronstein et al., 2021; Atz et al., 2021), provide a promising direction for structure-based drug design (Gaudelet et al., 2021). Even though utilizing deep learning as surrogate docking models has achieved remarkable progress (Lu et al., 2022; Stärk et al., 2022), deep learning-based design of ligands that bind to target proteins is still an open problem. Early attempts have been made to represent molecules as atomic density maps, and variational auto-encoders were utilized to generate new atomic density maps corresponding to novel molecules (Ragoza et al., 2022). However, it is nontrivial to map atomic density maps back to molecules, necessitating a subsequent atom-fitting stage. Follow-up work addressed this limitation by representing molecules as 3D graphs with atomic coordinates and types which circumvents the unnecessary post-processing steps.  Li et al. (2021) proposed an autoregressive generative model to sample ligands given the protein pocket as a conditioning constraint.  Peng et al. (2022) improved this method by using an

-equivariant graph neural network which respects rotation and translation symmetries in 3D space. Similarly, 

Drotár et al. (2021); Liu et al. (2022)

used autoregressive models to generate atoms sequentially and incorporate angles during the generation process.  

Li et al. (2021)

formulated the generation process as a reinforcement learning problem and connected the generator with Monte Carlo Tree Search for protein pocket-conditioned ligand generation. However, the main premise of sequential generation methods may not hold in real scenarios, since there is no ordering of the generation process and, as a result, the global context of the generated ligands may be lost. In addition, sequential methods pose more computational complexities that make the model inference inefficient 

(Luo et al., 2021; Peng et al., 2022).

An alternative is a one-shot generation strategy that samples the atomic coordinates and types of all the atoms at once (Du et al., 2022b). In this work, we develop an equivariant diffusion model for structure-based drug design (DiffSBDD) which, to the best of our knowledge, is the first of its kind. Specifically, we formulate SBDD as a 3D-conditioned generation problem where we aim to generate diverse ligands with high binding affinity for specific protein targets. We propose an -equivariant 3D-conditional diffusion model that respects translation, rotation, reflection, and permutation equivariance. We introduce two strategies, protein-conditioned generation and ligand-inpainting generation

producing new ligands conditioned on protein pockets. Specifically, protein-conditioned generation considers the protein as a fixed context, while ligand-inpainting models the joint distribution of the protein-ligand complex and new ligands are inpainted during inference time. We further curate an experimentally determined binding dataset derived from Binding MOAD 

(Hu et al., 2005), which supplements the commonly used synthetic Crossdocked (Francoeur et al., 2020) dataset to validate our model performance under realistic binding scenarios. The experimental results demonstrate that DiffSBDD is capable of generating novel, diverse and drug-like ligands with predicted high binding affinities to given protein pockets. The code is available at

Figure 1: DiffSBDD in the protein-conditioned scenario. We first simulate the forward diffusion process to gain a trajectory of progressively noised samples over T timesteps. We then train a model

to reverse or denoise this process that is conditional on the target structure. Once trained, we are able to sample new drug candidates from a Gaussian distribution

. Both atom features and coordinates are diffused throughout the process. Ligands () are represented as fully-connected graphs during the diffusion process (edges not shown for clarity) and covalent bonds are added to the resultant point cloud at the end of generation. The protein () is represented as a graph but is shown as a surface here for clarity.

2 Background

Denoising Diffusion Probabilistic Models

Denoising diffusion probabilistic models (DDPMs) (Sohl-Dickstein et al., 2015; Ho et al., 2020) are a class of generative models inspired by non-equilibrium thermodynamics. Briefly, they define a Markovian chain of random diffusion steps by slowly adding noise to sample data and then learning the reverse of this process (typically via a neural network) to reconstruct data samples from noise.

In this work, we closely follow the framework developed by 15. In our setting, data samples are atomic point clouds with 3D geometric coordinates and categorical features , where is the number of atoms. A fixed noise process


adds noise to the data and produces a latent noised representation for .

controls the signal-to-noise ratio

and follows either a learned or pre-defined schedule from to (Kingma et al., 2021)

. We also choose a variance-preserving noising process 

(Song et al., 2020) with .

Since the noising process is Markovian, we can write the denoising transition from time step to in closed form as


with and following the notation of 15. This true denoising process depends on the data sample , which is not available when using the model for generating new samples. Instead, a neural network is used to approximate the sample . More specifically, we can reparameterize Equation (1) as with and directly predict the Gaussian noise . Thus, is simply given as .

To maximise the likelihood of our training data, we could directly optimize the variational lower bound (VLB) (D. Kingma, T. Salimans, B. Poole, and J. Ho (2021); 15)


The prior loss should always be close to zero and can be computed exactly in closed form while the reconstruction loss must be estimated as described in

15. In practice, however, we do not directly optimize the VLB but instead minimize the simplified training objective (Ho et al., 2020; Kingma et al., 2021)


-equivariant Graph Neural Networks

A function is said to be equivariant w.r.t. the group if , where denotes the action of the group element on and (Serre and others, 1977). Graph Neural Networks (GNNs) are learnable functions that process graph-structured data in a permutation-equivariant way, making them particularly useful for molecular systems where nodes do not have an intrinsic order. Permutation invariance means that where is an permutation matrix acting on the node feature matrix. Since the nodes of the molecular graph represent the 3D coordinates of atoms, we are interested in additional equivariance w.r.t. the Euclidean group or rigid transformations. An -equivariant GNN (EGNN) satisfies for an orthogonal matrix

and some translation vector

added row-wise.

In our case, since the nodes have both geometric atomic coordinates as well as atomic type features , we can use a simple implementation of EGNN proposed by Satorras et al. (2021), in which the updates for features and coordinates of node at layer are computed as follows:


where , , and

are learnable Multi-layer Perceptrons (MLPs) and

and are the relative distances and edge features between nodes and respectively.

3 Equivariant Diffusion Models for SBDD

We utilize an equivariant DDPM to generate molecules and binding conformations jointly with respect to a specific protein target. We represent protein and ligand point clouds as fully-connected graphs that are further processed by EGNNs (Satorras et al., 2021). We consider two distinct approaches to 3D pocket conditioning: (1) a conditional DDPM that receives a fixed pocket representation as context in each denoising step, and (2) a model that approximates the joint distribution of ligand-pocket pairs combined with inpainting at inference time.

Figure 2: Comparison between the conditional generation and inpainting approaches. The conditional model learns to denoise molecules in the fixed context of protein pockets . In the inpainting scenario, the model first learns to approximate the joint distribution of ligand and pocket nodes . For sampling, context is provided by combining the latent representation of the ligand with a forward diffused representation of the pocket in each denoising step.

3.1 Pocket-conditioned small molecule generation

In the conditional molecule generation setup, we provide fixed three-dimensional context in each step of the denoising process. To this end, we supplement the ligand node point cloud , denoted by superscript , with protein pocket nodes , denoted by superscript , that remain unchanged throughout the reverse diffusion process (Figure 2).

We parameterize the noise predictor with an EGNN (V. G. Satorras, E. Hoogeboom, and M. Welling (2021); 15). To process ligand and pocket nodes with a single GNN, atom types and residue types are first embedded in a joint node embedding space by separate learnable MLPs. We employ the same message-passing scheme outlined in Equations (7)-(9), however, following (Igashov et al., 2022), we replace the coordinate update step with the following:


to ensure the three-dimensional protein context remains fixed throughout the EGNN layers.


In the probabilistic setting with 3D-conditioning, we would like to ensure -equivariance in the following sense111We transpose the node feature matrices hereafter so that the matrix multiplication resembles application of a group action. We also ignore node type features, which transform invariantly, for simpler notation.:

  • Evaluating the likelihood of a molecule given the three-dimensional representation of a protein pocket should not depend on global -transformations of the system, i.e. for orthogonal with and added column-wise.

  • At the same time, it should be possible to generate samples

    from this conditional probability distribution so that equivalently transformed ligands

    are sampled with the same probability if the input pocket is rotated and translated and we sample from


Equivariance to the orthogonal group (comprising rotations and reflections) is achieved because we model both prior and transition probabilities with isotropic Gaussians where the mean vector transforms equivariantly w.r.t. rotations of the context (see  15 and Appendix C). Ensuring translation equivariance, however, is not as easy because the transition probabilities are not inherently translation-equivariant. In order to circumvent this issue, we follow previous works (J. Köhler, L. Klein, and F. Noé (2020); M. Xu, L. Yu, Y. Song, C. Shi, S. Ermon, and J. Tang (2022); 15) by limiting the whole sampling process to a linear subspace where the center of mass (CoM) of the system is zero. In practice, this is achieved by subtracting the center of mass of the system before performing likelihood computations or denoising steps.

Note that the 3D-conditional model can achieve equivariance without this “subspace-trick”. The coordinates of pocket nodes provide a reference frame for all samples that can be used to translate them to a unique location (e.g. such that the pocket is centered at the origin: ). By doing this for all training data, translation equivariance becomes irrelevant and the CoM-free subspace approach obsolete. To evaluate the likelihood of translated samples at inference time, we can first subtract the pocket’s center of mass from the whole system and compute the likelihood after this mapping. Similarly, for sampling molecules we can first generate a ligand in a CoM-free version of the pocket and move the whole system back to the original location of the pocket nodes to restore translation equivariance. As long as the mean of our Gaussian noise distribution depends equivariantly on the pocket node coordinates , -equivariance is satisfied as well (Appendix C). Since this change did not seem to affect the performance of the conditional model in our experiments, we decided to keep sampling in the linear subspace to ensure that the implementation is as similar as possible to the joint model, for which the subspace approach is necessary.

3.2 Joint distribution with inpainting

As an extension to the conditional approach described above, we also present a ligand-inpainting approach. Originally introduced as a technique for completing masked parts of images (Song et al., 2020; Lugmayr et al., 2022), inpainting has been adopted in other domains, including biomolecular structures (Wang et al., 2022). Here, we extend this idea to three-dimensional point cloud data.

We first train an unconditional DDPM to approximate the joint distribution of ligand and pocket nodes 222We use notations and interchangeably to describe the combined system of ligand and pocket nodes.. This allows us to sample new pairs without additional context. To condition on a target protein pocket, we then need to inject context into the sampling process by modifying the probabilistic transition steps. The combined latent representation of protein pocket and ligand at diffusion step is assembled from a forward noised version of the pocket that is combined with ligand nodes predicted by the DDPM based on the previous latent representation at step


In this manner, we traverse the Markov chain in reverse order from

to , replacing the predicted pocket nodes with their forward noised counterparts in each step. Equation (12) conditions the generative process on the given protein pocket. Thanks to the noise schedule, which decreases the variance of the noising process to almost zero at (Equation (1)), the final sample is guaranteed to contain an unperturbed representation of the protein pocket.

Since the model is trained to approximate the unconditional joint distribution of ligand-pocket pairs, the training procedure is identical to the unconditional molecule generation procedure developed by 15 aside from the fully-connected neural networks that embed protein and ligand node features in a common space as described in Section 3.1. The conditioning on known protein pockets is entirely delegated to the sampling algorithm, which means this approach is not limited to ligand-inpainting but, in principle, allows us to mask and replace arbitrary parts of the ligand-pocket system without retraining.


Similar desiderata as in the conditional case apply to the joint probability model, where we desire -invariance that can be obtained from invariant priors via equivariant flows (Köhler et al., 2020). The main complications compared to the previous approach are the missing reference frame and impossibility of defining a valid translation-invariant prior noise distribution as such a distribution cannot integrate to one. Consequently, it is necessary to restrict the probabilistic model to a CoM-free subspace as described in previous works (J. Köhler, L. Klein, and F. Noé (2020); M. Xu, L. Yu, Y. Song, C. Shi, S. Ermon, and J. Tang (2022); 15). While the reverse diffusion process is defined for a CoM-free system, substituting the predicted pocket node coordinates with a new diffused version of the known pocket as described in Equations (11) - (13) can lead to non-zero CoM. To prevent this, we translate the known pocket representation so that its center of mass coincides with the predicted representation: before creating the new combined representation with .

4 Experiments

4.1 Datasets


We use the CrossDocked dataset (Francoeur et al., 2020) and follow the same filtering and splitting strategies as in previous work (Luo et al., 2021; Peng et al., 2022). This results in 100,000 high-quality protein-ligand pairs for the training set and 100 proteins for the test set. The split is done by 30% sequence identity using MMseqs2 (Steinegger and Söding, 2017).

Binding MOAD

We also evaluate our method on experimentally determined protein-ligand complexes found in Binding MOAD (Hu et al., 2005) which are filtered and split based on the proteins’ enzyme commission number as described in Appendix B. This results in 40,354 protein-ligand pairs for training and 130 pairs for testing.

4.2 Evaluation

width=1 Vina Score (kcal/mol, ) QED () SA () Lipinski () Diversity () Time (s, ) Test set 3D-SBDD (AR) (Luo et al., 2021) Pocket2Mol (Peng et al., 2022) DiffSBDD-cond () DiffSBDD-inpaint () DiffSBDD-cond

Table 1: Evaluation of generated molecules for targets from the CrossDocked test set. denotes that we re-evaluate the generated ligands provided by the authors. The inference times are taken from their papers.

For every experiment, we evaluated all combinations of all-atom and

level graphs with conditional and inpainting-based approaches respectively (with the exception of the all-atom inpainting approach due to computational limitations). Full details of model architecture and hyperparameters are given in Appendix 

A. We sampled 100 valid molecules333Due to occasional processing issues the actual number of available molecules is slightly lower on average (see Appendix D.1). for each target pocket with ground truth ligand sizes and remove all atoms that are not bonded to the largest connected fragment.

We employ widely-used metrics to assess the quality of our generated molecules (Peng et al., 2022; Li et al., 2021): (1) Vina Score is a physics-based estimation of binding affinity between small molecules and their target pocket; (2) QED is a simple quantitative estimation of drug-likeness combining several desirable molecular properties; (3) SA (synthetic accessibility) is a measure estimating the difficulty of synthesis; (4) Lipinski measures how many rules in the Lipinski rule of five (Lipinski et al., 2012), which is a loose rule of thumb to assess the drug-likeness of molecules, are satisfied; (5) Diversity is computed as the average pairwise dissimilarity (1 - Tanimoto similarity) between all generated molecules for each pocket; (6) Inference Time is the average time to sample 100 molecules for one pocket across all targets. All docking scores and chemical properties are calculated with QuickVina2 (Alhossary et al., 2015) and RDKit (Landrum and others, 2016).

4.3 Baselines

We compare with two recent deep learning methods for structure-based drug design. 3D-SBDD (Luo et al., 2021) and Pocket2Mol (Peng et al., 2022) are auto-regressive schemes relying on graph representations of the protein pocket and previously placed atoms to predict probabilities based on which new atoms are added. 3D-SBDD

use heuristics to infer bonds from generated atomic point clouds while

Pocket2Mol directly predicts them during the sequential generation process.

4.4 Results


Figure 3: DiffSBDD models trained on CrossDocked and evaluated against a aminotransferase (top, PDB: 2jjg) and hydrolase (bottom, PDB: 3kc1). Conditional and inpainting approaches are compared (using all-atom and level protein presentations respectively) and three high affinity molecules from each model are presented. ‘Sim’ is the Tanimoto similarity between the generated and reference ligand.

Overall, the experimental results in Table 1 suggest that DiffSBDD can generate diverse small-molecule compounds with predicted high binding affinity, matching state-of-the-art performance. We do not see significant differences between the conditional model and the inpainting approach. The diversity score is arguably the most interesting, as this suggests our model is able to sample greater amounts of chemical space when compared to previous methods, while maintaining high binding performance, one of the most important requirements in early-stage, structure-based lead discovery. Specifically, DiffSBDD aims to generate ligands that bind to protein pockets and learn the probability density of ligands interacting with protein pockets. While it does not optimize for other molecular properties, such as QED and Lipinski, it generates molecules similar to the test set distributions. Only SA scores are significantly lower on average. However, this reflects that our models are capable of exploring larger amounts of chemical space, given that SA primarily scores against the historical knowledge of previously synthesised molecules (Ertl and Schuffenhauer, 2009). Generally, presenting the full atomic context to the model constrains the space of outputs considerably, leading to higher Vina scores but lower diversity compared to the -only models. The all-atom model consistently beats -based models on a per target basis (Appendix Figure 9).

A representative selection of molecules for two targets (2jjg and 3kc1) are presented (Figure 3). This set is curated to be representative of our high scoring molecules, with both realistic and non-realistic motifs shown. It is noteworthy that the second molecule generated for 3kc1 has a similar tricyclic motif in the same pocket location as the reference ligand which was designed by traditional SBDD methods to maximise the hydrophobic interactions via shape complementarity of the ring system (Tsukada et al., 2010). However, a number of irregularities are present in even the highest scoring of generated molecules. For example, the high number of triangles in the molecules targeting 2jjg (from Inpainting-) and the large rings for 3kc1 would prove difficult to synthesise. Random selections of generated molecules made by all methods evaluated are presented in Figure 7.

All docking scores reported in Table 1

are within one standard deviation of each other, which poses challenges for the discrimination of the best models. To verify successful pocket-conditioning, we therefore discuss the agreement of generated molecular conformations with poses after docking in Appendix 

D.4. This experiment showcases the success of our method to model protein-drug interactions at the atomic level and clearly highlights the benefits of the all-atom pocket representation.

Binding MOAD

Results for the Binding MOAD dataset with experimentally determined binding complex data are reported in Table 2. 100 valid ligands have been generated for each of the 130 test pockets resulting in molecules in total444The QuickVina score could not be computed for 49 () molecules from DiffSBDD-cond.. DiffSBDD generates highly diverse molecules but on average docking scores are lower than corresponding reference ligands from this dataset.

Generated molecules for a representative target are shown in Figure 4. The target (PDB: 6c0b) is a human receptor which is involved in microbial infection (Chen et al., 2018) and possibly tumor suppression (Ding et al., 2016). The reference molecule, a long fatty acid (see Figure 4) that aids receptor binding (Chen et al., 2018)

, has too high a number of rotatable bonds and low a number of hydrogen bond donors/acceptors to be considered a suitable drug (QED of 0.36). Our model however, generates drug-like (QED between 0.63-0.85) and suitably sized molecules by adding aromatic rings connected by a small number of rotatable bonds, which allows the molecules to adopt a complementary binding geometry and is entropically favourable (by reducing the degrees of freedom), a classic technique in medicinal chemistry 

(Ritchie and Macdonald, 2009). A random selection of generated molecules in presented in Figure 8.

width=1 Vina Score (kcal/mol, ) QED () SA () Lipinski () Diversity () Time (s, ) Test set DiffSBDD-cond () DiffSBDD-inpaint ()

Table 2: Evaluation of generated molecules for target pockets from the Binding MOAD test set.
Figure 4: DiffSBDD models trained on Binding MOAD evaluated against a human receptor protein (PDB: 6c0b). Conditional and inpainting approaches are compared ( for both) and the three highest affinity molecules from each model are presented. Further details of the molecules shown here are explained in Appendix D.1

5 Related Work

Diffusion Models for Molecules

Inspired by non-equilibrium thermodynamics, diffusion models have been proposed to learn data distributions by modeling a denoising (reverse diffusion) process and have achieved remarkable success in a variety of tasks such as image, audio synthesis and point cloud generation (Kingma et al., 2021; Kong et al., 2021; Luo and Hu, 2021). Recently, efforts have been made to utilize diffusion models for molecule design (Du et al., 2022b). Specifically, 15 propose a diffusion model with an equivariant network that operates both on continuous atomic coordinates and categorical atom types to generate new molecules in 3D space. Torsional Diffusion (Jing et al., 2022) focuses on a conditional setting where molecular conformations (atomic coordinates) are generated from molecular graphs (atom types and bonds). Similarly, 3D diffusion models have been applied to generative design of larger biomolecular structures, such as antibodies (Luo et al., 2022) and other proteins (Anand and Achim, 2022; Trippe et al., 2022).

Structure-based Drug Design

Structure-based Drug Design (SBDD) (Blundell, 1996; Ferreira et al., 2015; Anderson, 2003) relies on the knowledge of the 3D structure of the biological target obtained either through experimental methods or high-confidence predictions using homology modelling (Kelley et al., 2015). Candidate molecules are then designed to bind with high affinity and specificity to the target using interactive software (Kalyaanamoorthy and Chen, 2011) and often human-based intuition (Ferreira et al., 2015). Recent advances in deep generative models have brought a new wave of research that model the conditional distribution of ligands given biological targets and thus enable de novo structure-based drug design. Most of recent work consider this task as a sequential generation problem and design a variety of generative methods including autoregressive models, reinforcement learning, etc., to generate ligands inside protein pockets atom by atom (Drotár et al., 2021; Luo et al., 2021; Li et al., 2021; Peng et al., 2022).

6 Conclusion

In this work, we propose DiffSBDD, an -equivariant 3D-conditional diffusion model for structure-based drug design. We demonstrate the effectiveness and efficiency of DiffSBDD in generating novel and diverse ligands with predicted high-affinity for given protein pockets on both a synthetic benchmark and a new dataset of experimentally determined protein-ligand complexes. We demonstrate that an inpainting-based approach can achieve competitive results to direct conditioning on a wide range of molecular metrics. Extending this more versatile strategy to an all atom pocket representation therefore holds promise to solve a variety of other structure-based drug design tasks, such as lead optimization or linker design, and binding site design without retraining.


We thank Xingang Peng and Shitong Luo for providing us generated molecules of the Pocket2Mol and 3D-SBDD methods. We thank Hannes Stärk and Joshua Southern for valuable feedback and insightful discussions. This work was supported by the European Research Council (starting grant no. 716058), the Swiss National Science Foundation (grant no. 310030_188744), and Microsoft Research AI4Science. Charles Harris is supported by the Cambridge Centre for AI in Medicine Studentship which is in turn funded by AstraZeneca and GSK. Michael Bronstein is supported in part by ERC Consolidator grant no. 724228 (LEMAN).


  • A. Alhossary, S. D. Handoko, Y. Mu, and C. Kwoh (2015) Fast, accurate, and reliable molecular docking with quickvina 2. Bioinformatics 31 (13), pp. 2214–2216. Cited by: §4.2.
  • N. Anand and T. Achim (2022) Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. arXiv preprint arXiv:2205.15019. Cited by: §5.
  • A. C. Anderson (2003) The process of structure-based drug design. Chemistry & biology 10 (9), pp. 787–797. Cited by: §1, §5.
  • K. Atz, F. Grisoni, and G. Schneider (2021) Geometric deep learning on molecular representations. Nature Machine Intelligence 3 (12), pp. 1023–1032. Cited by: §E.1, §1.
  • S. Batzner, A. Musaelian, L. Sun, M. Geiger, J. P. Mailoa, M. Kornbluth, N. Molinari, T. E. Smidt, and B. Kozinsky (2022) E (3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nature communications 13 (1), pp. 1–11. Cited by: §E.1.
  • T. L. Blundell (1996) Structure-based drug design.. Nature 384 (6604 Suppl), pp. 23–26. Cited by: §1, §5.
  • M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković (2021) Geometric deep learning: grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478. Cited by: §E.1, §1.
  • P. Chen, L. Tao, T. Wang, J. Zhang, A. He, K. Lam, Z. Liu, X. He, K. Perry, M. Dong, et al. (2018) Structural basis for recognition of frizzled proteins by clostridium difficile toxin b. Science 360 (6389), pp. 664–669. Cited by: §4.4.
  • L. Ding, X. Huang, F. Zheng, J. Xie, L. She, Y. Feng, B. Su, D. Zheng, and Y. Lu (2016) FZD2 inhibits the cell growth and migration of salivary adenoid cystic carcinomas. Oncology Reports 35 (2), pp. 1006–1012. Cited by: §4.4.
  • P. Drotár, A. R. Jamasb, B. Day, C. Cangea, and P. Liò (2021) Structure-aware generation of drug-like molecules. arXiv preprint arXiv:2111.04107. Cited by: §1, §5.
  • W. Du, H. Zhang, Y. Du, Q. Meng, W. Chen, N. Zheng, B. Shao, and T. Liu (2022a) SE (3) equivariant graph neural networks with complete local frames. In International Conference on Machine Learning, pp. 5583–5608. Cited by: §E.1.
  • Y. Du, T. Fu, J. Sun, and S. Liu (2022b) MolGenSurvey: a systematic survey in machine learning models for molecule design. arXiv preprint arXiv:2203.14500. Cited by: §E.1, §1, §5.
  • Y. Du, X. Liu, N. Shah, S. Liu, J. Zhang, and B. Zhou (2022c) ChemSpacE: interpretable and interactive chemical space exploration. Cited by: §E.1.
  • D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. Advances in neural information processing systems 28. Cited by: §E.1.
  • [15] Equivariant diffusion for molecule generation in 3d. Cited by: Appendix A, Appendix A, §2, §2, §2, §2, §3.1, §3.1, §3.2, §3.2, §5.
  • P. Ertl and A. Schuffenhauer (2009) Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. Journal of cheminformatics 1 (1), pp. 1–11. Cited by: §4.4.
  • L. G. Ferreira, R. N. Dos Santos, G. Oliva, and A. D. Andricopulo (2015) Molecular docking and structure-based drug design strategies. Molecules 20 (7), pp. 13384–13421. Cited by: §1, §5.
  • P. G. Francoeur, T. Masuda, J. Sunseri, A. Jia, R. B. Iovanisci, I. Snyder, and D. R. Koes (2020)

    Three-dimensional convolutional neural networks and a cross-docked data set for structure-based drug design

    Journal of Chemical Information and Modeling 60 (9), pp. 4200–4215. Cited by: §1, §4.1.
  • T. Gaudelet, B. Day, A. R. Jamasb, J. Soman, C. Regep, G. Liu, J. B. R. Hayter, R. Vickers, C. Roberts, J. Tang, D. Roblin, T. L. Blundell, M. M. Bronstein, and J. P. Taylor-King (2021) Utilizing graph machine learning within drug discovery and development. Briefings in Bioinformatics 22 (6). External Links: Document, Link Cited by: §1.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In International conference on machine learning, pp. 1263–1272. Cited by: §E.1.
  • J. Ho, A. Jain, and P. Abbeel (2020) Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33, pp. 6840–6851. Cited by: §2, §2.
  • L. Holdijk, Y. Du, F. Hooft, P. Jaini, B. Ensing, and M. Welling (2022) Path integral stochastic optimal control for sampling transition paths. arXiv preprint arXiv:2207.02149. Cited by: §E.1.
  • L. Hu, M. L. Benson, R. D. Smith, M. G. Lerner, and H. A. Carlson (2005) Binding moad (mother of all databases). Proteins: Structure, Function, and Bioinformatics 60 (3), pp. 333–340. Cited by: Appendix B, §1, §4.1.
  • I. Igashov, H. Stärk, C. Vignac, V. G. Satorras, P. Frossard, M. Welling, M. Bronstein, and B. Correia (2022) Equivariant 3d-conditional diffusion models for molecular linker design. arXiv preprint arXiv:2210.05274. Cited by: §3.1.
  • J. J. Irwin and B. K. Shoichet (2005) ZINC- a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling 45 (1), pp. 177–182. Cited by: §1.
  • B. Jing, G. Corso, J. Chang, R. Barzilay, and T. Jaakkola (2022) Torsional diffusion for molecular conformer generation. arXiv preprint arXiv:2206.01729. Cited by: §5.
  • S. Kalyaanamoorthy and Y. P. Chen (2011) Structure-based drug design to augment hit discovery. Drug discovery today 16 (17-18), pp. 831–839. Cited by: §5.
  • L. A. Kelley, S. Mezulis, C. M. Yates, M. N. Wass, and M. J. Sternberg (2015) The phyre2 web portal for protein modeling, prediction and analysis. Nature protocols 10 (6), pp. 845–858. Cited by: §5.
  • D. Kingma, T. Salimans, B. Poole, and J. Ho (2021) Variational diffusion models. Advances in neural information processing systems 34, pp. 21696–21707. Cited by: §2, §2, §2, §5.
  • J. Klicpera, J. Groß, and S. Günnemann (2020) Directional message passing for molecular graphs. arXiv preprint arXiv:2003.03123. Cited by: §E.1.
  • J. Köhler, L. Klein, and F. Noé (2020) Equivariant flows: exact likelihood generative learning for symmetric densities. In International conference on machine learning, pp. 5361–5370. Cited by: §3.1, §3.2.
  • Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro (2021) DiffWave: a versatile diffusion model for audio synthesis. In International Conference on Learning Representations, Cited by: §5.
  • G. Landrum et al. (2016)

    Rdkit: open-source cheminformatics software

    Cited by: §4.2.
  • K. Lapchevskyi, B. Miller, M. Geiger, and T. Smidt (2020) Euclidean neural networks (e3nn) v1. 0. Technical report Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States). Cited by: §E.1.
  • Y. Li, J. Pei, and L. Lai (2021) Structure-based de novo drug design using 3d deep generative models. Chemical science 12 (41), pp. 13664–13675. Cited by: §1, §4.2, §5.
  • C. A. Lipinski, F. Lombardo, B. W. Dominy, and P. J. Feeney (2012) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Advanced drug delivery reviews 64, pp. 4–17. Cited by: §4.2.
  • M. Liu, Y. Luo, K. Uchino, K. Maruhashi, and S. Ji (2022) Generating 3d molecules for target protein binding. arXiv preprint arXiv:2204.09410. Cited by: §1.
  • W. Lu, Q. Wu, J. Zhang, J. Rao, C. Li, and S. Zheng (2022) TANKBind: trigonometry-aware neural networks for drug-protein binding structure prediction. bioRxiv. Cited by: §1.
  • A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool (2022) Repaint: inpainting using denoising diffusion probabilistic models. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 11461–11471. Cited by: §3.2.
  • S. Luo, J. Guan, J. Ma, and J. Peng (2021) A 3d generative model for structure-based drug design. Advances in Neural Information Processing Systems 34, pp. 6229–6239. Cited by: Appendix A, Table 5, §1, §4.1, §4.3, Table 1, §5.
  • S. Luo and W. Hu (2021) Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2837–2845. Cited by: §5.
  • S. Luo, Y. Su, X. Peng, S. Wang, J. Peng, and J. Ma (2022) Antigen-specific antibody design and optimization with diffusion-based generative models. bioRxiv. Cited by: §5.
  • P. D. Lyne (2002) Structure-based virtual screening: an overview. Drug discovery today 7 (20), pp. 1047–1055. Cited by: §1.
  • A. Q. Nichol and P. Dhariwal (2021) Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171. Cited by: Appendix A.
  • N. M. O’Boyle, M. Banck, C. A. James, C. Morley, T. Vandermeersch, and G. R. Hutchison (2011) Open babel: an open chemical toolbox. Journal of cheminformatics 3 (1), pp. 1–14. Cited by: Appendix A.
  • X. Peng, S. Luo, J. Guan, Q. Xie, J. Peng, and J. Ma (2022) Pocket2Mol: efficient molecular sampling based on 3d protein pockets. arXiv preprint arXiv:2205.07249. Cited by: Table 5, §1, §4.1, §4.2, §4.3, Table 1, §5.
  • S. Pérot, O. Sperandio, M. A. Miteva, A. Camproux, and B. O. Villoutreix (2010) Druggable pockets and binding site centric chemical space: a paradigm shift in drug discovery. Drug discovery today 15 (15-16), pp. 656–667. Cited by: §1.
  • M. Ragoza, T. Masuda, and D. R. Koes (2022) Generating 3d molecules conditional on receptor binding sites with deep generative models. Chemical science 13 (9), pp. 2701–2713. Cited by: §1.
  • T. J. Ritchie and S. J. Macdonald (2009) The impact of aromatic ring count on compound developability–are too many aromatic rings a liability in drug design?. Drug discovery today 14 (21-22), pp. 1011–1020. Cited by: §4.4.
  • V. G. Satorras, E. Hoogeboom, and M. Welling (2021) E (n) equivariant graph neural networks. In International conference on machine learning, pp. 9323–9332. Cited by: §E.1, §2, §3.1, §3.
  • K. T. Schütt, H. E. Sauceda, P. Kindermans, A. Tkatchenko, and K. Müller (2018) Schnet–a deep learning architecture for molecules and materials. The Journal of Chemical Physics 148 (24), pp. 241722. Cited by: §E.1.
  • J. Serre et al. (1977) Linear representations of finite groups. Vol. 42, Springer. Cited by: §2.
  • B. K. Shoichet (2004) Virtual screening of chemical libraries. Nature 432 (7019), pp. 862–865. Cited by: §1.
  • J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli (2015)

    Deep unsupervised learning using nonequilibrium thermodynamics

    In International Conference on Machine Learning, pp. 2256–2265. Cited by: §2.
  • Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole (2020) Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456. Cited by: §2, §3.2.
  • H. Stärk, O. Ganea, L. Pattanaik, R. Barzilay, and T. Jaakkola (2022) Equibind: geometric deep learning for drug binding structure prediction. In International Conference on Machine Learning, pp. 20503–20521. Cited by: §E.1, §1.
  • M. Steinegger and J. Söding (2017) MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology 35 (11), pp. 1026–1028. External Links: Document, Link Cited by: §4.1.
  • B. L. Trippe, J. Yim, D. Tischer, T. Broderick, D. Baker, R. Barzilay, and T. Jaakkola (2022) Diffusion probabilistic modeling of protein backbones in 3d for the motif-scaffolding problem. arXiv preprint arXiv:2206.04119. Cited by: §5.
  • T. Tsukada, M. Takahashi, T. Takemoto, O. Kanno, T. Yamane, S. Kawamura, and T. Nishi (2010) Structure-based drug design of tricyclic 8h-indeno [1, 2-d][1, 3] thiazoles as potent fbpase inhibitors. Bioorganic & medicinal chemistry letters 20 (3), pp. 1004–1007. Cited by: §4.4.
  • J. Wang, S. Lisanza, D. Juergens, D. Tischer, J. L. Watson, K. M. Castro, R. Ragotte, A. Saragovi, L. F. Milles, M. Baek, et al. (2022) Scaffolding protein functional sites using deep learning. Science 377 (6604), pp. 387–394. Cited by: §3.2.
  • S. A. Wildman and G. M. Crippen (1999) Prediction of physicochemical parameters by atomic contributions. Journal of chemical information and computer sciences 39 (5), pp. 868–873. Cited by: §D.3.
  • M. Xu, L. Yu, Y. Song, C. Shi, S. Ermon, and J. Tang (2022) Geodiff: a geometric diffusion model for molecular conformation generation. arXiv preprint arXiv:2203.02923. Cited by: §3.1, §3.2.

Appendix A Implementation Details

Molecule size

As part of a sample’s overall likelihood, we compute the empirical joint distribution of ligand and pocket nodes observed in the training set and smooth it with a Gaussian filter (). In the conditional generation scenario, we derive the distribution and use it for likelihood computations.

For sampling, we can either fix molecule sizes manually or sample the number of ligand nodes from the same distribution given the number of nodes in the target pocket:



All molecules are expressed as graphs. For the

only model the node features for the protein are set as the one hot encoding of the amino acid type. The full atom model uses the same one hot encoding of atom types for ligand and protein nodes. We refrain from adding a categorical feature for distinguishing between protein and ligand atoms in this case and continue using two separate MLPs for embedding the node features instead.

Noise schedule

We use the pre-defined polynomial noise schedule introduced in (15):


Following (A. Q. Nichol and P. Dhariwal (2021); 15), values of are clipped between 0.001 and 1 for numerical stability near , and is recomputed as


A tiny offset is used to avoid numerical problems at defining the final noise schedule:


Feature scaling

We scale the node type features by a factor of 0.25 relative to the coordinates which was empirically found to improve model perfomance in previous work (15).


Hyperparameters for all presented models are summarized in Table 3. Training takes about (BindingMOAD) and

(CrossDocked) per 100 epochs on a single NVIDIA V100 GPU in the

scenario and (CrossDocked) per 100 epochs on a single NVIDIA A100 GPU with all atom pocket representation.

width=1 CrossDocked Binding MOAD Cond Cond () Inpaint () Cond () Inpaint () No. layers 6 6 6 6 6 Joint embedding dim. 32 32 32 32 32 Hidden dim. 256 256 256 256 256 Learning rate Weight decay Diffusion steps 1000 1000 1000 1000 1000 Edges fully connected fully connected fully connected fully connected Epochs 1000 1000 1000 800 800

Table 3: DiffSBDD hyperparameters.


For postprocessing of generated molecules, we use a similar procedure as in (Luo et al., 2021). Given a list of atom types and coordinates, bonds are first added using OpenBabel (O’Boyle et al., 2011). We then use RDKit to sanitise molecules, filter for the largest molecular fragment and finally remove steric clashes with 200 steps of force-field relaxation.

Appendix B Binding MOAD Dataset

We curate a dataset of experimentally determined complexed protein-ligand structures from Binding MOAD (Hu et al., 2005). We keep pockets with valid555as defined in and moderately ‘drug-like’ ligands with QED score . We further discard small molecules that contain atom types as well as binding pockets with non-standard amino acids. We define binding pockets as the set of residues that have any atom within of any ligand atom. Ligand redundancy is reduced by randomly sampling at most 50 molecules with the same chemical component identifier (3-letter-code). After removing corrupted entries that could not be processed, training pairs and 130 testing pairs remain. A validation set of size 246 is used to monitor estimated log-likelihoods during training. The split is made to ensure different sets do not contain proteins from the same Enzyme Commission Number (EC Number) main class.

Appendix C Proofs

In the following proofs we do not consider categorical node features as only the positions are subject to equivariance constraints. Furthermore, we do not distinguish between the zeroth latent representation and data domain representations for ease of notation, and simply drop the subscripts.

c.1 -equivariance of the prior probability

The isotropic Gaussian prior

is equivariant to rotations and reflections represented by an orthogonal matrix

as long as because:

Here we used for orthogonal .

c.2 -equivariance of the transition probabilities

The denoising transition probabilities from time step to

are defined as isotropic normal distributions:


Therefore, is -equivariant by a similar argument to Section C.1 if is computed equivariantly from the three-dimensional context.

Recalling the definition of , we can prove its equivariance as follows:

where defined as is equivariant because:

c.3 -equivariance of the learned likelihood

Let be an orthogonal matrix representing an element from the general orthogonal group . We obtain the marginal probability density of the Markovian denoising process as follows

and the sample’s likelihood is -equivariant:

Appendix D Extended results

d.1 Additional Experimental Details

The numbers of available molecules differ slightly between different methods due to computational issues or missing molecules in the available baseline sets. More precisely, on average , , and molecules have been evaluated per pocket for DiffSBDD-cond, DiffSBDD-inpaint (), and DiffSBDD-cond (), respectively. For Pocket2Mol, molecules are available per pocket. The set of 3D-SBDD molecules does not contain generated ligands for two test pockets. For the remaining 98 pockets, molecules are available on average.

All Figures show molecules generated where the starting number of nodes equals the number of nodes in the reference ligands, with the exception of Figure 4, which employs the sampling strategy outlined in Appendix A.

d.2 Additional Molecular Metrics

In addition to the molecular properties discussed in Section 4 we assess the models’ ability to produce novel and valid molecules using four simple metrics: validity, connectivity, uniqueness, and novelty. Validity measures the proportion of generated molecules that pass basic tests by RDKit–mostly ensuring correct valencies. Connectivity is the proportion of valid molecules that do not contain any disconnected fragments. We convert every valid and connected molecule from a graph into a canonical SMILES string representation, count the number unique occurrences in the set of generated molecules and compare those to the training set SMILES to compute uniqueness and novelty respectively.

Table 4 shows that only a small fraction of all generated molecules is invalid and must be discarded for downstream processing. The DiffSBDD models trained on CrossDocked with pocket representation generate fragmented molecules about 50% of the time. Since we can simply select and process the largest fragments in these cases, low connectivity does not necessarily affect the efficiency of the generative process. Moreover, all models produce diverse sets of molecules unseen in the training set.

Model Validity Connectivity Uniqueness Novelty
CrossDocked Training data 100% 100%
DiffSBDD-cond () 97.75% 48.02% 96.95% 100%
DiffSBDD-inpaint () 91.62% 51.38% 98.64% 100%
DiffSBDD-cond 93.23% 83.46% 97.46% 100%
Binding MOAD Training data 96.38% 100%
DiffSBDD-cond () 94.02% 66.46% 99.55% 99.81%
DiffSBDD-inpaint () 94.98% 70.21% 99.75% 99.80%
Table 4: Basic molecular metrics for generated small molecules given a and full atom representation of the protein pocket.

d.3 Octanol-water partition coefficient

The octanol-water partition coefficient (

) is a measure of lipophilicity and is commonly reported for potential drug candidates 

(Wildman and Crippen, 1999). We summarize this property for our generated molecules in Table 5.

CrossDocket Binding MOAD
Test set
3D-SBDD (AR) (Luo et al., 2021)
Pocket2Mol (Peng et al., 2022)
DiffSBDD-cond ()
DiffSBDD-inpaint ()
Table 5: LogP values of generated molecules.

d.4 Agreement of generated and docked conformations

Figure 5: RMSD between original and docked conformations for CrossDocked dataset. Left (A), DiffSBDD-cond, sample size . Middle (B), DiffSBDD-cond (), sample size . Right (C), DiffSBDD-inpaint (), sample size .
Figure 6: RMSD between original and docked conformations for Binding MOAD dataset. Left (A), DiffSBDD-cond (), sample size . Right (B), DiffSBDD-inpaint (), sample size .

Here we discuss an alternative way of using QuickVina for assessing the quality of the conditional generation procedure besides its in silico docking score. We compare the generated raw conformations (before force-field relaxation) to the best scoring QuickVina docking pose and plot the distribution of resulting RMSD values in Figures 5 and 6. As a baseline, the procedure is repeated for RDKit conformers of the same molecules with identical center of mass. For a large percentage of molecules generated by the all-atom CrossDocked model, QuickVina agrees with the predicted bound conformations, leaving them almost unchanged (RMSD below ). This demonstrates successful conditioning on the given protein pockets.

For the -only models results are less convincing. They produce poses that only slightly improve upon conformers lacking pocket-context. Likely, this is caused by atomic clashes with the proteins’ side chains that QuickVina needs to resolve. Notably, however, there is a clear enrichment of molecules with less than RMSD for both conditional models (Binding MOAD and CrossDocked) showing the advantage over unconditional conformer generation.

d.5 Random generated molecules

Randomly selected molecules generated with our method and 3 baseline methods (LiGAN, SBDD-3D and Pocket2Mol) when trained with CrossDocked are presented in Figure 7. Randomly selected molecules generated by our method when trained with Binding MOAD are show in Figure 8.

Figure 7: Generated molecules for 10 randomly chosen targets in the CrossDocked test set. For each target, 3 randomly selected generated molecules from 4 models are shown.
Figure 8: Generated molecules for 10 randomly chosen targets in the Binding MOAD test set. For each target-model pair, 5 randomly selected generated molecules are shown. level proteins were used for both models.

d.6 Distribution of docking scores by target

We present extensive evaluation of the docking scores for our generated molecules in Figure 9. We evaluate all models trained with a given dataset first against all targets (Figure 9A+C) and 10 randomly chosen targets (Figure 9B+D). We note that the all-atom model trained using CrossDocked data outperforms all other methods. Unsurprisingly, model performance is highly target dependent, likely varying with properties like pocket geometry, size, charge, and hydrophbicity, which would affect the propensity of generating high affinity molecules.

Figure 9: Docking scores of generated molecules for various methods trained on the CrossDocked (A-B) and Binding MOAD (C-D) datasets. (A) Violin plot of docking scores for all 3 methods trained using CrossDocked. (B) Same as before but for 10 randomly chosen targets sorted by mean score. (C) Violin plot of docking scores for all 2 methods trained using Binding MOAD. (D) Same as before but for 10 randomly chosen targets sorted by mean score.

Appendix E More Related Work

e.1 Geometric Deep Learning for Drug Discovery

Geometric deep learning refers to incorporating geometric priors in neural architecture design that respects symmetry and invariance, thus reduces sample complexity and eliminates the need for data augmentation (Bronstein et al., 2021). It has been prevailing in a variety of drug discovery tasks from virtual screening to de novo drug design as symmetry widely exists in the representation of drugs. One line of work introduces graph and geometry priors and designs message passing neural networks and equivariant neural networks that are permutation- and translation-, rotation-, reflection-equivariant, respectively (Duvenaud et al., 2015; Gilmer et al., 2017; Satorras et al., 2021; Lapchevskyi et al., 2020; Du et al., 2022a), and has been widely used in representing biomolecules from small molecules to proteins (Atz et al., 2021) and solving downstream tasks such as molecular property prediction (Schütt et al., 2018; Klicpera et al., 2020), binding pose prediction (Stärk et al., 2022) or molecular dynamics (Batzner et al., 2022; Holdijk et al., 2022). Another line of work focuses on generative design of new molecules (Du et al., 2022b, c). Specifically, they formulate molecule design as a graph or geometry generation problem and there are two strategies: one-shot generation that generates graphs (atom and bond features) in one step and sequential generation that generates them in a sequence of steps.