1 Introduction
The rational design of molecular compounds to act as drugs remains an oustanding challenge in biopharmaceutical research. Towards supporting such efforts, structurebased drug design (SBDD) aims to generate smallmolecule ligands that bind to a specific 3D protein structure with high affinity and specificity (Anderson, 2003). However, SBDD remains very challenging and with important limitations. A traditional SBDD campaign starts with the identification and validation of a target of interest and its subsequent structural characterisation using experimental structural determination methods. The first step in this process is the identification of the binding pocket; a cavity in which ligands may bind the target to elicit the desired therapeutic effect. This can be achieved via experimental means or a plethora of computational approaches (Pérot et al., 2010). Once a binding site is identified, the goal is to discover lead compounds that exhibit the desired biological activity. Importantly, to transition from leads to promising candidates the compounds need to be evaluated regarding other drug development constraints that are also hard to predict (toxicity, absorption, etc.).
Traditionally, SBDD is handled either by highthroughput experimental (Blundell, 1996) or virtual screening (Lyne, 2002; Shoichet, 2004) of large chemical databases. Not only is this expensive and time consuming but it also limits the exploration of chemical space to the historical knowledge of previously studied molecules, with a further emphasis usually placed on commercial availability (Irwin and Shoichet, 2005). Moreover, the optimization of initial lead molecules is often a biased process, with heavy reliance on human intuition (Ferreira et al., 2015).
Recent advances in geometric deep learning, especially in modeling geometric structures of biomolecules
(Bronstein et al., 2021; Atz et al., 2021), provide a promising direction for structurebased drug design (Gaudelet et al., 2021). Even though utilizing deep learning as surrogate docking models has achieved remarkable progress (Lu et al., 2022; Stärk et al., 2022), deep learningbased design of ligands that bind to target proteins is still an open problem. Early attempts have been made to represent molecules as atomic density maps, and variational autoencoders were utilized to generate new atomic density maps corresponding to novel molecules (Ragoza et al., 2022). However, it is nontrivial to map atomic density maps back to molecules, necessitating a subsequent atomfitting stage. Followup work addressed this limitation by representing molecules as 3D graphs with atomic coordinates and types which circumvents the unnecessary postprocessing steps. Li et al. (2021) proposed an autoregressive generative model to sample ligands given the protein pocket as a conditioning constraint. Peng et al. (2022) improved this method by using anequivariant graph neural network which respects rotation and translation symmetries in 3D space. Similarly,
Drotár et al. (2021); Liu et al. (2022)used autoregressive models to generate atoms sequentially and incorporate angles during the generation process.
Li et al. (2021)formulated the generation process as a reinforcement learning problem and connected the generator with Monte Carlo Tree Search for protein pocketconditioned ligand generation. However, the main premise of sequential generation methods may not hold in real scenarios, since there is no ordering of the generation process and, as a result, the global context of the generated ligands may be lost. In addition, sequential methods pose more computational complexities that make the model inference inefficient
(Luo et al., 2021; Peng et al., 2022).An alternative is a oneshot generation strategy that samples the atomic coordinates and types of all the atoms at once (Du et al., 2022b). In this work, we develop an equivariant diffusion model for structurebased drug design (DiffSBDD) which, to the best of our knowledge, is the first of its kind. Specifically, we formulate SBDD as a 3Dconditioned generation problem where we aim to generate diverse ligands with high binding affinity for specific protein targets. We propose an equivariant 3Dconditional diffusion model that respects translation, rotation, reflection, and permutation equivariance. We introduce two strategies, proteinconditioned generation and ligandinpainting generation
producing new ligands conditioned on protein pockets. Specifically, proteinconditioned generation considers the protein as a fixed context, while ligandinpainting models the joint distribution of the proteinligand complex and new ligands are inpainted during inference time. We further curate an experimentally determined binding dataset derived from Binding MOAD
(Hu et al., 2005), which supplements the commonly used synthetic Crossdocked (Francoeur et al., 2020) dataset to validate our model performance under realistic binding scenarios. The experimental results demonstrate that DiffSBDD is capable of generating novel, diverse and druglike ligands with predicted high binding affinities to given protein pockets. The code is available at https://github.com/arneschneuing/DiffSBDD.2 Background
Denoising Diffusion Probabilistic Models
Denoising diffusion probabilistic models (DDPMs) (SohlDickstein et al., 2015; Ho et al., 2020) are a class of generative models inspired by nonequilibrium thermodynamics. Briefly, they define a Markovian chain of random diffusion steps by slowly adding noise to sample data and then learning the reverse of this process (typically via a neural network) to reconstruct data samples from noise.
In this work, we closely follow the framework developed by 15. In our setting, data samples are atomic point clouds with 3D geometric coordinates and categorical features , where is the number of atoms. A fixed noise process
(1) 
adds noise to the data and produces a latent noised representation for .
controls the signaltonoise ratio
and follows either a learned or predefined schedule from to (Kingma et al., 2021). We also choose a variancepreserving noising process
(Song et al., 2020) with .Since the noising process is Markovian, we can write the denoising transition from time step to in closed form as
(2) 
with and following the notation of 15. This true denoising process depends on the data sample , which is not available when using the model for generating new samples. Instead, a neural network is used to approximate the sample . More specifically, we can reparameterize Equation (1) as with and directly predict the Gaussian noise . Thus, is simply given as .
To maximise the likelihood of our training data, we could directly optimize the variational lower bound (VLB) (D. Kingma, T. Salimans, B. Poole, and J. Ho (2021); 15)
(3) 
(4)  
(5) 
The prior loss should always be close to zero and can be computed exactly in closed form while the reconstruction loss must be estimated as described in
15. In practice, however, we do not directly optimize the VLB but instead minimize the simplified training objective (Ho et al., 2020; Kingma et al., 2021)(6) 
equivariant Graph Neural Networks
A function is said to be equivariant w.r.t. the group if , where denotes the action of the group element on and (Serre and others, 1977). Graph Neural Networks (GNNs) are learnable functions that process graphstructured data in a permutationequivariant way, making them particularly useful for molecular systems where nodes do not have an intrinsic order. Permutation invariance means that where is an permutation matrix acting on the node feature matrix. Since the nodes of the molecular graph represent the 3D coordinates of atoms, we are interested in additional equivariance w.r.t. the Euclidean group or rigid transformations. An equivariant GNN (EGNN) satisfies for an orthogonal matrix
and some translation vector
added rowwise.In our case, since the nodes have both geometric atomic coordinates as well as atomic type features , we can use a simple implementation of EGNN proposed by Satorras et al. (2021), in which the updates for features and coordinates of node at layer are computed as follows:
(7)  
(8)  
(9) 
where , , and
are learnable Multilayer Perceptrons (MLPs) and
and are the relative distances and edge features between nodes and respectively.3 Equivariant Diffusion Models for SBDD
We utilize an equivariant DDPM to generate molecules and binding conformations jointly with respect to a specific protein target. We represent protein and ligand point clouds as fullyconnected graphs that are further processed by EGNNs (Satorras et al., 2021). We consider two distinct approaches to 3D pocket conditioning: (1) a conditional DDPM that receives a fixed pocket representation as context in each denoising step, and (2) a model that approximates the joint distribution of ligandpocket pairs combined with inpainting at inference time.
3.1 Pocketconditioned small molecule generation
In the conditional molecule generation setup, we provide fixed threedimensional context in each step of the denoising process. To this end, we supplement the ligand node point cloud , denoted by superscript , with protein pocket nodes , denoted by superscript , that remain unchanged throughout the reverse diffusion process (Figure 2).
We parameterize the noise predictor with an EGNN (V. G. Satorras, E. Hoogeboom, and M. Welling (2021); 15). To process ligand and pocket nodes with a single GNN, atom types and residue types are first embedded in a joint node embedding space by separate learnable MLPs. We employ the same messagepassing scheme outlined in Equations (7)(9), however, following (Igashov et al., 2022), we replace the coordinate update step with the following:
(10) 
to ensure the threedimensional protein context remains fixed throughout the EGNN layers.
Equivariance
In the probabilistic setting with 3Dconditioning, we would like to ensure equivariance in the following sense^{1}^{1}1We transpose the node feature matrices hereafter so that the matrix multiplication resembles application of a group action. We also ignore node type features, which transform invariantly, for simpler notation.:

Evaluating the likelihood of a molecule given the threedimensional representation of a protein pocket should not depend on global transformations of the system, i.e. for orthogonal with and added columnwise.

At the same time, it should be possible to generate samples
from this conditional probability distribution so that equivalently transformed ligands
are sampled with the same probability if the input pocket is rotated and translated and we sample from
.
Equivariance to the orthogonal group (comprising rotations and reflections) is achieved because we model both prior and transition probabilities with isotropic Gaussians where the mean vector transforms equivariantly w.r.t. rotations of the context (see 15 and Appendix C). Ensuring translation equivariance, however, is not as easy because the transition probabilities are not inherently translationequivariant. In order to circumvent this issue, we follow previous works (J. Köhler, L. Klein, and F. Noé (2020); M. Xu, L. Yu, Y. Song, C. Shi, S. Ermon, and J. Tang (2022); 15) by limiting the whole sampling process to a linear subspace where the center of mass (CoM) of the system is zero. In practice, this is achieved by subtracting the center of mass of the system before performing likelihood computations or denoising steps.
Note that the 3Dconditional model can achieve equivariance without this “subspacetrick”. The coordinates of pocket nodes provide a reference frame for all samples that can be used to translate them to a unique location (e.g. such that the pocket is centered at the origin: ). By doing this for all training data, translation equivariance becomes irrelevant and the CoMfree subspace approach obsolete. To evaluate the likelihood of translated samples at inference time, we can first subtract the pocket’s center of mass from the whole system and compute the likelihood after this mapping. Similarly, for sampling molecules we can first generate a ligand in a CoMfree version of the pocket and move the whole system back to the original location of the pocket nodes to restore translation equivariance. As long as the mean of our Gaussian noise distribution depends equivariantly on the pocket node coordinates , equivariance is satisfied as well (Appendix C). Since this change did not seem to affect the performance of the conditional model in our experiments, we decided to keep sampling in the linear subspace to ensure that the implementation is as similar as possible to the joint model, for which the subspace approach is necessary.
3.2 Joint distribution with inpainting
As an extension to the conditional approach described above, we also present a ligandinpainting approach. Originally introduced as a technique for completing masked parts of images (Song et al., 2020; Lugmayr et al., 2022), inpainting has been adopted in other domains, including biomolecular structures (Wang et al., 2022). Here, we extend this idea to threedimensional point cloud data.
We first train an unconditional DDPM to approximate the joint distribution of ligand and pocket nodes ^{2}^{2}2We use notations and interchangeably to describe the combined system of ligand and pocket nodes.. This allows us to sample new pairs without additional context. To condition on a target protein pocket, we then need to inject context into the sampling process by modifying the probabilistic transition steps. The combined latent representation of protein pocket and ligand at diffusion step is assembled from a forward noised version of the pocket that is combined with ligand nodes predicted by the DDPM based on the previous latent representation at step
(11)  
(12)  
(13) 
In this manner, we traverse the Markov chain in reverse order from
to , replacing the predicted pocket nodes with their forward noised counterparts in each step. Equation (12) conditions the generative process on the given protein pocket. Thanks to the noise schedule, which decreases the variance of the noising process to almost zero at (Equation (1)), the final sample is guaranteed to contain an unperturbed representation of the protein pocket.Since the model is trained to approximate the unconditional joint distribution of ligandpocket pairs, the training procedure is identical to the unconditional molecule generation procedure developed by 15 aside from the fullyconnected neural networks that embed protein and ligand node features in a common space as described in Section 3.1. The conditioning on known protein pockets is entirely delegated to the sampling algorithm, which means this approach is not limited to ligandinpainting but, in principle, allows us to mask and replace arbitrary parts of the ligandpocket system without retraining.
Equivariance
Similar desiderata as in the conditional case apply to the joint probability model, where we desire invariance that can be obtained from invariant priors via equivariant flows (Köhler et al., 2020). The main complications compared to the previous approach are the missing reference frame and impossibility of defining a valid translationinvariant prior noise distribution as such a distribution cannot integrate to one. Consequently, it is necessary to restrict the probabilistic model to a CoMfree subspace as described in previous works (J. Köhler, L. Klein, and F. Noé (2020); M. Xu, L. Yu, Y. Song, C. Shi, S. Ermon, and J. Tang (2022); 15). While the reverse diffusion process is defined for a CoMfree system, substituting the predicted pocket node coordinates with a new diffused version of the known pocket as described in Equations (11)  (13) can lead to nonzero CoM. To prevent this, we translate the known pocket representation so that its center of mass coincides with the predicted representation: before creating the new combined representation with .
4 Experiments
4.1 Datasets
CrossDocked
We use the CrossDocked dataset (Francoeur et al., 2020) and follow the same filtering and splitting strategies as in previous work (Luo et al., 2021; Peng et al., 2022). This results in 100,000 highquality proteinligand pairs for the training set and 100 proteins for the test set. The split is done by 30% sequence identity using MMseqs2 (Steinegger and Söding, 2017).
Binding MOAD
We also evaluate our method on experimentally determined proteinligand complexes found in Binding MOAD (Hu et al., 2005) which are filtered and split based on the proteins’ enzyme commission number as described in Appendix B. This results in 40,354 proteinligand pairs for training and 130 pairs for testing.
4.2 Evaluation
For every experiment, we evaluated all combinations of allatom and
level graphs with conditional and inpaintingbased approaches respectively (with the exception of the allatom inpainting approach due to computational limitations). Full details of model architecture and hyperparameters are given in Appendix
A. We sampled 100 valid molecules^{3}^{3}3Due to occasional processing issues the actual number of available molecules is slightly lower on average (see Appendix D.1). for each target pocket with ground truth ligand sizes and remove all atoms that are not bonded to the largest connected fragment.We employ widelyused metrics to assess the quality of our generated molecules (Peng et al., 2022; Li et al., 2021): (1) Vina Score is a physicsbased estimation of binding affinity between small molecules and their target pocket; (2) QED is a simple quantitative estimation of druglikeness combining several desirable molecular properties; (3) SA (synthetic accessibility) is a measure estimating the difficulty of synthesis; (4) Lipinski measures how many rules in the Lipinski rule of five (Lipinski et al., 2012), which is a loose rule of thumb to assess the druglikeness of molecules, are satisfied; (5) Diversity is computed as the average pairwise dissimilarity (1  Tanimoto similarity) between all generated molecules for each pocket; (6) Inference Time is the average time to sample 100 molecules for one pocket across all targets. All docking scores and chemical properties are calculated with QuickVina2 (Alhossary et al., 2015) and RDKit (Landrum and others, 2016).
4.3 Baselines
We compare with two recent deep learning methods for structurebased drug design. 3DSBDD (Luo et al., 2021) and Pocket2Mol (Peng et al., 2022) are autoregressive schemes relying on graph representations of the protein pocket and previously placed atoms to predict probabilities based on which new atoms are added. 3DSBDD
use heuristics to infer bonds from generated atomic point clouds while
Pocket2Mol directly predicts them during the sequential generation process.4.4 Results
CrossDocked
Overall, the experimental results in Table 1 suggest that DiffSBDD can generate diverse smallmolecule compounds with predicted high binding affinity, matching stateoftheart performance. We do not see significant differences between the conditional model and the inpainting approach. The diversity score is arguably the most interesting, as this suggests our model is able to sample greater amounts of chemical space when compared to previous methods, while maintaining high binding performance, one of the most important requirements in earlystage, structurebased lead discovery. Specifically, DiffSBDD aims to generate ligands that bind to protein pockets and learn the probability density of ligands interacting with protein pockets. While it does not optimize for other molecular properties, such as QED and Lipinski, it generates molecules similar to the test set distributions. Only SA scores are significantly lower on average. However, this reflects that our models are capable of exploring larger amounts of chemical space, given that SA primarily scores against the historical knowledge of previously synthesised molecules (Ertl and Schuffenhauer, 2009). Generally, presenting the full atomic context to the model constrains the space of outputs considerably, leading to higher Vina scores but lower diversity compared to the only models. The allatom model consistently beats based models on a per target basis (Appendix Figure 9).
A representative selection of molecules for two targets (2jjg and 3kc1) are presented (Figure 3). This set is curated to be representative of our high scoring molecules, with both realistic and nonrealistic motifs shown. It is noteworthy that the second molecule generated for 3kc1 has a similar tricyclic motif in the same pocket location as the reference ligand which was designed by traditional SBDD methods to maximise the hydrophobic interactions via shape complementarity of the ring system (Tsukada et al., 2010). However, a number of irregularities are present in even the highest scoring of generated molecules. For example, the high number of triangles in the molecules targeting 2jjg (from Inpainting) and the large rings for 3kc1 would prove difficult to synthesise. Random selections of generated molecules made by all methods evaluated are presented in Figure 7.
All docking scores reported in Table 1
are within one standard deviation of each other, which poses challenges for the discrimination of the best models. To verify successful pocketconditioning, we therefore discuss the agreement of generated molecular conformations with poses after docking in Appendix
D.4. This experiment showcases the success of our method to model proteindrug interactions at the atomic level and clearly highlights the benefits of the allatom pocket representation.Binding MOAD
Results for the Binding MOAD dataset with experimentally determined binding complex data are reported in Table 2. 100 valid ligands have been generated for each of the 130 test pockets resulting in molecules in total^{4}^{4}4The QuickVina score could not be computed for 49 () molecules from DiffSBDDcond.. DiffSBDD generates highly diverse molecules but on average docking scores are lower than corresponding reference ligands from this dataset.
Generated molecules for a representative target are shown in Figure 4. The target (PDB: 6c0b) is a human receptor which is involved in microbial infection (Chen et al., 2018) and possibly tumor suppression (Ding et al., 2016). The reference molecule, a long fatty acid (see Figure 4) that aids receptor binding (Chen et al., 2018)
, has too high a number of rotatable bonds and low a number of hydrogen bond donors/acceptors to be considered a suitable drug (QED of 0.36). Our model however, generates druglike (QED between 0.630.85) and suitably sized molecules by adding aromatic rings connected by a small number of rotatable bonds, which allows the molecules to adopt a complementary binding geometry and is entropically favourable (by reducing the degrees of freedom), a classic technique in medicinal chemistry
(Ritchie and Macdonald, 2009). A random selection of generated molecules in presented in Figure 8.5 Related Work
Diffusion Models for Molecules
Inspired by nonequilibrium thermodynamics, diffusion models have been proposed to learn data distributions by modeling a denoising (reverse diffusion) process and have achieved remarkable success in a variety of tasks such as image, audio synthesis and point cloud generation (Kingma et al., 2021; Kong et al., 2021; Luo and Hu, 2021). Recently, efforts have been made to utilize diffusion models for molecule design (Du et al., 2022b). Specifically, 15 propose a diffusion model with an equivariant network that operates both on continuous atomic coordinates and categorical atom types to generate new molecules in 3D space. Torsional Diffusion (Jing et al., 2022) focuses on a conditional setting where molecular conformations (atomic coordinates) are generated from molecular graphs (atom types and bonds). Similarly, 3D diffusion models have been applied to generative design of larger biomolecular structures, such as antibodies (Luo et al., 2022) and other proteins (Anand and Achim, 2022; Trippe et al., 2022).
Structurebased Drug Design
Structurebased Drug Design (SBDD) (Blundell, 1996; Ferreira et al., 2015; Anderson, 2003) relies on the knowledge of the 3D structure of the biological target obtained either through experimental methods or highconfidence predictions using homology modelling (Kelley et al., 2015). Candidate molecules are then designed to bind with high affinity and specificity to the target using interactive software (Kalyaanamoorthy and Chen, 2011) and often humanbased intuition (Ferreira et al., 2015). Recent advances in deep generative models have brought a new wave of research that model the conditional distribution of ligands given biological targets and thus enable de novo structurebased drug design. Most of recent work consider this task as a sequential generation problem and design a variety of generative methods including autoregressive models, reinforcement learning, etc., to generate ligands inside protein pockets atom by atom (Drotár et al., 2021; Luo et al., 2021; Li et al., 2021; Peng et al., 2022).
6 Conclusion
In this work, we propose DiffSBDD, an equivariant 3Dconditional diffusion model for structurebased drug design. We demonstrate the effectiveness and efficiency of DiffSBDD in generating novel and diverse ligands with predicted highaffinity for given protein pockets on both a synthetic benchmark and a new dataset of experimentally determined proteinligand complexes. We demonstrate that an inpaintingbased approach can achieve competitive results to direct conditioning on a wide range of molecular metrics. Extending this more versatile strategy to an all atom pocket representation therefore holds promise to solve a variety of other structurebased drug design tasks, such as lead optimization or linker design, and binding site design without retraining.
Acknowledgments
We thank Xingang Peng and Shitong Luo for providing us generated molecules of the Pocket2Mol and 3DSBDD methods. We thank Hannes Stärk and Joshua Southern for valuable feedback and insightful discussions. This work was supported by the European Research Council (starting grant no. 716058), the Swiss National Science Foundation (grant no. 310030_188744), and Microsoft Research AI4Science. Charles Harris is supported by the Cambridge Centre for AI in Medicine Studentship which is in turn funded by AstraZeneca and GSK. Michael Bronstein is supported in part by ERC Consolidator grant no. 724228 (LEMAN).
References
 Fast, accurate, and reliable molecular docking with quickvina 2. Bioinformatics 31 (13), pp. 2214–2216. Cited by: §4.2.
 Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. arXiv preprint arXiv:2205.15019. Cited by: §5.
 The process of structurebased drug design. Chemistry & biology 10 (9), pp. 787–797. Cited by: §1, §5.
 Geometric deep learning on molecular representations. Nature Machine Intelligence 3 (12), pp. 1023–1032. Cited by: §E.1, §1.
 E (3)equivariant graph neural networks for dataefficient and accurate interatomic potentials. Nature communications 13 (1), pp. 1–11. Cited by: §E.1.
 Structurebased drug design.. Nature 384 (6604 Suppl), pp. 23–26. Cited by: §1, §5.
 Geometric deep learning: grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478. Cited by: §E.1, §1.
 Structural basis for recognition of frizzled proteins by clostridium difficile toxin b. Science 360 (6389), pp. 664–669. Cited by: §4.4.
 FZD2 inhibits the cell growth and migration of salivary adenoid cystic carcinomas. Oncology Reports 35 (2), pp. 1006–1012. Cited by: §4.4.
 Structureaware generation of druglike molecules. arXiv preprint arXiv:2111.04107. Cited by: §1, §5.
 SE (3) equivariant graph neural networks with complete local frames. In International Conference on Machine Learning, pp. 5583–5608. Cited by: §E.1.
 MolGenSurvey: a systematic survey in machine learning models for molecule design. arXiv preprint arXiv:2203.14500. Cited by: §E.1, §1, §5.
 ChemSpacE: interpretable and interactive chemical space exploration. Cited by: §E.1.
 Convolutional networks on graphs for learning molecular fingerprints. Advances in neural information processing systems 28. Cited by: §E.1.
 [15] Equivariant diffusion for molecule generation in 3d. Cited by: Appendix A, Appendix A, §2, §2, §2, §2, §3.1, §3.1, §3.2, §3.2, §5.
 Estimation of synthetic accessibility score of druglike molecules based on molecular complexity and fragment contributions. Journal of cheminformatics 1 (1), pp. 1–11. Cited by: §4.4.
 Molecular docking and structurebased drug design strategies. Molecules 20 (7), pp. 13384–13421. Cited by: §1, §5.

Threedimensional convolutional neural networks and a crossdocked data set for structurebased drug design
. Journal of Chemical Information and Modeling 60 (9), pp. 4200–4215. Cited by: §1, §4.1.  Utilizing graph machine learning within drug discovery and development. Briefings in Bioinformatics 22 (6). External Links: Document, Link Cited by: §1.
 Neural message passing for quantum chemistry. In International conference on machine learning, pp. 1263–1272. Cited by: §E.1.
 Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33, pp. 6840–6851. Cited by: §2, §2.
 Path integral stochastic optimal control for sampling transition paths. arXiv preprint arXiv:2207.02149. Cited by: §E.1.
 Binding moad (mother of all databases). Proteins: Structure, Function, and Bioinformatics 60 (3), pp. 333–340. Cited by: Appendix B, §1, §4.1.
 Equivariant 3dconditional diffusion models for molecular linker design. arXiv preprint arXiv:2210.05274. Cited by: §3.1.
 ZINC a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling 45 (1), pp. 177–182. Cited by: §1.
 Torsional diffusion for molecular conformer generation. arXiv preprint arXiv:2206.01729. Cited by: §5.
 Structurebased drug design to augment hit discovery. Drug discovery today 16 (1718), pp. 831–839. Cited by: §5.
 The phyre2 web portal for protein modeling, prediction and analysis. Nature protocols 10 (6), pp. 845–858. Cited by: §5.
 Variational diffusion models. Advances in neural information processing systems 34, pp. 21696–21707. Cited by: §2, §2, §2, §5.
 Directional message passing for molecular graphs. arXiv preprint arXiv:2003.03123. Cited by: §E.1.
 Equivariant flows: exact likelihood generative learning for symmetric densities. In International conference on machine learning, pp. 5361–5370. Cited by: §3.1, §3.2.
 DiffWave: a versatile diffusion model for audio synthesis. In International Conference on Learning Representations, Cited by: §5.

Rdkit: opensource cheminformatics software
. Cited by: §4.2.  Euclidean neural networks (e3nn) v1. 0. Technical report Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States). Cited by: §E.1.
 Structurebased de novo drug design using 3d deep generative models. Chemical science 12 (41), pp. 13664–13675. Cited by: §1, §4.2, §5.
 Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Advanced drug delivery reviews 64, pp. 4–17. Cited by: §4.2.
 Generating 3d molecules for target protein binding. arXiv preprint arXiv:2204.09410. Cited by: §1.
 TANKBind: trigonometryaware neural networks for drugprotein binding structure prediction. bioRxiv. Cited by: §1.

Repaint: inpainting using denoising diffusion probabilistic models.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 11461–11471. Cited by: §3.2.  A 3d generative model for structurebased drug design. Advances in Neural Information Processing Systems 34, pp. 6229–6239. Cited by: Appendix A, Table 5, §1, §4.1, §4.3, Table 1, §5.
 Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2837–2845. Cited by: §5.
 Antigenspecific antibody design and optimization with diffusionbased generative models. bioRxiv. Cited by: §5.
 Structurebased virtual screening: an overview. Drug discovery today 7 (20), pp. 1047–1055. Cited by: §1.
 Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171. Cited by: Appendix A.
 Open babel: an open chemical toolbox. Journal of cheminformatics 3 (1), pp. 1–14. Cited by: Appendix A.
 Pocket2Mol: efficient molecular sampling based on 3d protein pockets. arXiv preprint arXiv:2205.07249. Cited by: Table 5, §1, §4.1, §4.2, §4.3, Table 1, §5.
 Druggable pockets and binding site centric chemical space: a paradigm shift in drug discovery. Drug discovery today 15 (1516), pp. 656–667. Cited by: §1.
 Generating 3d molecules conditional on receptor binding sites with deep generative models. Chemical science 13 (9), pp. 2701–2713. Cited by: §1.
 The impact of aromatic ring count on compound developability–are too many aromatic rings a liability in drug design?. Drug discovery today 14 (2122), pp. 1011–1020. Cited by: §4.4.
 E (n) equivariant graph neural networks. In International conference on machine learning, pp. 9323–9332. Cited by: §E.1, §2, §3.1, §3.
 Schnet–a deep learning architecture for molecules and materials. The Journal of Chemical Physics 148 (24), pp. 241722. Cited by: §E.1.
 Linear representations of finite groups. Vol. 42, Springer. Cited by: §2.
 Virtual screening of chemical libraries. Nature 432 (7019), pp. 862–865. Cited by: §1.

Deep unsupervised learning using nonequilibrium thermodynamics
. In International Conference on Machine Learning, pp. 2256–2265. Cited by: §2.  Scorebased generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456. Cited by: §2, §3.2.
 Equibind: geometric deep learning for drug binding structure prediction. In International Conference on Machine Learning, pp. 20503–20521. Cited by: §E.1, §1.
 MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology 35 (11), pp. 1026–1028. External Links: Document, Link Cited by: §4.1.
 Diffusion probabilistic modeling of protein backbones in 3d for the motifscaffolding problem. arXiv preprint arXiv:2206.04119. Cited by: §5.
 Structurebased drug design of tricyclic 8hindeno [1, 2d][1, 3] thiazoles as potent fbpase inhibitors. Bioorganic & medicinal chemistry letters 20 (3), pp. 1004–1007. Cited by: §4.4.
 Scaffolding protein functional sites using deep learning. Science 377 (6604), pp. 387–394. Cited by: §3.2.
 Prediction of physicochemical parameters by atomic contributions. Journal of chemical information and computer sciences 39 (5), pp. 868–873. Cited by: §D.3.
 Geodiff: a geometric diffusion model for molecular conformation generation. arXiv preprint arXiv:2203.02923. Cited by: §3.1, §3.2.
Appendix A Implementation Details
Molecule size
As part of a sample’s overall likelihood, we compute the empirical joint distribution of ligand and pocket nodes observed in the training set and smooth it with a Gaussian filter (). In the conditional generation scenario, we derive the distribution and use it for likelihood computations.
For sampling, we can either fix molecule sizes manually or sample the number of ligand nodes from the same distribution given the number of nodes in the target pocket:
(14) 
Preprocessing
All molecules are expressed as graphs. For the
only model the node features for the protein are set as the one hot encoding of the amino acid type. The full atom model uses the same one hot encoding of atom types for ligand and protein nodes. We refrain from adding a categorical feature for distinguishing between protein and ligand atoms in this case and continue using two separate MLPs for embedding the node features instead.
Noise schedule
We use the predefined polynomial noise schedule introduced in (15):
(15) 
Following (A. Q. Nichol and P. Dhariwal (2021); 15), values of are clipped between 0.001 and 1 for numerical stability near , and is recomputed as
(16) 
A tiny offset is used to avoid numerical problems at defining the final noise schedule:
(17) 
Feature scaling
We scale the node type features by a factor of 0.25 relative to the coordinates which was empirically found to improve model perfomance in previous work (15).
Hyperparameters
Postprocessing
For postprocessing of generated molecules, we use a similar procedure as in (Luo et al., 2021). Given a list of atom types and coordinates, bonds are first added using OpenBabel (O’Boyle et al., 2011). We then use RDKit to sanitise molecules, filter for the largest molecular fragment and finally remove steric clashes with 200 steps of forcefield relaxation.
Appendix B Binding MOAD Dataset
We curate a dataset of experimentally determined complexed proteinligand structures from Binding MOAD (Hu et al., 2005). We keep pockets with valid^{5}^{5}5as defined in http://www.bindingmoad.org/ and moderately ‘druglike’ ligands with QED score . We further discard small molecules that contain atom types as well as binding pockets with nonstandard amino acids. We define binding pockets as the set of residues that have any atom within of any ligand atom. Ligand redundancy is reduced by randomly sampling at most 50 molecules with the same chemical component identifier (3lettercode). After removing corrupted entries that could not be processed, training pairs and 130 testing pairs remain. A validation set of size 246 is used to monitor estimated loglikelihoods during training. The split is made to ensure different sets do not contain proteins from the same Enzyme Commission Number (EC Number) main class.
Appendix C Proofs
In the following proofs we do not consider categorical node features as only the positions are subject to equivariance constraints. Furthermore, we do not distinguish between the zeroth latent representation and data domain representations for ease of notation, and simply drop the subscripts.
c.1 equivariance of the prior probability
The isotropic Gaussian prior
is equivariant to rotations and reflections represented by an orthogonal matrix
as long as because:Here we used for orthogonal .
c.2 equivariance of the transition probabilities
The denoising transition probabilities from time step to
are defined as isotropic normal distributions:
(18) 
Therefore, is equivariant by a similar argument to Section C.1 if is computed equivariantly from the threedimensional context.
Recalling the definition of , we can prove its equivariance as follows:
where defined as is equivariant because:
c.3 equivariance of the learned likelihood
Let be an orthogonal matrix representing an element from the general orthogonal group . We obtain the marginal probability density of the Markovian denoising process as follows
and the sample’s likelihood is equivariant:
Appendix D Extended results
d.1 Additional Experimental Details
The numbers of available molecules differ slightly between different methods due to computational issues or missing molecules in the available baseline sets. More precisely, on average , , and molecules have been evaluated per pocket for DiffSBDDcond, DiffSBDDinpaint (), and DiffSBDDcond (), respectively. For Pocket2Mol, molecules are available per pocket. The set of 3DSBDD molecules does not contain generated ligands for two test pockets. For the remaining 98 pockets, molecules are available on average.
d.2 Additional Molecular Metrics
In addition to the molecular properties discussed in Section 4 we assess the models’ ability to produce novel and valid molecules using four simple metrics: validity, connectivity, uniqueness, and novelty. Validity measures the proportion of generated molecules that pass basic tests by RDKit–mostly ensuring correct valencies. Connectivity is the proportion of valid molecules that do not contain any disconnected fragments. We convert every valid and connected molecule from a graph into a canonical SMILES string representation, count the number unique occurrences in the set of generated molecules and compare those to the training set SMILES to compute uniqueness and novelty respectively.
Table 4 shows that only a small fraction of all generated molecules is invalid and must be discarded for downstream processing. The DiffSBDD models trained on CrossDocked with pocket representation generate fragmented molecules about 50% of the time. Since we can simply select and process the largest fragments in these cases, low connectivity does not necessarily affect the efficiency of the generative process. Moreover, all models produce diverse sets of molecules unseen in the training set.
Model  Validity  Connectivity  Uniqueness  Novelty 

CrossDocked Training data  100%  100%  –  – 
DiffSBDDcond ()  97.75%  48.02%  96.95%  100% 
DiffSBDDinpaint ()  91.62%  51.38%  98.64%  100% 
DiffSBDDcond  93.23%  83.46%  97.46%  100% 
Binding MOAD Training data  96.38%  100%  –  – 
DiffSBDDcond ()  94.02%  66.46%  99.55%  99.81% 
DiffSBDDinpaint ()  94.98%  70.21%  99.75%  99.80% 
d.3 Octanolwater partition coefficient
The octanolwater partition coefficient (
) is a measure of lipophilicity and is commonly reported for potential drug candidates
(Wildman and Crippen, 1999). We summarize this property for our generated molecules in Table 5.d.4 Agreement of generated and docked conformations
Here we discuss an alternative way of using QuickVina for assessing the quality of the conditional generation procedure besides its in silico docking score. We compare the generated raw conformations (before forcefield relaxation) to the best scoring QuickVina docking pose and plot the distribution of resulting RMSD values in Figures 5 and 6. As a baseline, the procedure is repeated for RDKit conformers of the same molecules with identical center of mass. For a large percentage of molecules generated by the allatom CrossDocked model, QuickVina agrees with the predicted bound conformations, leaving them almost unchanged (RMSD below ). This demonstrates successful conditioning on the given protein pockets.
For the only models results are less convincing. They produce poses that only slightly improve upon conformers lacking pocketcontext. Likely, this is caused by atomic clashes with the proteins’ side chains that QuickVina needs to resolve. Notably, however, there is a clear enrichment of molecules with less than RMSD for both conditional models (Binding MOAD and CrossDocked) showing the advantage over unconditional conformer generation.
d.5 Random generated molecules
d.6 Distribution of docking scores by target
We present extensive evaluation of the docking scores for our generated molecules in Figure 9. We evaluate all models trained with a given dataset first against all targets (Figure 9A+C) and 10 randomly chosen targets (Figure 9B+D). We note that the allatom model trained using CrossDocked data outperforms all other methods. Unsurprisingly, model performance is highly target dependent, likely varying with properties like pocket geometry, size, charge, and hydrophbicity, which would affect the propensity of generating high affinity molecules.
Appendix E More Related Work
e.1 Geometric Deep Learning for Drug Discovery
Geometric deep learning refers to incorporating geometric priors in neural architecture design that respects symmetry and invariance, thus reduces sample complexity and eliminates the need for data augmentation (Bronstein et al., 2021). It has been prevailing in a variety of drug discovery tasks from virtual screening to de novo drug design as symmetry widely exists in the representation of drugs. One line of work introduces graph and geometry priors and designs message passing neural networks and equivariant neural networks that are permutation and translation, rotation, reflectionequivariant, respectively (Duvenaud et al., 2015; Gilmer et al., 2017; Satorras et al., 2021; Lapchevskyi et al., 2020; Du et al., 2022a), and has been widely used in representing biomolecules from small molecules to proteins (Atz et al., 2021) and solving downstream tasks such as molecular property prediction (Schütt et al., 2018; Klicpera et al., 2020), binding pose prediction (Stärk et al., 2022) or molecular dynamics (Batzner et al., 2022; Holdijk et al., 2022). Another line of work focuses on generative design of new molecules (Du et al., 2022b, c). Specifically, they formulate molecule design as a graph or geometry generation problem and there are two strategies: oneshot generation that generates graphs (atom and bond features) in one step and sequential generation that generates them in a sequence of steps.