Decoupled coordinates for machine learning-based molecular fragment linking

by   Markus Fleck, et al.

Recent developments in machine-learning based molecular fragment linking have demonstrated the importance of informing the generation process with structural information specifying the relative orientation of the fragments to be linked. However, such structural information has not yet been provided in the form of a complete relative coordinate system. Mathematical details for a decoupled set of bond lengths, bond angles and torsion angles are elaborated and the coordinate system is demonstrated to be complete. Significant impact on the quality of the generated linkers is demonstrated numerically. The amount of reliable information within the different types of degrees of freedom is investigated. Ablation studies and an information-theoretical analysis are performed. The presented benefits suggest the application of a complete and decoupled relative coordinate system as a standard good practice in linker design.



There are no comments yet.


page 1

page 2

page 3

page 4


Rotation Invariant Graph Neural Networks using Spin Convolutions

Progress towards the energy breakthroughs needed to combat climate chang...

Deep Spatial Learning with Molecular Vibration

Machine learning over-fitting caused by data scarcity greatly limits the...

Geometric Perspectives on Fundamental Solutions in the Linearized Satellite Relative Motion Problem

Understanding natural relative motion trajectories is critical to enable...

Learning Gradient Fields for Molecular Conformation Generation

We study a fundamental problem in computational chemistry known as molec...

Structural Design Recommendations in the Early Design Phase using Machine Learning

Structural engineering knowledge can be of significant importance to the...

Analysis of Atomistic Representations Using Weighted Skip-Connections

In this work, we extend the SchNet architecture by using weighted skip c...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Computational drug design remains a challenging problem, foremost due to the vast size of the drug-like chemical space [1, 2, 3, 4]. A subset of the discipline of molecular generation is constituted by the task of fragment linking. Given a pair of structures, the goal of the design process is their connection via an appropriate linker. Chemical properties like toxicity and solubility are of general concern in order for a compound to turn out administrable as a drug. Furthermore, in order to maintain high activity and specificity, it is of major importance that the designed linker maintains the binding mode of the underlying fragments [5, 6]. Therefore, an effective linker design process should incorporate structural information in the form of the relative orientation between the molecules to be joined. Recently, DeLinker [7], a machine learning framework operating on molecular graphs, was proposed to incorporate such structural information. While the results of the method arguably constituted a breakthrough in machine learning-based fragment linking, unfortunately, the method employs a problematic set of coordinates.

The benefits of a maximally decoupled coordinate system are well known in computational chemistry and biophysics. For instance, they have turned out especially crucial in the calculation of configurational entropy [8, 9, 10, 11, 12, 13, 14, 15] from molecular mechanics simulations. Here, in Cartesian coordinates, the adequate sampling of the soft constraints present in rigid chemical groups (e. g. a methyl group) quickly becomes computationally prohibitive with increasing molecular size. Recently, similar insights have led to a mathematically rigorous separation of the placeholder atom from the physical partition function in alchemical free energy simulations [16]. Coordinate systems based on bonds, bond angles and torsions feature decoupling naturally by accommodating the molecular topology made up from rigid bonds. Here, we apply such a bond-angle-torsion (BAT) coordinate system in order to specify the relative orientation of the molecular fragments to be linked. As opposed to the coordinates applied in the investigated previous work [7], the coordinate system proposed in this article is mathematically well-defined, complete and geometrically decoupled. Here, geometrically decoupled denotes the absence of mathematical coordinate constraints [16], which constitutes a necessary criterium for the dynamical decoupling mentioned above in the context of configurational entropy.

2 Methods

In this article, we propose a beneficial choice of a relative coordinate system for linking pairs of molecular fragments by specifying their relative orientation in 3D space. In order to benchmark our proposed coordinate system, the details of which will be given below, we follow DeLinker [7], a recent pioneering machine learning framework for linking molecular fragments.

2.1 The framework

Here, a brief summary of the DeLinker [7] framework is given. For further details, the interested reader is referred to said publication [7] as well as references therein (in particular [17, 18, 19]).

On the top level, DeLinker [7]

is embedded in a Variational Autoencoder (VAE)


framework; Linker generation is performed by seeding the latent variables of the decoder of the VAE. First, the graph representation of the fragment pair is encoded by a standard gated graph neural network


. Then, a set of atoms is initialized in order to serve as expansion nodes for linking a pair of fragments. The maximum linker length is given by the chosen size of this set of expansion nodes. For each atom, a multidimensional hidden state is drawn from standard normal distributions. The atom type of the respective node is derived via sampling from a learned mapping from the hidden state to a Boltzmann distribution (i.e. softmax or, termed more precisely, softargmax).

Successively, the fragments are linked by attaching atoms from the set of expansion nodes in an iterative manner. Following the breadth-first paradigm, a first-in-first-out queue is initialized with two exit atoms (Figure 1

), one per fragment. The focused node, i. e. the first node in the queue, forms covalent bonds to candidate atoms, which are themselves added to the queue upon bond formation. The selection process of the focused node is terminated upon forming a bond to a special stop node. Then, the focused node is removed from the queue and the next atom in turn initiates bond formation. Nodes become closed once they are focused, which means they are no longer considered as candidate nodes for bond formation throughout the entire generation process. The bond selection procedure is accomplished by computing feature vectors between the focused atom and all candidate atoms. These feature vectors comprise both atom types, their hidden states as well as their graph distance. Furthermore, the bond formation process takes as an input the average of the hidden states across all nodes during node initialization (i. e. iteration zero) as well as at the current iteration. The current iteration number is supplied in addition. Importantly, the feature vector is augmented with coordinates specifying the relative orientation between the fragments, the details of which will be given below. The actual chosen bond and its type (single, double or triple) are sampled from Boltzmann distributions, which are based on learned mappings from the feature vectors and are masked with valency constraints


After each bond formation, the hidden states of all nodes are updated with respect to the newly formed graph topology. This update is performed via a standard gated graph neural network [19], taking as input the initial states of all nodes. Starting from the states of the nodes at iteration zero, instead of the current state, prevents the network from learning assembly pathways [17].

The generative process ends when the first-in-first-out queue is empty. Note that the described procedure does not prevent the generation of unlinked fragments, which, however, constitutes the only mechanism leading to invalid molecules as an outcome.

Fragment linking constitutes a multimodal problem. Two fragments can be linked in many different ways. Inspired by Jin et al. [18], a low-dimensional latent vector

is introduced in order for the model to accommodate this one-to-many mapping. To avoid difficulties known from Computer Vision

[21], during training, is derived from the encoding of the ground-truth linked fragments. In this manner, the model is encouraged to pay attention to the latent vector. Furthermore, is regularized via the KL-divergence to the standard normal distribution.

2.2 Relative coordinates

In this section, we derive our proposed relative coordinate system for linking molecular fragments. The following derivation follows the bond-angle-torsion [8, 22, 23, 24, 25, 26, 27] coordinate formalism, which closely relates to the z-matrix representation [28, 27].

Referring to Figure 1, let us introduce the following nomenclature. The atoms of the fragments to which the linker is covalently bound are referred to as exit atoms; E and E for the two fragments, respectively. The linker atoms L and L are attached to E and E via a rotable bond.

Figure 1: Bond-angle-torsion (BAT) coordinates. For specifying the relative orientation of two molecular fragments, we first define the exit atom E of the first fragment as the atom which is connected to the initial linker atom L. E and L are defined analogously for the second fragment. Two of the BAT coordinates are constituted by the exit vectors and , bridging the exit atoms to the initial linker atoms. A third BAT coordinate is defined by the length of the pseudo-bond . The angles between and as well as between and , termed and respectively, yield another two BAT coordinates. In order to obtain six BAT coordinates, a well-known number for specifying relative molecular orientations, we add the dihedral angle of the atom chain E-L-L-E.

Naturally for the problem of linking molecular fragments, we are not concerned with the global position and orientation of the fragment pair as a whole. Rather, we are interested in the position and orientation of the fragments relative to each other. Thus, without loss of generality, exit atom 1 (E

) is positioned at the origin of the Cartesian coordinate system, linker atom 1 (L

) on the x-axis and linker atom 2 (L) in the x-y plane. Then, these anchored Cartesian coordinates [22, 23] are given as follows:

In order to achieve ease of notation, we constitute the following definitions.

Then, our implemented BAT coordinates are given as:

Matrix multiplications and outer vector products are denoted by and , respectively. The back-transformation from BAT (Equation LABEL:eq:bat2) to anchored Cartesian coordinates [22, 23] (Equation LABEL:eq:int_cart), demonstrating the completeness of our proposed BAT coordinate system, is given in the Appendix. In this context, note that we do not explicitly feed the model the lengths of the exit vectors, i. e. and . As physical bond lengths, their values hardly vary in comparison to the pseudo-bond length , which dominates the geometry. Thus, they barely carry any useful information for the model.

One topic is left to be discussed. In contrast to bond angles, i. e. and in the present case, dihedral angles are periodic. This means , but . Periodicity poses a challenge for machine learning algorithms, since the implied distance metric does not correctly reflect the underlying physical geometry.

Figure 2: Treating periodicity. a) shows an illustration of the problematic distance metric. Both the top and bottom angles measure 10 degrees geometrically. For the bottom angle, this is correctly reflected in its numerical value, i. e. . For the top angle, however, the numerical value deviates from the underlying geometry due to the periodic discontinuity at . Here, we have . This discontinuity of the mapping from geometric angles to their numerical values poses a challenge for machine learning algorithms. b) The periodic discontinuity can be avoided by considering the dihedral an angle in the complex plane. Supplying both the real and imaginary part, i. e. and , respectively, dihedral angles are presented to the model in continuous form and without loss of information.

As done commonly (see e. g. [29, 30, 31]), we feed the model the sine and cosine of , instead of the plain dihedral angle value. One way to motivate this treatment states as follows. There is a periodic discontinuity [see Figure 2(a)] at in the mapping of the underlying geometry to its numerical dihedral angle value. A solution is inspired by the complex unit circle [Figure 2(b)]: One maps the dihedral to . The machine learning model is fed the real and imaginary part of this transformation, i. e. both and . These angular functions, as a representation of the periodic dihedral angle , are both continuous as well as faithful with respect to the underlying geometry. Additionally, they naturally map to the interval , which constitutes another desirable property. Note that the transformation was chosen over just in order to match the complex unit circle with the canonical definition of dihedral angles [Figure 2(b)]. However, as the order of the input features is irrelevant for the machine-learning algorithm, this choice of swapping sine with cosine is arbitrary.

2.3 Data preparation

Following DeLinker [7], we are working with two datasets, a selected [32] set of 250 000 ZINC [33] compounds as well as CASF-2016 (i. e., the PDBbind core set) [34]. A major difference between these two sets is the origin of the 3D structural information. While for CASF, the structures stem from 285 high-quality crystal structures of ligands bound to proteins, the ZINC structures are generated in silico using RDKit [35], as further detailed below.

For both data sets, preparation is performed in the same manner and sketched as follows. The ligands are split into three parts, i. e. two fragments and the corresponding linker. Cuts were performed on acyclic single bonds outside of functional groups [36]. Triplet splits for which either of the three components is unrealistically small, or the linker is inflated with respect to the fragments, were removed. The remaining triplet splits were filtered further by molecular graph (2D) properties, in particular synthetic accessibility (SA) [37], ring aromaticity as well as pan-assay interference compounds (PAINS) [38]. This procedure yielded 420 000 triplet splits for the ZINC data set and 309 triplet splits for CASF.

Training is performed purely on the ZINC data set. 400 compounds each were randomly selected from the whole ZINC set and held back as a validation and test set. The CASF data is used solely as a test set for evaluation.

As outlined above, unlike for CASF, where experimental 3D structures are available, the ZINC 3D coordinates need to be generated. This is accomplished using RDKit [35] via employing the Merck molecular force field (MMFF) [39, 40]. The DeLinker [7] source code ships without the 3D structures of the training set (obviously due to storage space reasons). For calculating our proposed relative coordinates of the training set, we reproduced the 3D structures. Referring to Figure 1, DeLinker [7] uses the distance from E to E as well as the angle between the exit vectors (i. e. and ) as relative coordinates. In order to ensure exact reproduction, we recalculated these relative coordinates alongside the proposed BAT coordinates and compared them to the data shipped with DeLinker [7]. Our comparison demonstrated an exact match. For further details, the interested reader is referred to the DeLinker publication [7].

2.4 Training

The model is trained under a VAE framework over 10 epochs, exclusively on the ZINC training set, using the Adam optimizer with a learning rate of

and a minibatch size of 16. The encoding of the nodes hidden states as well as the latent vector encoding the ground-truth molecule are regularized via KL loss to follow standard normal distributions. A two-fold cross-entropy reconstruction loss is applied, one part measuring the error in the prediction of the atom types, the second part judging the sequence of bond-formation steps in order to reconstruct the ground truth molecule. Further details can be found in the DeLinker publication [7] as well as in Liu et al. [17].

2.5 Evaluation

For each of the fragment pairs in the triplet cuts of the test sets, i. e. 400 in the case of ZINC and 309 for CASF, 250 linkers are generated. This amounts to 100 000 and 77 250 linked molecules generated, respectively. We apply the standard metrics for molecular generation tasks, i. e. fractional validity, uniqueness and novelty. Validity is assessed by RDKit [35] being able to parse the generated SMILES [41] strings as connected molecules. Uniqueness is calculated by the cardinality of the (unique) set of generated molecules divided by the total number generated. Novelty describes the fraction of generated molecules which are not contained in the training set.

Furthermore, we assess the generated molecules via the 2D filtering metrics of subsection 2.3. SA [37] is a measure for the difficulty of physically synthesizing the molecule in the laboratory. Ring aromaticity amounts to the properties of a molecule constituting a drug. PAINS [38] assesses the reactivity of a compound by performing a knowledge-based analysis on its substructures.

Arguably the most significant estimator of the impact of our coordinate system is the models capability to recover the ground-truth linker from the original triplet cut. This statement is motivated by the fact that the structural information provided to the generation process is derived from the respective ground-truth linker. In this manner, the model is conditioned to reproduce the ground-truth linker more frequently.

2.6 Information-theoretical analysis

Input features, referring to the coordinates in the present case, should be uncorrelated, i. e. decoupled, in order to assure efficient learning for the model. Therefore, the mutual information between the input features should be low [42]. In order to investigate said decoupling, we calculate pairwise mutual information values, given as [8, 9, 10, 13, 14, 15]



The formulas given in Equation LABEL:entropy

refer to discretized differential entropy values. This means the underlying continuous probability densities are approximated by sampling to discrete histogram bins. The indices x and y designate an arbitrary coordinate, i. e. a bond, an angle or a dihedral in the present case.

denotes the probability for the coordinate to assume a value in bin . If samples for coordinate are taken, and of them fall into bin , then (and analogous for ). Furthermore, if of the samples taken fall into bin for the -coordinate and bin for the -coordinate in the according 2D histogram. and are the widths of the equally spaced bins. is the Jacobian of coordinate in the middle of bin . Using the definition of the marginals

we can write Equation 4 as

The last equality exhibits interesting features to be discussed. First, the Jacobian determinants have canceled out, meaning the type of degree of freedom has become irrelevant. Furthermore, the bin sizes have vanished. This means that the mutual information calculated from discretized differential entropy values resembles discrete information in Shannon’s [43] sense. Note, however, that the mutual information still is dependent on bin sizes; the probabilities , and are affected by this choice. For example, let us denote the limit where the binning becomes binary, meaning the bin sizes are chosen so small that there is either no or only one data point in each bin. Assuming that both the x- as well as the y-marginal take on such a binary form, we can write for data points:

Furthermore, note that we write the equations without the Boltzmann or Gas constant. Spatial entropies, as in the equations LABEL:entropy, do not bear physically meaningful units. The reason is the fact that the semi-classical entropy integral cannot be split in a manner that either the spatial or the momentum entropy bear physically meaningful units [10]. Upon taking entropy differences, however, the problematic terms cancel out [10]. This means that the mutual information in equations 4, 2.6 and 2.6 can indeed be multiplied with the Boltzmann or Gas constant to yield physical entropy values. Stated without such a physical constant, the units obtained here are natural units of information (nats, similar to bits, however, using the natural logarithm).

3 Results and discussion

As mentioned in section 2.5, arguably the most indicative metric for the quality of the provided coordinates is the rate of recovery of the ground-truth linker, as it quantifies the models capability to follow the supplied information. This metric improves across all test sets, as shown in Table 1, with the most drastic enhancement for ZINC. The coordinate system carries information about the ground-truth linker. With higher quality information supplied, the model is expected to recover the original linker more frequently. This demonstrates that the model takes advantage of the augmented information by following the provided target geometry. Since the generation process is more directed, the uniqueness metrics drops, which serves as another indicator for the quality of the information content in our proposed coordinate system.

no info distance only DeLinker abBAT
ZINC 74.5 78.3 79.0 88.3
CASF N/A N/A 53.7 56.3
ZINC 5+ atoms N/A N/A 67.0 80.8


CASF 5+ atoms N/A N/A 29.8 34.0


ZINC 36.2 37.6 39.5 39.6
CASF N/A N/A 51.0 53.3
ZINC 5+ atoms N/A N/A 49.4 47.9
CASF 5+ atoms N/A N/A 68.7 69.1


ZINC 97.0 98.6 98.4 98.2
CASF N/A N/A 95.5 94.5
ZINC 5+ atoms N/A N/A 98.1 98.3
CASF 5+ atoms N/A N/A 94.7 93.1


ZINC 51.2 47.3 44.2 37.6
CASF N/A N/A 51.9 47.1
ZINC 5+ atoms N/A N/A 61.0 52.7
CASF 5+ atoms N/A N/A 72.9 66.3

pass all 2D filters

ZINC 89.9 90.2 89.8 90.5
CASF N/A N/A 81.4 80.4
ZINC 5+ atoms N/A N/A 84.1 85.5
CASF 5+ atoms N/A N/A 71.7 70.3

pass ring filter

ZINC 95.2 94.5 94.8 95.5
CASF N/A N/A N/A 92.5
ZINC 5+ atoms N/A N/A N/A 92.2
CASF 5+ atoms N/A N/A N/A 87.2

pass SA filter

ZINC 95.1 95.5 95.3 95.5
CASF N/A N/A N/A 85.1
ZINC 5+ atoms N/A N/A N/A 93.4
CASF 5+ atoms N/A N/A N/A 77.6

pass PAINS filter

ZINC 97.8 98.4 97.9 98.2
CASF N/A N/A N/A 98.0
ZINC 5+ atoms N/A N/A N/A 97.7
CASF 5+ atoms N/A N/A N/A 97.7
Table 1: Graph metrics. 2D quality criteria are compared for molecules generated using BAT coordinates versus the values from the DeLinker [7] publication. Values which are not given in the DeLinker [7] publication are marked via an ”N/A” entry. For the ZINC dataset, the DeLinker [7] ablation study values are included. The two 5+ test sets at the bottom constitute subsets of target linkers with at least 5 atoms in length. Best in category (row) values have been printed in bold font.

Considering the ablation study of DeLinker [7] on the ZINC test set, the bulk of the geometric information obviously is contained in the distance coordinate. Compared to providing no relative coordinates at all, the distance takes the recovery rate from 74.5 to 78.3 percent. Adding the angle coordinate, i. e., considering the complete DeLinker [7] coordinate set, yields an additional gain of comparatively low 0.7 percent. Figure 3 provides insight on this outcome.

Figure 3: Robustness of DeLinker coordinates.

The plots compare two random seeds used for generating conformers for the test set. The distance as well as the angle of the DeLinker coordinate set are plotted. The distances demonstrate a somewhat adequate robustness between random seeds, with Pearson correlation of 0.86 between seeds. For the angles, we observe a considerable fraction of outliers, decreasing the correlation to 0.47 only. The robustness to altering random seeds is proposed as a proxy for the quality of the provided structural information.

When using a different random seed for RDKit to generate conformers, the DeLinker [7] distances are preserved rather robustly with a Pearson correlation of 0.86. The angles, on the other hand, show significant deviations. Their Pearson correlation to the angles of the conformers generated with the previous random seed is given as a relatively low value of 0.47. This indicates that the reliability of the angular information is low compared to the distance information, which attributes to only minor gains when feeding them to the model.

In order to further investigate the information content in the relative coordinates proposed here, analogue ablation studies were performed for the BAT coordinates. Table 2 lists the results.

no info distance only distance and angles abBAT


recovered 74.5 84.5 87.0 88.3
novel 36.2 40.6 38.6 39.6
valid 97.0 97.5 98.2 98.2
unique 51.2 44.8 37.8 37.6
pass all 2D filters 89.9 89.8 90.5 90.5
pass ring filter 95.2 94.5 95.6 95.5
pass SA filter 95.1 95.1 94.9 95.5
pass PAINS filter 97.8 98.0 98.2 98.2
Table 2: BAT ablation study. The same metrics as in Table 1 are investigated and the impact of the different types of coordinates is dissected. As was the case for the DeLinker [7] coordinates, the 2D filter and the valid criteria are generally high. Referring to the recovered metric, similar to Table 1, the bulk of our information is carried in the distance coordinate, albeit the improvement appears significantly more distinct. Furthermore, the unique metric continues to behave opposite to the recovery metric, supporting the hypothesis of the coordinate information providing valuable guidance for the model.

Similar to the DeLinker [7] coordinate system, the bulk of the BAT information is carried in the distance coordinate. However, the BAT distance ( in Figure 1) takes the recovered metric from 74.5 to 84.5 (Table 2), as opposed to only 78.3 (Table 1) for the DeLinker [7] distance . Given the similarity of the two distances, this discrepancy appears rather drastic at a first glance. The reason is argued to lie in the fact that the DeLinker [7] distance fails to decouple from the angular and dihedral coordinate systems. For the 6 BAT coordinates in Figure 1, one can vary each of them while keeping the others constant. However, any such variation will change the DeLinker [7] distance . Furthermore, varying the BAT angles or the BAT dihedral will change the DeLinker [7] angle. When comparing the distances in Figure 3 to those in Figure 4, the effect of this coupling becomes evident.

Figure 4: BAT coordinate quality. Analogous plot to Figure 3 for the proposed BAT coordinates. The robustness to varying random seeds for the distances appears excellent with a Pearson . Compared to Figure 3, both our angles demonstrate enhanced reliability. Referring to the dihedrals, the estimated information content is considerably low.

The DeLinker [7] distance shows a Pearson correlation of 0.86 between coordinates of conformers generated with different random seeds. The decoupled BAT distance, on the other hand, demonstrates excellent robustness to the choice of random seed with a Pearson correlation of 0.99. This explains the arguably drastic improvement of the recovered metric by 10 percent using the BAT distance only.

Feeding the model the BAT angles additionally, we gain another 2.5 percent, as opposed to 0.7 for the DeLinker [7] angle. Comparing the correlation plots, BAT provides two angles with a Pearson correlation of 0.67 and 0.72 versus one angle with 0.47 for DeLinker [7]; the BAT angles are decoupled from the distance and dihedral system.

Inspecting Figure 4, the dihedral angle information is of low quality. Interestingly, even given this low quality information, the model improves the recovered metric by 1.3 percent. Note that this value exceeds the angular improvement of the DeLinker [7] coordinates (0.7) almost by a factor of two. The reason, as above, arguably lies in the decoupling from the distance and angular system; while the information is certainly non-reliable on its own, it is indeed orthogonal, i. e. non-redundant, with respect to the other coordinates. Altogether, the BAT coordinate system takes the recovery metric to 88.3 percent, which is an improvement of 9.3 percent over the DeLinker [7] coordinate system.

Last but not least, we have performed an information-theoretical analysis comparing the two input coordinate sets. Given the fact that the DeLinker [7] coordinates are given as a distance and angle pair, we compare the mutual information using the training set between the Delinker [7] coordinates with the mutual information between the BAT distance and either BAT angle. Figure 5 shows that, with few exceptions at large bin sizes where values have not converged, the BAT coordinates exhibit lower mutual information, which demonstrates an increased amount of decoupling. Given these insights on an information-theoretic level, we suggest the presented graphs as strong evidence for the decoupling of the coordinate system as a major factor for the improvements.

Figure 5: Mutual information comparison. The graphs compare the mutual information between the DeLinker [7] distance and angle as well as between the BAT distance and bond angles. The training set of ground-truth molecules is analyzed. Since, as discussed in the Methods section, the mutual information is dependent on the bin size, the plots show the mutual information as a function of the granularity in terms of the number of bins used. Different regimes of the number of bins are shown. Note that the mutual information curves for both our angles lie on top of each other. With the exception of rare events in the regime of low number of bins (top graph), our coordinates exhibit lower mutual information, i. e. enhanced decoupling.

4 Conclusion

Inspired by recent pioneering work, we have demonstrated a considerable beneficial effect for the application of a well-behaved coordinate system on machine learning-based molecular linker generation. The enhancements of our proposed relative coordinate system, which roots in the bond-angle-torsion (BAT) coordinate formalism, were established by comparing to said pioneering work. In this context, we roughly halved (a reduction of 44.3 percent) the number of test set examples for which the ground-truth linker could not be recovered. A comparative analysis on various indicative aspects was performed. First, commonly used metrics for generative models as well as molecular graph evaluation were presented. Furthermore, ablation studies on the included coordinate types were compared in order to identify the coordinates which provide the most valuable information to the model. The reasons for the performance of the different coordinates were elaborated by performing a robustness analysis with respect to the conformer generation process given different random seeds. By means of information-theoretical calculations, the amount of decoupling in the two input coordinate sets was compared. The results support the hypothesis that the decoupled nature of our presented coordinate system plays a major role for the improved performance of the generative process.

The application of structural information for future state-of-the-art models in molecular fragment linking arguably suggests itself. Given the considerable improvements demonstrated, we propose the presented coordinate system as a standard technique for linking molecular fragments.

5 Acknowledgements

We thank Anton A. Polyansky for a critical reading of the manuscript.

6 Appendix

6.1 Transformation from BAT to anchored Cartesian coordinates

In order to demonstrate the completeness of the proposed BAT coordinate system (Equation LABEL:eq:bat2), the back-transformation to anchored Cartesian coordinates [22, 23] (Equation LABEL:eq:int_cart) is given as follows.


Here, represents for the case . In order to obtain from , we rotate around an axis parallel to by an angle . Following previous work [27], this coordinate transformation is applied by using Rodrigues’ rotation formula. Note that Equation LABEL:eq:backtrafo is never explicitly calculated. It is given here for the sole purpose of demonstrating the equivalence of the BAT and internal Cartesian coordinate systems.


  • [1] Pavel G. Polishchuk, Timur I. Madzhidov, and Alexandre Varnek. Estimation of the size of drug-like chemical space based on GDB-17 data. J. Comput. Aided Mol. Des., 27(8):675–679, 2013.
  • [2] Jean-Louis Reymond, Ruud van Deursen, Lorenz C. Blum, and Lars Ruddigkeit. Chemical space as a source for new drugs. MedChemComm, 1(1):30, 2010.
  • [3] Xuanyi Li, Yinqiu Xu, Hequan Yao, and Kejiang Lin.

    Chemical space exploration based on recurrent neural networks: applications in discovering kinase inhibitors.

    Journal of Cheminformatics, 12(1), 2020.
  • [4] Asher Mullard. The drug-maker's guide to the galaxy. Nature, 549(7673):445–447, 2017.
  • [5] Osamu Ichihara, John Barker, Richard J. Law, and Mark Whittaker. Compound design by fragment-linking. Molecular Informatics, 30(4):298–306, 2011.
  • [6] Rachelle J. Bienstock. Computational Methods for Fragment-Based Ligand Design: Growing and Linking, pages 119–135. Springer New York, New York, NY, 2015.
  • [7] Fergus Imrie, Anthony R. Bradley, Mihaela van der Schaar, and Charlotte M. Deane. Deep generative models for 3d linker design. Journal of Chemical Information and Modeling, 60(4):1983–1995, 2020.
  • [8] Benjamin J. Killian, Joslyn Yundenfreund Kravitz, and Michael K. Gilson. Extraction of configurational entropy from molecular simulations via an expansion approximation. The Journal of Chemical Physics, 127(2):024107, 2007.
  • [9] Benjamin J. Killian, Joslyn Yudenfreund Kravitz, Sandeep Somani, Paramita Dasgupta, Yuan-Ping Pang, and Michael K. Gilson. Configurational Entropy in Protein–Peptide Binding: Computational Study of Tsg101 Ubiquitin E2 Variant Domain with an HIV-Derived PTAP Nonapeptide. J. Mol. Biol., 389(2):315–335, 2009.
  • [10] Vladimir Hnizdo and Michael K. Gilson. Thermodynamic and differential entropy under a change of variables. Entropy, 12(3):578–590, 2010.
  • [11] Riccardo Baron, Wilfred F. van Gunsteren, and Philippe H. Hünenberger. Estimating the configurational entropy from molecular dynamics simulations: anharmonicity and correlation corrections to the quasi-harmonic approximation. Trends Phys. Chem., 11:87–122, 2006.
  • [12] Bracken M. King, Nathaniel W. Silver, and Bruce Tidor. Efficient calculation of molecular configurational entropies using an information theoretic approximation. The Journal of Physical Chemistry B, 116(9):2891–2904, 2012.
  • [13] Markus Fleck, Anton A. Polyansky, and Bojan Zagrovic. PARENT: A Parallel Software Suite for the Calculation of Configurational Entropy in Biomolecular Systems. J. Chem. Theory Comput., 12(4):2055–2065, 2016.
  • [14] Markus Fleck and Bojan Zagrovic. Configurational entropy components and their contribution to biomolecular complex formation. Journal of Chemical Theory and Computation, 15(6):3844–3853, 2019.
  • [15] Jorge Numata and Ernst-Walter Knapp. Balanced and Bias-Corrected Computation of Conformational Entropy Differences for Molecular Trajectories. J. Chem. Theory Comput., 8(4):1235–1245, 2012.
  • [16] Markus Fleck, Marcus Wieder, and Stefan Boresch. Dummy atoms in alchemical free energy calculations. Journal of Chemical Theory and Computation, 17(7):4403–4419, 2021.
  • [17] Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander Gaunt. Constrained graph variational autoencoders for molecule design. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  • [18] Wengong Jin, Kevin Yang, Regina Barzilay, and Tommi Jaakkola. Learning multimodal graph-to-graph translation for molecule optimization. In International Conference on Learning Representations, 2019.
  • [19] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. Gated graph sequence neural networks. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
  • [20] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengio and Yann LeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
  • [21] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A. Efros, Oliver Wang, and Eli Shechtman.

    Toward multimodal image-to-image translation.

    In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 465–476, 2017.
  • [22] Michael J. Potter and Michael K. Gilson. Coordinate Systems and the Calculation of Molecular Properties. J. Phys. Chem. A, 106(3):563–566, 2002.
  • [23] Chia-En Chang, Michael J. Potter, and Michael K. Gilson. Calculation of Molecular Configuration Integrals. The Journal of Physical Chemistry B, 107(4):1048–1055, 2003.
  • [24] Dudley R. Herschbach, Harold S. Johnston, and Donald Rapp. Molecular Partition Functions in Terms of Local Properties. The Journal of Chemical Physics, 31(6):1652–1661, 1959.
  • [25] Kenneth S. Pitzer. Energy Levels and Thermodynamic Functions for Molecules with Internal Rotation: II. Unsymmetrical Tops Attached to a Rigid Frame. The Journal of Chemical Physics, 14(4):239–243, 1946.
  • [26] Nobuhiro Gō and Harold A. Scheraga. On the Use of Classical Statistical Mechanics in the Treatment of Polymer Chain Conformation. Macromolecules, 9(4):535–542, 1976.
  • [27] Jerod Parsons, J. Bradley Holmes, J. Maurice Rojas, Jerry Tsai, and Charlie E. M. Strauss. Practical conversion from torsion space to Cartesian space for in silico protein synthesis. J. Comput. Chem., 26(10):1063–1068, 2005.
  • [28] M. S. Gordon and J. A. Pople. Approximate Self‐Consistent Molecular‐Orbital Theory. VI. INDO Calculated Equilibrium Geometries. The Journal of Chemical Physics, 49(10):4643–4650, 1968.
  • [29] Rhys Heffernan, Kuldip Paliwal, James Lyons, Abdollah Dehzangi, Alok Sharma, Jihua Wang, Abdul Sattar, Yuedong Yang, and Yaoqi Zhou.

    Improving prediction of secondary structure, local backbone angles and solvent accessible surface area of proteins by iterative deep learning.

    Scientific Reports, 5(1):11476, 2015.
  • [30] Haiou Li, Jie Hou, Badri Adhikari, Qiang Lyu, and Jianlin Cheng. Deep learning methods for protein torsion angle prediction. BMC Bioinform., 18(1):417:1–417:13, 2017.
  • [31] Jianzhao Gao, Yuedong Yang, and Yaoqi Zhou. Grid-based prediction of torsion angle probabilities of protein backbone and its application to discrimination of protein intrinsic disorder regions and selection of model structures. BMC Bioinform., 19(1):29:1–29:8, 2018.
  • [32] Rafael Gómez-Bombarelli, Jennifer N. Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, and Alán Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Science, 4(2):268–276, 2018.
  • [33] Teague Sterling and John J. Irwin. ZINC 15 - ligand discovery for everyone. J. Chem. Inf. Model., 55(11):2324–2337, 2015.
  • [34] Minyi Su, Qifan Yang, Yu Du, Guoqin Feng, Zhihai Liu, Yan Li, and Renxiao Wang. Comparative assessment of scoring functions: The CASF-2016 update. J. Chem. Inf. Model., 59(2):895–913, 2019.
  • [35] G. Landrum.

    RDKit: open-source cheminformatics.
  • [36] Jameed Hussain and Ceara Rea. Computationally efficient algorithm to identify matched molecular pairs (mmps) in large data sets. J. Chem. Inf. Model., 50(3):339–348, 2010.
  • [37] Peter Ertl and Ansgar Schuffenhauer. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminformatics, 1:8, 2009.
  • [38] Jonathan B. Baell and Georgina A. Holloway. New substructure filters for removal of pan assay interference compounds (pains) from screening libraries and for their exclusion in bioassays. Journal of Medicinal Chemistry, 53(7):2719–2740, 2010.
  • [39] Thomas A. Halgren. Merck molecular force field. i. basis, form, scope, parameterization, and performance of mmff94. Journal of Computational Chemistry, 17(5-6):490–519, 1996.
  • [40] Thomas A. Halgren. Mmff vi. mmff94s option for energy minimization studies. Journal of Computational Chemistry, 20(7):720–729, 1999.
  • [41] David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences, 28(1):31–36, 1988.
  • [42] Gavin Brown.

    A new perspective for information theoretic feature selection.

    In David van Dyk and Max Welling, editors,

    Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics

    , volume 5 of Proceedings of Machine Learning Research, pages 49–56, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 2009. PMLR.
  • [43] Claude E. Shannon. A Mathematical Theory of Communication. Bell System Technical Journal, 27(3):379–423, 1948.