High-Throughput Virtual Screening of Small Molecule Inhibitors for SARS-CoV-2 Protein Targets with Deep Fusion Models

04/09/2021 ∙ by Garrett A. Stevenson, et al. ∙ Lawrence Livermore National Laboratory 0

Structure-based Deep Fusion models were recently shown to outperform several physics- and machine learning-based protein-ligand binding affinity prediction methods. As part of a multi-institutional COVID-19 pandemic response, over 500 million small molecules were computationally screened against four protein structures from the novel coronavirus (SARS-CoV-2), which causes COVID-19. Three enhancements to Deep Fusion were made in order to evaluate more than 5 billion docked poses on SARS-CoV-2 protein targets. First, the Deep Fusion concept was refined by formulating the architecture as one, coherently backpropagated model (Coherent Fusion) to improve binding-affinity prediction accuracy. Secondly, the model was trained using a distributed, genetic hyper-parameter optimization. Finally, a scalable, high-throughput screening capability was developed to maximize the number of ligands evaluated and expedite the path to experimental evaluation. In this work, we present both the methods developed for machine learning-based high-throughput screening and results from using our computational pipeline to find SARS-CoV-2 inhibitors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The COVID-19 disease caused by the severe acute respiratory syndrome coronavirus (SARS-CoV-2) is responsible for the most recent, severe pandemic in modern human history (Lau et al., 2021). At the onset of the COVID-19 pandemic, a worldwide effort began to identify and provide target proteins for vaccine and drug development to neutralize the virus. Two distinct proteins were rapidly solved; the trimeric spike protein (spike), which binds to human ACE2 to enter human cells (Belouzard et al., 2012) and the main protease (Mpro), which plays a pivotal role in viral gene expression and replication (Ullrich and Nitsche, 2020). In response to the pandemic, we participated in a large-scale multi-institutional effort to virtually screen, experimentally test, and optimize therapeutic leads targeting the spike and Mpro SARS-CoV-2 protein targets. Two different binding sites from the spike protein (denoted spike1, spike2) and two different conformations of the Mpro active site (denoted protease1, protease2) were used in the high-throughput screening calculations.

Experimental tests of drug candidates are expensive and serve as a fundamental bottleneck to the drug discovery process. As a result, computational chemistry is used extensively to accelerate the discovery process by screening drugs and nominating the strongest candidates for experimental validation (Wong, 2021). High Performance Computing (HPC) plays a critical role in virtual screening (Zhang et al., 2013, 2014; Jorgensen, 2004; Kitchen et al., 2004; Cheng et al., 2012)

. Not only does HPC accelerate computationally expensive calculations, but it provides the scalability necessary to screen large numbers of candidate molecules. This is crucial, as the chemical space of potential molecules has been estimated to be on the order of 10

60 (Reymond et al., 2010; Drew et al., 2012). Accurately estimating protein-ligand binding affinities is an important step in drug discovery. However, even computationally expensive, biophysics-based scoring methods find predicting binding free energy a difficult task (Wong, 2021; Jones et al., ; Zhu et al., 2020). Deep learning methods represent an alternative, rapid approach to binding affinity prediction which alleviates the dependence on hand-curated features, which may not capture the mechanism of binding (Ballester and Mitchell, 2010; Ain et al., 2015).

The two leading deep learning approaches to structure-based binding affinity prediction fall into two categories: 3-dimensional Convolutional Neural Networks (3D-CNNs)

(Ragoza et al., 2017; Jiménez et al., 2018; Zhang et al., 2019)

and Spatial Graph Convolutional Neural Networks (SG-CNNs)

(Feinberg et al., 2018; Zhou et al., 2020; Lim et al., 2019). Fundamentally, 3D-CNN models exploit a voxelized representation of atoms in a 3D grid, which portrays protein-ligand compounds in a Euclidean space for inference. On the other hand, SG-CNN approaches leverage a graph representation of the protein-ligand complex, allowing for multiple “edge-types” to be encoded in the representation (e.g., distinct distance thresholds corresponding to covalent or non-covalent interactions) to sub-select groups of atoms and evaluate their pairwise interactions.

These methods are significantly different in the way they represent compounds and their mechanisms of inference. However, they seek to achieve the same goal: accurate prediction of binding free energy on novel compounds. This observation led to the hypothesis that the 3D-CNN and SG-CNN likely have complementary strengths, which could be exploited by fusing the latent spaces of each model’s learned features . This approach to “Fusion” modeling was explored and shown to achieve superior generalization performance in predicting protein-ligand binding on X-ray crystallographic structures and virtually-docked poses of protein-ligand compounds in hold-out test sets (Jones et al., ). In this work, we detail improvements to fusion modeling, describe its utility in high-throughput screening, and evaluate its application to the SARS-CoV-2 drug discovery problem. A key innovation reported here is the potential to train a fusion model ”coherently”. While Coherent Fusion comes with the increased computational burden of training a more complex model, the increase in complexity is mitigated by using HPC to perform parallel, distributed training.

2. Deep Fusion

2.1. Fusion Modeling in Computational Chemistry

Machine learning, specifically deep learning approaches to protein-ligand binding affinity prediction, represent a promising new development in drug discovery (Zhu et al., 2020; Ballester and Mitchell, 2010; Ain et al., 2015; Jiménez et al., 2018; Zhang et al., 2019; Feinberg et al., 2018; Zhou et al., 2020; Gomes et al., 2017; Stepniewska-Dziubinska et al., 2017). At a high level, the deep-learned models being proposed for binding affinity prediction are single-pass, feed-forward systems. This fundamental model formulation results in a computational advantage, in that the models quickly predict binding affinity in one pass over their input. The simplicity and speed of deep learning prediction relative to biophysics-based computations make their use especially attractive in the context of massive virtual drug screens.

The concept of fusion is a recent development in deep learning, initially applied to computer vision problems

(Roitberg et al., 2019; Yang et al., 2018; Wagner et al., 2016; Li et al., 2017). In fusion, models are combined to achieve improved performance by integrating different modes of data or modeling approaches. This concept was recently applied to the computational chemistry space in the form of Fusion models for Atomic and molecular STructures (FAST) (Jones et al., ), where fusion of the two leading deep learning models (3D-CNN, SG-CNN) was shown to improve binding affinity prediction. Specifically, the Late Fusion and Mid-level Fusion models are shown as approximately equivalent or superior to individual SG-CNNs, individual 3D-CNNs, other state-of-the-art deep learning models (Jiménez et al., 2018; Stepniewska-Dziubinska et al., 2017), and physics-based approaches including both Autodock Vina (Trott and Olson, 2010) and Molecular Mechanics - Generalized Born / Surface Area (MM/GBSA) which has been shown to improve ligand pose ranking for certain target proteins (Wong, 2021; Hou et al., 2011). The Late Fusion and Mid-level Fusion FAST models were specifically described and made publicly available. In the context of this work, we retrain and optimize the models on data from PDBbind-2019 (Wang et al., 2005).

(1)

The Late Fusion approach is simple, but its superior performance compared to individual SG-CNNs and 3D-CNNs shows the potential of fusion modeling. In Late Fusion, SG-CNN and 3D-CNN models were separately trained to predict absolute binding affinity (Equation 1), where absolute binding affinity is defined as the negative logarithm of a binding constant . Binding affinity data in this study is measured as an inhibitory constant , disassociation constant . or inhibitory activity (), where these measurements are treated as equivalent labels in our calculations. Late Fusion simply takes the unweighted arithmetic mean across the individually-predicted binding affinity values from the SG-CNN and 3D-CNN models.

The Mid-level Fusion model is also explicitly described in (Jones et al., )

. The model is defined by extracting the latent space feature vector from

of an N-layer SG-CNN and of an M-layer 3D-CNN. Each vector of gradients is then separately processed through a model-specific dense layer, concatenated with the originally extracted vectors, and passed through two fusion dense layers before outputting a final prediction. In contrast to Late Fusion, Mid-level Fusion can be described as a non-linear combination of the respective models and its performance has been shown to outperform Late Fusion in some cases.

2.2. Advancing Fusion

Faced with the imperative task of virtually screening molecules for inhibition of SARS-CoV-2 activity, we sought to apply and improve the fusion deep learning approach. We first updated the previously trained models with data from the latest version of PDBbind (2019) (Wang et al., 2005), which provides approximately 4000 more compounds than the 2016 version used to train the original model and, therefore, offers the potential for improved model generalization

Our second step looked to improve the 3D-CNN and SG-CNN models by relying on recent literature on the optimization of hyper-parameters. Deep learning models are highly sensitive to a human’s definition of their hyper-parameters – model architecture, loss function, and optimization algorithm, among others

(Jaderberg et al., 2017). Automated hyper-parameter optimization was first addressed with parallel searches (grid or random), followed by sequential optimization methods such as Bayesian optimization (Srinivas et al., 2009; Bergstra et al., 2011; Snoek et al., 2012; Hutter et al., 2011). Hyper-parameter optimization algorithms have improved over time in their ability to scale (Snoek et al., 2015)

. Evolutionary Algorithms (EAs) have been shown to further improve the optimization process

(Jaderberg et al., 2017; Castillo et al., 2000; Husken et al., 2000; Li et al., 2019). Recently, a leading population-based EA, Population-Based Bandits (PB2) (Jaderberg et al., 2017), was improved by formulating hyper-parameter optimization as a Gaussian Process (GP) bandit optimization of a time-varying function (Parker-Holder et al., 2020).

Fusion modeling in particular is marked by a significant exposure to hyper-parameters. Both the 3D-CNN and SG-CNN models have their own hyper-parameters which enable them to learn the binding affinity prediction problem optimally in isolation. Additionally, the fusion layers require another set of hyper-parameters necessary to find an optimal non-linear combination of the two models. With this in mind, we saw a significant opportunity for improving fusion by utilizing the PB2 automated optimization algorithm (Parker-Holder et al., 2020) on Lassen, one of the most powerful high-performance computers in the world (LLNL, 2021b)

. The libraries used to define the model and optimization architectures are PyTorch

(Paszke et al., 2019), Pytorch Geometric (Fey and Lenssen, 2019), and Ray/Ray[Tune] (Moritz et al., 2018).

Finally, a new formulation of fusion, the Coherent Fusion model, was developed as a potential improvement to the previous Late and Mid-level Fusion models. In both the existing fusion approaches, a 3D-CNN and SG-CNN are individually optimized to minimize the mean squared error (MSE) between their predictions of binding free energy and ground-truth experimental values. The existing Mid-level approach then combines the independently optimized models to form a stronger predictor, by learning the latent space strengths of each model. However, in both the Late and Mid-level Fusion models, the 3D-CNN and SG-CNN weights are unaltered and remain in the state that was optimal for isolated prediction.

Given the Late and Mid-level Fusion models’ superior performance compared to their isolated components, we hypothesized that fusion might be further improved by coherently backpropagating gradient through both the fusion layers and the separate models. In doing so, the Coherent Fusion model fine-tunes both the 3D-CNN and SG-CNN heads to cooperatively exploit their strengths in a joint optimization. The drawbacks of the Coherent Fusion approach are both an increased hyper-parameter search space and an increased number of trainable parameters. To address this, we developed a parallel, distributed hyper-parameter optimization training architecture, which accelerates training using state-of-the-art hyper-parameter optimization in a distributed fashion.

3. Optimization and Evaluation

3.1. Data

PDBbind-2019 is a curated subset of the larger Protein Data Bank (PDB) (Wang et al., 2004), which is widely used to tune biophysics- and machine learning-based methods (Wong, 2021; Jones et al., ; Jiménez et al., 2018; Feinberg et al., 2018; Lim et al., 2019). The PDBbind data set is comprised of crystal structures arranged into two groups ( and ) based on size (where protein-ligand compounds containing a ligand with molecular weight 1000 Daltons (Da) are excluded from the refined set), data quality (where compounds with a measured but no or measurements are excluded from the refined set), and resolution of the crystal structure (2.5 Angstroms). From the set, a third, set is extracted using a clustering protocol based on protein-sequence similarity. The set is compiled to represent a valid test for scoring methods by creating a high-quality subset of compounds sufficiently different from the and sets. As such, we use the set as a primary means for evaluating the Fusion methods considered and comparing against published literature.

In this study, we employ the quintile sub-sampling method from (Jones et al., ) to formulate training and validation sets from the PDBbind-2019 and groupings. The sub-sampling is done independently on the and sets and 10% of the examples from each are withdrawn to form the validation set. Quintile sub-sampling guarantees both the training and validation sets to represent the full range of binding affinity values across PDBbind, where simple random sampling holds the risk of training and validating models on different sub-spaces of affinity values (Ellingson et al., 2020). The outcome is a training set of 15,631 complexes, a validation set of 1,731 complexes, and the 290 PDBbind

set complexes are held-out for evaluation. Details of pre-processing and feature extraction for the PDBbind data can be found in

(Jones et al., ), where here the same tools (Pettersen et al., 2004; Maier et al., 2015; Jakalian et al., 2000; O’Boyle et al., 2011) and sequence of operations are used.

As a means for additional evaluation, we supplement the set of 290 crystal structures from PDBbind with virtually-docked representations of the complexes. In practice, docking pose data is used for large-scale virtual screening, but is noisy and error prone since the correct ligand pose is not known until a co-complex is crystallized experimentally. Therefore, a scoring function’s performance in the docking space is critical in gauging its robustness to noise and its pragmatic utility.

We leverage the ConveyorLC toolchain (Zhang et al., 2013, 2014, 2017) to produce all docking complex data, as it is used in our high-throughput virtual screening pipeline. ConveyorLC generates docking poses using the Vina scoring function (Trott and Olson, 2010), then re-scores up to 10 best docking poses using MM/GBSA on a subset of the larger virtual screen; only a subset is re-scored because MM/GBSA is orders of magnitude more computationally expensive than docking.. This sequence of down selecting to limit the search space, accompanied by increasingly complex analyses, is frequently used in drug discovery pipelines, and even molecular dynamics (MD) simulations can be used before finalizing candidates for physical experimentation. The opportunity for machine learning models, like Deep Fusion, is to replace or supplement a more costly stage of a drug discovery pipeline with either improved accuracy or speed

Hyper-parameter 3D-CNN SG-CNN Fusion
Optimizer
Adam
(Kingma and Ba, 2014)
Adam
(Kingma and Ba, 2014)
Adam (Kingma and Ba, 2014),
AdamW (Loshchilov and Hutter, 2017),
RMSprop (Graves, 2013),
Adadelta (Duchi et al., 2011)
Activation function ReLU ReLU
ReLU
LReLU (Xu et al., 2015),
SELU (Klambauer et al., 2017)
Batch size 8,12,24 4,8,12,16
1,2,4,5,8,12,
16,24,28,34,
38,48,56
Learning rate
-
-
-
Model-Specific
Fusion Layers
N/A N/A T/F
Epochs 0-150 0-350 0-500
Pre-trained F F T/F
Batch norm. T/F F T/F
Dropout 1 (early) 0.25 0 0 - 0.50
Dropout 2 (mid) 0.125 0 0 - 0.25
Dropout 3 (late) 0 0 0 - 0.125
# of Fusion Layers N/A N/A 3,4,5
# of Dense Nodes
40,64,88,
104,128
N/A
8,24,40,64,
88,104,128
Residual Option 1 T/F N/A T/F
Residual Option 2 T/F N/A T/F
# of Conv. Filters 1 32,64,96 N/A 32,64,96
# of Conv. Filters 2 64,96,128 N/A 64,96,128
Non-covalent /
Covalent
K
N/A
2,3,4,
5,6,7,8
2,3,4,
5,6,7,8
Non-covalent /
Covalent
Neighbor Threshold
N/A 1.2Å-5.9Å 1.2Å-5.9Å
Non-covalent /
Covalent
Gather Width
N/A
8,24,40,64,
88,104,128
8,24,40,64,
88,104,128
Table 1. Hyper-parameters for each model and their range of values considered by the PB2 optimization
Figure 1. Fusion model architecture with voxelized and spatial graph inputs of COVID-19 Mpro (PDB: 6LU7, denoted protease1, green) in complex with an N3 inhibitor (red) [51]. Components of the 3D-CNN (orange), SG-CNN (blue), and Fusion layers (yellow) which were given as options to the hyper-parameter optimization are shown with dashed lines/borders.

Fusion model architecture

3.2. Training Architecture

Our approach to train the various individual and fusion models was executed iteratively. Lassen uses the IBM Spectrum LSF Job Scheduler (IBM, 2021), which necessitates pausing, rescheduling, and resuming training jobs after a maximum run-time. As the hyper-parameter optimization began to converge, the range of hyper-parameter values was adjusted, when possible, to ensure the lower and upper-bounds of the search space were not limiting factors in model performance. The full scope of hyper-parameters and ranges evaluated for each model are provided in Table 1, where the ranges are binary (T/F), a list of options, uniformly sampled continuous variables, or not applicable (N/A).

We used the Ray/Ray[Tune] (Moritz et al., 2018) Python library extensively as the foundation for running trials of individual hyper-parameter combinations within the context of a PB2 optimization. Together, Ray and PyTorch provide the ability to accelerate the training process by distributing individual trials across multiple nodes/GPUs. Each of Lassen’s 792 GPU nodes is made up of 44 3.45 GHz Power9 CPU cores, 4 NVIDIA Volta V100 GPUs each with 16GB of memory, and 256 GB of main memory. Depending on the complexity of each model, we distributed individual hyper-parameter configurations between 1 and 12 ranks (1 rank = 1 GPU, 10 CPU cores, 64 GB memory) to expedite the training process. Each rank also utilized 24 data workers running in parallel to pre-load future batches. The combination of distributed model training and parallel data loading was central to the feasibility of an experiment with this size/scope.

The PB2 hyper-parameter optimization was initialized with a quantile fraction (

) of 50, a time scale () in Epochs, a perturbation interval () of 100 Epochs, and an objective function () of minimum validation set MSE loss (Jaderberg et al., 2017; Parker-Holder et al., 2020). The procedure begins with a population of initial, randomly sampled hyper-parameter hypotheses. As every trial reaches the perturbation interval , PB2 looks at the model’s performance and determines if it is above or below the quantile fraction (). The best performing trials (above ) continue, while the under-performing trials clone a top-performing configuration (exploiting) and modify it using a parallel GP-bandit optimization (exploring). The training process produces both an optimal model and important information about how the hyper-parameters considered effect performance.

3.3. Model Architectures

We draw heavily from the original FAST network architectures in (Jones et al., ), which holds detailed descriptions of pre-processing, feature extraction, voxel grid sizing and atom propagation, which were unaltered. The following focuses on updates to the models and the final optimized hyper-parameter configurations for each component. For brevity, we list only the final optimized hyper-parameter values, where the advantage of PB2 is in its ability to learn a schedule of hyper-parameters to converge in an end-state (Parker-Holder et al., 2020).

3.3.1. Individual Models

Hyper-parameter Value
Epochs 213
Batch size 16
Learning rate
Non-covalent K 3
Covalent K 6
Non-covalent Neighbor Threshold 5.22Å
Covalent Neighbor Threshold 2.24Å
Non-covalent Gather Width 128
Covalent Gather Width 24
Table 2. Optimized final hyper-parameters for the SG-CNN
Hyper-parameter Value
Epochs 75
Batch size 12
Learning rate
Batch normalization F
# of Dense Nodes 128
# of Conv. Filters 1 32
# of Conv. Filters 2 64
Residual Option 1 F
Residual Option 2 T
Table 3. Optimized final hyper-parameters for the 3D-CNN

The SG-CNN in this work is structurally unaltered from (Jones et al., ), which uses the PotentialNet (Feinberg et al., 2018) architecture based on Gated Graph Sequence Neural Networks (Li et al., 2015). The only notable difference is the size of the dense layers were set according to the Non-covalent Gather Width, such that it was sequentially reduced in size by a factor of 1.5 and then 2. A population of 90 SG-CNN trials produced the final model and hyper-parameter configuration given in Table 2.

The 3D-CNN model is slightly modified from the architecture in (Jones et al., ). The model has dropout above the first two dense layers, 2 additional convolutional layers, the filter sizes begin at 5x5x5 and reduce to 3x3x3, the residual options shown in Figure 1 were fed to the hyper-parameter optimization, and similar to the SG-CNN, the second dense layer size was determined by the optimization and then sequentially reduced by a factor of 2. Again, a population of 90 trials was used, the final hyper-parameter values are given in Table 3

, where the optimization converged to using the second residual connection shown in Figure

1 and 32 to 64 filters for the 5x5x5 and 3x3x3 convolutional layers respectively. With this larger 3D-CNN architecture (deeper than in (Jones et al., )

) we found it beneficial to augment the input matrices for the training set by randomly flipping the input data in X,Y, and Z, each with a 10% probability of occurring.

Hyper-parameter Value
Epochs 64
Batch size 1
Learning rate
Batch normalization F
Optimizer Adam (Kingma and Ba, 2014)
Activation function SELU (Klambauer et al., 2017)
Residual Fusion Layers T
Dropout Rate 1 (early) 0.251
Dropout Rate 2 (mid) 0.125
Dropout Rate 3 (late) 0
# of Fusion Layers 5
Table 4. Optimized final hyper-parameters for the Mid-level Fusion Model

3.3.2. Late / Mid-level Fusion

The Late Fusion method was implemented the same as in (Jones et al., ). On the other hand, the optimization led the Mid-level Fusion model to a modified structure. For Mid-level Fusion, every optional layer (dashed lines) in the yellow Fusion block of Figure 1 was turned on. Table 4 gives the final hyper-parameters for Mid-level Fusion, which are the output of a 180 individual trial population. The other minor differences are a SELU (Klambauer et al., 2017) activation was selected over the previous Leaky-ReLU activation (Xu et al., 2015), a final batch size of 1, and the usage of light dropout instead of none.

Hyper-parameter Value
Pre-trained T
Epochs 18
Batch size 48
Learning rate
Batch normalization F
Optimizer Adam (Kingma and Ba, 2014)
Activation function SELU (Klambauer et al., 2017)
Residual Fusion Layers F
Dropout Rate 1 (early) 0.386
Dropout Rate 2 (mid) 0.247
Dropout Rate 3 (late) 0.055
# of Fusion Layers 4
Table 5. Optimized final hyper-parameters for the Coherent Fusion Model

3.3.3. Coherent Fusion

In developing the Coherent Fusion model, it was unclear whether the same 3D-CNN and SG-CNN hyper-parameter configurations found to be optimal in isolation would also be ideal for their collaborative prediction. As such, we gave the optimization the option to load the models individually trained for prediction or re-define their structure and train each head from scratch. Using the pre-trained models led to a significant improvement in validation loss. Therefore, Table V gives the final hyper-parameters for the best performing Coherent Fusion model, which loads the weights from the SG-CNN in Table 2 and 3D-CNN model described in Table 3.

The Coherent Fusion model experiment optimized a population of 270 individual trials to produce a best performer. Interestingly, the Coherent Fusion model converged to exclude the model-specific dense layer options the Mid-level Fusion model uses (Figure 1) and used a simpler (4 fusion layers) architecture overall. Additionally, the Coherent Fusion used a larger batch size of 48 and significantly stronger dropout. Across the board, the Coherent Fusion model preferred a simpler Fusion architecture with significantly stronger regularization. Our intuition for this phenomenon is that the Coherent Fusion model adjusting a larger set of learned parameters allows for a simpler architecture, faster convergence, and heavier regularization compared to the Mid-level Fusion model, which serves as preliminary evidence of a stronger predictor.

3.4. Evaluation Results

Model RMSE MAE R
Pearson
R
Spearman
R
Pafnucy (Stepniewska-Dziubinska et al., 2017) 1.42 1.13 - 0.78 -
Mid-level
Fusion
1.38 1.10 0.596 0.778 0.757
Late Fusion 1.33 1.07 0.623 0.813 0.805
Coherent
Fusion
1.30 1.05 0.640 0.807 0.802
KDeep (Jiménez et al., 2018) 1.27 - - 0.82 0.82
Table 6. Performance of Fusion models on the PDBbind core set crystal structures

In sum, over 60,000 Lassen GPU hours were used to optimize the various models. All model iterations and intervals were not run across the same number of nodes, but our training architecture was run at its peak across 66 Lassen nodes capable of over 7,300 TFLOPS using 2904 CPU cores and 264 GPUs to train in parallel.

In Table 6, the Coherent Fusion model is shown to outperform the Late and Mid-level Fusion methods on the PDBbind core

set of 290 compounds. While the difference between the Late and Coherent Fusion methods is only 0.03 RMSE, the genetic optimization of Coherent Fusion produced several nearly identical models, which consistently performed better than the Late Fusion approach in all evaluated metrics. Importantly, the Coherent Fusion model converged to a model structure using an automated process that exceeds the performance of the hand-crafted fusion architecture used in the Mid-level Fusion. Additionally, we provide a comparison to two other deep learning approaches (KDeep

(Jiménez et al., 2018) and Pafnucy (Stepniewska-Dziubinska et al., 2017)) to view the Coherent Fusion model’s performance in a wider scope.

While the PDBbind core set is a standard benchmark for machine learning methods (Jones et al., ; Jiménez et al., 2018; Stepniewska-Dziubinska et al., 2017), high-throughput virtual screening overwhelmingly relies to docked poses of compounds for drug discovery. With that in mind, we leveraged ConveyorLC (Zhang et al., 2014) to evaluate the Coherent Fusion model against physics-based scoring functions in the docking space. An evaluation of the noisier docking data also provides insight into whether the machine learning model was over-trained and how robust it is to noise when evaluating more realistic data.

Figure 2. Binary classification of 128 docked complexes from the PDBbind core set, where the positive, “stronger” binder class represents 57 compounds with or 8 and the negative, “weaker” class consists of 71 compounds with experimental or 6.

P/R Curves for Vina, MM/GBSA, and Coherent Fusion on the PDBBind core set

197 compounds from the PDBbind core set were successfully evaluated by ConveyorLC with the physics-based Autodock Vina docking algorithm (Trott and Olson, 2010) and MM/GBSA methods for comparison with Coherent Fusion. Each compound was then filtered by , where each of the 197 compounds were checked for a pose with such that a correct pose was found and sufficiently similar to the crystal structure from PDBbind. Using the binding affinity values from PDBbind as ground truth for the docking poses, Vina achieved a Pearson correlation coefficient of .579, MM/GBSA scored .591, and Coherent Fusion reached .745. To further examine the performance difference between the three methods, binding affinity prediction can be cast as a binary classification problem (Jones et al., ). Figure 2 shows the results on a subset of the 197 core set compounds. Positive and negative classes were created from 57 stronger binders and 71 weaker binders, respectively. Because the set of strong vs. weak docked poses is small (128 total), we elected to compare the different methods using a Precision-Recall Curves and -scores which give a much more direct picture of how each model is performing than a ROC curve provides. With this extended analysis, we considered the Coherent Fusion model sufficiently validated and ready for use in high-throughput screening for SARS-CoV-2 inhibitors.

4. High-Throughput Virtual Screening

Our SARS-CoV-2 effort screened over 500 million compounds against each of the 4 distinct Mpro and spike targets, drawing compounds from four publicly available virtual compound libraries (Lau et al., 2021). The ZINC database (Sterling and Irwin, 2015) was used to create a set of “world-approved 2018” drugs from a list of FDA-approved and “world-not-FDA” approved drugs. An additional 1.5 million unique compounds were selected from ChEMBL (Gaulton et al., 2012) and 18 million compounds were drawn from eMolecules (eMolecules, 2021). The remaining compounds came from Enamine’s list of drug-like virtual compounds estimated to be synthetically feasible (Enamine, 2021). In sum, over 5 billion docking poses were generated and evaluated.

4.1. Physics-based Screening Pipeline

As with our comparison between Coherent Fusion and physics-based binding affinity scoring functions, we leveraged the existing ConveyorLC (Zhang et al., 2014) tool chain to search for candidate inhibitors of the two binding sites from spike and active sites for two different SARS-CoV-2 protease crystal structures. ConveyorLC is mainly made up of four parallelized programs each designed to handle a specific task in the molecular docking and re-scoring processes. ConveyorLC uses CDT1Receptor to perform protein preparation, CDT2Ligand for ligand preparation, CDT3Docking performs the molecular docking, and finally CDT4mmgbsa handles MM/GBSA re-scoring. More details on the exact execution, pre-processing, and parameters used in the docking simulations can be found in previous studies (Lau et al., 2021; Zhang et al., 2014).

The Vina scoring used in CDT3Docking, operates at approximately one minute per compound per CPU core. On a single Lassen node with 40 CPU cores (each core has 4 hardware threads) using 8 Monte Carlo simulations per compound, Vina is able to dock 10 docking poses per second. In contrast, a single-point MM/GBSA score takes 10 minutes per docking pose per CPU core. Because of its computational cost, MM/GBSA is often used as a re-scoring function to refine an already filtered set of compounds. (Wong, 2021). Even on a Lassen node, MM/GBSA is only capable of re-scoring 0.067 poses per second.

Figure 3. Structure of a single Fusion evaluation job (top). A job begins with 2 million poses to evaluate, divides the poses per node, then each node assigns poses to its ranks and evaluates. The predictions are aggregated and written across the ranks in parallel. Individual ranks (bottom) take their assigned poses, begin loading batches into main memory and feeding them to the GPU for inference. Finally, pose identifiers and predictions are collected for file writing.

Fusion HPC Architecture

4.2. Scalable Fusion Predictions

In order to screen millions of compounds against SARS-CoV-2, we developed a scalable architecture around the Coherent Fusion model for rapid evaluation (Figure 3). The Coherent Fusion model occupies 1.5 GB GPU memory, which fits on each 16GB NVIDIA Volta V100 GPU. The remainder of the GPU memory is used to simultaneously load 56 individual docked poses into a batch alongside each model. The 4 model instances on each node were given 12 parallel data loaders to accelerate inference. Each model in every job is assigned a subset of compounds to evaluate and its data loaders complete all file reading and pre-processing operations to prepare batches of data in an individual node’s 256 GB of memory, which are subsequently loaded onto the GPU. After evaluating a batch, the screening code unloads the compound, target, and pose identifiers along with the model’s predicted binding affinity. Once a job completes evaluation, the identifiers and predictions are gathered across MPI ranks and distributed across the individual ranks to be written in parallel to HDF5 files.

In the context of Lassen’s Job Scheduler LSF (IBM, 2021), we formulated Fusion evaluation jobs as many, individual 4 node processes, each assigned to evaluate an independent set of 2 million poses, which is approximately 200,000 compounds. This format was also a response to our encountering a wide range of errors (bad metadata, node failure, broken pipe errors, etc…), which led to our pipeline being tailored for fault tolerance. With this architecture, when a job fails it has minimal impact on overall throughput (another job takes its place), the reason for failure is easier to pinpoint (log files are smaller and easier to parse), and only a small set of compounds are affected or need to be rescheduled.

Metric Single Job Peak
Avg. Startup 20 min.
Avg. Evaluation 280 min.
Avg. File Output 6.5 min.
Poses per sec. 108 13,594
Poses per hour 338,800 48,600,000
Compounds per hour 33,880 4,860,000
Table 7. Throughput for Fusion prediction single job (2 million poses) and peak performance (125 parallel jobs)

To create each 4 node job, we relied heavily on Horovod (Sergeev and Balso, 2018), which is based on MPI concepts and uses MPI for cross-node communication. With 4 GPUs on each node, each job is a 16-rank distributed process. Each rank runs a Python script and is given a specific GPU, CPU cores, and memory allocation to execute the evaluation. At the beginning of a job, we simply divide the set of compounds assigned to the job by the number of ranks and assign each rank the subset with its index. When evaluation completes, the ranks use allgather to compile the results and subsequently write out HDF5 files. File output was identified as a bottleneck to the evaluation process early on, which was mitigated by assigning each rank compounds to be written in the same files and directories. The output file format was designed to mirror the output format of ConveyorLC’s CDT3Docking process (Zhang et al., 2014) for interpretation with existing tools and further evaluation of pharmacokinetic and safety properties (Lau et al., 2021).

Table 7 includes a breakdown of how time is spent in an individual job, where the initial 20 minutes are consumed by loading HPC modules, an Anaconda (Inc., 2021) environment, initializing the Horovod ranks, loading an instance of the Fusion model onto each GPU, and pre-loading the initial batches of data for evaluation. The bulk of a job is, as expected, spent in an evaluation period, where batches are loaded, evaluated, and predictions are stored in parallel across all ranks. Finally, once the ranks gather together. The file writing process begins and completes about 6.5 minutes later yielding an average total run-time of approximately 5.1 hours.

With jobs designed for scalability, we regularly ran more than 10 at a time during the SARS-CoV-2 screening effort. However, at set times the majority of Lassen nodes were made available to accelerate evaluation. The peak of which was an allotment of 500 nodes for Fusion screening. The impact of a large number of nodes is clearly seen in Table 7, where throughput was increased more than 100 times. Ultimately, during several hours of evaluation at scale, the Coherent Fusion model used more than 14,010 TFLOPS of Lassen’s compute power to screen nearly 5 million compounds per hour.

The throughput advantage of using Fusion models is clear and relies heavily on a combination of parallel CPU workers to keep their assigned GPU utilized. Compared with Vina and MM/GBSA, the Fusion model screening code provides a 2.7x and 403x speed increase, respectively.

5. SARS-CoV-2 Results

The high-throughput virtual drug screening pipeline described in Section 4 produced several computational results for each compound screened. The Fusion model’s binding-affinity prediction was one of the three energy calculations (Vina, MM/GBSA, Fusion), which were used as a component of a hand-tailored cost function designed to filter which compounds to purchase for experimental evaluation and which were less likely to be successful. Full details of the ranking and reasoning may be found in (Lau et al., 2021) and computational predictions are made available at https://url-excluded-during-review (LLNL, 2021a). The output of the virtual screening in the computational space fed directly into an experimental process to physically interrogate candidate molecules.

5.1. Experimental Validation

Experimental testing of the candidate binders which were screened and purchased to target Mpro used a fluorescence resonance energy transfer (FRET) based activity assay or a SDS-PAGE gel protein cleavage assay. After assay optimization, additional screens were run in order to down select compounds. For example, compounds from the ZINC database (Sterling and Irwin, 2015) were down selected to an additional testing of 19 compounds, which yielded 4 candidates inhibiting the activity of Mpro at 100 micro-Molar (µM) concentrations. The four identified compounds include: candestartan cilexetil, FAD disodium, tigecycline, and tetracycline (Lau et al., 2021).

On the other hand, compounds predicted to inhibit the SARS-CoV-2 spike protein were screened by both a pseudo-typed virus assay and a biolayer inferometry competitive assay (BLI). Here the candidate compounds are being evaluated for their ability to inhibit ACE2-spike binding and in parallel, the spike binding candidates were screened using a cell-based infection assay at 10 µM. Further details of the experimental design, assays, results, and discussion can be found in (Lau et al., 2021).

Figure 4. Coherent Fusion predicted binding affinity vs. experimental percentage of inhibition at 100 µM for 130 compounds against Mpro protease1 (blue) and 81 compounds against Mpro protease2 (orange). The spike assays were evaluated at 10 µM and include 151 compounds against spike1 (green) and 113 compounds against spike2 (red). Compounds which exhibited 1% inhibition (no experimental binding activity) are excluded.

Scatter plot of Fusion predictions vs experimental percent inhibition.

5.2. Connecting Predictions and Experimental Results

Given the ground-truth experimental values for physically tested compounds, we’re enabled to retrospectively evaluate the accuracy of each computational approach which generated a prediction. Some of the obvious questions to ask are: “Which method was most correlated with the experimental results?”, “Are the most accurate scoring functions the same for all four Mpro and spike targets?”, and “Which of the scoring methods is most accurate for the strongest experimental inhibitors?”.

Each experimentally prosecuted compound can be traced back to its virtually docked poses for either the two Mpro or two spike binding targets. This means each scoring method may have predicted binding affinity values for up to 40 poses per compound (10 poses maximum per binding site). While in the computational domain we have predicted binding affinity values, the output from the Mpro and spike assays is a percentage of inhibition normalized between 0 and 100%. These values are produced at a given concentration of the candidate drug (100 µM for Mpro targets and 10 µM for spike targets), which gives context to the overall strength of a binder.

For each scoring method (Vina, MM/GBSA, Coherent Fusion) the results per compound were aggregated and represented by the strongest prediction for each compound and binding site combination (maximum for Coherent Fusion, minimum for Vina and MM/GBSA) across distinct target binding sites. This means, for example, if a compound was tested experimentally against Mpro, we took the strongest prediction for each method on both of the two virtual Mpro structures/conformations and used those values as the scoring method’s prediction. The same procedure was applied to the experimental results, binding sites, and virtually docked poses of spike. Because of the computational cost of MM/GBSA, we instead use the ATOM Modeling PipeLine (AMPL) MM/GBSA predicted MM/GBSA values, which have been shown to be highly correlated with actual MM/GBSA calculations and were trained to predict MM/GBSA scores on each specific target (McLoughlin, 2021).

5.3. Analyzing Computational Predictions

Following this aggregation, the output is a single prediction per compound tied to a single percent inhibition. Each method can then be viewed as a scatter plot comparing predicted vs. actual experimental results as in Figure 4. We computed Pearson and Spearman correlation coefficients for each method across all experimentally tested Mpro and spike compounds. However, most experimentally tested compounds are negatives ( inhibition), which gives correlation coefficients near 0 for each of the three evaluated methods (table excluded for brevity). In an attempt to focus our analysis on the relative strengths and weaknesses of the different scoring methods and not the difficulty of the overall binding affinity prediction problem, we then computed correlation coefficients for each method on the subset of compounds for which any experimental binding () was observed.

Method Target/Site
Pearson
R
Spearman
R
Vina Mpro/protease1 0.03 -0.08
AMPL MM/GBSA Mpro/protease1 0.08 0.01
Coherent Fusion Mpro/protease1 -0.06 -0.04
Vina Mpro/protease2 -0.08 -0.14
AMPL MM/GBSA Mpro/protease2 -0.05 -0.07
Coherent Fusion Mpro/protease2 0.04 0.04
Vina spike/spike1 -0.02 0.06
AMPL MM/GBSA spike/spike1 0.15 0.22
Coherent Fusion spike/spike1 0.22 0.30
Vina spike/spike2 0.13 0.27
AMPL MM/GBSA spike/spike2 -0.02 -0.05
Coherent Fusion spike/spike2 -0.02 -0.01
Table 8. Correlation of predicted binding and percent inhibition on compounds with inhibition

Table 8 shows the correlations for each scoring method as described where the absolute value of the Vina and MM/GBSA scores are used, as their predictions are negative values. AMPL MM/GBSA gives the best correlation for the protease1 target, Coherent Fusion for the protease2 and spike1 targets, and Vina scores the spike2 binding site optimally. However, across the board, it is clear that even when limiting the analysis to inhibiting compounds, the correlations for each method remain low and the interpretation of near-zero correlation coefficients is unavailing.

While removing all the non-inhibitors gives some glimpse into which methods are most correlated with the SARS-CoV-2 binders, it also removes the context of those predictions. That is, the overall prediction strength for each method is somewhat obscured as the range of each method’s prediction values is limited to its minimum and maximum prediction in the smaller set of SARS-CoV-2 binders.

With this in mind, we sought to answer the question of which scoring methods were most accurate for the strongest experimental inhibitors by including non-binding compounds and casting the prediction problem as a binary classification of compounds with inhibition (positive class) and compounds with inhibition (negative class). A threshold of 33% was chosen in order to avoid severe class imbalances caused by higher thresholds. This problem formulation is similar to that in Figure 2 where we set a threshold to separate stronger binders from weaker binders. This approach is applicable in practice, as large-scale screens eventually down select to a small set of candidates for final analysis, purchase, and further experimental testing.

Figure 5 displays Precision/Recall Curves and -scores using a threshold of 33% to separate the positive and negative binding examples for each target. The y-axis of each P/R curve is limited to 0.35 to better observe different model behaviors. Although the curves for each target appear different from the P/R plots in Figure 2, they provide additional information about how each model performs in practice on a noisier experimental screen where the cutoff separating active from inactive molecules is less clear.

Figure 5. Precision/Recall Curves and F1-scores by SARS-CoV-2 protein target at 33% experimental inhibition. Mpro protease1 (far left) shows results for 30 positive and 311 negative binders. Mpro

protease2 (middle left) includes 20 positive and 196 negative binders. The spike1 site (middle right) includes 32 positive and 209 negative binders. Finally, the spike2 site (far right) includes 26 positive and 218 negative binders. The black horizontal dashed line indicates the performance of a random classifier.

P/R Curves for Vina, MM/GBSA, and Coherent Fusion predictions vs. experimental percent inhibition.

While the P/R Curves in Figure 5 present nominally low -scores and precisions, two important observations arise. First, for each of the four binding sites, we computed Cohen’s Kappa statistic () to compare each model with a random classifier (Cohen, 1960; Landis and Koch, 1977). Equation 2 gives the equation for computing ;

(2)

where is the observed accuracy and is the expected accuracy. A random classifier achieves a by guessing according to the frequency of the positive and negative classes. Every model on each target was measured to achieve a ¿ 0 (with the exception of Vina on spike1), indicating the models are consistently better than random across the targets. In the context of P/R Curves, a random classifier’s performance is expressed as a horizontal line (Figure 5) with constant expected precision equal to the proportion of positive examples in the data set.

In practice, the three binding affinity prediction models served as input to a hand-crafted cost function (Lau et al., 2021), to nominate compounds for physical experimentation. The outcome of which produced 9 distinct compounds which inhibited 100% of Mpro activity and 108 total compounds with 33% inhibition. Using the 33% inhibition threshold, 108 of 1042 (10.4%) experimentally tested compounds inhibit Mpro activity, indicating the models have significant predictive power. In that, while 500+ million candidates were evaluated, only a fraction (2.1%) of the best candidates were tested experimentally, yet the models successfully yielded a 10.4% hit rate.

For the Mpro protease1 binding site, AMPL MM/GBSA is the best predictor of experimental binding. Additionally, between the two Mpro binding sites, protease1 and protease2, the AMPL MM/GBSA predictions on protease1 conjointly achieve a higher -score and Pearson correlation coefficient than all other methods. In fact, for each of the four targets, not only do the best 33% -scores match with the best 1% Pearson/Spearman correlations, but the relative differences across target binding sites are also in agreement.

The Coherent Fusion model reaches the maximal and correlation coefficients for the Mpro protease2 and spike1 binding sites. Across all model / binding site combinations, the Coherent Fusion model’s performance on the spike1 protein is the top performer, though followed closely by AMPL MM/GBSA on the same target site. This is unexpected, as the protease targets are much larger and nominally thought to provide better opportunity for drug-like compounds to bind.

However, target specific strengths are not unexpected (Wong, 2021). Especially where the Mpro sites are large protein pockets and the spike targets are much smaller. Interestingly, the maximal ¿1% correlations and ¿33% -scores across all targets favor the spike proteins. The differing concentrations between the Mpro and spike experiments is important to note, as Mpro binders were evaluated at 100 µM, which is a higher drug concentration and, therefore, allows for weaker binders to exhibit higher observed percentages of inhibition.

6. Conclusion and Future Work

Deep Fusion was improved by coherent backpropagation and distributed, genetic hyper-parameter optimization. The optimization automatically produced an version of the Coherent Fusion model, which was shown to exceed the performance of hand-crafted Fusion model variants and alternative machine learning methods on crystal structures from the PDBbind core set benchmark. In evaluating noisier docked poses of the same test set, the Coherent Fusion model also displayed an improvement in correlation and -score relative to the physics-based Vina and MM/GBSA methods.

We utilized the Coherent Fusion model to screen over 500 million compounds across four SARS-CoV-2 binding sites. This was achieved by designing a high-throughput distributed architecture for fault-tolerant evaluation at scale. Using parallel data loaders and Lassen’s 4 GPUs per node, the Coherent Fusion model is currently capable of screening approximately 30 poses per second per node. The Coherent Fusion model was used as input to a weighted cost function across binding affinity and other scoring models in order to sub-select compounds for experimentation.

Among the nearly 1000 compounds tested experimentally, several were found to inhibit activity of Mpro and spike. In analyzing which binding affinity methods were most correlated with the experimental results, we found the optimal performer to vary by target. However, two of the four targeted binding sites were best predicted by the Coherent Fusion model both in terms of ¿1% inhibiting correlation with the experimental results and ¿33% inhibiting binary classification. Although overall predictive power of all of the scoring functions is still limited, any opportunity for computational enrichment of strong binders is needed to alleviate the otherwise prohibitive time and cost of the physical experimental screens.

In future work, we aim to use our baseline Coherent Fusion model to fine tune and predict for specific protein target types and binding sites. We believe introducing target specificity to the models and thereby reducing the scope of the binding affinity prediction problem will increase the value of relative differences in the model’s binding affinity predictions.

Our high-throughput architecture can also be accelerated, as the GPUs on each Lassen node were observed to be under-utilized. Increased numbers of parallel data loaders were observed to decrease the overall stability of each individual evaluation job indicating some refactoring is necessary.

Acknowledgements.
The authors gratefully acknowledge Professor Xinquan Wang (Tsinghua University) for providing early access to his crystal structure of the ACE2-RBD complex. The authors gratefully acknowledge extensive computer time provided by Livermore Computing. Part of this research was supported by the DOE Office of Science through the National Virtual Biotechnology Laboratory, a consortium of DOE national laboratories focused on response to COVID-19, with funding provided by the Coronavirus CARES Act. The authors thank Lawrence Livermore National Laboratory for funding Laboratory Directed Research and Development projects 20- ERD-065 and 20-ERD-062. Part of this research was also supported by the American Heart Association under CRADA TC02274-4 and the National Nuclear Security Administration through the Accelerating Therapeutics for Opportunities in Medicine (ATOM) Consortium under CRADA TC02349 . This work was funded in part by DTRA under award HDTRA1036045. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0 0 03525. All work performed at Lawrence Livermore National Laboratory is performed under the auspices of the U.S. Department of Energy under Contract DE-AC52-07NA27344. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.

References

  • Q. U. Ain, A. Aleksandrova, F. D. Roessler, and P. J. Ballester (2015) Machine-learning scoring functions to improve structure-based binding affinity prediction and virtual screening. Wiley Interdisciplinary Reviews: Computational Molecular Science 5 (6), pp. 405–424. Cited by: §1, §2.1.
  • P. Ballester and J. Mitchell (2010) A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics (Oxford, England) 26, pp. 1169–75. External Links: Document Cited by: §1, §2.1.
  • S. Belouzard, J. K. Millet, B. N. Licitra, and G. R. Whittaker (2012) Mechanisms of coronavirus cell entry mediated by the viral spike protein. Viruses 4 (6), pp. 1011–1033. Cited by: §1.
  • J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl (2011) Algorithms for hyper-parameter optimization. In 25th annual conference on neural information processing systems (NIPS 2011), Vol. 24. Cited by: §2.2.
  • P. Castillo, J. Carpio, J. Merelo, A. Prieto, V. Rivas, and G. Romero (2000)

    Evolving multilayer perceptrons

    .
    Neural Processing Letters 12 (2), pp. 115–128. Cited by: §2.2.
  • T. Cheng, Q. Li, Z. Zhou, Y. Wang, and S. H. Bryant (2012) Structure-based virtual screening for drug discovery: a problem-centric review. The AAPS journal 14 (1), pp. 133–141. Cited by: §1.
  • J. Cohen (1960) A coefficient of agreement for nominal scales. Educational and psychological measurement 20 (1), pp. 37–46. Cited by: §5.3.
  • K. L. Drew, H. Baiman, P. Khwaounjoo, B. Yu, and J. Reynisson (2012) Size estimation of chemical space: how big is it?. Journal of Pharmacy and Pharmacology 64 (4), pp. 490–495. Cited by: §1.
  • J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization.. Journal of machine learning research 12 (7). Cited by: Table 1.
  • S. R. Ellingson, B. Davis, and J. Allen (2020) Machine learning and ligand binding predictions: a review of data, methods, and obstacles. Biochimica et Biophysica Acta (BBA)-General Subjects 1864 (6), pp. 129545. Cited by: §3.1.
  • eMolecules (2021) EMolecules. External Links: Link Cited by: §4.
  • Enamine (2021) Enamine. External Links: Link Cited by: §4.
  • E. N. Feinberg, D. Sur, Z. Wu, B. E. Husic, H. Mai, Y. Li, S. Sun, J. Yang, B. Ramsundar, and V. S. Pande (2018) PotentialNet for molecular property prediction. ACS central science 4 (11), pp. 1520–1530. Cited by: §1, §2.1, §3.1, §3.3.1.
  • M. Fey and J. E. Lenssen (2019) Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428. Cited by: §2.2.
  • A. Gaulton, L. J. Bellis, A. P. Bento, J. Chambers, M. Davies, A. Hersey, Y. Light, S. McGlinchey, D. Michalovich, B. Al-Lazikani, et al. (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic acids research 40 (D1), pp. D1100–D1107. Cited by: §4.
  • J. Gomes, B. Ramsundar, E. N. Feinberg, and V. S. Pande (2017) Atomic convolutional networks for predicting protein-ligand binding affinity. arXiv preprint arXiv:1703.10603. Cited by: §2.1.
  • A. Graves (2013)

    Generating sequences with recurrent neural networks

    .
    arXiv preprint arXiv:1308.0850. Cited by: Table 1.
  • T. Hou, J. Wang, Y. Li, and W. Wang (2011) Assessing the performance of the molecular mechanics/poisson boltzmann surface area and molecular mechanics/generalized born surface area methods. ii. the accuracy of ranking poses generated from docking. Journal of computational chemistry 32 (5), pp. 866–877. Cited by: §2.1.
  • M. Husken, J. E. Gayko, and B. Sendhoff (2000) Optimization for problem classes-neural networks that learn to learn. In

    2000 IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks. Proceedings of the First IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks (Cat. No. 00)

    ,
    pp. 98–109. Cited by: §2.2.
  • F. Hutter, H. H. Hoos, and K. Leyton-Brown (2011) Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization, pp. 507–523. Cited by: §2.2.
  • IBM (2021) IBM spectrum lsf v10.1 documentation. External Links: Link Cited by: §3.2, §4.2.
  • A. Inc. (2021) Anaconda software distribution. External Links: Link Cited by: §4.2.
  • M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, et al. (2017) Population based training of neural networks. arXiv preprint arXiv:1711.09846. Cited by: §2.2, §3.2.
  • A. Jakalian, B. L. Bush, D. B. Jack, and C. I. Bayly (2000) Fast, efficient generation of high-quality atomic charges. am1-bcc model: i. method. Journal of computational chemistry 21 (2), pp. 132–146. Cited by: §3.1.
  • J. Jiménez, M. Skalic, G. Martinez-Rosell, and G. De Fabritiis (2018) K deep: protein–ligand absolute binding affinity prediction via 3d-convolutional neural networks. Journal of chemical information and modeling 58 (2), pp. 287–296. Cited by: §1, §2.1, §2.1, §3.1, §3.4, §3.4, Table 6.
  • D. Jones, H. Kim, X. Zhang, A. Zemla, G. Stevenson, W. F. D. Bennett, D. Kirshner, S. E. Wong, F. C. Lightstone, and J. E. Allen (0) Improved protein–ligand binding affinity prediction with structure-based deep fusion inference. Journal of Chemical Information and Modeling 0 (0), pp. null. Note: PMID: 33754707 External Links: Document, Link, https://doi.org/10.1021/acs.jcim.0c01306 Cited by: §1, §1, §2.1, §2.1, §3.1, §3.1, §3.3.1, §3.3.1, §3.3.2, §3.3, §3.4, §3.4.
  • W. L. Jorgensen (2004) The many roles of computation in drug discovery. Science 303 (5665), pp. 1813–1818. Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Table 1, Table 4, Table 5.
  • D. B. Kitchen, H. Decornez, J. R. Furr, and J. Bajorath (2004) Docking and scoring in virtual screening for drug discovery: methods and applications. Nature reviews Drug discovery 3 (11), pp. 935–949. Cited by: §1.
  • G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter (2017) Self-normalizing neural networks. arXiv preprint arXiv:1706.02515. Cited by: §3.3.2, Table 1, Table 4, Table 5.
  • J. R. Landis and G. G. Koch (1977) The measurement of observer agreement for categorical data. biometrics, pp. 159–174. Cited by: §5.3.
  • E. Y. Lau, O. A. Negrete, B. J. Bennion, F. Bourguet, M. Franco, B. Harmon, S. He, D. Jones, H. Kim, D. Kirshner, V. Lao, J. Lo, K. McLoughlin, R. Mosesso, D. K. Murugesh, E. A. Saada, B. Segelke, G. A. Stevenson, M. W. Torres, D. Weilhammer, Y. Yang, A. Zemla, X. Zhang, F. Zhu, J. E. Allen, and F. C. Lightstone (2021) Discovery of small-molecule inhibitors of sars-cov-2 proteins using a computational and experimental pipeline. Frontiers in Molecular Biosciences, In Review.. Cited by: §1, §4.1, §4.2, §4, §5.1, §5.1, §5.3, §5.
  • A. Li, O. Spyra, S. Perel, V. Dalibard, M. Jaderberg, C. Gu, D. Budden, T. Harley, and P. Gupta (2019) A generalized framework for population based training. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1791–1799. Cited by: §2.2.
  • H. Li, J. Sun, Z. Xu, and L. Chen (2017) Multimodal 2d+ 3d facial expression recognition with deep fusion convolutional neural network. IEEE Transactions on Multimedia 19 (12), pp. 2816–2831. Cited by: §2.1.
  • Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §3.3.1.
  • J. Lim, S. Ryu, K. Park, Y. J. Choe, J. Ham, and W. Y. Kim (2019) Predicting drug–target interaction using a novel graph neural network with 3d structure-embedded graph representation. Journal of chemical information and modeling 59 (9), pp. 3981–3988. Cited by: §1, §3.1.
  • LLNL (2021a) Lawrence livermore national laboratory covid-19 therapeutic design data portal. External Links: Link, Document Cited by: §5.
  • LLNL (2021b) LLNL’s lassen supercomputer leaps to no. 10 on top500 list, sierra remains no. 2n. Note: https://www.llnl.gov/news/llnl’s-lassen-supercomputer-leaps-no-10-top500-list-sierra-remains-no-2Accessed: 2020-03-06 Cited by: §2.2.
  • I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: Table 1.
  • J. A. Maier, C. Martinez, K. Kasavajhala, L. Wickstrom, K. E. Hauser, and C. Simmerling (2015) Ff14SB: improving the accuracy of protein side chain and backbone parameters from ff99sb. Journal of chemical theory and computation 11 (8), pp. 3696–3713. Cited by: §3.1.
  • K. McLoughlin (2021) ATOM modeling pipeline (ampl) mm/gbsa predicted mm/gbsa values. Note: Personal Communication Cited by: §5.2.
  • P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, et al. (2018) Ray: a distributed framework for emerging ai applications. pp. 561–577. Cited by: §2.2, §3.2.
  • N. M. O’Boyle, M. Banck, C. A. James, C. Morley, T. Vandermeersch, and G. R. Hutchison (2011) Open babel: an open chemical toolbox. Journal of cheminformatics 3 (1), pp. 1–14. Cited by: §3.1.
  • J. Parker-Holder, V. Nguyen, and S. J. Roberts (2020)

    Provably efficient online hyperparameter optimization with population-based bandits

    .
    Advances in Neural Information Processing Systems 33. Cited by: §2.2, §2.2, §3.2, §3.3.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703. Cited by: §2.2.
  • E. F. Pettersen, T. D. Goddard, C. C. Huang, G. S. Couch, D. M. Greenblatt, E. C. Meng, and T. E. Ferrin (2004) UCSF chimera—a visualization system for exploratory research and analysis. Journal of computational chemistry 25 (13), pp. 1605–1612. Cited by: §3.1.
  • M. Ragoza, J. Hochuli, E. Idrobo, J. Sunseri, and D. R. Koes (2017) Protein–ligand scoring with convolutional neural networks. Journal of chemical information and modeling 57 (4), pp. 942–957. Cited by: §1.
  • J. Reymond, R. Van Deursen, L. C. Blum, and L. Ruddigkeit (2010) Chemical space as a source for new drugs. MedChemComm 1 (1), pp. 30–38. Cited by: §1.
  • A. Roitberg, T. Pollert, M. Haurilet, M. Martin, and R. Stiefelhagen (2019) Analysis of deep fusion strategies for multi-modal gesture recognition. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops

    ,
    pp. 0–0. Cited by: §2.1.
  • A. Sergeev and M. D. Balso (2018)

    Horovod: fast and easy distributed deep learning in TensorFlow

    .
    arXiv preprint arXiv:1802.05799. Cited by: §4.2.
  • J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. arXiv preprint arXiv:1206.2944. Cited by: §2.2.
  • J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat, and R. Adams (2015) Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pp. 2171–2180. Cited by: §2.2.
  • N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger (2009) Gaussian process optimization in the bandit setting: no regret and experimental design. arXiv preprint arXiv:0912.3995. Cited by: §2.2.
  • M. M. Stepniewska-Dziubinska, P. Zielenkiewicz, and P. Siedlecki (2017) Pafnucy–a deep neural network for structure-based drug discovery. stat 1050, pp. 19. Cited by: §2.1, §2.1, §3.4, §3.4, Table 6.
  • T. Sterling and J. J. Irwin (2015) ZINC 15 – ligand discovery for everyone. Journal of Chemical Information and Modeling 55 (11), pp. 2324–2337. Note: PMID: 26479676 External Links: Document, Link, https://doi.org/10.1021/acs.jcim.5b00559 Cited by: §4, §5.1.
  • O. Trott and A. J. Olson (2010) AutoDock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry 31 (2), pp. 455–461. Cited by: §2.1, §3.1, §3.4.
  • S. Ullrich and C. Nitsche (2020) The sars-cov-2 main protease as drug target. Bioorganic & Medicinal Chemistry Letters, pp. 127377. Cited by: §1.
  • J. Wagner, V. Fischer, M. Herman, and S. Behnke (2016) Multispectral pedestrian detection using deep fusion convolutional neural networks.. In ESANN, Vol. 587, pp. 509–514. Cited by: §2.1.
  • R. Wang, X. Fang, Y. Lu, and S. Wang (2004) The pdbbind database:  collection of binding affinities for protein−ligand complexes with known three-dimensional structures. Journal of Medicinal Chemistry 47 (12), pp. 2977–2980. Note: PMID: 15163179 External Links: Document, Link, https://doi.org/10.1021/jm030580l Cited by: §3.1.
  • R. Wang, X. Fang, Y. Lu, C. Yang, and S. Wang (2005) The pdbbind database: methodologies and updates. Journal of medicinal chemistry 48 (12), pp. 4111–4119. Cited by: §2.1, §2.2.
  • S. Wong (2021) Large scale evaluation of mm-gb/sa rescoring on the pdbbind 2019 refined dataset. Note: Personal Communication Cited by: §1, §2.1, §3.1, §4.1, §5.3.
  • B. Xu, N. Wang, T. Chen, and M. Li (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853. Cited by: §3.3.2, Table 1.
  • F. Yang, J. Yang, Z. Jin, and H. Wang (2018) A fusion model for road detection based on deep learning and fully connected crf. In 2018 13th Annual Conference on System of Systems Engineering (SoSE), pp. 29–36. Cited by: §2.1.
  • H. Zhang, L. Liao, K. M. Saravanan, P. Yin, and Y. Wei (2019) DeepBindRG: a deep learning based method for estimating effective protein–ligand affinity. PeerJ 7, pp. e7362. Cited by: §1, §2.1.
  • X. Zhang, H. Perez-Sanchez, and F. C Lightstone (2017) A comprehensive docking and mm/gbsa rescoring study of ligand recognition upon binding antithrombin. Current topics in medicinal chemistry 17 (14), pp. 1631–1639. Cited by: §3.1.
  • X. Zhang, S. E. Wong, and F. C. Lightstone (2013) Message passing interface and multithreading hybrid for parallel molecular docking of large databases on petascale high performance computing machines. Journal of computational chemistry 34 (11), pp. 915–927. Cited by: §1, §3.1.
  • X. Zhang, S. E. Wong, and F. C. Lightstone (2014) Toward fully automated high performance computing drug discovery: a massively parallel virtual screening pipeline for docking and molecular mechanics/generalized born surface area rescoring to improve enrichment. ACS Publications. Cited by: §1, §3.1, §3.4, §4.1, §4.2.
  • J. Zhou, S. Li, L. Huang, H. Xiong, F. Wang, T. Xu, H. Xiong, and D. Dou (2020) Distance-aware molecule graph attention network for drug-target binding affinity prediction. arXiv preprint arXiv:2012.09624. Cited by: §1, §2.1.
  • F. Zhu, X. Zhang, J. E. Allen, D. Jones, and F. C. Lightstone (2020) Binding affinity prediction by pairwise function based on neural network. Journal of Chemical Information and Modeling 60 (6), pp. 2766–2772. Note: PMID: 32338892 External Links: Document, Link, https://doi.org/10.1021/acs.jcim.0c00026 Cited by: §1, §2.1.