The race to identify promising repurposing drug candidates against COVID-19 calls for improvements in the underlying property prediction methodology. The accuracy of many existing techniques depends heavily on access to reasonably large, uniform training data. Such high-throughput, on target screening data is not yet publicly available for COVID-19. Indeed, we have only 48 drugs with measured in-vitro SARS-CoV-2 activity shared with the research community Jeon et al. (2020). This limited data scenario is not unique to the current pandemic but likely to recur with each evolving or new viral challenge. The ability to make accurate predictions based on all the available data, however limited, is also helpful in guiding later high-throughput targeted experimental effort.
We can supplement scarce on-target data with other related data sources, either related screens pertaining to COVID-19 or screens involving related viruses. For instance, we can use additional data pertaining to molecular fragment screens that measure binding to SARS-CoV-2 main protease, obtained via crystallography screening Source (2020). On average, these fragments consist of only 14 atoms, comprising roughly 37% of full drug size molecules. Another source of data is SARS-CoV-1 screens. Since SARS-CoV-1 and SARS-CoV-2 proteases are similar (more than 79% sequence identity) Zhou et al. (2020), drugs screened against SARS-CoV-1 can be expected to be relevant for SARS-CoV-2 predictions. These two examples highlight the challenges for property prediction tools: much of the available training data comes from either different chemical space (molecular fragments) or different viral species (SARS-CoV-1).
The key technological challenge is to be able to estimate models that can extrapolate beyond their training data, e.g., to different chemical spaces. The ability to extrapolate implies a notion of invariance (being impervious) to the differences between the available training data and where predictions are sought. A recently proposed approach known as invariant risk minimization (IRM)Arjovsky et al. (2019) seeks to find predictors that are simultaneously optimal across different such scenarios (called environments). Indeed, the differences in chemical spaces can be thought as "nuisance variation" that the predictor should be explicitly forced to ignore. One possible way to automatically define this type of environment variability for molecules is scaffolds Bemis and Murcko (1996). But the setting is challenging since scaffolds are combinatorial descriptors (substructures) and can potentially uniquely identify each compound in the training data. Useful environments for estimation should enjoy some statistical support.
In this paper we propose a novel variant of invariant risk minimization specifically tailored to rich, combinatorially defined environments typical in molecular contexts. Indeed, unlike in standard IRM, we introduce two dynamic (in contrast to many static) environments. These are defined over the same set of training examples, but differ in terms of their associated latent representations. The difference between them arises from continually adjusted perturbations that manipulate the latent representations of compounds towards more “generic” versions with the help of a scaffold classifier. The idea is to explicitly highlight to the property predictor that operates on these latent representations what the nuisance variability is that it should not rely on.
Our method is evaluated on existing SARS-CoV-2 screening data Source (2020); Jeon et al. (2020). The training utilizes three sources of data: SARS-CoV-2 screened molecules, SARS-CoV-2 fragments and SARS-CoV-1 screening data described above. We compare against multiple transfer learning techniques such as domain adversarial training (Ganin et al., 2016) and conditional domain adversarial network (Long et al., 2018). On two SARS-CoV-2 datasets, the proposed approach outperforms the best performing baseline with 8-16% relative AUROC improvement. Finally, we apply our model on Broad drug repurposing hub Corsello et al. (2017) and report the top 20 predictions for further investigation.
2 Domain Extrapolation
Training data in many emerging applications is necessarily limited, fragmented, or otherwise heterogeneous. It is therefore important to ensure that model predictions derived from such data generalize substantially beyond where the training samples lie. In other words, the trained model should have the ability to extrapolate. For instance, in computational chemistry, it is desirable for property prediction models to perform well in time-split scenarios where the evaluation concerns compounds that were created after those in the training set. Another way to simulate evaluation on future compounds is through a scaffold split Bemis and Murcko (1996). A scaffold split between training and test introduces some structural separation between the chemical spaces of the two sets of compounds, hence evaluating the model’s ability to extrapolate to a new domain.
One way to ensure domain extrapolation is to enforce an appropriate invariance criterion during training. We envision here that the compounds can be divided into potentially a large number of domains or “environments” , for example, based on their scaffold. The goal is then to learn a parametric mapping of compounds to their latent representations in a manner that satisfies the chosen invariance criterion. A number of such strategies relevant to extrapolation have been proposed. They can be roughly divided into the following three categories:
Domain adversarial training Ganin et al. (2016) enforces the latent representation to have the same distribution across different domains . If we denote by the conditional distribution of compounds in environment , then we want for all . With some abuse of notation, we can write this condition as . A single predictor is learned based on , i.e., all the domains share the same predictor. As a result, the predicted label distribution will also be the same across the domains. This can be problematic when the training and test domains have very different label distributions Zhao et al. (2019). The independence condition itself can be challenging to satisfy when the chemical spaces overlap across the environments.
Conditional domain adaptation Long et al. (2018) relaxes the requirement that the label distributions must agree across the environments. The key idea is to condition the invariance criterion on the label. In other words, we require that for all and , i.e., we aim to satisfy the independence statement . The formulation allows the label distribution to vary between domains since and can depend on each other. The constraint remains, however, too restrictive about the latent representation. To illustrate this, consider a simple case where the environments share the same chemical space and differ only in terms of proportions of different types of compounds in them. These type proportions play roles analogous to label proportions in domain adversarial training. Hence, the only way to achieve would be if the proportions were the same across environments. To state the example differently, a functional mapping
cannot fractionally assign probability mass placed onto different latent space locations ; it all has to be mapped to a single location. To reduce the impact of the strict condition, we would have to introduce in place of the simpler functional mapping , further complicating the approach.
Invariant risk minimization (IRM) Arjovsky et al. (2019) seeks a different notion of invariance, focusing less on aligning distributions of latent representations, and instead shifting the emphasis on how those representations can be consistently used for predictions. The IRM principle requires that the predictor operating on is simultaneously optimal across different environments or domains. For example, this holds if our representation explicates only features that are (causally) necessary for the correct prediction. How is distributed across the environments is then immaterial. The associated conditional independence criterion is . In other words, knowing the environment shouldn’t provide any additional information about beyond the features . The distribution of labels can differ across the environments.
While the IRM principle provides a natural framework for domain extrapolation, it needs to be extended in several ways for our setting. The main limitation of the original framework is that the environments themselves are fixed and pre-defined. Their role in the pricinple is to illustrate “nuisance” variation, i.e., variability that the predictor should learn not to rely on. In order to enforce the associated independence criterion, we need a fair number of examples within each such environment. The approach therefore becomes unsuitable when the natural environments such as scaffolds are combinatorially defined or otherwise have high cardinality. Indeed, we might have only a single molecule per environment in our training set, making the independence criterion vacuous ( would uniquely specify , thus also and ). A straightforward remedy for the high cardinality environments would be to introduce a coarser definition, and enforce the principle at this coarse level instead. Since environments represent constraints on the predictor, their role in estimation is adversarial. What is then the appropriate trade-off between such a coarser definition (relaxation of constraints) and our ability to predict? We side-step having to answer this question, and instead propose to dynamically map the large number of environments to just two. These two environments are designed to nevertheless highlight the nuisance variation the predictor should avoid but do so in a tractable manner.
3 IRM with adaptive environments
Our goal is to adaptively highlight to the predictor the type of variability that it ought not to rely on. We do this by replacing high cardinality environments such as those based on scaffold with just two new environments. These two new environments are unusual in the sense that they share the exact same set of examples. Indeed, they only differ in terms of the representation that the predictor operates on. The first environment simply corresponds to the representation we are trying to learn, i.e.,
, where the lowercase letters refer to specific instances rather than random variables. The second environment is defined in terms of a modified representationthat is a perturbed version of and constructed with the help of the environment or scaffold classifier. The goal of is to explicate directions in the latent representation that the predictor should avoid paying attention to. While traditional IRM environments divide examples into environments, often exclusively, we instead exercise different latent representations over the same set of examples.
More formally, our two environments correspond to a choice of perturbation used to derive the latent representation from , i.e., . The associated target labels are clearly the same regardless of which perturbation (none or ) was chosen. The key part of our approach pertains to how is defined. To this end, let be a parametric environment classifier that we will instantiate in detail later. The associated classification loss is where is the correct original environment label (here a scaffold) for . The scaffold classifier is evolved together with the feature mapping and the associated predictor . We define the non-zero perturbation in terms of the gradient:
where is a step size parameter. The goal of this perturbation is to turn into its “generic” version which contains less information of the environment (e.g., scaffold). Note that if we were to perform adversarial domain alignment, would represent a reverse gradient update to modify . We do not do that, instead we are using the perturbation to highlight directions of variability to avoid for the predictor within an overall IRM formulation. The degree to which is adjusted in response to arises from the IRM principle, not from a direct alignment objective.
We begin by building the overall training objective which is then optimized in batches as described in Algorithm 1. Let be a pair of training example + the associated label to predict. Each also has an environment label/features given by (the original mapping of examples to environments is assumed given and fixed, defined by ). The environment classifier is trained to minimize
As we will explain later on, the environment classifier remains “unaware” of how the perturbation is derived on the basis of its predictions. The loss of the predictor , now operating on , where , is defined as
The specific form of the loss depends on the prediction task. In accordance with the IRM principle, we enforce that the predictor operating on remains optimal whether its input is or the perturbed version . In other words, we require that
where is a predictor in the same parametric family as but trained separately with the knowledge of (perturbed or not). By relaxing the constraints via Lagrange multipliers, we express the overall training objective as
This minimax objective is minimized with respect to , , and , and maximized with respect to , . A few remarks are necessary concerning this objective:
Even though is defined on the basis of and the environment classifier , we view it as a functionally independent player. The goal of is to enforce optimality of and therefore it plays an adversarial role relative to . Similarly to GAN objectives where the discriminator has a separate objective function, different from the generator, we separate out as another player in an overall game theoretic objective. Specifically, takes input from and but does not inform them in return in back-propagation.111Incorporating this higher order dependence would not improve the empirical results.
in our objective is adjusted to also help the auxiliary environment classifier. This is contrary to domain alignment where the goal would be to take out any dependence on the environment. The benefit in our formulation is two-fold. First, the term grounds also based on the auxiliary objective, helping it to retain useful information about each example . Second, the term grounds and stabilizes the definition of as the gradient of the environment predictor since no longer approaches a random predictor. It would be weak if contains no information about the environment as in domain alignment. Thus remains well-defined as a direction throughout the optimization.
The training procedure is shown in Algorithm 1.
3.1 Adapting the framework to molecule property prediction
In molecule property prediction, the training data is a collection of pairs , where is a molecular graph and is its activity score, typically binary (active/inactive). The feature extractor is a graph convolutional network (GCN) which translates a molecular graph into a continuous vector through directed message passing operations Yang et al. (2019). The predictor is a feed-forward network that takes or as input and yields predicted activity .
The original environment of each compound is defined as its Murcko scaffold Bemis and Murcko (1996), which is a subgraph of . Since scaffold is a combinatorial object with a large vocabulary of possible values, we define and train the environment classifier in a contrastive fashion Oord et al. (2018). Specifically, for a given molecule with scaffold , we randomly sample other molecules and take their associated scaffolds as negative examples, as the contrastive set . The environment classifier makes use of a feed-forward network that maps each compound or a scaffold (subgraph) to a feature vector. The probability that is mapped to its correct scaffold is then defined as
stands for cosine similarity. In practice, we use the molecules within the same batch as negative examples.
Our experiments consist of two settings. To compare our method with existing transfer learning techniques, we first evaluate our methods on a standard unsupervised transfer setup. All the models are trained on SARS-CoV-1 data and tested on SARS-CoV-2 compounds. Next, in order to identify drug candidates for SARS-CoV-2, we extend our method by incorporating labeled SARS-CoV-2 data to maximize prediction accuracy and perform virtual screening over Broad drug repurposing hub Corsello et al. (2017).
Our training data consist of three screens related to SARS-CoV. All the data can be found at https://github.com/yangkevin2/coronavirus_data.
SARS-CoV-2 MPro inhibition 881 fragments screened for SARS-CoV-2 main protease (Mpro) collected by the Diamond Light Source group Source (2020). The dataset contains 78 hits.
SARS-CoV-2 antiviral activity 48 FDA-approved drugs screened for antiviral activity against SARS-CoV-2 in vitro Jeon et al. (2020), including reference drugs such as Remdesivir, Lopinavir and Chloroquine. The dataset contains 27 hits.
SARS-CoV-1 3CLpro inhibition Over 290K molecules screened for activity against SARS-CoV-1 3C-like protease (3CLpro) in PubChem AID1706 assay. There are 405 active compounds.
We compare the proposed approach with the following baselines:
Direct transfer: We train a GCN on SARS-CoV-1 data and directly test it on SARS-CoV-2 data.
Domain adversarial training (DANN) Ganin et al. (2016): Since distribution of molecules is different between SARS-CoV-1 and SARS-CoV-2 datasets, we use domain adversarial training to facilitate transfer. Specifically, we augment our GCN with additional domain classifier to enforce the distribution of to be the same across training (SARS-CoV-1) and test set (SARS-CoV-2).
Conditional adversarial domain adaptation (CDAN) Long et al. (2018) conditions the domain classifier with predicted labels . In particular, we adopt their multilinear conditioning strategy: the input to becomes a vector outer-product , which has the same dimension as for binary classification tasks.
Scaffold adversarial training (SANN): This is an extension of DANN where the domain classifier is replaced with our scaffold classifier in Eq.(6). SANN seeks to learn a scaffold-invariant representation through the following minimax game ( is scaffold classification loss):
Invariant risk minimization (IRM): The original IRM Arjovsky et al. (2019) requires the predictor to be constant, which does not work well in our setting. Therefore, we adopt an adversarial formulation for IRM proposed in Chang et al. (2020), allowing us to use powerful neural predictors:
Here each of the environments consists of molecules with the same scaffold. Since the number of environments is large, we impose parameter sharing among the competing predictors . Specifically, the input of is a concatenation of
and one-hot encoding of.
For our model, we set and perturbation learning rate , which worked well across all experiments. All methods are trained with Adam using its default configuration. Our GCN implementation is based on chemprop Yang et al. (2019).
For unsupervised transfer, we use their default hyper-parameter setting. For all methods, the GCN has three layers with hidden layer dimension 300. The predictor is a two-layer MLP.
For supervised transfer, we perform hyper-parameter optimization to identify the best architecture for the multitask GCN. The GCN has two layers with hidden layer dimension 2000. The predictor is a three-layer MLP. The dropout rate is . For fair comparison, all the methods use the same architecture in this setting.
4.1 Unsupervised transfer
Setup Our model is a single-task binary classification model which predicts the SARS-CoV-1 3CLpro inhibition. After training, the model is tested on SARS-CoV-2 Mpro and antiviral data. Each model is evaluated under five independent runs and we report the average AUROC score.
Results Our results are shown in Table 1. The proposed method significantly outperformed all the baselines, especially on the Mpro inhibition prediction dataset (0.756 versus 0.653 AUROC).
Indeed, the improvement of our model comes from two sources: the additional auxiliary task and IRM principle. To show individual contribution of each component, we conduct an ablation study of our method without the IRM principle. The loss function in this case is the scaffold classification loss plus property prediction loss. The performance of this method is shown in the end of Table 1 (“without IRM”). The auxiliary scaffold classifier shows quite significant improvement, but is still inferior to our full model trained with IRM principle.
|Mpro inhibition AUC||Antiviral activity AUC|
|DANN Ganin et al. (2016)|
|CDAN Long et al. (2018)|
|IRM Arjovsky et al. (2019)|
|- without IRM|
4.2 Drug repurposing for SARS-CoV-2
Setup We extend all the methods to multitask binary classification models that predict three different properties for each new compound: 1) probability of inhibiting the SARS-CoV-2 Mpro; 2) antiviral activity against SARS-CoV-2; 3) probability of inhibiting SARS-CoV-1 3CLpro.
Each model is evaluated under 5-fold cross validation with the same splits. In each fold, the training set contains the SARS-CoV data and 60% of the SARS-CoV-2 data (Mpro + antiviral), and the test set contains the rest 40% of the SARS-CoV-2 compounds. We report the mean and standard deviation of AUROC score evaluated on five different folds.
Results Our results are shown in Table 2. The proposed method significantly outperformed the two baselines, especially on the antiviral activity prediction dataset (0.89 versus 0.82 AUROC). As an ablation study, we also trained a GCN on only SARS-CoV-2 data (the first row in Table 2). Indeed, the multitask GCN trained with additional SARS-CoV-1 data performs better (0.740 vs 0.807 on antiviral prediction), indicating that the two virus are closely related.
|Method||CoV-1||CoV-2||Mpro inhibition AUC||Antiviral activity AUC|
|IRM Arjovsky et al. (2019)|
The best model is then used to predict the SARS-CoV-2 Mpro inhibition and antiviral activity of compounds in Broad drug repurposing hub. In order to utilize maximal amount of labeled data, the model is re-trained under 10-fold cross validation with 90%/10% split (instead of 60%/40%). The resulting 10 models are combined together as an ensemble to predict properties for new compounds. We report the top 20 predicted molecules for MPro inhibition and antiviral activity in Table 3 and 4.
In this paper, we investigate existing domain extrapolation paradigms and their limitations. To allow the method to extrapolate across combinatorially many environments, we propose a new method which complements invariant risk minimization with adaptive environments. The method is evaluated on molecule property prediction tasks and shows significant improvements over strong baselines.
- Invariant risk minimization. arXiv preprint arXiv:1907.02893. Cited by: §1, 3rd item, 5th item, Table 1, Table 2.
- The properties of known drugs. 1. molecular frameworks. Journal of medicinal chemistry 39 (15), pp. 2887–2893. Cited by: §1, §2, Figure 1, §3.1.
- Invariant rationalization. arXiv preprint arXiv:2003.09772. Cited by: 5th item.
- The drug repurposing hub: a next-generation drug library and information resource. Nature medicine 23 (4), pp. 405–408. Cited by: §1, §4.
Domain-adversarial training of neural networks.
The Journal of Machine Learning Research17 (1), pp. 2096–2030. Cited by: §1, 1st item, 2nd item, Table 1.
- Identification of antiviral drug candidates against sars-cov-2 from fda-approved drugs. bioRxiv. Cited by: §1, §1, 2nd item.
- Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pp. 1640–1650. Cited by: §1, 2nd item, 3rd item, Table 1.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §3.1.
- SARS-cov-2 main protease structure and xchem fragment screen. Note: www.diamond.ac.uk/covid-19/for-scientists/Main-protease-structure-and-XChem Cited by: §1, §1, 1st item.
- Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling 59 (8), pp. 3370–3388. Cited by: §3.1, §4.
- On learning invariant representation for domain adaptation. arXiv preprint arXiv:1901.09453. Cited by: 1st item.
- A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579 (7798), pp. 270–273. Cited by: §1.