Property-aware Adaptive Relation Networks for Molecular Property Prediction

07/16/2021 ∙ by Yaqing Wang, et al. ∙ Baidu, Inc. 13

Molecular property prediction plays a fundamental role in drug discovery to discover candidate molecules with target properties. However, molecular property prediction is essentially a few-shot problem which makes it hard to obtain regular models. In this paper, we propose a property-aware adaptive relation networks (PAR) for the few-shot molecular property prediction problem. In comparison to existing works, we leverage the facts that both substructures and relationships among molecules are different considering various molecular properties. Our PAR is compatible with existing graph-based molecular encoders, and are further equipped with the ability to obtain property-aware molecular embedding and model molecular relation graph adaptively. The resultant relation graph also facilitates effective label propagation within each task. Extensive experiments on benchmark molecular property prediction datasets show that our method consistently outperforms state-of-the-art methods and is able to obtain property-aware molecular embedding and model molecular relation graph properly.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Drug discovery is an important biomedical task, which targets at finding new potential medical compounds with desired properties such as better absorption, distribution, metabolism, and excretion (ADME), low toxicity and active pharmacological activity [Rohrer and Baumann2009, Abbasi et al.2019, Altae-Tran et al.2017]. It is recorded that drug discovery takes more than 2 billion and at least 10 years in average while the clinical success rate is around 10% [Paul et al.2010, Leelananda and Lindert2016, Zhavoronkov et al.2019]

. To speedup this process, quantitative structure property/activity relationship (QSPR/QSAR) modeling uses machine learning methods to establish the connection between molecule structure and particular properties 

[Dahl et al.2014]

. It usually consists of two components: a molecular encoder which encodes molecule structure as a fixed-length molecular representation, and a predictor which estimates the activity of a certain property based on the molecular representation. Predictive models can be leveraged in virtual screening to discover potential molecules more efficiently 

[Guo et al.2021]. However, molecular property prediction is essentially a few-shot problem which makes it hard to solve. Only a small amount of candidate molecules can pass virtual screening to be evaluated in the lead optimization stage of drug discovery [Rong et al.2020]. After a series of wet-lab experiments, most candidates still fail to be a potential drug due to the lack of any desired properties [Dahl et al.2014]. These together result in a limited number of labeled data [Nguyen et al.2020].

Recently, there emerge few-shot learning (FSL) methods [Altae-Tran et al.2017, Guo et al.2021, Wang et al.2020]

for molecular property prediction. They target at learning a predictor from a set of property prediction tasks and generalize to predict new properties with a few labeled molecules. As molecules can be naturally represented as graphs, graph-based molecular representation learning methods use graph neural networks (GNNs) 

[Kipf and Welling2016, Hamilton et al.2017] to obtain graph-level representation as the molecular embedding. Specifically, the pioneering IterRefLSTM [Altae-Tran et al.2017] adopts GNN as the molecular encoder and modifies a classic FSL method [Vinyals et al.2016] proposed for image classification to handle molecular prediction tasks. The recent Meta-MGNN [Guo et al.2021] leverages a GNN pretrained from large-scale self-supervised tasks as molecular encoder and introduces additional tasks such as bond reconstruction and atom type prediction to be jointly optimized with the molecular property prediction tasks. Finally,  DTCR [Abbasi et al.2019]

is particularly designed for few-shot transfer learning across datasets by adversarial learning.

However, aforementioned methods neglect two key facts in molecular property prediction. The first fact is that different molecular properties are attributed to different molecule substructures as found by previous QSPR studies [Ajmani et al.2009]. While IterRefLSTM and Meta-MGNN use graph-based molecular encoder to encode molecules regardless of target properties whose relevant substructures can be dramatically different. The second fact is that the relationship among molecules also vary w.r.t. the target property. This can be commonly observed in benchmark molecular property prediction datasets, as shown in Figure 1. However, existing works fail to model relation graph among molecules.

Molecules Label (SR-) ID SMILES HSE MMP Mol-1 c1ccc2sc(SNC3CCCCC3)nc2c1 1 1 Mol-2 Cc1cccc(/N=N/c2ccc(N(C)C)cc2)c1 0 1 Mol-3 C=C(C)[C@H]1CN[C@H] (C(=O)O)[C@H]1CC(=O)O 0 0 Mol-4 O=C(c1ccccc1) C1CCC1 1 0
Figure 1: Illustrative examples of relation graphs for the same molecules in two tasks of Tox21. Red (blue) edges mean the connected molecules are both active (inactive) on the target property.

To handle these problems, we propose a property-aware adaptive relation networks (PAR) which is compatible with existing graph-based molecular encoders, and are further equipped with the ability to obtain property-aware molecular embedding and model molecular relation graph adaptively. Specifically, our contribution can be summarized as follows:

  • [leftmargin=*]

  • We propose a property-aware embedding function which co-adapts each molecular embedding with class prototypes and further projects it to a substructure-aware space w.r.t. the target property.

  • We propose an adaptive relation graph learning module to jointly estimate molecular relation graph and refine molecular embeddings w.r.t. the target property, such that the limited labels can be efficiently propagated between similar molecules.

  • We propose a new training strategy: we only fine-tune the property-aware embedding function and final classifier while keeping the other parts of the PAR (graph-based molecular encoder and adaptive relation graph learning module) fixed within each task. We show it is particularly helpful to separately capture the generic knowledge shared across different tasks and those property-aware.

  • We conduct extensive empirical studies on real molecular property prediction datasets. Results consistently show PAR consistently outperform the others. Further model analysis shows PAR can obtain property-aware molecular embedding and model molecular relation graph properly.

Notation.

We denote vectors by lowercase boldface, matrices by uppercase boldface, and sets by uppercase calligraphic font. For a vector

, denotes the th element of . For a matrix , denotes its th row, denotes the th entry of . The superscript denotes the transpose operation.

2 Review: Graph Neural Networks (GNNs)

A graph neural network (GNN) can learn expressive node/graph representation from the topological structure and associated features of a graph via neighborhood aggregation [Kipf and Welling2016, Gilmer et al.2017, Hu et al.2020]. Consider a graph with node feature for each node and edge feature for each edge between nodes . At the -th layer, GNN updates the node embedding of node as:

where is a set of neighbors of . Existing GNNs differ on the design of aggregate function and update function . After iterations of aggregation, the graph-level representation is obtained as , where is a function to aggregate all node embeddings into the graph embedding, such as summation [Xu et al.2018].

Our paper is related to GNN in two aspects: we obtain molecular representation via a graph-based molecular encoder, and we capture adaptive relation graph among molecules by graph structure learning.

2.1 Graph-based Molecular Representation Learning

Representing molecules properly as fixed-length vectors is vital to the success of downstream biomedical applications  [Gawehn et al.2016]. Recently, graph-based molecular representation learning methods are popularly used and obtain state-of-the-art performance. A molecule is represented as an undirected graph , where each node represents an atom with feature and each edge represents the bond between two nodes with feature . Graph-based molecular representation learning methods use graph neural networks (GNNs) to obtain graph-level representation as molecular embedding. Examples include graph convolutional networks (GCN) [Duvenaud et al.2015], graph attention networks (GAT) [Veličković et al.2017], message passing neural networks (MPNN) [Gilmer et al.2017], Graph Isomorphism Network (GIN) [Xu et al.2018], Pretrained GNN (Pre-GNN) [Hu et al.2019] and GROVER [Rong et al.2020].

In FSL methods for molecular property prediction, IterRefLSTM [Altae-Tran et al.2017] uses a GCN while Meta-MGNN [Guo et al.2021] uses the pretrained Pre-GNN. Using these graph-based molecular encoders cannot discover molecule substructures corresponding to the target property. Although there exist GNNs which handle subgraphs  [Monti et al.2018, Alsentzer et al.2020, Fu et al.2020], they require predefined subgraphs. While discovering and enumerating molecule substructures is extremely hard even for domain experts [Ajmani et al.2009, Yu et al.2013]. In this paper, we first obtain molecular embeddings using graph-based molecular encoders. We further learn to extract relevant substructure embeddings w.r.t. the target property upon these generic molecular embeddings, which is more effective and improves the performance.

2.2 Graph Structure Learning

As the provided graphs may not be optimal, a number of graph structure learning methods target at jointly learning graph structure and node embeddings [Zhu et al.2021, Chen et al.2020]. In general, they iterate over two procedures: estimating adjacency matrix (i.e., refining neighborhood ) which encodes graph structure from the current node embeddings, and apply a GNN on this learned graph to obtain new node embeddings.

In FSL, there exist some works [Garcia and Bruna2018, Liu et al.2018, Kim et al.2019, Yang et al.2020, Rodríguez et al.2020] which learn to construct fully-connected relation graph among images in a -way--shot few-shot image classification task. Their methods cannot work for the -way--shot property prediction tasks where choosing a wrong neighbor in the different class will heavily deteriorate the quality of molecular embeddings. Although we share the same spirit of learning relation graph, we introduce several regularizations to encourage our adaptive property-aware relation graph learning module to select correct neighbors effectively.

3 Proposed Method

In this section, we present the details of PAR, whose overall architecture is shown in Figure 2. Considering few-shot molecular property prediction problem, we first obtain property-aware molecular embeddings via a specially designed embedding function, and then propagate the limited labels on the adaptive molecular relation graph whose structure is jointly optimized with the molecular embeddings. To optimize PAR, we introduce a new training strategy to separately modeling generic and property-aware knowledge. Finally, we PAR can be easily extended to handle the few-shot transfer learning across datasets problem.

Figure 2: The architecture of the proposed PAR framework. We illustrate using a -way--shot task from Tox21. PAR is optimized over different molecular property prediction tasks. Within each task , the modules with dotted lines are fine-tuned on support set and those with solid lines are fixed.

3.1 Problem Definition

As defined in [Altae-Tran et al.2017, Guo et al.2021], a few-shot molecular property prediction task is formulated as a -way--shot classification task with a support set and a query set where labeled samples are provided per class and is small. Each corresponds to an experimental assay testing on whether each molecule is active () or inactive () on a target property.

3.2 Property-aware Molecular Embedding

Our molecular encoder consists of (i) a graph-based molecular encoder which is trained from large-scale tasks to capture generic information, and (ii) a property-aware embedding function which adapts the generic molecular embeddings to be property-aware.

Recent advances in graph-based molecular encoder such as pretrained molecular encoder [Hu et al.2019, Rong et al.2020] makes it possible to encode generic knowledge into molecular embedding by learning across tasks. Thus, we first obtain a generic molecular embedding of using an existing graph-based molecular encoder introduced in Section 2, such as Pre-GNN [Hu et al.2019]. The parameter of this graph-based molecular encoder is optimized across large-scale tasks.

However, existing graph-based molecular encoders cannot capture property-aware substructures as discussed above. When learning across tasks, a molecule will be evaluated for multiple properties. This leads to a one-to-many relationship between a molecule and properties. Thus, we are motivated to implicitly capture substructures in the embedding space w.r.t. the target property of . Let denotes the class prototype for class , which is computed as the average of in whose . We model the context for as . We then transform into by:

(1)

where extracts the st row vector which corresponds to . The

denotes the multilayer perceptron, which is used to find a lower-dimensional space which encodes substructures that are more relevant to the target property of

. is computed using scaled dot-product self-attention [Vaswani et al.2017], such that each can be compared with class prototypes dimensional wise. This contextualized is property-aware which is more predictive of the target property.

3.3 Adaptive Relation Graph Among Molecules

Apart from relevant substructures, the relationship among molecules also vary across properties. As shown in Figure 1, two molecules with a shared property can be different from each other on another property [Rohrer and Baumann2009, Kuhn et al.2016, Richard et al.2016]. In this section, we introduce an adaptive relation graph learning module to capture and further leverage this property-aware relation graph among molecules, such that the limited labels can be efficiently propagated between similar molecules.

Let denotes the adjacency matrix encoding the the relation graph among the molecules in and a single molecule in . if nodes are connected. Ideally, the similarity between property-aware molecular embeddings of reveals their relationship under the current property prediction task. Hence we set initially.

At the th iteration, we first calculate similarity between using the current molecular embeddings:

(2)

The resultant is a dense matrix, which encodes a fully connected .

However, a new molecule only has real neighbors in in a -way--shot task. Choosing a wrong neighbor in the different class will heavily deteriorate the quality of molecular embeddings, especially when only one-shot is provided in each class. To avoid the interference of wrong neighbors, we further sparsify as a -nearest neighbor (NN) graph, where is set to be exactly the same as the number of labeled molecules per each class in . The indices of the top largest for is recorded in . Then, we set

(3)

With the new , we update by a GNN layer as

(4)

where , and is the LeakyReLuactivation function. A full adaptive relation graph learning module consists of GNN layers. We take as the final molecular embedding of , and represents the optimized relation graph.

Finally, we obtain class prediction w.r.t. active/inactive as with the th element calculated as

(5)

where denotes the classifier parameter for class .

3.4 Training and Inference

For simplicity, we denote PAR as where includes all learnable parameters: which is the parameter of graph-based molecular encoder, which is the parameter of property-aware molecular embedding function, which is the parameter of adaptive relation graph learning module, which is the parameter of classifier.

In each , loss evaluated on takes the form:

(6)

where is a one-hot vector with all 0s but a single one denoting the index of the ground-truth class , means the th row of , and records the ground-truth label consistency where if and 0 otherwise. The first term is the cross entropy for classification loss, and the second term is specially designed neighbor alignment loss which penalizes wrong neighbors in the relation graph.

We adopt gradient-based meta-learning strategy [Finn et al.2017] to train PAR: we learn from a set of meta-training tasks a good initialization that can be easily adapted to by taking a few gradient descents; then we keep fixed while fine-tune on by taking a few gradient descents for each . For example, is obtained as

(7)

with learning rate . By learning this way, we encourage our model to separately capture the generic knowledge shared across different tasks and those property-aware.

Then loss is calculated in the same form of (6) while using instead. is then obtained as

(8)

which is also optimized by gradient descent [Finn et al.2017]. The complete algorithm of PAR is shown in Algorithm 1. Line 6-7 correspond to property-aware embedding which encodes substructure w.r.t the target property (see Section 3.2). Line 8-12 corresponds to adaptive relation graph learning which facilitates effective label propagation among similar molecules (see Section 3.3).

1:  initialize randomly or adopt parameter of pretrained models, initialize randomly;
2:  while not done do
3:     sample a batch of tasks ;
4:     for all  do
5:        sample support set and query set from ;
6:        obtain generic molecular embedding for each by a graph-based molecular encoder;
7:        adapt molecular embedding to be property-aware by (1);
8:        set ;
9:        for  do
10:           estimate adjacency matrix of relation graph among molecules using molecular embeddings by (2);
11:           refine molecular embeddings on the updated relation graph by (4);
12:        end for
13:        obtain class prediction using ;
14:        evaluate training loss on ;
15:        fine-tune as by gradient descents (e.g., by (7));
16:        evaluate testing loss on ;
17:     end for
18:     update by (8);
19:  end while
Algorithm 1 Meta-training procedure for PAR.

For inference, the generalization ability of PAR is evaluated on the query set of each new task which tests on new property in meta-testing stage. Still, , are fixed and , are fine-tuned on .

3.5 Transfer Learning Across Datasets

Further, we extend PAR to handle tasks in meta-training and meta-testing come from different datasets, which requires higher generalization ability and has been considered in [Abbasi et al.2019, Altae-Tran et al.2017, Cai et al.2020]. we show that our PAR can be easily adapted to conduct few-shot transfer learning across datasets by modifying Algorithm 1.

Following [Abbasi et al.2019], we use the molecules from target domain task distribution to influence the meta-training stage. After line 18, we sample a batch of tasks with the labeled support set and unlabeled query set whose labels cannot be exposed during learning [Abbasi et al.2019]. We then repeat line 7-16 for all . To evaluate the performance on , we propose a new loss defined as

This entropy-based loss can encourage the model to make “confident” (low-entropy) predictions [Grandvalet and Bengio2004]. As is obtained by aggregating labels from its neighbors in , minimizing can also force PAR to model more accurately. PAR is optimized w.r.t. the accumulation of and calculated over all sampled tasks.

4 Experiments

We perform experiments on widely used benchmark few-shot molecular property prediction datasets (Table 1), whose details are in Appendix A.

Dataset Tox21 SIDER MUV ToxCast
# Compounds 8014 1427 93127 8615
# Tasks 12 27 17 617
# Meta-Training Tasks 9 21 12 450
# Meta-Testing Tasks 3 6 5 167
Table 1: Summary of datasets used.

4.1 Experimental Settings

Method Tox21 SIDER MUV ToxCast
10-shot 1-shot 10-shot 1-shot 10-shot 1-shot 10-shot 1-shot
Siamese 80.400.35 65.001.58 71.104.32 51.433.31 59.965.13 50.000.17 - -
ProtoNet 74.980.32 65.581.72 64.540.89 57.502.34 65.884.11 58.313.18 63.701.26 56.361.54
MAML 80.210.24 75.740.48 70.430.76 67.811.12 63.902.28 60.513.12 66.790.85 65.975.04
TPN 76.050.24 60.161.18 67.840.95 62.901.38 65.225.82 50.000.51 62.741.45 50.010.05
EGNN 81.210.16 79.440.22 72.870.73 70.790.95 65.202.08 62.181.76 63.651.57 61.021.94
IterRefLSTM 81.100.17 80.970.10 69.630.31 71.730.14 49.565.12 48.543.12 - -
PAR(ours) 82.060.12 80.460.13 74.680.31 71.870.48 66.482.12 64.121.18 69.721.63 67.282.90
Pre-GNN 82.140.08 81.680.09 73.960.08 73.240.12 67.141.58 64.511.45 73.680.74 72.900.84
Meta-MGNN 82.970.10 82.130.13 75.430.21 73.360.32 68.991.84 65.542.13 - -
Pre-PAR(ours) 84.930.11 83.010.09 78.080.16 74.460.29 69.961.37 66.941.12 75.120.84 73.631.00
Table 2: ROC-AUC scores of FSL on molecular property prediction datasets. Best results are in bold, and second best ones are underlined.
Method Tox21 SIDER SIDER Tox21 ToxCast Tox21 ToxCast SIDER
10-shot 1-shot 10-shot 1-shot 10-shot 1-shot 10-shot 1-shot
ProtoNet 56.714.89 53.803.52 67.076.38 58.735.24 69.12 3.76 65.132.23 57.120.61 55.121.10
MAML 56.842.34 54.682.46 65.204.77 63.531.56 72.983.12 67.726.64 56.581.05 55.941.86
TPN 57.503.97 50.311.43 67.805.52 50.020.87 68.351.81 55.071.03 55.250.85 50.000.06
EGNN 57.822.39 55.962.45 68.401.25 65.501.20 72.562.76 65.603.18 55.220.65 53.481.14
PAR(ours) 58.441.51 57.032.08 69.081.34 66.571.02 73.632.55 70.724.53 58.980.89 56.631.22
DTCR 63.002.00 60.002.00 74.002.00 69.002.00 74.005.00 71.003.00 63.001.00 58.005.00
PAR-TL(ours) 63.121.13 60.121.38 75.232.44 70.122.73 75.122.12 73.433.71 63.220.91 59.081.06
Pre-GNN 61.120.82 58.291.78 73.771.52 65.621.77 76.083.76 75.534.35 59.300.71 57.151.12
Meta-MGNN 61.991.43 58.892.38 74.262.18 67.271.37 - - - -
Pre-PAR(ours) 62.201.32 58.772.43 74.401.97 69.481.66 77.751.54 75.711.28 60.830.66 58.620.81
Pre-PAR-TL(ours) 65.190.97 61.492.08 78.051.33 71.321.09 80.651.43 76.581.54 64.220.84 61.021.34
Table 3: ROC-AUC scores of few-shot transfer learning across datasets. Best results are in bold, and second best ones are underlined.
Baselines.

In the paper, we compare our PAR (Algorithm 1) with two types of baselines: (i) FSL methods with graph-based encoder learned from scratch including Siamese [Koch et al.2015], ProtoNet [Snell et al.2017], MAML [Finn et al.2017], TPN [Liu et al.2018], and EGNN [Kim et al.2019], IterRefLSTM [Altae-Tran et al.2017]; and (ii) methods which leverage pretained graph-based molecular encoder including Pre-GNN [Hu et al.2019], Meta-MGNN [Guo et al.2021], and Pre-PAR which is our PAR equipped with Pre-GNN. We use results of IterRefLSTM reported in [Altae-Tran et al.2017] as its code is not available. For the other methods, we implement them using public codes of the respective authors. More implementation details are in Appendix B.

Generic graph-based molecular representation.

Following [Hu et al.2019, Guo et al.2021], we use RDKit [Landrum2013] to build molecular graphs from raw SMILES, and to extract atom features (atom number and chirality tag) and bond features (bond type and bond direction). For all methods re-implemented by us, we use GIN [Xu et al.2018] as the graph-based molecular encoder to extract molecular embeddings. Pre-GNN, Meta-MGNN and Pre-PAR further use the pretrained GIN which is also provided by the authors of  [Hu et al.2019].

Evaluation Metrics.

Following [Hu et al.2019, Guo et al.2021]

, we evaluate the binary classification performance by ROC-AUC scores calculated on the query set of each meta-testing task. We run experiments for ten times with different random seeds, and report the mean and standard deviations of ROC-AUC computed across all meta-testing tasks.

4.2 Experimental Results.

FSL for Molecular Prediction.

Table 2 shows the results. Results of Siamese, IterRefLSTM and Meta-MGNN are not provided: the first two methods lack codes and are not evaluated on ToxCast before while Meta-MGNN runs out of memory. As can be seen, Pre-PAR consistently obtains the best performance while PAR outperform among methods without pretrained GNNs. The previous state-of-the-art IterRefLSTM and Meta-MGNN obtain slightly better performance than Pre-GNN which is pretrained from large-scale self-supervised tasks. We also observe that FSL methods that learn relation graphs (i.e., GNN, TPN, EGNN) obtain better performance than the classic ProtoNet and MAML.

Few-shot Transfer Learning across Datasets

Further, we evaluate the extension of PAR (denote as PAR-TL) in Section 3.5 for few-shot transfer learning across different datasets. Following [Altae-Tran et al.2017, Abbasi et al.2019], we consider transfer learning (i) between Tox21 and SIDER which contain distinct tasks; (ii) from ToxCast to Tox21 which both evaluate toxicity and (iii) from ToxCast to SIDER which differ largely following [Abbasi et al.2019]. In addition to baselines, we compare with the state-of-the-art DTCR [Abbasi et al.2019]. As the authors did not release codes, we use their reported results. Siamese and IterRefLSTM are not compared as they are not evaluated under this setting [Altae-Tran et al.2017]. Following [Abbasi et al.2019], we compute ROC-AUC scores on query set of all tasks in the target datasets.

Table 3 presents the results. PAR-TL and DTCR outperform the others. PAR-TL can be easily trained while DTCR requires adversarial learning to align source and target domain. As shown, directly applying FSL methods cannot obtain satisfactory results, which is also observed in [Altae-Tran et al.2017]. We also observe that all methods obtain higher ROC-AUC score on ToxCast Tox21 than ToxCast SIDER, which shows transfer learning from similar source dataset is more helpful.

4.3 Model Analysis for PAR

We further compare Pre-PAR and PAR with the following variants: (1) w/o property-aware embedding; (2) w/o context in equation (1); (3) w/o adaptive relation graph learning; (4) w/o reducing to NN graph; (5) w/o the neighbor alignment loss in equation (6); and (6) fine-tune all parameters on line 15 of Algorithm 1. Figure 3 shows the results obtained for 10-shot while results for 1-shot is put in Appendix C.1. As shown, the design of property-aware embedding and adaptive relation graph learning are vital to the success of PAR. PAR and Pre-PAR outperform their variants which validates the effectiveness of our model design, while Pre-PAR which uses pretrained GIN can output better generic molecular embeddings as a starting point.

(a) Pre-PAR
(b) PAR
Figure 3: Ablation study for -way-10-shot tasks from Tox21.

We also evaluate the performance of PAR using various graph-based molecular encoders (Appendix C.2) and conduct a case study (Appendix C.3) to evaluate whether PAR can obtain property-aware molecular embeddings and relation graphs for tasks with overlapping molecules but different target properties.

5 Conclusion

We propose a property-aware adaptive relation network (PAR) for few-shot molecular property prediction problem. PAR consists of three components: a graph-based molecular encoder to encode the topological structure of the molecular graph, atom features, and bond features into a molecular embedding, a property-aware embedding projection to obtain property-aware embeddings encoding context information of each task; and an adaptive relation graph learning to construct a relation graph to effectively propagate information among similar molecules. Empirical Results consistently show that PAR outperforms state-of-the-art methods under both standard few-shot learning settings and transfer learning across different datasets setting. We leave interpreting the substructures learned by PAR as future works.

References

  • [Abbasi et al.2019] Karim Abbasi, Antti Poso, Jahanbakhsh Ghasemi, Massoud Amanlou, and Ali Masoudi-Nejad. Deep transferable compound representation across domains and tasks for low data drug discovery. Journal of Chemical Information and Modeling, 59(11):4528–4539, 2019.
  • [Ajmani et al.2009] Subhash Ajmani, Kamalakar Jadhav, and Sudhir A Kulkarni. Group-based QSAR (G-QSAR): Mitigating interpretation challenges in QSAR. QSAR & Combinatorial Science, 28(1):36–51, 2009.
  • [Alsentzer et al.2020] Emily Alsentzer, Samuel Finlayson, Michelle Li, and Marinka Zitnik. Subgraph neural networks. In Advances in Neural Information Processing Systems, volume 33, pages 8017–8029, 2020.
  • [Altae-Tran et al.2017] Han Altae-Tran, Bharath Ramsundar, Aneesh S Pappu, and Vijay Pande. Low data drug discovery with one-shot learning. ACS Central Science, 3(4):283–293, 2017.
  • [Cai et al.2020] Chenjing Cai, Shiwei Wang, Youjun Xu, Weilin Zhang, Ke Tang, Qi Ouyang, Luhua Lai, and Jianfeng Pei. Transfer learning for drug discovery. Journal of Medicinal Chemistry, 63(16):8683–8694, 2020.
  • [Chen et al.2020] Yu Chen, Lingfei Wu, and Mohammed Zaki. Iterative deep graph learning for graph neural networks: Better and robust node embeddings. In Advances in Neural Information Processing Systems, pages 19314–19326, 2020.
  • [Dahl et al.2014] George E Dahl, Navdeep Jaitly, and Ruslan Salakhutdinov. Multi-task neural networks for QSAR predictions. arXiv preprint arXiv:1406.1231, 2014.
  • [Duvenaud et al.2015] David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems, 2015.
  • [Fey and Lenssen2019] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with PyTorch Geometric. arXiv preprint arXiv:1903.02428, 2019.
  • [Finn et al.2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135. PMLR, 2017.
  • [for Advancing Translational Sciences2017] National Center for Advancing Translational Sciences. Tox21 challenge. http://tripod.nih.gov/tox21/challenge/, 2017. Accessed: 2016-11-06.
  • [Fu et al.2020] Xinyu Fu, Jiani Zhang, Ziqiao Meng, and Irwin King. MAGNN: Metapath aggregated graph neural network for heterogeneous graph embedding. In The Web Conference, pages 2331–2341, 2020.
  • [Garcia and Bruna2018] Victor Garcia and Joan Bruna. Few-shot learning with graph neural networks. In International Conference on Learning Representations, 2018.
  • [Gawehn et al.2016] Erik Gawehn, Jan A Hiss, and Gisbert Schneider. Deep learning in drug discovery. Molecular Informatics, 35(1):3–14, 2016.
  • [Gilmer et al.2017] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pages 1263–1272, 2017.
  • [Grandvalet and Bengio2004] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In Advances on Neural Information Processing Systems, pages 529–536, 2004.
  • [Guo et al.2021] Zhichun Guo, Chuxu Zhang, Wenhao Yu, John Herr, Olaf Wiest, Meng Jiang, and Nitesh V Chawla. Few-shot graph learning for molecular property prediction. In The Web Conference, 2021.
  • [Hamilton et al.2017] William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1025–1035, 2017.
  • [Hu et al.2019] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies for pre-training graph neural networks. In International Conference on Learning Representations, 2019.
  • [Hu et al.2020] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. In Advances in Neural Information Processing Systems, volume 33, pages 22118–22133, 2020.
  • [Kim et al.2019] Jongmin Kim, Taesup Kim, Sungwoong Kim, and Chang D Yoo. Edge-labeling graph neural network for few-shot learning. In

    IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , pages 11–20, 2019.
  • [Kingma and Ba2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [Kipf and Welling2016] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2016.
  • [Koch et al.2015] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2. Lille, 2015.
  • [Kuhn et al.2016] Michael Kuhn, Ivica Letunic, Lars Juhl Jensen, and Peer Bork. The SIDER database of drugs and side effects. Nucleic Acids Research, 44(D1):D1075–D1079, 2016.
  • [Landrum2013] Greg Landrum. RDKit: A software suite for cheminformatics, computational chemistry, and predictive modeling, 2013.
  • [Leelananda and Lindert2016] Sumudu P Leelananda and Steffen Lindert. Computational methods in drug discovery. Beilstein journal of organic chemistry, 12(1):2694–2718, 2016.
  • [Liu et al.2018] Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sung Ju Hwang, and Yi Yang. Learning to propagate labels: Transductive propagation network for few-shot learning. In International Conference on Learning Representations, 2018.
  • [Monti et al.2018] Federico Monti, Karl Otness, and Michael M Bronstein. MotifNet: A motif-based graph convolutional network for directed graphs. In

    IEEE Data Science Workshop

    , pages 225–228. IEEE, 2018.
  • [Nguyen et al.2020] Cuong Q Nguyen, Constantine Kreatsoulas, and Kim M Branson. Meta-learning gnn initializations for low-resource molecular property prediction. arXiv preprint arXiv:2003.05996v2, pages arXiv–2003, 2020.
  • [Paszke et al.2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. PyTorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019.
  • [Paul et al.2010] Steven M Paul, Daniel S Mytelka, Christopher T Dunwiddie, Charles C Persinger, Bernard H Munos, Stacy R Lindborg, and Aaron L Schacht. How to improve R&D productivity: The pharmaceutical industry’s grand challenge. Nature Reviews Drug Discovery, 9(3):203–214, 2010.
  • [Richard et al.2016] Ann M Richard, Richard S Judson, Keith A Houck, Christopher M Grulke, Patra Volarath, Inthirany Thillainadarajah, Chihae Yang, James Rathman, Matthew T Martin, John F Wambaugh, et al. ToxCast chemical landscape: Paving the road to 21st century toxicology. Chemical Research in Toxicology, 29(8):1225–1251, 2016.
  • [Rodríguez et al.2020] Pau Rodríguez, Issam Laradji, Alexandre Drouin, and Alexandre Lacoste. Embedding propagation: Smoother manifold for few-shot classification. In European Conference on Computer Vision, pages 121–138. Springer, 2020.
  • [Rohrer and Baumann2009] Sebastian G Rohrer and Knut Baumann. Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data. Journal of Chemical Information and Modeling, 49(2):169–184, 2009.
  • [Rong et al.2020] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems, 33:12559–12571, 2020.
  • [Snell et al.2017] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4080–4090, 2017.
  • [Vaswani et al.2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017.
  • [Veličković et al.2017] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
  • [Vinyals et al.2016] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3637–3645, 2016.
  • [Wang et al.2020] Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys, 53(3):1–34, 2020.
  • [Xu et al.2018] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2018.
  • [Yang et al.2020] Ling Yang, Liangliang Li, Zilun Zhang, Xinyu Zhou, Erjin Zhou, and Yu Liu. DPGN: Distribution propagation graph network for few-shot learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13390–13399, 2020.
  • [Yu et al.2013] Wenying Yu, Hui Xiao, Jiayuh Lin, and Chenglong Li. Discovery of novel STAT3 small molecule inhibitors via in silico site-directed fragment-based drug design. Journal of Medicinal Chemistry, 56(11):4402–4412, 2013.
  • [Zhavoronkov et al.2019] Alex Zhavoronkov, Yan A Ivanenkov, Alex Aliper, Mark S Veselov, Vladimir A Aladinskiy, Anastasiya V Aladinskaya, Victor A Terentiev, Daniil A Polykovskiy, Maksim D Kuznetsov, Arip Asadulaev, et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nature Biotechnology, 37(9):1038–1040, 2019.
  • [Zhu et al.2021] Yanqiao Zhu, Weizhi Xu, Jinghao Zhang, Qiang Liu, Shu Wu, and Liang Wang. Deep graph structure learning for robust representations: A survey. arXiv preprint arXiv:2103.03036, 2021.

Appendix A Details of Datasets

We perform experiments on widely used benchmark few-shot molecular property prediction datasets111All datasets are downloaded from http://moleculenet.ai/.: (i) Tox21 [for Advancing Translational Sciences2017] contains assays each measuring the human toxicity of a biological target; (ii) SIDER [Kuhn et al.2016] records the side effects for compounds used in marketed medicines, where the original 5868 side effect categories are grouped into 27 categories as in [Altae-Tran et al.2017, Guo et al.2021]; (iii) MUV [Rohrer and Baumann2009] is designed to validate virtual screening where active molecules are chosen to be structurally distinct from each another; and (iv) ToxCast [Richard et al.2016] is a collection of compounds with toxicity labels which are obtained via high-throughput screening. Tox21, SIDER and MUV have public task splits provided by [Altae-Tran et al.2017], which we adopt them. For ToxCast, we randomly select 450 tasks for meta-training and use the rest for meta-testing.

Appendix B Implementation Details

All experiments are conducted on a PC with 32GB memory, Intel-i8 CPU and a 32GB NVIDIA Tesla V100 GPU.

b.1 Baselines

In the paper, we compare our PAR (Algorithm 1) with two types of baselines: (i) FSL methods with graph-based encoder learned from scratch including Siamese [Koch et al.2015]

which learns dual convolutional neural networks to identity whether the input molecule pairs are from the same class,

ProtoNet222https://github.com/jakesnell/prototypical-networks [Snell et al.2017] which assigns each query molecule with the label of its nearest class prototype, MAML333We use MAML implemented in learn2learn library at https://github.com/learnables/learn2learn. [Finn et al.2017] which adapts the meta-learned parameters to new tasks via gradient descent, , TPN444https://github.com/csyanbin/TPN-pytorch [Liu et al.2018] which conducts label propagation on a relation graph with rescaled edge weight under transductive setting, and EGNN555https://github.com/khy0809/fewshot-egnn [Kim et al.2019] which learns to predict edge-labels of relation graph, IterRefLSTM [Altae-Tran et al.2017] which adapts Matching Networks [Vinyals et al.2016] to handle molecular property prediction tasks; and (ii) methods which leverage pretained graph-based molecular encoder including Pre-GNN666http://snap.stanford.edu/gnn-pretrain [Hu et al.2019] which pretrains a graph isomorphism networks (GIN) [Xu et al.2018] using graph-level and node-level self-supervised tasks and is fine-tuned using support sets, Meta-MGNN777https://github.com/zhichunguo/Meta-Meta-MGNN [Guo et al.2021] which uses Pre-GNN as molecular encoder and optimizes the molecular property prediction task with self-supervised bond reconstruction and atom type predictions tasks, and Pre-PAR which is our PAR equipped with Pre-GNN. GROVER [Rong et al.2020] is not compared as it uses a different set of atom and bond features. We use results of Siamese and IterRefLSTM reported in [Altae-Tran et al.2017]

as their codes are not available. For the other methods, we implement them using public codes of the respective authors. We find hyperparameters using the validation set via grid search for all methods.

Generic graph-based molecular representation.

For methods re-implemented by us, we use GIN as the graph-based molecular encoder to extract molecular embeddings in all methods (including ours). Following [Guo et al.2021, Hu et al.2019], we use GIN888https://github.com/snap-stanford/pretrain-gnns/ provided by the authors of  [Hu et al.2019]: it consists 5 GNN layers with 300 dimensional hidden units (), take average pooling as the READOUT function, and set dropout rate as 0.5. Pre-GNN, Meta-MGNN and Pre-PAR further use the pretrained GIN which is also provided by the authors of  [Hu et al.2019].

b.2 Par

In PAR, MLP used in (1) and (2

) both consist of two fully connected layers with hidden size 128. We iterate between relation graph estimation and molecular embedding refinement for two times. We implement PAR in PyTorch 

[Paszke et al.2019] and Pytorch Geometric library [Fey and Lenssen2019]

. We train the model for a maximum number of 2000 epochs. We use Adam 

[Kingma and Ba2014] with a learning rate 0.001 for meta training and a learning rate 0.05 for fine-tuning property-aware molecular embedding function and classifier within each task. We early stop training if the validation loss does not decrease for ten consecutive epochs. Dropout rate is 0.1 except for the graph-based molecular encoder. We summarize the hyperparameters and their range used by PAR in Table 4.

Hyperparameter Range Selected
learning rate for fine-tuning for each task 0.010.5 0.05
number of update steps for fine-tuning 15 1
learning rate for meta-learning 0.001 0.001
number of layers in adaptive relation graph learning module 13 2
number of layer for MLPs in (1) and (2) 13 2
hidden dimension for MLPs in (1) and (2) 100300 128
dropout rate 0.00.5 0.1
hidden dimension for classifier in (5) 100200 128
Table 4: Hyperparameters used by PAR.

Appendix C More Experimental Results

c.1 Ablation Study.

Figure 4 presents the results of comparing PAR (and Pre-PAR) with six variants on -way-10-shot tasks from Tox21. The conservation is consistent: PAR and Pre-PAR outperform their variants.

(a) Pre-PAR
(b) PAR
Figure 4: Ablation study for 2-way-1-shot tasks from Tox21.

Further, we pay special attention to the design of adaptive relation graph among molecules. Correspondingly, we compare PAR with variant-4 which did not reduce to NN graph and variant-5 which removes the neighbor alignment loss in equation (6). The correct neighbor ratio is calculated as the ratio of neighbors with the same label among the top nearest neighbors. We report the average value over all molecules in each query set of meta-testing tasks. Figure 5 plots the results obtained on the -way-10-shot task corresponding to the th task of Tox21. As can be seen, PAR can improve both the correct neighbor ratio and the overall ROC-AUC scores during learning. While the consistency between neighbor alignment loss and ROC-AUC scores further validates the efficacy of the additional neighbor alignment loss in (6).

(a) ROC-AUC scores.
(b) Correct neighbor ratio.
Figure 5: Further study for adaptive relation graph learning on a -way-10-shot task of Tox21.

c.2 Using Other Graph-based Molecular Encoders

In the experiments, we use GIN and its pretrained version. However, as introduced in Section 3.2, our PAR is compatible with any existing graph-based molecular encoder introduced in Section 2. Here, we consider the following popular choices as the encoder to output : GIN999GIN, GAT, GCN and GraphSAGE and their pretrained versions are obtained from https://github.com/snap-stanford/pretrain-gnns/, whose details are in Appendix A of [Hu et al.2019]. [Xu et al.2018], GCN [Duvenaud et al.2015], GraphSAGE [Hamilton et al.2017] and GAT [Veličković et al.2017], which are either learned from scratch or pretrained. We compare the proposed PAR with fine-tuning the encoder on support sets (denote as GNN).

Figure 6 shows the results. As can be seen, GIN is the consistently better than the others. PAR consistently outperforms the fine-tuned GNN using the five kinds of encoders. This validates the effectiveness of the property-aware molecular embedding function and the adaptive relation graph learning module. We further notice that using pretrained encoders can improve the performance except for GAT, which is also observed in [Hu et al.2019].

(a) 10-shot
(b) 1-shot
Figure 6: ROC-AUC scores of FSL on Tox21 using different graph-based molecular encoders.

c.3 Case Study

Finally, we validate whether PAR can obtain different property-aware molecular embeddings and relation graphs for tasks contain overlapping molecules but evaluate different properties. To examine this under a controlled setting, we sample a fixed group of 10 molecules on Tox21 (Table 5) which coexist in different meta-testing tasks (i.e. the th, th and th tasks). Provided with the meta-learned parameters , we take these 10 molecules as the support set to fine-tune and keep fixed in each task. As the support set is fixed now, the ratio of active molecules to inactive molecules among the 10 molecules may not be 1:1 in the three tasks. Thus the resultant task may not be -way--shot.

Molecule Label
ID SMILES SR-HSE SR-MMP SR-p53
mol_1 Cc1cccc(/N=N/c2ccc(N(C)C)cc2)c1 0 1 0
mol_2 O=C(c1ccccc1)C1CCC1 1 0 0
mol_3 C=C(C)[C@H]1CN[C@H](C(=O)O)[C@H]1CC(=O)O 0 0 1
mol_4 c1ccc2sc(SNC3CCCCC3)nc2c1 1 1 0
mol_5 C=CCSSCC=C 0 0 1
mol_6 CC(C)(C)c1cccc(C(C)(C)C)c1O’ 0 1 0
mol_7
C[C@@H]1CC2(OC3C[C@@]4(C)C5=CC[C@H]6C(C)(C)C(O[C@@H]7OC[C@@H]
(O)[C@H](O)[C@H]7O)CC[C@@]67C[C@@]57CC[C@]4(C)C31)OC(O)C1(C)OC21
0 1 0
mol_8 O=C(CCCCCCC(=O)Nc1ccccc1)NO 0 0 1
mol_9 CC/C=C\\C/C=C\\C/C=C\\CCCCCCCC(=O)O 1 0 0
mol_10 Cl[Si](Cl)(c1ccccc1)c1ccccc1 0 1 0
Table 5: The ten molecules sampled from Tox21 dataset, which coexist in the three meta-testing tasks ( the th task for SR-HSE, the th task for SR-MMP, and the th task for SR-p53).

c.3.1 Visualization of the Learned Relation Graphs

As described in Section 3.3, PAR returns as the adjacency matrix encoding the optimized relation graph among molecules. Each entry records the pairwise similarity of the 10 molecules and a random query (which is dropped then). As the number of active and inactive molecules may not be equal in the support set, we no longer reduce adjacency matrices to which encodes NN graph. Figure 7 plots the optimized adjacency matrices obtained on all three tasks and Figure 8 further plots the relation graphs encoded in these adjacency matrices. The observations are consistent: PAR obtains different adjacency matrices for different property-prediction tasks, and the learned adjacency matrices are visually similar to the ones computed using ground-truth labels.

(a) PAR
(b) Expert-annotated
Figure 7: Comparison between adjacency matrix returned by PAR (left) and the computed using ground-truth labels (right) for the ten molecules in Table 5 on the th task (first row), th task (second row), and th task (third row). We set if molecules and have the same label and 0 otherwise.
(a) the th task
(b) the th task
(c) the th task
Figure 8: Relation graphs returned by PAR, which are encoded in the adjacency matrices in Figure 7. A red (blue) node means the molecule is active (inactive) on the current task. A red (blue) line means the connected molecules are both active (inactive). A gray line means the connected nodes are not from the same class. Thicker lines mean higher similarity.
(a)
(b)
(c)
Figure 9: t-SNE visualization of molecular embeddings for the ten molecules in Table 5 on the th task (first row), th task (second row), and th task (third row).

c.3.2 Visualization of the Learned Molecular Embeddings

We also present the t-SNE visualization of (generic molecular embeddings obtained by graph-based molecular encoders), (the molecular embeddings obtained by property-aware molecular embedding function), and (the final molecular embeddings returned by PAR) for these 10 molecules. For the same molecule, is the same across th, th, th task, while and are property-aware. Figure 9 shows the results. As shown, PAR indeed captures property-aware information during encoding the same molecules for different molecular property prediction tasks. From the first column to the third column in Figure 9, molecular embeddings gradually get closer to the class prototypes on the 10th and 11th tasks. The 12th task is harder to evaluate.