1. Introduction
Predicting the property of molecules, such as the energy, is a fundamental issue in many related domains including chemistry, biology and material science, which has led to many significant relevant research and applications. For example, the process of drug discovery (ekins2019exploiting) can be accelerated if we can accurately predict the properties of molecules in time to help develop specific medicines for the epidemic, such as H1N1 flu, SARS, Covid19.
In chemistry, density functional theory (DFT) is commonly used computational methods for molecular property prediction, which has been studied dating back to the 1970s (becke2014perspective). It offers accurate and explainable solutions for molecular following complete theory (kohn1965self). However, in practice, it suffers from a critical problem of expensive computation cost as it needs to solve many linear equations iteratively for the solutions. For example, experimental results find that it takes an hour to calculate the properties of a molecule with only 20 atoms (gilmer2017neural). Obviously, such low efficiency of DFT has limited its applications when screening from a large set of molecules.
Recently, researchers have attempted to use machine learning methods that are costeffective for molecular property prediction (hansen2015machine). Along this line, the most representative methods are graph neural networks (GNN), including MPNN (gilmer2017neural), SchNet (schutt2017schnet) and MGCN (lu2019molecular)
, which have shown superior performance. Generally, they treat a molecule as a graph where the nodes denote atoms and the edges represent the interaction between atoms. They design several neural layers to project each node into latent space with a lowdimensional learnable embedding vector and pass its interaction message through the edges iteratively. At last, the node messages can be aggregated to represent the molecule for property prediction.
Though GNNs have achieved great success, they are usually datahungry, which requires a big amount of labeled data (i.e., molecules whose properties are known) for training (gilmer2017neural). However, the labeled molecules usually take an extreme small portion in the whole chemical space since they can only be provided by expensive experiments or DFT calculation, which restricts GNN based development. To gain further promotion, as shown in top left part of the Figure 1, there are still many valid molecules in the chemical space, though the properties remaining unknown, that have some benefits in terms of their structures. If we can effectively leverage these unlabeled molecules, it could be potentially helpful to improve the performance. Therefore, in this paper, we aim to explore semisupervised learning (SSL) by fully taking advantage of both labeled molecules and unlabeled ones for property prediction.
However, it is highly challenging due to the following domainspecific characteristics. First, learning molecular graph representation is nontrivial because it involves both the node and the graph level information. Different from traditional applications like social networks since we usually meet a large number of graphs in chemical space rather than a single graph with large number of nodes. Though some existing semisupervised learning methods, such as Ladder Networks (rasmus2015semi), have shown their performance in various domains, such as image and text, they cannot be directly used for molecular graph learning. Second, it is difficult to handle the imbalance between labeled and unlabeled molecules in chemical space since the number of labeled ones generally take extreme small portion. Directly applying previous SSL methods leads to loss conflict caused by large number of unlabeled molecules for their structural representation but ignores our main goal of property prediction. Third, the performance might be still unsatisfactory due to limited labels, we need to find new molecules for labeling to improve the model. To increase the efficiency of labeling, we need a mechanism to find most informative molecules for labeling.
To address these challenges, we design a novel framework called Active Semisupervised Graph Neural Network (ASGN) for molecular property prediction by taking advantage of both labeled and unlabeled molecules. Generally, ASGN uses a novel teacherstudent framework consisting of two models that work alternatively. Specifically, in the teacher model, we propose a novel semisupervised learning method to learn a general representation that jointly explores molecular features both at a global scale and local scale. The local one represents the essences of molecules, i.e., atoms and bonds while the global one learns the whole molecular graph encoding with respect to the chemical space. Then, to deal with the loss conflict between the unsupervised structure representation and property prediction, we introduce the student model by finetuning on property prediction task only on the small labeled molecules. By doing so, the student model can focus on the prediction to achieve lower error than the teacher model and converge much faster. Additionally, it can alleviate overfitting than training from scratch only on the labeled dataset. Moreover, to improve labeling efficiency, we propose a novel strategy based on active learning to select new informative molecules. That is, ASGN uses the embeddings by the teacher model to select a diversified subset of molecules in the chemical space and add them to the labeled dataset for finetuning two models repeatedly until the label budget or desired accuracy is reached. We conduct extensive experiments on realworld datasets, where the experimental results demonstrate the effectiveness of our proposed ASGN. To the best of our knowledge, this is the first attempt to incorporate both unlabeled and labeled molecules for property prediction actively in a semisupervised manner.
2. Related Work
In this section, we summarize the related work with the following three categories.
Molecular Property Prediction. Predicting the properties of molecules is a fundamental task with applications in many areas such as chemistry and biology (becke2007quantum; oglic2017active). According to quantum physics, the states of a molecule are characterized by Schrödinger equation (thouless2014quantum). The first class like Density Functional Theory (DFT) (becke2014perspective) are simulation based methods directly derived or approximated by the Schrödinger equation. However, DFT methods are timeconsuming because it solves some big linear equations and the complexity of DFT is where is the number of atoms.
Another class of molecular properties prediction methods are datadriven (hansen2015machine; ying2018graph; gilmer2017neural; do2019graph)
. Researchers attempted to use traditional machine learning methods with empirical descriptors or handcraft features to represent a molecule and use them for linear or logistic regression
(hansen2015machine; ying2018graph). However, these methods cannot achieve desirable accuracy due to the limited effectiveness of handcrafted features and model capacity (gilmer2017neural).Inspired by the remarkable development of graph neural networks in various domains (gilmer2017neural) (wang2019mcne)(DBLP:conf/iclr/PeiWCLY20)(wang2018united), researchers have noticed the potentials of them for molecular property prediction. Generally, by treating the molecule as a graph, several graph neural networks have been applied (hamilton2017inductive; ma2019graph; wang2019mcne) as an architecture that can directly deal with noneuclidean data like graphs. Variants of graph neural networks like MPNN (gilmer2017neural), Schnet(schutt2017schnet), can be applied for molecular properties prediction where they use nodes to represent atoms, and the edges are weighted by the distances between atoms. Then the node embeddings are propagated and updated using the embeddings of their neighborhood, named message passing. The graph embedding can be pooled from nodes for property prediction.
Semisupervised Representation learning. Semisupervised learning is a popular framework to improve model performance by incorporating unlabeled data into training (zhu2005semi). The main idea is to use the unlabeled data to learn a general and robust representation to improve the performance of the model. On the one hand, methods like ladder network (rasmus2015semi) borrow the idea of jointly learning representation for unlabeled data (via generation) and labeled data (kingma2013auto). On the other hand, a popular fashion is developed recently which uses selfsupervised methods that force the networks to be consistent under the handcrafted transformations like image inpainting (pathak2016context), rotation(gidaris2018unsupervised), contrastive loss (He). Usually, these methods use a pseudolabeling mechanism to assign each unlabeled data with a pseudo label and force the neural network to predict these pesudo labels. Then the pretrained models can be used for downstream tasks like classification or regression. For example, gidaris2018unsupervised uses the rotation degree of an image as a kind of pesudo label. These pesudo labels are often obtained from transformations of data without changing their semantic feature. Deep Clustering (caron2018deep)
shows that the convolutional neural network itself can be viewed as a strong prior to processing image data. Accordingly, they design a selfsupervised method based on learning the clustering results of the features by the neural networks.
Active Learning. Active learning is a popular framework to alleviate data deficiency and it has been applied in many tasks (gal2017deep; yang2014active; DBLP:conf/aaai/WuLZPLC20; DBLP:journals/tois/HuangLCWXCMH20). Active learning framework starts with a small set of labeled data and a large set of unlabeled data. In every iteration, it develops a model to select a batch of unlabeled data to be labeled for supplementing the limited labeled data so that it achieves better performance. Generally, the representative methods consider the strategy selection from two perspectives, i.e., uncertainty, and diversity (gal2017deep) (Sener2017)
. Specifically, the uncertainty based methods define the model uncertainty for a new unlabeled data leveraged by some statistics properties (e.g., variance) and then select the data with the highest value
(gal2017deep) (ting2018optimal). Comparatively, the diversity based methods aim to choose a small subset that is the most representative for the whole dataset (Sener2017).As is pointed out in (ash2019deep), the data selected by the uncertainty strategy are almost identical in batch mode settings, so it might be not suitable for large datasets like our scenarios. In this paper, we propose a novel diversity based active learning strategy for informative molecule selection where the semisupervised embeddings are used for calculating the distance between molecules.
3. Definitions and Notations
In this section, we will give formal definitions of terminologies and problems in this paper for clarity. Following the previous works (gilmer2017neural) (schutt2017schnet), we treat each molecule in chemical space as a graph, hence we define a molecular graph as follows:
Definition 3.1 ().
Molecular Graph: A molecule is denoted as a weighted graph , where the vertex set , we use to represent the feature vector of the node (atom) indicating its type such as Carbon, Nitrogen. is the total number of atoms. is the set of edges connecting two atoms (nodes) and . Specifically, in a certain molecule, the coordinates of each atom can be represented as . Therefore, we further denote the edge between two atom nodes as weighted by their coordinate distance .
Then we give the formal definition of chemical space.
Definition 3.2 ().
Chemical Space: Generally, the whole chemical space consists of a set of molecules, which can be denoted as: . In practice, only a subset of molecules in the space have been examined to obtain their several properties (e.g., energy) by typical DFT calculation. Therefore, we divide the chemical space into two subset , . Specifically, represents the subset of molecules whose properties have been examined, where denotes the property vector with real value of molecule . Comparatively, represents the subset of molecules whose properties remain unknown. Without loss of generality, we call the subset and as ”labeled set” and ”unlabeled set”, respectively.
With the above definition, our problem can be formalized as that we want to find a model using limited labels , for precisely predicting the properties of molecules.
4. ASGN: Active Semisupervised Graph Neural Network
In this section, we present a description of the framework of ASGN. Then we describe the components of ASGN comprehensively.
4.1. Framework
In this paper, we propose a novel Active Semisupervised Graph Neural Network (ASGN) for molecular property prediction by incorporating both labeled and unlabeled molecules in chemical space. The general framework is illustrated in Figure 2.
Generally, we use a teacher model and a student model that work iteratively. Each of them is a graph neural network. In the teacher network, we use a semisupervised fashion to obtain a general representation of molecular graphs. We jointly train the embeddings for unsupervised representation learning and property prediction. Then in the student model, we handle the loss conflict by finetuning the parameters transferred from the teacher model for property prediction. After that, we use the student model to assign pseudo labels for the unlabeled dataset. As feedback for the teacher, the teacher model can learn the student’s knowledge from these pseudo labels. Also, to improve the labeling efficiency, we propose using active learning to select the new representative unlabeled molecules for labeling. We then add them to the labeled set and finetune two models iteratively until accuracy budget is reached. Specifically, the key idea is to use the embeddings output by the teacher model to find a subset that is most diversified in the whole unlabeled set. We then assign ground truth labels such as using DFT calculation to these molecules. After that, we add them into the labeled set and repeat the iteration to improve performance.
In the following, we will first describe technical details of our teacher model and student model.
4.2. Semisupervised Teacher Model
In the teacher model, we use semisupervised learning. We first introduce the network backbone. Then, we introduce the loss for representation learning. Specifically, a property loss on labeled molecule and two unsupervised loss (from both the graph and the node level) on all molecules are designed to guide it.
4.2.1. Message Passing Graph Neural Network
The task of the teacher model is to learn a general representation for molecular graphs from both labeled set and unlabeled set. We first introduce a message passing graph neural network (MPGNN) as the backbone that transforms a molecular graph into a representation vector based on message passing graph neural networks. The graph neural network consists of message passing layers. At th layer, it first embeds each node in a graph to a high dimensional space as their embeddings using . Then the node embeddings are updated by aggregating node embeddings of its neighbors along the weighted edges called message passing:
(1) 
where
is the activation function,
is a learnable weight matrix, is the aggregation function such as sum , mean, max (ma2019graph). Here we choose sum as the aggregation type which directly adds the messages from its neighbors as suggested in (Xu2018). is a vector called message function determined by the node embeddings and edge weights that pass from node to . As the interactions decay with the growth of the distances between two atoms, we use a Gaussian radical basis (schutt2017schnet) to embed the edge information that reflects the interaction strength between nodes:(2) 
for where is a set of predefined filter centers. More intensive centers means higher resolution and can capture minor difference of different bond length.
After layers of message passing and aggregation, we aggregate all node embeddings to get the whole graph embedding:
(3) 
In this paper, we utilize a simple pooling method which directly averages or sums all node embeddings. At last, multilayer perceptron
is used to get the property .Traditionally, MPGNN is trained in a supervised manner where all the labels are given and we usually use mean square loss (MSE) between predictions and labels (i.e. the labeled properties in ) to guide the optimization of the model parameters:
(4) 
However, in practice the training set with small number of labels easily results in an overfitted model. Additionally, endtoend training that only learns a highlevel representation guided by the property/label is less effective for structural representation. To overcome these challenges, in this paper we propose a semisupervised representation learning method by considering both local level and global level unsupervised information to enhance the expressive power of a model for both labeled and unlabeled molecular graphs.
4.2.2. Node Level Representation Learning
In node level representation learning, we learn to capture domain knowledge from geometry information of a molecular graph. The main idea is to use node embeddings to reconstruct the node types and topology (distances between nodes) from the representation. Specifically, we first randomly sample some nodes and edges from the graph as shown in Figure 2, then pass these nodes’ representation to a MLP and use them to reconstruct the node types and distances between nodes . Mathematically, we minimize the following crossentropy loss:
(5)  
where first term is the loss function for node types reconstruction, and the second term is the edge weights reconstruction. For both terms, we optimize the expectation of the samples.
is the number of atom types, we transform the continuous edge weights into a discrete classification problem by dividing the continuous distance into several discrete bins and is the total number of bins. It means that only if is the nearest to the weight of edge . is a multilayer perceptron.Practically, we randomly sample some nodes and edges to reconstruct their attributes and optimize the expectation of samples. We found such random sampling to be significantly more efficient without sacrificing much performance. We sample () edges from the graph along with the nodes to reconstruct their features. What’s more, we notice that using a fully connected graph to represent a molecule contains redundant information because a molecule contains only degrees of freedom since the coordinates of each atom can be decided by numbers as . Therefore sampling edges with size is an efficient tradeoff between performance and algorithm complexity. By optimizing the reconstruction loss (Eq. (5)), we can obtain the node embeddings that contains the topology and features of molecular graphs.
4.2.3. Graph Level Representation Learning
Although node embeddings that can reconstruct the topology of molecules can effectively represent the structure of molecules, a recent study (Hu) shows that combing graph level representation learning is beneficial for downstream tasks like property prediction. In order to learn a graph level representation, the key insight is to use the mutual relation between molecules within the chemical space, i.e. similar molecules roughly have similar properties. Inspired by this intuition, we propose a method based on learning to cluster to enhance graph level representation. First, we calculate the graph level embedding by the network. Then we use an implicit clustering based method to assign molecules each with a cluster id which contains clusters generated by the implicit clustering process. After that we optimize the model with a penalty loss function. The process is iteratively done until at least a local minima is reached.
Next, we introduce the details of graph level representation learning. We denote
as the cluster id in the rest of this section. First we pass the graph level embedding into a multilayer perceptron and predict the probability distribution
. We assume there exists a posterior distribution of cluster id. We optimize the crossentropy loss between and as following:(6) 
However, we easily get a trivial solution if no constraint is applied on . The key is to confine these clustering ids to a predefined prior distribution as (bojanowski2017unsupervised) (asano2019self)
. We choose a uniform distribution with fixed
supports which means that the whole dataset is roughly divided into equally partitioned subsets. Practically, we use hard labeling technique to constraint to be a discrete label by applying the hardmax function. Then we explicitly write the optimization object as:(7) 
We iteratively optimize predictive distribution by performing gradient descent on the network parameters and the posterior distribution by the following method which can be viewed as an implicit clustering approach. We first rewrite Eq. (7) as:
(8) 
with denotes the Frobenius dotproduct between two matrices, , , and
denotes the joint distribution of
and . This is a typical optimal transport problem and we add an entropy regularization and use SinkhornKnopp algorithm (cuturi2013sinkhorn) for a better convergence speed:(9) 
In fact, this process can be viewed as a type of clustering (cuturi2014fast) so we name this loss as clustering loss for selfsupervision.
4.3. Supervised Student Model
Practically, directly optimizing Eq. (10) of the teacher model yields unsatisfactory results for property prediction. The teacher model will be heavily loaded since it requires to learn several tasks simultaneously. Due to the conflict of these optimization targets, we observe that each target gets worse performance compared with optimizing them separately. Especially, it is also inefficient because if then little attention will be paid to optimization of
in an epoch, however property prediction is what we care the most. As a result, the property prediction loss is much higher compared with a model that only needs to learn this task. To alleviate this problem, we propose introducing a student model. We use the teacher model to learn such representation by jointly optimizing the objects above. When the teacher’s learning process ends, we transfer the teacher’s weight to the student model, and use the student model to finetune only on the labeled dataset to learn the target properties the same as Eq. (
4) shown in Figure 2:(11) 
After finetuning, we use the student model to infer the whole unlabeled dataset and assign each unlabeled data a pseudo label indicating the student’s prediction of its properties then the unlabeled dataset is where is the parameters of student model. In the next iteration, the teacher model also needs to learn such pseudo labels as Eq. (10) becomes:
(12) 
This can be viewed as the teacher learns the knowledge from the students as feedback inspired by the idea of knowledge distillation (hinton2015distilling). In summary, we handle the loss conflict by using two models whose targets are different. The teacher model learns a general representation while the student model aims to learn accurate prediction of molecular graph properties. The pretraining of the teacher provides a warm start for the student model.
4.4. Active Learning for Data Selection
We have incorporated the information in both labeled and unlabeled molecules. However, due to the limited number of labels available, the accuracy might still be unsatisfactory, we need to find new labeled data to improve its performance. Therefore, in each iteration we use the embeddings output by the teacher model to iteratively select a subset of molecules, and the properties (ground truth labels) will be computed (i.e., by DFT). Then we add these molecules output by active learning into the labeled set for finetuning two models iteratively. Along this line, the key strategy of active learning is to find a small batch of most diversified molecules in the chemical space for labeling. A wellstudied method to measure diversity is to sample from DPP as (kulesza2011k) suggests. However, the subset selection is NPhard therefore a greedy approximation is taken advantage of, which is the center method. Denoting the unlabeled dataset by , and the labeled dataset by , we use a myopic method that in each iteration we choose a subset of data that maximize the distance between labeled set and unlabeled set. Concretely, for every within the th batch, we choose the data point that satisfies the following condition:
(13) 
where is the distance between two molecules. We use norm on the representations by the teacher model. Since the teacher model learns a general representation we naturally believe that the distance between the representations of two molecules indicates the difference of them. Moreover the features are automatically extracted, we do not need to rely on handcraft distances like graph edit distance which might not suit our problem. Additionally, since the teacher model is trained in a semisupervised manner, the teacher model only needs to be finetuned when new labeled data is added, thus accelerating the training process.
4.5. Method Summary and Discussion
In this subsection, we briefly summarize the framework in Algorithm 1. Given a unlabeled set and a labeled set. In each iteration, we use center active learning strategy to get a new batch of data for labeling and add them to the labeled set (Line 4), next we transfer the teacher’s weight to the student network (Line 5) and finetune the student network (Line 6), then we use the student model to assign a pseudo label of the property for the rest of the unlabeled dataset (Line 7). After that, we continue to finetune the teacher model jointly with three tasks (Line 8). At last, the trained student model will be applied to predict the properties of the molecules.
To summarize, we propose a novel approach to predict the properties of molecules using graph neural networks. First, we use a multilevel representation learning method to obtain general embeddings for molecular graphs. The node embeddings store essential components of molecular graphs and they are composable to form meaningful graph level embeddings with respect to the whole data distribution. Subsequently, a teacherstudent framework is used to effectively combine semisupervised learning and active learning to deal with label insufficiency. Compared with vanilla semisupervised learning methods (Sener2017), the separation of the two models can alleviate loss conflict. Compared with naive active learning methods that retrains the model from scratch when every new batch data points are selected, the weight transferred from the teacher provides a warm start for the student and avoids overfitting of the small labeled dataset and accelerates training. Besides, the two models communicate via weight transfer and feedback from assigning pseudo labels so that they can be mutually promoted.
5. Experiments
Properties  HOMO  LUMO  gap  ZPVE  

Unit  eV  eV  eV  eV  Cal/MolK  eV  eV  eV  eV  Bohr  Debye  Bohr 
Supervised  0.3204  0.2934  0.2948  0.2722  0.2368  0.1632  0.1686  0.2475  0.0007  10.05  0.3201  0.5792 
MeanTeachers  0.3717  0.2730  0.2535  0.2150  0.2036  0.1605  0.1686  0.2394  0.00054  5.22  0.3488  0.5792 
InfoGraph  0.1410  0.1702  0.1592  0.1552  0.1965  0.1605  0.1659  0.2421  0.00036  4.92  0.3168  0.5444 
ASGN (Ours)  0.0562  0.0594  0.0560  0.0583  0.0984  0.1190  0.1061  0.2012  0.00017  1.38  0.1947  0.2818 
Property  HOMO  LUMO 

Unit  Hatree  
Supervised  0.080  0.078 
MeanTeacher  0.078  0.075 
InfoGraph  0.077  0.076 
ASGN (Ours)  0.059  0.057 
In this section, we conduct extensive experiments to show the effectiveness of ASGN on two popular molecular datasets. The code is publicly available ^{1}^{1}1https://github.com/HaoZhongkai/AS_Molecule.
5.1. Datasets

QM9:^{2}^{2}2http://quantummachine.org/datasets/ The QM9 dataset (ramakrishnan2014quantum) is a wellknown benchmark datasets that contains the equilibrium coordinates of 130,000 molecules along with their quantum mechanical properties. We use 10,000 molecules for testing and 10,000 for validation. Coordinates and properties for all molecules are calculated using DFT methods. Molecules in QM9 contain no more than 9 heavy atoms (atom heavier than hydrogen).

OPV:^{3}^{3}3https://cscdata.nrel.gov/#/datasets/ad5d2c9aaf0a4d72b9431e433d5750d6 OPV (st2019message)is a dataset with roughly 100,000 medium size molecules, each contains 20 to 30 heavy atoms. Again the properties and equilibrium coordinates of these molecules are obtained through DFT. We use 5,000 for testing and 5,000 for validation.
5.2. Experiments Setup
We evaluate our method under two experimental settings. We first describe the implementation details and parameters of ASGN. We run all experiments are on one Tesla V100 GPU and 16 Intel CPUs.
Graph Neural Network Hyperparameters
. For the network backbone, we use 4 message passing layers and embedding dimension of 96 in Eq. (1). We use Adam optimizer with a learning rate 1e3. We use filters from 0 to 3nm with an interval of 0.01nm in Eq. (2).SemiSupervised Learning Hyperparameters
. The teacher model has an additional linear classifier after the graph neural network. We divide the distance of the edge into
bins in Eq. (5) for reconstruction. We use in Eq. (6). The regularization constant is set to be 25 in Eq. (9). We train (finetune) the teacher model for 20 epochs in each iteration. We train the student network until the loss does not decrease for about 20 epochs.Active Learning Hyperparameters. In each iteration, we select 1,000 new unlabeled molecules in Eq. (13) to be labeled and add them into the training dataset.
5.3. Effectiveness Experiment
To demonstrate that our method could achieve lower error with limited labeled data, we first conduct an effectiveness experiment. Under this experimental setting we have a fixed label budget which is the maximum number of labels. Given a fixed label budget, we compare the final MeanAbsoluteError(MAE) (schutt2017schnet) on the test dataset after training. We use a label budget of 5000 for both QM9 and OPV about 5%. Other than these 5,000 labeled data, other labels are not available. We compare our methods with baselines listed below.
5.3.1. Baselines
For accuracy experiments, we mainly compare our method with several semisupervised learning baselines. To ensure fairness, all baselines are conducted on the same network backbone (i.e MPGNN). The compared baselines are selected from two perspectives, one is traditional semisupervised learning, the other is semisupervised learning baselines for graph data.

Supervised : We train the network backbone using fully supervised manner only on the small labeled dataset.

MeanTeachers (tarvainen2017mean): This is a method for semisupervised learning by using a consistency regularization and uses moving average for the models’ weights as the teacher.

InfoGraph (sun2019infograph):
This is the stateoftheart method for semisupervised learning or unsupervised learning on graphs. It maximizes the mutual information between the graph level representations and the substructures of the graphs.
5.3.2. Results
First, We found that our method is significantly better than baseline methods on all properties. We achieved a reduction of more than 50 on several properties such as , , and compared with the stateoftheart method. This shows our semisupervised learning method is effective and incorporating unlabeled data can help the prediction of molecular properties.
Second, the semisupervised reconstruction captures domain knowledge for molecules and achieves better results than supervised model (i.e MPGNN) and MeanTeachers. The global representation learning at graph level is beneficial for molecular property prediction and its performance is better than Infograph.
5.4. Efficiency Experiment
To demonstrate ASGN is label efficient, we conduct an efficiency experiment. In this experiment, we start with 5,000 labeled molecules and the rest in the unlabeled set. Then, in each iteration, after the model selects a molecule from , we add it to . During this process, we measure the Label RateMean Absolute Error(MAE) curve to show how many labels are saved for a fixed error. For a fixed error, the less labeled data is used, the better the model is.
5.4.1. Baselines
The baselines are selected from active learning methods. We apply these methods on the backbone of ASGN (i.e MPGNN). We simply omit some methods that cannot be applied to our settings. We use a batch number of 2500 new labeled molecules in every iteration in Eq. (13) for ASGN. The computational cost of QBC method on OPV dataset is unaffordable so we simply omit it.

Random: Choosing data points randomly from the unlabeled dataset in each iteration. The model is reinitialized when a new batch of labeled data is selected. This method equals the passive learning.

Query By Committee (QBC) (seung1992query): We jointly train a group of models named committee initialized in the same method but different parameters. Each iteration we choose a batch of data points with the biggest disagreement of the committee members. We use 8 models as a committee, training 8 models at the same time is time consuming.

Deep Bayes Active Learning (BALD) (gal2017deep): This is a method based on uncertainty. We approximate the uncertainty by performing Monte Carlo dropout (srivastava2014dropout) on layers of the network.

Vanilla center (Sener2017): The representation learned by the semisupervised learning methods actually benefits the selection of new data points. We also compare our method with the vanilla plain center active learning strategy.
5.4.2. Results
We plot the results on HOMO (highest occupied molecular orbital) on both QM9 dataset and OPV dataset in Figure 3. ”Full” denotes the MAE for a supervised MPGNN using all labeled data. We have the following conclusions.
First, we show that for all datasets and properties, when the label number is fixed, the MAE is much lower than baselines which proves the effectiveness of our model. This shows that the active learning strategy is beneficial for model training. Additionally, the performance is better than a fully supervised model on all labeled data, proving the effectiveness of combining semisupervised loss as regularization.
Second, when we set a fixed error target, we found that our model is about times label efficient than baselines. This means that if we only need a predictor with given accuracy, we could use only labels compared with other methods. Specifically, we use 50% labeled data to reach full accuracy on QM9 and 40% for OPV.
Third, we found that some baseline methods that work well in deep learning for image classification like BALD and
center do not perfrom well on molecular data. Additionally, since BALD requires dropout, the performance is better when few labels are available but worse when we use all the labels.5.5. Ablation Experiments
In this section, we conduct more experiments on ASGN including the ablation study to demonstrate how every part of our model affects the performance and a visualization experiment to support the interpretability of our model.
5.5.1. Necessity of TeacherStudent Framework
First, to show the effectiveness of the teacherstudent framework in our model, we conduct an ablation study of ASGN without the teacher model or the student model. We denote ASGN with only the teacher model as ASGNT which means that we jointly learn all tasks without handling the loss conflict. We list the results of HOMO on QM9 and OPV datasets in Table 3. We see that with the student network, the model achieves better performance on property prediction task.
We also study the case without the teacher model as ASGNS which means no semisupervised learning is used. Notice that ASGNS is identical to a vanilla center active learning method(Sener2017). Results show that it is necessary using the teacherstudent framework.
5.5.2. Necessity of Weight Transfer
The essential step in connecting the student model and teacher model in our method is to transfer the weight of the teacher model to the student model in order to accelerate the training process. Here we use an ablation experiment to demonstrate the effect of the weight transfer. In Figure 4, we plot the MAE of ASGN with weight transfer and without weight transfer on the test dataset of QM9 on LUMO (lowest unoccupied molecular orbital) property when 10,000 labeled data are available. Results show that both training and testing MAE converge faster and are more stable with weight transfer. The final performance is also better using weight transfer.
5.6. Visualization Experiments
Name/Dataset  Homo(QM9)  Homo(OPV)  

Unit  eV  Hatree  
Number of data  5k  10k  50k  5k  10k  50k 
ASGNT  0.1668  0.1523  0.0682  0.080  0.053  0.020 
ASGNS  0.1632  0.1252  0.0653  0.076  0.049  0.019 
ASGN  0.1190  0.0951  0.0517  0.060  0.039  0.015 
Our representation learning has considered the mutual relation between molecules within the chemical space and we use the information mutually for predicting the clustering to enhance the representation. To demonstrate that the distribution of molecules exhibits a clustered structure, we use tSNE method to visualize the graph level representation of molecules using ASGN, shown in Figure 5. We see after using tSNE the embedding of molecules can be clustered, and there is obvious distance between the clusters which verifies that we have got discriminative graph level embeddings. Additionally, similar molecules are clustered into the same cluster that means the embeddings can capture structural information.
6. Conclusions
In this paper, we proposed a novel framework to improve the performance for molecular property prediction with limited labels by incorporating unlabeled molecules. We designed a teacherstudent framework consisting of two graph neural networks that work iteratively. Then we introduced the details of our semisupervised representation learning method for molecular graphs that consider both graph level and node level information. Weight transfer and pseudo labeling are used to optimize two models to balance the loss functions. Furthermore, we used diversity based active learning to select new molecules for labelling. ASGN achieves much better performance compared with baselines when labels are limited. Additionally, we showed the necessity for components in ASGN using ablation experiments. In future work, we will attempt to extend our model to more general molecular property prediction.
ACKNOWLEDGMENTS. This research was supported by grants from the National Natural Science Foundation of China (Grants No. 61922073, U1605251). Qi Liu gratefully acknowledges the support of the Youth Innovation Promotion Association of CAS (No. 2014299).