Uncertainty quantification of molecular property prediction with Bayesian neural networks

03/20/2019 ∙ by Seongok Ryu, et al. ∙ KAIST 수리과학과 0

Deep neural networks have outperformed existing machine learning models in various molecular applications. In practical applications, it is still difficult to make confident decisions because of the uncertainty in predictions arisen from insufficient quality and quantity of training data. Here, we show that Bayesian neural networks are useful to quantify the uncertainty of molecular property prediction with three numerical experiments. In particular, it enables us to decompose the predictive variance into the model- and data-driven uncertainties, which helps to elucidate the source of errors. In the logP predictions, we show that data noise affected the data-driven uncertainties more significantly than the model-driven ones. Based on this analysis, we were able to find unexpected errors in the Harvard Clean Energy Project dataset. Lastly, we show that the confidence of prediction is closely related to the predictive uncertainty by performing on bio-activity and toxicity classification problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern deep neural network (DNN) models have been used in various molecular applications, such as high-throughput screening for drug discovery 1, 2, 3, 4, de novo molecular design 5, 6, 7, 8, 9, 10, 11, 12 and planning chemical reaction 13, 14, 15. DNNs show comparable or sometimes better performance than traditional approaches grounded on quantum chemical theories in predicting some molecular properties 16, 17, 18, 19, 20, if a vast amount of well-qualified data is secured. Despite the remarkable potential of DNN models, the direct use of their outputs is sometimes limited because most data in practical applications is likely to involve undesirable problems caused by the lack of both data quality and quantity.

Such data discourages a reliable statistical analysis based on DNN models, since their accuracy critically depends on training data. For example, Feinberg et al. mentioned that more qualified data should be provided to improve the prediction accuracy on drug-target interactions, which is a key step for drug discovery 21. The number of ligand-protein complex samples in the PDB-bind database 22 is only about 15,000, limiting the development of reliable DNN models. In order to prepare more qualified data, expensive and time-consuming experiments are inevitable. Synthetic data from computations can be used as an alternative, like the Harvard Clean Energy Project set 23, but it often suffers from unintentional errors caused by approximation methods employed. In addition, data-inherent bias and noise hurt the quality of data. Tox21 3 and DUD-E dataset 24 are such examples. The number of data in the Tox21 dataset is less than 10,000. There are far more negative samples than positive samples. Of various toxic types, the lowest percentage of positive samples is 2.9% and the highest is 15.5%. For the DUD-E dataset, it is highly imbalanced that the number of decoy samples are almost 50 times larger than that of active samples. All of those situations would interrupt developing reliable models.

It has been stressed in deep learning researches that uncertainty analysis is necessary to address namely the AI-safety problems. 25, 26, 27 That is because even though DNNs push the bounds of data-driven approaches, they often make catastrophic decisions. The uncertainty analysis has been performed to analyze the processes of decision making with deep neural networks. Kendall and Gal

studied quantitative uncertainty analysis on computer vision problems by using Bayesian neural networks (BNNs).

28

They separated model- and data-driven uncertainties, which helps to identify the sources of prediction errors. It is possible because Bayesian inference allows uncertainty assessments, giving probabilistic interpretations of model outputs.

In this paper, we propose to exploit BNNs to quantify uncertainties implied in molecular property predictions. Previous studies on uncertainty quantification have regarded a predictive variance as a predictive uncertainty 28, 29. The predictive uncertainty can be decomposed into (i) an aleatoric uncertainty arisen from data noise and (ii) an epistemic uncertainty arisen from the incompleteness of model 30. We adopt the same method in this study. As a DNN model for molecular applications, we use augmented graph convolutional networks (GCNs) 31, 32, 33. In what follows, we briefly introduce BNNs, the uncertainty quantification methods based on Bayesian inference, and the augmented-GCN used in this work. Then, we show the results of uncertainty analysis on three experimental studies. The main results are summarized as follows.

  • We first applied the Bayesian GCN to a simple example, the logP prediction of molecules in the ZINC set34, in order to demonstrate the uncertainty quantification in molecular applications. As expected, the aleatoric uncertainty increases as the data noise increases, while the epistemic uncertainty slightly depends on the quality of data.

  • Second, we evaluate the quality of synthetic data and find erroneous samples fabricated by poor approximations. The Harvard Clean Energy Project (CEP) set 23 contains synthetic power conversion efficiency (PCE) values of molecules. We noted that molecules with exactly zero values have a conspicuously large aleatoric uncertainty, which have been verified as incorrect annotations.

  • In the last example, for the binary classification of bio-activity and toxicity, we studied the relationship between predicted probability and uncertainties. Our analysis shows that prediction with a lower uncertainty turned out to be more accurate, indicating that the uncertainty can be regarded as the confidence of prediction.

2 Theoretical backgrounds

2.1 Bayesian neural network

For a given training set , let and be a model likelihood and a prior distribution for a parameter

, respectively. Under the Bayesian framework, the model parameter and output are considered as random variables. The posterior distribution is given by

(1)

and the predictive distribution is defined as

(2)

for a new input and an output . These simple formulations make the two following tasks possible: (i) assessing uncertainty of the random variables in a conditional manner and (ii) predicting a distribution of the new output given both the new input and the training set .

However, direct computation of eq. (2) is often infeasible when deep neural network models are exploited because the integration over the whole parameter space entails heavy computational costs. Many practical approximation methods have been proposed to handle this computation cost. A variational inference, one of the most popular approximation methods, approximates the posterior distribution with a tractable distribution parametrized by a variational parameter 35, 36

. Minimizing the Kullback-Leibler divergence,

(3)

makes the two distributions similar to one another in principle. We can replace the intractable posterior distribution in (3) with

due to the Bayes’ theorem (

1). Then, our minimization objective, called the negative evidence lower-bound, is

(4)

In order to implement Bayesian models, we need to be cautious in choosing a variational distribution . Blundell et al.

proposed to use a product of Gaussian distributions for the variational distribution

. In addition, a multiplicative normalizing flow 37 can be applied to increase the expressive power of variational distribution. However, the two approaches often require a large number of weight parameters. The Monte-Carlo dropout (MC-dropout) using a dropout38

variational distribution approximates the posterior distribution by a product of Bernoulli distribution

39

. The MC-dropout is practical in that it does not need extra learnable parameters to model the variational posterior distribution and the integration over the whole parameter space can be easily approximated with the summation of models sampled by a Monte-Carlo estimator

25, 39. Thus, we adopted the MC-dropout in this work.

2.2 Uncertainty quantification with Bayesian neural network

A variational inference approximating a posterior with a variational distribution provides a variational predictive distribution of a new output given a new input as

(5)

where is a model output with a given w. For regression tasks, a predictive mean of this distribution with times of MC sampling is estimated by

(6)

and a predictive variance is estimated by

(7)

with drawn from at the sampling step and an assumption . Here, the model assumes a homoscedasticity with a known quantity, meaning that every data point gives a distribution with a same variance

. Further to this, obtaining the distributions with different variances allows deducing a heteroscedastic uncertainty. Assuming the heteroscedasticity, the output given the

-th sample is

(8)

The heteroscedastic predictive uncertainty given by (9) can be partitioned into two different uncertainties: aleatoric and epistemic uncertainties.

(9)

The aleatoric uncertainty arises from data inherent noise, while the epistemic uncertainty is related to the model incompleteness. Note that the latter can be reduced by increasing the amount of training data, because it comes from insufficient amount of data as well as the use of inappropriate model 30.

In classification problems, Kwon et al. proposed a natural way to quantify aleatoric and epistemic uncertainties as follows.

(10)

where and . While Kendall and Gal’s method requires extra parameters at the last hidden layer and often causes unstable parameter updates in a training phase,28 the method in Kwon et al. has advantages in that models do not need the extra parameters.29 The equation (10) also utilizes a functional relationship between mean and variance of multinomial random variables. We refer to Kwon et al. for more details.

2.3 Graph convolutional network for molecular property predictions

Molecules, social graphs, images and language sentences can be represented as graph structures 40. GCN is one of the most popular graph neural networks and is widely adopted to process molcular graphs. Inputs to the GCN is , where is an adjacency matrix with the number of nodes and is a set of initial node features whose dimensionality is . The GCN gives new node features as follows.

(11)

where and are node features and weight parameters for the -th graph convolution layer for , respectively. The GCN updates node features with information of only adjacent nodes.

Applying a self-attention41 enables the GCN to learn relations between node pairs by reflecting the importance of adjacent nodes.42 Updating node features with the -head self-attention is given by

(12)

where denotes the adjacent nodes of the -th node, is the -th node feature updated at -th graph convolution, is a weight parameter for the -th attention head, is a weight parameter to combine the node features from -different attention heads, and the attention coefficient is given by

(13)

where is a weight parameter.

In addition, the GCN has room for improvement because its accuracy is gradually lowered as the number of graph convolution layers increases. 32, 33 We used a gated-skip connection to prevent this problem as follows.

(14)

where and are trainable parameters and denotes Hadamard product.

After computing the node features -times by following eq. (14), a graph feature is aggregated as the summation of all node features in a set of nodes ,

(15)

where MLP

denotes a multi-layer perceptron. The graph feature is invariant to permutations of the node states. A molecular property, which is the final output from the model, is a function of the graph feature.

(16)

3 Implementation details

3.1 Model architecture

Figure 1: The architecture of Bayesian GCN used in this work. (a) The entire model is composed of three augemented graph convolutional layers, readout layers and three linear layers with non-linear activation. (b) Detailed description of the graph convolution layer augmented with attention and gate mechanisms. We added dropout layers in order for the model parameters to have stochasticity.

As illustrated in Figure 1, our graph convolutional MC-dropout network used in this work consists of the following three parts:

  • Three augmented graph convolution layers update node features according to (14). The number of self-attention head is four. The dimension of output from each layer is () = ().

  • A readout function produces a graph feature whose dimension is 256 by following (15).

  • A feed-forward MLP, which is composed of two fully-connected layers, turns out a molecular property. The hidden dimension of each fully-connected layer is 256.

In order for the model parameters to have stochasticity, we applied dropouts at every hidden layer. Note that we did not use the standard dropout with a pre-defined dropout rate, but used Concrete dropout43 to develop as an accurate Bayesian model as possible. By using the Concrete dropout, we can obtain an optimal dropout rate for individual hidden layer by a stochastic optimization. We used Gaussian priors with length scale for all model parameters. In the training phase, we used the Adam optimizer44 with an initial learning rate

, and the learning rate is decayed by half at every 10 epoch. The number of total training epoches is

and the batch size is . We randomly split datasets in the ratio of for training, validation and test. The code used for the experiments is available at https://github.com/seongokryu/uq-molecule.

4 Experiments

4.1 Implication of data quality on aleatoric and epistemic uncertainties

Figure 2: Histograms of (a) aleatoric , (b) epistemic and (c) total uncertanties as the amount of additive noise increases.

In this experiment, we applied the uncertainty quantification method to a simple example, logP prediction. We chose this example because we can obtain the logP value of molecules from the analytic expression of logP as implemented in the RDKit 45 without data inherent noise. To examine the effect of data quality on uncertainties, we adjust the extent of noise in logP by adding a random Gaussian noise . We trained the model with 97,287 samples and analyzed uncertainties of each predicted logP for 27,023 samples. The samples were chosen randomly from the ZINC dataset.

Figure 2 shows the distribution of the three uncertainties as a function of the amount of additive noise . As the noise level increases, the aleatoric and total uncertainties increase, but the epistemic uncertainty is slightly changed. This result verifies that the aleatoric uncertainty arises from data inherent noises, while the epistemic uncertainty does not depend on data quality. Theoretically, the epistemic uncertainty should not increase by the changes in the amount of data noise. We guess that the slight change of the epistemic uncertainty arises from the stochastic numerical optimization of model parameters.

4.2 Evaluating quality of synthetic data based on uncertainty analysis

Figure 3: (a) Aleatoric, (b) epistemic, (c) total uncertainties and (d) predicted PCE against the PCE value in the dataset. The samples colored in red show the total uncertainty greater than two.

Based on the analysis of the previous experiment, we attempted to evaluate the quality of synthetic data. Synthetic PCE values in the CEP dataset 23 was obtained from the Scharber model with statistical approximations 46. In this procedure, unintentional errors can be involved in the resulting synthetic data. Since the aleatoric uncertainty arises due to data quality, we evaluated quality of the synthetic data by analyzing the uncertainties of predicted PCE values. We used the same dataset in Duvenaud et al. 111https://github.com/HIPS/neural-fingerprint for training and test.

Figure 3 shows the scatter plot of three uncertainties in the CEP predictions for 5,995 molecules in the test set. Samples with the total uncertainty greater than two are highlighted with red color. Some samples with large PCE values above eight had relatively large total uncertainties. Their PCE values deviated considerably from the black line in Figure 3-(d). More interestingly, we found that most molecules with the zero PCE value had large total uncertainties as well. Those large uncertainties came from the aleatoric uncertainty as depicted in Figure 3-(a), indicating that the data quality of those particular samples is relatively poor. Hence, we speculated that data inherent noises might cause large prediction errors.

To elaborate the origin of such errors, we investigated the procedure of obtaining the PCE values. The Havard Organic Photovolatic Dataset 47 contains both experimental and synthetic PCE values of 350 organic photovoltaic materials. The synthetic PCE values were computed according to (17), which is the result of the Scharber model 46.

(17)

where is an open circuit potential, is a fill factor, and is a short circuit current density. was set to 65%. and were obtained from electronic structure calculations of molecules.23 We found that of some molecules were zero or nearly zero, resulting in zero or almost zero synthetic PCE values, in contrast to their non-zero experimental PCE values. Especially, and PCE values computed using the M06-2X functional 48 were almost zero consistently. We suspect that those approximated values caused a significant drop of data quality, resulting in large aleatoric uncertainties as highlighted in Figure 3. Consequently, the data noise due to poorly fabricated data was identified as the large aleatoric uncertainties.

4.3 Uncertainty as confidence indicator: bio-activity and toxicity classification

Figure 4: (a) Aleatoric, (b) epistemic and (c) total uncertainty of predicted probabilities in the classification of bio-activity against the EGFR target.

In this experiment, we demonstrate that the uncertainty analysis can lead reliable classification. In classification problems, it tends to interpret the final outputs from a sigmoid or softmax activation as their confidence, which means that the higher the output probability, the higher the prediction accuracy. However, as Gal and Ghahramani pointed out, such interpretation is erroneous. 39 Thus, we applied the uncertainty quantification on the bio-activity and toxicity classification problems and show that the predictive uncertainty can be used as the confidence of outcomes.

Figure 5: Test accuracy for the classifications of (a) bio-activities against the five target proteins in the DUD-E set and (b) the five toxic effects in the Tox21 set.

We trained the Bayesian GCN using 25,627 molecules with the labels for EGFR-activity in the DUD-E dataset. Figure 4 shows the results for 7,118 molecules in the test set. In order for the predictive uncertainty to be interpreted as a confidence, its value should be minimum on the output probability of zero or one and should be maximum on that of 0.5. Indeed, the total uncertainty predicted from our model shows such behaviour. In other words, more uncertain outcomes have lower predictive probability values. We also noted that the aleatoric uncertainty affected the total uncertainty more significantly than the epistemic uncertainty did.

To further investigate a relationship between accuracy and uncertainty, we trained the Bayesian GCN for various bio-activity labels in the DUD-E dataset and toxicity labels in the Tox21 dataset. Then, we sorted the molecules in the order of increasing uncertainty and then divided them into five groups as follows: molecules in the -th group have total uncertainties in the range of . Figure 5 shows the classification accuracy of each group; (a) and (b) denote the classification results of bio-acitvities against the five different targets and the five different toxicities of Tox21 set molecules, respectively. This result is an evidence that the uncertainty can be used as a confidence indicator in binary classification problems.

5 Conclusion

Deep neural network models show promising performances in the prediction of molecular properties. In practical applications, however, a lack of data quality and quantity discourages developing accurate models. To make reliable decisions in such a case, we have proposed to analyze uncertainties in the prediction results by using the Bayesian GCN.

Our first experiment on the logP prediction showed that data inherent noise can be identified by the aleatoric uncertainty. The aleatoric uncertainty in the predicted logP values increases as the amount of noise increases. In contrast, the epistemic uncertainty slightly depends on the data noise as expected. In the second experiment, we applied the uncertainty analysis to the Harvard Clean Energy Project dataset. It was able to identify erroneous data by noting the abnormally increased aleatoric uncertainty in the poorly approximated synthetic data, which is helpful to find the source of the errors. In the third experiment of bio-activity and toxicity predictions, we showed that the uncertainty is closely related to the confidence of prediction for binary classification problems. As grouping the molecules in the increasing order of uncertainty, the groups with lower uncertainty show higher accuracy than those with higher uncertainty.

We have demonstrated how useful the uncertainty quantification is in molecular applications. By using the Bayesian GCN, we can analyze the quality of data that is often noisy because of the stochastic nature of experimental results. From the relationship between output probability and confidence of prediction, it is able to extract more reliable results selectively from entire predictions, which is critical to making a desirable decision. Such analysis can be used to screen bio-active and toxic molecules, where reliable prediction is vital. We believe that our study on the uncertainty quantification of molecular properties offers insights to tackle AI-safety problems in molecular applications.

References