1 Introduction
Modern deep neural network (DNN) models have been used in various molecular applications, such as highthroughput screening for drug discovery ^{1, 2, 3, 4}, de novo molecular design ^{5, 6, 7, 8, 9, 10, 11, 12} and planning chemical reaction ^{13, 14, 15}. DNNs show comparable or sometimes better performance than traditional approaches grounded on quantum chemical theories in predicting some molecular properties ^{16, 17, 18, 19, 20}, if a vast amount of wellqualified data is secured. Despite the remarkable potential of DNN models, the direct use of their outputs is sometimes limited because most data in practical applications is likely to involve undesirable problems caused by the lack of both data quality and quantity.
Such data discourages a reliable statistical analysis based on DNN models, since their accuracy critically depends on training data. For example, Feinberg et al. mentioned that more qualified data should be provided to improve the prediction accuracy on drugtarget interactions, which is a key step for drug discovery ^{21}. The number of ligandprotein complex samples in the PDBbind database ^{22} is only about 15,000, limiting the development of reliable DNN models. In order to prepare more qualified data, expensive and timeconsuming experiments are inevitable. Synthetic data from computations can be used as an alternative, like the Harvard Clean Energy Project set ^{23}, but it often suffers from unintentional errors caused by approximation methods employed. In addition, datainherent bias and noise hurt the quality of data. Tox21 ^{3} and DUDE dataset ^{24} are such examples. The number of data in the Tox21 dataset is less than 10,000. There are far more negative samples than positive samples. Of various toxic types, the lowest percentage of positive samples is 2.9% and the highest is 15.5%. For the DUDE dataset, it is highly imbalanced that the number of decoy samples are almost 50 times larger than that of active samples. All of those situations would interrupt developing reliable models.
It has been stressed in deep learning researches that uncertainty analysis is necessary to address namely the AIsafety problems. ^{25, 26, 27} That is because even though DNNs push the bounds of datadriven approaches, they often make catastrophic decisions. The uncertainty analysis has been performed to analyze the processes of decision making with deep neural networks. Kendall and Gal
studied quantitative uncertainty analysis on computer vision problems by using Bayesian neural networks (BNNs).
^{28}They separated model and datadriven uncertainties, which helps to identify the sources of prediction errors. It is possible because Bayesian inference allows uncertainty assessments, giving probabilistic interpretations of model outputs.
In this paper, we propose to exploit BNNs to quantify uncertainties implied in molecular property predictions. Previous studies on uncertainty quantification have regarded a predictive variance as a predictive uncertainty ^{28, 29}. The predictive uncertainty can be decomposed into (i) an aleatoric uncertainty arisen from data noise and (ii) an epistemic uncertainty arisen from the incompleteness of model ^{30}. We adopt the same method in this study. As a DNN model for molecular applications, we use augmented graph convolutional networks (GCNs) ^{31, 32, 33}. In what follows, we briefly introduce BNNs, the uncertainty quantification methods based on Bayesian inference, and the augmentedGCN used in this work. Then, we show the results of uncertainty analysis on three experimental studies. The main results are summarized as follows.

We first applied the Bayesian GCN to a simple example, the logP prediction of molecules in the ZINC set^{34}, in order to demonstrate the uncertainty quantification in molecular applications. As expected, the aleatoric uncertainty increases as the data noise increases, while the epistemic uncertainty slightly depends on the quality of data.

Second, we evaluate the quality of synthetic data and find erroneous samples fabricated by poor approximations. The Harvard Clean Energy Project (CEP) set ^{23} contains synthetic power conversion efficiency (PCE) values of molecules. We noted that molecules with exactly zero values have a conspicuously large aleatoric uncertainty, which have been verified as incorrect annotations.

In the last example, for the binary classification of bioactivity and toxicity, we studied the relationship between predicted probability and uncertainties. Our analysis shows that prediction with a lower uncertainty turned out to be more accurate, indicating that the uncertainty can be regarded as the confidence of prediction.
2 Theoretical backgrounds
2.1 Bayesian neural network
For a given training set , let and be a model likelihood and a prior distribution for a parameter
, respectively. Under the Bayesian framework, the model parameter and output are considered as random variables. The posterior distribution is given by
(1) 
and the predictive distribution is defined as
(2) 
for a new input and an output . These simple formulations make the two following tasks possible: (i) assessing uncertainty of the random variables in a conditional manner and (ii) predicting a distribution of the new output given both the new input and the training set .
However, direct computation of eq. (2) is often infeasible when deep neural network models are exploited because the integration over the whole parameter space entails heavy computational costs. Many practical approximation methods have been proposed to handle this computation cost. A variational inference, one of the most popular approximation methods, approximates the posterior distribution with a tractable distribution parametrized by a variational parameter ^{35, 36}
. Minimizing the KullbackLeibler divergence,
(3) 
makes the two distributions similar to one another in principle. We can replace the intractable posterior distribution in (3) with
due to the Bayes’ theorem (
1). Then, our minimization objective, called the negative evidence lowerbound, is(4) 
In order to implement Bayesian models, we need to be cautious in choosing a variational distribution . Blundell et al.
proposed to use a product of Gaussian distributions for the variational distribution
. In addition, a multiplicative normalizing flow ^{37} can be applied to increase the expressive power of variational distribution. However, the two approaches often require a large number of weight parameters. The MonteCarlo dropout (MCdropout) using a dropout^{38}variational distribution approximates the posterior distribution by a product of Bernoulli distribution
^{39}. The MCdropout is practical in that it does not need extra learnable parameters to model the variational posterior distribution and the integration over the whole parameter space can be easily approximated with the summation of models sampled by a MonteCarlo estimator
^{25, 39}. Thus, we adopted the MCdropout in this work.2.2 Uncertainty quantification with Bayesian neural network
A variational inference approximating a posterior with a variational distribution provides a variational predictive distribution of a new output given a new input as
(5) 
where is a model output with a given w. For regression tasks, a predictive mean of this distribution with times of MC sampling is estimated by
(6) 
and a predictive variance is estimated by
(7) 
with drawn from at the sampling step and an assumption . Here, the model assumes a homoscedasticity with a known quantity, meaning that every data point gives a distribution with a same variance
. Further to this, obtaining the distributions with different variances allows deducing a heteroscedastic uncertainty. Assuming the heteroscedasticity, the output given the
th sample is(8) 
The heteroscedastic predictive uncertainty given by (9) can be partitioned into two different uncertainties: aleatoric and epistemic uncertainties.
(9) 
The aleatoric uncertainty arises from data inherent noise, while the epistemic uncertainty is related to the model incompleteness. Note that the latter can be reduced by increasing the amount of training data, because it comes from insufficient amount of data as well as the use of inappropriate model ^{30}.
In classification problems, Kwon et al. proposed a natural way to quantify aleatoric and epistemic uncertainties as follows.
(10) 
where and . While Kendall and Gal’s method requires extra parameters at the last hidden layer and often causes unstable parameter updates in a training phase,^{28} the method in Kwon et al. has advantages in that models do not need the extra parameters.^{29} The equation (10) also utilizes a functional relationship between mean and variance of multinomial random variables. We refer to Kwon et al. for more details.
2.3 Graph convolutional network for molecular property predictions
Molecules, social graphs, images and language sentences can be represented as graph structures ^{40}. GCN is one of the most popular graph neural networks and is widely adopted to process molcular graphs. Inputs to the GCN is , where is an adjacency matrix with the number of nodes and is a set of initial node features whose dimensionality is . The GCN gives new node features as follows.
(11) 
where and are node features and weight parameters for the th graph convolution layer for , respectively. The GCN updates node features with information of only adjacent nodes.
Applying a selfattention^{41} enables the GCN to learn relations between node pairs by reflecting the importance of adjacent nodes.^{42} Updating node features with the head selfattention is given by
(12) 
where denotes the adjacent nodes of the th node, is the th node feature updated at th graph convolution, is a weight parameter for the th attention head, is a weight parameter to combine the node features from different attention heads, and the attention coefficient is given by
(13) 
where is a weight parameter.
In addition, the GCN has room for improvement because its accuracy is gradually lowered as the number of graph convolution layers increases. ^{32, 33} We used a gatedskip connection to prevent this problem as follows.
(14) 
where and are trainable parameters and denotes Hadamard product.
After computing the node features times by following eq. (14), a graph feature is aggregated as the summation of all node features in a set of nodes ,
(15) 
where MLP
denotes a multilayer perceptron. The graph feature is invariant to permutations of the node states. A molecular property, which is the final output from the model, is a function of the graph feature.
(16) 
3 Implementation details
3.1 Model architecture
As illustrated in Figure 1, our graph convolutional MCdropout network used in this work consists of the following three parts:

Three augmented graph convolution layers update node features according to (14). The number of selfattention head is four. The dimension of output from each layer is () = ().

A readout function produces a graph feature whose dimension is 256 by following (15).

A feedforward MLP, which is composed of two fullyconnected layers, turns out a molecular property. The hidden dimension of each fullyconnected layer is 256.
In order for the model parameters to have stochasticity, we applied dropouts at every hidden layer. Note that we did not use the standard dropout with a predefined dropout rate, but used Concrete dropout^{43} to develop as an accurate Bayesian model as possible. By using the Concrete dropout, we can obtain an optimal dropout rate for individual hidden layer by a stochastic optimization. We used Gaussian priors with length scale for all model parameters. In the training phase, we used the Adam optimizer^{44} with an initial learning rate
, and the learning rate is decayed by half at every 10 epoch. The number of total training epoches is
and the batch size is . We randomly split datasets in the ratio of for training, validation and test. The code used for the experiments is available at https://github.com/seongokryu/uqmolecule.4 Experiments
4.1 Implication of data quality on aleatoric and epistemic uncertainties
In this experiment, we applied the uncertainty quantification method to a simple example, logP prediction. We chose this example because we can obtain the logP value of molecules from the analytic expression of logP as implemented in the RDKit ^{45} without data inherent noise. To examine the effect of data quality on uncertainties, we adjust the extent of noise in logP by adding a random Gaussian noise . We trained the model with 97,287 samples and analyzed uncertainties of each predicted logP for 27,023 samples. The samples were chosen randomly from the ZINC dataset.
Figure 2 shows the distribution of the three uncertainties as a function of the amount of additive noise . As the noise level increases, the aleatoric and total uncertainties increase, but the epistemic uncertainty is slightly changed. This result verifies that the aleatoric uncertainty arises from data inherent noises, while the epistemic uncertainty does not depend on data quality. Theoretically, the epistemic uncertainty should not increase by the changes in the amount of data noise. We guess that the slight change of the epistemic uncertainty arises from the stochastic numerical optimization of model parameters.
4.2 Evaluating quality of synthetic data based on uncertainty analysis
Based on the analysis of the previous experiment, we attempted to evaluate the quality of synthetic data. Synthetic PCE values in the CEP dataset ^{23} was obtained from the Scharber model with statistical approximations ^{46}. In this procedure, unintentional errors can be involved in the resulting synthetic data. Since the aleatoric uncertainty arises due to data quality, we evaluated quality of the synthetic data by analyzing the uncertainties of predicted PCE values. We used the same dataset in Duvenaud et al. ^{1}^{1}1https://github.com/HIPS/neuralfingerprint for training and test.
Figure 3 shows the scatter plot of three uncertainties in the CEP predictions for 5,995 molecules in the test set. Samples with the total uncertainty greater than two are highlighted with red color. Some samples with large PCE values above eight had relatively large total uncertainties. Their PCE values deviated considerably from the black line in Figure 3(d). More interestingly, we found that most molecules with the zero PCE value had large total uncertainties as well. Those large uncertainties came from the aleatoric uncertainty as depicted in Figure 3(a), indicating that the data quality of those particular samples is relatively poor. Hence, we speculated that data inherent noises might cause large prediction errors.
To elaborate the origin of such errors, we investigated the procedure of obtaining the PCE values. The Havard Organic Photovolatic Dataset ^{47} contains both experimental and synthetic PCE values of 350 organic photovoltaic materials. The synthetic PCE values were computed according to (17), which is the result of the Scharber model ^{46}.
(17) 
where is an open circuit potential, is a fill factor, and is a short circuit current density. was set to 65%. and were obtained from electronic structure calculations of molecules.^{23} We found that of some molecules were zero or nearly zero, resulting in zero or almost zero synthetic PCE values, in contrast to their nonzero experimental PCE values. Especially, and PCE values computed using the M062X functional ^{48} were almost zero consistently. We suspect that those approximated values caused a significant drop of data quality, resulting in large aleatoric uncertainties as highlighted in Figure 3. Consequently, the data noise due to poorly fabricated data was identified as the large aleatoric uncertainties.
4.3 Uncertainty as confidence indicator: bioactivity and toxicity classification
In this experiment, we demonstrate that the uncertainty analysis can lead reliable classification. In classification problems, it tends to interpret the final outputs from a sigmoid or softmax activation as their confidence, which means that the higher the output probability, the higher the prediction accuracy. However, as Gal and Ghahramani pointed out, such interpretation is erroneous. ^{39} Thus, we applied the uncertainty quantification on the bioactivity and toxicity classification problems and show that the predictive uncertainty can be used as the confidence of outcomes.
We trained the Bayesian GCN using 25,627 molecules with the labels for EGFRactivity in the DUDE dataset. Figure 4 shows the results for 7,118 molecules in the test set. In order for the predictive uncertainty to be interpreted as a confidence, its value should be minimum on the output probability of zero or one and should be maximum on that of 0.5. Indeed, the total uncertainty predicted from our model shows such behaviour. In other words, more uncertain outcomes have lower predictive probability values. We also noted that the aleatoric uncertainty affected the total uncertainty more significantly than the epistemic uncertainty did.
To further investigate a relationship between accuracy and uncertainty, we trained the Bayesian GCN for various bioactivity labels in the DUDE dataset and toxicity labels in the Tox21 dataset. Then, we sorted the molecules in the order of increasing uncertainty and then divided them into five groups as follows: molecules in the th group have total uncertainties in the range of . Figure 5 shows the classification accuracy of each group; (a) and (b) denote the classification results of bioacitvities against the five different targets and the five different toxicities of Tox21 set molecules, respectively. This result is an evidence that the uncertainty can be used as a confidence indicator in binary classification problems.
5 Conclusion
Deep neural network models show promising performances in the prediction of molecular properties. In practical applications, however, a lack of data quality and quantity discourages developing accurate models. To make reliable decisions in such a case, we have proposed to analyze uncertainties in the prediction results by using the Bayesian GCN.
Our first experiment on the logP prediction showed that data inherent noise can be identified by the aleatoric uncertainty. The aleatoric uncertainty in the predicted logP values increases as the amount of noise increases. In contrast, the epistemic uncertainty slightly depends on the data noise as expected. In the second experiment, we applied the uncertainty analysis to the Harvard Clean Energy Project dataset. It was able to identify erroneous data by noting the abnormally increased aleatoric uncertainty in the poorly approximated synthetic data, which is helpful to find the source of the errors. In the third experiment of bioactivity and toxicity predictions, we showed that the uncertainty is closely related to the confidence of prediction for binary classification problems. As grouping the molecules in the increasing order of uncertainty, the groups with lower uncertainty show higher accuracy than those with higher uncertainty.
We have demonstrated how useful the uncertainty quantification is in molecular applications. By using the Bayesian GCN, we can analyze the quality of data that is often noisy because of the stochastic nature of experimental results. From the relationship between output probability and confidence of prediction, it is able to extract more reliable results selectively from entire predictions, which is critical to making a desirable decision. Such analysis can be used to screen bioactive and toxic molecules, where reliable prediction is vital. We believe that our study on the uncertainty quantification of molecular properties offers insights to tackle AIsafety problems in molecular applications.
References
 Gomes et al. 2017 Gomes, J.; Ramsundar, B.; Feinberg, E. N.; Pande, V. S. Atomic convolutional networks for predicting proteinligand binding affinity. arXiv preprint arXiv:1703.10603 2017,

Jiménez et al. 2018
Jiménez, J.; Skalic, M.; MartínezRosell, G.; De Fabritiis, G. K DEEP: Protein–Ligand Absolute Binding Affinity Prediction via 3DConvolutional Neural Networks.
Journal of chemical information and modeling 2018, 58, 287–296.  Mayr et al. 2016 Mayr, A.; Klambauer, G.; Unterthiner, T.; Hochreiter, S. DeepTox: toxicity prediction using deep learning. Frontiers in Environmental Science 2016, 3, 80.
 Öztürk et al. 2018 Öztürk, H.; Özgür, A.; Ozkirimli, E. DeepDTA: deep drug–target binding affinity prediction. Bioinformatics 2018, 34, i821–i829.
 De Cao and Kipf 2018 De Cao, N.; Kipf, T. MolGAN: An implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973 2018,
 GómezBombarelli et al. 2018 GómezBombarelli, R.; Wei, J. N.; Duvenaud, D.; HernándezLobato, J. M.; SánchezLengeling, B. et al. Automatic chemical design using a datadriven continuous representation of molecules. ACS central science 2018, 4, 268–276.
 Guimaraes et al. 2017 Guimaraes, G. L.; SanchezLengeling, B.; Outeiral, C.; Farias, P. L. C.; AspuruGuzik, A. Objectivereinforced generative adversarial networks (ORGAN) for sequence generation models. arXiv preprint arXiv:1705.10843 2017,
 Jin et al. 2018 Jin, W.; Barzilay, R.; Jaakkola, T. Junction Tree Variational Autoencoder for Molecular Graph Generation. arXiv preprint arXiv:1802.04364 2018,
 Kusner et al. 2017 Kusner, M. J.; Paige, B.; HernándezLobato, J. M. Grammar variational autoencoder. arXiv preprint arXiv:1703.01925 2017,
 Li et al. 2018 Li, Y.; Vinyals, O.; Dyer, C.; Pascanu, R.; Battaglia, P. Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324 2018,

Segler et al. 2017
Segler, M. H.; Kogej, T.; Tyrchan, C.; Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks.
ACS central science 2017, 4, 120–131.  You et al. 2018 You, J.; Liu, B.; Ying, R.; Pande, V.; Leskovec, J. Graph Convolutional Policy Network for GoalDirected Molecular Graph Generation. arXiv preprint arXiv:1806.02473 2018,
 Segler et al. 2018 Segler, M. H.; Preuss, M.; Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 2018, 555, 604.
 Wei et al. 2016 Wei, J. N.; Duvenaud, D.; AspuruGuzik, A. Neural networks for the prediction of organic chemistry reactions. ACS central science 2016, 2, 725–732.

Zhou et al. 2017
Zhou, Z.; Li, X.; Zare, R. N. Optimizing chemical reactions with deep reinforcement learning.
ACS central science 2017, 3, 1337–1344.  Faber et al. 2017 Faber, F. A.; Hutchison, L.; Huang, B.; Gilmer, J.; Schoenholz, S. S. et al. Prediction errors of molecular machine learning models lower than hybrid DFT error. Journal of chemical theory and computation 2017, 13, 5255–5264.
 Gilmer et al. 2017 Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; Dahl, G. E. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212 2017,
 Schütt et al. 2017 Schütt, K.; Kindermans, P.J.; Felix, H. E. S.; Chmiela, S.; Tkatchenko, A. et al. SchNet: A continuousfilter convolutional neural network for modeling quantum interactions. Advances in Neural Information Processing Systems. 2017; pp 991–1001.

Schütt et al. 2017
Schütt, K. T.; Arbabzadah, F.; Chmiela, S.; Müller, K. R.; Tkatchenko, A. Quantumchemical insights from deep tensor neural networks.
Nature communications 2017, 8, 13890.  Smith et al. 2017 Smith, J. S.; Isayev, O.; Roitberg, A. E. ANI1: an extensible neural network potential with DFT accuracy at force field computational cost. Chemical science 2017, 8, 3192–3203.
 Feinberg et al. 2018 Feinberg, E. N.; Sur, D.; Husic, B. E.; Mai, D.; Li, Y. et al. Spatial Graph Convolutions for Drug Discovery. arXiv preprint arXiv:1803.04465 2018,
 Liu et al. 2017 Liu, Z.; Su, M.; Han, L.; Liu, J.; Yang, Q. et al. Forging the basis for developing protein–ligand interaction scoring functions. Accounts of chemical research 2017, 50, 302–309.
 Hachmann et al. 2011 Hachmann, J.; OlivaresAmaya, R.; AtahanEvrenk, S.; AmadorBedolla, C.; SánchezCarrera, R. S. et al. The Harvard clean energy project: largescale computational screening and design of organic photovoltaics on the world community grid. The Journal of Physical Chemistry Letters 2011, 2, 2241–2251.
 Mysinger et al. 2012 Mysinger, M. M.; Carchia, M.; Irwin, J. J.; Shoichet, B. K. Directory of useful decoys, enhanced (DUDE): better ligands and decoys for better benchmarking. Journal of medicinal chemistry 2012, 55, 6582–6594.
 Gal 2016 Gal, Y. Uncertainty in deep learning. University of Cambridge 2016,
 Begoli et al. 2019 Begoli, E.; Bhattacharya, T.; Kusnezov, D. The need for uncertainty quantification in machineassisted medical decision making. Nature Machine Intelligence 2019, 1, 20.
 McAllister et al. 2017 McAllister, R.; Gal, Y.; Kendall, A.; Van Der Wilk, M.; Shah, A. et al. Concrete problems for autonomous vehicle safety: advantages of Bayesian deep learning. 2017.
 Kendall and Gal 2017 Kendall, A.; Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems. 2017; pp 5574–5584.
 Kwon et al. 2018 Kwon, Y.; Won, J.H.; Kim, B. J.; Paik, M. C. Uncertainty quantification using Bayesian neural networks in classification: Application to ischemic stroke lesion segmentation. international conference on medical imaging with deep learning. 2018.
 Der Kiureghian and Ditlevsen 2009 Der Kiureghian, A.; Ditlevsen, O. Aleatory or epistemic? Does it matter? Structural Safety 2009, 31, 105–112.
 Duvenaud et al. 2015 Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T. et al. Convolutional networks on graphs for learning molecular fingerprints. Advances in neural information processing systems. 2015; pp 2224–2232.
 Kipf and Welling 2016 Kipf, T. N.; Welling, M. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 2016,
 Ryu et al. 2018 Ryu, S.; Lim, J.; Kim, W. Y. Deeply learning molecular structureproperty relationships using graph attention neural network. arXiv preprint arXiv:1805.10988 2018,
 Irwin and Shoichet 2005 Irwin, J. J.; Shoichet, B. K. ZINC A free database of commercially available compounds for virtual screening. Journal of chemical information and modeling 2005, 45, 177–182.
 Blundell et al. 2015 Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; Wierstra, D. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424 2015,
 Graves 2011 Graves, A. Practical variational inference for neural networks. Advances in neural information processing systems. 2011; pp 2348–2356.
 Louizos and Welling 2017 Louizos, C.; Welling, M. Multiplicative normalizing flows for variational bayesian neural networks. arXiv preprint arXiv:1703.01961 2017,
 Srivastava et al. 2014 Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 2014, 15, 1929–1958.
 Gal and Ghahramani 2016 Gal, Y.; Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. international conference on machine learning. 2016; pp 1050–1059.
 Battaglia et al. 2018 Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; SanchezGonzalez, A.; Zambaldi, V. et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 2018,
 Vaswani et al. 2017 Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L. et al. Attention is all you need. Advances in Neural Information Processing Systems. 2017; pp 5998–6008.
 Velickovic et al. 2017 Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P. et al. Graph attention networks. arXiv preprint arXiv:1710.10903 2017,
 Gal et al. 2017 Gal, Y.; Hron, J.; Kendall, A. Concrete dropout. Advances in Neural Information Processing Systems. 2017; pp 3581–3590.
 Kingma and Ba 2014 Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 2014,
 Landrum 2006 Landrum, G. RDKit: Opensource cheminformatics. 2006.
 Scharber et al. 2006 Scharber, M. C.; Mühlbacher, D.; Koppe, M.; Denk, P.; Waldauf, C. et al. Design rules for donors in bulkheterojunction solar cells—Towards 10% energyconversion efficiency. Advanced materials 2006, 18, 789–794.
 Lopez et al. 2016 Lopez, S. A.; PyzerKnapp, E. O.; Simm, G. N.; Lutzow, T.; Li, K. et al. The Harvard organic photovoltaic dataset. Scientific data 2016, 3, 160086.
 Zhao and Truhlar 2008 Zhao, Y.; Truhlar, D. G. The M06 suite of density functionals for main group thermochemistry, thermochemical kinetics, noncovalent interactions, excited states, and transition elements: two new functionals and systematic testing of four M06class functionals and 12 other functionals. Theoretical Chemistry Accounts 2008, 120, 215–241.