1 Introduction
Machinelearning (ML) continues to emerge as a versatile strategy in the chemical sciences, with applications to drug discovery,^{1, 2, 3, 4, 5} materials design,^{6, 7, 8, 9, 5} and reaction prediction.^{10, 11, 12, 13, 14, 5} An increasing number of ML methods have focused on the prediction of molecular properties, including quantum mechanical electronic energies ^{15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}, densities ^{26, 32, 33, 34, 35, 36}, and spectra ^{37, 38, 39, 40, 41}. Most of this work has focused on ML in the representation of atom or geometryspecific features, although more abstract representations are gaining increased attention.^{42, 43, 44, 45, 46, 47, 48}
We recently introduced a rigorous factorization of the postHartreeFock (postHF) correlation energy into contributions from pairs of occupied molecular orbitals and showed that these pair contributions could be compactly represented in the space of molecularorbitalbased (MOB) features to allow for straightforward ML regression.^{47, 48} This MOBML method was demonstrated to accurately predict secondorder MøllerPlessett perturbation theory (MP2)^{49, 50} and coupled cluster with singles, doubles and perturbative triples (CCSD(T))^{51, 52} energies of different benchmark systems, including the QM7bT and GDB13T datasets of thermalized druglike organic molecules. While providing good accuracy with a modest amount of training data, the accuracy of MOBML in these initial studies was limited by the high computational cost () of applying Gaussian Process Regression (GPR) to the full set of training data.^{48}
In this work, we combine MOBML with regression clustering (RC) to overcome this bottleneck in computational cost and accuracy. The training data are clustered via RC to discover locally linear structures. By independently regressing these subsets of the data, we obtain MOBML models with greatly reduced training costs while preserving prediction accuracy and transferability.
2 Theory
2.1 Molecularorbital based machine learning (MOBML)
The MOBML method is based on the observation that the correlation energy for any postHF wavefunction theory can be exactly decomposed as a sum over occupied molecular orbitals (MOs) via Nesbet’s theorem,^{53, 54}
(1) 
where is the correlation energy and is the pair correlation energy corresponding to occupied MOs and . The pair correlation energies can be expressed as a functional of the set of (occupied and unoccupied) MOs, appropriately indexed by and , such that
(2) 
The functional maps the HF MOs to the pair correlation energy, regardless of the molecular composition or geometry, such that it is a universal functional for all chemical systems. To bypass the expensive postHartreeFock evaluation procedure, MOBML approximates by machine learning two functionals, and , which correspond to diagonal and offdiagonal terms of the sum in Eq. 1.
(3) 
The MOBML feature vectors
and are comprised of unique elements of the Fock, Coulomb and exchange matrices between , , and the set of virtual orbitals. Without loss of generality, we perform MOBML using localized MOs (LMOs) to improve transferability across chemical systems.^{47} Detailed descriptions of feature design are provided in our previous work ^{48, 47}, and the features employed here are unchanged from those detailed in Ref. 482.2 Local linearity of MOB feature space
It has been previously emphasized that MOBML facilitates transferability across chemical systems, even allowing for predictions involving molecules with elements that do not appear in the training set,^{47} due to the fact that MOB features provide a compact and highly abstracted representation of the electronic structure. However, it is worth additionally emphasizing that this transferability benefits from the smooth variation and local linearity of the pair correlation energies as a function of MOB feature values associated with different molecular geometries and even different molecules.
Figure 1 illustrates these latter properties for a bonding orbital in a series of simple molecules. On the yaxis, we plot the diagonal contribution to the correlation energy associated with this orbital (), computed at the MP2/ccpvTZ level of theory. On the xaxis, we plot the value of a particular MOB feature, the Fock matrix element for the that localized orbital, . For each molecule, a range of geometries is sampled from the Boltzmann distribution at K, with each plotted point corresponding to a different sampled geometry.
It is immediately clear from the figure that the pair correlation energy varies smoothly and linearly as a function of the MOB feature value. Moreover, the slope of the linear curve is remarkably consistent across molecules. This illustration suggests that MOB features may lead to accurate regression of correlation energies using simple machine learning models (even linear models), and it also indicates the basis for the robust transferability of MOBML across diverse chemical systems, including those with elements that do not appear in the training set.
2.3 Regression clustering with a greedy algorithm
To take advantage of the local linearity of pair correlation energies as a function of MOB features, we propose a strategy to discover optimally linear clusters using regression clustering (RC).^{55} Consider the set of datapoints {} , where is the length of the MOB feature vector and where each datapoint is indexed by and corresponds to a MOB feature vector and the associated reference value (i.e., label) for the pair correlation energy. To separate these datapoints into locally linear clusters, , we seek a solution to the optimization problem
(4) 
where and
are obtained via ordinary least squares (OLS) solution,
(5) 
Each resulting is the set of indices assigned to cluster comprised of datapoints. To perform the optimization in Eq. 4, we employ a modified version of the greedy algorithm proposed in Ref. 56 (Algorithm 1). In general, solutions to Eq. 4 may overlap, such that for ; however, the proposed algorithm enforces that clusters remain pairwisedisjoint.
Algorithm 1 has a periteration runtime of , since we compute OLS solutions each with runtime and since . However, the algorithm can be trivially parallelized to reach a runtime of . A key operational step in this algorithm is line 8, which can be explained in simple terms as follows: we assign each datapoint, indexed by , to the cluster to which it is closest, as measured by the squared linear regression distance metric,
(6) 
where is the distance of this point to cluster . In principle, a datapoint could be equidistant to two or more different clusters by this metric; in such cases, we randomly assign the datapoint to only one of those equidistant clusters to enforce the pairwisedisjointness of the resulting clusters. Convergence of the greedy algorithm is measured by the decrease in the objective function of Eq. 4.
Figure 2 illustrates RC in a simple onedimensional example for which unsupervised clustering approaches will fail to reveal the underlying linear structure. To create two clusters of nearly linear data that overlap in feature space, the interval of feature values on is uniformly discretized, such that for . Then, of the feature values are randomly chosen without replacement for cluster while the remainder are placed in ; the energy labels associated with each feature value are then generated using
and
where is an i.d.d. sequence. The resulting dataset is shown in Fig. 2a.
Application of the RC method to this example is illustrated in Figs. 2(bd). The greedy algorithm is initialized by randomly assigning each datapoint to either or (Fig. 2b). Then, with only a small number of iterations (Figs. 2c and d), the algorithm converges to clusters that reflect the underlying linear character. For comparison, Fig. 2e shows the clustering that is obtained upon convergence of the standard Kmeans algorithm,^{57} initialized with random cluster assignments. Unlike RC, the Kmeans algorithm prioritizes the compactness of clusters, resulting in a final clustering that is far less amenable to simple regression. While we recognize that the correct clustering could potentially be obtained using Kmeans when the dimensions of and are comparable, this is not the case for MOBML applications since is typically at least 10dimensional and is a scalar; the RC approach does not suffer from this issue. Finally, we have confirmed that initialization of RC from the clustering in Fig. 2e rapidly returns to the results in Fig. 2d, requiring only a couple of iterations of the greedy algorithm.
3 Calculation Details
Results are presented for QM7bT,^{48} a thermalized version of the QM7b set^{58} of 7211 molecules with up to seven C, O, N, S, and Cl heavy atoms, as well as for GDB13T,^{48} a thermalized version of the GDB13 set^{59} of molecules with thirteen C, O, N, S, and Cl heavy atoms. The MOBML features employed in the current study are identical to those previously provided.^{48} Reference pair correlation energies are computed using secondorder MøllerPlessett perturbation theory (MP2)^{49} and using coupled cluster with singles, doubles, and perturbative triples (CCSD(T)).^{51, 52} The MP2 reference data were obtained with the ccpVTZ basis set,^{60} whereas the CCSD(T) data were obtained using the ccpVDZ basis set.^{60} All employed training and test datasets are provided in Ref. 48.
3.1 Regression Clustering (RC)
RC is performed using the ordinary least square linear regression implementation in the Scikitlearn package ^{61}. Unless otherwise specified, we initialize the greedy algorithm from the results of Kmeans clustering, also implemented in Scikitlearn; Kmeans initialization was found to improve the subsequent training of the RFC in comparison to random initialization. It is found that neither L1 nor L2 regularization had significant effect on the rate of convergence of the greedy algorithm, so neither is employed in the results presented here. It is found that a convergence threshold of kcal/mol
for the loss function of the greedy algorithm (Eq.
4) leads to no degradation in the final MOBML regression accuracy (Fig. S2); this value is employed throughout.3.2 Regression
Two different regression models are employed in the current work. The first is ordinary leastsquares linear regression (LR), as implemented in Scikitlearn. The second is Gaussian Process Regression, as implemented in the GPy 1.9.6 software package ^{62}. Regression is independently performed for the training data associated with each cluster, yielding a local regression model for each cluster. Also, as in our previous work,^{47, 48} regression is independently performed for the diagonal and offdiagonal pair correlation energies ( and ) yielding independent regression models for each (Eq. 3).
GPR is performed using a negative log marginal likelihood objective. As in our previous work,^{48}
the Matérn 5/2 kernel is used for regression of the diagonal pair correlation energies and the Matérn 3/2 kernel is used for the offdiagonal pair correlation energies; in both cases, white noise regularization
^{63}is employed, and the GPR is initialized with unit lengthscale and variance.
3.3 Classification
A random forest classifier (RFC) is trained on MOBML features and cluster labels for a training set and then used to predict the cluster assignment of test datapoints in MOBML feature space. We employ the RFC implementation in Scikitlearn, using with 200 trees, the entropy split criteria,^{64} and balanced class weights.^{64} Alternative classifiers were also tested in this work, including Kmeans, Linear SVM,^{65} and AdaBoost;^{66} however, these schemes were generally found to yield less accurate MOBML energy predictions than RFC.
For comparison, a “perfect” classifier is obtained by simply including the test data within the RC training set. While useful for the analysis of prediction errors due to classification, this scheme is not generally practical because it assumes prior knowledge of the reference energy labels for the test molecules. Since the perfect classifier avoids misclassification of the test data by construction, it should be regarded as a best case scenario for the performance of the clustering/regression/classification approach.
3.4 The clustering/regression/classification workflow
Fig. 3 summarizes the combined work flow for training and evaluating a MOBML model with clustering. The training involves three steps: First, the training dataset of MOBML feature vectors and energy labels are assigned to clusters using the RC method (panel a). Second, for each cluster of training data, the regression model (LR or GPR) is trained, to enable the prediction of pair correlation energies from the MOBML vector. Third, a classifier is trained from the MOBML feature vectors and cluster labels for the training data, to enable the prediction of the cluster assignment from a MOBML feature vector.
The resulting MOBML model is specified in terms of the method of clustering (RC, for all results presented here), the method of regression (either LR or GPR), and the method of classification (either RFC or the perfect classifier). In referring to a given MOBML model, we employ a notation that specifies these options (e.g., RC/LR/RFC or RC/GPR/perfect).
Evaluation of the trained MOBML model is explained in Fig. 3d. A given molecule is first decomposed into a set of test feature vectors associated with the pairs of occupied MOs. The classifier is then used to assign each feature vector to an associated cluster. The clusterspecific regression model is then used to predict the pair correlation energy from the MOB feature vector. And finally, the pair correlation energies are summed to yield the total correlation energy for the molecule.
To improve the accuracy and reduce the uncertainty in the MOBML predictions, ensembles of 10 independent models using the clustering/regression/classification workflow are trained, and the predictive mean and the corresponding standard error of the mean (SEM) are computed by averaging over the 10 models; a comparison between the learning curves from a single run and from averaging over the 10 independent models is included in Supporting Information Fig. S1.
4 Results
4.1 Clustering and classification in MOB feature space
We begin by showing that the situation explored in Fig. 2, in which locally linear clusters overlap, also arises in realistic chemical applications of MOBML. We consider the QM7bT set of druglike molecules with thermalized geometries, using the diagonal pair correlation energies computed at the MP2/ccpVTZ level. Randomly selecting 1000 molecules for training, we perform RC on the dataset comprised of these energy labels and feature vectors, using optimized clusters; the sensitivity of RC to the choice of is examined later.
In many cases, the resulting clusters are well separated, such that the datapoints for one cluster have small distances (as measured by the linear regression distance metric, Eq. 6) to the cluster which it belongs to and large distances to all other clusters. However, the clusters can also overlap. Fig. 4a illustrates this overlap for two particular clusters (labeled 1 and 2) obtained from the QM7bT diagonalpair training data. Each datapoint assigned to cluster 1 (blue) is plotted according to its distance to both cluster 1 and cluster 2; likewise for the datapoints in cluster 2 (red). The datapoints for which the distances to both clusters approach zero correspond to regions of overlap between the clusters in the highdimensional space of MOBML features, akin to the case shown in Fig. 2.
Finally, in Fig. 4b, we illustrate the classification of the feature vectors into clusters. An RFC is trained on the feature vectors and cluster labels for the diagonal pairs of 1000 QM7bT molecules in the training set, and the classifier is then used to predict the cluster assignment for the feature vectors associated with the remaining diagonal pairs of 6211 molecules in QM7bT. For clusters 1 and 2, we then analyze the accuracy of the RFC by plotting the linear regression distance for each datapoint to the two clusters, as well as indicating the RFC classification of the feature vector. Each red datapoint in Fig. 4b that lies above the diagonal line of reflection is misclassified into cluster 2, and similarly, each blue datapoint that lies below the line of reflection is misclassified into cluster 1. The figure illustrates that while RFC is not a perfect means of classification, it is at least qualitatively correct. Later, in the results section, we will analyze the sources of MOBML prediction errors due to misclassification by comparing energy predictions obtained with perfect classification versus RFC.
4.2 Chemically intuitive clusters
A natural question is whether the clusters obtained via RC are consistent with chemical intuition, and whether the RFC can correctly classify MOBML feature vectors from test molecules in a way that is also consistent with chemical intuition. To address this, we employ a training set of 500 randomly selected molecules from QM7bT, and we perform regression clustering for the diagonal pair correlation energies with a range of total cluster numbers, up to . For each clustering, we then train an RFC. Finally, each trained RFC is independently applied to a set of test molecules with easily characterized valence molecular orbitals (listed in the caption of Fig. 5), to see how the feature vectors associated with valence occupied LMOs are classified among the optimized clusters.
Figure 5 presents the results of this exercise, clearly indicating the agreement between chemical intuition and the predictions of the RFC. As the number of clusters increases, the feature vectors associated with different valence LMO types are resolved into different clusters; and with a sufficiently large number of clusters (15 or 20), each cluster is dominated by a single type of LMO while each LMO type is assigned to a small number of different clusters. The empty boxes in Fig. 5 reflect that the training set contains a larger diversity of LMO types than the 11 test molecules, which is expected. The observed consistency of the clustering/classification method presented here with chemical intuition is of course promising for the accurate local regression of pair correlation energies, which is the focus of the current work; however, the results of Fig. 5 also suggest that the clustering/classification of chemical systems in MOBML feature space provides a powerful and highly general way of mapping the structure of chemical space for other applications.
4.3 Sensitivity to the number of clusters
We now explore the sensitivity of the MOBML clustering/regression/classification implementation to the number of employed clusters. In particular, we investigate the mean absolute error (MAE) of the MOBML predictions for the diagonal () and offdiagonal () contributions to the total correlation energy, as a function of the number of clusters, , used in the RC. The MOBML models employ linear regression and RFC classification (i.e., the RC/LR/RFC protocol); the training set is comprised of 1000 randomly chosen molecules from QM7bT, and the test set contains the remaining molecules in QM7bT.
Figure 6 presents the result of this calibration study, plotting the prediction MAE as a function of the number of clusters. Not surprisingly, the prediction accuracy for both the diagonal and offdiagonal contributions improves with , although it eventually plateaus in both cases. For the diagonal contributions, the accuracy improves most rapidly up to approximately 20 clusters, in accord with the observations in Fig. 5; and for the offdiagonal contributions, a larger number of clusters is useful for reducing the MAE error, which is sensible given the greater variety of feature vectors that can be created from pairs of LMOs rather than only individual LMOs. Appealingly, there does not seem to be a strong indication of MAE increases due to “overclustering.” While recognizing that the optimal number of clusters will, in general, depend somewhat on the application and the regression method (i.e., LR versus GPR), the results in Fig. 6 nonetheless provide useful guidance with regard to the appropriate values of . Throughout the remainder of the study, we employ a value of for the MOBML prediction of diagonal contributions to the correlation energy and a value of for the offdiagonal contributions; however, we recognize that these choices could be further optimized.
4.4 Performance and training cost of MOBML with RC
We now investigate the effect of clustering on the accuracy and training costs of MOBML for applications to sets of druglike molecules. Figure 7a presents learning curves (on a linearlinear scale) for various implementations of MOBML applied to MP2/ccpVTZ correlation energies, with the training and test sets corresponding to nonoverlapping subsets of QM7bT. In addition to the new results obtained using RC, we include the MOBML results from our previous work (GPR without clustering).^{48}
Figure 7a yields three clear observations. The first is that the use of RC with RFC (i.e., RC/GRP/RFC and RC/LR/RFC) leads to slightly less efficient learning curves than our previous implementation without clustering, at least when efficiency is measured in terms of the number of training molecules. Both the RC/GPR/RFC and RC/LR/RFC protocols require approximately 300 training molecules to reach the 1 kcal/mol per seven heavy atoms threshold for chemical accuracy employed here, whereas MOBML without clustering requires approximately half as many training molecules. The second observation is that the classifier is the dominant source of prediction error in these results. Comparison of results using RFC versus the perfect classifier (which utilizes prior knowledge of the energy labels and this thus not generally practical), reveals a dramatic reduction in the prediction error, regardless of the regression method. This result indicates that there is potentially much to be gained from the development of improved classifiers for MOBML applications. A third observation is that with a perfect classifier, the LR slightly outperforms GPR, given that the clusters are optimized to be locally linear; however, GPR slightly outperforms LR in combination with the RFC, indicating that GPR is less sensitive to classification error that LR.
Figure 7b presents the corresponding results at the CCSD(T)/ccpVDZ level of theory. The same trends emerge as the ones at the MP2/ccpVTZ level of theory. As seen in previous work, the training efficiency of MOBML with respect to the size reference dataset is found to be largely insensitive to the level of electronic structure theory.^{47, 48}
Figure 8 explores the training costs and transferability of MOBML models that employ RC. In all cases, the models are trained on random subsets of molecules from QM7bT with up to seven heavy atoms, and predictions are made either on the remaining molecules of QM7bT (circles) or on the GDB13T set (diamonds); it has previously been shown than that MOBML substantially outperforms the FHCL atombasedfeature method in terms of transferability from small to large molecules.^{48} The parallelization of the training steps are implemented as follows. Within the RC step, the LR for each cluster is performed independently on a different core of a 16core Intel Skylake (2.1 GHz) CPU processor. Within the regression step, the LR or GPR for each cluster is likewise performed independently on a different core. For RFC training, we apply parallel 200 cores using the parallel implemenation of Scikitlearn, since there are 200 trees. The regression and RFC training are indendent of each other and are thus also trivially parellelizable.
Focusing first on the predictions for sevenheavyatom molecules (circles), it is clear from Fig. 8 that RC leads to large improvements in the efficiency of the MOBML wallclock training costs. Although it requires somewhat more training molecules than MOBML without clustering, MOBML with clustering enables chemical accuracy to be reached with the training cost reduced by a factor of approximately 4500 for RC/GPR/RFC and of 35000 for RC/LR/RFC. Remarkably, for predictions within the QM7bT set, chemical accuracy can be achieved using RC/LR/RFC with a wallclock training time of only 7.7 s.
Figure 8 also demonstrates the transferability of the MOBML models for predictions on the GDB13T set of thirteenheavyatom molecules (diamonds). In general, it is seen that the degradation in the MAE per atom is greater for the RC/LR/RFC than for RC/GPR/RFC, due to the previously mentioned sensitivity of LR to classification error. However, we note that the RC/GPR/RFC enables predictions on GDB13T (blue, diamonds) that meet the peratom threshold of chemical accuracy used here, whereas that threshold was not achievable without clustering (green, diamonds) due to the prohibitive training costs involved.
The improved efficiency of MOBML training with the use of clustering arises from the cubic scaling of standard GPR in terms of training time (, where is number of training pairs).^{63}
Trivial parallelization over the independent regression of the clusters reduces this training time cost to the cube of largest cluster. We note that other kernelbased ML methods with high complexity in training time, like Kernel Ridge Regression,
^{67} would similarly benefit from clustering. For the RC/LR/RFC and RC/GPR/RFC results presented in Fig. 8, a breakdown of the training time contributions for each step of the clustering/regression/classification workflow as a function of the size of the training dataset is shown in Fig. S3; this supporting information figure confirms that the GPR regression dominates the total training cost for the RC/GPR/RFC implementation, whereas training the RFC dominates the training cost for RC/LR/RFC. In addition to improved efficiency in terms of training time, clustering also bring benefits in terms of the memory costs for MOBML training, due to the quadratic scaling of GPR memory costs in terms of the size of the dataset.4.5 Capping the cluster size
Since the parallelized training time for RC/GPR/RFC is dominated by the GPR regression of the largest cluster (Fig. S3), a natural question is whether additional computational savings and adequate prediction accuracy could achieved by simply capping the number of datapoints in the largest cluster. In doing so, we define to be the number of datapoints in the largest cluster obtained when the RC with the greedy algorithm is applied to a training dataset of molecules from QM7bT. Upon specifying (and thus ), the RC/GPR/RFC implementation is modified as follows. For a given number of training molecules (which will typically exceed ), the RC step is performed as normal. However, at the end of the RC step, datapoints for clusters whose size exceeds are discarded at random until all clusters contain or fewer datapoints. The GPR and RFC training steps are performed as before, except using this set of clusters that are capped in size. The precise value of will vary slightly depending on which training molecules are randomly selected for training and the convergence of the greedy algorithm, but typical values for are and for and , respectively, and those values will be used for the numerical tests presented here.
Figure 9a demonstrates that capping the maximum cluster size allows for substantial improvements in accuracy when the number of training molecules exceeds . Specifically, the figure shows the effect of capping on RC/GPR/RFC learning curves for MP2/ccpVTZ correlation energies, with the training and test sets corresponding to nonoverlapping subsets of QM7bT. As a baseline, note that with 100 training molecules, the RC/GPR/RFC implementation yields a prediction MAE of approximately kcal/mol. However, if the maximum cluster size is capped at and 300 training molecules are employed, then the prediction MAE drops to approximately kcal/mol while the parallelized training cost for RC/GPR/RFC will be unchanged so long as it remains dominated by the size of the largest cluster. As expected, Fig. 9a shows that the learning curves saturate at higher prediction MAE values when smaller values of are employed. Nonetheless, the figure demonstrates that if additional training data is available, then the prediction accuracy for MOBML with RC can be substantially improved while capping the size of the largest cluster.
Figure 9b demonstrates the actual effect of capping on the parallelized training time, plotting the prediction MAE versus parallelized training time as a function of the number of training molecules. For reference, the results obtained using RC/LR/RFC and RC/GPR/RFC without capping are reproduced from Fig. 8. As is necessary, the RC/GPR/RFC results obtained with capping exactly overlap those obtained without capping when the number of training molecules is not greater than . However, for each value of , a sharp drop in the prediction MAE is seen when the number of training molecules begins to exceed , demonstrating that prediction accuracy can be greatly improved with minimal increase in parallelized training time. For example, it is seen that for RC/GPR/RFC with , chemical accuracy can be reached with only 7.4 s of parallelized training, slightly less than even RC/LR/RFC. For small values of , this prediction MAE eventually levelsoff versus the training time, since the RFC training step becomes the dominant contribution to the training time.
5 Conclusions
Molecularorbitalbased (MOB) features offer a complete representation for mapping chemical space and a compact representation for evaluating correlation energies. In the current work, we take advantage of the intrinsic structure of MOB feature space, which cluster according to types of localized molecular orbitals, as well as the fact that orbitalpair contributions to the correlation energy contributions vary linearly with the MOB features, to overcome a fundamental bottleneck in the efficiency of machine learning (ML) correlation energies. Specifically, we introduce a regression clustering (RC) approach in which MOB features and pair correlation energies are clustered according to their local linearity; we then individually regress these clusters and train a classifier for the prediction of cluster assignments on the basis of MOB features. This combined clustering/regression/classification approach is found to reduce MOBML training times by 34 orders of magnitude, while enabling prediction accuracies that are substantially improved over that which is possible using MOBML without clustering. The use of a random forest classifier for the cluster assignments, while better than alternatives that were explored, is found to be the limiting factor in terms of MOBML accuracy within this new approach, motivating future work on improved classifiers. This work provides a useful step towards that development of accurate, transferable, and scalable quantum ML methods to describe everbroader swathes of chemical space.
This work emerged from a CMS 273 class project at Caltech that also involved Dmitry Burov, Jialin Song, Ying Shi Teh, and Dr. Tamara Husch, as well as Professors Kaushik Bhattacharya and Richard Murray; we thank these individuals for their ideas and contributions. This work is supported by the US Air Force Office of Scientific Research (AFOSR) grant FA95501710102. M.W. acknowledges a postdoctoral fellowship from the Resnick Sustainability Institute. N.B.K. is supported, in part, by the US National Science Foundation (NSF) grant DMS 1818977, the US Office of Naval Research (ONR) grant N000141712079, and the US Army Research Office (ARO) grant W911NF1220022. Computational resources were provided by the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility supported by the DOE Office of Science under contract DEAC0205CH11231.
Figures in the supplementary information show the effect of averaging over independently trained MOBMLmodels (Fig. S1), the sensitivity of the prediction accuracy to the RC convergence threshold (Fig. S2), and a detailed breakdown of the parallelized wallclock timings (Fig. S3). Tables in the supporting information provide the numerical data for the plots appearing in the main text.
References
 Lavecchia 2015 Lavecchia, A. Machinelearning approaches in drug discovery: Methods and applications. Drug Discov. Today 2015, 20, 318–331.

Gawehn et al. 2016
Gawehn, E.; Hiss, J. A.; Schneider, G. Deep learning in drug discovery.
Mol. Inform. 2016, 35, 3–14. 
Popova et al. 2018
Popova, M.; Isayev, O.; Tropsha, A. Deep reinforcement learning for de novo drug design.
Sci. Adv. 2018, 4, eaap7885.  Kearnes et al. 2016 Kearnes, S.; McCloskey, K.; Berndl, M.; Pande, V.; Riley, P. Molecular graph convolutions: Moving beyond fingerprints. J. Comput. Aided Mol. Des. 2016, 30, 595.
 Mater and Coote 2019 Mater, A. C.; Coote, M. L. Deep learning in chemistry. J. Chem. Inf. Model 2019, 59, 2545–2559.
 Kim et al. 2017 Kim, E.; Huang, K.; Jegelka, S.; Olivetti, E. Virtual screening of inorganic materials synthesis parameters with deep learning. npj Comput. Mater. 2017, 3, 53.
 Ren et al. 2018 Ren, F.; Ward, L.; Williams, T.; Laws, K. J.; Wolverton, C.; HattrickSimpers, J.; Mehta, A. Accelerated discovery of metallic glasses through iteration of machine learning and highthroughput experiments. Sci. Adv. 2018, 4, eaaq1566.
 Butler et al. 2018 Butler, K. T.; Davies, D. W.; Cartwright, H.; Isayev, O.; Walsh, A. Machine learning for molecular and materials science. Nature 2018, 559, 547–555.
 SanchezLengeling and AspuruGuzik 2018 SanchezLengeling, B.; AspuruGuzik, A. Inverse molecular design using machine learning: Generative models for matter engineering. Science 2018, 361, 360–365.

Wei et al. 2016
Wei, J. N.; Duvenaud, D.; AspuruGuzik, A. Neural networks for the prediction of organic chemistry reactions.
ACS Cent. Sci. 2016, 2, 725–732.  Raccuglia et al. 2016 Raccuglia, P.; Elbert, K. C.; Adler, P. D. F.; Falk, C.; Wenny, M. B.; Mollo, A.; Zeller, M.; Friedler, S. A.; Schrier, J.; Norquist, A. J. Machinelearningassisted materials discovery using failed experiments. Nature 2016, 533, 73–76.
 Ulissi et al. 2017 Ulissi, Z. W.; Medford, A. J.; Bligaard, T.; Nørskov, J. K. To address surface reaction network complexity using scaling relations machine learning and DFT calculations. Nat. Commun. 2017, 8, 14621.
 Segler and Waller 2017 Segler, M. H. S.; Waller, M. P. Neuralsymbolic machine learning for retrosynthesis and reaction prediction. Chem.  A Eur. J. 2017, 23, 5966–5971.
 Segler et al. 2018 Segler, M. H. S.; Preuss, M.; Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 2018, 555, 604–610.
 Smith et al. 2017 Smith, J. S.; Isayev, O.; Roitberg, A. E. ANI1: an extensible neural network potential with DFT accuracy at force field computational cost. Chem. Sci. 2017, 8, 3192–3203.

S Smith et al. 2018
S Smith, J.; Nebgen, B. T.; Zubatyuk, R.; Lubbers, N.; Devereux, C.; Barros, K.; Tretiak, S.; Isayev, O.; Roitberg, A. Outsmarting quantum chemistry through transfer learning.
ChemRxiv eprints 2018, 10.26434/chemrxiv.6744440.  Lubbers et al. 2018 Lubbers, N.; Smith, J. S.; Barros, K. Hierarchical modeling of molecular energies using a deep neural network. J. Chem. Phys. 2018, 148, 241715.
 Bartók et al. 2010 Bartók, A. P.; Payne, M. C.; Kondor, R.; Csányi, G. Gaussian approximation potentials: the accuracy of quantum mechanics, without the electrons. Phys. Rev. Lett. 2010, 104, 136403.
 Rupp et al. 2012 Rupp, M.; Tkatchenko, A.; Müller, K.R.; von Lilienfeld, O. A. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 2012, 108, 58301.
 Montavon et al. 2013 Montavon, G.; Rupp, M.; Gobre, V.; VazquezMayagoitia, A.; Hansen, K.; Tkatchenko, A.; Müller, K.R.; von Lilienfeld, O. A. Machine learning of molecular electronic properties in chemical compound space. New J. Phys. 2013, 15, 95003.
 Hansen et al. 2013 Hansen, K.; Montavon, G.; Biegler, F.; Fazli, S.; Rupp, M.; Scheffler, M.; von Lilienfeld, O. A.; Tkatchenko, A.; Müller, K.R. Assessment and validation of machine learning methods for predicting molecular atomization energies. J. Chem. Theory Comput. 2013, 9, 3404.
 Ramakrishnan et al. 2015 Ramakrishnan, R.; Dral, P. O.; Rupp, M.; von Lilienfeld, O. A. Big data meets quantum chemistry approximations: the machine learning approach. J. Chem. Theory Comput. 2015, 11, 2087.
 Behler 2016 Behler, J. Perspective: Machine learning potentials for atomistic simulations. J. Chem. Phys. 2016, 145, 170901.
 Paesani 2016 Paesani, F. Getting the right answers for the right reasons: toward predictive molecular simulations of water with manybody potential energy functions. Acc. Chem. Res. 2016, 49, 1844.

Schütt et al. 2017
Schütt, K. T.; Arbabzadah, F.; Chmiela, S.; Müller, K.R.; Tkatchenko, A. Quantumchemical insights from deep tensor neural networks.
Nat. Commun. 2017, 8, 13890.  Wu et al. 2018 Wu, Z.; Ramsundar, B.; Feinberg, E. N.; Gomes, J.; Geniesse, C.; Pappu, A. S.; Leswing, K.; Pande, V. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 2018, 9, 513.
 Nguyen et al. 2018 Nguyen, T. T.; Székely, E.; Imbalzano, G.; Behler, J.; Csányi, G.; Ceriotti, M.; Götz, A. W.; Paesani, F. Comparison of permutationally invariant polynomials, neural networks, and Gaussian approximation potentials in representing water interactions through manybody expansions. J. Chem. Phys. 2018, 148, 241725.
 Fujikake et al. 2018 Fujikake, S.; Deringer, V. L.; Lee, T. H.; Krynski, M.; Elliott, S. R.; Csányi, G. Gaussian approximation potential modeling of lithium intercalation in carbon nanostructures. J. Chem. Phys. 2018, 148, 241714.
 Li et al. 2018 Li, H.; Collins, C.; Tanha, M.; Gordon, G. J.; Yaron, D. J. A density functional tight binding layer for deep learning of chemical Hamiltonians. J. Chem. Theory Comput. 2018, 14, 5764–5776.
 Zhang et al. 2018 Zhang, L.; Han, J.; Wang, H.; Car, R.; E, W. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Phys. Rev. Lett. 2018, 120, 143001.
 Nandy et al. 2018 Nandy, A.; Duan, C.; Janet, J. P.; Gugler, S.; Kulik, H. J. Strategies and software for machine learning accelerated discovery in transition metal chemistry. Ind. Eng. Chem. Res. 2018, 57, 13973–13986.
 Grisafi et al. 2019 Grisafi, A.; Wilkins, D. M.; Willatt, M. J.; Ceriotti, M. Atomicscale representation and statistical learning of tensorial properties. arXiv eprints 2019, arXiv:1904.01623.
 Grisafi et al. 2019 Grisafi, A.; Fabrizio, A.; Meyer, B.; Wilkins, D. M.; Corminboeuf, C.; Ceriotti, M. Transferable machinelearning model of the electron density. ACS Cent. Sci. 2019, 5, 57–64.
 Bogojeski et al. 2018 Bogojeski, M.; Brockherde, F.; VogtMaranto, L.; Li, L.; Tuckerman, M. E.; Burke, K.; Müller, K.R. Efficient prediction of 3D electron densities using machine learning. arXiv eprints 2018, arXiv:1811.06255.

Pereira and Airesde Sousa 2018
Pereira, F.; Airesde Sousa, J. Machine learning for the prediction of molecular dipole moments obtained by density functional theory.
J. Cheminformatics 2018, 10, 43.  Smith et al. 2019 Smith, J. S.; Nebgen, B. T.; Zubatyuk, R.; Lubbers, N.; Devereux, C.; Barros, K.; Tretiak, S.; Isayev, O.; Roitberg, A. E. Approaching coupled cluster accuracy with a generalpurpose neural network potential through transfer learning. Nat. Commun. 2019, 10, 2903.
 Ramakrishnan et al. 2015 Ramakrishnan, R.; Hartmann, M.; Tapavicza, E.; von Lilienfeld, O. A. Electronic spectra from TDDFT and machine learning in chemical space. J. Chem. Phys. 2015, 143, 084111.
 Gastegger et al. 2017 Gastegger, M.; Behler, J.; Marquetand, P. Machine learning molecular dynamics for the simulation of infrared spectra. Chem. Sci. 2017, 8, 6924–6935.
 Yao et al. 2018 Yao, K.; Herr, J. E.; Toth, D. W.; McKintyre, R.; Parkhill, J. The TensorMol0.1 model chemistry: a neural network augmented with longrange physics. Chem. Sci. 2018, 9, 2261–2269.
 Christensen et al. 2019 Christensen, A. S.; Faber, F. A.; von Lilienfeld, O. A. Operators in quantum machine learning: Response properties in chemical space. J. Chem. Phys. 2019, 150, 064105.
 Ghosh et al. 2019 Ghosh, K.; Stuke, A.; Todorović, M.; Jørgensen, P. B.; Schmidt, M. N.; Vehtari, A.; Rinke, P. Deep learning spectroscopy: Neural networks for molecular excitation spectra. Adv. Sci. 2019, 6, 1801367.
 Brockherde et al. 2017 Brockherde, F.; Vogt, L.; Li, L.; Tuckerman, M. E.; Burke, K.; Müller, K.R. Bypassing the KohnSham equations with machine learning. Nat. Commun. 2017, 8, 872.
 McGibbon et al. 2017 McGibbon, R. T.; Taube, A. G.; Donchev, A. G.; Siva, K.; Hernández, F.; Hargus, C.; Law, K.H.; Klepeis, J. L.; Shaw, D. E. Improving the accuracy of MøllerPlesset perturbation theory with neural networks. J. Chem. Phys. 2017, 147, 161725.
 Nudejima et al. 2019 Nudejima, T.; Ikabata, Y.; Seino, J.; Yoshikawa, T.; Nakai, H. Machinelearned electron correlation model based on correlation energy density at complete basis set limit. J. Chem. Phys 2019, 151, 024104.
 Townsend and Vogiatzis 2019 Townsend, J.; Vogiatzis, K. D. DataDriven acceleration of the coupledcluster singles and doubles iterative solver. J. Phys. Chem. Lett. 2019, 10, 4129–4135.
 Sinitskiy and Pande 2019 Sinitskiy, A. V.; Pande, V. S. Physical machine learning outperforms “human learning” in quantum chemistry. arXiv eprints 2019, arXiv:1908.00971.
 Welborn et al. 2018 Welborn, M.; Cheng, L.; Miller, T. F., III Transferability in machine learning for electronic structure via the molecular orbital basis. J. Chem. Theory Comput. 2018, 14, 4772–4779.
 Cheng et al. 2019 Cheng, L.; Welborn, M.; Christensen, A. S.; Miller, T. F., III A universal density matrix functional from molecular orbitalbased machine learning: Transferability across organic molecules. J. Chem. Phys. 2019, 150, 131103.
 Møller and Plesset 1934 Møller, C.; Plesset, M. S. Note on an approximation treatment for manyelectron systems. Phys. Rev. 1934, 46, 618.
 Saebo and Pulay 1993 Saebo, S.; Pulay, P. Local treatment of electron correlation. Annu. Rev. Phys. Chem. 1993, 44, 213–236.
 Bartlett et al. 1990 Bartlett, R. J.; Watts, J. D.; Kucharski, S. A.; Noga, J. Noniterative fifthorder triple and quadruple excitation energy corrections in correlated methods. Chem. Phys. Lett. 1990, 165, 513–522.
 Schütz 2000 Schütz, M. Loworder scaling local electron correlation methods. III. Linear scaling local perturbative triples correction (T). J. Chem. Phys. 2000, 113, 9986–10001.
 Nesbet 1958 Nesbet, R. K. Brueckner’s theory and the method of superposition of configurations. Phys. Rev. 1958, 109, 1632.
 Szabo and Ostlund 1996 Szabo, A.; Ostlund, N. S. Modern Quantum Chemistry; Dover: Mineola, 1996; pp 231–239.
 Späth 1979 Späth, H. Correction to algorithm 39: clusterwise linear regression. Computing 1979, 26, 275.
 Späth 1979 Späth, H. Algorithm 39: clusterwise linear regression. Computing 1979, 367–373.
 Lloyd 1982 Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137.
 Montavon et al. 2013 Montavon, G.; Rupp, M.; Gobre, V.; VazquezMayagoitia, A.; Hansen, K.; Tkatchenko, A.; Müller, K.R.; von Lilienfeld, O. A. Machine learning of molecular electronic properties in chemical compound space. New J. Phys. 2013, 15, 095003.
 Blum and Reymond 2009 Blum, L. C.; Reymond, J.L. 970 Million druglike small molecules for virtual screening in the chemical universe database GDB13. J. Am. Chem. Soc. 2009, 131, 8732.
 Dunning 1989 Dunning, T. H. Gaussian basis sets for use in correlated molecular calculations. I. The atoms boron through neon and hydrogen. J. Chem. Phys. 1989, 90, 1007.
 Pedregosa et al. 2011 Pedregosa, F. et al. Scikitlearn: machine learning in python (v0.21.2). J. Mach. Learn. Res. 2011, 12, 2825.
 gpy since 2012 GPy: A Gaussian process framework in python. http://github.com/SheffieldML/GPy, since 2012.
 Rasmussen and Williams 2006 Rasmussen, C. E.; Williams, C. K. I. Gaussian processes for machine learning; MIT Press: Cambridge, MA, 2006.

Criminisi et al. 2012
Criminisi, A.; Shotton, J.; Konukoglu, E. Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semisupervised learning.
Found. Trends Comput. Graph. Vis. 2012, 7, 81–227.  Fan et al. 2008 Fan, R.E.; Chang, K.W.; Hsieh, C.J.; Wang, X.R.; Lin, C.J. LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res. 2008, 9, 1871–1874.
 Hastie et al. 2009 Hastie, T.; Rosset, S.; Zhu, J.; Zou, H. Multiclass adaboost. Stat. Its Interface 2009, 2, 349–360.
 Murphy 2012 Murphy, K. P. Machine learning: a probabilistic perspective; MIT Press: Cambridge, MA, 2012; pp 492–493.