Simple and efficient algorithms for training machine learning potentials to force data

06/09/2020 ∙ by Justin S. Smith, et al. ∙ Los Alamos National Laboratory 0

Abstract Machine learning models, trained on data from ab initio quantum simulations, are yielding molecular dynamics potentials with unprecedented accuracy. One limiting factor is the quantity of available training data, which can be expensive to obtain. A quantum simulation often provides all atomic forces, in addition to the total energy of the system. These forces provide much more information than the energy alone. It may appear that training a model to this large quantity of force data would introduce significant computational costs. Actually, training to all available force data should only be a few times more expensive than training to energies alone. Here, we present a new algorithm for efficient force training, and benchmark its accuracy by training to forces from real-world datasets for organic chemistry and bulk aluminum.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Machine learning (ML) is driving the development of next-generation interatomic potentials. By training the ML model to a large and diverse dataset of ab initio quantum simulations, one aims to build a low-cost, high-fidelity emulator, valid over a wide space of atomic configurations. For example, such ML potentials can be used as the basis for large scale molecular dynamics simulations with unprecedented accuracy Smith et al. (2020a); Lu et al. (2020).

The reference data is generated by approximate solution to the Schrödinger equation, typically using a tool such as density functional theory (DFT). Under the Born Oppenheimer approximation, nuclei are treated classically. Each reference calculation takes as input the atomic configuration (nuclei positions and species) and outputs total energy . Often, once the total energy has been computed, forces for all atoms can be produced, at minimal additional cost. If possible to acquire, these forces provide highly valuable training data for the ML model. For a system with atoms, the collection of force components comprise times more data than the energy scalar.

The ML model predicts a potential energy surface that is hopefully a good approximation to the true energy , even for configurations outside the training set. To maximize generality, it is natural to train the ML model such that its predicted energy and forces

agree with reference data. It might appear that incorporating a large quantity of force data into the training procedure would incur a large increase in computational cost. Here, we show otherwise. In the context of neural networks (or more generally, any method based on gradient-based optimization of a loss function) one can train on energy


force data at a cost comparable to training on the energy data alone. Readers familiar with ML frameworks (ML-F) such as PyTorch 

Paszke et al. (2019)

or TensorFlow 

Abadi et al. (2016)

may recognize the above statement as self-evident. The principle of reverse-mode automatic differentiation (backpropagation)

guarantees that the gradient of a scalar loss function can be efficiently calculated, independent of the number of gradient components Griewank (1989). The backpropagation procedure effectively requires tracing backward through all computational steps that were used to calculate the loss. An ML-F will automatically execute this procedure to produce the full gradient.

Prototyping new ML codes inside an ML-F is an obvious choice. However, there remain several reasons why certain production ML codes may wish to avoid use of an ML-F. An obvious one is that it can be difficult to port existing codes into the constrained context of an ML-F. Another reason may be memory constraints. By design, the ML-F needs to track every computational operation, recording all associated data, in order to backpropagate. It is often possible to design alternative algorithms to calculate a gradient, for which memory requirements are significantly reduced Wang et al. (2018a)

. Finally, there is a question of performance. Codes may wish to avoid an ML-F if they require types of calculations that are not easily expressable in terms of high-level tensor operations. Although next-generation ML-Fs such as JAX 

Bradbury et al. (2018); Schoenholz and Cubuk (2019) and Zygote Innes (2019) enable backproprogation through nearly arbitrary Python or Julia code, costs arising from automated tracing seem unavoidable.

Our main contribution is a simple algorithm to efficiently calculate the full gradient of a loss function that directly incorporates force data. This algorithm works with or without an ML-F, and so remains fully general. In particular, the method can be applied to any existing neural network code that was designed to train to energy data, including AENet Artrith and Urban (2016), N2P2 Singraber (2018); Singraber et al. (2019), ANI Smith et al. (2017), and PINN Pun et al. (2019).

In a practical implementation, the cost to evalute the loss gradient may be about 3 times the cost to predict all forces, independent of the number of atoms and number of model parameters.

Ii Evaluating the loss gradient

ii.1 Problem statement

Our context is as follows: We seek to optimize (i.e, train) model parameters such that the ML-predicted energy function minimizes a loss function,

The terms


constrain the model predictions and to match reference energy and force data. Angle brackets denote an average over the dataset. The final term is a placeholder for various possible model regularization terms. Coefficients and are fixed prior to training.

Optimization of model parameters

typically involves some variant of stochastic gradient descent, which requires the loss gradient

, or an approximation to it. A modern neural network will typically have or more scalar components in and, therefore, in . It is essential to calculate this full loss gradient efficiently. One can handle with ordinary backpropagation. Calculating , however, presents an interesting challenge.

To simplify notation, let us focus attention on a single data point, e.g. a single DFT calculation. For a system with atoms, we define


The target can be calculated by averaging over all data points.

Our focus, then, is efficient calculation of . This appears difficult, because already incorporates derivatives of the ML potential, via its dependence on


Naïve expansion indicates that involves all second derivatives . Fortunately, it is not necessary to evaluate all these components individually. Below, we demonstrate two methods to efficiently calculate at a cost comparable to the calculation of alone.

ii.2 Approach 1: Iterated backpropagation

As previously mentioned, ML frameworks (ML-F) such as PyTorch or TensorFlow offer an efficient algorithm to calculate the full gradient . This calculation happens as follows. Using primitives provided by the ML-F, the user writes a code to calculate energy and the loss in terms of and . The ML-F will first execute the code to calculate , tracing all dependencies on the atomic configuration and the model parameters . Operating backward on that trace, the ML-F then efficiently calculates all forces . Once these forces are known, the ML-F can calculate the loss . Throughout the entire calculation of (including the backpropagation phase to calculate ), tracing remains active. A second backpropagation step can then be performed, this time to calculate the full gradient . We emphasize that the calculation of involves backpropagating through the backpropagation step used to calculate . In other words, calculation of effectively requires four traversals of the computational graph to calculate . Remarkably, these steps are completely automated by the ML-F. Implementing iterated backpropagation without the help of an ML-F would be a daunting task.

Many popular ML potentials have been written in an ML-F, for which the iterated backpropagation strategy is a natural fit Lubbers et al. (2018); Schütt et al. (2018); Wang et al. (2018b); Yao et al. (2018); Zubatyuk et al. (2019); Unke and Meuwly (2019); Gao et al. (2020); Gilmer et al. (2020).

ii.3 Approach 2: Directional derivative of the energy gradient

Here we demonstrate that it is possible to efficiently calculate the full gradient , even without the aid of an ML-F. This algorithm should be applicable to any existing neural network code. Our only assumption is that subroutines are available to efficiently calculate the energy , as well as its two gradients, and .111For efficiency, these gradients should be evaluated using backpropagation. Implementing a first iteration of backpropagation manually is not too difficult.

The error in the force predictions are readily calculated,


The loss gradient may then be written as


In the second step we used the fact that is ground truth data, independent of model parameters . Applying the definition and commuting derivatives yields


Naïvely, one might consider evaluating by finite differencing on all positions individually. There is a better algorithm, however, which avoids introducing a factor of into the computational cost.

The idea is to interpret as a

-dimensional vector in the space of all atomic coordinates, and

as the gradient vector in this space. In this language, the loss gradient () may be viewed as a directional derivative of the energy gradient along the direction of force errors (). Central differencing gives,


where and denote new configurations in which each atom is slightly perturbed,


In Eq. (7), and are to be held fixed with respect to varations in (namely, we impose . Models are typically designed to be smooth, so Eq. (7) is valid to order The “small” parameter has units of length per force. Its selection will be discussed below.

Combining the above results, our method can be summarized as follows:

Steps for efficient evaluation of loss gradient For a given atomic configuration , calculate all predicted forces , and associated force errors, . Generate slightly perturbed atomic configurations Evaluate the full energy gradient at new configurations and . Use central differences, Eq. (7), to approximate .

In total, this recipe requires calculating forces and two additional energy gradients, , and . Compared to the method of Sec. II.2, less memory is required because here we avoid iterated backpropagation.

Equation (7) may be interpreted as an approximation to Pearlmutter’s algorithm for efficient multiplication by the Hessian Pearlmutter (1994). In Pearlmutter’s version, the limit is taken, avoiding numerical errors due to finite differencing. This can be achieved using the method of forward mode automatic differentiation Griewank (1989). Specifically, the code to calculate should be transformed into one that operates on so-called dual numbers, which are designed to track infinitesimal perturbations. Unlike reverse mode autodiff, the forward mode version requires no tracing.

Existing neural network codes are unlikely to support dual numbers, so we instead advocate the central difference approximation of Eq. (7). The next section will indicate that numerical errors can be quite small.

Iii Minimizing numerical error

iii.1 Scaling the finite differencing parameter

The finite differencing scheme of Eq. (7) requires selection of a sufficiently small parameter . Since actually carries dimensions, it is natural to factorize


where is a characteristic length scale, and is a characteristic scale associated with errors in the force predictions, . The small dimensionless parameter should be selected according to floating point round-off considerations, as will be discussed below.

For simplicity, we fix . The characteristic scale should vary according to the accuracy of the model’s force predictions, as applied to a particular system. A reasonable choice is


selected on a per system basis.

iii.2 Two measures of error

A direct error measure for the finite differencing scheme of Eq. (7) is,


The bars denote an norm, to be taken over all components, and all points in the dataset (e.g. all DFT calculations).

Ideally, one would like to know how floating point round-off errors contribute to . In applications, it may be infeasible to calculate to full precision, and we therefore will not know the true numerical error in . To circumvent this limitation, it will be useful to introduce a second error measure that can be used as a proxy for .

Removing the gradient operator from the right hand side of Eq. (7) yields a new finite difference approximation,


again valid to order . Using , we find that


The suggests a new error measure


which should reflect , insofar as the finite difference approximations Eqs. (7) and (13) have comparable round-off errors. Below we present evidence to this effect.

Because the reference loss is generally available, the error measure can be calculated to high precision with existing codes.

iii.3 Empirical error measurements

Here we demonstrate a numerical procedure for selecting the dimensionless parameter , which fixes via Eq. (9).

Our intention is that the approximate loss gradient will ultimately be used to enable a gradient descent training procedure, for which the ML model will have a highly nonlinear dependence on its model parameters

. In this subsection, however, we consider the simpler context of a linear regression model so that it becomes possible to precisely evaluate the effects of floating point round-off on

. Local energy contributions will be modeled as , where are fitting coefficients and serve as descriptors of each local atomic environment. For concreteness, we select a SNAP potential for tantalum, and use its corresponding dataset Trott et al. (2014); Thompson et al. (2015). In SNAP, the descriptors are bispectrum coefficients. The dataset consists of 362 different configurations, sampled from both crystal and liquid phases Thompson et al. (2015). System sizes in this dataset range from 2 to 100 tantalum atoms. Reference energy and force data were computed with DFT.

In the context of an ML training procedure, we must account for the fact that the parameters will be rapidly evolving. Ideally, should remain a good approximation to for arbitrary model parameters . Therefore, in addition to the trained SNAP potential, we also consider an untrained model, for which we randomize the model parameters according to the Kaiming initialization procedure He et al. (2015).

Figure 1:

Relative error in the finite difference estimates of

for trained and untrained SNAP potentials of tantalum. Circles denote the true error , and crosses denote its proxy . Central differencing errors formally scale as in the small parameter . Accounting for double-precision round-off errors, the choice yields the smallest errors for both model types (under both error measures).

Figure 1 shows empirical measurements of errors associated with the approximation , for various values of the dimensionless finite differencing parameter . Importantly, a single value, , is observed to minimize the error for both trained and untrained models. This optimal balances the central differencing error with the floating point round-off error. For this calculation, we used 64-bit (double-precision) floating point accuracy, for which the 53 bit significand corresponds to approximately digits of precision. Here, the proper selection of yields about 9 digits of accuracy in estimates of the loss gradient, , which is more than sufficient for neural network training.

Figure 1 actually reports our two measures of error, namely and . Recall that the former represents the true error in the approximation , and the latter is intended as a proxy for the true error. Our results indicate that, indeed, and are of comparable scale. When moving to real-world neural network codes, will be easy to directly measure. Our general recommendation is to select to minimize . The results of Fig. 1 indicate that this choice of will yield a quality approximation , even under significant variations to the model parameters .

The scaling relations of Eqs. (9) and (10) are crucial for ensuring that is roughly invariant to model quality. In particular, as the model improves, the typical force errors decrease, and the finite differencing parameter should increase, so that the characteristic atomic displacements of Eq. (8) have a roughly invariant scale. Appendix A demonstrates the importance of accounting for these scaling relationships.

Iv Benefits of force training

Our initial motivation for developing the force training scheme of Sec. II.3 was to support ANAKIN-ME (ANI) models Smith et al. (2017). ANI is a variant of the Behler Parrinello neural network architecture Behler and Parrinello (2007). The Neurochem implementation of ANI is written in highly optimized C++/CUDA code Smith (2020), and does not use an ML framework such as PyTorch or TensorFlow. Applied to a recently developed aluminum potential, NeuroChem can calculate 1,000 atomic forces in about 20 ms, running on a single modern GPU (Nvidia RTX 2080 Ti) Smith et al. (2020a). For comparison, TorchANI is a recent reimplemenation of ANI in PyTorch, designed for flexibility Gao et al. (2020). TorchANI makes prototyping new model variants much easier, but is up to an order of magnitude slower than NeuroChem. Whereas TorchANI can use the iterated backpropagation scheme of Sec. II.2, the optimized NeuroChem implementation cannot. Fortunately, the method presented in Sec. II.3 allows NeuroChem to also train to force data in a very efficient manner.

We demonstrate the value of training to force data by benchmarking on two real-world datasets. The first, ANI-1x, includes about 5M DFT calculations on single organic molecules (elements C, H, N, and O, with a mean molecule size of about 15 atoms), over a broad range of conformations Smith et al. (2020b). The second, ANI-Al, includes about 6,000 DFT calculations on bulk aluminum, in various melt and crystal configurations, each containing about 100 to 200 atoms Smith et al. (2020a)

. Both datasets were generated automatically using an active learning procedure, which aims to maximize the diversity of the atomic configurations 

Smith et al. (2018). For each of the two datasets, we trained two ML potentials. The first potential was trained to energy data only, and the second potential was trained to both energy and

force data. We employ ensemble averaging to reduce model variance; each ML potential actually consists of eight ANI neural networks, differing only in the random initialization of their weights prior to training. Model details and training procedures are described in previous work 

Smith et al. (2018, 2020a).

The force training scheme of II.3 requires selection of a finite differencing parameter via the dimensionless number . The NeuroChem implementation uses a careful mix of 32 bit and 64 bit floating point precision, and the optimal choice of would be difficult to guess a priori. We selected to approximately minimize , and found that this choice yields reasonable estimates of the loss gradient throughout the training procedure.

ANI-1x (chem.) ANI-Al (alum.)
Training on energy data only
Energy RMSE
Force RMSE
Training on energy and force data
Energy RMSE
Force RMSE
Table 1: Root-mean-squared-errors (RSME) for neural network energy and force predictions. Models were trained to the ANI-1x and ANI-Al datasets for organic chemistry and bulk aluminum, respectively.

Table 1 shows the resulting errors in energy and force predictions, as measured on held-out test data. Because the natural energy units vary according to domain, we use kcal/mol for the ANI-1x dataset (organic chemistry) and eV for the ANI-Al data (bulk aluminum).

For ANI-1x, including force data into the training procedure reduces error in the energy predictions by about 7%, and in the force predictions by about 33%. For ANI-Al, we see a much more dramatic improvement: energy and force errors are reduced by about 57% and 88%, respectively. In other words, using force data in the training procedure can reduce force prediction errors by almost a factor of 9.

The biggest difference between the ANI-1x and ANI-Al datasets is that the latter contains DFT calculations for bulk systems (100 to 200 aluminum atoms), whereas the former contains calculations for single molecules only (each with about 15 atoms on average). Consequently, in the ANI-Al dataset, far more information is carried by the force data than the energy data.

V Conclusions

Various works state or imply that training neural network potentials to force data is challenging or expensive. Some studies even opt to ignore forces, and train only to energies. Here, we have discussed two algorithms that make training to force data simple and efficient. With either algorithm, the computational cost of training to energy and force data is only a few times more expensive than the cost of training to energy data alone, independent of system size and model complexity. This is striking given that, for a bulk system, the collection of all forces contains vastly more information than does the energy alone.

In Sec. II.2 we discussed the method of iterative backprogation. Using an ML framework such as PyTorch or TensorFlow, force training can be handled almost automatically. One is free to place arbitrary force-dependent terms into the loss function, and gradients come “for free.” Under the hood, the ML framework traces all intermediate values in the computational graph for calculating the loss function, and will automatically backpropagate through this graph to calculate the gradient of the loss function. We use the term “iterated backpropagation” to refer to the fact backpropagation must happen twice (first to calculate forces and second to calculate the gradient of the loss).

In Sec. II.3 we presented a new method that involves taking an appropriate directional derivative of the energy gradient. A primary motivation for the new method is that it does not require the use of an ML framework; our method could be applied to any existing neural network code that was designed to train to energy data. Compared to iterated backpropagation, the new method requires only half the memory, because it avoids the second backpropagation step. The directional derivative may be estimated with single central difference approximation of Eq. (7). The numerical errors associated with finite differencing were found to be negligible in practice. Alternatively, one could in principle retain full numerical precision in calculating the loss gradient if the neural network code happens to support a generalization to dual numbers Griewank (1989); Pearlmutter (1994).

The benefits of force training have been extensively demonstrated in Ref. Cooper et al., 2020. Interestingly, that study treats the loss function of Eq. (1) in a more approximate way. Namely, the authors reframed the problem in terms of energy training only, by effectively augmenting their dataset with small, random perturbations to existing configurations. Here, in contrast, here we have shown it possible to directly calculate the full gradient at a cost only a few times greater than the cost to calculate itself, independent of the system size and the number of model parameters.

We have focused on ML models for which training involves some flavor of gradient descent optimization. Kernel methods, such as Gaussian process regression, are an alternative approach to ML potential development, for which the model parameters are calculated via solution to a linear system of equations Bartók and Csányi (2015); Rupp (2015). Force training is important for kernel models as well as for neural networks Chmiela et al. (2017); Christensen et al. (2019). One might ask: Could the algorithms presented here also be of use when training kernel models to large quantities of force data?

This work was partially supported supported by the Laboratory Directed Research and Development (LDRD) program at LANL. N. L. and A. T. acknowledge support from the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. K. B. acknowledges support from the center of Materials Theory as a part of the Computational Materials Science (CMS) program, funded by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Materials Sciences and Engineering Division. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.

Appendix A Importance of proper scaling

Figure 2: Errors in the finite difference estimates of , analogous to those of Fig. (1), but here naïvely fixing , rather than using the definition of Eq. (10). With this replacement, the optimal values of now differ significantly between trained and untrained models.

Figure 1 measured errors in the approximation for trained and untrained SNAP potentials. By scaling according to Eqs. (9) and (10), we achieved a good approximator , valid for both trained and untrained models, using a single dimensionless parameter . The invariance of is important because one expects model parameters to vary significantly over the course of an ML training procedure.

Figure 2 illustrates the danger of naïvely fixing constant, rather than using Eq. (10). We observe that, with fixed, the optimal value of can easily vary by multiple orders of magnitude between trained and untrained models.