Statistical learning method for predicting density-matrix based electron dynamics

07/31/2021 ∙ by Prachi Gupta, et al. ∙ University of California, Merced 6

We develop a statistical method to learn a molecular Hamiltonian matrix from a time-series of electron density matrices. We extend our previous method to larger molecular systems by incorporating physical properties to reduce dimensionality, while also exploiting regularization techniques like ridge regression for addressing multicollinearity. With the learned Hamiltonian we can solve the Time-Dependent Hartree-Fock (TDHF) equation to propagate the electron density in time, and predict its dynamics for field-free and field-on scenarios. We observe close quantitative agreement between the predicted dynamics and ground truth for both field-off trajectories similar to the training data, and field-on trajectories outside of the training data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Predicting the dynamic electronic properties of a molecular system is essential to understanding phenomena such as charge transfer and response to an applied laser field. The time-dependent Schrödinger equation (TDSE) governs the time evolution of a quantum electronic system. Using the time-dependent density operator within a finite-dimensional basis yields the Liouville-von Neumann equation:


Here and denote the time-dependent electron density and Hamiltonian matrices in orthonormal bases, respectively, and the square brackets denote a commutator: for any square matrices and , the commutator is .

The many-body problem given by (1) can only be solved for simple systems, such as those with very few electrons within a small basis. Hartree-Fock (HF) theory is a simplified mean field approach in which the many-body wave function is approximated using an anti-symmetrized product of single particle orbitals. Applying this approximation to the Hamiltonian for (1) produces two-electron terms that are given by Coulomb and exchange operators, and thus a Hamiltonian that is now density dependent . Using this HF Hamiltonian, sometimes called the Fock matrix within HF theory, with (1) yields the time-dependent HF (TDHF) equation, which, along with time-dependent density functional theory (TDDFT), is often used for simulating electron dynamics,


Here, using TDHF training data, we address the problem of learning the field-free Hamiltonian matrix from time series observations of electron densities . For the field-free trajectory, i.e., when the Hamiltonian contains no explicit time-dependence, is a complex Hermitian matrix function of , which is also complex and Hermitian. Therefore, and

are completely determined by their upper triangular elements. Both matrices can be represented by vectors that contain the real and imaginary components of their upper triangular parts. Using these vector representations for

and , we develop a statistical model for . This model is linear in its parameters ; in the vector representation, the model Hamiltonian is also a linear function of electron density matrix elements.

To fit the model, we minimize a loss function that measures the squared Frobenius norm between the left- and right-hand sides of (

2), evaluated on training data. This data consists of time series of electron density matrices and their time-derivatives computed via centered differencing. The loss function depends on its parameters through the Hamiltonian. Since we use a linear model for the Hamiltonian, our loss function is quadratic in the model parameters . Therefore, to minimize the loss and fit the model, we must solve a least squares problem. Equipped with the Hessian and gradient of the loss function, the solution to this problem reduces to that of . For small systems, we can carry this out effectively, using automatic differentiation to compute and .

However, this approach does not scale well to larger molecular systems and results in prohibitively large training times, the majority of which is required for computation of the Hessian matrix. To address this, we develop a data science framework that scales to larger molecules and larger basis sets than in our previous work


. Here, we use dimensionality-reduction techniques based on degrees of freedom in the density matrix and properties of the HF Hamiltonian. Another challenge for large systems with symmetry is that the Hamiltonian model does not extrapolate well to the field-on case because the Hessian matrix has

eigenvalues, leading to multicollinearity. To resolve this challenge we use ridge regression. Ridge regression places a constraint on the model parameters by adding a penalty to the loss function.

To train the Hamiltonian model we use time series of density matrices generated with no external perturbations. Using the learned Hamiltonian, we propagate forward in time to obtain a field-free trajectory. To compute a field-on trajectory, we add a time-dependent external perturbation to the learned Hamiltonian and propagate forward in time. We find that the learned field-free Hamiltonian can be used to propagate electron dynamics in both field-free and field-on conditions, yielding results that closely match those obtained via ground truth Hamiltonians.

Our overarching goal is to learn a potential/Hamiltonian for TDDFT to simulate more accurate electron dynamics. Key to this theory is the introduction of a density dependent exchange correlation potential that accounts for quantum electron-electron many-body Coulombic interactions not captured from the classical (mean field) Coulomb contribution. However, the exact form of the exchange correlation portion of the Hamiltonian is unknown. Therefore, our goal is to first develop a method to learn a known, more approximate density-dependent Hamiltonian, like that used in time-dependent Hartree-Fock (TDHF) theory [9, 10, 8, 6]

. This work provides the methodological development for a framework that seeks to model the Hamiltonian and use it to predict the dynamics of the system. This work sets us on a pathway towards developing a novel statistical/machine learning method for more complex theories for predicting electron dynamics.

2 Methods

2.1 Generating Data

In this paper, we predict electron dynamics for six molecular systems: in the 6-31G basis set (two electrons in 4 basis functions), in the 6-31G and 6-311++G basis sets (two electrons in 4 and 14 basis functions, respectively), LiH in the 6-31G and 6-311++G basis sets (four electrons in 11 and 29 basis functions, respectively), and in the STO-3G basis set (16 electrons in 14 basis functions). Note that each molecular orbital created from a linear combination of these atomic orbital basis functions is doubly occupied. We build off of our previous work that developed models for the simpler systems, , and LiH in the STO-3G basis set (two electrons and two basis functions) [2].

For each molecular system, we apply standard electronic structure methods to compute the ground truth field-free Hamiltonian/Fock matrix and variationally determine the ground state electron density matrix . Our initial condition at is either field-free with determined from solving for the electron density in the presence of an applied electric field (a delta-kick perturbation at ), or is the ground state electron density and we apply the field during propagation (see below). We then numerically solve (2) to generate an electron dynamics trajectory , recording the data at temporal resolution a.u., propagating with the modified midpoint unitary transformation method [7, 11]. These steps were performed using a modified version of the Gaussian electronic structure code [4]. We generate two data sets for each molecular system:

  1. Field-free trajectory: The initial density matrix in the presence of an electric field is calculated. Using this initial condition as the delta kick perturbation and then without applying any external perturbation during propagation, a trajectory is produced. A part of this trajectory, i.e., density matrices where , is used for training and another part of this trajectory for is used as a validation set.

  2. Field-on trajectory: The initial ground state density matrix without any perturbation is calculated. An external forcing term is applied during propagation, where is the applied electric field in the z direction (along the main molecular bond axis), is the electric field frequency, and is the

    component of the molecular dipole moment. For this study, the electric field is turned on for one cycle (

    = ) at , with a.u (an off-resonant frequency corresponding to the neodymium-YAG laser) and a.u. We test our learned Hamiltonian against this field-on trajectory; field-on trajectories are never used during the training process.

2.2 Statistical Learning

Our aim is to learn the molecular Hamiltonian , which is a Hermitian matrix-valued function of the Hermitian density matrix as in (2). Since and are Hermitian, they are completely determined by their upper-triangular components. We split and into real and imaginary matrices and then flatten and combine the upper-triangular parts of each matrix into corresponding real vectors. Let , denote real column vectors that contain the real and imaginary parts of the upper-triangular portions of the complex matrices , . Let tildes denote statistical models—to be clear, is the model Hamiltonian, different from the true Hamiltonian . As in [2], we use a linear model and squared loss


where , , , and . The loss function quantifies the mismatch between the left- and right-hand sides of (2), with the time-derivative approximated by a centered-difference quotient. To train, we compute that minimizes on the training data:


This is a least squares problem. Let denote the Hessian of the loss with respect to . Let , the gradient of the loss with respect to , evaluated at . To solve (5), we can take the gradient of the loss function and set it to 0. This results in the normal equations, which we can write in terms of the Hessian and gradient of the loss (see Appendix A for details):


We briefly explain the meaning of the loss by asking the hypothetical question: if refers to ground truth electron density matrices in our training data, what does it mean for the loss function to vanish? Consider the following equation, which defines a one-step prediction of :


For to vanish, for each , we must be able to insert the true values of and into the right-hand side of (7) and obtain a predicted that perfectly matches the true . In short, the loss measures the deviation from perfect one-step or local prediction via (7), across the entire training time series. We use the loss as a proxy for the true metric of interest, which is long-term propagation error (12). Direct or adjoint-based minimization of (12) can in principle be used to solve for ; however, this will be much more computationally expensive than our approach.

As described above, the training data consists of field-free trajectories. For each molecular system, we train using time series of density matrices where obtained from the field-free trajectory to ensure that this learned Hamiltonian does not depend on an external field. We do not use the first two time steps of the trajectory since these time steps have large values of , a consequence of the delta-kick initial condition.

The solution to (6

) results in the statistical estimates

and the molecular Hamiltonian can then be determined using (3). We tested the model for small molecules in small basis sets (up to 6 6 in dimension for the complex Hamiltonian). When we sought to extend this approach to more complex molecular systems, we encountered two main problems: (i) training times were unacceptably large due to automatic differentiation, and (ii) propagation results were inaccurate. To solve (i), we coded the gradient and Hessian of the loss (4) ourselves, leveraging parallelization—see Appendix B for details. To solve (ii), we applied dimensionality reduction and ridge regression, which we now detail in turn.

2.2.1 Dimensionality Reduction

We consider diatomic molecules in the 6-31G and 6-311++G** bases and the larger molecule in the small STO-3G basis. Let denote the dimension of the density and Hamiltonian matrices for each molecule in a given basis set. For larger basis sets or larger molecules, increases dramatically; see Table 1. Our initial implementation leads to a naïve version of (3) in which is of size and hence is an matrix of regression coefficients. We employ two tactics to reduce the dimensionality of . First, we split (3) into two separate models, such that the parts of that correspond to real (respectively, imaginary) components of depend only on the real (respectively, imaginary) components of . This splitting, which can be justified based on physical properties of the Hartree-Fock Hamiltonian, was not present in our prior work [2]. At time , the true field-free Hamiltonian in the AO basis is,


Here is the kinetic energy matrix, is the electron-nuclear energy matrix, and is the density dependent combination of Coulomb and exchange matrices. Let , then for ,



is a four-index tensor in the Coulomb and exchange calculations. Because this tensor is real, the real elements of the Hamiltonian depend on the real elements of the density matrix and the imaginary elements of the Hamiltonian depend on the imaginary elements of the density matrix. The second tactic used to reduce dimensionality is that when forming the flattened vector representation

, we retain only those entries of where the corresponding entries of are not identically zero [2]. For these linear or flat molecular systems, elements are identically zero due to the molecular symmetry, e.g. if they are constructed from orthogonal basis functions. In this way, for the largest problem under consideration, we reduce from to , reducing the number of coefficients by a factor .

2.2.2 Ridge Regression

When we scale our method to molecular systems with large , we also notice multicollinearity, e.g., numerous zero eigenvalues in the Hessian of the loss . With multicollinear data, the least squares estimator predicts poorly. We eliminate this problem by using ridge regression, for which we can write the penalized loss function as ; note the use of the -norm, as opposed to the -norm in the penalty term for Lasso, i.e., [5]. In this work, we train our model by computing the ridge regression solution:


where is the Hessian of with respect to and is the gradient of with respect to computed at . For a grid of values, we compute on the training set, and then compute the loss on a validation set that is disjoint from but equal in size to the training set. Figure 1 shows the validation loss for different values for in the STO-3G basis and in the 6-311G** basis set. We choose that minimizes the validation set loss.

One might conclude incorrectly from Figure 1 that, as the range of values on the vertical axes is small (multiplied by in the left panel and on the right), choosing a non-optimal may not affect final results. In practice, we find that field-on propagation results improve considerably if we choose the optimal . We hypothesize that this occurs for two reasons. First, the loss function essentially measures local or one-step propagation error, as described in Section 2.2. Second, we note that (2

) is a nonlinear system of ordinary differential equations; the right-hand side is quadratic in the elements of

. Nonlinearity can magnify errors in the estimated Hamiltonian. Over thousands of time steps, these errors can accumulate and cause predicted trajectories to diverge substantially from reality. We also explored Lasso, but chose ridge regression due to its superior performance.

Figure 1: Loss computed for the validation set for two systems. Validation loss is plotted against the value for in the STO-3G basis (left) and for in 6-311++G** basis (right). Note the x axis is plotted on a log scale. The value that minimizes the validation loss is chosen. For , and for , .
Figure 2: Training set size vs. mean propagation error for in 6-311++G (left) and LiH in 6-311++G (right). As we increase the training set size the test error decreases but the computational time for training increases.

3 Results

Applying the training procedure described in Section 2 to the molecular systems listed in Table 1, we learn and determine . Here, for smaller molecular systems, we train using time series with points. For larger systems, we increase the training set size; we determine the number of points by computing a learning curve, plotting test set propagation error against the number of training points. Let us illustrate the effect of training set size for two of the larger systems studied here: in 6-311++G** and the largest molecular system, LiH in 6-311++G. Figure 2 shows that, as we increase the training set size, field-on propagation error decreases (blue) while computational time for training increases (gray). The training set size used for each system is given in 1.

For propagation, we use RK45 ([3]) to solve (2) numerically with the learned Hamiltonian for 2000 steps. We do this both for the case of a delta kick perturbation (the same as the training data, a field-off perturbation) and for the case of a sinusoidal electric field perturbation (a field-on perturbation). The field-on perturbation tests the learned Hamiltonian in a regime that is outside that of the training set.

Molecule Basis set Training Set size Training Loss field-free error field-on error
LiH 6-31G
LiH 6-311++G**
Table 1: Molecule, number of elements in the density matrix, training loss, field-free and field-on propagation error.

The training loss, field-free, and field-on propagation error for six molecular systems are presented in Table 1. Training loss reported here is calculated as using (4). This training loss measures the squared Frobenius norm of one step errors, i.e, the error in propagating to the next time step using the learned Hamiltonian via (7). The small values of the field-free error, for all molecules, indicate that the Hamiltonian learned by minimizing (4) can be used for long-term propagation. Even with an applied field, which is outside the training regime, we obtain propagation errors comparable to if not less than those in the field-free case, implying that the learned Hamiltonian generalizes well beyond the training regime.

Figure 3: Propagation error compares ground truth density matrices against those computed by numerically solving (2) using the learned Hamiltonian . The solid lines are for field-off propagation and the dashed lines are with the field on.

Let denote the prediction, i.e, density matrix obtained by propagating the learned Hamiltonian. We define the time-dependent propagation error as


where measures the error (at time ) between , the predicted trajectory obtained by propagating the learned Hamiltonian, and , the ground truth trajectory. We calculate the mean propagation error for the propagation interval as


where is the number of time steps for which we propagate the Hamiltonian. For this study, . In Fig. 3 we plot the time-dependent propagation errors for all molecular systems in both the field-free and field-on cases. We see that that errors for both cases remain reasonably small for all molecular systems even after propagating for a.u., which is equivalent to time steps.

In Fig. 4, we plot, as a function of time, selected nonzero elements of the density matrix obtained by propagating the learned Hamiltonian (red), and the ground truth (blue) obtained from a widely-used electronic structure code (see details in Section 2). We observe good agreement between predicted and ground truth trajectories.

Figure 4: Real parts of selected elements of ground truth density matrices (blue) and density matrices computed using the learned Hamiltonian (red) for in the 6-311++G** basis for the field-free (left) and field-on (right) cases. Note the close agreement between all curves.

4 Discusssion

In this work, we extended our prior methodology by incorporating dimensionality reduction (in the form of real-imaginary splitting) and ridge regression. Using these techniques, we addressed challenges in scaling our method to molecular systems with larger basis set size . Using the learned Hamiltonian, we can predict electron densities for not only the training set (field free) but also for the test set (field on). The loss function (4) measures the sum of squares of one-step propagation errors, a form of local error. By minimizing this loss over the training set, we obtain very good long-time propagation error. For some molecular systems, the agreement is to a degree that we cannot tell the two curves (propagation using learned Hamiltonian and the ground truth trajectory) apart.

We used two dimensionality reduction techniques to significantly reduce the number of model parameters: (i) splitting the Hamiltonian model based on properties of the HF Hamiltonian and (ii) modeling only non-zero elements of the Hamiltonian matrix. The effective number of degrees of freedom in the Hamiltonian is less than the number of non-zero elements due to the linear combinations of Hamiltonian elements that expressed through (2). This reduction can be easily observed for smaller molecular systems like in the STO-3G basis set. For larger molecular systems, these linear dependencies are much more prevalent and more difficult to verify directly. In such cases, regularization improves the prediction capability of a model by decreasing the number of degrees of freedom. Here, using ridge regression we successfully reduced the field-on propagation error.

We also coded the Hessian and gradient for the loss function instead of using automatic differentiation techniques, thus making it feasible to obtain the least squares solution for larger molecular systems. Although for most molecular systems we used one field-free trajectory with time steps for training, for larger systems such as LiH (), we increased the training set size and observed that the field-on propagation error decreases for a larger training set. However, as we increase the training set size, the computational time for training increases, eventually becoming prohibitively expensive for a training set with more than time steps. In the future, we hope to extend this model to even larger molecular systems, and also learn a density-dependent Hamiltonian based on more accurate wave function generated densities.


This work was supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences under Award Number DE-SC0020203. We acknowledge computational time on the MERCED cluster (funded by NSF ACI-1429783), and on the Nautilus cluster, which is supported by the Pacific Research Platform (NSF ACI-1541349), CHASE-CI (NSF CNS-1730158), and Towards a National Research Platform (NSF OAC-1826967). Additional funding for Nautilus has been supplied by the University of California Office of the President.

Data Availability Statement

All code required to reproduce all training and test results is available on GitHub at https: // [1]. Training data is available from the authors upon request.

Financial disclosure

None reported.

Conflict of interest

The authors declare no potential conflict of interests.

Appendix A Reduction to Least Squares

We begin by writing the loss (4) as . To minimize , we need to determine


We start by expanding the loss function:

To minimize the right-hand side, we take the gradient with respect to ,

Setting this gradient to , we obtain the normal equations:


Let denote the Hessian of . Since and , we can write (14) as


To estimate the ridge regression solution, we need to compute


Augmenting the loss with the ridge penalty yields

The gradient is then

Setting the gradient to , we get


The Hessian of is . With this, we can write (17) as


Appendix B Computation of the Gradient and Hessian

Here we describe the details behind our computation of the gradient and Hessian of the loss function (4). Let us introduce the notation to denote the -th row and -th column of the matrix . Similarly, let denote the -th row and -th column of the centered-difference time derivative . We let denote . Then, with denoting complex conjugate in this section, we can rewrite the loss (4) as


Whereas we previously wrote , here we give more details. The term refers to an intercept matrix. However, refers to two collections of matrices, and . All matrices here are of the same dimension as . To better understand the roles of these matrices, let us note that depends only on certain non-zero, upper-triangular entries of . We let denote the indices of the real part of upon which we allow to depend. Similarly, we let denote the indices of the imaginary part of upon which we allow to depend. Hence and are, respectively, the total numbers of real and imaginary parts of that are active in the model for .

With this notation, we can write our linear model for as follows—note that we begin with the upper-triangular part: for ,


For , because is Hermitian (or self-adjoint), we have


Here we have used the fact that and are both real—this is necessary for the real (respectively, imaginary) part of to depend only on the real (respectively, imaginary) part of . Note that in these expressions, we only use the upper-triangular parts of and .

We focus first on the gradient and Hessian of with respect to . From (20-21), we see that depends on only through . For any integers and , we define the Kronecker delta . Then

Observe that is of the form . Putting these pieces together, we obtain, with signifying real part,


In what follows, we use to denote the indicator function of the set , e.g., . We then compute


This implies that


This is the gradient of the loss with respect to each of the matrices. In our code, we parallelize this computation across the and indices. More specifically, we implement this calculation via a function that, for a given and , computes for all at once. We then evaluate this function in parallel across all indices ; as mentioned above, the lower-triangular parts of the , , and matrices play no role in our model for .

Examining the form of the model (20-21), we note that upon exchanging


the roles of and become reversed. Using this fact, we can extract from the above calculation an expression for the gradient : we simply apply the transformation (24) to (23). We have verified by hand that this yields precisely the same result as differentiating directly with respect to .

Further examining (20-21), we see that if we set and , then plays the same role as . Setting in (23) gives us the gradient ; again, we have verified this by hand. With this, we have described the full computation of the gradient of with respect to all model parameters.

To begin our calculation of the Hessian, we take a second derivative on both sides of (22) to obtain

The product here yields different terms inside the sum. Through algebra analogous to that used to derive (23), we can compute each of these terms and sum them over all and . The resulting terms give us a closed-form expression for . When we implement the Hessian in code, we first develop a function that takes as input fixed values of , , , and , returning as output the partial derivative for all values of and at once. We then evaluate this function in parallel over all possible values of and , again taking into account the fact that only the upper-triangular part of matters. In this way, we compute the central block in the overall Hessian:


The calculation of the block can be recycled and converted into calculations of all other blocks. For instance, applying the transformation (24) to in the final expression of the block gives us, by symmetry, both the and blocks. If we then go back and apply the transformation (24)to both and in the final expression of the block, we obtain the block. Similarly, setting either or both of to yields the blocks in the first row and first column of (25).

Through these strategies, we compute all entries of the gradient and Hessian of without recourse to automatic differentiation, which we relied upon in our earlier work [2]. While automatic differentiation yields perfectly accurate gradients and Hessians for small molecular systems, as the system size grows larger, we find that the computational cost of automatic differentiation increases considerably until it becomes unusable. Simultaneously, we find that the parallel computation of analytically derived gradients and Hessians, via the techniques described here, scales well to all molecular systems described in this paper.


  • [1] H. S. Bhat, P. Gupta, and K. Ranka (2021) Electron Dynamics. Note: Cited by: Data Availability Statement.
  • [2] H. S. Bhat, K. Ranka, and C. M. Isborn (2020) Machine learning a molecular Hamiltonian for predicting electron dynamics. International Journal of Dynamics and Control 8 (4), pp. 1089–1101. Cited by: Appendix B, §1, §2.1, §2.2.1, §2.2.
  • [3] J.R. Dormand and P.J. Prince (1980) A family of embedded Runge-Kutta formulae. Journal of Computational and Applied Mathematics 6 (1), pp. 19–26. External Links: ISSN 0377-0427, Document Cited by: §3.
  • [4] M. J. Frisch, G. W. Trucks, and H. B. S. et. al (2018) Gaussian Development Version Revision I.14+. Note: Gaussian Inc. Wallingford CT Cited by: §2.1.
  • [5] T. Hastie, R. Tibshirani, and J. Friedman (2001) The Elements of Statistical Learning. Springer, New York, NY, USA. Cited by: §2.2.2.
  • [6] X. Li, N. Govind, C. Isborn, A. E. DePrince, and K. Lopata (2020) Real-time time-dependent electronic structure theory. Chemical Reviews 120 (18), pp. 9951–9993. Note: PMID: 32813506 External Links: Document Cited by: §1.
  • [7] X. Li, S. M. Smith, A. N. Markevitch, D. A. Romanov, R. J. Levis, and H. B. Schlegel (2005) A time-dependent Hartree-Fock approach for studying the electronic optical response of molecules in intense fields. Physical Chemistry Chemical Physics 7 (2), pp. 233–239. External Links: Document, ISSN 14639076 Cited by: §2.1.
  • [8] N. T. Maitra (2016) Perspective: Fundamental aspects of time-dependent density functional theory. JCP 144 (22), pp. 220901. External Links: Document Cited by: §1.
  • [9] M. A. L. Marques, N. T. Maitra, F. M. S. Nogueira, E. K. U. Gross, and A. Rubio (2012) Fundamentals of Time-Dependent Density Functional Theory. Springer-Verlag. Cited by: §1.
  • [10] M. R. Provorse and C. M. Isborn (2016) Electron dynamics with real-time time-dependent density functional theory. IJQC 116 (10), pp. 739–749. External Links: Document Cited by: §1.
  • [11] H. B. Schlegel, S. M. Smith, and X. Li (2007) Electronic optical response of molecules in intense fields: Comparison of TD-HF, TD-CIS, and TD-CIS(D) approaches. Journal of Chemical Physics 126 (24), pp. 1–13. External Links: Document, ISSN 00219606 Cited by: §2.1.