1 Introduction
Many fundamental problems in scientific computing can be reduced to the computation of eigenvalues and eigenfunctions of an operator. One primary example is the electronic structure calculations, namely, computing the leading eigenvalue and eigenfunction of the Schrödinger operator. If the dimension of the state variable is low, one can use classical approaches, such as the finite difference method or spectral method, to discretize the operator and to solve the eigenvalue problem. However, these conventional, deterministic approaches suffer from the socalled curse of dimensionality, when the underlying dimension becomes high, since the degree of freedom grows exponentially as the dimension increases.
For highdimensional problems, commonly arising from quantum mechanics, statistical mechanics, and finance applications, stochastic methods become more attractive and in many situations the only viable option. In the context of quantum mechanics, two widely used approaches for highdimensional eigenvalue problems are the variational Monte Carlo (VMC) and diffusion Monte Carlo (DMC) methods [1, 2, 3, 4, 5, 6]. These two approaches deal with the high dimensionality via different strategies. VMC relies on leveraging chemical knowledge to propose an ansatz of eigenfunction (wavefunction in the context of quantum mechanics) with parameters to be optimized under the variational formulation of the eigenvalue problem. The Monte Carlo approach is used to approximate the gradient of the energy with respect to the parameters at each optimization iteration step. On the other hand, DMC represents the density of the eigenfunction with a collection of particles which follows the imaginary time evolution given by the Schrödinger operator, via a FeynmanKac representation of the semigroup. It can be understood as a generalization of the classical power method from finitedimensional matrices to infinitedimensional operators. In electronic structure calculations, DMC usually can give more accurate eigenvalues compared with VMC, which relies on the quality of the proposed ansatz, while the particle representation of DMC often falls short to provide other information of the eigenfunction, such as its derivatives, unlike VMC.
As discussed above, one key to solving highdimensional eigenvalue problems is the choice of function approximation to the targeted eigenfunction, ranging from the gridbased basis, spectral basis, to nonlinear parametrizations used in VMC, and to particle representations in DMC. Given the recent compelling success of neural networks in representing highdimensional functions with remarkable accuracy and efficiency in various computational disciplines, it is fairly attempting to introduce neural networks to solve highdimensional eigenvalue problems. This idea has been recently investigated under the variational formulation by [7, 8, 9, 10, 11, 12]. Particularly [9, 10, 11, 12] has shown the exciting potential of solving the manyelectron Schrödinger equation with neural networks within the framework of VMC. On the other hand, how to apply neural networks in the formalism of DMC has not been explored in the literature, which leaves a natural open direction to investigate.
In this paper, we propose a new algorithm to solve highdimensional eigenvalue problem for the secondorder differential operators, in a similar spirit of DMC while based on the neural network parametrization of the eigenfunction. The eigenvalue problem is reformulated as a parabolic equation, whose solution can be represented by (nonlinear) FeynmanKac formula in terms of forwardbackward stochastic differential equations. Then we leverage the recently proposed deep BSDE method [13, 14] to seek optimal eigenpairs. Specifically, two deep neural networks are constructed to represent the eigenfunction and its scaled gradient. Then the neural network is propagated according to the semigroup generated by the operator. The loss function is defined as the difference between the neural networks before and after the propagation. Compared to conventional DMC, the proposed algorithm provides a direct approximation to the target eigenfunction, which overcomes the shortcoming in providing the gradient information. Moreover, since the BSDE formulation is valid for nonlinear operators, our approach can be extended to highdimensional nonlinear eigenvalue problems, also validated in our numerical examples.
The rest of this paper is organized as follows. In Section 2, we introduce the algorithm to solve the eigenvalue problem. In Section 3, numerical examples are presented. We conclude in Section 4 with an outlook for future work.
2 Numerical methods
2.1 Method for linear operator
We consider the eigenvalue problem
(1) 
on with periodic boundary condition where is a linear operator of the form
(2) 
is a matrix function such that is uniformly elliptic, denotes the gradient of , is a
dimensional vector field and
denotes the Hessian matrix of .To solve this eigenvalue problem, we augment a time variable and consider the following backward parabolic partial differential equation (PDE) in the time interval
:(3) 
This is essentially a continuous time analog of the power method for matrix eigenvalue problem. Let us denote the solution of (3) as (note that the backward propagator forms a semigroup, i.e., ). According to the spectral theory of the elliptic operator, if is a stationary solution of (3), i.e., , then must be an eigenpair of . Therefore, we can minimize the “loss function” with respect to to solve the eigenvalue problem. While this is a nonconvex optimization problem, we expect local convergence to a valid eigenpair with appropriate initialization.
The reformulation above turns the eigenvalue problem into solving a parabolic PDE in high dimensions. For the latter, we can leverage the recently developed deep BSDE method [13, 14, 15] (which is why the parabolic PDE (3) is written backward in time). Let solve the stochastic differential equation (SDE)
(4) 
or in the integral form
(5) 
where is a dimensional Brownian motion, and is sampled from some initial distribution . Then according to Itô’s formula, the solution to (3), satisfies
(6) 
Note that simulating the two SDEs (5) and (6) is relatively simple even in high dimensions, while directly solving the PDE (3) is intractable. We remark that it is possible to add a drift term to the SDE (4) and modify (6) accordingly (see the discussion below).
Of course, a priori in (6) for both and are unknown, while we know that if we set , the eigenfunction we look for, and , the solution remains for all . The idea of our method is then to use two neural networks, and as ansatz for the eigenfunction and its scaled gradient , respectively. Assigning and in (6), the discrepancy for the propagated solution, i.e.,
(7) 
then indicates the accuracy of the approximation. Here, we use to denote the absolute value of a number or the Euclidean norm of a vector according to the context. Note that the second term above penalizes the discrepancy between the approximation of and its gradient, where
are two weight hyperparameters. Therefore, using the above discrepancy as a loss function to optimize the triple
gives us a scheme to solve the eigenvalue problem. The above procedure can be directly extended to semilinear case where depends on and , as we will discuss in Section 2.3.To employ the above framework in practice, we numerically discretize the SDEs (5) and (6) using Euler–Maruyama method with a given partition of interval : :
(8)  
and  
(9) 
for . Here , , and we use and to represent the continuous and discretized stochastic process, respectively. The noise terms have the same realization in (8) and (9), as in the forwardbackward SDEs (5) and (6).
The loss function (7) then corresponds to the discrete counterpart:
(10) 
where is the gradient of neural network with respect to its input. In practice, the expectation in (10
) is further approximated by Monte Carlo sampling, which is similar to the empirical loss often used in the supervised learning context. For a given batch size
, we sample points of the initial state from the distributionat each training step and estimate the gradient of the loss with respect to the trainable parameters using the empirical Monte Carlo average of (
10):(11) 
We remark that the definition of dynamics (5) is not unique and implicitly affects the detailed computation of the loss function (10) and (11). Specifically, in (5), the diffusion term is determined by the operator (2) while the choice of initial distribution and the drift term has some flexibility. If the drift in (5) changes, one can change the associated drift in (6) according to Itô’s formula and define the loss again as (10) and (11). In this work, we choose the form of (5) without the drift and
being the uniform distribution on
to ensure that the whole region is reasonably sampled for the optimization of eigenfunction. Some importance sampling can be also used if some prior knowledge of the eigenfunction is available, which we will not go into further details in this work.At a high level, our algorithm is in a similar vein as the power method for solving the eigenvalue problem in linear algebra. Both algorithms seek for solutions that are stationary under the propagation. However, one distinction is that our algorithm is not only for solving the first eigenvalue, but rather general eigenvalue, mainly depending on the initialization of . On the other hand, we find in numerical experiments that if is initialized small enough, it will always converge to the first eigenvalue.
In practice, we use fullyconnected feedforward neural networks for the approximation of
and , respectively. To ensure periodicity of the neural network outputs, the input vector is first mapped into a fixed trigonometric basis of order . Then the vector consisting of all basis components are fed into fullyconnected neural networks with some hidden layers, each with several nodes. See Figure 1for illustration of the involved network structure. We use ReLU as the activation function and optimize the parameters with the Adam optimizer
[16].2.2 Normalization
The above loss has one caveat though, as the trivial solution is a global minimizer. Therefore, normalization is required to exclude such a trivial case. We seek for eigenfunctions such that , i.e., . ^{1}^{1}1The reason we set instead of is because we want to consider the problem in high dimensions in the domain . Consider the trivial case when , whose smallest eigenvalue is and any constant function is a corresponding eigenfunction. If , the constant function becomes , which vanishes as ; instead the normalization keeps the pointwisevalue of as order , which benefits the training process. To proceed, we define the normalization constant
(12) 
Thus dividing by , we enforce that the parametrized function ensures the normalization condition. Note that the first term on the right hand side of (12) is introduced to fix the global sign ambiguity of the eigenfunction.
In computation, given the parameters of , we do not have direct access to . Instead, at the th step of training, we use our batch of data samples to approximate via
(13) 
where the superscripts in serves as the index of batch () and index of the training step (). The above is a Monte Carlo estimation of (12) if is sampled from uniform distribution (which we assume in this work). Due to the normalization procedure, will enter into the loss function, and thus the stochastic gradient based on the empirical average over the batch becomes biased (since in general). To reduce the bias and make the training more stable, we introduce an exponential moving average scheme to the normalization constant in order to reduce the dependence of the loss to the estimated normalization constant of the current batch. In our implementation, we use
(14) 
Here is the moving average coefficient for decay. It is observed that small at the beginning makes training efficient, and later on its value is increased such that the gradient is less biased.
Given the introduced normalization factor, the neural network approximation in the updating scheme (9) is replaced by (we suppress the training step index in )
(15) 
and we would hope to reduce the discrepancy between the solution of (9) at time and the normalized neural network approximation through training. The associated batch approximation of loss function used for the computation of stochastic gradient is as follows
(16) 
In the last term above is a hyperparameter and is the associated weight. This term is introduce to prevent being too small; otherwise the normalization would become unstable. In each training step, we calculate the gradient of (16) with respect to all parameters to be optimized, including the eigenvalue and parameters in the neural network ansatz and . Note that in (16) we do not normalize since its scale has been determined implicitly. When is reasonably large and if we neglect the discretization error of simulating the SDEs, the empirical sum in (16) can be interpreted as a Monte Carlo approximation to the loss (ignoring the sign ambiguity)
(17) 
where is defined as (6) except that
. We remark that the normalization procedure introduced here shares a similar spirit with Batch Normalization
[17], which is widely used in the training of neural networks.We summarize our algorithm as pseudocode in Algorithm 1.
2.3 Method for semilinear operator
Our algorithm can be generalized to solve eigenvalue problems for semilinear operator
(18) 
The method for semilinear problems is almost the same to previous sections, except for a few modifications. The SDE for is the same as (5) while equation (6) that the solution of the PDE (3) satisfies becomes
(19)  
The discretization of , equation (8), remains unchanged while equation (9) needs modification according to (19):
(20)  
where Clip is a clipping function given by
(21) 
Here we introduce the clipping function to prevent numerical instability caused by the nonlinearity of in (19), especially at the early stage of training. It checks and replaces those whose absolute values are larger than with , where is an upper bound of the absolute value of the true normalized eigenfunction. Given the modified forward dynamics (20), the loss function for the semilinear operators are defined the same as (16), and the training algorithm is the same too.
3 Numerical results
In this section, we report the performance of the proposed eigensolver in three examples: the FokkerPlanck equation, the linear Schrödinger equation, and the nonlinear Schrödinger equation. The domain is always with periodic boundary condition. In each example we consider the dimension and . The hyperparameters are given in Appendix B. We examine the errors of the prescribed eigenvalue, the associated eigenfunction, and the gradient of the eigenfunction. The errors for eigenfunctions and gradients of eigenfunctions are computed in sense, approximated through a set of validation points. Given a set of validation points , we use the quantity
(22) 
to measure the error for eigenfunction where is computed via equation (14), with a known reference eigenfunction . We use
(23) 
to quantify the error for the gradient approximation. We record and plot the error every steps in the training process, with a smoothed moving average of window size . The final error reported is based on the average of last steps. Besides the errors above, we also visualize and compare the density of the true eigenfunction and its neural network approximation (since it is hard to visualize the highdimensional eigenfunction directly). The density of a function
is defined as the probability density function of
whereis a uniformly distributed random variable on
. In practice, the density is approximated by Monte Carlo sampling. As shown below, in all three examples, we find that the eigenpairs (with gradients) are solved accurately and the associated densities match well.3.1 FokkerPlanck equation
In this subsection we consider the linear FokkerPlanck operator
where is a potential function. The smallest eigenvalue of is and the corresponding eigenfunction is , which can be used to compute the error. We consider an example , where is the th coordinate of , and takes values in . The function is periodic by construction. Figure 2 shows the density and error curves for the FokkerPlanck equation in and . For , the final errors of the eigenvalue, eigenfunction, and the scaled gradient are 3.46e3, 2.406%, and 4.914%. For , the final errors are 3.67e3, 1.432%, and 4.701%.
3.2 Linear Schrödinger equation
In this subsection we consider the Schrödinger operator
where is a potential function. Here we choose , in which takes values in . With potential function being such a form, the problem is essentially decoupled. Therefore we are able to compute the eigenvalues and eigenfunctions in each dimension through the spectral method and obtain the final first eigenpair with high accuracy for comparison. The computation details are provided in Appendix A. Figure 3 shows the density and error curves for the Schrödinger equation in and . For , the final errors of the eigenvalue, eigenfunction, and the scaled gradient are 8.50e4, 0.974%, and 2.380%. For , the final errors are 1.47e3, 1.483%, and 3.634%.
3.3 Nonlinear Schrödinger equation
We finally consider a nonlinear Schrödinger operator with a cubic term, arising from the Gross–Pitaevskii equation [18, 19] for the singleparticle wavefunction in a BoseEinstein condensate:
(24) 
Here we assume and consider a specific external potential
(25) 
such that is an eigenpair of the operator (24). Here is a positive constant such that . Figure 4 shows the density and error curves for the nonlinear Schrödinger equation in and . For , the final errors of the eigenvalue, eigenfunction, and the scaled gradient are 1.53e3, 0.807%, and 4.367%. For , the final errors are 2.6e4, 0.460%, and 3.552%.
4 Conclusion and future works
In this paper, we propose a new method to solve eigenvalue problems in high dimensions using neural networks. Our method is able to compute both eigenvalues and corresponding eigenfunctions (with gradients) with high accuracy.
There are several natural directions for future work. First, to apply our methodology to quantum manybody systems, we need to respect the permutation symmetry in our ansatz for the wavefunctions. Previous works [9, 12, 11, 10] have proposed various flexible neuralnetwork ansatz to incorporate in the symmetry, which can be combined with our approach. Moreover, in DMC, importance sampling techniques are often essential to improve the accuracy. In our context, this means to choose a better underlying diffusion process guided by a trial wavefunction depending on the problem. Last, the scalability of the method has to be tested on larger systems beyond the toy numerical examples in this work.
On the theoretical aspects, the understanding of the stability and convergence of the proposed method is a fascinating future direction. While the general analysis might be quite difficult given the highly nonlinear approximation induced by the neural networks and also the complicated optimization strategy, some perturbative analysis, especially in the linearized regime, might be possible. We will leave these to future works.
Acknowledgement. The work of JL and MZ is supported in part by National Science Foundation via grant DMS1454939. The authors are grateful for computing time at the Terascale Infrastructure for Groundbreaking Research in Science and Engineering (TIGRESS) of Princeton University.
References
 [1] William Lauchlin McMillan. Ground state of liquid He 4. Physical Review, 138(2A):A442, 1965.
 [2] David Ceperley, Geoffrey V. Chester, and M.H. Kalos. Monte Carlo simulation of a manyfermion study. Physical Review B, 16(7):3081–3099, 1977.
 [3] Richard Blankenbecler, DJ Scalapino, and RL Sugar. Monte Carlo calculations of coupled bosonfermion systems. I. Physical Review D, 24(8):2278, 1981.
 [4] Shiwei Zhang, Joseph Carlson, and James E. Gubernatis. Constrained path Monte Carlo method for fermion ground states. Physical Review B, 55(12):7464, 1997.
 [5] W.M.C. Foulkes, Lubos Mitas, Richarad J. Needs, and Gunaretnam Rajagopal. Quantum Monte Carlo simulations of solids. Reviews of Modern Physics, 73(1):33, 2001.
 [6] Richarad J. Needs, Michael D. Towler, Neil D. Drummond, and P. López Ríos. Continuum variational and diffusion quantum Monte Carlo calculations. Journal of Physics: Condensed Matter, 22(2):023201, 2009.

[7]
Weinan E and Bing Yu.
The deep Ritz method: a deep learningbased numerical algorithm for solving variational problems.
Communications in Mathematics and Statistics, 6(1):1–12, 2018.  [8] David Pfau, Stig Petersen, Ashish Agarwal, David G. T. Barrett, and Kimberly L. Stachenfeld. Spectral inference networks: Unifying deep and spectral learning. In International Conference on Learning Representations, 2019.
 [9] Jiequn Han, Linfeng Zhang, and Weinan E. Solving manyelectron Schrödinger equation using deep neural networks. Journal of Computational Physics, 399:108929, 2019.
 [10] David Pfau, James S Spencer, Alexander G de G Matthews, and W. M. C. Foulkes. Abinitio solution of the manyelectron Schrödinger equation with deep neural networks. arXiv preprint arXiv:1909.02487, 2019.
 [11] Jan Hermann, Zeno Schätzle, and Frank Noé. Deep neural network solution of the electronic Schrödinger equation. arXiv preprint arXiv:1909.08423, 2019.
 [12] Kenny Choo, Antonio Mezzacapo, and Giuseppe Carleo. Fermionic neuralnetwork states for abinitio electronic structure. arXiv preprint arXiv:1909.12852, 2019.
 [13] Weinan E, Jiequn Han, and Arnulf Jentzen. Deep learningbased numerical methods for highdimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4):349–380, 2017.
 [14] Jiequn Han, Arnulf Jentzen, and Weinan E. Solving highdimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
 [15] Jiequn Han and Jihao Long. Convergence of the deep BSDE method for coupled FBSDEs. arXiv preprint arXiv:1811.01165, 2018.
 [16] Diederik Kingma and Jimmy Ba. Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.

[17]
S. Ioffe and C. Szegedy.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
In
International Conference on Machine Learning
, pages 448–456, 2015.  [18] Eugene P Gross. Structure of a quantized vortex in boson systems. Il Nuovo Cimento, 20(3):454–477, 1961.
 [19] Lev P Pitaevskii. Vortex lines in an imperfect Bose gas. Soviet Physics—JETP, 13(2):451–454, 1961.
Appendix A Spectrum method for linear Schrödinger equation
Suppose that the potential function in the linear Schrödinger operator is decoupled with the form , then we can solve the corresponding eigenvalue problem in a decoupled way. Specifically, assume we can solve the onedimensional eigenvalue problem
(26) 
Then one can easily verify that and
(27) 
together define an eigenpair of the original highdimensional Schrödinger operator.
To solve (26), we can employ the classical spectrum method. For a fixed , assume that
(28) 
then
(29) 
Let () be the test functions. By (26) and periodicity, we have
(30) 
Since and , the left and righthand sides of (30) become
(31)  
(assuming for ) and
(32) 
respectively. Therefore, equation (30) can be rewritten in the matrix form:
(33) 
This is a standard eigenvalue problem in numerical algebra. Suppose and
is the eigenvalue and associated eigenvector of the matrix in (
33), then is the approximated eigenvalue in (26), and (28) provides the associated eigenfunction, which is equivalent to a real eigenfunction up to a complex constant.Appendix B Hyperparameters in the numerical example
We first report hyperpameters commonly used in all three numerical examples and then list in Tables 13 those specific to the examples. In all three examples, the order of the trigonometric basis , the constant for regularizing the normalization factor, the weight parameters in the loss . In the following tables, the structures of the neural networks are represented by vectors, whose elements denote the number of nodes within each layer. The learning rate and moving average decay are both piecewise constant, whose values and boundaries are given separately. For example, in dimensional FokkerPlanck problem, the learning rate is for the first steps, from the st to the th step and 1e5 after the th step. The moving average decay is defined similarly, with the same boundaries.
Parameters  

terminal time  
number of time intervals  
structure of neural networks  [, , ]  [, , ] 
number of iterations  
learning rate  [, , ]  [, , ] 
moving average decay  [, , ]  [, , ] 
piecewise constant boundaries  [,]  [, ] 
batch size 
Parameters  

terminal time  
number of time intervals  
structure of neural networks  [, , ]  [, , ] 
number of iterations  
learning rate  [, , ]  [, , ] 
moving average decay  [, , ]  [, , ] 
piecewise constant boundaries  [, ]  [, ] 
batch size 
Parameters  

terminal time  
number of time intervals  
structure of neural networks  [, , ]  [, , ] 
number of iterations  
learning rate  [, , ]  [, , ] 
moving average decay  [, , ]  [, , ] 
piecewise constant boundaries  [, ]  [, ] 
batch size 
Comments
There are no comments yet.