 # Solving high-dimensional eigenvalue problems using deep neural networks: A diffusion Monte Carlo like approach

We propose a new method to solve eigenvalue problems for linear and semilinear second order differential operators in high dimensions based on deep neural networks. The eigenvalue problem is reformulated as a fixed point problem of the semigroup flow induced by the operator, whose solution can be represented by Feynman-Kac formula in terms of forward-backward stochastic differential equations. The method shares a similar spirit with diffusion Monte Carlo but augments a direct approximation to the eigenfunction through neural-network ansatz. The criterion of fixed point provides a natural loss function to search for parameters via optimization. Our approach is able to provide accurate eigenvalue and eigenfunction approximations in several numerical examples, including Fokker-Planck operator, linear and nonlinear Schrödinger operators in high dimensions.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many fundamental problems in scientific computing can be reduced to the computation of eigenvalues and eigenfunctions of an operator. One primary example is the electronic structure calculations, namely, computing the leading eigenvalue and eigenfunction of the Schrödinger operator. If the dimension of the state variable is low, one can use classical approaches, such as the finite difference method or spectral method, to discretize the operator and to solve the eigenvalue problem. However, these conventional, deterministic approaches suffer from the so-called curse of dimensionality, when the underlying dimension becomes high, since the degree of freedom grows exponentially as the dimension increases.

For high-dimensional problems, commonly arising from quantum mechanics, statistical mechanics, and finance applications, stochastic methods become more attractive and in many situations the only viable option. In the context of quantum mechanics, two widely used approaches for high-dimensional eigenvalue problems are the variational Monte Carlo (VMC) and diffusion Monte Carlo (DMC) methods [1, 2, 3, 4, 5, 6]. These two approaches deal with the high dimensionality via different strategies. VMC relies on leveraging chemical knowledge to propose an ansatz of eigenfunction (wavefunction in the context of quantum mechanics) with parameters to be optimized under the variational formulation of the eigenvalue problem. The Monte Carlo approach is used to approximate the gradient of the energy with respect to the parameters at each optimization iteration step. On the other hand, DMC represents the density of the eigenfunction with a collection of particles which follows the imaginary time evolution given by the Schrödinger operator, via a Feynman-Kac representation of the semigroup. It can be understood as a generalization of the classical power method from finite-dimensional matrices to infinite-dimensional operators. In electronic structure calculations, DMC usually can give more accurate eigenvalues compared with VMC, which relies on the quality of the proposed ansatz, while the particle representation of DMC often falls short to provide other information of the eigenfunction, such as its derivatives, unlike VMC.

As discussed above, one key to solving high-dimensional eigenvalue problems is the choice of function approximation to the targeted eigenfunction, ranging from the grid-based basis, spectral basis, to nonlinear parametrizations used in VMC, and to particle representations in DMC. Given the recent compelling success of neural networks in representing high-dimensional functions with remarkable accuracy and efficiency in various computational disciplines, it is fairly attempting to introduce neural networks to solve high-dimensional eigenvalue problems. This idea has been recently investigated under the variational formulation by [7, 8, 9, 10, 11, 12]. Particularly [9, 10, 11, 12] has shown the exciting potential of solving the many-electron Schrödinger equation with neural networks within the framework of VMC. On the other hand, how to apply neural networks in the formalism of DMC has not been explored in the literature, which leaves a natural open direction to investigate.

In this paper, we propose a new algorithm to solve high-dimensional eigenvalue problem for the second-order differential operators, in a similar spirit of DMC while based on the neural network parametrization of the eigenfunction. The eigenvalue problem is reformulated as a parabolic equation, whose solution can be represented by (nonlinear) Feynman-Kac formula in terms of forward-backward stochastic differential equations. Then we leverage the recently proposed deep BSDE method [13, 14] to seek optimal eigenpairs. Specifically, two deep neural networks are constructed to represent the eigenfunction and its scaled gradient. Then the neural network is propagated according to the semigroup generated by the operator. The loss function is defined as the difference between the neural networks before and after the propagation. Compared to conventional DMC, the proposed algorithm provides a direct approximation to the target eigenfunction, which overcomes the shortcoming in providing the gradient information. Moreover, since the BSDE formulation is valid for nonlinear operators, our approach can be extended to high-dimensional nonlinear eigenvalue problems, also validated in our numerical examples.

The rest of this paper is organized as follows. In Section 2, we introduce the algorithm to solve the eigenvalue problem. In Section 3, numerical examples are presented. We conclude in Section 4 with an outlook for future work.

## 2 Numerical methods

### 2.1 Method for linear operator

We consider the eigenvalue problem

 Lψ=λψ, (1)

on with periodic boundary condition where is a linear operator of the form

 Lψ(x)=−12Tr(σ(x)σ⊤(x)Hess(ψ)(x))−b(x)⋅∇ψ(x)+f(x)ψ(x). (2)

is a matrix function such that is uniformly elliptic, denotes the gradient of , is a

-dimensional vector field and

denotes the Hessian matrix of .

To solve this eigenvalue problem, we augment a time variable and consider the following backward parabolic partial differential equation (PDE) in the time interval

:

 {∂tu(t,x)−Lu(t,x)+λu(t,x)=0in [0,T]×Ω,u(T,x)=Ψ(x)on Ω. (3)

This is essentially a continuous time analog of the power method for matrix eigenvalue problem. Let us denote the solution of (3) as (note that the backward propagator forms a semigroup, i.e., ). According to the spectral theory of the elliptic operator, if is a stationary solution of (3), i.e., , then must be an eigenpair of . Therefore, we can minimize the “loss function” with respect to to solve the eigenvalue problem. While this is a non-convex optimization problem, we expect local convergence to a valid eigenpair with appropriate initialization.

The reformulation above turns the eigenvalue problem into solving a parabolic PDE in high dimensions. For the latter, we can leverage the recently developed deep BSDE method [13, 14, 15] (which is why the parabolic PDE (3) is written backward in time). Let solve the stochastic differential equation (SDE)

 dXt=σ(Xt)dWt, (4)

or in the integral form

 Xt=X0+∫t0σ(Xs)dWs, (5)

where is a -dimensional Brownian motion, and is sampled from some initial distribution . Then according to Itô’s formula, the solution to (3), satisfies

 u(t,Xt)=u(0,X0)+∫t0(f(Xs)u(s,Xs)−λu(s,Xs)−b(Xs)∇u(s,Xs))ds+∫t0σ(Xs)⊤∇u(s,Xs)dWs. (6)

Note that simulating the two SDEs (5) and (6) is relatively simple even in high dimensions, while directly solving the PDE (3) is intractable. We remark that it is possible to add a drift term to the SDE (4) and modify (6) accordingly (see the discussion below).

Of course, a priori in (6) for both and are unknown, while we know that if we set , the eigenfunction we look for, and , the solution remains for all . The idea of our method is then to use two neural networks, and as ansatz for the eigenfunction and its scaled gradient , respectively. Assigning and in (6), the discrepancy for the propagated solution, i.e.,

 (7)

then indicates the accuracy of the approximation. Here, we use to denote the absolute value of a number or the Euclidean norm of a vector according to the context. Note that the second term above penalizes the discrepancy between the approximation of and its gradient, where

are two weight hyperparameters. Therefore, using the above discrepancy as a loss function to optimize the triple

gives us a scheme to solve the eigenvalue problem. The above procedure can be directly extended to semilinear case where depends on and , as we will discuss in Section 2.3.

To employ the above framework in practice, we numerically discretize the SDEs (5) and (6) using Euler–Maruyama method with a given partition of interval : :

 X0=X0, Xtn+1=Xtn+σ(Xtn)ΔWn (8) and U0=NΨ(X0), (9)

for . Here , , and we use and to represent the continuous and discretized stochastic process, respectively. The noise terms have the same realization in (8) and (9), as in the forward-backward SDEs (5) and (6).

The loss function (7) then corresponds to the discrete counterpart:

 (10)

where is the gradient of neural network with respect to its input. In practice, the expectation in (10

) is further approximated by Monte Carlo sampling, which is similar to the empirical loss often used in the supervised learning context. For a given batch size

, we sample points of the initial state from the distribution

at each training step and estimate the gradient of the loss with respect to the trainable parameters using the empirical Monte Carlo average of (

10):

 (11)

We remark that the definition of dynamics (5) is not unique and implicitly affects the detailed computation of the loss function (10) and (11). Specifically, in (5), the diffusion term is determined by the operator (2) while the choice of initial distribution and the drift term has some flexibility. If the drift in (5) changes, one can change the associated drift in (6) according to Itô’s formula and define the loss again as (10) and (11). In this work, we choose the form of (5) without the drift and

being the uniform distribution on

to ensure that the whole region is reasonably sampled for the optimization of eigenfunction. Some importance sampling can be also used if some prior knowledge of the eigenfunction is available, which we will not go into further details in this work.

At a high level, our algorithm is in a similar vein as the power method for solving the eigenvalue problem in linear algebra. Both algorithms seek for solutions that are stationary under the propagation. However, one distinction is that our algorithm is not only for solving the first eigenvalue, but rather general eigenvalue, mainly depending on the initialization of . On the other hand, we find in numerical experiments that if is initialized small enough, it will always converge to the first eigenvalue.

In practice, we use fully-connected feed-forward neural networks for the approximation of

and , respectively. To ensure periodicity of the neural network outputs, the input vector is first mapped into a fixed trigonometric basis of order . Then the vector consisting of all basis components are fed into fully-connected neural networks with some hidden layers, each with several nodes. See Figure 1

for illustration of the involved network structure. We use ReLU as the activation function and optimize the parameters with the Adam optimizer

. Figure 1: Illustration of the neural network with 3 layers and several nodes in each layer. For x=(x1,⋯,xd), sin(kx)\coloneqq(sin(kx1),sin(kx2),⋯,sin(kxd)) is a d-dimensional vector, and cos(kx) is defined similarly, k=1,2,⋯,M. The output for NΨ is a scalar and the output for Nσ⊤∇Ψ is a d-dimensional vector.

### 2.2 Normalization

The above loss has one caveat though, as the trivial solution is a global minimizer. Therefore, normalization is required to exclude such a trivial case. We seek for eigenfunctions such that , i.e., 111The reason we set instead of is because we want to consider the problem in high dimensions in the domain . Consider the trivial case when , whose smallest eigenvalue is and any constant function is a corresponding eigenfunction. If , the constant function becomes , which vanishes as ; instead the normalization keeps the pointwise-value of as order , which benefits the training process. To proceed, we define the normalization constant

 ZΨ=sign(∫ΩNΨ(x)dx)(1|Ω|∫ΩNΨ(x)2dx)1/2. (12)

Thus dividing by , we enforce that the parametrized function ensures the normalization condition. Note that the first term on the right hand side of (12) is introduced to fix the global sign ambiguity of the eigenfunction.

In computation, given the parameters of , we do not have direct access to . Instead, at the -th step of training, we use our batch of data samples to approximate via

 ^ZℓΨ=sign(K∑k=1NΨ(Xk,ℓ0))(1KK∑k=1NΨ(Xk,ℓ0)2)1/2, (13)

where the superscripts in serves as the index of batch () and index of the training step (). The above is a Monte Carlo estimation of (12) if is sampled from uniform distribution (which we assume in this work). Due to the normalization procedure, will enter into the loss function, and thus the stochastic gradient based on the empirical average over the batch becomes biased (since in general). To reduce the bias and make the training more stable, we introduce an exponential moving average scheme to the normalization constant in order to reduce the dependence of the loss to the estimated normalization constant of the current batch. In our implementation, we use

 ZℓΨ=γℓZℓ−1Ψ+(1−γℓ)^ZℓΨ. (14)

Here is the moving average coefficient for decay. It is observed that small at the beginning makes training efficient, and later on its value is increased such that the gradient is less biased.

Given the introduced normalization factor, the neural network approximation in the updating scheme (9) is replaced by (we suppress the training step index in )

 Uk0=NΨ(Xk0)ZℓΨ, (15)

and we would hope to reduce the discrepancy between the solution of (9) at time and the normalized neural network approximation through training. The associated batch approximation of loss function used for the computation of stochastic gradient is as follows

 1KK∑k=1(η1∣∣∣NΨ(XkT)ZℓΨ−UkT∣∣∣2+η2∣∣∣Nσ⊤∇Ψ(XkT)−σ⊤(XT)∇NΨ(XkT)ZℓΨ∣∣∣2)+η3(Z0−ZℓΨ)+. (16)

In the last term above is a hyperparameter and is the associated weight. This term is introduce to prevent being too small; otherwise the normalization would become unstable. In each training step, we calculate the gradient of (16) with respect to all parameters to be optimized, including the eigenvalue and parameters in the neural network ansatz and . Note that in (16) we do not normalize since its scale has been determined implicitly. When is reasonably large and if we neglect the discretization error of simulating the SDEs, the empirical sum in (16) can be interpreted as a Monte Carlo approximation to the loss (ignoring the sign ambiguity)

 EX0∼ν⎡⎢⎣η1∣∣∣NΨ(XT)∥NΨ∥2/|Ω|12−u(T,XT)∣∣∣2+η2∣∣∣Nσ⊤∇Ψ(XT)−σ⊤(XT)∇NΨ(XT)∥NΨ∥2/|Ω|12∣∣∣2⎤⎥⎦, (17)

where is defined as (6) except that

. We remark that the normalization procedure introduced here shares a similar spirit with Batch Normalization

, which is widely used in the training of neural networks.

We summarize our algorithm as pseudocode in Algorithm 1.

### 2.3 Method for semilinear operator

Our algorithm can be generalized to solve eigenvalue problems for semilinear operator

 Lψ(x)=−12Tr(σ(x)σ⊤(x)Hess(ψ)(x))−b(x)⋅∇ψ(x)+f(x,ψ(x),σ⊤(x)∇ψ(x)). (18)

The method for semilinear problems is almost the same to previous sections, except for a few modifications. The SDE for is the same as (5) while equation (6) that the solution of the PDE (3) satisfies becomes

 u(t,Xt)= u(0,X0)+∫t0(f(Xs,u(s,Xs),σ⊤(Xs)∇u(s,Xs))−b(Xs)∇u(s,Xs)−λu(s,Xs))ds (19) +∫t0(∇u(s,Xs))⊤σ(Xs)dWs.

The discretization of , equation (8), remains unchanged while equation (9) needs modification according to (19):

 ˜Utn+1 =Utn+(f(Xtn,Utn,Nσ⊤∇Ψ)−λUtn−(bσ−⊤Nσ⊤∇Ψ)(Xtn))Δtn+Nσ⊤∇Ψ(Xtn)ΔWn, (20) Utn+1 =Clip(˜Utn+1,−Q,Q),

where Clip is a clipping function given by

 Clip(u,−Q,Q)=⎧⎨⎩−Q,if u<−Q;u,if −Q≤u≤Q;Q,if u>Q. (21)

Here we introduce the clipping function to prevent numerical instability caused by the nonlinearity of in (19), especially at the early stage of training. It checks and replaces those whose absolute values are larger than with , where is an upper bound of the absolute value of the true normalized eigenfunction. Given the modified forward dynamics (20), the loss function for the semilinear operators are defined the same as (16), and the training algorithm is the same too.

## 3 Numerical results

In this section, we report the performance of the proposed eigensolver in three examples: the Fokker-Planck equation, the linear Schrödinger equation, and the nonlinear Schrödinger equation. The domain is always with periodic boundary condition. In each example we consider the dimension and . The hyperparameters are given in Appendix B. We examine the errors of the prescribed eigenvalue, the associated eigenfunction, and the gradient of the eigenfunction. The errors for eigenfunctions and gradients of eigenfunctions are computed in sense, approximated through a set of validation points. Given a set of validation points , we use the quantity

 errΨ\coloneqq⎡⎢ ⎢⎣1KK∑k=1(NΨ(Xk)ZℓΨ−Ψ(Xk)(1K∑Km=1Ψ(Xm)2)12)2⎤⎥ ⎥⎦12 (22)

to measure the error for eigenfunction where is computed via equation (14), with a known reference eigenfunction . We use

 errσ⊤∇Ψ\coloneqq⎡⎢ ⎢ ⎢⎣1KdK∑k=1∣∣ ∣ ∣∣Nσ⊤∇Ψ(Xk)(1Kd∑Km=1|Nσ⊤∇Ψ(Xm)|2)12−σ(Xk)⊤∇Ψ(Xk)(1Kd∑Km=1|σ(Xm)⊤∇Ψ(Xm)|2)12∣∣ ∣ ∣∣2⎤⎥ ⎥ ⎥⎦12 (23)

to quantify the error for the gradient approximation. We record and plot the error every steps in the training process, with a smoothed moving average of window size . The final error reported is based on the average of last steps. Besides the errors above, we also visualize and compare the density of the true eigenfunction and its neural network approximation (since it is hard to visualize the high-dimensional eigenfunction directly). The density of a function

is defined as the probability density function of

where

is a uniformly distributed random variable on

. In practice, the density is approximated by Monte Carlo sampling. As shown below, in all three examples, we find that the eigenpairs (with gradients) are solved accurately and the associated densities match well.

### 3.1 Fokker-Planck equation

In this subsection we consider the linear Fokker-Planck operator

 Lψ=−Δψ−∇⋅(ψ∇V),

where is a potential function. The smallest eigenvalue of is and the corresponding eigenfunction is , which can be used to compute the error. We consider an example , where is the -th coordinate of , and takes values in . The function is periodic by construction. Figure 2 shows the density and error curves for the Fokker-Planck equation in and . For , the final errors of the eigenvalue, eigenfunction, and the scaled gradient are 3.46e-3, 2.406%, and 4.914%. For , the final errors are 3.67e-3, 1.432%, and 4.701%. Figure 2: Top: density of Ψ(x) for Fokker-Planck equation with d=5 (left) and d=10 (right). Bottom: associated error curves in the training process with d=5 (left) and d=10 (right).

### 3.2 Linear Schrödinger equation

In this subsection we consider the Schrödinger operator

 Lψ=−Δψ+Vψ,

where is a potential function. Here we choose , in which takes values in . With potential function being such a form, the problem is essentially decoupled. Therefore we are able to compute the eigenvalues and eigenfunctions in each dimension through the spectral method and obtain the final first eigenpair with high accuracy for comparison. The computation details are provided in Appendix A. Figure 3 shows the density and error curves for the Schrödinger equation in and . For , the final errors of the eigenvalue, eigenfunction, and the scaled gradient are 8.50e-4, 0.974%, and 2.380%. For , the final errors are 1.47e-3, 1.483%, and 3.634%. Figure 3: Top: density of Ψ(x) for linear Schrödinger equation with d=5 (left) and d=10 (right). Bottom: associated error curves in the training process with d=5 (left) and d=10 (right).

### 3.3 Nonlinear Schrödinger equation

We finally consider a nonlinear Schrödinger operator with a cubic term, arising from the Gross–Pitaevskii equation [18, 19] for the single-particle wavefunction in a Bose-Einstein condensate:

 Lψ=−Δψ+ϵψ3+Vψ. (24)

Here we assume and consider a specific external potential

 V(x)=−1c2exp(2dd∑i=1cosxi)+d∑i=1(sin2xid2−cosxid)−3, (25)

such that is an eigenpair of the operator (24). Here is a positive constant such that . Figure 4 shows the density and error curves for the nonlinear Schrödinger equation in and . For , the final errors of the eigenvalue, eigenfunction, and the scaled gradient are 1.53e-3, 0.807%, and 4.367%. For , the final errors are 2.6e-4, 0.460%, and 3.552%. Figure 4: Top: density of Ψ(x) for nonlinear Schrödinger equation with d=5 (left) and d=10 (right). Bottom: associated error curves in the training process with d=5 (left) and d=10 (right).

## 4 Conclusion and future works

In this paper, we propose a new method to solve eigenvalue problems in high dimensions using neural networks. Our method is able to compute both eigenvalues and corresponding eigenfunctions (with gradients) with high accuracy.

There are several natural directions for future work. First, to apply our methodology to quantum many-body systems, we need to respect the permutation symmetry in our ansatz for the wavefunctions. Previous works  [9, 12, 11, 10] have proposed various flexible neural-network ansatz to incorporate in the symmetry, which can be combined with our approach. Moreover, in DMC, importance sampling techniques are often essential to improve the accuracy. In our context, this means to choose a better underlying diffusion process guided by a trial wavefunction depending on the problem. Last, the scalability of the method has to be tested on larger systems beyond the toy numerical examples in this work.

On the theoretical aspects, the understanding of the stability and convergence of the proposed method is a fascinating future direction. While the general analysis might be quite difficult given the highly nonlinear approximation induced by the neural networks and also the complicated optimization strategy, some perturbative analysis, especially in the linearized regime, might be possible. We will leave these to future works.

Acknowledgement. The work of JL and MZ is supported in part by National Science Foundation via grant DMS-1454939. The authors are grateful for computing time at the Terascale Infrastructure for Groundbreaking Research in Science and Engineering (TIGRESS) of Princeton University.

## Appendix A Spectrum method for linear Schrödinger equation

Suppose that the potential function in the linear Schrödinger operator is decoupled with the form , then we can solve the corresponding eigenvalue problem in a decoupled way. Specifically, assume we can solve the one-dimensional eigenvalue problem

 −Ψ′′j(x)+cjcos(x)Ψj(x)=λjΨj(x),   x∈[0,2π]. (26)

Then one can easily verify that and

 Ψ(x)=d∏j=1Ψj(xj) (27)

together define an eigenpair of the original high-dimensional Schrödinger operator.

To solve (26), we can employ the classical spectrum method. For a fixed , assume that

 Ψj(x)=N∑m=−Najmeimx, (28)

then

 Ψ′j(x)=N∑m=−Nmajmeimxi. (29)

Let () be the test functions. By (26) and periodicity, we have

 ∫2π0(Ψ′j(x)φ′n(x)+cjcos(x)Ψj(x)φn(x))dx=λj∫2π0Ψj(x)φn(x)dx. (30)

Since and , the left- and right-hand sides of (30) become

 ∫2π0(Ψ′j(x)φ′n(x)+cjcos(x)Ψj(x)φn(x))dx (31) = N∑m=−Najm∫2π0(−mneimxeinx+cjcos(x)eimxeinx)dx = N∑m=−Najmπ(−2mnδm+n+cjδm+n+1+cjδm+n−1) = π(2n2a−n+cja−n−1+cja−n+1),

(assuming for ) and

 ∫2π0Ψj(x)φn(x)dx=2πa−n, (32)

respectively. Therefore, equation (30) can be rewritten in the matrix form:

 π⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣2N2cjcj⋱⋱⋱2⋱⋱0⋱⋱2⋱⋱⋱cjcj2N2⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣aj−N⋮⋮⋮ajN⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦=2πλj⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣aj−N⋮⋮⋮ajN⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦. (33)

This is a standard eigenvalue problem in numerical algebra. Suppose and

is the eigenvalue and associated eigenvector of the matrix in (

33), then is the approximated eigenvalue in (26), and (28) provides the associated eigenfunction, which is equivalent to a real eigenfunction up to a complex constant.

## Appendix B Hyperparameters in the numerical example

We first report hyperpameters commonly used in all three numerical examples and then list in Tables 1-3 those specific to the examples. In all three examples, the order of the trigonometric basis , the constant for regularizing the normalization factor, the weight parameters in the loss . In the following tables, the structures of the neural networks are represented by vectors, whose elements denote the number of nodes within each layer. The learning rate and moving average decay are both piecewise constant, whose values and boundaries are given separately. For example, in -dimensional Fokker-Planck problem, the learning rate is for the first steps, from the -st to the -th step and 1e-5 after the -th step. The moving average decay is defined similarly, with the same boundaries.