# Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations

We propose a new algorithm for solving parabolic partial differential equations (PDEs) and backward stochastic differential equations (BSDEs) in high dimension, by making an analogy between the BSDE and reinforcement learning with the gradient of the solution playing the role of the policy function, and the loss function given by the error between the prescribed terminal condition and the solution of the BSDE. The policy function is then approximated by a neural network, as is done in deep reinforcement learning. Numerical results using TensorFlow illustrate the efficiency and accuracy of the proposed algorithms for several 100-dimensional nonlinear PDEs from physics and finance such as the Allen-Cahn equation, the Hamilton-Jacobi-Bellman equation, and a nonlinear pricing model for financial derivatives.

## Authors

• 59 publications
• 23 publications
• 46 publications
07/14/2021

### Gradient boosting-based numerical methods for high-dimensional backward stochastic differential equations

In this work we propose a new algorithm for solving high-dimensional bac...
09/18/2017

### Machine learning approximation algorithms for high-dimensional fully nonlinear partial differential equations and second-order backward stochastic differential equations

High-dimensional partial differential equations (PDE) appear in a number...
12/14/2020

### FBSDE based neural network algorithms for high-dimensional quasilinear parabolic PDEs

In this paper, we propose forward and backward stochastic differential e...
10/21/2021

### Deep Reinforcement Learning for Online Control of Stochastic Partial Differential Equations

In many areas, such as the physical sciences, life sciences, and finance...
05/13/2018

### General solutions for nonlinear differential equations: a deep reinforcement learning approach

Physicists use differential equations to describe the physical dynamical...
03/25/2020

### Spectral methods for nonlinear functionals and functional differential equations

We develop a rigorous convergence analysis for finite-dimensional approx...
12/13/2018

### Deep neural networks algorithms for stochastic control problems on finite horizon, Part 2: numerical applications

This paper presents several numerical applications of deep learning-base...

## Code Repositories

### DeepBSDE

Deep BSDE solver in TensorFlow

### Delpde

Deep Learning-based Numerical Methods for PDEs

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Developing efficient numerical algorithms for high dimensional (say, hundreds of dimensions) partial differential equations (PDEs) has been one of the most challenging tasks in applied mathematics. As is well-known, the difficulty lies in the “curse of dimensionality

[1], namely, as the dimensionality grows, the complexity of the algorithms grows exponentially. For this reason, there are only a limited number of cases where practical high dimensional algorithms have been developed. For linear parabolic PDEs, one can use the Feynman-Kac formula and Monte Carlo methods to develop efficient algorithms to evaluate solutions at any given space-time locations. For a class of inviscid Hamilton-Jacobi equations, Darbon & Osher have recently developed an algorithm which performs numerically well in the case of such high dimensional inviscid Hamilton-Jacobi equations; see [9]. Darbon & Osher’s algorithm is based on results from compressed sensing and on the Hopf formulas for the Hamilton-Jacobi equations. A general algorithm for (nonlinear) parabolic PDEs based on the Feynman-Kac and Bismut-Elworthy-Li formula and a multi-level decomposition of Picard iteration was developed in [11] and has been shown to be quite efficient on a number examples in finance and physics. The complexity of the algorithm is shown to be for semilinear heat equations, where is the dimensionality of the problem and is the required accuracy.

In recent years, a new class of techniques, called deep learning, have emerged in machine learning and have proven to be very effective in dealing with a large class of high dimensional problems in computer vision (cf., e.g.,

[23]

), natural language processing (cf., e.g.,

[20]), time series analysis, etc. (cf., e.g., [15, 24]). This success fuels in speculations that deep learning might hold the key to solve the curse of dimensionality problem. It should be emphasized that at the present time, there are no theoretical results that support such claims although the practical success of deep learning has been astonishing. However, this should not prevent us from trying to apply deep learning to other problems where the curse of dimensionality has been the issue.

In this paper, we explore the use of deep learning for solving general high dimensional PDEs. To this end, it is necessary to formulate the PDEs as a learning problem. Motivated by ideas in [16] where deep learning-based algorithms were developed for high dimensional stochastic control problems, we explore a connection between (nonlinear) parabolic PDEs and backward stochastic differential equations (BSDEs) (see [26, 28, 25]) since BSDEs share a lot of common features with stochastic control problems.

## 2 Main ideas of the algorithm

We will consider a fairly general class of nonlinear parabolic PDEs (see (30) in Subsection 4.1 below). The proposed algorithm is based on the following set of ideas:

1. Through the so-called nonlinear Feynman-Kac formula, we can formulate the PDEs equivalently as BSDEs.

2. One can view the BSDE as a stochastic control problem with the gradient of the solution being the policy function. These stochastic control problems can then be viewed as model-based reinforcement learning problems.

3. The (high dimensional) policy function can then be approximated by a deep neural network, as has been done in deep reinforcement learning.

Instead of formulating initial value problems, as is commonly done in the PDE literature, we consider the set up with terminal conditions since this facilitates making connections with BSDEs. Terminal value problems can obviously be transformed to initial value problems and vice versa.

In the remainder of this section we present a rough sketch of the derivation of the proposed algorithm, which we refer to as deep BSDE solver. In this derivation we restrict ourself to a specific class of nonlinear PDEs, that is, we restrict ourself to semilinear heat equations (see (PDE) below) and refer to Subsections 3.2 and 4.1 below for the general introduction of the deep BSDE solver.

### 2.1 An example: a semilinear heat partial differential equation (PDE)

Let , , , let and be continuous functions, and let satisfy for all , that and

 ∂u∂t(t,x)+12(Δxu)(t,x)+f(u(t,x),(∇xu)(t,x))=0. (PDE)

A key idea of this work is to reformulate the PDE (PDE) as an appropriate stochastic control problem.

### 2.2 Formulation of the PDE as a suitable stochastic control problem

More specifically, let

be a probability space, let

be a -dimensional standard Brownian motion on , let be the normal filtration on generated by , let be the set of all -adapted -valued stochastic processes with continuous sample paths, and for every and every let be an -adapted stochastic process with continuous sample paths which satisfies that for all it holds -a.s. that

 Yy,Zt=y−∫t0f(Yy,Zs,Zs)ds+∫t0⟨Zs,dWs⟩Rd. (1)

We now view the solution of (PDE) and its spatial derivative as the solution of a stochastic control problem associated to (1). More formally, under suitable regularity hypotheses on the nonlinearity it holds that the pair consisting of and is the (up to indistinguishability) unique global minimum of the function

 R×A∋(y,Z)↦E[|Yy,ZT−g(ξ+WT)|2]∈[0,∞]. (2)

One can also view the stochastic control problem (1)–(2) (with being the control) as a model-based reinforcement learning problem. In that analogy, we view as the policy and we approximate using feedforward neural networks (see (11) and Section 4 below for further details). The process , , corresponds to the value function associated to the stochastic control problem and can be computed approximatively by employing the policy (see (9) below for details). The connection between the PDE (PDE) and the stochastic control problem (1)–(2) is based on the nonlinear Feynman-Kac formula which links PDEs and BSDEs (see (BSDE) and (3) below).

### 2.3 The nonlinear Feynman-Kac formula

Let and be -adapted stochastic processes with continuous sample paths which satisfy that for all it holds -a.s. that

 Yt=g(ξ+WT)+∫Ttf(Ys,Zs)ds−∫Tt⟨Zs,dWs⟩Rd. (BSDE)

Under suitable additional regularity assumptions on the nonlinearity we have that the nonlinear parabolic PDE (PDE) is related to the BSDE (BSDE) in the sense that for all it holds -a.s. that

 Yt=u(t,ξ+Wt)∈RandZt=(∇xu)(t,ξ+Wt)∈Rd (3)

(cf., e.g., [25, Section 3] and [27]). The first identity in (3) is sometimes referred to as nonlinear Feynman-Kac formula in the literature.

### 2.4 Forward discretization of the backward stochastic differential equation (BSDE)

To derive the deep BSDE solver, we first plug the second identity in (3) into (BSDE) to obtain that for all it holds -a.s. that

 (4)

In particular, we obtain that for all with it holds -a.s. that

 Yt2=Yt1−∫t2t1f(Ys,(∇xu)(s,ξ+Ws))ds+∫t2t1⟨(∇xu)(s,ξ+Ws),dWs⟩Rd. (5)

Next we apply a time discretization to (5). More specifically, let and let be real numbers which satisfy

 0=t0

and observe that (5) suggests for sufficiently large that

 Ytn+1 (7) ≈Ytn−f(Ytn,(∇xu)(tn,ξ+Wtn))(tn+1−tn)+⟨(∇xu)(tn,ξ+Wtn),Wtn+1−Wtn⟩Rd.

### 2.5 Deep learning-based approximations

In the next step we employ a deep learning approximation for

 (∇xu)(tn,x)∈Rd,x∈Rd,n∈{0,1,…,N}, (8)

but not for , , . Approximations for , , , in turn, can be computed recursively by using (7) together with deep learning approximations for (8). More specifically, let , let , , be real numbers, let , , , be continuous functions, and let , , be stochastic processes which satisfy for all , that and

 (9)

We think of as the number of parameters in the neural network, for all appropriate we think of as suitable approximations

 Uθ≈u(0,ξ) (10)

of , and for all appropriate , , we think of as suitable approximations

 Vθn(x)≈(∇xu)(tn,x) (11)

of .

### 2.6 Stochastic optimization algorithms

The “appropriate”

can be obtained by minimizing the expected loss function through stochastic gradient descent-type algorithms. For the loss function we pick the squared approximation error associated to the terminal condition of the BSDE (

BSDE). More precisely, assume that the function has a unique global minimum and let

be the real vector for which the function

 Rρ∋θ↦E[|YθN−g(XN)|2]∈[0,∞] (12)

is minimal. Minimizing the function (12) is inspired by the fact that

 E[|YT−g(XT)|2]=0 (13)

according to (BSDE) above (cf. (2) above). Under suitable regularity assumptions, we approximate the vector through stochastic gradient descent-type approximation methods and thereby we obtain random approximations of . For sufficiently large

we then employ the random variable

as a suitable implementable approximation

 UΘm≈u(0,ξ) (14)

of (cf. (10) above) and for sufficiently large and all , we use the random variable as a suitable implementable approximation

 VΘmn(x)≈(∇xu)(tn,x) (15)

of (cf. (11) above). In the next section the proposed approximation method is described in more detail.

To simplify the presentation we have restricted us in (PDE), (1), (2), (BSDE) above and Subsection 3.1 below to semilinear heat equations. We refer to Subsection 3.2 and Section 4 below for the general description of the deep BSDE solver.

## 3 Details of the algorithm

### 3.1 Formulation of the proposed algorithm in the case of semilinear heat equations

In this subsection we describe the algorithm proposed in this article in the specific situation where (PDE) is the PDE under consideration, where batch normalization (see Ioffe & Szegedy [21]) is not employed, and where the plain-vanilla stochastic gradient descent approximation method with a constant learning rate and without mini-batches is the employed stochastic algorithm. The general framework, which includes the setting in this subsection as a special case, can be found in Subsection 3.2 below.

###### Framework 3.1 (Specific case).

Let , , , let and be functions, let be a probability space, let , , be independent -dimensional standard Brownian motions on , let be real numbers with

 0=t0

for every let , for every , let be a function, for every , let be the stochastic process which satisfies for all that and

 Yθ,mn+1=Yθ,mn−f(Yθ,mn,Vθn(ξ+Wmtn))(tn+1−tn)+⟨Vθn(ξ+Wmtn),Wmtn+1−Wmtn⟩Rd, (17)

for every let be the function which satisfies for all , that

 ϕm(θ,ω)=∣∣Yθ,mN(ω)−g(ξ+WmT(ω))∣∣2, (18)

for every let be a function which satisfies for all , that

 Φm(θ,ω)=(∇θϕm)(θ,ω), (19)

and let be a stochastic process which satisfy for all that

 Θm=Θm−1−γ⋅Φm(Θm−1). (20)

Under suitable further hypotheses (cf. Sections 4 and 5 below), we think in the case of sufficiently large and sufficiently small of as an appropriate approximation

 u(0,ξ)≈UΘm (21)

of the solution , , of the PDE

 ∂u∂t(t,x)+12(Δxu)(t,x)+f(u(t,x),(∇xu)(t,x))=0 (22)

for .

### 3.2 Formulation of the proposed algorithm in the general case

###### Framework 3.2 (General case).

Let , , , let , , and be functions, let be a probability space, let , , be independent -dimensional standard Brownian motions on , let be real numbers with

 0=t0

for every let , for every , , , let be a function, for every let and , , , be stochastic processes which satisfy for all , , that

 Xm,j0=ξ,Yθ,s,m,j0=Uθ,Xm,jn+1=Υ(tn,tn+1,Xm,jn,Wm,jtn+1−Wm,jtn), (24)
 Yθ,s,m,jn+1=Yθ,s,m,jn−f(tn,Xm,jn,Yθ,s,m,jn,Vθ,sn,j({Xm,in}i∈N))(tn+1−tn)+Vθ,sn,j({Xm,in}i∈N)(Wm,jtn+1−Wm,jtn), (25)

for every , let be the function which satisfies for all , that

 ϕm,js(θ,ω)=∥Yθ,s,m,jN(ω)−g(Xm,jN(ω))∥2Rk, (26)

for every , let be a function which satisfies for all , that

 Φm,js(θ,ω)=(∇θϕm,js)(θ,ω), (27)

let be a function, for every let and be functions, and let , , and be stochastic processes which satisfy for all that

 Sm=S(Sm−1,Θm−1,{Xm−1,in}(n,i)∈{0,1,…,N−1}×N), (28)
 Ξm=Ψm(Ξm−1,{Φm−1,jSm(Θm−1)}j∈N),andΘm=Θm−1−ψm(Ξm). (29)

### 3.3 Comments on the proposed algorithm

The dynamics in (24) associated to the stochastic processes for allows us to incorporate different algorithms for the discretization of the considered forward stochastic differential equation (SDE) into the deep BSDE solver in Subsection 3.2. The dynamics in (29) associated to the stochastic processes , , and , , allows us to incorporate different stochastic approximation algorithms such as

• stochastic gradient descent with or without mini-batches (see Subsection 5.1 below) as well as

[22] and Subsection 5.2 below) into the deep BSDE solver in Subsection 3.2.

The dynamics in (28) associated to the stochastic process , , allows us to incorporate the standardization procedure in batch normalization (see Ioffe & Szegedy [21] and also Section 4 below) into the deep BSDE solver in Subsection 3.2. In that case we think of ,

, as approximatively calculated means and standard deviations.

## 4 Examples for nonlinear partial differential equations (PDEs) and nonlinear backward stochastic differential equations (BSDEs)

In this section we illustrate the algorithm proposed in Subsection 3.2 using several concrete example PDEs. In the examples below we will employ the general approximation method in Subsection 3.2 in conjunction with the Adam optimizer (cf. Example 5.2 below and Kingma & Ba [22]) with mini-batches with samples in each iteration step (see Subsection 4.1 for a detailed description).

In our implementation we employ fully-connected feedforward neural networks to represent for , , (cf. also Figure 1 below for a rough sketch of the architecture of the deep BSDE solver). Each of the neural networks consists of layers ( input layer [-dimensional], hidden layers [both -dimensional], and output layer [-dimensional]). The number of hidden units in each hidden layer is equal to . We also adopt batch normalization (BN) (see Ioffe & Szegedy [21]) right after each matrix multiplication and before activation. We employ the rectifier function

as our activation function for the hidden variables. All the weights in the network are initialized using a normal or a uniform distribution without any pre-training. Each of the numerical experiments presented below is performed in

Python using TensorFlow on a Macbook Pro with a Gigahertz (GHz) Intel Core i5 micro processor and 16 gigabytes (GB) of 1867 Megahertz (MHz) double data rate type three synchronous dynamic random-access memory (DDR3-SDRAM). We also refer to the Python code 1 in Subsection 6.1 below for an implementation of the deep BSDE solver in the case of the -dimensional Allen-Cahn PDE (35).

### 4.1 Setting

Assume the setting in Subsection 3.2, assume for all that , , , , , let and be functions, let be a continuous and at most polynomially growing function which satisfies for all that , , and

 (30)

let , , , , , let , , be the functions which satisfy for all , that

 Powr(x)=(|x1|r,…,|xρ|r), (31)

and assume for all , , that

 Ψm(x,y,(φj)j∈N)=(Xx+(1−X)(1J∑Jj=1φj),Yy+(1−Y)Pow2(1J∑Jj=1φj)) (32)

and

 ψm(x,y)=[ε+Pow\nicefrac12(y(1−Ym))]−1γmx(1−Xm). (33)

(cf. Example 5.2 below and Kingma & Ba [22]).

###### Remark 4.1.

In this remark we illustrate the specific choice of the dimension of in the framework in Subsection 4.1 above.

1. The first component of is employed for approximating the real number .

2. The next -components of are employed for approximating the components of the -matrix .

3. In each of the employed neural networks we use components of

to describe the linear transformation from the

-dimensional first layer (input layer) to the -dimensional second layer (first hidden layer) (to uniquely describe a real -matrix).

4. In each of the employed neural networks we use components of to uniquely describe the linear transformation from the -dimensional second layer (first hidden layer) to the -dimensional third layer (second hidden layer) (to uniquely describe a real -matrix).

5. In each of the employed neural networks we use components of to describe the linear transformation from the -dimensional third layer (second hidden layer) to the -dimensional fourth layer (output layer) (to uniquely describe a real -matrix).

6. After each of the linear transformations in items (iii)–(v) above we employ a componentwise affine linear transformation (multiplication with a diagonal matrix and addition of a vector) within the batch normalization procedure, i.e., in each of the employed neural networks, we use components of for the componentwise affine linear transformation between the first linear transformation (see item (iii)) and the first application of the activation function, we use components of for the componentwise affine linear transformation between the second linear transformation (see item (iv)) and the second application of the activation function, and we use components of for the componentwise affine linear transformation after the third linear transformation (see item (v)).

Summing (i)–(vi) results in

 ρ=1+ditems~{}% (???)--(???)+(N−1)(d(d+10)+(d+10)2+d(d+10))items~{}(???)--(???)+(N−1)(2(d+10)+2(d+10)+2d)% item~{}(???)=d+1+(N−1)(2d(d+10)+(d+10)2+4(d+10)+2d). (34)

### 4.2 Allen-Cahn equation

In this section we test the deep BSDE solver in the case of an -dimensional Allen-Cahn PDE with a cubic nonlinearity (see (35) below).

More specifically, assume the setting in the Subsection 4.1 and assume for all , , , , that , , , , , , , , , and . Note that the solution of the PDE (30) then satisfies for all , that and

 ∂u∂t(t,x)+u(t,x)−[u(t,x)]3+(Δxu)(t,x)=0. (35)

In Table 1 we approximatively calculate the mean of , the standard deviation of , the relative -approximatin error associated to , the standard deviation of the relative -approximatin error associated to , and the runtime in seconds needed to calculate one realization of against based on independent realizations ( independent runs) (see also the Python code 1 below). Table 1 also depicts the mean of the loss function associated to and the standard deviation of the loss function associated to against based on Monte Carlo samples and independent realizations ( independent runs). In addition, the relative -approximation error associated to against is pictured on the left hand side of Figure 2 based on independent realizations ( independent runs) and the mean of the loss function associated to against is pictured on the right hand side of Figure 2 based on Monte Carlo samples and independent realizations ( independent runs). In the approximative computations of the relative -approximation errors in Table 1 and Figure 2 the value of the exact solution of the PDE (35) is replaced by the value which, in turn, is calculated by means of the Branching diffusion method (see the Matlab code 2 below and see, e.g., [17, 19, 18] for analytical and numerical results for the Branching diffusion method in the literature).

### 4.3 A Hamilton-Jacobi-Bellman (HJB) equation

In this subsection we apply the deep BSDE solver in Subsection 3.2 to a Hamilton-Jacobi-Bellman (HJB) equation which admits an explicit solution that can be obtained through the Cole-Hopf transformation (cf., e.g., Chassagneux & Richou [7, Section 4.2] and Debnath [10, Section 8.4]).

Assume the setting in the Subsection 4.1 and assume for all , , , , that , , , , , , , , , and . Note that the solution of the PDE (30) then satisfies for all , that and

 ∂u∂t(t,x)+(Δxu)(t,x)=∥(∇xu)(t,x)∥2Rd. (36)

In Table 2 we approximatively calculate the mean of , the standard deviation of , the relative -approximatin error associated to , the standard deviation of the relative -approximatin error associated to , and the runtime in seconds needed to calculate one realization of against based on independent realizations ( independent runs). Table 2 also depicts the mean of the loss function associated to and the standard deviation of the loss function associated to against based on Monte Carlo samples and independent realizations ( independent runs). In addition, the relative -approximation error associated to against is pictured on the left hand side of Figure 3 based on independent realizations ( independent runs) and the mean of the loss function associated to against is pictured on the right hand side of Figure 3 based on Monte Carlo samples and independent realizations ( independent runs). In the approximative computations of the relative -approximation errors in Table 2 and Figure 3 the value of the exact solution of the PDE (35) is replaced by the value which, in turn, is calculated by means of Lemma 4.2 below (with , , , , in the notation of Lemma 4.2) and a classical Monte Carlo method (see the Matlab code 3 below).