    # Some machine learning schemes for high-dimensional nonlinear PDEs

We propose new machine learning schemes for solving high dimensional nonlinear partial differential equations (PDEs). Relying on the classical backward stochastic differential equation (BSDE) representation of PDEs, our algorithms estimate simultaneously the solution and its gradient by deep neural networks. These approximations are performed at each time step from the minimization of loss functions defined recursively by backward induction. The methodology is extended to variational inequalities arising in optimal stopping problems. We analyze the convergence of the deep learning schemes and provide error estimates in terms of the universal approximation of neural networks. Numerical results show that our algorithms give very good results till dimension 50 (and certainly above), for both PDEs and variational inequalities problems. For the PDEs resolution, our results are very similar to those obtained by the recent method in weinan2017deep when the latter converges to the right solution or does not diverge. Numerical tests indicate that the proposed methods are not stuck in poor local minimaas it can be the case with the algorithm designed in weinan2017deep, and no divergence is experienced. The only limitation seems to be due to the inability of the considered deep neural networks to represent a solution with a too complex structure in high dimension.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper is devoted to the resolution in high dimension of nonlinear parabolic partial differential equations (PDEs) of the form

 {∂tu+Lu+f(.,.,u,σ⊺Dxu)=0, on [0,T)×Rd,u(T,.)=g, on Rd, (1.1)

with a non-linearity in the solution and its gradient via the function defined on , a terminal condition , and a second-order generator defined by

 Lu :=12Tr(σσ⊺D2xu)+μ.Dxu. (1.2)

Here is a function defined on with values in , is a function defined on with values in the set of matrices, and is the generator associated to the forward diffusion process:

 Xt=x0+∫t0μ(s,Xs)ds+∫t0σ(s,Xs)dWs,0≤t≤T, (1.3)

with a

-dimensional Brownian motion on some probability space

equipped with a filtration satisfying the usual conditions.

Due to the so called “curse of dimensionality”, the resolution of nonlinear PDEs in high dimension has always been a challenge for scientists. Until recently, only the BSDE (Backward Stochastic Differential Equation) approach first developed in

[pardoux1990adapted] was available to tackle this problem: using the time discretization scheme proposed in [bouchard2004discrete], some effective algorithms based on regressions manage to solve non linear PDEs in dimension above 4 (see [gobet2005regression, lemor2006rate]). However this approach is still not implementable in dimension above 6 or 7 : the number of basis functions used for the regression still explodes with the dimension.

Quite recently some new methods have been developed for this problem, and three kind of methodologies have emerged:

• Some are based on the Feyman-Kac representation of the PDE. Branching techniques [henry2016branching] have been studied and shown to be convergent but only for small maturities and some small nonlinearities. Some effective techniques based on nesting Monte Carlo have been studied in [warin2018nestingsMC, warin2018monte]: the convergence is proved for semi-linear equations. Still based on this Feyman-Kac representation some machine learning techniques permitting to solve a fixed point problem have been used recently in [chan2018machine]: numerical results show that it is efficient and some partial demonstrations justify why it is effective.

• Another class of methods is based on the BSDE approach and the curse of dimensionality issue is partially avoided by using some machine learning techniques. [han2017overcoming, weinan2017deep] propose a deep learning based technique called Deep BSDE. Based on an Euler discretization of the forward underlying SDE , the idea is to view the BSDE as a forward SDE, and the algorithm tries to learn the values and at each time step of the Euler scheme by minimizing a global loss function between the forward simulation of till maturity and the target .

• At last, using some machine learning representation of the solution, [sirignano2018dgm] proposes to use the automatic numerical differentiation of the solution to solve the PDE on a finite domain.

Like the second methodology, our approach relies on BSDE representation of the PDE and deep learning approximations: we first discretize the BSDE associated to the PDE by an Euler scheme, but in contrast with [weinan2017deep], we adopt a classical backward resolution technique. On each time step, we propose to use some machine learning techniques to estimate simultaneously the solution and its gradient by minimizing a loss function defined recursively by backward induction, and solving this local problem by a stochastic gradient algorithm. Two different schemes are designed to deal with the local problems:

• The first one tries the estimate the solution and its gradient by a neural network.

• The second one tries only to approximate the solution by a neural network while its gradient is estimated directly with some numerical differentiation techniques.

The proposed methodology is then extended to solve some variational inequalities, i.e., free boundary problems related to optimal stopping problems.

Convergence analysis of the two schemes for PDEs and variational inequalities is provided and shows that the approximation error goes to zero as we increase the number of time steps and the number of neurons/layers whenever the gradient descent method used to solve the local problems is not trapped in a local minimum.

In the last part of the paper, we test our algorithms on different examples. When the solution is easy to represent by a neural network, we can solve the problem in quite high dimension (at least in our numerical tests). We show that the proposed methodology is superior to the algorithm proposed in [han2017overcoming] that often does not converge or is trapped in a local minimum far away from the true solution. We then show that when the solution has a very complex structure, we can still solve the problem but only in moderate dimension: the neural network used is not anymore able to represent the solution accurately in very high dimension. Finally, we illustrate numerically that the method is effective to solve some system of variational inequalities: we consider the problem of American options and show that it can be solved very accurately in high dimension (we tested until ).

The outline of the paper is organized as follows. In Section 2, we give a brief and useful reminder for neural networks. We describe in Section 3 our two numerical schemes and compare with the algorithm in [han2017overcoming]. Section 4 is devoted to the convergence analysis of our machine learning algorithms, and we present in Section 5 several numerical tests.

## 2 Neural networks as function approximators

Multilayer (also called deep) neural networks are designed to approximate unknown or large class of functions. In contrast to additive approximation theory with weighted sum over basis functions, e.g. polynomials, neural networks rely on the composition of simple functions, and appear to provide an efficient way to handle high-dimensional approximation problems, in particular thanks to the increase in computer power for finding the “optimal” parameters by (stochastic) gradient descent methods.

We shall consider feedforward (or artificial) neural networks, which represent the basic type of deep neural networks. Let us recall some notation and basic definitions that will be useful in our context. We fix the input dimension (here the dimension of the state variable ), the output dimension (here for approximating the real-valued solution to the PDE, or

for approximating the vector-valued gradient function), the global number

of layers with , , the number of neurons (units or nodes) on each layer: the first layer is the input layer with , the last layer is the output layer with , and the layers between are called hidden layers, where we choose for simplicity the same dimension , .

A feedforward neural network is a function from to defined as the composition

 x∈Rd ⟼AL∘ϱ∘AL−1∘…∘ϱ∘A1(x)∈Rd1. (2.1)

Here , are affine transformations: maps from to , map from to , and maps from to , represented by

 Aℓ(x) =Wℓx+βℓ, (2.2)

for a matrix called weight, and a vector called bias term,

is a nonlinear function, called activation function, and applied component-wise on the outputs of

, i.e.,

. Standard examples of activation functions are the sigmoid, the ReLu, the Elu,

.

All these matrices and vectors , , are the parameters of the neural network, and can be identified with an element , where is the number of parameters, where we fix , , , but allow growing number of hidden neurons. We denote by the set of possible parameters: in the sequel, we shall consider either the case when there are no constraints on parameters, i.e., , or when the total variation norm of the neural networks is smaller than , i.e.,

 Θm =Θγm:={θ=(Wℓ,βℓ)ℓ:|Wl|≤γm,ℓ=1,…,L}, withγm↗∞, as m→∞. (2.3)

We denote by the neural network function defined in (2.1), and by the set of all such neural networks for , and set

 NNϱd,d1,L = ⋃m∈NNNϱd,d1,L,m(Θm)=⋃m∈NNNϱd,d1,L,m(RNm),

as the class of all neural networks within a fixed structure given by , , and .

The fundamental result of Hornick et al. [horetal89] justifies the use of neural networks as function approximators:

Universal approximation theorem (I): is dense in for any finite measure on , whenever is continuous and non-constant.

Moreover, we have a universal approximation result for the derivatives in the case of a single hidden layer, i.e. , and when the activation function is a smooth function, see [hor90universal].

Universal approximation theorem (II): Assume that is a (non constant) function. Then, approximates any function and its derivatives up to order , arbitrary well on any compact set of .

## 3 Deep learning-based schemes for semi-linear PDEs

The starting point for our probabilistic numerical schemes to the PDE (1.1) is the well-known (see [pardoux1990adapted]) nonlinear Feynman-Kac formula via the pair of -adapted processes valued in , solution to the BSDE

 Yt =g(XT)+∫Ttf(s,Xs,Ys,Zs)ds−∫TtZ⊺sdWs,0≤t≤T, (3.1)

related to the solution of (1.1) via

 Yt =u(t,Xt),0≤t≤T,

and when is smooth:

 Zt =σ⊺(t,Xt)Dxu(t,Xt),0≤t≤T.

### 3.1 The deep BSDE scheme of [han2017overcoming]

The DBSDE algorithm proposed in [han2017overcoming, weinan2017deep] starts from the BSDE representation (3.1) of the solution to (1.1), but rewritten in forward form as:

 u(t,Xt)= u(0,x0)−∫t0f(s,Xs,u(s,Xs),σ⊺(s,Xs)Dxu(s,Xs))ds (3.2) +∫t0Dxu(s,Xs)⊺σ(s,Xs)dWs,0≤t≤T. (3.3)

The forward process in equation (1.3), when it is not simulatable, is numerically approximated by an Euler scheme on a time grid: , with modulus , , and defined as

 Xti+1=Xti+μ(ti,Xti)Δti+σ(ti,Xti)ΔWti,i=0,…,N−1,X0=x0, (3.4)

where we set . To alleviate notations, we omit the dependence of on the time grid as there is no ambiguity (recall that we use the notation for the forward diffusion process). The approximation of equation (1.1) is then given formally from the Euler scheme associated to the forward representation (3.2) by

 u(ti+1,Xti+1) ≈F(ti,Xti,u(ti,Xti),σ⊺(ti,Xti)Dxu(ti,Xti),Δti,ΔWti) (3.5)

with

 F(t,x,y,z,h,Δ) :=y−f(t,x,y,z)h+z⊺Δ. (3.6)

In [han2017overcoming, weinan2017deep], the numerical approximation of is designed as follows: starting from an estimation of , and then using at each time step , , a multilayer neural network with parameter for the approximation of :

 Zi(x;θi) ≈σ⊺(ti,x)Dxu(ti,x), (3.7)

one computes estimations of by forward induction via:

 Ui+1 =F(ti,Xti,Ui,Zi(Xti;θi),Δti,ΔWti),

for . This algorithm forms a global deep neural network composed of the neural networks (3.7) of each period, by taking as input data (in machine learning language) the paths of and , and giving as output , which is a function of the input and of the total set of parameters . The output aims to match the terminal condition of the BSDE, and one then optimizes over the parameter the expected square loss function:

 θ

This is obtained by stochastic gradient descent-type (SGD) algorithms relying on training input data.

### 3.2 New schemes: DBDP1 and DBDP2

The proposed scheme is defined from a backward dynamic programming type relation, and has two versions:

• First version:

• Initialize from an estimation of with

• For , given , use a pair of deep neural networks for the approximation of , and compute (by SGD) the minimizer of the expected quadratic loss function

 ⎧⎪ ⎪⎨⎪ ⎪⎩^L(1)i(θ):=E∣∣ˆU(1)i+1(Xti+1)−F(ti,Xti,Ui(Xti;θ),Zi(Xti;θ),Δti,ΔWti)∣∣2θ∗i∈argminθ∈RNm^L1i(θ). (3.8)

Then, update: , and set .

• Second version:

• Initialize with

• For , given , use a deep neural network , and compute (by SGD) the minimizer of the expected quadratic loss function

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩^L(2)i(θ):=E∣∣ˆU(2)i+1(Xti+1)−F(ti,Xti,Ui(Xti;θ),σ⊺(ti,Xti)^DxUi(Xti;θ),Δti,ΔWti)∣∣2θ∗i∈argminθ∈Θm^L2i(θ), (3.9)

where is the numerical differentiation of . Then, update: , and set .

###### Remark 3.1.

For the first version of the scheme, one can use independent neural networks, respectively for the approximation of and for the approximation of . In other words, the parameters are divided into a pair and we consider neural networks and .

In the sequel, we refer to the first and second version of the new scheme above as DBDP1 and DBDP2, where the acronym DBDP stands for deep learning backward dynamic programming.

The intuition behind DBDP1 and DBDP2 is the following. For simplicity, take , so that . The solution to the PDE (1.1) should then approximately satisfy (see (3.5))

 u(ti+1,Xti+1) ≈u(ti,Xti)+Dxu(ti,Xti)⊺σ(ti,Xti)ΔWti.

Consider the first scheme DBDP1, and suppose that at time , is an estimation of . The quadratic loss function at time is then approximately equal to

 ^L(1)i(θ) ≈E∣∣u(ti+1,Xti+1)−Ui(Xti;θ)−Zi(Xti;θ)⊺ΔWti∣∣2 ≈E[∣∣u(ti,Xti)−Ui(Xti;θ)∣∣2+Δti∣∣σ⊺(ti,Xti)Dxu(ti,Xti)−Zi(Xti;θ)∣∣2].

Therefore, by minimizing over this quadratic loss function, via SGD based on simulations of (called training data in the machine learning language), one expects the neural networks and to learn/approximate better and better the functions and in view of the universal approximation theorem [hor90universal]. Similarly, the second scheme DPDP2, which uses only neural network on the value functions, learns by means of the neural network , and via . The rigorous arguments for the convergence of these schemes will be derived in the next section.

The advantages of our two schemes, compared to the Deep BSDE algorithm, are the following:

• we solve smaller problems that are less prone to be trapped in local mimimizer. The memory needed in [han2017overcoming] can be a problem when taking too many time steps.

• at each time step, we initialize the weights and bias of the neural network to the weights and bias of the previous time step treated : this methodology always used in iterative solvers in PDE methods permits to have a starting point close to the solution, and then to avoid local minima too far away from the true solution. Besides the number of gradient iterations to achieve is rather small after the first resolution step.

The small disadvantage is due to the Tensorflow structure. As it is done in python, the global graph creation takes much time as it is repeated for each time step and the global resolution is a little bit time consuming : as the dimension of the problem increases, the time difference decreases and it becomes hard to compare the computational time for a given accuracy when the dimension is above 5.

### 3.3 Extension to variational inequalities: scheme RDBDP

Let us consider a variational inequality in the form

 {min[−∂tu−Lu−f(t,x,u,σ⊺Dxu),u−g]=0,t∈[0,T),x∈Rd,u(T,x)=g(x),x∈Rd. (3.10)

which arises, e.g., in optimal stopping problem and American option pricing in finance. It is known, see e.g. [elk97], that such variational inequality is related to reflected BSDE of the form

 Yt =g(XT)+∫Ttf(s,Xs,Ys,Zs)ds−∫TtZ⊺sdWs+KT−Kt, (3.11) Yt ≥g(Xt),0≤t≤T, (3.12)

where is an adapted non-decreasing process satisfying

 ∫T0(Yt−g(Xt))dKt =0. (3.13)

The extension of our DBDP1 scheme for such variational inequality, and refereed to as RDBDP scheme, becomes

• Initialize

• For , given , use a pair of (multilayer) neural network , and compute (by SGD) the minimizer of the expected quadratic loss function

 (3.14)

Then, update: , and set .

## 4 Convergence analysis

The main goal of this section is to prove convergence of the DBDP schemes towards the solution to the BSDE (3.1) (or reflected BSDE (3.11) for variational inequalities), and to provide a rate of convergence that depends on the approximation errors by neural networks.

### 4.1 Convergence of DBDP1

We assume the standard Lipschitz conditions on and , which ensures the existence and uniqueness of an adapted solution to the forward SDE (1.3) satisfying for any ,

 E[sup0≤t≤T|Xt|p]

for some constant depending only on , , and . Moreover, we have the well-known error estimate with the Euler scheme defined in (3.4) with a time grid , with modulus s.t. is bounded by a constant depending only on (hence independent of ):

 maxi=0,…,N−1E[|Xti+1−Xti+1|2+supt∈[ti,ti+1]|Xt−Xti|2] =O(|π|). (4.2)

Here, the standard notation means that .

We shall make the standing usual assumptions on the driver and the terminal data .

(H1) (i) There exists a constant such that the driver satisfies:

 |f(t2,x2,y2,z2)−f(t1,x1,y1,z1)|≤[f]L(|t2−t1|1/2+|x2−x1|+|y2−y1|+|z2−z1|), (4.3)

for all and . Moreover,

 sup0≤t≤T|f(t,0,0,0)|<∞.

(ii) The function satisfies a linear growth condition.

Recall that Assumption (H1) ensures the existence and uniqueness of an adapted solution to (3.1) satisfying

 E[sup0≤t≤T|Yt|2+∫T0|Zt|2dt] <∞. (4.4)

From the linear growth condition on in (H1), and (4.1), we also see that

 E[∫T0|f(t,Xt,Yt,Zt)|2dt] <∞. (4.5)

Moreover, we have the standard -regularity result on :

 maxi=0,…,N−1E[supt∈[ti,ti+1]|Yt−Yti|2] =O(|π|). (4.6)

Let us also introduce the -regularity of :

 εZ(π) :=E[N−1∑i=0∫ti+1ti|Zt−¯Zti|2dt], with ¯Zti:=1ΔtiEi[∫ti+1tiZtdt], (4.7)

where denotes the conditional expectation given . Since is a -projection of , we know that converges to zero when goes to zero. Moreover, as shown in [zhang04numerical], when the terminal condition is also Lipschitz, we have

 εZ(π) =O(|π|). (4.8)

Let us first investigate the convergence of the scheme DBDP1 in (3.8), and define (implicitly)

 ⎧⎪ ⎪⎨⎪ ⎪⎩ˆVti:=Ei[ˆU(1)i+1(Xti+1)]+f(ti,Xti,ˆVti,¯¯¯¯¯¯¯ˆZti)Δti¯¯¯¯¯¯¯ˆZti:=1ΔtiEi[ˆU(1)i+1(Xti+1)ΔWti], (4.9)

for . Notice that is well-defined for small enough (recall that is Lipschitz) by a fixed point argument. By the Markov property of the discretized forward process , we note that there exists some deterministic functions and s.t.

 ˆVti=^vi(Xti), and ¯¯¯¯¯¯¯ˆZti=¯¯¯¯¯^zi(Xti),i=0,…,N−1. (4.10)

Moreover, by the martingale representation theorem, there exists an -valued square integrable process such that

 ˆU(1)i+1(Xti+1) =ˆVti−f(ti,Xti,ˆVti,¯¯¯¯¯¯¯ˆZti)Δti+∫ti+1tiˆZ⊺sdWs, (4.11)

and by Itô isometry, we have

 ¯¯¯¯¯¯¯ˆZti =1ΔtiEi[∫ti+1tiˆZsds],i=0,…,N−1. (4.12)

Let us now define a measure of the (squared) error for the DBDP1 scheme by

 E[(ˆU(1),ˆZ(1)),(Y,Z)] :=maxi=0,…,N−1E∣∣Yti−ˆU(1)i(Xti)∣∣2+E[N−1∑i=0∫ti+1ti∣∣Zt−ˆZ(1)i(Xti)∣∣2dt]. (4.13)

Our first main result gives an error estimate of the DBDP1 scheme in terms of the -approximation errors of and by neural networks and , , assumed to be independent (see Remark 3.1), and defined as

 εN,vi:=infξE∣∣^vi(Xti)−Ui(Xti;ξ)∣∣2,εN,zi:=infηE∣∣¯¯¯¯¯^zi(Xti)−Zi(Xti;η)∣∣2. (4.14)

Here, we fix the structure of the neural networks with input dimension , output dimension for , and for , number of layers , and neurons for the hidden layers, and the parameters vary in the whole set where is the number of parameters. From the universal approximation theorem (I) ([horetal89]), we know that and converge to zero as goes to infinity, hence can be made arbitrary small for sufficiently large number of neurons.

###### Theorem 4.1.

(Consistency of DBDP1) Under (H1), there exists a constant , independent of , such that

 E[(ˆU(1),ˆZ(1)),(Y,Z)] ≤C(E∣∣g(XT)−g(XT)∣∣2+|π|+εZ(π) (4.15) +N−1∑i=0(NεN,vi+εN,zi)). (4.16)
###### Remark 4.1.

The error contributions for the DBDP1 scheme in the r.h.s. of estimation (4.16) consists of four terms. The first three terms correspond to the time discretization of BSDE, similarly as in [bouchard2004discrete], [gobet2005regression], namely (i) the strong approximation of the terminal condition (depending on the forward scheme and the terminal data ), and converging to zero, as goes to zero, with a rate when is Lipschitz by (4.2) (see [avi09] for irregular ), (ii) the strong approximation of the forward Euler scheme, and the -regularity of , which gives a convergence of order , (iii) the -regularity of , which converges to zero, as goes to zero, with a rate when is Lipschitz. Finally, the better the neural networks are able to approximate/learn the functions and at each time , the smaller is the last term in the error estimation.

Proof of Theorem 4.1.

In the following, will denote a positive generic constant independent of , and that may take different values from line to line.

Step 1. Fix , and observe by (3.1), (4.9) that

 Yti−ˆVti =Ei[Yti+1−ˆU(1)i+1(Xti+1)]+Ei[∫ti+1tif(t,Xt,Yt,Zt)−f(ti,Xti,ˆVti,¯¯¯¯¯¯¯ˆZti)dt]. (4.17)

By using Young inequality: for some to be chosen later, Cauchy-Schwarz inequality, the Lipschitz condition on in (H1), and the estimation (4.2) on the forward process, we then have

 E∣∣Yti−ˆVti∣∣2 (4.18) +4[f]2LΔti(1+1γΔti){|Δti|2+E[∫ti+1ti∣∣Yt−ˆVti∣∣2dt] (4.19)