1 Introduction
This paper is devoted to the resolution in high dimension of fully nonlinear parabolic partial differential equations (PDEs) of the form
(1.1) 
with a nonlinearity in the solution, its gradient and its hessian via the function defined on (where is the set of symmetric matrices), and a terminal condition .
The numerical resolution of this class of PDEs is far more difficult than the one of classical semilinear PDEs where the nonlinear function does not depend on . In fact, rather few methods are available to solve fully nonlinear equations even in moderate dimension.

First based on the work of [cheridito2007second], an effective scheme developed in [fahim2011probabilistic] using some regression techniques has been shown to be convergent under some ellipticity conditions later removed by [tan2013splitting]. Due to the use of basis functions, this scheme does not permit to solve PDE in dimension greater than 5.

A scheme based on nesting Monte Carlo has been recently proposed in [warin2018monte]. It seems to be effective in very high dimension for maturities not too long and linearities not too important.

A numerical algorithm to solve fully nonlinear equations has been proposed by [beck2017machinela] based on the representation of [cheridito2007second] and global deep neural networks minimizing a terminal objective function, but no test on real fully nonlinear case is given.

The Deep Galerkin method proposed in [sirignano2018dgm]
based on some machine learning techniques and using some numerical differentiation of the solution seems to be effective on some cases. It has been tested in
[al2018solving] for example on the Merton problem.
In this article, we introduce a numerical method based on machine learning techniques and backward in time iterations, which extends the proposed scheme for semilinear PDEs in the recent work [hure2019some]. A first idea to extend this work consists in using the representation proposed in [cheridito2007second] and applied numerically in [fahim2011probabilistic] and [beck2017machinela] to solve the problem: more precisely, at each time step of an Euler scheme, at is approximated by a neural network minimizing some local criterion associated to a BSDE involving at date and . Then, the pair at date is approximated/learned with a second minimization similarly as in the method described by [hure2019some]. The first minimization can be implemented with different variations but numerical results show that the global scheme does not scale well with the dimension. Instability on the calculation rapidly propagates during the backward resolution. Besides, the methodology appears to be costly when using two optimizations at each time step. An alternative approach that we develop here, is to combine the ideas of [hure2019some] and the splitting method in [beck2019deep]
in order to derive a new deep learning scheme that requires only one local optimization during the backward resolution for learning the pair
and approximating .The outline of the paper is organized as follows. In Section 2, we briefly recall the mathematical description of the classical feedforward approximation, and then derive the proposed neural networksbased backward scheme. We test our method in Section 3 on several examples: first we illustrate our results with a PDE involving a non linearity of type , then we consider a stochastic linear quadratic problem with controlled volatility where an analytic solution is available, and we test the performance and accuracy of our algorithm up to dimension . The convergence analysis of our scheme is postponed to a forthcoming work.
2 The proposed deep backward scheme
Our aim is to numerically approximate the function , assumed to be the unique smooth solution to the fully nonlinear PDE (1.1) under suitable conditions. This will be achieved by means of neural networks approximations for and its gradient , relying on a backward scheme and training simulated data of some forward diffusion process.
2.1 Feedforward neural network to approximate functions
We denote by the dimension of the input variables, and the dimension of the output variable. A (deep) neural network is characterized by a number of layers with ,
, the number of neurons (units or nodes) on each layer: the first layer is the input layer with
, the last layer is the output layer with , and the layers between are called hidden layers, where we choose for simplicity the same dimension , .A feedforward neural network is a function from to defined as the composition
(2.1) 
Here , are affine transformations: maps from to , map from to , and maps from to , represented by
(2.2) 
for a matrix
called weight, and a vector
called bias term,is a nonlinear function, called activation function, and applied componentwise on the outputs of
, i.e.,. Standard examples of activation functions are the sigmoid, the ReLu, the Elu,
.All these matrices and vectors , , are the parameters of the neural network, and can be identified with an element , where is the number of parameters. We denote by the set of all functions generated by (2.1) for .
2.2 Forwardbackward representation
Let us introduce a forward diffusion process
(2.3) 
where is a function defined on with values in , is a function defined on with values in the set of matrices, and a
dimensional Brownian motion on some probability space
equipped with a filtration satisfying the usual conditions. The process will be used for the simulation of training data in our deep learning algorithm, and we shall discuss later the choice of the drift and diffusion coefficients and , see Remark 2.2.Let us next denote by the triple of adapted processes valued in , defined by
(2.4) 
By Itô’s formula applied to , and since is solution to (1.1), we see that satisfies the backward equation:
(2.5) 
2.3 Algorithm
We now provide a numerical approximation of the forward backward system (2.3)(2.5), and consequently of the solution (as well as its gradient ) to the PDE (1.1).
We start from a time grid of , with , and time steps , . The time discretization of the forward process on is then equal (typically when and are constants) or approximated by an Euler scheme:
where we set (by misuse of notation, we keep the same notation for the continuous time diffusion process and its Euler scheme). The backward SDE (2.5) is approximated by the time discretized scheme
that is written in forward form as
(2.6) 
with
(2.7)  
The idea of the proposed scheme is the following. Similarly as in [hure2019some], we approximate at each time , and its gradient , by neural networks with parameter that are learned optimally by backward induction: suppose that , is an approximation of and at time , then is computed from the minimization of the quadratic loss function:
(2.8) 
where is a truncation operator such that
is bounded for example by a quantile of the diffusion process and
stands for the numerical differentiation of . The truncation permits to avoid that the oscillations of the neural network fitted in zone where the simulations are scarce propagate to areas of importance.The intuition for the relevance of this scheme to the approximation of the PDE (1.1) is the following. From (2.4) and (2.6), the solution to (1.1) should approximately satisfy
Suppose that at time , is an estimation of . Recalling the expression of in (2.7), the quadratic loss function at time is then approximately equal to
By assuming that has small nonlinearities in its arguments , say Lipschitz, possibly with a suitable choice of , , the loss function is thus approximately equal to
Therefore, by minimizing over
this quadratic loss function, via stochastic gradient descent (SGD) based on simulations of
(called training data in the machine learning language), one expects the neural networks and to learn/approximate better and better the functions and in view of the universal approximation theorem for neural networks. The rigorous convergence of this algorithm is postponed to a future work.To sum up, the global algorithm is given in Algo 1 in the case where is Lipschitz and the derivative can be analytically calculated almost everywhere. If the derivative of is not available, it can be calculated by numerical differentiation.
(2.9) 
(2.10) 
Remark 2.1
A variation in the algorithm consists in using two neural networks for and instead of one.
Remark 2.2
The diffusion process is used for the training simulations in the stochastic gradient descent method for finding the minimizer of the quadratic loss function in (2.10). The choice of the drift coefficient is typically related to the underlying probabilistic problem associated to the PDE (for example a stochastic control problem), but does not really matter. The choice of the diffusion coefficient is more delicate: large induces a better exploration of the state space, but would require a lot of neurons. Moreover, for the applications in stochastic control, we might explore some region that are visited with very small probabilities by the optimal state process, hence representing few interest. On the other hand, small means a weak exploration, and we might lack information and precision on some region of the state space. In practice and for the numerical examples in the next section, we test the scheme for different and by varying the number of time steps, and if it converges to the same solution, one can consider that we have obtained the correct solution.
3 Numerical results
We first construct an example with different non linearities in the Hessian term and the solution. We graphically show that the solution is very well calculated in dimension one and then move to higher dimensions. We next use an example derived from a stochastic optimization problem with an analytic solution and show that we are able to accurately calculate the solution.
In the whole numerical part, we use a classical Feed Forward Network using layers with neurons each and a activation function, the output layer uses an identity activation function. At each time step the resolution of equation (2.10) is achieved using some minibatch with trajectories. Every 50 inner iterations the convergence rate is checked using trajectories and an adaptation of the learning rate is achieved using an Adams gradient descent. Notice that the adaptation of the learning rate is not common with the Adams method but in our case it appears to be crucial to have a steady diminution of the loss of the objective function.
During time resolution, it is far more effective to initialize the solution of equations (2.10) with the solution at the next time step and to use an initial learning rate . The number of outer iterations is fixed for each optimization. It is set to for the first optimization at date and then a value of outer iteration is used at the dates
. All experiments have achieved using Tensorflow
[2015tensorflow]. In the sequel, the PDE solutions on curves are calculated as the average of 10 runs. Each time we provide the curves giving the standard deviation observed for the given results. We also show the impact of the choice of the diffusion coefficient
, and the influence of the number of neurons on the accuracy of the results.3.1 A non linearity in
We consider the case where ,
and , so that an analytical solution is available:
We choose to evaluate the solution at and , for which while its derivative is equal to . This initial value is chosen such that independently of the dimension the solution is varying around this point and not in a region where the function is close to or .
The coefficients of the forward process used to solve the equation are
and the truncation operator indexed by a parameter is chosen equal to
where ,
is the CDF of a unit centered Gaussian random variable. In the numerical results we take
and neurons.We first begin in dimension one, and show in figure 1 how , and are well approximated by the resolution method.
On figure 2, we check the convergence, for different values of of both the value function and its derivative at point and date . Standard deviation of the function value is very low and the standard deviation of the derivative still being low.
As the dimension increases, we have to increase the value of of the forward process. In dimension 3, the value gives high standard deviation in the result obtained as shown on figure 3, while in dimension 10, see Figure 4, we see that the value is too low to give good results. We also clearly notice that in 10D, a smaller time step should be used but in our test cases we decided to consider a maximum number of time steps equal to 160.
On this simple test case, the dimension is not a problem and very good results are obtained in dimension 20 or above with only 20 neurons and 2 layers.
3.2 A linear quadratic stochastic test case.
In this example, we consider a controlled process with dynamics in according to
where is a real Brownian motion, the control process is valued in , and the constant coefficients , , . The quadratic cost functional to be minimized is
where , are non negative symmetric matrices and is strictly positive.
The Bellman equation associated to this stochastic control problem is:
which can be rewritten as a fully nonlinear equation in the form (1.1) with
An explicit solution to this PDE is given by
where is non negative symmetric matrix function solution to the Riccati equation:
We take . The coefficients of the forward process used to solve the equation are
In our numerical example we take the following parameters for the optimization problem:
and we want to estimate the solution at .
In this example, the truncation operator (indexed by between and and close to ) is as follows:
where , is a vector so that , , is a unit vector, and the square root is taken componentwise.
On figure 5 we give the solution of the PDE using obtained for two dates: at and at close to zero. We observe that we have a very good estimation of the function value and a correct one of the value at date . The precision remains good for close to and very good for and .
On figures 6, we give the results obtained in dimension one by varying . For a value of , the standard deviation of the result becomes far higher than with or
On figure 7, we take a quite low truncation factor and observe that the number of neurons to take has to be rather high. We have also checked that taking a number of hidden layers equal to 3 does not improve the results.
On figure 8, we give the same graphs for a truncation factor higher. As we take a higher truncation factor the number of neurons to use has to be increased to .
On figure 9, we observe in dimension 7 the influence of the number of neurons on the result for a high truncation factor . With a number of neurons equal to , we clearly have a bias disappearing with a number of neurons equal to . We had to take higher values of to get good results.
On figure 10, we check that influence of the truncation factor appears to be slow for higher dimensions.
Finally, we give results in dimension 10, 15 and 20 for on figures 11, 12, 13. We observe that the number a neurons with 2 hidden layers has to increase with the dimension but also that the increase is rather slow in contrast with the case of one hidden layer as theoretically shown in [pinkus1999approximation]. For we had to take 300 neurons to get very accurate results.