1 Introduction
Let us consider the following discretetime stochastic control problem over a finite horizon . The dynamics of the controlled state process valued in is given by
(1.1) 
where
is a sequence of i.i.d. random variables valued in some Borel space
, and defined on some probability space equipped with the filtration generated by the noise ( is the trivial algebra), the control is an adapted process valued in , and is a measurable function from into .Given a running cost function defined on , a terminal cost function defined on , the cost functional associated to a control process is
The set of admissible control is the set of control processes satisfying some integrability conditions ensuring that the cost functional
is welldefined and finite. The control problem, also called Markov decision process (MDP), is formulated as
(1.2) 
and the goal is to find an optimal control , i.e., attaining the optimal value: . Notice that problem (1.1)(1.2) may also be viewed as the time discretization of a continuous time stochastic control problem, in which case, is typically the Euler scheme for a controlled diffusion process, and is the discretetime approximation of a fully nonlinear HamiltonJacobiBellman equation.
Problem (1.2) is tackled by the dynamic programming approach, and we introduce the standard notations for MDP: denote by , ,
, the family of transition probabilities associated to the controlled (homogenous) Markov chain (
1.1), given byand for any measurable function on :
With these notations, we have for any measurable function on , for any ,
The optimal value is then determined in backward induction starting from the terminal condition
and by the dynamic programming (DP) formula, for :
(1.3) 
The function is called optimal stateaction value function, and is the (optimal) value function. Moreover, when the infimum is attained in the DP formula at any time by , we get an optimal control in feedback form given by: where is the Markov process defined by
The DP has a probabilistic formulation: it says that for any control , the value function process augmented with the cumulative costs defined by
(1.4) 
is a submartingale, and a martingale for the optimal control . This martingale property for the optimal control is a key observation for our algorithms described later.
Remark 1.1
We can deal with state/control constraints at any time, which is useful for the applications:
where is some given subset of . In this case, in order to ensure that the set of admissible controls is not empty, we assume that the sets
are non empty for all , and the DP formula reads now as
From a computational point of view, it may be more convenient to work with unconstrained state/control variable, hence by relaxing the state/control constraint and introducing into the running cost a penalty function : , and . For example, if the constraint set is in the form: , for some functions , then one can take as penalty functions:
where are penalization coefficients (large in practice).
The implementation of the DP formula requires the knowledge and explicit computation of the transition probabilities . In situations when they are unknown, this leads to the problematic of reinforcement learning for computing the optimal control and value function by relying on simulations of the environment. The challenging tasks from a numerical point of view are then twofold:

Transition probability operator. Calculations for any , of , for . This is a computational challenge in high dimension
for the state space with the “curse of dimensionality” due to the explosion of grid points in deterministic methods.

Optimal control. Computation of the infimum in of for fixed and , and of attaining the minimum if it exists. This is also a computational challenge especially in high dimension for the control space.
The classical probabilistic numerical methods based on DP for solving the MDP are sometimes called approximate dynamic programming methods, see e.g. [4], [29], and consist basically of the two following steps:

Approximate at each time step the
value function defined as a conditional expectation. This can be performed by regression MonteCarlo (RMC) techniques or quantization. RMC is typically done by leastsquare linear regression on a set of basis function following the popular approach by Longstaff and Schwarz
[24] initiated for Bermudean option problem, where the suitable choice of basis functions might be delicate. Conditional expectation can be also approximated by regression on neural network as in [19] for American option problem, and appears as a promising and efficient alternative in high dimension to the linear regression. The main issue in the controlled case concerns the simulation of the endogenous controlled MDP, and this can be overcome by control randomization as in [17]. Alternatively, quantization method consists in approximating the noiseby a discrete random variable on a finite grid, in order to reduce the conditional expectation to a finite sum.

Control search: Once we get an approximation of the value function, the optimal control which achieves the minimum over of can be obtained either by an exhaustive search when is discrete (with relatively small cardinality), or by a (deterministic) gradientbased algorithm for continuous control space (with relatively small dimension).
Recently, numerical methods by direct approximation, without DP, have been developed and made implementable thanks to the power of computers: the basic idea is to focus directly on the control approximation by considering feedback control (policy) in a parametric form:
for some given function with parameters , and minimize over the parametric functional
where denotes the controlled process with feedback control . This approach was first adopted in [21], who used EM algorithm for optimizing over the parameter , and further investigated in [13], [6], [15]
, who considered deep neural networks (DNN) for the parametric feedback control, and stochastic gradient descent methods (SGD) for computing the optimal parameter
. The theoretical foundation of these DNN algorithms has been recently investigated in [14]. Deep learning has emerged recently in machine learning as a successful technique for dealing with highdimensional problems in speech recognition, computer vision, etc (see e.g.
[22], [9]). Let us mention that DNN approximation in stochastic control has already been explored in the context of reinforcement learning (RL) (see [4] and [30]), and called deep reinforcement learning in the artificial intelligence community
[26] (see also [23] for a recent survey) but usually for infinite horizon (stationary) control problems.In this paper, we combine different ideas from the mathematics (numerical probability) and the computer science (reinforcement learning) communities to propose and compare several algorithms based on dynamic programming (DP), and deep neural networks (DNN) for the approximation/learning of (i) the optimal policy, and then of (ii) the value function. Notice that this differs from the classical approach in DP recalled above, where we first approximate the optimal state/control value function, and then approximate the optimal control. Our learning of the optimal policy is achieved in the spirit of [13] by DNN, but sequentially in time though DP instead of a global learning over the whole period . Once we get an approximation of the optimal policy, and recalling the martingale property (1.4), we approximate the value function by MonteCarlo (MC) regression based on simulations of the forward process with the approximated optimal control. In particular, we avoid the issue of a priori endogenous simulation of the controlled process in the classical approach. The MC regressions for the approximation of the optimal policy and/or value function, are performed according to different features leading to algorithmic variants: Performance iteration (PI) or hybrid iteration (HI), and regress now or regress later/quantization in the spirit of [24] or [8]. Numerical results on several applications are devoted to a companion paper [2]. The theoretical contribution of the current paper is to provide a detailed convergence analysis of our three proposed algorithms: Theorem 4.1 for the NNContPI Algo based on control learning by performance iteration with DNN, Theorem 4.2 for the HybridNow Algo based on control learning by DNN and then value function learning by regressnow method, and Theorem 4.3 for the HybridLaterQ Algo based on on control learning by DNN and then value function learning by regress later method combined with quantization. We rely mainly on arguments from statistical learning and non parametric regression as developed notably in the book [12], for giving estimates of approximated control and value function in terms of the universal approximation error of the neural networks.
The plan of this paper is organized as follows. We recall in Section 2 some basic results about deep neural networks (DNN) and stochastic optimization gradient descent methods used in DNN. Section 3 is devoted to the description of our three algorithms. We analyze in detail in Section 4 the convergence of the three algorithms. Finally the Appendix collect some Lemmas used in the proof of the convergence results.
2 Preliminaries on DNN and SGD
2.1 Neural network approximations
Deep Neural networks (DNN) aim to approximate (complex non linear) functions defined on finitedimensional space, and in contrast with the usual additive approximation theory built via basis functions, like polynomial, they rely on composition of layers of simple functions. The relevance of neural networks comes from the universal approximation theorem and the KolmogorovArnold representation theorem (see [20], [5] or [16]), and this has shown to be successful in numerous practical applications.
We consider here feedforward artificial network (also called multilayer perceptron) for the approximation of the optimal policy (valued in
) and the value function (valued in ), both defined on the state space . The architecture is depicted in Figure 1, and it is mathematically represented by functionswith or in our context, and where are the weights (or parameters) of the neural networks. The DNN function with input layer composed of
units (or neurons),
hidden layers (with layer composed of units), and output layer composed ofneurons is obtained by successive composition of linear combination and activation function
(that is a nonlinear monotone function like e.g. the sigmoid, the rectified linear unit ReLU, the exponential linear unit ELU, or the softmax):
for some matrix weights
and vector weight
, aggregating into . A key feature of neural networks is the computation of the gradient (with respect to the variable and the weights) of the DNN function via a forwardbackward propagation algorithm derived from chain rule composition. For example, for the sigmoid activation function
, and noting that , we havewhile the gradient w.r.t. of , for a realvalued differentiable function , is given in backward induction by
We refer to the online book [27] for a gentle introduction to neural networks and deep learning.
2.2 Stochastic optimization in DNN
Approximation by means of DNN requires a stochastic optimization with respect to a set of parameters, which can be written in a generic form as
(2.1) 
where is a random variable from which the training samples , are drawn, and
is a loss function involving DNN with parameters
, and typically differentiable w.r.t. with known gradient .Several basic algorithms are already implemented in TensorFlow for the search of infimum in (2.1). Given a training sample of size , in all the following cases, the sequence tends to under suitable assumptions on the learning rate sequence .

Batch gradient descent: (compute the gradient over the full training set). Fix an integer , and do
The main problem with the Batch Gradient Descent is that the convergence is very slow and also the computation of the sum can be painful for very large training sets. Hence it makes it very stable, but too slow in most situations.

Stochastic gradient descent (SGD): (compute the gradient over one random instance in the training set)
starting from , with a learning rate . The Stochastic gradient algorithm computes the gradient based on a single random instance in the training set. It is then a fast but unstable algorithm.

Minibatch gradient descent: (compute the gradient over random small subsets of the training set, i.e. minibatches) let be an integer than divides . stands for the number of minibatches and should be taken much smaller than in the applications.
For all ,
Randomly draw a subset of size in the training set.

iterate: .
The minibatch gradient descent is often considered to be the best tradeoff between speed and stability.

The three gradient descents that we just introduced are the first three historical algorithms that has been designed to learn optimal parameters. Other methods such as the Adaptive optimization methods AdaGrad, RMSProp, and finally Adam are also available. Although not wellunderstood and even questioned (see e.g.
[31]), the latter are often chosen by the practitioners to solve (2.1) and appear to provide the best results in most of the situations.For sake of simplicity, we only refer in the sequel to the stochastic gradient descent method, when presenting our algorithms. However, we recommend to test and use different algorithms in order to know which are the ones that provide best and fastest results for a given problem.
3 Description of the algorithms
We propose algorithms relying on a DNN approximation of the optimal policy that we compute sequentially in time through the dynamic programming formula, and using performance or hybrid iteration. The value function is then computed by MonteCarlo regression either by a regress now method or a regress later joint with quantization approach. These variants lead to three algorithms for MDP that we detail in this section.
Let us introduce a set of neural networks for approximating optimal policies, that is a set of parametric functions , with parameters , and a set of neural networks functions for approximating value functions, that is a set of parametric functions , with parameters .
We are also given at each time a probability measure on the state space , which we refer to as a training distribution. Some comments about the choice of the training measure are discussed in Section 3.3.
3.1 Control learning by performance iteration
This algorithm, refereed in short as NNcontPI Algo, is designed as follows:
For , we keep track of the approximated optimal policies , , and approximate the optimal policy at time by with
(3.1) 
where , , and for , . Given estimate of , , the approximated policy is estimated by using a training sample , of for simulating , and optimizing over the parameters of the NN , the expectation in (3.1) by stochastic gradient descent method (or its variants) as described in Section (2.2).
We then get an estimate of the optimal policy at any time by:
where is the “optimal” parameter resulting from the SGD in (3.1) with a training sample of size . This leads to an estimated value function given at any time by
(3.2) 
where is the expectation conditioned on the training set (used for computing ), and , is given by: , , . The dependence of the estimated value function upon the training samples , for , used at time , is emphasized through the exponent in the notations.
Remark 3.1
The NNcontPI Algo can be viewed as a combination of the DNN algorithm designed in [13] and dynamic programming. In the algorithm presented in [13], which totally ignores the dynamic programming principle, one learns all the optimal controls , at the same time, by performing one unique stochastic gradient descent. This is efficient as all the parameters of all the NN are getting trained at the same time, using the same minibatches. However, when the number of layers of the global neural network gathering all the NN , is large (say , where is the number of layers in
), then one is likely to observe vanishing or exploding gradient problems that will affect the training of the weights and biais of the first layers of the global NN (see
[7] for more details). Therefore, it may be more reasonable to make use of the dynamic programming structure when is large, and learn the optimal policy sequentially as proposed in our NNcontPI Algo. Notice that a similar idea was already used in [11] in the context of uncertain volatility model where the authors use a specific parametrization for the feedback control instead of a DNN adopted more generally here.Remark 3.2
The NNcontPI Algo does not require value function iteration, but instead is based on performance iteration by keeping track of the estimated optimal policies computed in backward recursion. The value function is then computed in (3.2) as the gain functional associated to the estimated optimal policies
. Consequently, it provides usually a low bias estimate but induces possibly high variance estimate and large complexity, especially when
is large.3.2 Control learning by hybrid iteration
Instead of keeping track of all the approximated optimal policies as in the NNcontPI Algo, we use an approximation of the value function at time in order to compute the optimal policy at time . The approximated value function is then updated at time by relying on the martingale property (1.4) under the optimal control. This leads to the following generic algorithm:
Generic Hybrid Algo

Initialization:

For ,

Approximate the optimal policy at time by with
(3.3) where , .

Updating: approximate the value function by
(3.4)

The approximated policy is estimated by using a training sample , of to simulate , and optimizing over the parameters of the NN , the expectation in (3.3) by stochastic gradient descent method (or its variants) as described in Section (2.2). We then get an estimate . The approximated value function written as a conditional expectation in (3.4) is estimated according to a Monte Carlo regression, either by a regress now method (in the spirit of [19]) or a regress later (in the spirit of [8] and [3]) joint with quantization approach, and this leads to the following algorithmic variants detailed in the two next paragraphs.
3.2.1 HybridNow Algo
Given an estimate of the optimal policy at time , and an estimate of , we estimate by neural networks regression, i.e.,
(3.5) 
using samples , , of , and of . In other words, we have
where is the “optimal” parameter resulting from the SGD in (3.5) with a training sample of size .
3.2.2 HybridLaterQ Algo
Given an estimate of the optimal policy at time , and an estimate of , the regresslater approach for estimating
is achieved in two stages: (a) we first regress/interpolate the estimated value
at time by a NN (or alternatively a Gaussian process) , (b) Analytical formulae are applied to the conditional expectation of this NN of future values with respect to the present value , and this is obtained by quantization of the noise driving the dynamics (1.1) of the state process.The ingredients of the quantization approximation are described as follows:

We denote by a quantizer of the valued random variable (typically a Gaussian random variable), that is a discrete random variable on a grid defined by
where , , are Voronoi tesselations of , i.e., Borel partitions of the Euclidian space satisfying
The discrete law of is then characterized by
The grid points which minimize the quantization error lead to the socalled optimal quantizer, and can be obtained by a stochastic gradient descent method, known as Kohonen algorithm or competitive learning vector quantization (CLVQ) algorithm, which also provides as a byproduct an estimation of the associated weights . We refer to [28]
for a description of the algorithm, and mention that for the normal distribution, the optimal grids and the weights of the Voronoi tesselations are precomputed on the website
http://www.quantize.mathsfi.com 
Recalling the dynamics (1.1), the conditional expectation operator is equal to
that we shall approximate analytically by quantization via:
(3.6)
The two stages of the regresslater are then detailed as follows:

(Later) interpolation of the value function: Given a DNN on with parameters , we interpolate by
where is obtained via SGD (as described in paragraph 2.2) from the regression of against , using training samples , , of , and of .

Updating/approximation of the value function: by using the hat operator in (3.6) for the approximation of the conditional expectation by quantization, we calculate analytically
Remark 3.3
Let us discuss and compare the Algos HybridNow and HybridLaterQ. When regressing later, one just has to learn a deterministic function through the interpolation step (a), as the noise is then approximated by quantization for getting analytical formula. Therefore, compared to HybridNow, the HybridLaterQ Algo reduces the variance of the estimate . Moreover, one has a wide choice of loss functions when regressing later, e.g., MSE loss function, loss, relative error loss, etc, while the loss function is required to approximate of condition expectation using regressnow method. However, although quantization is quite easy and fast to implement in small dimension for the noise, it might be not efficient in highdimension compared to HybridNow.
Remark 3.4
Again, we point out that the estimated value function in HybridNow or HybridLaterQ depend on training samples , , used at times , for computing the estimated optimal policies , and this is emphasized through the exponent in the notations.
3.3 Training sets design
We discuss here the choice of the training measure used to generate the training sets on which will be computed the estimations. Two cases are considered in this section. The first one is a knowledgebased selection, relevant when the controller knows with a certain degree of confidence where the process has to be driven in order to optimize her cost functional. The second case, on the other hand, is when the controller has no idea where or how to drive the process to optimize the cost functional.
Exploitation only strategy
In the knowledgebased setting, there is no need for exhaustive and expensive (in time mainly) exploration of the state space, and the controller can directly choose training sets constructed from distributions that assign more points to the parts of the state space where the optimal process is likely to be driven.
In practice, at time , assuming we know that the optimal process is likely to stay in the ball centered around the point and with radius , we choose a training measure centered around as, for example , and build the training set as sample of the latter.
Explore first, exploit later

Explore first: If the agent has no idea of where to drive the process to receive large rewards, she can always proceed to an exploration step to discover favorable subsets of the state space. To do so, , the training sets at time , for , can be built as uniform grids that cover a large part of the state space, or can be chosen uniform on such domain. It is essential to explore far enough to have a well understanding of where to drive and where not to drive the process.

Exploit later: The estimates for the optimal controls at time , that come up from the Explore first step, are relatively good in the way that they manage to avoid the wrong areas of state space when driving the process. However, the training sets that have been used to compute the estimated optimal control are too sparse to ensure accuracy on the estimation. In order to improve the accuracy, the natural idea is to build new training sets by simulating times the process using the estimates on the optimal strategy computed from the Explore first step, and then proceed to another estimation of the optimal strategies using the new training sets. This trick can be seen as a two steps algorithm that improves the estimate of the optimal control.
3.4 Some remarks
We end this section with some comments about our proposed algorithms.
3.4.1 Case of finite control space: classification
In the case where the control space is finite, i.e., Card with , one can think of the optimal control searching task as a problem of classification. This means that we randomize the control in the sense that given a state value , the controller chooses with a probability . We can then consider a neural network that takes state as an input, and returns at each time a probability vector with softmax output layer:
after some hidden layers. Finally, in practice, we use pure strategies given a state value , choose with
For example, the NNcontPI Algo with classification reads as follows:

For , keep track of the approximated optimal policies , , and compute
where , , , for , .

Update the approximate optimal policy at time by
with
3.4.2 Comparison of the algorithms
We emphasize the pros (+) and cons () of the three proposed algorithms in terms of bias estimate for the value function, variance, complexity and dimension for the state space.
Algo  Bias estimate  Variance  Complexity  Dimension  Number of 

time steps  
NNContPI  +      +   
HybridNow    +  +  +  + 
HybridLaterQ    ++  +    + 
This table is the result of observations made when numerically solving various control problems, combined to a close look at the rates of convergence derived for the three algorithms in Theorems 4.1, 4.2 and 4.3. Note that the sensibility of the NNContPI and the HybridLaterQ algorithms w.r.t. the number of time steps is clearly described in the studies of their rate of convergence achieved in Theorems 4.1 and 4.3. However, we could only provide a weak result on the rate of convergence of the Hybrid algorithm (see Theorem 4.3), which in particular does not explain why the latter does not suffer from large value of , unless stronger assumptions are made on the loss of the neural network estimating the optimal controls.
4 Convergence analysis
This section is devoted to the convergence of the estimator of the value function obtained from a training sample of size and using DNN algorithms listed in Section 3.
Training samples rely on a given family of probability distributions
on , for , refereed to as training distribution (see Section 3.3 for a discussion on the choice of ). For sake of simplicity, we consider that does not depend on , and denote then by the training distribution. We shall assume that the family of controlled transition probabilities has a density w.r.t. , i.e.,