This paper is devoted to the numerical resolution of discrete-time stochastic control problem over a finite horizon. The dynamics of the controlled state process valued in is given by
is a sequence of i.i.d. random variables valued in some Borel space
, and defined on some probability spaceequipped with the filtration generated by the noise ( is the trivial -algebra), the control is an -adapted process valued in , and is a measurable function from into . Given a running cost function defined on and a terminal cost function defined on , the cost functional associated with a control process is
The set of admissible controls is the set of control processes satisfying some integrability conditions ensuring that the cost functional
is well-defined and finite. The control problem, also called Markov decision process (MDP), is formulated as
and the goal is to find an optimal control , i.e., attaining the optimal value: . Notice that problem (1.1)-(1.3) may also be viewed as the time discretization of a continuous time stochastic control problem, in which case, is typically the Euler scheme for a controlled diffusion process.
It is well-known that the global dynamic optimization problem (1.3) can be reduced to local optimization problems via the dynamic programming (DP) approach, which allows to determine the value function in a backward recursion by
Moreover, when the infimum is attained in the DP formula (1.4) at any time by , we get an optimal control in feedback form (policy) given by: where is the Markov process defined by
The practical implementation of the DP formula may suffer from the curse of dimensionality and large complexity when the state space dimensionand the control space dimension are high. In , we proposed algorithms relying on deep neural networks for approximating/learning the optimal policy and then eventually the value function by performance/policy iteration or hybrid iteration with Monte Carlo regressions now or later. This research led to three algorithms, namely algorithms NNcontPI, Hybrid-Now and Hybrid-LaterQ that are recalled in Section 2. In Section 3, we perform some numerical and comparative tests for illustrating the efficiency of our different algorithms, on -dimensional nonlinear PDEs examples as in  and quadratic Backward Stochastic Differential equations as in . We also provide numerical results for an option hedging problem in finance, and energy storage problems arising in the valuation of gas storage and in microgrid management. Finally, we conclude in Section 4 with some comments about possible extensions and improvements of our algorithms.
The proposed algorithms can deal with state and control constraints at any time, which is useful in several applications:
where is some given subset of . In this case, in order to ensure that the set of admissible controls is not empty, we assume that the sets
are non empty for all , and the DP formula now reads
From a computational point of view, it may be more convenient to work with unconstrained state/control variables, hence by relaxing the state/control constraint and introducing into the running cost a penalty function : , and . For example, if the constraint set is in the form: , for some functions , then one can take as penalty functions:
where are penalization coefficients (large in practice).
This section recalls the DNN-based algorithms we propose to solve the discrete-time stochastic control problem (1.1)-(1.2)-(1.3)-(1.4). These algorithms have been described and analyzed in detail in our companion paper . We also introduce a quantization and
-nearest-neighbor-based algorithm (knn) to be used as benchmark when testing our algorithms on low-dimensional control problems.
We are given a class of deep neural networks (DNN) for the control policy represented by the parametric functions , with parameters , and a class of DNN for the value function represented by the parametric functions: , with parameters . Recall that these DNN functions and
are compositions of linear combinations and nonlinear activation functions, see.
2.1 Control Learning by Performance Iteration (NNContPI & ClassifPI)
For , keep track of the approximated optimal policies , , and compute the approximated optimal policy at time by
where is distributed according to a training probability measure on , and with defined by induction, for , as:
We later refer to this algorithm as the NNContPI algorithm.
(Case of finite control space) In the case where the control space is finite, i.e., Card with , a classification method can be used: consider a DNN that takes state
as input and returns a probability vectorwith parameters ; the algorithm reads:
For , keep track of the approximated optimal policies , , and compute the approximated optimal policy at time by
where is distributed according to a training probability measure on , and , , for , .
In the numerical applications of the next section, we refer to this classification-based algorithm as the ClassifPI algorithm.
2.2 Double DNN with Regress Now (Hybrid-Now)
compute the approximated policy at time
where is distributed according to a training probability measure on , and .
estimate the value function at time
2.3 Double DNN with Regress Later and Quantization (Hybrid-LaterQ)
We are given in addition an -optimal quantizer of the noise
via a discrete random variablevalued in a grid of points in , and with weights .
compute the approximated policy at time
where is distributed according to a training probability measure on , and where .
approximate the value function at time
estimate analytically by quantization the value function at time :
we refer to section 3.3 of  for a discussion on the choice of the training measure.
2.4 Quantization with k-nearest-neighbors (Qknn-algorithm)
We now present a simple version of the Qknn algorithm, based on the quantization and -nearest neighbors methods, which will be the benchmark for all the low-dimensional control problems that will be considered in the next section. We refer to  for a detailed presentation of a more sophisticated version of this algorithm, and comparisons to other well-known algorithms on various control problems.
We are given an -optimal quantizer of the noise via a discrete random variable valued in a grid of points in , and with weights ; as well as grids , of points in , which are assumed to cover the region of that is likely to be visited by the optimally driven process at time .
compute the approximated -value at time
where Proj is the Euclidean projection over .
compute the optimal control at time
using classical algorithms for optimization of deterministic functions.
estimate analytically by quantization the value function:
3 Numerical applications
3.1 A semilinear PDE
We consider the following semilinear PDE with quadratic growth in the gradient:
By observing that for any , - , the PDE (3.1) can be written as a Hamilton-Jacobi-Bellman equation
hence associated with the stochastic control problem
where is the controlled process governed by
is a -dimensional Brownian motion, and the control process is valued in . The time discretization (with time step ) of the control problem (3.3) leads to the discrete-time control problem (1.1)-(1.2)-(1.3) with
where is a sequence of i.i.d. random variables of law , and a cost functional,
We choose to run tests on two different examples that have already been considered in the literature:
Some recent numerical results have been obtained in  (see Section 4.3 in ) when and in dimension (see Table 2 and Figure 3 in ). Their method is based on neural network regression to solve the BSDE representation of the control problem, and provide estimations of the value function at time 0 and state 0 for different values of a coefficient . We plotted the results of the Hybrid-Now algorithm in Figure 1. Hybrid-Now achieves a relative error of 0,13% in a bit less than 2hours using a 4-cores 3GHz intel Core i7 CPU, which is very close to the result found by , and reaches a relative error of 0,09% in a bit more than 4 hours. We want to highlight the fact that the algorithm presented in  only needs hundreds of seconds to provide a relative error of 0,17%, which is not comparable to the time required by Hybrid-Now to converge. However, we believe that the computation time can easily be alleviated; some ideas in that direction are discussed in section 4.
We also considered the same problem in dimension , for which we plotted the first component of w.r.t. time in Figure 2, for five different paths of the Brownian motion, where for each , the agent follows either the naive ( ) or the Hybrid-Now strategy. On can see that both strategies are very similar when the terminal time is far; but the Hybrid-Now strategy clearly forces to get closer to 0 when the terminal time gets closer, in order to reduce the terminal cost.
Tests of the algorithms have also been run in dimension 1 with the terminal cost and . This problem has already been considered in , where the author used the BSDE-based algorithm presented in . Their results for the value function estimation at time 0 and state 0, when , are available in , and have been reported in column of Table 1. Also, the exact values for the value function have been computed for these values of , using the closed-form formula and running a Monte Carlo, and are reported in the column Bench of Table 1. Tests of the Hybrid-Now and Hybrid-LaterQ algorithms have been run, and the estimations of the value function at time 0 and state are reported in the Hybrid-Now and Hybrid-LaterQ columns. We also tested the Qknn algorithm based on quantization of the exogenous noise and k-nearest neighbors method, and reported the results in column Qknn. Qknn does not regress on neural networks, but rather uses k-nearest neighbors (knn) estimates to approximate the -value. See  for a presentation, more details and several different tests of the Qknn algorithm. Note that Qknn is particularly well-suited to 1-dimensional control problems. In particular, it is not time-consuming since the dimension of the state space is 1. Actually, it provides the fastest results, which is not surprising since the other algorithms need time to learn the optimal strategies and value functions through gradient-descent methods at each time step . Moreover, Table 1 reveals that Qknn is the most accurate algorithm on this example, probably because it uses local methods in space to estimate the conditional expectation that appears in the expression of the -value.
We end this paragraph by giving some implementation details for the different algorithms as part of Test 2.
Implementation details of Y&R algorithm: Y&R algorithm requires to approximate the control problem by using a Lipschitz version of like the following:
Implementation details of Hybrid-Now algorithm: We use time steps for the discretization of the time interval . The value functions and optimal controls at time
are estimated using neural networks with 3 hidden layers and 10+5+5 neurons.
Implementation details of Hybrid-LaterQ algorithm: We use time steps for the discretization of the time interval . The value functions and optimal controls at time are estimated using neural networks with 3 hidden layers containing 10+5+5 neurons; and 51 points for the quantization of the exogenous noise.
Implementation details of Qknn algorithm: We use time steps for the discretization of the time interval . We take 51 points to quantize the exogenous noise, , for ; and 200 points for the space discretization. See  for more details on the Qknn algorithm.
The main conclusion regarding the numerical implementations and comparisons of this semilinear PDE is that the Hybrid-Now algorithm performs well in the control problem of dimension =100, and outperforms the Hybrid-LaterQ algorithm in dimension =2.
3.2 Option hedging
Our second example comes from a classical hedging problem in finance. We consider an investor who trades in stocks with (positive) price process , and we denote by valued in the amount held in these assets on the period . We assume for simplicity that the price of the riskless asset is constant equal to (zero interest rate). It is convenient to introduce the return process as: , , so that the self-financed wealth process of the investor with a portfolio strategy , and starting from some capital , is governed by
Given an option payoff , the objective of the agent is to minimize over her portfolio strategies her expected square replication error
where is a convex function on . Assuming that the returns , are i.i.d, we are in a -dimensional framework of Section 1 with with valued in , with the dynamics function
the running cost function and the terminal cost . We test our algorithm in the case of a square loss function, i.e. , and when there is no portfolio constraints , and compare our numerical results with the explicit solution derived in : denote by the distribution of , by its mean, and by assumed to be invertible; we then have
where the functions , and are given in backward induction, starting from the terminal condition
and for , by
so that , where is the initial stock price. Moreover, the optimal portfolio strategy is given in feedback form by , where is the function
and is the optimal wealth associated with , i.e., . Moreover, the initial capital that minimizes , and called (quadratic) hedging price is given by
Take , and consider one asset with returns modeled by a trinomial tree:
with , , , . Take , and consider the call option with . The price of this option is defined as the initial value of the portfolio that minimizes the terminal quadratic loss of the agent when the latter follows the optimal strategy associated with the initial value of the portfolio. In this test, we want to determine the price of the call and the associated optimal strategy using different algorithms.
In Figure 3, we plot the value function at time 0 w.r.t , the initial value of the portfolio, when the agent follows the theoretical optimal strategy (benchmark), and the optimal strategy estimated by the Hybrid-Now or Hybrid-LaterQ algorithms. We perform forward Monte Carlo using 10,000 samples to approximate the lower bound of the value function at time 0 (see  for details on how to get an approximation of the upper-bound of the value function via duality). One can observe that while all the algorithms give a call option price approximately equal to 4.5, Hybrid-LaterQ clearly provides a better strategy than Hybrid-Now to reduce the quadratic risk of the terminal loss.
We plot in Figure 4 three different paths of the value of the portfolio w.r.t the time , when the agent follows either the theoretical optimal strategy (red), or the estimated one using the Hybrid-Now algorithm (blue) or Hybrid-LaterQ algorithm (green). We set for these simulations. Note that for such a big value of , it is not obvious that Hybrid-LaterQ is better than Hybrid-Now.
Comments on the Hybrid-Now and Hybrid-LaterQ algorithms
The Option Hedging problem belongs to the class of the linear quadratic control problems for which we expect the optimal control to be affine in and the value function to be quadratic in . It is then natural to consider the following classes of controls and functions to properly approximate the optimal controls and the values functions at time =:
where describes the parameters (weights+ bias) associated with the neural network and describes those associated with the neural network . The notation stands for the transposition, and for the inner product. Note that there are 2 (resp. 3) neurons in the output layer of (resp. ), so that the inner product is well-defined in (3.5) and (3.4). It is then natural to use gradient-descent-based algorithms to find the optimal parameter (resp. ) for which (resp. ) coincides with the optimal control (resp. the value function) at time =.
The option hedging problem is linear quadratic, hence belongs to the class of problems where the agent has ansatzes on the optimal control and the value function. For these kind of problems, the algorithms presented in  can easily be adapted so that the expressions of the estimators satisfy the ansatzes, see e.g. (3.4) and (3.5).
3.3 Valuation of energy storage
We present a discrete-time version of the energy storage valuation problem studied in . We consider a commodity (gas) that has to be stored in a cave, e.g. salt domes or aquifers. The manager of such a cave aims to maximize the real options value by optimizing over a finite horizon the dynamic decisions to inject or withdraw gas as time and market conditions evolve. We denote by the gas price, which is an exogenous real-valued Markov process modeled by the following mean-reverting process:
where , and is the stationary value of the gas price. The current inventory in the gas storage is denoted by and depends on the manager’s decisions represented by a control process valued in : (resp. ) means that she injects (resp. withdraws) gas with an injection (resp. withdrawal) rate (resp. ) requiring (causing) a purchase (resp. sale) of (resp. ), and means that she is doing nothing. The difference between and (resp. and ) indicate gas loss during injection/withdrawal. The evolution of the inventory is then governed by
where we set
and we have the physical inventory constraint:
The running gain of the manager at time is given by
and represents the storage cost in each regime . The problem of the manager is then to maximize over the expected total profit
where a common choice for the terminal condition is
which penalizes for having less gas than originally, and makes this penalty proportional to the current price of gas ( ). We are then in the -dimensional framework of Section 1 with , and the set of admissible controls in the dynamic programming loop is given by:
As in , we consider the example
, , , , , with , and in the terminal penalty function, , , .
Figure 5 provides the value function estimates at time 0 w.r.t. using Qknn algorithm, compared to the benchmark (Bench) defined as the naive do-nothing strategy . As expected, the naive strategy performs well when is small, since, in this case, it takes time to fill the cave, so that the agent is likely to do nothing so as not to be penalized at terminal time. When is large, it is easy to fill up the cave, so the agent has more freedom to buy and sell gas in the market without worrying about the terminal cost. Observe that the value function is not monotone, due to the fact that the state space for the volume of gas in the cave is a bounded discrete set.
Table 2 provides the value function estimates obtained with the ClassifPI, Hybrid-Now and Hybrid-LaterQ algorithms. Observe first that the estimations provided by the Qknn algorithm are larger than those provided by the other algorithms, meaning that Qknn outperforms the other algorithms. The second best algorithm is ClassifPI, while the performance of Hybrid-Now is poor and clearly suffers from instability, due probably to the discontinuity of the running rewards w.r.t. the control variable.
Finally, Figures 6, 7, 8 provide the optimal decisions w.r.t. at times 5, 10, 15, 20, 25, 29 estimated respectively by the Qknn, ClassifPI and Hybrid-Now algorithms. As expected, one can observe on each plot that the optimal strategy is to inject gas when the price is low, to sell gas when the price is high, and to make sure to have a volume of gas greater than in the cave when the terminal time is getting closer to minimize the terminal cost. Let us now comment on the implementation of algorithms:
Comments on the Qknn algorithm: Table 2 shows that once again, due to the low-dimensionality of the problem, Qknn provides the best value function estimates. The estimated optimal strategies, shown on Figure 6, are very good estimations of the theoretical ones. The three decision regions on Figure 6 are natural and easy to interpret: basically it is optimal to sell when the price is high, and to buy when it is low. However, a closer look reveals that the waiting region (where it is optimal to do nothing) has an unusual triangular-based shape, which, while close to the theoretical one, can be expected to be very hard to reproduce with the DNN-based algorithms proposed in .
Comments on the ClassifPI algorithm: As shown on Figure 7, the ClassifPI algorithm manages to provide stable estimates for the optimal controls at time . However, the latter is not able to catch the particular triangular-based shape of the waiting region, which explains why Qknn performs better.
Comments on the Hybrid-Now algorithm: As shown on Figure 8, the Hybrid-Now algorithm only manages to provide a weak estimation of the three different regions at time . In particular, the regions suffer from instability.
We end this paragraph by providing some implementation details for the different algorithms we tested.
Implementation details for the Qknn algorithm: We recall that the Qknn algorithm is based on the quantization and -nearest neighbors methods to estimate the value functions at time . We take the closest neighbors for the estimation of the regression of the value functions, in order to insure continuity of the estimation w.r.t. the pair of state variables. The optimal control is computed at each point of the grid using deterministic optimizers such as the Golden-section search or the Brent algorithm, which are classical optimization routines available in many numerical libraries.
Implementation details for the neural network-based algorithms: We use neural networks with two hidden layers, ELU activation functionseeeThe Exponential Linear Unit (ELU) activation function is defined as . and neurons . The output layer contains 3 neurons with softmax activation function for the ClassifPI algorithm and no activation function for the Hybrid-Now one. We use a training set of size
at each time step. Note that given the expression of the terminal cost, the ReLU activation functions (Rectified Linear Units) could have been deemed a better choice to capture the shape of the value functions, but our tests revealed that ELU activation functions provide better results.
The main conclusion of our numerical comparisons on this energy storage example is that ClassifPI, the DNN-based classification algorithm designed for stochastic control problems with discrete control space, appears to be more accurate than the more general Hybrid-Now. Nevertheless, ClassifPI was not able to capture the unusual triangle-based shape of the optimal control as well as Qknn did.
3.4 Microgrid management
Finally, we consider a discrete-time model for power microgrid inspired by the continuous-time models developed in  and ; see also . The microgrid consists of a photovoltaic (PV) power plant, a diesel generator and a battery energy storage system (BES), hence using a mix of fuel and renewable energy sources. These generation units are decentralized, i.e., installed at a rather small scale (a few kW power), and physically close to electricity consumers. The PV produces electricity from solar panels with a generation pattern depending on the weather conditions. The diesel generator has two modes: on and off. Turning it on consumes fuel, and produces an amount of power . The BES can store energy for later use but has limited capacity and power. The aim of the microgrid management is to find the optimal planning that meets the power demand, denoted by , while minimizing the operational costs due to the diesel generator. We denote by
the residual demand of power: when , one should provide power through diesel or battery, and when , one can store the surplus power in the battery.
The optimal control problem over a fixed horizon is formulated as follows. At any time , the microgrid manager decides the power production of the diesel generator, either by turning it off: , or by turning it on, hence generating a power valued in with . There is a fixed cost associated with switching from the on/off mode to the other one off/on, and we denote by the mode valued in of the generator right before time , i.e.,