 # Deep neural networks algorithms for stochastic control problems on finite horizon, Part 2: numerical applications

This paper presents several numerical applications of deep learning-based algorithms that have been analyzed in . Numerical and comparative tests using TensorFlow illustrate the performance of our different algorithms, namely control learning by performance iteration (algorithms NNcontPI and ClassifPI), control learning by hybrid iteration (algorithms Hybrid-Now and Hybrid-LaterQ), on the 100-dimensional nonlinear PDEs examples from  and on quadratic Backward Stochastic Differential equations as in . We also provide numerical results for an option hedging problem in finance, and energy storage problems arising in the valuation of gas storage and in microgrid management.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper is devoted to the numerical resolution of discrete-time stochastic control problem over a finite horizon. The dynamics of the controlled state process valued in is given by

 Xn+1 = F(Xn,αn,εn+1),n=0,…,N−1,X0=x0∈Rd, (1.1)

where

is a sequence of i.i.d. random variables valued in some Borel space

, and defined on some probability space

equipped with the filtration generated by the noise ( is the trivial -algebra), the control is an -adapted process valued in , and is a measurable function from into . Given a running cost function defined on and a terminal cost function defined on , the cost functional associated with a control process is

 J(α) = (1.2)

The set of admissible controls is the set of control processes satisfying some integrability conditions ensuring that the cost functional

is well-defined and finite. The control problem, also called Markov decision process (MDP), is formulated as

 V0(x0) := infα∈AJ(α), (1.3)

and the goal is to find an optimal control , i.e., attaining the optimal value: . Notice that problem (1.1)-(1.3) may also be viewed as the time discretization of a continuous time stochastic control problem, in which case, is typically the Euler scheme for a controlled diffusion process.

It is well-known that the global dynamic optimization problem (1.3) can be reduced to local optimization problems via the dynamic programming (DP) approach, which allows to determine the value function in a backward recursion by

 VN(x) = g(x),x∈Rd, Vn(x) = infa∈AQn(x,a), (1.4) with Qn(x,a) = f(x,a)+E[Vn+1(Xn+1)∣∣Xn=x,αn=a],(x,a)∈Rd×A.

Moreover, when the infimum is attained in the DP formula (1.4) at any time by , we get an optimal control in feedback form (policy) given by: where is the Markov process defined by

 X∗n+1 = F(X∗n,a∗n(X∗n),εn+1),n=0,…,N−1,X∗0=x0.

The practical implementation of the DP formula may suffer from the curse of dimensionality and large complexity when the state space dimension

and the control space dimension are high. In , we proposed algorithms relying on deep neural networks for approximating/learning the optimal policy and then eventually the value function by performance/policy iteration or hybrid iteration with Monte Carlo regressions now or later. This research led to three algorithms, namely algorithms NNcontPI, Hybrid-Now and Hybrid-LaterQ that are recalled in Section 2. In Section 3, we perform some numerical and comparative tests for illustrating the efficiency of our different algorithms, on -dimensional nonlinear PDEs examples as in  and quadratic Backward Stochastic Differential equations as in . We also provide numerical results for an option hedging problem in finance, and energy storage problems arising in the valuation of gas storage and in microgrid management. Finally, we conclude in Section 4 with some comments about possible extensions and improvements of our algorithms.

###### Remark 1.1

The proposed algorithms can deal with state and control constraints at any time, which is useful in several applications:

 (Xαn,αn) ∈ Sa.s.,n∈N,

where is some given subset of . In this case, in order to ensure that the set of admissible controls is not empty, we assume that the sets

 A(x) := {a∈Rq:(F(x,a,ε1),a)∈Sa.s.}

are non empty for all , and the DP formula now reads

 Vn(x) = infa∈A(x)[f(x,a)+PaVn+1(x)],x∈S.

From a computational point of view, it may be more convenient to work with unconstrained state/control variables, hence by relaxing the state/control constraint and introducing into the running cost a penalty function : , and . For example, if the constraint set is in the form: , for some functions , then one can take as penalty functions:

 L(x,a) = p∑k=1μk|hk(x,a)|2+q∑k=p+1μkmax(0,−hk(x,a)).

where are penalization coefficients (large in practice).

## 2 Algorithms

This section recalls the DNN-based algorithms we propose to solve the discrete-time stochastic control problem (1.1)-(1.2)-(1.3)-(1.4). These algorithms have been described and analyzed in detail in our companion paper . We also introduce a quantization and

-nearest-neighbor-based algorithm (knn) to be used as benchmark when testing our algorithms on low-dimensional control problems.

We are given a class of deep neural networks (DNN) for the control policy represented by the parametric functions , with parameters , and a class of DNN for the value function represented by the parametric functions: , with parameters . Recall that these DNN functions and

are compositions of linear combinations and nonlinear activation functions, see

.

### 2.1 Control Learning by Performance Iteration (NNContPI & ClassifPI)

• For , keep track of the approximated optimal policies , , and compute the approximated optimal policy at time by

 ^an = A(.;^βn) with ^βn ∈ (2.1)

where is distributed according to a training probability measure on , and with defined by induction, for , as:

We later refer to this algorithm as the NNContPI algorithm.

###### Remark 2.1

In practice, we use the Adam algorithm, implemented in TensorFlow, to compute in (2.1), and refer to section 3.3 of  for a discussion on the choice of the training measure.

###### Remark 2.2

(Case of finite control space) In the case where the control space is finite, i.e., Card with , a classification method can be used: consider a DNN that takes state

as input and returns a probability vector

with parameters ; the algorithm reads:

• For , keep track of the approximated optimal policies , , and compute the approximated optimal policy at time by

 ^an(x) = a^ℓn(x) with ^ℓn(x)∈argmaxℓ=1,…,Lpℓ(x;^βn) ^βn ∈ argminβ∈RqE[L∑ℓ=1pℓ(Xn;β)(f(Xn,aℓ)+N−1∑k=n+1f(Xℓk,^ak(Xℓk))+g(XℓN))],

where is distributed according to a training probability measure on , and , , for , .

In the numerical applications of the next section, we refer to this classification-based algorithm as the ClassifPI algorithm.

### 2.2 Double DNN with Regress Now (Hybrid-Now)

• Initialize

• For ,

• compute the approximated policy at time

 ^an = A(.;^βn) with ^βn ∈ argminβ∈RqE[f(Xn,A(Xn;β))+^Vn+1(Xβn+1)] (2.2)

where is distributed according to a training probability measure on , and .

• estimate the value function at time

 ^Vn = Φ(.;^θn) with ^θn ∈ argminθ∈RpE[f(Xn,^an(Xn))+^Vn+1(X^βnn+1)−Φ(Xn;θ)]2. (2.3)
###### Remark 2.3

Once again, we use the Adam algorithm, natively implemented in TensorFlow, to compute in (2.2) and in (2.3). Once again, we refer to section 3.3 of  for a discussion on the choice of the training measure.

### 2.3 Double DNN with Regress Later and Quantization (Hybrid-LaterQ)

We are given in addition an -optimal quantizer of the noise

valued in a grid of points in , and with weights .

• Initialize

• For ,

• compute the approximated policy at time

 ^an = A(.;^βn) with ^βn ∈ (2.4)

where is distributed according to a training probability measure on , and where .

• approximate the value function at time

 ~Vn+1 = Φ(.;^θn+1) with ^θn+1 ∈ argminθ∈RpE[^Vn+1(X^βnn+1)−Φ(X^βnn+1;θ)]2. (2.5)
• estimate analytically by quantization the value function at time :

 ^Vn(x) =
###### Remark 2.4
• We use the Adam algorithm to compute in (2.4) and in (2.5).

• Observe that step (ii) is an interpolation step, which means that all kind of loss functions can be chosen to compute

. In (2.5), we decide to take the -loss, mainly because of its smoothness.

• we refer to section 3.3 of  for a discussion on the choice of the training measure.

### 2.4 Quantization with k-nearest-neighbors (Qknn-algorithm)

We now present a simple version of the Qknn algorithm, based on the quantization and -nearest neighbors methods, which will be the benchmark for all the low-dimensional control problems that will be considered in the next section. We refer to  for a detailed presentation of a more sophisticated version of this algorithm, and comparisons to other well-known algorithms on various control problems.

We are given an -optimal quantizer of the noise via a discrete random variable valued in a grid of points in , and with weights ; as well as grids , of points in , which are assumed to cover the region of that is likely to be visited by the optimally driven process at time .

• Initialize

• For ,

• compute the approximated -value at time

 ^Qn(z,a) = f(z,a)+L∑ℓ=1pℓˆVn+1(% ProjΓn+1(F(z,a,eℓ))),∀(z,a)∈Γn×A,

where Proj is the Euclidean projection over .

• compute the optimal control at time

 ^An(z)∈argmina∈A[^Qn(z,a)],∀z∈Γn,

using classical algorithms for optimization of deterministic functions.

• estimate analytically by quantization the value function:

 ˆVn(z)=^Qn(z,^An(z)),∀z∈Γn.

## 3 Numerical applications

### 3.1 A semilinear PDE

We consider the following semilinear PDE with quadratic growth in the gradient:

 ⎧⎪⎨⎪⎩∂v∂t+Δxv−|Dxv|2=0,(t,x)∈[0,T)×Rd,v(T,x)=g(x),x∈Rd. (3.1)

By observing that for any , - , the PDE (3.1) can be written as a Hamilton-Jacobi-Bellman equation

 ⎧⎪⎨⎪⎩∂v∂t+Δxv+infa∈Rd[|a|2+2a.Dxv]=0,(t,x)∈[0,T)×Rd,v(T,x)=g(x),x∈Rd, (3.2)

hence associated with the stochastic control problem

 v(t,x) = infα∈AE[∫Tt|αs|2ds+g(Xt,x,αT)], (3.3)

where is the controlled process governed by

 dXs = 2αsds+√2dWs,t≤s≤T,Xt=x,

is a -dimensional Brownian motion, and the control process is valued in . The time discretization (with time step ) of the control problem (3.3) leads to the discrete-time control problem (1.1)-(1.2)-(1.3) with

 Xαn+1 = Xαn+2αnh+√2hεn+1=:F(Xαn,αn,εn+1),n=0,…,N−1,

where is a sequence of i.i.d. random variables of law , and a cost functional,

 J(α) = E[N−1∑n=0h|αn|2+g(XαN)].

On the other hand, it is known that an explicit solution to (3.1) (or equivalently (3.2)) can be obtained via a Hopf-Cole transformation (see e.g. ), and is given by

 v(t,x) =

We choose to run tests on two different examples that have already been considered in the literature:

#### Test 1

Some recent numerical results have been obtained in  (see Section 4.3 in ) when and in dimension (see Table 2 and Figure 3 in ). Their method is based on neural network regression to solve the BSDE representation of the control problem, and provide estimations of the value function at time 0 and state 0 for different values of a coefficient . We plotted the results of the Hybrid-Now algorithm in Figure 1. Hybrid-Now achieves a relative error of 0,13% in a bit less than 2hours using a 4-cores 3GHz intel Core i7 CPU, which is very close to the result found by , and reaches a relative error of 0,09% in a bit more than 4 hours. We want to highlight the fact that the algorithm presented in  only needs hundreds of seconds to provide a relative error of 0,17%, which is not comparable to the time required by Hybrid-Now to converge. However, we believe that the computation time can easily be alleviated; some ideas in that direction are discussed in section 4.

We also considered the same problem in dimension , for which we plotted the first component of w.r.t. time in Figure 2, for five different paths of the Brownian motion, where for each , the agent follows either the naive ( ) or the Hybrid-Now strategy. On can see that both strategies are very similar when the terminal time is far; but the Hybrid-Now strategy clearly forces to get closer to 0 when the terminal time gets closer, in order to reduce the terminal cost.

#### Test 2

Tests of the algorithms have also been run in dimension 1 with the terminal cost and . This problem has already been considered in , where the author used the BSDE-based algorithm presented in . Their results for the value function estimation at time 0 and state 0, when , are available in , and have been reported in column of Table 1. Also, the exact values for the value function have been computed for these values of , using the closed-form formula and running a Monte Carlo, and are reported in the column Bench of Table 1. Tests of the Hybrid-Now and Hybrid-LaterQ algorithms have been run, and the estimations of the value function at time 0 and state are reported in the Hybrid-Now and Hybrid-LaterQ columns. We also tested the Qknn algorithm based on quantization of the exogenous noise and k-nearest neighbors method, and reported the results in column Qknn. Qknn does not regress on neural networks, but rather uses k-nearest neighbors (knn) estimates to approximate the -value. See  for a presentation, more details and several different tests of the Qknn algorithm. Note that Qknn is particularly well-suited to 1-dimensional control problems. In particular, it is not time-consuming since the dimension of the state space is 1. Actually, it provides the fastest results, which is not surprising since the other algorithms need time to learn the optimal strategies and value functions through gradient-descent methods at each time step . Moreover, Table 1 reveals that Qknn is the most accurate algorithm on this example, probably because it uses local methods in space to estimate the conditional expectation that appears in the expression of the -value. Figure 1: Relative error of the Hybrid-Now estimation of the value function at time 0 w.r.t the number of mini-batches used to build the Hybrid-Now estimators of the optimal strategy. The value functions have been computed running three times a forward Monte Carlo with a sample of size 10 000, following the optimal strategy estimated by the Hybrid-Now algorithm. Figure 2: Pathwise comparison of the first component of X w.r.t. time when the agent follows the optimal strategy estimated by the Hybrid-Now algorithm (opt) and the naive strategy α=0 (bench). The dimension of the semilinear control problem has been set to d=2. Observe that, as expected, the strategy designed by Hybrid-Now algorithm is not to influence the diffusion of X when the terminal time is far in order to avoid any running cost, and try to make X small when terminal time gets close in order to minimize the terminal cost.

We end this paragraph by giving some implementation details for the different algorithms as part of Test 2.

• Implementation details of Y&R algorithm: Y&R algorithm requires to approximate the control problem by using a Lipschitz version of like the following:

 gN(x)={g(x) if x∉[0,N−11−γ]−Nx otherwise.
• Implementation details of Hybrid-Now algorithm: We use time steps for the discretization of the time interval . The value functions and optimal controls at time

are estimated using neural networks with 3 hidden layers and 10+5+5 neurons.

• Implementation details of Hybrid-LaterQ algorithm: We use time steps for the discretization of the time interval . The value functions and optimal controls at time are estimated using neural networks with 3 hidden layers containing 10+5+5 neurons; and 51 points for the quantization of the exogenous noise.

• Implementation details of Qknn algorithm: We use time steps for the discretization of the time interval . We take 51 points to quantize the exogenous noise, , for ; and 200 points for the space discretization. See  for more details on the Qknn algorithm.

The main conclusion regarding the numerical implementations and comparisons of this semilinear PDE is that the Hybrid-Now algorithm performs well in the control problem of dimension =100, and outperforms the Hybrid-LaterQ algorithm in dimension =2.

### 3.2 Option hedging

Our second example comes from a classical hedging problem in finance. We consider an investor who trades in stocks with (positive) price process , and we denote by valued in the amount held in these assets on the period . We assume for simplicity that the price of the riskless asset is constant equal to (zero interest rate). It is convenient to introduce the return process as: , , so that the self-financed wealth process of the investor with a portfolio strategy , and starting from some capital , is governed by

 Wαn+1 = Wαn+αn.Rn+1,n=0,…,N−1,Wα0=w0.

Given an option payoff , the objective of the agent is to minimize over her portfolio strategies her expected square replication error

 V0 = infα∈AE[ℓ(h(PN)−WαN)],

where is a convex function on . Assuming that the returns , are i.i.d, we are in a -dimensional framework of Section 1 with with valued in , with the dynamics function

 F(w,p,a,r) =

the running cost function and the terminal cost . We test our algorithm in the case of a square loss function, i.e. , and when there is no portfolio constraints , and compare our numerical results with the explicit solution derived in : denote by the distribution of , by its mean, and by assumed to be invertible; we then have

 Vn(w,p) = Knw2−2Zn(p)w+Cn(p)

where the functions , and are given in backward induction, starting from the terminal condition

 KN=1,ZN(p)=h(p),CN(p)=h2(p),

and for , by

 Kn = Kn+1(1−¯ν⊺¯M−12¯ν), Zn(p) = ∫Zn+1(p+diag(p)r)ν(dr)−¯ν⊺¯M−12∫Zn+1(p+diag(p)r)rν(dr), Cn(p) = ∫Cn+1(p+diag(p)r)ν(dr) −1Kn+1(∫Zn+1(p+diag(p)r)rν(dr))⊺¯M−12(∫Zn+1(p+diag(p)r)rν(dr)),

so that , where is the initial stock price. Moreover, the optimal portfolio strategy is given in feedback form by , where is the function

 a∗n(w,p)=¯M−12[∫Zn+1(p+diag(p)r)rν(dr)Kn+1−¯νw],

and is the optimal wealth associated with , i.e., . Moreover, the initial capital that minimizes , and called (quadratic) hedging price is given by

 w∗0 = Z0(p0)K0.

#### Test

Take , and consider one asset with returns modeled by a trinomial tree:

 ν(dr) = π+δr++π0δ0+π−δr−,π0+π++π−=1,

with , , , . Take , and consider the call option with . The price of this option is defined as the initial value of the portfolio that minimizes the terminal quadratic loss of the agent when the latter follows the optimal strategy associated with the initial value of the portfolio. In this test, we want to determine the price of the call and the associated optimal strategy using different algorithms.

#### Numerical results

In Figure 3, we plot the value function at time 0 w.r.t , the initial value of the portfolio, when the agent follows the theoretical optimal strategy (benchmark), and the optimal strategy estimated by the Hybrid-Now or Hybrid-LaterQ algorithms. We perform forward Monte Carlo using 10,000 samples to approximate the lower bound of the value function at time 0 (see  for details on how to get an approximation of the upper-bound of the value function via duality). One can observe that while all the algorithms give a call option price approximately equal to 4.5, Hybrid-LaterQ clearly provides a better strategy than Hybrid-Now to reduce the quadratic risk of the terminal loss.

We plot in Figure 4 three different paths of the value of the portfolio w.r.t the time , when the agent follows either the theoretical optimal strategy (red), or the estimated one using the Hybrid-Now algorithm (blue) or Hybrid-LaterQ algorithm (green). We set for these simulations. Note that for such a big value of , it is not obvious that Hybrid-LaterQ is better than Hybrid-Now.

#### Comments on the Hybrid-Now and Hybrid-LaterQ algorithms

The Option Hedging problem belongs to the class of the linear quadratic control problems for which we expect the optimal control to be affine in and the value function to be quadratic in . It is then natural to consider the following classes of controls and functions to properly approximate the optimal controls and the values functions at time =:

 AM:={(w,p)↦A(x;β)⋅(1,w)⊺;β∈Rp}, (3.4)
 FM:={(w,p)↦Φ(x;θ)⋅(1,w,w2)⊺;θ∈Rp}, (3.5)

where describes the parameters (weights+ bias) associated with the neural network and describes those associated with the neural network . The notation stands for the transposition, and for the inner product. Note that there are 2 (resp. 3) neurons in the output layer of (resp. ), so that the inner product is well-defined in (3.5) and (3.4). It is then natural to use gradient-descent-based algorithms to find the optimal parameter (resp. ) for which (resp. ) coincides with the optimal control (resp. the value function) at time =.

###### Remark 3.1

The option hedging problem is linear quadratic, hence belongs to the class of problems where the agent has ansatzes on the optimal control and the value function. For these kind of problems, the algorithms presented in  can easily be adapted so that the expressions of the estimators satisfy the ansatzes, see e.g. (3.4) and (3.5). Figure 3: Estimations of the value function at time 0 w.r.t. w0 using the Hybrid-Now algorithm (blue line), Hybrid-LaterQ algorithm (green dashes). We draw the value function in red for comparison. One can observe that the price of the call given by all the algorithms is approximately equals to 4.5, but Hybrid-LaterQ is better than Hybrid-Now at reducing the quadratic risk. Figure 4: Three simulations of the agent’s wealth w.r.t. the time n when, for each ω, the latter follows the theoretical optimal strategy (red), and the estimated one using the Hybrid-Now (blue) or Hybrid-LaterQ algorithm (green). We took w0=100 for this simulation. Observe that for such a big value of w0, the optimal strategy estimated by the Hybrid-LaterQ and the Hybrid-Now algorithms are similar to the theoretical optimal strategy.

### 3.3 Valuation of energy storage

We present a discrete-time version of the energy storage valuation problem studied in . We consider a commodity (gas) that has to be stored in a cave, e.g. salt domes or aquifers. The manager of such a cave aims to maximize the real options value by optimizing over a finite horizon the dynamic decisions to inject or withdraw gas as time and market conditions evolve. We denote by the gas price, which is an exogenous real-valued Markov process modeled by the following mean-reverting process:

 Pn+1 = ¯p(1−β)+βPn+ξn+1, (3.6)

where , and is the stationary value of the gas price. The current inventory in the gas storage is denoted by and depends on the manager’s decisions represented by a control process valued in : (resp. ) means that she injects (resp. withdraws) gas with an injection (resp. withdrawal) rate (resp. ) requiring (causing) a purchase (resp. sale) of (resp. ), and means that she is doing nothing. The difference between and (resp. and ) indicate gas loss during injection/withdrawal. The evolution of the inventory is then governed by

 Cαn+1 = Cαn+h(Cαn,αn),n=0,…,N−1,Cα0=c0, (3.7)

where we set

 h(c,a) = ⎧⎪⎨⎪⎩ain(c) for a=10 for a=0−aout(c) for a=−1,

and we have the physical inventory constraint:

 Cαn ∈ [Cmin,Cmax],n=0,…,N.

The running gain of the manager at time is given by

 f(p,c,a) = ⎧⎪⎨⎪⎩−bin(c)p−K1(c) for a=1−K0(c) for a=0bout(c)p−K−1(c) for a=−1,

and represents the storage cost in each regime . The problem of the manager is then to maximize over the expected total profit

 J(α) = E[N−1∑n=0f(Pn,Cαn,αn)+g(PN,CαN)],

where a common choice for the terminal condition is

 g(p,c) = −μp(c0−c)+,

which penalizes for having less gas than originally, and makes this penalty proportional to the current price of gas ( ). We are then in the -dimensional framework of Section 1 with , and the set of admissible controls in the dynamic programming loop is given by:

 An(c)={a∈{−1,0,1}:c+h(c,a)∈[Cmin,Cmax],c∈[Cmin,Cmax]},n=0,…,N−1.

#### Test

As in , we consider the example

 ain(c)=bin(c)=0.06, aout(c)=bout(c)=0.25 Ki(c)=0.01c

, , , , , with , and in the terminal penalty function, , , .

#### Numerical results

Figure 5 provides the value function estimates at time 0 w.r.t. using Qknn algorithm, compared to the benchmark (Bench) defined as the naive do-nothing strategy . As expected, the naive strategy performs well when is small, since, in this case, it takes time to fill the cave, so that the agent is likely to do nothing so as not to be penalized at terminal time. When is large, it is easy to fill up the cave, so the agent has more freedom to buy and sell gas in the market without worrying about the terminal cost. Observe that the value function is not monotone, due to the fact that the state space for the volume of gas in the cave is a bounded discrete set.

Table 2 provides the value function estimates obtained with the ClassifPI, Hybrid-Now and Hybrid-LaterQ algorithms. Observe first that the estimations provided by the Qknn algorithm are larger than those provided by the other algorithms, meaning that Qknn outperforms the other algorithms. The second best algorithm is ClassifPI, while the performance of Hybrid-Now is poor and clearly suffers from instability, due probably to the discontinuity of the running rewards w.r.t. the control variable.

Finally, Figures 6, 7, 8 provide the optimal decisions w.r.t. at times 5, 10, 15, 20, 25, 29 estimated respectively by the Qknn, ClassifPI and Hybrid-Now algorithms. As expected, one can observe on each plot that the optimal strategy is to inject gas when the price is low, to sell gas when the price is high, and to make sure to have a volume of gas greater than in the cave when the terminal time is getting closer to minimize the terminal cost. Let us now comment on the implementation of algorithms:

• Comments on the Qknn algorithm: Table 2 shows that once again, due to the low-dimensionality of the problem, Qknn provides the best value function estimates. The estimated optimal strategies, shown on Figure 6, are very good estimations of the theoretical ones. The three decision regions on Figure 6 are natural and easy to interpret: basically it is optimal to sell when the price is high, and to buy when it is low. However, a closer look reveals that the waiting region (where it is optimal to do nothing) has an unusual triangular-based shape, which, while close to the theoretical one, can be expected to be very hard to reproduce with the DNN-based algorithms proposed in .

• Comments on the ClassifPI algorithm: As shown on Figure 7, the ClassifPI algorithm manages to provide stable estimates for the optimal controls at time . However, the latter is not able to catch the particular triangular-based shape of the waiting region, which explains why Qknn performs better.

• Comments on the Hybrid-Now algorithm: As shown on Figure 8, the Hybrid-Now algorithm only manages to provide a weak estimation of the three different regions at time . In particular, the regions suffer from instability. Figure 5: Value functions estimates at time 0 w.r.t. ain, when the agent follows the optimal strategy estimated by the Qknn algorithm, by running a forward Monte Carlo with a sample of size 100,000 (blue). We also plotted the cost functional associated with the naive passive strategy α = 0 (Bench). See that for small values of ain such as 0.06, doing nothing is a reasonable strategy. In this case, the naive strategy is a good benchmark to test the algorithms.

We end this paragraph by providing some implementation details for the different algorithms we tested.

• Implementation details for the Qknn algorithm: We recall that the Qknn algorithm is based on the quantization and -nearest neighbors methods to estimate the value functions at time . We take the closest neighbors for the estimation of the regression of the value functions, in order to insure continuity of the estimation w.r.t. the pair of state variables. The optimal control is computed at each point of the grid using deterministic optimizers such as the Golden-section search or the Brent algorithm, which are classical optimization routines available in many numerical libraries.

• Implementation details for the neural network-based algorithms: We use neural networks with two hidden layers, ELU activation functionseeeThe Exponential Linear Unit (ELU) activation function is defined as . and neurons . The output layer contains 3 neurons with softmax activation function for the ClassifPI algorithm and no activation function for the Hybrid-Now one. We use a training set of size

at each time step. Note that given the expression of the terminal cost, the ReLU activation functions (Rectified Linear Units) could have been deemed a better choice to capture the shape of the value functions, but our tests revealed that ELU activation functions provide better results.

The main conclusion of our numerical comparisons on this energy storage example is that ClassifPI, the DNN-based classification algorithm designed for stochastic control problems with discrete control space, appears to be more accurate than the more general Hybrid-Now. Nevertheless, ClassifPI was not able to capture the unusual triangle-based shape of the optimal control as well as Qknn did.