1 Introduction
Motivation
Optimal stopping problems are particularly important for risk management as they are involved in the pricing of American options. Americanstyle options are used not only by traditional asset managers but also by energy companies to hedge “optimised assets” by finding optimal decisions to optimise their P&L and find their value. A common modelling of a power plant unit P&L is done using swing options which are American options allowing to exercise at most times the option with possibly a constraint on the delay between two exercise dates (see Carmona and Touzi (2008) or Warin (2012) for gas storage modelling).
Formally, for , we are given a stochastic processes
defined on a probability space
and one wants to find an increasing sequence of stopping times that maximises the expectation of some objective function
Numerical methods to solve the optimal stopping problem when and is Markovian include:

Partial differential equation (PDE): a variational inequality derived from the Hamilton Jacobi Bellman equation is given by
where is the infinitesimal generator of (Shreve, 2004, Chapter 8, Section 3.3). A numerical scheme can be applied to solve this PDE and find the option value.
These approaches generalise well for , see Carmona and Touzi (2008) for dynamic programming principle or Bernhart et al. (2012) for the BSDE method. The non linear case where is of the form is studied by Trabelsi (2013)
. The method proposed in the latter paper can also be related to the parametric valuation of American options: a decision rule or the exercise region is represented by a vector of parameters to be optimised
(Glasserman, 2013, Chapter 8, Section 2). As in reinforcement learning, the estimation of optimal parameters is based on Monte Carlo simulations. This list of methods is not exhaustive and the reader can refer to
(Glasserman, 2013, Chapter 8)for more details on numerical methods for American option pricing. All these algorithms suffer from the curse of dimensionality: the number of underlying is hardly above 5. However energy companies portfolio may trade derivatives involving more that 4 commodities at one time (e.g. swing options indexed on C02, natural gas, electricity, volume, fuel) and traditional numerical methods hardly provide good solutions in a reasonable time.
Recently, neural networkbased approaches have shown good results regarding stochastic control problems and PDE numerical resolution in high dimension (see Han et al. (2017b), ChanWaiNam et al. (2019)). In Huré et al. (2018), Bachouch et al. (2018), the optimal policy is parameterised by a neural network which weights and bias minimise at each time step the right hand side of the dynamic programming equation, going backward. The value function to minimise at time is either computed using all the optimal policies computed after or using an approximation of the value function at time from a neural network regression. This method can be used both for continuous and discrete actions. Numerical tests are performed on a gas storage in Bachouch et al. (2018). In Huré et al. (2019) and Han et al. (2017a), neural networks are used to solve BSDE. In Huré et al. (2019), the neural networks parameterising the solution and its gradient (or only the solution and the gradient is computed by numerical differentiation) minimise the loss between the left handside and the right hand side of the Euler discretisation of the BSDE, going backward from the terminal value. Bachouch et al. (2018) and Huré et al. (2019) need to maximise one criteria by time step. The approach of Han et al. (2017a) is quite different: the neural network allows the parameterisation of the initial value of the BSDE and the gradient at each time step, and it minimises the distance between the terminal value obtained by the neural network and the terminal value of the BSDE, going forward. American put options prices are computed in Huré et al. (2019) up to dimension 40 with 160 time steps. Fecamp et al. (2019) uses neural networks to parameterise the positions that need to be taken in order to hedge an option. One neural network is trained taking as entries the time and the value of the underlying(s) in order to minimise a risk criteria ( loss, value at risk,…). This risk criteria is estimated by MonteCarlo simulations and the optimisation is done by gradient descent over the parameters of the neural network. This methodology shows very good performance, even in an incomplete market where there are (proportional) transaction costs and volume constraints. This last technique produces optimal control in a reasonable time even in high dimension but do not apply when there is a discrete control (e.g optimal stopping time). Neural network approaches have also been used in the context of swing options pricing in gas market in BarreraEsteve et al. (2006). The definition of swing options slightly differs from ours as it considers a continuous control: the option owner buys a certain amount of gas between a minimum and a maximum quantity. It is however related to our problem as in continuous time, this option is bangbang: it is optimal to exercise at the minimum or the maximum level at each date, that is choosing between two actions. BarreraEsteve et al. (2006) directly models the policy by a neural network and optimise the objective function as in Fecamp et al. (2019).
Contrarily to Huré et al. (2018), Bachouch et al. (2018), Han et al. (2017a), Huré et al. (2019), the goal of this paper is to propose a reinforcement learning algorithm to solve optimal multiexercise (rather than one single) stopping time problems with constraints on exercise times that does not need to derive a dynamic programming equation nor to find an equivalent BSDE of the problem. The only information needed is the dynamic of the state process and the objective function. This kind of algorithm is called policy gradient and is well known in the area reinforcement learning, see Sutton et al. (2000) for instance. Although continuous control with reinforcement learning shows good results, the case of optimal stopping times is more difficult as it involves controls taking values in a discrete set of actions.
The problem is similar to a combinatorial optimisation one: at each time step, an action belonging to a finite set needs to be taken. One way to solve this problem is to perform a relaxation assuming that the control belongs to a continuous space. For instance, if one needs to price an American option,
a decision represented by a value in and consisting in exercising or not must be taken. Relaxing the problem consists in searching for solutions in . Such method has been studied in Becker et al. (2019b) and Becker et al. (2019a) in the case of Bermudan options (with only one exercise) pricing with neural networks. They succeed in pricing Bermudan options in a high dimensional setting (up to 1000) with good accuracy.
On a very different combinatorial problem (namely traveller salesman) Bello et al. (2016) proposes an approach to solve with neural networks by using randomisation of discrete variables instead of relaxing the discrete setting. The probability for the action to take a discrete value is modelled by a neural network but the function to optimise is computed by MonteCarlo sampling from the probability linked to the discrete actions. The difficulty then comes from the computation of the gradient: the trick which is common in reinforcement learning is to use the likelihood ratio method Sutton et al. (2000).
Main results
Our approach follows the spirit of Fecamp et al. (2019): one directly parameterises the optimal policy by a neural network and maximises the objective function. We propose an algorithm using reinforcement learning as in Bello et al. (2016) in order to solve optimal stopping times problem as an combinatorial optimisation problem. While Bello et al. (2016) considers deterministic optimisation problem, the framework of this article is stochastic and involves a dynamic in time of the state process, with discrete decisions at each time step. Compared to the papers referenced above our approach presents the many advantages as it

can solve multiple optimal stopping time problems;

is independent from the dynamic of ;

allows to add in a flexible way any constraint on the stopping times;

can then be associated with the one of Fecamp et al. (2019) considering continuous actions in order to solve stochastic impulse control problems, combining discrete and continuous controls;

is able to choose any risk criteria to optimise even if it is not possible to derive a dynamic programming equation, see Fecamp et al. (2019) where hedging is done under an asymmetric risk criteria.
One of the proposed algorithm (Algorithm 1) allows to solve stopping time problems without any knowledge of the dynamic programming equation or of an equivalent BSDE. Numerical tests covering Bermudan and swing options are proposed and show good results in the pricing of 10 underlyings Bermudan option and also on 5 underlyings swing options having up to exercise dates. An extension to stochastic impulse problems combining both continuous and discrete controls is proposed in Algorithm 2. Those problems are classical in finance but usually hard to solve. It is tested on a well known problem which is hedging with fixed transaction costs cases. Those fixed transaction costs can be seen as a fee to enter in the market or as the hedging operational cost. Algorithm 2 is applied to the hedging of a 3 underlying spread option with fixed transaction costs, a difficult problem to solve using stochastic control approaches. To our knowledge, this paper is the first to propose a neural network approach to solve multiple optimal stopping times and impulse control problems by modelling directly the policy, without the use of the dynamic programming equation. The theoretical convergence study of our algorithm is out of the scope of this paper.
Organisation of the paper
The paper is organised as follow:

Section 2 deals with optimal stopping times problems. Section 2.1 and Section 2.2 describe the problem we want to solve by the algorithm proposed in 2.3. The neural network architecture and hyper parameters are discussed in Section 2.4 and Section 2.5. Numerical tests on Bermudan and swing options are done in 2.6.

In Section 3, an extension to impulse control is proposed. The optimal hedging with fixed transaction costs problem is described in Section 3.3. In Section 3.3.1, we compare our algorithm against stochastic control and against Whalley and Wilmott (1993) methodology on a call option with fixed transaction costs. Numerical tests are done on a spread option involving 3 risk factors in Section 3.3.2.
2 Optimal stopping
2.1 Continuous time modelling
We are given a financial market operating in continuous time. Let a filtered probability space and a ddimensional Brownian motion. One assumes that satisfies the usual conditions of right continuity and completeness. Let a finite horizon time and be the unique solution of the Stochastic Differential Equation (SDE):
(1) 
with and two measurable functions verifying and for and ( denotes the Euclidian distance in and for a matrix ) and . Under these hypothesis, has an unique strong solution which is Markovian. One could extend the modelling to more general Itô semimartingales but for the sake of simplicity, we restrict ourselves to continuous Markovian diffusions. Using the notations of Carmona and Touzi (2008) and with as defined in (1) for and for , an optimal stopping time problem consists in solving the problem
(2) 
where is the collection of all vectors of increasing stopping times such that for all , a.s. on the set of events and where is a measurable function. corresponds to the number of possible exercises and to the minimum delay between two exercise dates. The reader can refer to Ibáñez (2004), Carmona and Touzi (2008), Bernhart et al. (2012) for more information on swing options and methods to price them. The American option case corresponds to . One wants to find the optimal value (2) but also the optimal policy
(3) 
2.2 Discrete time modelling
In practice, one only considers optimal stopping on a discrete time grid (for instance, the valuation of a Bermudan option is used as a proxy of the American or swing option). Let us consider exercise dates belonging to a discrete set , . The problem consists in finding
(4) 
where is the set of stopping times belonging to such that on , for . This discretisation is needed for our algorithm as it is needed in classical methods such as Longstaff and Schwartz (2001). The solution of (4) can then be approximated by the solution of
(5) 
where is a sequence of measurable random variables taking values in such that
(6) 
and
(7) 
where . Given a solution of Equation (5) , a proxy for the optimal control (3) is given by, on the event with , for ,
with .
2.3 Algorithm description for optimal stopping times problems
As the
’s are discrete we cannot suppose that they are the output of a neural network which weights are optimised by applying a stochastic gradient descent (SGD). One idea to overcome this difficulty is to suppose that at each time step
, the discrete variablefollows a Bernoulli distribution conditionally on
. In a non Markovian framework, one could consider that the probability for to be equal to 1 is a function of all the values of for. In this case, one could use a Recurrent Neural Network to parameterise this function. In order to consider the different constraints on the sequence
, the law of depends also on the realisations of for . The parameter of the Bernoulli distribution is parameterised by a neural network defined on where represents the sets in which the bias and weights of the neural network lie.Parameterisation without constraints on
Without constraints, the parameterisation is the following
(8) 
with and . outputs the (the inverse function of ) of . The function is not necessary and one could only consider
to parameterise the logit of the probability. To reduce the values taken by the
, we bound the output of the neural using and one chooses such that and . is given in Section 2.5.Parameterisation with constraints (Eq. (6) and Eq. (7))
Now, let us consider the constraints (6) and (7). The parameterisation is the following
(9) 
where is a penalty term. We choose such that in order to have a probability of even in the case where the output of is 1. is given in Section 2.5. The methodology can be extended to any constraint on the policy.
The neural network architecture is described in Section 2.4 and Figure 1. From now on, is replaced by to indicate the dependence of the law of with . To approximate a solution to (5) we search for a verifying:
The gradient of is computed using the likelihood ratio method:
(10) 
(using the convention
) which can be easily computed using backpropagation. Once we have this expression, the algorithm consists in applying a SGD by applying the derivative defined in Equation (
10) (see Algorithm 1).Finally, while on the training phase the actions are sampled from the outputted probability on the training set, they are chosen equal to 1 if the probability is greater than 0.5 and 0 otherwise on the test and validation sets.
Remark 2.1
If it is not possible to have access to exact simulations of , it is possible to consider an Euler SDE discretisation of (1).
2.4 Neural network architecture
The neural network architecture is inspired by ChanWaiNam et al. (2019)
and consists in one single feed forward neural network which features are the time step
and the current realisation. Let and . The neural network is defined as followwhere for , , , for , , for and . corresponds to the number of layers and
to the number of neurons per layer (that we assume to be the same for every layer). The
correspond to the weights and to the bias. The functionis the activation function and is chosen as the ReLu function, that is
. is then equal to .2.5 Hyper parameters

As in ChanWaiNam et al. (2019), in our case, regularisation which is classically used to avoid overfitting is not relevant and we won’t use it as our data is not redundant and thus the network does not experience overfitting.

Since we use the same network at each time step, we use a meanvariance normalisation over all the time steps to center all the inputs for all ’s with the same coefficients. The scaling and recentering coefficients are estimated on presimulated data that is just used to this end.

We use Xavier initialisation Glorot and Bengio (2010) for the weights and a normal initialisation for the bias.

The number of layers is chosen equal to 3. The number of neurons per layer is constant (but can vary from a case to another).

Every 100 steps, the objective value is computed over a testing set. The parameters kept at the end are the ones minimising those evaluations. The objective function is finally evaluated on a validation set.

and the penalisation parameter is set to .

The library used is Abadi et al. (2015) and the algorithm runs on a laptop with 8 cores of 2,50 GHz, a RAM memory of 15,6 Go and without GPU acceleration.
2.6 Numerical results
In this section Algorithm 1 is applied to the valuation of Bermudan and swing options. The function is of the form where is the payoff of the option and is the risk free rate. We place ourselves in the BlackScholes framework: , with corresponding to the dividend rate and with a positive definite matrix. We choose to work with a regular time grid for . The probability measure corresponds to the risk neutral probability and finding the value of the option consists in solving Problem (5).
2.6.1 Bermudan options
In this section, we assume that (only one exercise) and we consider different options to price. For all the cases, we choose an initial learning rate , decaying with a rate of 0.98 every 100 steps (that is ), a test set of size 500,000 and a validation set of size 4,096,000 (500,000 is chosen high to have very accurate optimisation and 4,096,000 is chosen as in Becker et al. (2019a)).
Put option
with , payoff , , , , , , , . We consider a batch size equal to , a neural network with a depth of 3 layers having 10 neurons each and iterations. The reference value is given in Bouchard and Chassagneux (2008).
Maxcall option
with , payoff , , , , , , (
is the identity matrix with size
), , . We consider a batch size equal for and for , a neural network with 3 layers of size 30 for and 70 for and iterations. The reference values are given in Becker et al. (2019a).Strangle spread option
with , payoff , , , , , , , ,
, . We consider a batch size equal to , a neural networks with 3 layers size 60 and iterations. The reference value is given in Becker et al. (2019a).
In Table 1 losses and times obtained with Algorithm 1 are given for each case and losses are compared to the reference value (Bouchard and Chassagneux (2008) for the put option and Becker et al. (2019a) for the other options). The algorithm succeeds in pricing Bermudan options in dimension relatively high (up to 10) and also with a high number of time steps (up to 50). The computing time is more sensitive to the number of time steps than to the dimension: the number of neural network estimation is equal to the number of time steps. The increase of computing time when dimension increases is mostly caused by a need to increase the batch size and a more important simulation time. Algorithm 1 succeeds in pricing Bermudan options and solves problems that are usually hard to solve and very expensive in term of computation time as they suffer from the curse of dimensionality.
Use case / Method  Algorithm 1  Reference  Difference  Time 

Bermudan put  0.0603  0.0603  0.0%  155.2 
Maxcall,  13.8934  13.8990  0.04%  452.7 
Maxcall,  38.2115  38.2780  0.17%  2948.4 
Strangle spread  11.7830  11.7940  0.09%  5211.7 
2.6.2 Swing options
In the following cases, we use an initial learning rate , decaying with a rate of 0.96 every 100 steps (that is ), a test set of size 500,000 and a validation set of size 4,096,000.
We compare in Table 2 the results obtained by Algorithm 1 with the results of Ibáñez (2004) in the case of a put option with , , , , , , , , , (only one time step for delay) and . We consider a batch size equal to , a neural networks with 3 layers size 10 and iterations. Every case takes around 2 minutes to converge, see Table 3. The algorithm gives very accurate results in a short period of time (less than two minutes) for the valuation of the swing options.
/  35  40  45 

1  (5.1, 5.114, 0.282%)  (1.775, 1.774, 0.045%)  (0.409, 0.411, 0.408%) 
2  (10.165, 10.195, 0.291%)  (3.478, 3.48, 0.047%)  (0.769, 0.772, 0.388%) 
3  (15.181, 15.23, 0.323%)  (5.09, 5.111, 0.414%)  (1.084, 1.089, 0.492%) 
4  (20.151, 20.23, 0.39%)  (6.622, 6.661, 0.593%)  (1.361, 1.358, 0.187%) 
5  (25.16, 25.2, 0.159%)  (8.075, 8.124, 0.602%)  (1.573, 1.582, 0.557%) 
6  (30.085, 30.121, 0.12%)  (9.448, 9.502, 0.567%)  (1.756, 1.756, 0.019%) 
/  35  40  45 

1  121.4  121.9  126.1 
2  124.0  124.5  126.0 
3  126.0  125.3  122.4 
4  127.2  127.7  122.9 
5  117.0  122.7  122.8 
6  124.4  122.7  116.1 
To assess the performance of Algorithm 1 in high dimension, let us consider the pricing of the geometrical put option with payoff . Let , , , , , , , , and . Parameters are chosen in order to have an option value equal to the the one dimensional case put option value: the product of the components of follows a BlackScholes dynamic with drift parameter equal to and volatility equal to . It allows to have a reference value (from Ibáñez (2004)) while considering a high dimensional case. We consider a batch size equal to , a neural networks with 3 layers of size 30 and iterations. Results are given in Table 4
. The algorithm succeeds in pricing this option with 5 underlyings very quickly (arround 10 minutes). The hyperparameters have been chosen in order to have very accurate results. To reduce the time, it is possible to change some of the hyperparameters. In order to show that Algorithm
1 can give good results in a very short period of time, let us consider a neural network with 3 layers of 20 neurons each (instead of 30), (instead of ), (instead of ), a validation set size of (instead of ) and let us omit the evaluation on the test set. Results are given in Table 5. In less than 2 minutes, it is possible to obtain results with a accuracy.Use case / Method  Algorithm 1  Reference  Difference  Time 

l = 1  1.7748  1.774  0.04%  555.9 
l = 2  3.4770  3.480  0.09%  694.7 
l = 3  5.0897  5.111  0.42%  758.5 
l = 4  6.6229  6.661  0.57%  725.8 
l = 5  8.0588  8.124  0.8%  685.7 
l = 6  9.4311  9.502  0.75%  812.8 
Use case / Method  Algorithm 1  Reference  Difference  Time 

l = 1  1.7376  1.774  2.05%  91.9 
l = 2  3.4441  3.480  1.03%  89.6 
l = 3  5.0310  5.111  1.57%  91.3 
l = 4  6.5903  6.661  1.06%  92.8 
l = 5  8.0362  8.124  1.08%  89.9 
l = 6  9.2934  9.502  2.2%  90.3 
Let us now consider the case of put option with , , , , , , , , , and . Delay constraint is now present and a higher number of dates is considered. We consider a batch size equal to , a neural networks with 3 layers size 10 and iterations. We compare in Table 6 the results obtained with Algorithm 1 to the ones obtained with Carmona and Touzi (2008). The algorithm gives satisfying results but in this very situation we can see that the relative error increases with the number of exercises dates.
Use case / Method  Algorithm 1  Reference  Difference  Time 

9.8483  9.85  0.02%  2076.1  
19.0105  19.26  1.3%  2069.4  
27.6498  28.80  3.99%  2074.5  
35.7977  38.48  6.97%  2326.0  
43.3674  48.32  10.25%  2086.6 
The different use cases show that Algorithm 1 is able to solve optimal stopping time problems in a reasonable time, even when the dimension is high and also for multiexercise. The algorithm is simple and allows us to find an optimal policy without any knowledge on the dynamic programming equation. While the time increases a little when dimension increases, it increases a lot more with the number of time steps and the algorithm can have troubles to converge. To confirm all those results, one should study the convergence of the algorithm which is out of the scope of this paper.
3 Extension to stochastic impulse control
In this section we present how the algorithm of Section 2.3 can be combined to the method described in Fecamp et al. (2019) for impulse control problems. Impulse control has a lot of applications in economics and finance (real options valuation in energy markets, and optimal order execution in illiquid markets…). In Section 3.3, we apply the algorithm to an option hedging under fixed transaction costs.
An impulse control is a sequence with a sequence of increasing stopping times and a sequence of measurable random variables taking values in . Let the set of impulse control. A stochastic impulse control problem is the optimisation problem
(11) 
with the solution of (1). As for the optimal stopping time problems, the impulse control problems can be associated to a BSDE or to a quasivariationnal inequality, see Kharroubi et al. (2010) and can only be solved numerically most of the time.
We consider the discretisation of problem (11) and we search for
where contains the control with taking values in a discrete set . This problem is equivalent to
(12) 
where is a sequence of measurable random variables with taking values in and taking values in .
Remark 3.1
More generally, the process can depend on the control if jumps are present in its dynamic but this case is not considered here.
3.1 Algorithm extension to impulse control
is a continuous control and can be optimised as done in Fecamp et al. (2019). A neural network with vector of parameters outputs the optimal continuous control:
with the set in which the bias and weights of lie. is chosen as a Recurrent Neural Network, see Section 3.2, but for ease of notation, the dependence with the memory of the neural network is omitted.
is again supposed to be Bernoulli distributed with a probability parameterised by a neural network with vector of parameters . Omitting the constraints (which can be easily added as in Equation (9)), the probability is parameterised as in Equation (8):
(13) 
with the set in which the bias and weights of lie.
Algorithm 2 finds the optimal parameters of the neural networks. The algorithm is quite similar to Algorithm 1 except there is a gradient descent over the parameters . The computation of this gradient is easily done using backpropagation. In Algorithm 2, the two gradients are concatenated in order to have one gradient updated with one learning rate: it is possible (and it is used in some of the results hereafter) to consider two different learning rates and to update each gradient separately.
3.2 Architecture of the neural networks
We propose to follow Fecamp et al. (2019) but we use two recurrent neural networks, one for and another one for . The recurrent cells are LSTM cells Hochreiter and Schmidhuber (1997) among which a series of feedforward layers are placed (see Figure 2). One could give as inputs of the outputs of but from our tests it appears that it does not improve the results.
3.3 Numerical results: application to fixed transaction costs
In this section we consider the problem of hedging an European style option paying under fixed transaction costs: at each date and for each asset , the portfolio manager pays a constant fee that does not depend on the volume of underlying that is purchased or sold (given that the volume is not null). We search for an impulse control with if there is an hedge at time and corresponds to the quantity of asset to buy or sell at time . Let us consider the selffinancing portfolio with value at time
(14) 
where corresponds to the amount of asset held at time , corresponds to the option premium and where
The last control does not appear in the replication portfolio and nor in the optimisation problem. The transaction costs paid until time , , generated by the strategy is equal to
(15) 
One searches for an impulse control and an initial premium that minimises the momentbased criteria:
(16) 
where
(17) 
is the replication error and quantifies the tradeoff between the average costs ad the loss . One could consider different risk criteria as in Fecamp et al. (2019) but the meanvariance choice is convenient for comparison to usual stochastic control methods as a dynamic stochastic programming equation can be derived. Practitioners prefer to use the homogeneous criterion which can be addressed by Algorithm 2 but not by dynamic programming approaches.
3.3.1 European call option
In this section, we hedge an European call option (dimension 1, payoff ) in the BlackScholes framework ( is the solution of (1) with and ) with

Case 1: ,
Comments
There are no comments yet.