Solving optimal stopping problems with Deep Q-Learning

01/24/2021 ∙ by John Ery, et al. ∙ ETH Zurich 0

We propose a reinforcement learning (RL) approach to model optimal exercise strategies for option-type products. We pursue the RL avenue in order to learn the optimal action-value function of the underlying stopping problem. In addition to retrieving the optimal Q-function at any time step, one can also price the contract at inception. We first discuss the standard setting with one exercise right, and later extend this framework to the case of multiple stopping opportunities in the presence of constraints. We propose to approximate the Q-function with a deep neural network, which does not require the specification of basis functions as in the least-squares Monte Carlo framework and is scalable to higher dimensions. We derive a lower bound on the option price obtained from the trained neural network and an upper bound from the dual formulation of the stopping problem, which can also be expressed in terms of the Q-function. Our methodology is illustrated with examples covering the pricing of swing options.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) in its most general form deals with agents living in some environment and aiming at maximizing a given reward function. Alongside supervised and unsupervised learning, it is often considered as the third family of models in the machine learning literature. It encompasses a wide class of algorithms that have gained popularity in the context of building intelligent machines that can outperform masters in ancestral board games such as Go or chess, see e.g.

silver16; silver17. These models are very skilled when it comes to learning the rules of a certain game, starting from little or no prior knowledge at all, and progressively developing winning strategies. Recent research, see e.g. deepmind, doubleQ, duelingQ

, has considered integrating deep learning techniques in the framework of reinforcement learning in order to model complex unstructured environments. Deep reinforcement learning can hence leverage the ability of deep neural networks to uncover hidden structure from very complex functionals and the power of reinforcement techniques to take complex actions.

Optimal stopping problems from mathematical finance naturally fit into the reinforcement learning framework. Our work is motivated by the pricing of swing options which appear in energy markets (oil, natural gas, electricity) to hedge against futures price fluctuations, see e.g. meinshausen, bender15, and more recently daluiso20. Intuitively, when behaving optimally, investors holding these options are trying to maximize their reward by following some optimal sequence of decisions, which in the case of swing options consists in purchasing a certain amount of electricity or natural gas at multiple exercise times.

The stopping problems we will consider belong to the category of Markov decision processes (MDP). We refer the reader to

puterman or bertsekas for good textbook references on this topic. When the size of the MDP becomes large or when the MDP is not fully known (model-free learning), alternatives to standard dynamic programming techniques must be sought. Reinforcement learning can efficiently tackle these issues and can be transposed to our problem of determining optimal stopping strategies.

Previous work exists on the connections between optimal stopping problems in mathematical finance and reinforcement learning. For example, the common problem of learning optimal exercise policies for American options has been tackled in li using reinforcement learning techniques. They implement two algorithms, namely least-squares policy iteration (LSPI), see lagoudakis, and fitted Q-iteration (FQI), see vanroy, and compare their performance to a benchmark provided by the least-squares Monte Carlo (LSMC) approach of see longstaff. It is shown empirically that strategies uncovered by both these algorithms provide larger payoffs than LSMC. kohler08 model the Snell envelope of the underlying optimal stopping problem with a neural network. More recently, dos derive optimal stopping rules from Monte Carlo samples at each time step using deep neural networks. An alternative approach developed in becker20

considers the approximation of the continuation values using deep neural networks. This method also produces a dynamic hedging strategy based on the approximated continuation values. A similar approach with different activation functions is presented in

lapeyre alongside a convergence result for the pricing algorithm, whereas the method employed in chen20 is based on BSDE’s.

Our work aims at casting the optimal stopping decision into a unifying reinforcement learning framework through the modeling of the action-value function of the problem. One can then leverage reinforcement learning algorithms involving neural networks that learn the optimal action-value function at any time step. We illustrate this methodology by presenting examples from mathematical finance. In particular, we will focus on high-dimensional swing options where the action taking is more complex, and where deep neural networks are particularly powerful due to their approximation capabilities.

The remainder of the paper is structured as follows. In Section 2

, we introduce the necessary mathematical tools from reinforcement leaning, present an estimation approach of the Q-function using neural networks, and discuss the derivation of the lower and upper bounds on the option price. In Section

3 we explain the multiple stopping problem with waiting period constraint between two consecutive exercise times, again with the derivation of a lower bound and an upper bound on the option price at inception. We display numerical results for swing options in Section 4, and conclude in Section 5.

2 Theory and methodology

In this section we present the mathematical building blocks and the reinforcement learning machinery, leading to the formulation of the stopping problems under consideration.

2.1 Markov decision processes and action-value function

As discussed in the introduction, the problems we will consider in the sequel can be embedded into the framework of the well-studied Markov decision processes (MDPs), see sutton. A Markov decision process is defined as a tuple where

  • is the set of states;

  • is the set of actions the agent can take;

  • is the transition probability kernel, where

    is the probability of future states given that the current state is and that action is taken;

  • is a reward function, where denotes the reward obtained when moving from state under action (note here that different definitions exist in the literature);

  • is a discount factor which expresses preference towards short-term rewards (in the present work as we consider already discounted rewards).

A policy is then a rule for selecting actions based on the last visited state. More specifically, denotes the probability of taking action in state under policy The conventional task is to maximize the total (discounted) expected reward over policies, and can be expressed as A policy which maximizes this quantity is called an optimal policy. Given a starting state an initial action one can define the action-value function, also called Q-function:


where for a sequence of state-action pairs The optimal policy satisfies


where we write for . In other words, the optimal Q-function measures how "good" or "rewarding" it is to choose action while in state by following optimal decisions. We will consider problems with finite time horizon and we accordingly set for all

2.2 Single stopping problems as Markov decision processes

We consider the same stopping problem as in dos and becker20, namely an American-style option defined on a finite time grid . The discounted payoff process is assumed to be square-integrable and takes the form for a measurable function and a -dimensional -Markovian process defined on a filtered probability space . Let denote the space in which the underlying process lives. We assume that is deterministic and that is the risk-neutral probability measure. The value of the option at time is given by


where denotes all stopping times . This problem is essentially a Markov decision process with state space , action space (where we follow the convention for continuing and for stopping), reward function333When exercizing (taking action ), we implicitly move to the absorbing state, i.e. the last component of the state space becomes 1.

for and transition kernel driven by the dynamics of the -Markovian process . The state space includes time, the -dimensional Markovian process and an additional (absorbing) state which at each time step captures the event of exercise or no exercise. More precisely, we jump to this absorbing state when we have exercised. In the multiple stopping case which we discuss in Section 3, we jump to this absorbing state once we have used the last exercise right. In both single and multiple stopping frameworks, once this absorbing state has been reached at a random time , we set all rewards and Q-values to 0 for The associated Snell envelope process of the stopping problem in (3) is defined recursively by


It is well known that the Snell envelope provides an optimal stopping time solving (3) as stated in the following result.444Note that in particular . A standard proof for the latter can be found in shreve.

Proposition 2.1.

The stopping time defined by

for the Snell envelope given in (4), is optimal for the problem (3).

Various modeling approaches have been proposed to estimate the option value in (3). kohler08 propose to model directly the Snell envelope, dos take the approach of modeling the optimal stopping times. More recently, becker20 model the continuation values of the stopping problem. In this work, we rather propose to model the optimal action-value function of the problem for all and (where represents the stopping decision) given by


According to Proposition 2.1, through the knowledge of the optimal action-value function we can recover the optimal stopping time . Indeed, it turns out that the optimal decision functions in dos can be expressed in the action-value function framework through

where denotes the indicator function. Moreover, one can express the Snell envelope (estimated in kohler08) as and the continuation value modeled in becker20 can be reformulated in our setting as . As a by-product, one can price financial products such as swing options by considering

In this perspective, our modeling approach is very similar to previous studies but differs in the reinforcement learning machinery employed. Indeed, modeling the action-value function and optimizing it is a common and natural approach known under the name of Q-learning in the reinforcement learning literature. We introduce it in the next section.

2.3 Q-learning as estimation method

In contrast to policy or value iteration, Q-learning methods, see e.g. watkins89 and watkins92, estimate directly the optimal action-value function. They are model-free and can learn optimal strategies with no prior knowledge of the state transitions and the rewards. In this paradigm, an agent interacts with the environment (exploration step) and learns from past actions (exploitation step) to derive the optimal strategy.

One way to model the action-value function is by using deep neural networks. This approach is referred to under the name deep Q-learning in the reinforcement learning literature. In this setup, the optimal action-value function is modeled with a neural network often called deep Q-network (DQN), where

is a vector of parameters corresponding to the network architecture. However, reinforcement learning can be highly unstable or even potentially diverge due to the introduction of neural networks in the approximation the Q-function. To tackle these issues, a variant to the original Q-learning method has been developed in

mnih15. It relies on two main concepts. The first is called experience replay and allows to remove correlations in the sequence of observations. In practice this is done by generating a large sample of experiences which we denote as vectors at each time and that we store in a dataset We note that once we have reached the absorbing state, we start a new episode or sequence of observations by resetting the MDP to the initial state . Furthermore, we allow the agent to explore new unseen states according to a so-called -greedy strategy, see sutton, meaning that with probability we take a random action and with probability we take the action maximizing the Q-value. Typically one reduces the value of according to a linear schedule as the training iterations increase.

During the training phase, we then perform updates to the Q-values by sampling mini-batches uniformly at random from this dataset and minimizing over

the following loss function


However there might still be some correlations between the Q-values and the so-called target values The second improvement brought forward in mnih15 consists in updating the network parameters for the target values only with a regular frequency and not after each iteration. This is called parameter freezing and translates into minimizing over the modified loss function


where the target network parameters are only updated with the DQN parameters every steps, and are held constant between individual updates.

An alternative network specification would be to take only the state as input and update the Q-values for each action, see the implementation in deepmind. Network architectures such as double deep Q-networks, see doubleQ, dueling deep Q-networks, see duelingQ, and combinations thereof, see rainbow have been developed to improve the training performance even further. However the implementation of these algorithms is out of the scope of our presentation.

2.4 Inference and confidence intervals

In the same spirit as dos and becker20, we compute lower and upper bounds on the option price in (3

), the confidence interval resulting from the central limit theorem, as well as a point estimate for the optimal value

In the sequel, for ease of notation, we will use for

2.4.1 Lower bound

We store the parameters learned through the training of the deep neural network on an experience replay dataset with simulations for We denote as the vector of network parameters where denotes the dimension of the parameter space and corresponds to the calibrated network. We then generate new simulations of the state space process , independent from those used for training, for The independence is necessary to achieve unbiasedness of the estimates. The Monte Carlo average

where yields a lower bound for the optimal value Since the optimal strategies are not unique, we follow the convention of taking the largest optimal stopping rule which yields a strict inequality.

2.4.2 Upper bound

The derivation of the upper bound is based on the Doob-Meyer decomposition of the supermartingale given by the Snell envelope, see shreve. The Snell envelope of the discounted payoff process can be decomposed as

where is the -martingale given by

and is the non-decreasing -predictable process given by

From Proposition 7 in dos, given a sequence

of integrable random variables in

such that for all one has

for every -martingale starting from 0.

This upper bound is tight if and We can then use the optimal action-value function learned via the deep neural network to construct a martingale close to We now adapt the approach presented in dos to the expression of the martingale component of the Snell envelope. Indeed, the martingale differences from Subsection 3.2 in dos can be written in terms of the optimal action-value function:

since the continuation value at time is given by evaluating the optimal action-value function at action (continuing). Given the definition of the optimal action-value function at (5), one can rewrite the martingale differences as


The empirical counterparts are given by generating realizations of based on a sample of simulations for Again, we simulate realizations of the state space process independently from the simulations used for training. This gives us the following empirical differences:

where is the chosen action at time for simulation path and are the Monte Carlo averages approximating the continuation values for and

The continuation values appearing in the martingale increments are obtained through nested simulation, see the remark below:

where is the number of simulations in the inner step, and where, given each we simulate (conditional) continuation paths that are conditionally independent of each other and of and is the value of along the path


It is not guaranteed than for the Q-function learned via the neural network. To tackle this issue, we implement nested simulations as in dos and becker20

to estimate the continuation values. This gives unbiased estimates of

which is crucial to obtain a valid upper bound. Moreover, the variance of the estimates decreases with the number of inner simulations, at the expense of increased computational time.

Finally we can derive an unbiased estimate for the upper bound of the optimal value


2.4.3 Point estimate and confidence interval

The average between the lower and the upper bound for the point estimate of is considered in dos and becker20:

Assuming the discounted payoff process is square-integrable for all we also obtain that the upper bound is square-integrable. Let denote the

-quantile of a standard normal distribution. Defining the empirical standard deviations for the lower and upper bounds as


respectively, one can leverage the central limit theorem to build the asymptotic two-sided -confidence interval for the true optimal value


We have presented in this section the unifying properties of Q-learning compared to other approaches used to study optimal stopping problems. On one hand we do not require any iterative procedure and do not have to solve a potentially complicated optimization problem at each time step. Indeed the calibrated deep neural network solves the optimal stopping problem on the whole time interval. On the other hand, we are able to accommodate any finite number of possible actions. Looking back at the direct approach of dos to model optimal stopping policies, the parametric form of the stopping times would explode if we allow for more than two possible actions.

3 Multiple stopping with constraints

In this section we extend the previous problem to the more general framework of multiple-exercise options. Examples from this family include swing options, which are common in the electricity market. The holder of such an option is entitled to exercise a certain right, e.g. the delivery of a certain amount of energy, several times, until the maturity of the contract. The number of exercise rights and constraints on how they can be used are specified at inception. Typical constraints are a waiting period, i.e. a minimal waiting time between two exercise rights, and a volume constraint, which specifies how many units of the underlying asset can be purchased at each time.

Monte Carlo valuation of such products has been studied in meinshausen, producing lower and upper bounds for the price. Building on the dual formulation for option pricing, alternative methods additionally accounting for waiting time constraints have been considered in bender11, and for both volume and waiting time constraints in bender15. In all cases, the multiple stopping problem is decomposed into several single stopping problems using the so-called reduction principle. The dual formulation in meinshausen expresses the marginal excess value due to each additional exercise right as an infimum of an expectation over a certain space of martingales and a set of stopping times. A version of the dual problem in discrete time relying solely on martingales is presented in schoenmakers, and a dual for the continuous time problem with a non-trivial waiting time constraint is derived in bender11. In the latter case, the optimization is not only over a space of martingales, but also over adapted processes of bounded variation, which stem from the Doob-Meyer decomposition of the Snell envelope. The dual problem in the more general setting considering both volume and waiting time constraints is formulated in bender15.

We now express the multiple stopping extension of the problem defined at (3) for American-style options. Assume that the option holder has exercise rights over the lifetime of the contract. We consider the setting with no volume constraint and a waiting time which we assume to be a multiple of the time step resulting from the discretization of the interval The action space is still The state space now has an additional dimension corresponding to the number of remaining exercise opportunities. As in standard stopping, we assume an absorbing state to which we jump once the -th right has been exercised.

We note that due to the introduction of the waiting period, depending on the specification of and it may not be possible for the option holder to exercise all his rights before maturity, see the discussion in bender15, where a "cemetery time" is defined. If the specification of these parameters allows the exercise of all rights, and if we assume that for all then it will always be optimal to use all exercise rights. The value of this option with exercise possibilities at time is given by


where is the set of -tuples of stopping times in satisfying for

As in bender11, one can combine the dynamic programming principle with the reduction principle to rewrite the primal optimization problem. We introduce the following functions defined in bender11 for and

and we define the functions as

We set for all and all and for all In the sequel, we denote as the Snell envelope for the problem with remaining exercise rights, for The reduction principle essentially states that the option with stopping times is as good as the single option paying the immediate cashflow plus the option with stopping times starting with a temporal delay of This philosophy is also followed in meinshausen by looking at the marginal extra payoff obtained with an additional exercise right. The function corresponds to the continuation value in case of no exercise and the function to the continuation value in case of exercise, which requires a waiting period of

As shown in bender11, one can derive the optimal policy from the continuation values. Indeed, the optimal stopping times for are given by


for starting value , which is a convention to make sure that the first exercise time is bounded from below by 0. The optimal price is then

and as in the single stopping framework, one can express the Snell envelope, the optimal stopping times and the continuation values in terms of the optimal Q-function Indeed, the continuation values can be expressed as

the Snell envelope as

and the optimal policy as


To remain consistent with the notation introduced above for the functions and we denote by the optimal Q-value in state i.e. when there are remaining exercise rights. Analogously to standard stopping with one exercise right, we can derive a lower bound from the primal problem and an upper bound from the dual problem. Moreover, we derive a confidence interval around the pointwise estimate based on Monte Carlo simulations.

3.1 Lower bound

As in Section 2.4.1, we denote by the deep neural network calibrated through the training process using experience replay on a sample of simulated paths for We then generate a new set of simulations , independent from the simulations used for training, for Then, using the learned stopping times

for and with the convention for all the Monte Carlo average

yields a lower bound for the optimal value In order to not overload the notation we consider in the subscript of the simulated state space above.

3.2 Upper bound

By exploiting the dual as in bender11, one can also derive an upper bound on the optimal value In order to do so, we consider the Doob decomposition of the supermartingales given by

where is a -martingale with and is a non-decreasing -predictable process with for all and The corresponding approximated terms using the learned Q-function lead to the following decomposition:

where are martingales with for and are integrable adapted processes in discrete time with for

Moreover, one can write the increments of both the martingale and adapted components as:


Given the existence of the waiting period, one must also include the -increment term

We note that for since is a predictable process, this increment is equal to 0 for the optimal martingale and we retrieve the dual formulation in schoenmakers.

As the dual formulation involves conditional expectations, we use nested simulation on a new set of independent simulations for with inner simulations for each outer simulation as explained in Section 2.4.2, to approximate the one-step ahead continuation values and the -steps ahead continuation values We denote the Monte Carlo estimators of these conditional expectations as and respectively. We use these quantities to express the empirical counterparts of the adapted process increments for

We can then rewrite the empirical counterparts of the Snell envelopes through the Q-function:

for and where we set for and (no more exercises left). The theoretical upper bound stemming from the dual problem in bender11 is given by:

We hence obtain and this bound is sharp for the exact Doob-Meyer decomposition terms and for We denote the sharp upper bound as

The following Monte Carlo average then yields an estimate of the upper bound for the optimal price

The pathwise supremum appearing in the expression of the upper bound can be computed using the recursion formula from Proposition 3.8 in bender15. This recursion formula is implemented in our setting using the representation via the Q-function.

3.3 Point estimate and confidence interval

As in dos and becker20, we can construct a pointwise estimate for the optimal value in the multiple stopping framework in presence of a waiting time constraint by taking the pointwise estimate:

By storing the empirical standard deviations for the lower and upper bounds that we denote as and respectively, one can leverage the central limit theorem as in Section 2.4.3 to derive the asymptotic two-sided -confidence interval for the true optimal value


3.4 Bias on the dual upper bound

We now derive the extension of a result presented in meinshausen on the bias resulting from the derivation of the upper bound, to the case of multiple stopping in presence of a waiting period. The dual problem from meinshausen, being obtained from an optimization over a space of martingales and a set of stopping times, contains two terms: the bias coming from the martingale approximation, and the bias coming from the policy approximation. In the case with waiting constraint, as exemplified in the dual of bender11, we show how one can again control the bias in the approximations to the Doob-Meyer decompositions of the Snell envelopes for Indeed, in the dual problem, each martingale is approximated by a martingale and each predictable non-decreasing process is approximated by an integrable adapted process in discrete time We proceed in three steps and analyse separately the bias from each approximation employed:

  • Martingale terms:

  • Adapted terms:

  • Final term:

The error in the final term can be bounded using the methodology in meinshausen. Define

as the distance between the true Snell envelope and its approximation, and

as an upper bound on the Monte Carlo error from the 1-step ahead nested simulation to approximate the continuation values.

In order to study the bias coming from the martingale approximations, we define

as the distance between the optimal Snell envelope and its approximation over all remaining exercise times,