Reinforcement learning (RL) in its most general form deals with agents living in some environment and aiming at maximizing a given reward function. Alongside supervised and unsupervised learning, it is often considered as the third family of models in the machine learning literature. It encompasses a wide class of algorithms that have gained popularity in the context of building intelligent machines that can outperform masters in ancestral board games such as Go or chess, see e.g.silver16; silver17. These models are very skilled when it comes to learning the rules of a certain game, starting from little or no prior knowledge at all, and progressively developing winning strategies. Recent research, see e.g. deepmind, doubleQ, duelingQ
, has considered integrating deep learning techniques in the framework of reinforcement learning in order to model complex unstructured environments. Deep reinforcement learning can hence leverage the ability of deep neural networks to uncover hidden structure from very complex functionals and the power of reinforcement techniques to take complex actions.
Optimal stopping problems from mathematical finance naturally fit into the reinforcement learning framework. Our work is motivated by the pricing of swing options which appear in energy markets (oil, natural gas, electricity) to hedge against futures price fluctuations, see e.g. meinshausen, bender15, and more recently daluiso20. Intuitively, when behaving optimally, investors holding these options are trying to maximize their reward by following some optimal sequence of decisions, which in the case of swing options consists in purchasing a certain amount of electricity or natural gas at multiple exercise times.
The stopping problems we will consider belong to the category of Markov decision processes (MDP). We refer the reader toputerman or bertsekas for good textbook references on this topic. When the size of the MDP becomes large or when the MDP is not fully known (model-free learning), alternatives to standard dynamic programming techniques must be sought. Reinforcement learning can efficiently tackle these issues and can be transposed to our problem of determining optimal stopping strategies.
Previous work exists on the connections between optimal stopping problems in mathematical finance and reinforcement learning. For example, the common problem of learning optimal exercise policies for American options has been tackled in li using reinforcement learning techniques. They implement two algorithms, namely least-squares policy iteration (LSPI), see lagoudakis, and fitted Q-iteration (FQI), see vanroy, and compare their performance to a benchmark provided by the least-squares Monte Carlo (LSMC) approach of see longstaff. It is shown empirically that strategies uncovered by both these algorithms provide larger payoffs than LSMC. kohler08 model the Snell envelope of the underlying optimal stopping problem with a neural network. More recently, dos derive optimal stopping rules from Monte Carlo samples at each time step using deep neural networks. An alternative approach developed in becker20
considers the approximation of the continuation values using deep neural networks. This method also produces a dynamic hedging strategy based on the approximated continuation values. A similar approach with different activation functions is presented inlapeyre alongside a convergence result for the pricing algorithm, whereas the method employed in chen20 is based on BSDE’s.
Our work aims at casting the optimal stopping decision into a unifying reinforcement learning framework through the modeling of the action-value function of the problem. One can then leverage reinforcement learning algorithms involving neural networks that learn the optimal action-value function at any time step. We illustrate this methodology by presenting examples from mathematical finance. In particular, we will focus on high-dimensional swing options where the action taking is more complex, and where deep neural networks are particularly powerful due to their approximation capabilities.
The remainder of the paper is structured as follows. In Section 2
, we introduce the necessary mathematical tools from reinforcement leaning, present an estimation approach of the Q-function using neural networks, and discuss the derivation of the lower and upper bounds on the option price. In Section3 we explain the multiple stopping problem with waiting period constraint between two consecutive exercise times, again with the derivation of a lower bound and an upper bound on the option price at inception. We display numerical results for swing options in Section 4, and conclude in Section 5.
2 Theory and methodology
In this section we present the mathematical building blocks and the reinforcement learning machinery, leading to the formulation of the stopping problems under consideration.
2.1 Markov decision processes and action-value function
As discussed in the introduction, the problems we will consider in the sequel can be embedded into the framework of the well-studied Markov decision processes (MDPs), see sutton. A Markov decision process is defined as a tuple where
is the set of states;
is the set of actions the agent can take;
is the transition probability kernel, whereis the probability of future states given that the current state is and that action is taken;
is a reward function, where denotes the reward obtained when moving from state under action (note here that different definitions exist in the literature);
is a discount factor which expresses preference towards short-term rewards (in the present work as we consider already discounted rewards).
A policy is then a rule for selecting actions based on the last visited state. More specifically, denotes the probability of taking action in state under policy The conventional task is to maximize the total (discounted) expected reward over policies, and can be expressed as A policy which maximizes this quantity is called an optimal policy. Given a starting state an initial action one can define the action-value function, also called Q-function:
where for a sequence of state-action pairs The optimal policy satisfies
where we write for . In other words, the optimal Q-function measures how "good" or "rewarding" it is to choose action while in state by following optimal decisions. We will consider problems with finite time horizon and we accordingly set for all
2.2 Single stopping problems as Markov decision processes
We consider the same stopping problem as in dos and becker20, namely an American-style option defined on a finite time grid . The discounted payoff process is assumed to be square-integrable and takes the form for a measurable function and a -dimensional -Markovian process defined on a filtered probability space . Let denote the space in which the underlying process lives. We assume that is deterministic and that is the risk-neutral probability measure. The value of the option at time is given by
where denotes all stopping times . This problem is essentially a Markov decision process with state space , action space (where we follow the convention for continuing and for stopping), reward function333When exercizing (taking action ), we implicitly move to the absorbing state, i.e. the last component of the state space becomes 1.
for and transition kernel driven by the dynamics of the -Markovian process . The state space includes time, the -dimensional Markovian process and an additional (absorbing) state which at each time step captures the event of exercise or no exercise. More precisely, we jump to this absorbing state when we have exercised. In the multiple stopping case which we discuss in Section 3, we jump to this absorbing state once we have used the last exercise right. In both single and multiple stopping frameworks, once this absorbing state has been reached at a random time , we set all rewards and Q-values to 0 for The associated Snell envelope process of the stopping problem in (3) is defined recursively by
It is well known that the Snell envelope provides an optimal stopping time solving (3) as stated in the following result.444Note that in particular . A standard proof for the latter can be found in shreve.
Various modeling approaches have been proposed to estimate the option value in (3). kohler08 propose to model directly the Snell envelope, dos take the approach of modeling the optimal stopping times. More recently, becker20 model the continuation values of the stopping problem. In this work, we rather propose to model the optimal action-value function of the problem for all and (where represents the stopping decision) given by
According to Proposition 2.1, through the knowledge of the optimal action-value function we can recover the optimal stopping time . Indeed, it turns out that the optimal decision functions in dos can be expressed in the action-value function framework through
where denotes the indicator function. Moreover, one can express the Snell envelope (estimated in kohler08) as and the continuation value modeled in becker20 can be reformulated in our setting as . As a by-product, one can price financial products such as swing options by considering
In this perspective, our modeling approach is very similar to previous studies but differs in the reinforcement learning machinery employed. Indeed, modeling the action-value function and optimizing it is a common and natural approach known under the name of Q-learning in the reinforcement learning literature. We introduce it in the next section.
2.3 Q-learning as estimation method
In contrast to policy or value iteration, Q-learning methods, see e.g. watkins89 and watkins92, estimate directly the optimal action-value function. They are model-free and can learn optimal strategies with no prior knowledge of the state transitions and the rewards. In this paradigm, an agent interacts with the environment (exploration step) and learns from past actions (exploitation step) to derive the optimal strategy.
One way to model the action-value function is by using deep neural networks. This approach is referred to under the name deep Q-learning in the reinforcement learning literature. In this setup, the optimal action-value function is modeled with a neural network often called deep Q-network (DQN), where
is a vector of parameters corresponding to the network architecture. However, reinforcement learning can be highly unstable or even potentially diverge due to the introduction of neural networks in the approximation the Q-function. To tackle these issues, a variant to the original Q-learning method has been developed inmnih15. It relies on two main concepts. The first is called experience replay and allows to remove correlations in the sequence of observations. In practice this is done by generating a large sample of experiences which we denote as vectors at each time and that we store in a dataset We note that once we have reached the absorbing state, we start a new episode or sequence of observations by resetting the MDP to the initial state . Furthermore, we allow the agent to explore new unseen states according to a so-called -greedy strategy, see sutton, meaning that with probability we take a random action and with probability we take the action maximizing the Q-value. Typically one reduces the value of according to a linear schedule as the training iterations increase.
During the training phase, we then perform updates to the Q-values by sampling mini-batches uniformly at random from this dataset and minimizing over
the following loss function
However there might still be some correlations between the Q-values and the so-called target values The second improvement brought forward in mnih15 consists in updating the network parameters for the target values only with a regular frequency and not after each iteration. This is called parameter freezing and translates into minimizing over the modified loss function
where the target network parameters are only updated with the DQN parameters every steps, and are held constant between individual updates.
An alternative network specification would be to take only the state as input and update the Q-values for each action, see the implementation in deepmind. Network architectures such as double deep Q-networks, see doubleQ, dueling deep Q-networks, see duelingQ, and combinations thereof, see rainbow have been developed to improve the training performance even further. However the implementation of these algorithms is out of the scope of our presentation.
2.4 Inference and confidence intervals
In the same spirit as dos and becker20, we compute lower and upper bounds on the option price in (3In the sequel, for ease of notation, we will use for
2.4.1 Lower bound
We store the parameters learned through the training of the deep neural network on an experience replay dataset with simulations for We denote as the vector of network parameters where denotes the dimension of the parameter space and corresponds to the calibrated network. We then generate new simulations of the state space process , independent from those used for training, for The independence is necessary to achieve unbiasedness of the estimates. The Monte Carlo average
where yields a lower bound for the optimal value Since the optimal strategies are not unique, we follow the convention of taking the largest optimal stopping rule which yields a strict inequality.
2.4.2 Upper bound
The derivation of the upper bound is based on the Doob-Meyer decomposition of the supermartingale given by the Snell envelope, see shreve. The Snell envelope of the discounted payoff process can be decomposed as
where is the -martingale given by
and is the non-decreasing -predictable process given by
From Proposition 7 in dos, given a sequence
of integrable random variables insuch that for all one has
for every -martingale starting from 0.
This upper bound is tight if and We can then use the optimal action-value function learned via the deep neural network to construct a martingale close to We now adapt the approach presented in dos to the expression of the martingale component of the Snell envelope. Indeed, the martingale differences from Subsection 3.2 in dos can be written in terms of the optimal action-value function:
since the continuation value at time is given by evaluating the optimal action-value function at action (continuing). Given the definition of the optimal action-value function at (5), one can rewrite the martingale differences as
The empirical counterparts are given by generating realizations of based on a sample of simulations for Again, we simulate realizations of the state space process independently from the simulations used for training. This gives us the following empirical differences:
where is the chosen action at time for simulation path and are the Monte Carlo averages approximating the continuation values for and
The continuation values appearing in the martingale increments are obtained through nested simulation, see the remark below:
where is the number of simulations in the inner step, and where, given each we simulate (conditional) continuation paths that are conditionally independent of each other and of and is the value of along the path
It is not guaranteed than for the Q-function learned via the neural network. To tackle this issue, we implement nested simulations as in dos and becker20 to estimate the continuation values. This gives unbiased estimates of which is crucial to obtain a valid upper bound. Moreover, the variance of the estimates decreases with the number of inner simulations, at the expense of increased computational time.
to estimate the continuation values. This gives unbiased estimates of
which is crucial to obtain a valid upper bound. Moreover, the variance of the estimates decreases with the number of inner simulations, at the expense of increased computational time.
Finally we can derive an unbiased estimate for the upper bound of the optimal value
2.4.3 Point estimate and confidence interval
The average between the lower and the upper bound for the point estimate of is considered in dos and becker20:
Assuming the discounted payoff process is square-integrable for all we also obtain that the upper bound is square-integrable. Let denote the
respectively, one can leverage the central limit theorem to build the asymptotic two-sided -confidence interval for the true optimal value
We have presented in this section the unifying properties of Q-learning compared to other approaches used to study optimal stopping problems. On one hand we do not require any iterative procedure and do not have to solve a potentially complicated optimization problem at each time step. Indeed the calibrated deep neural network solves the optimal stopping problem on the whole time interval. On the other hand, we are able to accommodate any finite number of possible actions. Looking back at the direct approach of dos to model optimal stopping policies, the parametric form of the stopping times would explode if we allow for more than two possible actions.
3 Multiple stopping with constraints
In this section we extend the previous problem to the more general framework of multiple-exercise options. Examples from this family include swing options, which are common in the electricity market. The holder of such an option is entitled to exercise a certain right, e.g. the delivery of a certain amount of energy, several times, until the maturity of the contract. The number of exercise rights and constraints on how they can be used are specified at inception. Typical constraints are a waiting period, i.e. a minimal waiting time between two exercise rights, and a volume constraint, which specifies how many units of the underlying asset can be purchased at each time.
Monte Carlo valuation of such products has been studied in meinshausen, producing lower and upper bounds for the price. Building on the dual formulation for option pricing, alternative methods additionally accounting for waiting time constraints have been considered in bender11, and for both volume and waiting time constraints in bender15. In all cases, the multiple stopping problem is decomposed into several single stopping problems using the so-called reduction principle. The dual formulation in meinshausen expresses the marginal excess value due to each additional exercise right as an infimum of an expectation over a certain space of martingales and a set of stopping times. A version of the dual problem in discrete time relying solely on martingales is presented in schoenmakers, and a dual for the continuous time problem with a non-trivial waiting time constraint is derived in bender11. In the latter case, the optimization is not only over a space of martingales, but also over adapted processes of bounded variation, which stem from the Doob-Meyer decomposition of the Snell envelope. The dual problem in the more general setting considering both volume and waiting time constraints is formulated in bender15.
We now express the multiple stopping extension of the problem defined at (3) for American-style options. Assume that the option holder has exercise rights over the lifetime of the contract. We consider the setting with no volume constraint and a waiting time which we assume to be a multiple of the time step resulting from the discretization of the interval The action space is still The state space now has an additional dimension corresponding to the number of remaining exercise opportunities. As in standard stopping, we assume an absorbing state to which we jump once the -th right has been exercised.
We note that due to the introduction of the waiting period, depending on the specification of and it may not be possible for the option holder to exercise all his rights before maturity, see the discussion in bender15, where a "cemetery time" is defined. If the specification of these parameters allows the exercise of all rights, and if we assume that for all then it will always be optimal to use all exercise rights. The value of this option with exercise possibilities at time is given by
where is the set of -tuples of stopping times in satisfying for
As in bender11, one can combine the dynamic programming principle with the reduction principle to rewrite the primal optimization problem. We introduce the following functions defined in bender11 for and
and we define the functions as
We set for all and all and for all In the sequel, we denote as the Snell envelope for the problem with remaining exercise rights, for The reduction principle essentially states that the option with stopping times is as good as the single option paying the immediate cashflow plus the option with stopping times starting with a temporal delay of This philosophy is also followed in meinshausen by looking at the marginal extra payoff obtained with an additional exercise right. The function corresponds to the continuation value in case of no exercise and the function to the continuation value in case of exercise, which requires a waiting period of
As shown in bender11, one can derive the optimal policy from the continuation values. Indeed, the optimal stopping times for are given by
for starting value , which is a convention to make sure that the first exercise time is bounded from below by 0. The optimal price is then
and as in the single stopping framework, one can express the Snell envelope, the optimal stopping times and the continuation values in terms of the optimal Q-function Indeed, the continuation values can be expressed as
the Snell envelope as
and the optimal policy as
To remain consistent with the notation introduced above for the functions and we denote by the optimal Q-value in state i.e. when there are remaining exercise rights. Analogously to standard stopping with one exercise right, we can derive a lower bound from the primal problem and an upper bound from the dual problem. Moreover, we derive a confidence interval around the pointwise estimate based on Monte Carlo simulations.
3.1 Lower bound
As in Section 2.4.1, we denote by the deep neural network calibrated through the training process using experience replay on a sample of simulated paths for We then generate a new set of simulations , independent from the simulations used for training, for Then, using the learned stopping times
for and with the convention for all the Monte Carlo average
yields a lower bound for the optimal value In order to not overload the notation we consider in the subscript of the simulated state space above.
3.2 Upper bound
By exploiting the dual as in bender11, one can also derive an upper bound on the optimal value In order to do so, we consider the Doob decomposition of the supermartingales given by
where is a -martingale with and is a non-decreasing -predictable process with for all and The corresponding approximated terms using the learned Q-function lead to the following decomposition:
where are martingales with for and are integrable adapted processes in discrete time with for
Moreover, one can write the increments of both the martingale and adapted components as:
Given the existence of the waiting period, one must also include the -increment term
We note that for since is a predictable process, this increment is equal to 0 for the optimal martingale and we retrieve the dual formulation in schoenmakers.
As the dual formulation involves conditional expectations, we use nested simulation on a new set of independent simulations for with inner simulations for each outer simulation as explained in Section 2.4.2, to approximate the one-step ahead continuation values and the -steps ahead continuation values We denote the Monte Carlo estimators of these conditional expectations as and respectively. We use these quantities to express the empirical counterparts of the adapted process increments for
We can then rewrite the empirical counterparts of the Snell envelopes through the Q-function:
for and where we set for and (no more exercises left). The theoretical upper bound stemming from the dual problem in bender11 is given by:
We hence obtain and this bound is sharp for the exact Doob-Meyer decomposition terms and for We denote the sharp upper bound as
The following Monte Carlo average then yields an estimate of the upper bound for the optimal price
The pathwise supremum appearing in the expression of the upper bound can be computed using the recursion formula from Proposition 3.8 in bender15. This recursion formula is implemented in our setting using the representation via the Q-function.
3.3 Point estimate and confidence interval
As in dos and becker20, we can construct a pointwise estimate for the optimal value in the multiple stopping framework in presence of a waiting time constraint by taking the pointwise estimate:
By storing the empirical standard deviations for the lower and upper bounds that we denote as and respectively, one can leverage the central limit theorem as in Section 2.4.3 to derive the asymptotic two-sided -confidence interval for the true optimal value
3.4 Bias on the dual upper bound
We now derive the extension of a result presented in meinshausen on the bias resulting from the derivation of the upper bound, to the case of multiple stopping in presence of a waiting period. The dual problem from meinshausen, being obtained from an optimization over a space of martingales and a set of stopping times, contains two terms: the bias coming from the martingale approximation, and the bias coming from the policy approximation. In the case with waiting constraint, as exemplified in the dual of bender11, we show how one can again control the bias in the approximations to the Doob-Meyer decompositions of the Snell envelopes for Indeed, in the dual problem, each martingale is approximated by a martingale and each predictable non-decreasing process is approximated by an integrable adapted process in discrete time We proceed in three steps and analyse separately the bias from each approximation employed:
The error in the final term can be bounded using the methodology in meinshausen. Define
as the distance between the true Snell envelope and its approximation, and
as an upper bound on the Monte Carlo error from the 1-step ahead nested simulation to approximate the continuation values.
In order to study the bias coming from the martingale approximations, we define
as the distance between the optimal Snell envelope and its approximation over all remaining exercise times,