1 Introduction
We consider the problem of optimal stopping, which can be described as follows: a system evolves stochastically from one state to another in discrete time steps. At each decision epoch, a decision maker (DM) chooses whether to stop the system or to allow it to continue for one more time step. If the DM chooses to stop the system, she garners a reward that is dependent on the current state of the system; if she chooses to allow it to continue, she does not receive any reward in the current period, but can potentially stop it at a future time to obtain a higher reward. The DM must specify a policy, which prescribes the action to be taken (stop/continue) for each state the system may enter. The optimal stopping problem is to find the policy that achieves the highest possible reward in expectation.
The optimal stopping problem is a key problem in stochastic control and arises in many important applications; we name a few below:

Option pricing. One of the most important applications of optimal stopping is to the pricing of financial options that allow for early exercise, such as American and Bermudan options. The system is the collection of underlying securities (typically stocks) that the option is written on. The prices of the securities comprise the system state. The decision to stop the system corresponds to exercising the option and receiving the corresponding payoff; thus, the problem of obtaining the highest expected payoff from a given option corresponds to finding an optimal stopping policy. The highest expected payoff attained by such an optimal stopping (optimal exercise) policy is then the price that the option writer should charge for the option.

Healthcare
. Consider an organ transplant patient waiting on a transplant list, who may be periodically offered an organ for transplantation. In this context, the system corresponds to the patient and the currently available organ, and the system state describes the patient’s health and the attributes of the organ. The decision to stop corresponds to the patient accepting the organ, where the reward is the estimated qualityadjusted life years (QALYs) that the patient will garner upon transplantation. The problem is then to find a policy that prescribes for a given patient and a given available organ whether the patient should accept the organ, or wait for the next available organ, so as to maximize the QALYs gained from the transplant.

Marketing. Consider a retailer selling a finite inventory of products over some finite time horizon. The system state describes the remaining inventory of the products, which evolves over time as customers buy the products from period to period. The action of stopping the system corresponds to starting a price promotion that will last until the end of the finite time horizon. The problem is to decide at each period, based on the remaining inventory, whether to commence the promotion, or to wait one more period, in order to maximize the total revenue garnered by the end of the horizon.
Largescale optimal stopping problems that occur in practice are typically solved by approximate dynamic programming (ADP) methods. The goal in such ADP methods is to approximate the optimal value function that, for a given system state, specifies the best possible expected reward that can be attained when one starts in that state. With an approximate value function in hand, one can obtain a good policy by considering the greedy policy with respect to the approximate value function.
In ADP methods, depending on the approximation architecture, the approximate value function may provide some insight into which states of the system state space are more desirable. However, the policy that one obtains by being greedy with respect to the approximate value function need not have any readily identifiable structure. This is disadvantageous, because in many optimal stopping problems, we are not only interested in policies that attain high expected reward, but also policies that are interpretable. A policy that is interpretable is one where we can directly see how the state of the system maps to the recommended action, and the relation between state and action is sufficiently transparent.
Interpretability is desirable for three reasons. First, in modeling a real world system, it is useful to obtain some insight about what aspects of the system state are important for controlling it optimally or near optimally. Second, a complex policy, such as a policy that is obtained via an ADP method, may not be operationally feasible in many real life contexts. Lastly – and most importantly – in many application domains where the decision maker is legally responsible or accountable for the action taken by a policy, interpretability is not merely a desirable feature, but a requirement: in such settings a decision maker will simply not adopt the policy without the ability to explain the policy’s mechanism of action. There is moreover a regulatory push to increase the transparency and interpretability of customer facing datadriven algorithms; as an example, General Data Protection Regulation rules set by the EU (DoshiVelez and Kim 2017) dictate that algorithms which can differentiate between users must provide explanations for their decisions if such queries arise.
In this paper, we consider the problem of constructing interpretable optimal stopping policies from data. Our approach to interpretability is to consider policies that are representable as a binary tree. In such a tree, each leaf node is an action (stop or go), and each nonleaf node (also called a split node) is a logical condition in terms of the state variables which determines whether we proceed to the left or the right child of the current node. To determine the action we should take, we take the current state of the system, run it down the tree until we reach a leaf, and take the action prescribed in that leaf. An example of such a treebased policy is presented in Figure 0(a). Policies of this kind are simple, and allow the decision maker to directly see the link between the current system state and the action.
Before delving into our results, we comment on how one might compare such an aforementioned interpretable policy with an optimal or, possibly, nearoptimal heuristic policy. Our informal goal is to look for the best policy within the constraints of an interpretable policy architecture, which in our case is treebased policies. However, without any a priori knowledge that a given optimal stopping problem is “simple enough”, one would expect that enforcing that the policy architecture be interpretable carries a performance price; in other words, interpretable policies should generally not perform as well as a stateoftheart heuristic. On the other hand, one can hope that there exist stopping problems where the price of interpretability is low, in that interpretable policies, while suboptimal, do not carry a large optimality gap. The present paper is an attempt to (a) exhibit stopping problems of practical interest for which treebased policies attain nearoptimal performance, along with being interpretable and (b) provide an algorithm to find such interpretable policies directly from data.
We make the following specific contributions:

Sample average approximation.
We formulate the problem of learning an interpretable treebased policy from a sample of trajectories of the system (sequences of states and payoffs) as a sample average approximation (SAA) problem. To the best of our knowledge, the problem of directly learning a policy in the form of a tree for optimal stopping problems (and Markov decision processes more generally) has not been studied before. We show that under mild conditions, the tree policy SAA problem defined using a finite sample of trajectories converges almost surely in objective value to the tree policy problem defined with the underlying stochastic process, as the number of trajectories available as data grows large.

Computational tractability. From a computational complexity standpoint, we establish that three natural simplifications of the SAA problem are NPHard and thus finding good solutions to the SAA problem is challenging.
In response, we present a computationally tractable methodology for solving the learning problem. Our method is a construction algorithm: starting from a degenerate tree consisting of a single leaf, the algorithm grows the tree in each iteration by splitting a single leaf into two new child leaves. The split is chosen greedily, and the algorithm stops when there is no longer a sufficient improvement in the samplebased reward. Key to the procedure is determining the optimal split point at a candidate split; we show that this problem can be solved in a computationally efficient manner. In this way, the overall algorithm is fully datadriven, in that the split points are directly chosen from the data and are not artificially restricted to a collection of split points chosen a priori. While the algorithm resembles topdown tree induction methods from classification/regression, several important differences arising from the temporal nature of the problem make this construction procedure algorithmically nontrivial.

Practical performance versus stateoftheart ADP methods. We numerically demonstrate the value of our methodology by applying it to the problem of pricing a Bermudan option, which is a canonical optimal stopping problem in finance. We show that our tree policies outperform two stateoftheart approaches, namely the simulationregression approach of Longstaff and Schwartz (2001) and the martingale dualitybased approach of Desai et al. (2012b). At the same time, we also show that the tree policies produced by our approach are remarkably simple and intuitive. We further investigate the performance of our approach by testing it on a stylized onedimensional optimal stopping problem (not drawn from option pricing), where the exact optimal policy can be computed; in general, our policy is either optimal or very close to optimal.
The rest of this paper is organized as follows. In Section 2, we discuss the relevant literature in ADP, machine learning and interpretable decision making. In Section 3, we formally define our optimal stopping problem and its samplebased approximation, we define the problem of finding a tree policy from sample data, and analyze its complexity. In Section 4, we present a heuristic procedure for greedily constructing a tree directly from data. In Section 5, we present an extensive computational study in option pricing comparing our algorithm to alternate approaches. In Section 6, we evaluate our algorithm on the aforementioned onedimensional problem. Finally, we conclude in Section 7.
2 Literature review
Our paper relates to three different broad streams of research: the optimal stopping and ADP literature; the machine learning literature; and the multistage stochastic/robust optimization literature. We survey each of these below.
Approximate dynamic programming (ADP). ADP has been extensively studied in the operations research community since the mid1990s as a solution technique for Markov decision processes (Powell 2007, Van Roy 2002). In the last fifteen years, there has been significant interest in solving MDPs by approximating the linear optimization (LO) model; at a high level, one formulates the MDP as a LO problem, reduces the number of variables and constraints in a tractable manner, and solves the more accessible problem to obtain a value function approximation. Some examples of this approach include De Farias and Van Roy (2003), Adelman and Mersereau (2008), Desai et al. (2012a) and Bertsimas and Mišić (2016).
A large subset of the ADP research literature, originating in both the operations research and finance communities, has specifically studied optimal stopping problems. The seminal papers of Carriere (1996), Longstaff and Schwartz (2001) and Tsitsiklis and Van Roy (2001) propose simulationregression approaches, where one simulates trajectories of the system state and uses regression to compute an approximation to the optimal continuation value at each step. Later research has considered the use of martingale duality techniques. The idea in such approaches is to relax the nonanticipativity requirement of the policy by allowing the policy to use future information, but to then penalize policies that use this future information, in the spirit of Lagrangean duality. This approach yields upper bounds on the optimal value and can also be used to derive high quality stopping policies. Examples include Rogers (2002), Andersen and Broadie (2004), Haugh and Kogan (2004), Chen and Glasserman (2007), Brown et al. (2010) and Desai et al. (2012b).
Our approach differs from these generic ADP approaches and optimal stoppingspecific approaches in two key ways. First, general purpose ADP approaches, as well as those specifically designed for optimal stopping, are focused on obtaining an approximate value function or an upper bound on the value function, which is then used in a greedy manner. In contrast, our approach involves optimizing over a policy directly, without computing/optimizing over a value function. Second, our approach is designed with interpretability in mind, and produces a policy that can be easily visualized as a binary tree. In contrast, previously proposed ADP methods are not guaranteed to result in policies that are interpretable.
Lastly, we note that some methods in finance for pricing options with early exercise involve tree representations; examples include the binomial lattice approach (Cox et al. 1979) and the random tree method (Broadie and Glasserman 1997). However, the trees found in these methods represent discretizations of the sample paths of the underlying asset prices, which provide a tractable way to perform scenario analysis. In contrast, the trees in our paper represent policies, not sample paths. As such, our approach is unrelated to this prior body of work.
Machine learning. Interpretability has been a goal of major interest in the machine learning community, starting with the development of decision trees in the 1980s (Breiman et al. 1984, Quinlan 1986, 1993). A stream of research has considered interpretable scoring rules for classification and risk prediction; recent examples include Ustun and Rudin (2015, 2016) and Zeng et al. (2017). Another stream of research considers the design of disjunctive rules and rule lists; recent examples include Wang et al. (2015), Wang and Rudin (2015), Wang et al. (2017), Letham et al. (2015), Angelino et al. (2017) and Lakkaraju et al. (2016).
The algorithm we will present is closest in spirit to classical tree algorithms like CART and ID3. Our algorithm differs from these prior methods in that it is concerned with optimal stopping, which is a stochastic control problem that involves making a decision over time, and is fundamentally different from classification and regression. In particular, a key part of estimating a classification or regression tree is determining the leaf labels, which in general is a computationally simple task. As we will see in Section 3.5, the analogous problem in the optimal stopping realm is NPHard. As a result, this leads to some important differences in how the tree construction must be done to account for the temporal nature of the problem; we comment on these in more detail in Section 4.3.
Interpretable decision making. In the operations research community, there is growing interest in interpretability as it pertains to dynamic decision making; we provide some recent examples here. With regard to dynamic problems, Bertsimas et al. (2013) considers the problem of designing a dynamic allocation policy for allocating deceased donor kidneys to patients requiring a kidney transplant that maximizes efficiency while respecting certain fairness constraints. To obtain a policy that is sufficiently interpretable to policy makers, Bertsimas et al. (2013)
further propose using ordinary least squares regression to find a scoring rule that predicts the fairnessadjusted match quality as a function of patient and donor characteristics. In more recent work,
Azizi et al. (2018) consider a related approach for dynamically allocating housing resources to homeless youth. The paper proposes a general mixedinteger optimization framework for selecting an interpretable scoring rule (specifically, linear scoring rules, decision tree rules with axisaligned or oblique splits, or combinations of both linear and decision tree rules) for prioritizing youth on the waiting list. Lastly, the paper of Bravo and Shaposhnik (2017) considers the use of machine learning for analyzing optimal policies to MDPs. The key idea of the paper is to solve instances of a given MDP to optimality, and then to use the optimal policies as inputs to machine learning methods. The paper applies this methodology to classical problems such as inventory replenishment, admission control and multiarmed bandits. This approach differs from ours, in that we do not have access to the optimal policy; in fact, the goal of our method is to directly obtain a nearoptimal policy.3 Problem definition
We begin by defining the optimal stopping problem in Section 3.1, and its samplebased counterpart in Section 3.2. We then define the tree policy sampleaverage approximation (SAA) problem, where the class of policies is restricted to those that can be represented by a binary tree, in Section 3.3. We show that the tree policy SAA problem converges in Section 3.4. Finally, in Section 3.5, we show that the tree policy SAA problem is NPHard when one considers three specific simplifications.
3.1 Optimal stopping model
Consider a system with state given by . We let denote the state of the system at time , which evolves according to some stochastic process. We let denote the action space of the problem; we may either stop the system () or allow it to continue for one more time step (). We assume a finite horizon problem with periods, starting at period . We let denote the reward or payoff from stopping the system when the current period is and the current system state is . We assume that all rewards are discounted by a factor of for each period.
We define a policy as a mapping from the state space to the action space . We let be the set of all possible policies, where we use the notation for any integer . For a given realization of the process , we define the stopping time as the first time at which the policy prescribes the action :
(1) 
where we take the minimum to be if the set is empty. Our goal is to find the policy that maximizes the expected discounted reward over the finite horizon, which can be represented as the following optimization problem:
(2) 
For any policy , we let . For simplicity, we assume that there exists a starting state at which the system is started, such that always.
3.2 Sample average approximation
In order to solve problem (2), we need to have a full specification of the stochastic process . In practice, we may not have this specification or it may be too difficult to work with directly. Instead of this specification, we may instead have data, that is, we may have access to specific realizations or trajectories of the process . In this section, we describe a sampleaverage approximation (SAA) formulation of the optimal stopping problem (2) that will allow us to design a policy directly from these trajectories, as opposed to a probabilistic definition of the stochastic process.
We will assume that we have a set of trajectories, indexed by . We denote the state of the system in trajectory at time by . Thus, each trajectory corresponds to a sequence of system states , with .
We can now define a sampleaverage approximation (SAA) version of the optimal stopping problem (2). Let denote the time at which the given policy recommends that the system be stopped in the trajectory ; mathematically, it is defined as
(3) 
where we again take the minimum to be if the set is empty. Our SAA optimal stopping problem can now be written as
(4) 
Note that, in order to handle the case when does not stop on a given trajectory, we define to be 0 if . Lastly, we introduce the following shorthand notation for the sample average value of a policy :
3.3 Tree policies
Problem (4) defines an approximation of the original problem (2) that uses data – specifically, a finite sample of trajectories of the stochastic process . However, despite this simplification that brings the problem closer to being solvable in practice, problem (4) is still difficult because it is an optimization problem over the set of all possible stopping policies . Moreover, as discussed in Section 1, we wish to restrict ourselves to policies that are sufficiently simple and interpretable.
In this section, we will define the class of tree policies. A tree policy is specified by a binary tree that corresponds to a recursive partitioning of the state space . Each tree consists of two types of nodes: split nodes and leaf nodes. Each split node is associated with a query of the form ; we call the state variable that participates in the query the split variable, the index the split variable index and the constant value in the inequality the split point. If the query is true, we proceed to the left child of the current split node; otherwise, if it is false, we proceed to the right child.
We let denote the set of all nodes (splits and leaves) in the tree. We use to denote the set of split nodes and to denote the set of leaf nodes. We define the functions and to indicate the left and right child nodes of each split node, i.e., for a given split node , is its left child and is its right child. We use to denote the topology of the tree, which we define as the tuple .
Given the topology , we use and to denote the collection of all split variable indices and split points, respectively, where is the split variable index and is the split point of split . We let denote the collection of leaf actions, where is the action we take if the current state is mapped to leaf . A complete tree is therefore specified by the tuple , which specifies the tree topology, the split variable indices, the split points and the leaf actions.
Given a complete tree , we let denote the leaf in that the system state is mapped to, and define the stopping policy by taking the action of the leaf to which is mapped:
We provide an example of a tree policy below.
Consider a system where , for which the policy is the tree given in Figure 0(a). In this example, suppose that we number the nodes with the numbers 1 through 7, from top to bottom, left to right. Then, , and . The and mappings are:
The topology with the split and leaf labels is visualized in Figure 0(b).
The split variable indices and split points for the split nodes are
and the leaf actions for the leaf nodes are
As an example, suppose that the current state of the system is . To map this observation to an action, we start at the root and check the first query, . Since , the query is true, and we proceed to the left child of the root node. This new node is again a split, so we check its query, . Since , this query is false, so we proceed to its right child node, which is a leaf. The action of this leaf is , and thus our policy stops the system.
Letting denote the set of all tree policies specified as above, we wish to find the tree policy that optimizes the sampleaverage reward:
(5) 
We refer to this problem as the tree policy SAA problem. In addition, we also define the counterpart of problem (2) restricted to policies in , which we refer to simply as the tree policy problem:
(6) 
where is defined as in Section 3.1.
We comment on two aspects of this modeling approach. First, the class of tree policies, as defined above, are stationary: the policy’s behavior does not change with the period . This turns out to not be a limitation, because it is always possible to augment the state space with an additional state variable to represent the current period . By then allowing the tree to split on , it is possible to obtain a timedependent policy. We will follow this approach in our numerical experiments with option pricing in Section 5 and the stylized onedimensional problem in Section 6.
Second, our approach requires access to trajectories that are complete, that is, the trajectories are not terminated early/censored by the application of some prior policy, and the system state is known at every . In the case of censored trajectories, our methodology can potentially be applied by first fitting a stochastic model to the available trajectories and then simulating the model to fill in the missing data. The extension of our methodology to censored trajectories is beyond the scope of this paper, and left to future research.
3.4 Convergence of tree policy SAA problem
A natural expectation for the SAA problem (5) is that its objective approaches the optimal value of (6), as the number of trajectories available as samples increases. In this section, we show that this is indeed a property of our SAA problem. More precisely, we show that in the limit of and if we restrict the policy class to only trees of arbitrary bounded depth, the optimal value of the SAA problem converges to the value of the optimal treebased policy for the true problem almost surely.
We note that establishing such a convergence result in the case of treebased policies is challenging due to two difficulties. First, a tree policy is defined by a tuple
; the space of such tuples could in general be uncountable, and thus we cannot directly invoke the strong law of large numbers to guarantee that almost sure convergence holds simultaneously over all
. Secondly, invoking the strong law of large numbers over all members of a finite cover of this space is also nontrivial. This is due to the fact that is not necessarily continuous in , even as we change the split points and keep all other tree parameters constant, and as such we cannot rely on Lipschitz continuity arguments.In order to handle these challenges, we restrict our convergence analysis to the class of tree policies corresponding to trees of depth at most for some finite parameter . We denote this class of tree policies by . This restriction limits the set of parameters specifying a treebased policy to be finite, although the parameters still lie in a potentially uncountable set. We remark that, because our focus is on interpretable policies, limiting the potential depth of allowable trees that our SAA approach can optimize over is reasonable.
We also make some relatively mild assumptions regarding the underlying structure of the optimal stopping problem, which facilitate the subsequent analysis. The first two are relatively standard, enforcing the compactness of the state space , as well as placing a universal upper bound on the magnitude of the cost function : The state space is compact. There exists a constant such that for any , , .
The third assumption essentially imposes that the distributions of the marginals of the state variable are sufficiently smooth. In particular, we assume that the probability that the
th component of the state variable, , lies in some interval can be bounded by a function which only depends on , but not on and specifically, and which can be controlled to go to zero as the width of the interval vanishes. There exists a function such that,
, for every , and any with .

.

is strictly increasing and continuous.
We add that Assumption 3.4
is satisfied by distributions with probability density functions that can be uniformly upper bounded across their entire domain.
Having set up these assumptions, the following theorem proves almost sure convergence of to over all policies in .
For any finite tree depth , initial state and arbitrary , almost surely, there exists a finite sample size such that for all and all tree policies ,
Theorem 3.4 is the main technical result of this section and enables us to prove our desired convergence of the SAA optimal objective. This is stated in the following corollary: For any finite tree depth , initial state and arbitrary , almost surely, there exists a finite sample size such that for all ,
The corollary above shows that the decision maker indeed obtains via solving the sample average problem (5), a policy whose value approximates with arbitrarily high precision the value of the best treebased policy for the true problem, as long as the decision maker has access to a sufficiently large sample of state trajectories.
3.5 Complexity of tree policy SAA problem
Having established that the tree policy SAA problem converges to the exact tree policy optimization problem, we now turn our attention to the computational complexity of solving the tree policy SAA problem. In this section, we will consider three simplified versions of problem (5) and show that each of these is theoretically intractable.
To motivate the first of our intractability results, observe that problem (5) allows for the tree policy to be optimized along all four dimensions: the topology , the split variable indices , the split points and the leaf actions . Rather than optimizing over all four variables, let us consider instead a simplified version of problem (5), where the topology , split variable indices and split points are fixed, and the leaf actions are the only decision variables. We use to denote the set of all policies with the given , and , i.e.,
(7) 
The leaf action SAA problem is to find the policy in that optimizes the sample average reward:
(8) 
While problem (5) is difficult to analyze in generality, the simpler problem (8) is more amenable to analysis. It turns out, perhaps surprisingly, that this simplified problem is already hard to solve: The leaf action problem (8) is NPHard. The proof of this result (see Section 8.2.1) follows by a reduction from the minimum vertex cover problem.
This result is significant for two reasons. First, it establishes that even in this extremely simplified case, where we have already selected a tree topology, split variable indices and split points, the resulting problem is theoretically intractable.
Second, this result points to an important distinction between optimal stopping tree policies and classification trees that are used in machine learning. For binary classification, a classification tree consists of a tree topology, split variable indices, split points and leaf labels. Given , and , determining the leaf labels that minimize the 01 classification error on a training sample is trivial: for each leaf, we simply predict the class that is most frequent among those observations that are mapped to that leaf. More importantly, each leaf’s label can be computed independently. The same holds true for the regression setting, where is the continuous prediction of leaf : to find the values of that minimize the squared prediction error, we can set each as the average of the dependent variable for all training observations that are mapped to that leaf.
For optimal stopping, the situation is strikingly different. Determining the leaf actions is much more difficult because a single trajectory has a time dimension: as time progresses, a tree policy may map the current state of that trajectory to many different leaves in the tree. As a result, the decision of whether to stop or not in one leaf (i.e., to set for a leaf ) cannot be made independently of the decisions to stop or not in the other leaves: the leaf actions are coupled together. For example, if we set for a given leaf and for a different leaf , then a trajectory that reaches before it reaches could never stop at : thus, whether we choose to stop at depends on whether we choose to stop at .
Proposition 8 establishes that when the topology, split variable indices and split points are fixed, optimizing over the leaf actions is an intractable problem. It turns out that optimizing individually over the split variable indices and the split points, with the rest of the tree policy parameters fixed, is also intractable. Let denote the set of all tree policies with the given , and (i.e., the split variable indices may be optimized over), and let denote the set of all tree policies with the given , and (i.e., the split points may be optimized over):
(9)  
(10) 
Let us define the split variable index SAA problem as the problem of finding a policy in to maximize the samplebased reward:
(11) 
Similarly, let us define the split point SAA problem as the analogous problem of optimizing over policies in :
(12) 
We then have the following two intractability results. The split variable index SAA problem (11) is NPHard. The split point SAA problem (12) is NPHard. Like Proposition 8, the proofs of Propositions 12 and 12, found in Sections 8.2.2 and 8.2.3 respectively, also follow by a reduction from the minimum vertex cover problem.
4 Construction algorithm
As we saw in Section 3.3, three major simplifications of our SAA problem are theoretically intractable. Expanding the scope of the problem to allow for joint optimization over the split points, split variable indices and leaf actions, as well as the topology, renders the problem even more difficult. Given the difficulty of this problem, we now present a practical heuristic algorithm for approximately solving the SAA problem. Our algorithm is a greedy procedure that grows/induces the tree from the top down. In Section 4.1, we provide a description of the overall algorithm. In Section 4.2, we provide a procedure for performing the key step of our construction algorithm, which is finding the optimal split point. Finally, in Section 4.3, we compare our construction algorithm and extant classification tree approaches such as CART (Breiman et al. 1984).
4.1 Algorithm description
Our algorithm to heuristically solve (5) is a construction procedure where we greedily grow a tree up to the point where we no longer observe an improvement by adding another split point. At a high level, our algorithm works as follows. We start from a degenerate tree, consisting of a single leaf with the action of as the root node. At this leaf, we consider placing a split. Such a split will result in a left child and a right child, with both child nodes being leaves. For each potential split variable, we find the best possible split point assuming that the left child leaf will be (and the right child will be ), and assuming that the right child leaf will be (and the left child will be ). We find the best possible combination of the split variable, split point and direction (left child is and right child is , or right child is and left child is ), and add the split to the tree. We continue the procedure if the split resulted in a sufficient improvement on the current objective; otherwise, we terminate the procedure.
In the next iteration, we repeat the process, except that now we also optimize over the leaf that we select to split: we compute the best possible split at each leaf in the current tree, and take the best split at the best leaf. We continue in this way until there is no longer sufficient improvement in an iteration. At each iteration, we expand the tree at a single leaf and add two new leaves to the tree. Figure 2 provides a visualization of several iterations of the algorithm.
The notion of sufficient improvement that we use in our algorithm is that of relative improvement. Specifically, if is the objective value with the best split and is the current objective, then we continue running the algorithm if , where is a userspecified tolerance on the relative improvement; otherwise, if the relative improvement is lower than , the algorithm is terminated. Lower values of correspond to trees that are deeper with a larger number of nodes.
Our construction algorithm is defined formally as Algorithm 1. The tree policy is initialized to the degenerate policy described above, which prescribes the action at all states. Each iteration of the loop first computes the objective attained from choosing the best split point at each leaf with each variable assuming that either the left child leaf will be a action or the right child leaf will be a action. In the former case, we call the subtree rooted at node a leftstop subtree, and in the latter case, we call it a rightstop subtree; Figure 3 shows both of these subtrees. Then, if the best such split achieves an objective greater than the current objective value, we grow the tree in accordance to that split: we add two child nodes to the leaf using the GrowTree function, which is defined formally as Algorithm 2; we set the actions of those leaves according to ; we set the split variable index of the new split as ; and finally, we set the split point as . The existsImprovement flag is used to terminate the algorithm when it is no longer possible to improve on the objective of the current tree by at least a factor of .
We comment on three important aspects of Algorithm 1. The first aspect is the determination of the optimal split point for a given leaf, a given split variable index and the direction of the stop action (left/right). At present, this optimization is encapsulated in the function OptimizeSplitPoint to aid in the exposition of the overall algorithm; we defer the description of this procedure to the next section (Section 4.2).
The second aspect is the assumption of how the child actions are set when we split a leaf. At present, we consider two possibilities: the left child is and the right child is (this corresponds to ) or the left child is and the right child is (this corresponds to ). These are not the only two possibilities, as we can also consider setting both child nodes to or to . We do not explicitly consider these. First, one of these will result in exactly the same behavior as the current tree (for example, if in the current tree and we set in the new tree, the new tree policy will prescribe the same actions as the current tree), and thus cannot result in an improvement. Second, OptimizeSplitPoint can potentially return a value of or , effectively resulting in a degenerate split where we always go to the left or to the right. It is straightforward to see that such a split is equivalent to setting both child actions to or .
The third aspect is complexity control. The algorithm in its current form stops building the tree when it is no longer possible to improve the objective value by a factor of at least . The parameter plays an important role in controlling the complexity of the tree: if is set to a very low value, the algorithm may produce deep trees with a large number of nodes, which is undesirable because (1) the resulting tree may lack interpretability and (2) the resulting tree may not generalize well to new data. In contrast, higher values of will cause the algorithm to terminate earlier with smaller trees whose training set performance will be closer to their outofsample performance; however, these trees may not be sufficiently complex to lead to good performance. The appropriate value of can be calibrated through standard techniques like fold crossvalidation.
4.2 Finding the optimal split point
The key part of Algorithm 1 is the OptimizeSplitPoint function, which aims to answer the following question: what split point should we choose for the split at node on variable so as to maximize the samplebased reward? At first glance, this question appears challenging because we could choose any real number to be , and it is not clear how we can easily optimize over . Fortunately, it will turn out that the sample average reward of the new tree will be a piecewiseconstant function of , which we will be able to optimize over easily. The piecewiseconstant function is obtained by taking a weighted combination of a collection of trajectoryspecific piecewiseconstant functions. These trajectoryspecific piecewiseconstant functions can be computed by carefully analyzing how the variable changes along each trajectory.
Assume that the split variable index , the leaf and the subtree direction are fixed. Our first step is to determine the behavior of the trajectories with respect to the leaf . For each trajectory, we first find the time at which each trajectory is stopped at a leaf different from . We call this the nostop time. We denote it with and define it as
(13) 
where we use the notation to denote the set for an integer , and we define the minimum to be if the set is empty. We can effectively think of this as the time at which trajectory will stop if the new split that we place at does not result in the trajectory being stopped or equivalently, if we were to set the action of leaf to . We define the nostop value, , as the reward we garner if we allow the trajectory to stop at the nostop time:
(14) 
Note that the value is zero if the trajectory is never stopped outside of leaf . Lastly, we also determine the set of periods when the policy maps the system state to the leaf . We call these the inleaf periods, and define the set as
(15) 
The second step is to determine the permissible stop periods. Depending on whether we are optimizing for the leftstop subtree or for the rightstop subtree, these are the periods at which the subtree could stop. The rightstop permissible stop periods are defined as
(16) 
where the maximum is defined as if the corresponding set is empty. The leftstop permissible stop periods are similarly defined as
(17) 
where the minimum is defined as if the corresponding set is empty.
The third step is to construct the trajectoryspecific piecewiseconstant functions. Each such piecewise constant function will tell us how the reward of a trajectory varies as a function of . Let us order the times in as , where . We use those times to define the corresponding breakpoints of our piecewise constant function; the th breakpoint, , is defined as
The function value of the th piece is given by , which is defined as the value if we stopped at the th permissible stop period:
For a rightstop subtree, the corresponding piecewise constant function is denoted by , and it is defined as follows:
(18) 
For a leftstop subtree, the piecewise constant function is defined similarly, as
(19) 
The function value is exactly the reward that will be garnered from trajectory if we set the split point to . By averaging these functions over , we obtain the function , which returns the average reward, over the whole training set of trajectories, that ensues from setting the split point to :
(20) 
We wish to find the value of that maximizes the function , i.e.,
Note that because each is a piecewise constant function of , the overall average will also be a piecewise constant function of , and as a result there is no unique maximizer of . This is analogous to the situation faced in classification, where there is usually an interval of split points that lie between the values of the given coordinate of two points. Let be the interval on which is maximized. For convenience, consider the interior of this interval, which will always be an interval of the form , or for some values , with . We can then set the split point as follows:
(21) 
In the case that is a bounded interval, we set the split point to the midpoint of the interval. If is an unbounded interval, then there is no welldefined midpoint of the interval, and we set the split point to be either or .
We summarize the procedure formally as Algorithm 3, and illustrate it with an example.
In this example, let us suppose that we are at the first iteration of the construction algorithm, where the tree policy is simply a single leaf with the action . We want to find the optimal split point with respect to variable 1 for a split on this leaf, assuming that the subtree we place will be a rightstop subtree.
To illustrate, we fix a single trajectory . Figure 4 shows the relevant data for this trajectory, which are the values and the rewards . We assume that for all trajectories.
Since we are at the first iteration, we obtain that for the root node , the nostop time is , because there are no other leaves in the tree, and thus we have that , i.e., every period is a valid inleaf period for the root node. Observe also that the nostop value , because there is no other leaf in which the trajectory is stopped.
Having determined the inleaf periods, we determine the set of permissible stop periods . To do this, since we are placing a rightstop subtree, we follow the computation in equation (16). The lefthand side of Figure 5 shows the same data as in Figure 4, but with the permissible stop periods indicated with red squares.
To intuitively understand the computation in equation (16), one can imagine a horizontal line, superimposed on the top plot of Figure 4, that starts at and that slowly moves up towards . The vertical height of this horizontal line is the split point . As we move this line, we track the first period in time at which the trajectory exceeds this horizontal line: this is where the trajectory would stop, if we fixed the split point to the height of that line. Notice that when we start, the first period is . As soon as we exceed , we will no longer stop at , but at . As soon as our line goes above , the new period at which we stop will be . Once our line goes above , we will stop at , as this is the earliest that the trajectory passes above our line. We keep going in this way, until the line exceeds the value and we have determined all of the permissible stop periods.
To further elaborate on this process, notice that are permissible stop periods, but period is not. In order to stop at , we would have to set the threshold to a value lower than , and would have to be the first period at which the trajectory exceeds , i.e., . This is impossible with the trajectory of Figure 4. For this reason, the values of and at are not relevant in determining the reward garnered from the trajectory as is varied.
The permissible stop periods allow us to define the breakpoints and the values and of our piecewise constant function. The corresponding piecewise constant function is shown on the righthand side of Figure 5. We then repeat this construction for all of the trajectories in our training set, and average them to obtain the overall function . Figure 6 visualizes this averaging process, assuming that there are three trajectories in total (i.e., ), including the one trajectory displayed in Figures 4 and 5 (which we denote by ).
After are averaged, we determine to be . Since this is a bounded interval, we take the midpoint of this interval, which is 2.75, to be our final split point .
4.3 Comparison to topdown tree induction for classification
We note that our algorithm is inspired by classical topdown classification tree induction algorithms such as CART
(Breiman et al. 1984), ID3 (Quinlan 1986) and C4.5 (Quinlan 1993). However, there are a number of important differences, in both the problem and the algorithm. We have already discussed in Section 3.3that the problem of finding a stopping policy in the form of a tree is structurally different than finding a classifier in the form of a tree – in particular, deciding leaf labels in a tree classifier is easy to do, whereas deciding leaf actions in a tree policy for a stopping problem is an NPHard problem.
With regard to the algorithms themselves, there are several important differences. First, in our algorithm, we directly optimize the insample reward as we construct the tree. This is in contrast to how classification trees are built for classification, where typically it is not the classification error that is directly minimized, but rather an impurity metric such as the Gini impurity (as in CART) or the information gain/entropy (as in ID3 and C4.5). Second, in the classification tree setting, each leaf can be treated independently; once we determine that we should no longer split a leaf (e.g., because there is no more improvement in the impurity, or the number of points in the leaf is too low), we never need to consider that leaf again. In our algorithm, we must consider every leaf in each iteration, even if that leaf may not have resulted in improvement in the previous iteration; this is because the actions we take in the leaves interact with each other and are not independent of each other. Lastly, as mentioned earlier, determining the optimal split point is much more involved than in the classification tree setting: in classification, the main step is to sort the observations in a leaf by the split variable. In our optimal stopping setting, determining the split point is a more involved calculation that takes into account when each trajectory is in a given leaf and how the cumulative maximum/minimum of the values of a given split variable change with the time .
5 Application to option pricing
In this section, we report on the performance of our method in an application drawn from option pricing. We define the option pricing problem in Section 5.1. We compare our policies against two benchmarks in terms of outofsample performance and computation time in Sections 5.2 and 5.3, respectively. In Section 5.4, we present our tree policies for this application and discuss their structure. Finally, in Section 5.5, we evaluate our policies on instances calibrated with real S&P500 stock price data. Additional numerical results are provided in Section 9.
5.1 Problem definition
Highdimensional option pricing is one of the classical applications of optimal stopping, and there is a wide body of literature devoted to developing good heuristic policies (see Glasserman 2013 for a comprehensive survey). In this section, we illustrate our treebased method on a standard family of option pricing problems from Desai et al. (2012b). We consider two benchmarks. Our first benchmark is the leastsquares Monte Carlo method from Longstaff and Schwartz (2001), which is a commonly used method for option pricing. Our second benchmark is the pathwise optimization (PO) method from Desai et al. (2012b), which was shown in that paper to yield stronger exercise policies than the method of Longstaff and Schwartz. Both of these methods involve building regression models that estimate the continuation value at each time using a collection of basis functions of the underlying state, and using them within a greedy policy.
In each problem, the option is a Bermudan maxcall option, written on underlying assets. The stopping problem is defined over a period of 3 calendar years with equally spaced exercise opportunities. The price paths of the assets are generated as a geometric Brownian motion with drift equal to the annualized riskfree rate and annualized volatility , starting at an initial price . The pairwise correlation between different assets is set to . (The strike price for each option is set at . Each option has a knockout barrier , meaning that if any of the underlying stock prices exceeds at some time , the option is “knocked out” and the option value becomes at all times . Time is discounted continuously at the riskfree rate, which implies a discrete discount factor of .
Our state variable is defined as , where is the index of the period; is the price of asset at exercise time ;
is a binary variable that is 0 or 1 to indicate whether the option has not been knocked out by time
, defined as(22) 
and is the payoff at time , defined as
(23) 
In our implementation of the tree optimization algorithm, we vary the subset of the state variables that the tree model is allowed to use. We use time to denote , prices to denote , payoff to denote , and KOind to denote . We set the relative improvement parameter to 0.005, requiring that we terminate the construction algorithm when the improvement in insample objective becomes lower than 0.5%. We report on the sensitivity of our algorithm to the parameter in Section 9.1.
In our implementation of the LongstaffSchwartz (LS) algorithm, we vary the basis functions that are used in the regression. We follow the same notation as for the tree optimization algorithm in denoting different subsets of state variables; we additionally define the following sets of basis functions:

one: the constant function, .

pricesKO: the knockout (KO) adjusted prices defined as for .

maxpriceKO and max2priceKO: the largest and second largest KO adjusted prices.

prices2KO: the knockout adjusted secondorder price terms, defined as for .
In our implementation of the PO algorithm, we also vary the basis functions, and follow the same notation as for LS. We use 500 inner samples, as in Desai et al. (2012b).
We vary the number of stocks as and the initial price of all assets as . For each combination of and , we consider ten replications. In each replication, we generate 20,000 trajectories for training the methods and 100,000 trajectories for outofsample testing. All replications were executed on the Amazon Elastic Compute Cloud (EC2) using a single instance of type r4.4xlarge (Intel Xeon E52686 v4 processor with 16 virtual CPUs and 122 GB memory). All methods were implemented in the Julia technical computing language, version 0.6.2 (Bezanson et al. 2017). All linear optimization problems for the pathwise optimization method were formulated using the JuMP package for Julia (Lubin and Dunning 2015, Dunning et al. 2017) and solved using Gurobi 8.0 (Gurobi Optimization, Inc. 2018).
5.2 Outofsample performance
Table 1 shows the outofsample reward garnered by the LS and PO policies for different basis function architectures and the tree policies for different subsets of the state variables, for different values of the initial price
. The rewards are averaged over the ten replications, with standard errors reported in parentheses. For ease of exposition, we focus only on the
assets, as the results for and are qualitatively similar; for completeness, these results are provided in Section 9.2. In addition, Section 9.3 provides additional results for when the common correlation is not equal to zero.Method  State variables / Basis functions  Initial Price  

8  LS  one  33.82 (0.021)  38.70 (0.023)  43.13 (0.015) 
8  LS  prices  33.88 (0.019)  38.59 (0.023)  43.03 (0.014) 
8  LS  pricesKO  41.45 (0.027)  49.33 (0.017)  53.08 (0.009) 
8  LS  pricesKO, KOind  41.86 (0.021)  49.36 (0.020)  53.43 (0.012) 
8  LS  pricesKO, KOind, payoff  43.79 (0.022)  49.86 (0.013)  53.07 (0.009) 
8  LS  pricesKO, KOind, payoff,  43.83 (0.021)  49.86 (0.013)  53.07 (0.009) 
maxpriceKO  
8  LS  pricesKO, KOind, payoff,  43.85 (0.022)  49.87 (0.012)  53.06 (0.008) 
maxpriceKO, max2priceKO  
8  LS  pricesKO, payoff  44.06 (0.013)  49.61 (0.010)  52.65 (0.008) 
8  LS  pricesKO, prices2KO,  44.07 (0.013)  49.93 (0.010)  53.11 (0.010) 
KOind, payoff  
8  PO  prices  40.94 (0.012)  44.84 (0.016)  47.48 (0.014) 
8  PO  pricesKO, KOind, payoff  44.01 (0.019)  50.71 (0.011)  53.82 (0.009) 
8  PO  pricesKO, KOind, payoff,  44.07 (0.017)  50.67 (0.011)  53.81 (0.011) 
maxpriceKO, max2priceKO  
8  PO  pricesKO, prices2KO,  44.66 (0.018)  50.67 (0.010)  53.77 (0.008) 
KOind, payoff  
8  Tree  payoff, time  45.40 (0.018)  51.28 (0.016)  54.52 (0.006) 
8  Tree  prices  35.86 (0.170)  43.42 (0.118)  46.95 (0.112) 
8  Tree  prices, payoff  39.13 (0.018)  48.37 (0.014)  53.61 (0.010) 
8  Tree  prices, time  38.21 (0.262)  40.21 (0.469)  42.69 (0.133) 
8  Tree  prices, time, payoff  45.40 (0.017)  51.28 (0.016)  54.51 (0.006) 
8  Tree  prices, time, payoff, KOind  45.40 (0.017)  51.28 (0.016)  54.51 (0.006) 
From this table, it is important to recognize three key insights. First, for all three values of , the best tree policies – specifically, those tree policies that use the time and payoff state variables – are able to outperform all of the LS and PO policies. Relative to the best LS policy for each , these improvements range from 2.04% () to 3.02% (), which is substantial given the context of this problem. Relative to the best PO policy for each , the improvements range from 1.12% () to 1.66% (), which is still a remarkable improvement.
Second, observe that this improvement is attained despite an experimental setup biased in favor of LS and PO. In this experiment, the tree optimization algorithm was only allowed to construct policies using subsets of the primitive state variables. In contrast, both the LS and the PO method were tested with a richer set of basis function architectures that included knockout adjusted prices, highest and secondhighest prices and secondorder price terms. From this perspective, it is significant that our tree policies could outperform the best LS policy and the best PO policy. This also highlights an advantage of our tree optimization algorithm, which is its nonparametric nature: if the boundary between and in the optimal policy is highly nonlinear with respect to the state variables, then by estimating a tree policy one should (in theory) be able to closely approximate this structure with enough splits. In contrast, the performance of LS and PO is highly dependent on the basis functions used, and requires the DM to specify a basis function architecture.
Third, with regard to our tree policies specifically, we observe that policies that use time and payoff perform the best. The time state variable is critical to the success of the tree policies because the time horizon is finite: as such, a good policy should behave differently near the end of the horizon from how it behaves at the start of the time horizon. The LS algorithm handles this automatically because it regresses the continuation value from to , so the resulting policy is naturally timedependent. The PO policy is obtained in a similar way, with the difference that one regresses an upper bound from to , so the PO policy is also timedependent. Without time as an explicit state variable, our tree policies will estimate stationary policies, which are unlikely to do well given the nature of the problem. Still, it is interesting to observe some instances in our results where our timeindependent policies outperform timedependent ones from LS (for example, compare tree policies with prices only to LS with prices or one). With regard to payoff, we note that because the payoff is a function of the prices , one should in theory be able to replicate splits on payoff using a collection of splits on price; including payoff explicitly helps the construction algorithm recognize such collections of splits through a single split on the payoff variable.
5.3 Computation time
Table 2 reports the computation time for the LS, PO and tree policies for and , for the uncorrelated () case. The computation times are averaged over the ten replications for each combination of and . As with the performance results, we focus on to simplify the exposition; additional timing results for and are provided in Section 9.4. For LS, the computation time consists of only the time required to perform the regressions from to . For PO, the computation time consists of the time required to formulate the linear optimization problem in JuMP, the solution time of this problem in Gurobi, and the time required to perform the regressions from to (as in LongstaffSchwartz). For the tree method, the computation consists of the time required to run Algorithm 1.
Method  State variables / Basis functions  Initial Price  

8  LS  one  1.2 (0.0)  1.2 (0.0)  1.2 (0.0) 
8  LS  prices  1.4 (0.0)  1.4 (0.0)  1.4 (0.0) 
8  LS  pricesKO  1.4 (0.1)  1.5 (0.1)  1.5 (0.1) 
8  LS  pricesKO, KOind  1.6 (0.1)  1.4 (0.1)  1.5 (0.1) 
8  LS  pricesKO, KOind, payoff  1.9 (0.2)  1.9 (0.1)  1.8 (0.1) 
8  LS  pricesKO, KOind, payoff,  2.5 (0.2)  2.4 (0.2)  2.4 (0.2) 
maxpriceKO  
8  LS  pricesKO, KOind, payoff,  2.7 (0.2)  2.6 (0.3)  2.4 (0.2) 
maxpriceKO, max2priceKO  
8  LS  pricesKO, payoff  1.7 (0.2)  1.6 (0.1)  1.4 (0.1) 
8  LS  pricesKO, prices2KO, KOind, payoff  5.5 (0.4)  4.4 (0.2)  4.5 (0.2) 
8  PO  prices  33.3 (0.7)  35.5 (0.7)  32.8 (0.7) 
8  PO  pricesKO, KOind, payoff  76.8 (2.8)  73.8 (5.0)  56.9 (2.1) 
8  PO  pricesKO, KOind, payoff,  104.7 (5.8)  79.6 (3.3)  66.8 (3.8) 
maxpriceKO, max2priceKO  
8  PO  pricesKO, prices2KO, KOind, payoff  221.3 (9.4)  180.7 (4.1)  142.0 (4.2) 
8  Tree  payoff, time  7.6 (0.3)  3.9 (0.3)  3.2 (0.1) 
8  Tree  prices  124.7 (8.6)  125.5 (4.8)  125.0 (5.8) 
8  Tree  prices, payoff  5.5 (0.1)  5.3 (0.1)  5.1 (0.1) 
8  Tree  prices, time  158.8 (12.5)  101.0 (12.8)  51.1 (2.7) 
8  Tree  prices, time, payoff  20.4 (1.0)  10.9 (0.6)  9.3 (0.1) 
8  Tree  prices, time, payoff, KOind  21.5 (1.5)  11.2 (0.6)  9.3 (0.2) 
From this table, we can see that although our method requires more computation time than LS, the times are in general quite modest: our method requires no more than 2.5 minutes on average in the largest case. (In experiments with , reported in Section 9.4, we find that the method requires no more than 5 minutes on average in the largest case.) The computation times of our method also compare quite favorably to the computation times for the PO method. We also remark here that our computation times for the PO method do not include the time required to generate the inner paths and to preprocess them in order to formulate the PO linear optimization problem. For , including this additional time increases the computation times by a large amount, ranging from 540 seconds (using prices only; approximately 9 minutes) to 2654 seconds (using pricesKO, prices2KO, KOind, payoff; approximately 44 minutes).
5.4 Policy structure
It is also interesting to examine the structure of the policies that emerge from our tree optimization algorithm. Figure 7 shows trees obtained using prices, time, payoff and KOind for one replication with , for .
This figure presents a number of important qualitative insights about our algorithm. First, observe that the trees are extremely simple: there are no more than seven splits in any of the trees. The policies themselves are easy to understand and sensible. Taking as an example, we see that if the payoff is lower than 51.06, it does not stop unless we are in the last period (), because there is still a chance that the payoff will be higher by . If the payoff is greater than 51.06 but less than or equal to 54.11, then the policy does not stop unless we are in the last four periods ( 51, 52, 53 or 54). If the payoff is greater than 54.11 but less than or equal to 55.13, then we continue. Otherwise, if the payoff is greater than 55.13, then we stop no matter what period we are in; this is likely because when such a payoff is observed, it is large enough and far enough in the horizon that it is unlikely a larger reward will be realized later. In general, as the payoff becomes larger, the policy will recommend stopping earlier in the horizon. Interestingly, the policies do not include any splits on the prices and the KO indicator, despite the construction algorithm being allowed to use these variables: this further underscores the ability of the construction algorithm to produce simple policies. It is also interesting to note that the tree structure is quite consistent across all three values of . To the best of our knowledge, we do not know of any prior work suggesting that simple policies as in Figure 7 can perform well against mainstream ADP methods for highdimensional option pricing.
We observe in the and trees that there is some redundancy in the splits. For example, for , observe that the left subtree of the split is identical to the left subtree of the split
Comments
There are no comments yet.