Markov Decision Processes (MDPs) are a powerful framework for artificial intelligence systems that must perform optimally in the face of uncertainty. MDPs have been broadly applied in various domains, ranging from inventory control to communication to motion planning, and algorithms for finding optimal policies are mature [PutermanPuterman2014]. However, in many circumstances reward is not the only consideration for a policy. For instance, when an autonomous agent operates in an uncertain environment, actions have a probability of collision with obstacles which jeopardizes the future of the mission. In such circumstances, it is undesirable to simply maximize a measure of reward because high reward policies can be tied to dangerous actions, leading to unacceptably large chances of failure.
Chance Constrained MDPs (CCMDPs) place a constraint on the allowed probability of failure in the policy, which we refer to as either the chance constraint or the risk bound [RossmanRossman1977]. The addition of the risk bound complicates solving the CCMDP since it is no longer optimal to select the highest cumulative reward action from each state, and known solution techniques do not scale well to very large problems. While Monte Carlo Tree Search (MCTS) based planning has shown remarkable recent successes in large MDPs, it has not been applied to CCMDPs because a chance constraint couples all actions and outcomes in a policy, and it cannot trivially be reasoned over in separate branches when the rest of the policy is unknown.
In addition, we find a traditional CCMDP that asserts a constant risk bound to be insufficiently expressive, as in reality the acceptable probability of failure is often contingent upon the reward that would be achieved. Intuitively, mission designers will accept a greater probability of failure if a riskier mission is likely to yield much more reward. But marginal risk tolerance also tends to shrink with increasing reward. This means a mission with a much larger probability of failure but a small increase in reward over an already high reward mission is unlikely to be preferred. Altogether this implies a risk bound that is a concave nondecreasing function of the mission reward, a generalization of static risk bounds.
In this paper we introduce Vulcan, an MCTS based algorithm for large CCMDPS with a concave nondecreasing risk bounding function. We derive a sufficient condition that can be applied during Monte Carlo Tree Search so that any policy returned by the algorithm is guaranteed to satisfy a bound on probability of failure, computed as a function of expected reward of the policy. This allows the algorithm to be run in an anytime manner without the need to explore all states in the policy, at the cost of converging to a slightly suboptimal policy.
Experiments with Vulcan on smaller problems where the optimal policy can be found suggest that the mean suboptimality is on the order of a few percent. Vulcan is observed to run between 50 and 600 times faster than methods that explicitly explore the state space, and for large problems it is observed to run over 10 times faster than heuristic forward search methods. Finally, we demonstrate the use of Vulcan to find a chance constrained policy in a CCMDP with approximately states in 3 minutes.
2 Motivating Scenario
To motivate the development of Vulcan, consider an autonomous vehicle exploring an unknown environment in search of high reward samples, for example, an underwater vehicle exploring the oceans of Europa. As each sample is taken, it updates its model of the environment around it. The position estimate of the vehicle is a probability distribution through space, and when the vehicle’s samples take it close to obstacles, there is a risk of collision that could damage the vehicle and end its mission.
A low constant risk bound means that the vehicle will stay far away from obstacles, even if the most interesting samples are near them. A high constant risk bound is similarly undesirable, as the optimal policy will move the vehicle close to obstacles even if the samples are worth only slightly more reward. Instead, desirable behavior would include an interplay between risk and reward, in the sense that additional risk should be taken only if the additional reward is deemed worthwhile. A natural expression for this balance is a function that specifies the maximum allowable probability of failure for every expected reward. Vulcan allows any concave nondecreasing risk bounding function to be specified, and finds a satisfactory policy accordingly.
In this case, the environment model that predicts the outcomes of future actions is a function of the locations and outcomes of all previous samples. The probability of failure depends on the locations the vehicle visits, but also the order in which they are visited because the uncertainty in vehicle state typically grows with time. Since the optimal action to take depends on the possible outcomes of actions as well as the total reward and probability of failure incurred up to a state, the rewards and transition probabilities from a state (and therefore the optimal policy from that state) will generally depend on the entire history of states and actions preceding it. For even moderately sized environments, the set of states becomes very large. Vulcan is able to handle these large state spaces by sampling from them in an intelligent manner, which guides the search towards the final policy without evaluating all possible states.
This exploration scenario has one more feature that was important for the development of Vulcan, which is the fact that rewards and risks are expensive to compute. Propagating a sample outcome through an environment model can be computationally difficult, especially for larger models, and reasoning over probability distributions to compute collisions can also be time consuming. The expense of evaluating states further justifies an MCTS approach that does not generate every state in the CCMDP. Vulcan also accounts for expensive states by storing results in memory where possible to avoid recomputation.
3 Problem Statement
We consider the problem of finding the optimal policy in a finite horizon CCMDP subject to a risk bounding function. Formally, a finite horizon CCMDP is a tuple , where:
is a set of states.
is a set of safe states, which satisfy mission constraints such as staying outside of obstacles, while states in are considered failure states.
is a set of actions.
is a stochastic state transition function indicating the probability of transitioning from one state to another when taking an action.
is a reward function giving the numeric reward from moving between states according to an action.
is a discount factor that prioritizes immediate reward, so that a reward received actions in the future is worth times its original value.
is an initial state, representing the state of the world before any actions are taken.
is the planning horizon, or number of actions to perform in the CCMDP.
is a concave nondecreasing risk bounding function which gives the maximum acceptable probability of entering a failure state as a function of reward.
Missions where the initial state is chosen from a discrete set (for example, deploying a vehicle in different locations) or is distributed according to a probability distribution may be modeled with a fixed starting point using an initial action from a dummy state to decide the initial state.
We denote a state at time step as and a state history from time step to as (a sequence of states and actions between those states). The set of all possible state histories is denoted as , and the set of all possible state histories that include only safe states, which we call safe state histories, is denoted as . We seek a policy to be followed such that the probability of entering an failure state is bounded according the risk bounding function .
While the optimal policy may not be deterministic in general, Vulcan will produce an approximately optimal deterministic policy. It has been argued that randomized policies are difficult to reliably execute, particularly when multiple agents are involved [Dolgov DurfeeDolgov Durfee2005, Paruchuri, Tambe, Ordonez, KrausParuchuri et al.2004]. In addition, from a mission planning and tracking perspective, deterministic policies are easier to interpret and, if necessary, repeat. Regardless, even if a randomized policy is permitted, we will show that Vulcan still provides order of magnitude speedups over solution methods which explore the entire state space.
In the types of problems we will consider, a state depends strongly on its history, in the sense that a state will only be reachable by one or a small number of histories, like in our exploration scenario. Despite the fact that Vulcan uses a tree structure and considers each state history distinctly, we will show that it still performs well against techniques that do not repeat states when a state is reachable by several histories. However, in the extreme of problems with a small number of states with many loops and very long planning horizons, we would not expect the techniques introduced in this paper to perform well.
In this work, we consider failure to satisfy the constraints as disastrous in nature, such as damaging and losing the exploration vehicle or corrupting all data.111Strictly speaking, this is not a necessary assumption, but it obviates the need for the nuanced discussion of how an agent should act after entering a failure state when the mission is allowed to continue. As a result, entering any failure state is considered to be an end of the mission, as in [Geibel WysotzkiGeibel Wysotzki2005].
The lifetime reward function is defined as
To simplify notation, we introduce a binary random variablewhich is true if and only if the (stochastically determined) state is a safe state,
Following the notation of [Santana, Thiébaux, WilliamsSantana et al.2016], we define the execution risk of the policy following the state history as the probability that a failure state will be entered in the future after has occurred,
Unlike Santana et al., we do not consider partial observability of states, so in this work execution risk does not require consideration of the probability that any state in is a failure state, and only needs to be defined for safe state histories. When considering the state history , we will simply write .
We seek the optimal policy that satisfies
for a specified concave nondecreasing function .
4 Overview of Approach
In general, methods that solve for the exact solution to problems similar to eq. 4 do not scale well to large problems without heuristics, because the chance constraint couples all possible outcomes of the policy. An action on one branch may increase the expected reward, allowing more risk to be tolerated on a second branch with little relation to the first.
Our approach is instead to define a constraint which depends on a function and applies to state histories. If is satisfied by every safe state history reachable by a policy, then that policy is guaranteed to satisfy the risk bounding function. Implicitly, the constraint defines a set of acceptable safe state histories , and the set of policies that can be constructed from safe state histories in all satisfy the chance constraint, so it need not be considered explicitly. The final solution returned is the highest reward policy in , which we call . Satisfaction of the constraint by all safe state histories is a sufficient but not necessary condition, so that a suboptimal solution may be found, but optimality is traded off against an increased search speed. This idea is shown in figure 1 where is a subset of all policies . All policies in lie below the risk bounding function, and therefore satisfy the chance constraint, but there exist policies below the risk bounding function (potentially including the true optimal policy) that are not in .
The advantage of this technique is that each of the constraints is local, in the sense that the constraint can be verified for a state history without knowledge of the rest of the policy. The optimal policy in can therefore be found using forward search and Bellman backups, like in an unconstrained MDP, and each state history only needs to be considered at most once.
Once has been specified, search progresses by assuming that , meaning that every safe state sequence satisfies . Once a state history is explored up to the planning horizon, it is evaluated against . If is satisfied, search continues without modification, but if is violated or no actions remain from a state, then the preceding action is deleted. Search then continues as if the deleted action never existed. The end result is conceptually similar to search performed over , with additional state histories that are found and ignored when they are identified to not be part of .
We proceed by reviewing related work. We then define the constraint and introduce VulcanFS; which uses the constraint with forward search. VulcanFS is useful for smaller problems in that it still runs much faster than optimal explicit methods, and returns the policy that the MCTS algorithm Vulcan will converge to. Since it searches all states, it also provides an upper bound for the run time of an MCTS based approach. We then use insights from the application of UCT [Kocsis SzepesváriKocsis Szepesvári2006] to MDPs in order to develop Vulcan; our MCTS based planner for CCMDPs.
5 Related Work
5.1 Constrained MDPs and Chance Constrained MDPs
Under our assumption of terminal failure states, a CCMDP may be formulated as a constrained MDP (CMDP) [AltmanAltman1999], which bounds the expected sum of costs gained from each action. In this case, a cost of 1 is received for entering a failure state, and 0 otherwise. All CMDP methods finding policies for states can also be made to return history-dependent policies by making a unique state for each state history. The major extension of our problem over a traditional CMDP is the introduction of a chance constraint that is a general concave nondecreasing function as opposed to a constant, and that the state space is too large and expensive to compute in full.
Commonly used methods for finding policies for CCMDPs and CMDPs are linear programming and dynamic programming, while heuristic search and penalty methods also exist. Both linear programming and dynamic programming methods can be extended to handle certain classes of risk bounding functions, but their major drawback for our purposes is that they are unsuitable for the large state spaces we consider. Stochastic policies for CMDPs may be found in a time polynomial in the size of the CMDP [AltmanAltman1999], but it is known that computing deterministic policies for CMDPs is NP-complete [FeinbergFeinberg2000]. It follows that when the state spaces are too large to generate in full, neither linear programming nor dynamic programming methods are able to efficiently generate policies, and the problem is exacerbated when deterministic policies are desired.
In unconstrained MDPs, a common approach is to use an online receding horizon approach to plan up to a reduced horizon , and repeating the process for outcomes of the previously found policy. The resulting policy is typically suboptimal, though it is possible to place guarantees on the suboptimality of the resulting policy [Alden SmithAlden Smith1992, Chang MarcusChang Marcus2003]. In CMDPs, receding horizon approaches can guarantee that risk bounds are satisfied over the planning horizon, but do not account for actions later in the policy without estimates for future risk, so receding horizon approaches give no guarantees on feasibility. Methods of using receding horizon planning for CMDPs such as [Undurti HowUndurti How2010] require additional planning to guarantee feasibility from states, which may itself require searching through large state spaces and is additionally complicated when the allowed cost depends on the reward.
Our MCTS approach solves the problem of having to evaluate every state and reward in larger problems by running in an anytime manner, with the reward function computed only as states are explored. Application of UCT ensures that search is directed towards promising solutions so that a high scoring policy is achieved even when only a small fraction of the state space has been explored. While certain outcomes are not explored up to the full horizon , exploration deep into the search tree is incorporated into the policy, and high probability outcomes are more likely to be considered. In other words, the returned policy is likely to satisfy the risk bounds up to the planning horizon for high probability outcomes, while in an incomplete policy the states that satisfy the risk bound only up to a reduced horizon are much less likely to actually occur.
5.1.1 Linear Programming Methods
Just as an MDP may be solved with a linear programming (LP) approach, there exists an LP formulation for CMDPs [Heyman SobelHeyman Sobel1982]
. Deterministic policies can be computed instead using a mixed integer linear program (MILP) formulation, which adds a binary variable for each state-action pair in the CMDP[Dolgov DurfeeDolgov Durfee2005].
This method can be trivially extended to handle linear risk bounding functions, but general concave functions will require linear approximations. In theory the approximations can be made to reach an arbitrary degree of accuracy with increasing numbers of binary variables [BisschopBisschop2006], but the more pressing issue is that the MILP becomes exponentially more difficult to solve with increasing numbers of binary variables.
Furthermore, LP methods still require evaluation of every reward and cost function in the CMDP before a solution can be found. In our motivating example, when missions have a large number of measurements and measurement outcomes, performing this computation can be prohibitively expensive.
5.1.2 Dynamic Programming Methods
To use dynamic programming in a CMDP, a Lagrangian function is formulated which includes the CMDP reward and constraints. The problem of finding the optimal constrained policy then transforms into an unconstrained optimization over both the policy and the values of Lagrange multipliers in the Lagrangian function [AltmanAltman1999]. For any given value of the Lagrange multipliers, the unconstrained optimal policy is found using standard dynamic programming. This approach has been applied for CCMDP problems with up to millions of states [Ono, Kuwata, BalaramOno et al.2012].
Again, it is clear how to handle linear risk bounding functions, but general concave risk bounding functions cannot directly be used with dynamic programming. In addition, applying dynamic programming requires evaluation of all rewards and costs in the CMDP for each iteration of the algorithm used to select the Lagrange multipliers, which has the same drawbacks for large problems as the LP approach. By using MCTS, we accept a small degree of suboptimality in order to find policies for much larger chance constrained problems.
5.1.3 Heuristic Forward Search Methods
An alternative approach to solving CCMDPs or CMDPs is the use of heuristic forward search similar to AO*. RAO* [Santana, Thiébaux, WilliamsSantana et al.2016] was designed for partially observable CCMDPs and uses both a reward heuristic and a risk heuristic in order to guide exploration of the search space. RAO* explicitly computes expected lifetime reward and execution risk, and so may be extended to be used with an arbitrary risk bounding function. While optimal deterministic policies are computed rapidly for very large problems when effective heuristics are available, it can be difficult to develop strong heuristics when the environment is not well known. In the worst case, RAO* must re-compute the optimal policy whenever new states are added to its search tree, which can lead to enumeration of the entire policy space.
Vulcan does not require heuristics to identify high reward actions. When domain knowledge is present, it can be incorporated into the default policy used for expansion, while the degree of certainty in that default policy is controlled through UCT’s exploration parameter. The default policy only needs to express an estimate for the best action to take at any state, which is more intuitive to specify than a numerical prediction of upcoming reward or risk.
In the worst case, Vulcan will explore the entire state space before converging to its final policy, but it does not need to explore the entire policy space. In practice, the use of UCT causes rapid convergence towards high reward policies even without domain specific knowledge.
5.1.4 Risk Bounds as Penalties
It has been proposed that chance constraints in CCMDPs, or more generally constraints in CMDPs, may be handled without reasoning over the constraints explicitly by adding a penalty to the reward function
and solving an unconstrained MDP for one or many constant values . [Geibel WysotzkiGeibel Wysotzki2005]
apply this approach for CCMDPs and use reinforcement learning on the resulting unconstrained MDP. Solving for the optimal policy is performed repeatedly, and the constantis reduced so long as computed policies are observed to obey a constant risk bound.
While solving an unconstrained MDP is easy compared to a CCMDP, it is not possible to guarantee that solutions to penalty methods satisfy risk bounds without explicitly computing the probability of failure of the policy found. Further, the best choice of depends on the values of the rewards, and there is no method to estimate a value of that leads to a satisfactory solution, which explains why repeated solutions with different values are necessary. In fact, [Undurti HowUndurti How2010] showed that CCMDPs exist for which no value of gives the optimal policy. In contrast, our approach applies for all concave nondecreasing risk bounding functions, including constant functions, and even though our approach may also introduce some degree of suboptimality, the solution is guaranteed to satisfy the risk bounding function, and the CCMDP does not have to be repeatedly solved.
Even if we restrict our attention to certain classes of risk bounding functions such as linear functions, it is still not possible to guarantee that a value of that leads to the optimal solution exists, and so chance constraints may not be encoded as penalties in general. As an example, consider the very simple MDP with one action to be chosen in figure 2 (a), with the objective of maximizing reward with a risk bounding function . By inspection, action exceeds the allowed risk bound, while action is the maximum reward choice that satisfies the risk bound.
Formulating the risk as penalties on rewards leads to the MDP in figure 2 (b), but for the optimal action is and for the optimal action is , with a tie between the two at . Action is never selected, so a penalty method does not work here, while Vulcan will find the true optimal policy.
5.2 Application of MCTS to MDPs
The application of Monte Carlo Tree Search to Markov Decision Processes has allowed approximately optimal policies to be found for previously intractably large MDPs, with perhaps the most visible application being the highly publicized success of AlphaGo [Silver, Huang, Maddison, Guez, Sifre, Van Den Driessche, Schrittwieser, Antonoglou, Panneershelvam, Lanctot, et al.Silver et al.2016]. MCTS methods produce a high reward policy without explicitly enumerating all states in the MDP. Instead, random samples are used to estimate the possible reward for an action from a state, and an action is only allowed additional samples so long as the results of previous samples compare favorably to other actions.
There are many variants of MCTS, based on different heuristics and selection rules [Browne, Powley, Whitehouse, Lucas, Cowling, Rohlfshagen, Tavener, Perez, Samothrakis, ColtonBrowne et al.2012], but many build off the UCT algorithm introduced by [Kocsis SzepesváriKocsis Szepesvári2006]. UCT balances exploration of actions that have not been frequently sampled against exploitation of actions that previous samples have implied return a large reward. To do so, UCT dictates that all actions should be sampled from a state once, and then subsequent sample actions should be selected according to
Where is the total number of samples performed from state , is the total number of samples of action taken from state , is a tuned constant, and is a numerical estimate for the action value function (reward to go) from based on the previous samples taken during tree search. An action is more likely to be chosen by UCT if it has a high action value function, or the number of times it has been sampled is low compared to the total number of samples taken. After choosing a sample action, the specific outcome to be sampled next is selected randomly according to the probability distribution of outcomes.
The significance of UCT is that for certain choices of , the expected error in shrinks at a rate and the probability of finding a suboptimal policy can be shown to decay to zero at a polynomial rate [Kocsis SzepesváriKocsis Szepesvári2006]. In practice is often tuned empirically, though a theoretical value of can be derived for the bandit setting [Auer, Cesa-Bianchi, FischerAuer et al.2002].
While UCT works well for large MDPs, it has not been applied to constrained or chance constrained MDPs because of coupling between different outcomes through the constraint, which means that the optimal policy does not maximize a value function at a single state. By decoupling the constraint, Vulcan allows UCT to be applied to a CCMDP, gaining fast convergence to an approximately optimal policy.
6 Local Constraints for Risk Bounding
In this section we introduce and prove our local constraints on safe state histories which guarantee that the risk bounding function is satisfied. These constraints form the basis our forward search and MCTS based algorithms. The constraints involve a property of state histories that we call the sequence execution risk, based off of the previously defined execution risk.
6.1 Definition of Sequence Execution Risk
The sequence execution risk of a state history is related to the combined probability of entering a failure state by executing from , from , and so forth until from , assuming all states in the history are safe. For state histories with failure states, the is zero by definition. Formally,
where is the indicator function that is one if and zero otherwise.
We using the following shorthand for the immediate probability of failure of taking action from state , which may be thought of as the immediate risk of the action:
Then the sequence execution risk may be efficiently computed (or alternatively defined) for safe state histories as:
Intuitively, the numerator of the of a state history may be interpreted as the execution risk of the only policy possible in a different MDP, where only the actions taken in the state history are available, and taking the action from state leads to a failure state with probability (the same probability as in the original policy) and state otherwise. This intuitive definition is visualized in figure 3. The denominator acts as a scaling factor to give sequence execution risk a desirable expectation.
6.2 Local Constraint and Proof
Using the sequence execution risk, we now introduce our constraint that ensures a policy satisfies the risk bounding function. Let be any function with expectation bounded by the expected cumulative reward in the policy:
The following theorem shows that a local constraint on safe state histories that ensures the risk bounding function is satisfied can be created for every such .
For a concave nondecreasing function , asserting that for all state histories reachable under policy
Here, the constraint is our previously mentioned . Since is zero for unsafe state histories and maps to , the condition of theorem 1 is automatically satisfied for all unsafe state histories.
The implication of satisfying for all state histories is exactly the constraint on our problem statement in eq. 4, so a policy may be found by choosing an appropriate function , and ensuring that is satisfied for each safe state history. The basis of our algorithms is to search through feasible safe state histories, evaluate and , and construct the highest scoring policy from among those state histories that satisfy the local constraint.
To prove theorem 1, we begin with two lemmas.
where , are the elements of the state history and .
Our proof is by induction. For the case , we have
so the lemma holds.
Assuming the lemma is true for , the case of is proven as
where the expectation with the conditions is understood to be over all state histories of the form .
We note the following two recursive expressions for execution risk and sequence execution risk:
where , and for safe state histories,
The proof then follows from induction. The case where is trivially true since . Assuming the lemma is true for , the case of follows from
Since is defined to be zero for state histories involving failure states, the expectation needs only to sum over safe state histories. The additional terms from moving into the sum follow from lemma 1.
Proof of Theorem 1.
where the third line follows from Jensen’s inequality and the concavity of , while the fourth line follows from the facts that and is non-decreasing. ∎
Note that since we have defined the sequence execution risk to be zero for any state history involving a failure state, is always true for any failing state history. This was done so that policies which take high risks up front in order to reach large rewards at later time steps are not forbidden due to large probabilities of failure but low rewards on short state histories that end in early failures. The price that is paid is an inflation of for safe state histories so that it remains the expectation of the execution risk, which accounts for the denominator in eq. 8.
6.3 Selection of Function
We have defined a constraint which uses a function . A choice of function for which eq. 11 is satisfied with equality will introduce less conservatism to the solution, but there are a large number of functions with this property, and it remains to be shown why it would be advantageous to choose a function other than the obvious .
The downside of using is that unlikely and low reward outcomes of an action can lead to state histories with , as in figure 4. A very low reward outcome with a low probability does not heavily influence the expected lifetime reward of a policy taking an action, but the state history including that outcome will have a much lower lifetime reward, and may not satisfy the risk bound, which prevents the action from being taken. This situation occurs frequently in CCMDPs with low probability of failure and zero reward received upon failure.
Instead, we recommend a function that averages reward from the outcomes of each action,
This function satisfies eq. 11 with equality, and since it averages the outcomes of actions, it avoids the case where unlikely low reward outcomes prevent an action from being included in the policy. The downside is that the rewards to multiple outcome states must be evaluated, which can be computationally expensive. Nonetheless, when used in our MCTS algorithm, the advantage of not needing to enumerate all states and reward functions remains. Our experimental results all use this function.
7 VulcanFS: A Forward Search Based Algorithm
Using theorem 1, it is possible to derive a version of forward search [SucarSucar2011] that ensures the chance constraint is satisfied, which we call VulcanFS. The algorithm is presented in algorithm 1. At the planning horizon, if is violated, the reward is set to to ensure the action is never taken. Otherwise, the lifetime reward is maximized.
8 Vulcan: An MCTS Based Algorithm
Algorithm 1 shows how a deterministic policy satisfying a risk-bounding function may be found by only considering each possible state history once. For problems where the rewards and probabilities of failure are functions of the entire state history and ordering, this corresponds to the state space of the problem. However, algorithm 1 is still not suitable for large problems where evaluating every reward is not feasible; its main value comes from how it may be extended to the MCTS implementation Vulcan, given in algorithm 2. The idea behind Vulcan is to use the insights of UCT in order to construct a policy that rapidly converges towards while only sampling a subset of state histories allowed by the MDP.
Theorem 1 is used to ensure that the policy does not violate the chance constraint. Naturally, if the algorithm is terminated before an entire policy has been evaluated down to the planning horizon, guarantees cannot be placed on the states that have not been examined. However, in this case theorem 1 asserts the chance constraint over the states that have been explored; bounding the probability of failure across those states as a function of lifetime reward computed across those states.
Vulcan uses UCT to guide sampling in algorithm 3, which balances exploration against exploitation based on the reward and numbers of samples of state histories. Like in a traditional UCT algorithm, a default policy is used to incorporate domain-specific knowledge if it is available. However, our algorithm differs from standard UCT in three aspects.
First, since we are interested in domains where computation of reward is expensive, we avoid recomputing reward between states that have already been sampled. This is achieved by keeping all states visited in the search tree. Instead of expanding the search tree by a single node and performing a rollout, which evaluates many states but keeps only their cumulative reward, we add states to the search tree until reaching the planning horizon, saving the rollout in memory. This trades off memory usage against the slowdown of repeatedly evaluating an expensive state.
Second, we impose the constraint of theorem 1 on all safe state histories at the planning horizon. If the sequence execution risk does not satisfy , then the last action in the sampled state history is deleted so that it is not sampled again. Likewise, if all actions have been deleted from a non-terminal state history, then then taking any action from the sampled state history has a non-zero probability of leading to a state that violates the conditions of theorem 1, so the last action is again removed. After an action is deleted, sampling continues from the parent node of that action, until a satisfactory state history has been found. Sample counts are only incremented when a state history is found that satisfies the enforced condition, and they are decremented when states are deleted. This way, the regret bounds derived for UCT apply, with regard to the number of samples taken from state histories in .
Third, we note that the constraint in theorem 1 is not satisfied when sampling stops prematurely and there are outcomes of an action in the returned policy that are not sampled, as in figure 5. In that case, there may be a non-zero probability that an action in the policy leads to a state history that violates . To fix this, we add a new step to our algorithm that occurs after sampling is completed, which we call cleanup. During cleanup, given in algorithm 4, all immediate outcomes of actions in the policy that have not yet been sampled are explored to ensure the lifetime reward up to those states satisfies . By doing so, theorem 1 applies to assert a risk bound over the actions in the policy that have been explored during tree search, as if the leaves of the search tree were terminal states.
Cleanup is run by exploring all outcomes of the best policy found by UCT, and evaluating the lifetime reward up to unsampled states that follow immediately from actions in the policy. If is found not to hold, the action leading to the state is deleted from the policy. The policy to take from a state history is set to the next best action from that state history, and the change in expected reward is propagated up the search tree. However, actions in the policy are not changed during cleanup except when an action is deleted, in which case a different action from the same state history is chosen. This process is shown in figure 5, where an unexplored outcome of the current best policy is evaluated in (b), which results in an action being deleted. Instead of reevaluating the optimal action from the root state, the policy is only changed at the parent of the deleted action in (c), and unexplored outcomes of the new policy are evaluated. In this case, the alternative outcome also does not satisfy , so in (d), no valid actions remain from the parent of the deleted action, and the action that led to that state history is deleted. The immediate unexplored outcomes of the new policy are evaluated and found to be consistent with the risk bound.
Even though deleting an action may change the optimal action further up the tree, we do not reevaluate alternative actions during cleanup for two reasons. First, we wish to avoid laborious computation during cleanup so that Vulcan terminates shortly after sampling is finished. While propagating changes in reward up a search tree is not computationally expensive, the change in the policy may require states to be evaluated all throughout the large search tree as the policy updates. Secondly, when sampling has not yet identified not to hold, it is likely that many samples have been used to explore states that are close relatives to the violating state. Performing cleanup in the way we have detailed results in returned policies with more explored states deeper in the search tree.
Regardless, it is not necessarily true that updating the policy further up the tree is a better approximation for the true policy up to the planning horizon, since we do not know whether high reward actions may occur after the violating state. In addition, the convergence of Vulcan to the policy holds regardless of how cleanup is executed, since in the limit of a large number of samples, state histories violating will be identified over the course of sampling.
We examined the performance of Vulcan by testing it in two problem domains. The first may be viewed as a simplified multi-armed bandit problem with a limited number of actions [Gittins, Glazebrook, WeberGittins et al.2011], and we investigate this domain for small enough problem instances such that run time and optimality can be measured against existing methods. The second domain is illustrative of the exploration problems that Vulcan was built to solve, and describes a vehicle moving through a field described by a Gaussian Process [Williams RasmussenWilliams Rasmussen2006]. Our second experiment shows the capability of Vulcan to deal with large state spaces without the need for strong heuristics.
All experiments were performed on a computer powered by an Intel 8-core i7 CPU with 4.00 GHz clock speed and 12 GB RAM allocated to the process.
9.1 Simplified Multi-Armed Bandit Domain
In our multi-armed bandit domain, a player is faced with three machines, and must choose which machine to play at each action, subject to the following rules.
The reward for playing a machine is drawn from a two-point distribution with known outcomes, i.e. the machine returns reward with probability , and reward with probability , where and are known. The probability is not known, but is modeled in the Bayesian sense as also following a two-point distribution between two known values, meaning with probability and with probability . The player begins with knowledge of and and an estimate for .
After choosing a machine to play, the player immediately receives a randomly drawn reward from the distribution of the chosen machine. After receiving a reward, the player updates their belief state of according to the following Bayesian update rule. If the reward received after playing a given machine times is and the belief state after the measurements is characterized by , then probability of a measurement is
while the update rule is
Finally, a played machine may fail with a known risk of failure . If any machine fails, zero reward is received for the action and the game immediately ends. If the player deems the risk of every machine to be too high, or instead all machines have been found to have low expected reward, the player may instead choose to end the game immediately as an action. Doing so has no risk, and grants a reward of 0.25 for every remaining action in the game ().
The parameters for the specific machines we tested on are given in table 1. The player is forced to choose between a balanced machine with the highest and lowest rewards and intermediate risk (1), a relatively safe machine with low rewards but a belief that it is more likely to give the higher of its reward options (2), and a risky higher rewarding machine that is believed to be biased towards its low reward options (3).
This domain shows some of the aspects which motivated the development of Vulcan, including rewards that depend on the state history of the CCMDP, as well as the need to update a model after each measurement, which in this case is a relatively inexpensive Bayesian update. However, this case differs from a vehicle exploration problem in that the current state does not depend on the order of the measurements made, only the choices and outcomes. Vulcan generates a tree structure, and by modeling states in this domain as a tree we quantified the speedup of Vulcan for problems where the entire history is relevant (for example, with rewards that change in time). In addition, by modeling states as only dependent on the measurements and outcomes, creating a graph structure, we found that Vulcan still provides significant speedups over established methods, even when identical states are repeatedly generated.
9.1.1 VulcanFS Suboptimality
Even without performing MCTS, VulcanFS finds solutions rapidly by enforcing the chance constraint through local conditions. The problem is then decoupled and Bellman updates can be performed at the expense of suboptimality in the solution. To characterize the suboptimality we solved the multi-armed bandit domain with various horizons (number of actions) using both VulcanFS and the MILP approach of [Dolgov DurfeeDolgov Durfee2005], which generates the optimum deterministic policy. The VulcanFS rewards indicate the reward of the policy that Vulcan would eventually converge to. The problems used a risk bounding function of , with results shown in table 2. The results suggest that the suboptimality is small, and remains small over a large number of problem horizons.
|Horizon||Optimal Reward||VulcanFS Reward||Suboptimality|
In order to verify that low suboptimality is not limited exclusively to the risk bounding function chosen in the example above, we ran VulcanFS against the MILP formulation for a continuum of risk bounding functions of the form , for in the range 0.0005 to 0.003, for a fixed horizon of 4 actions. Beyond these limits, there was no error in the VulcanFS solution. The form of the risk bounding function was limited to a proportional equation so that it could be encoded as a MILP.
The results are shown in figure 6, from which it appears that the suboptimality reduces as the risk bounding function becomes less restrictive. The regions in which the suboptimality exceeds 6% were limited to relatively small regions of the test domain, and the average suboptimality over problems with nonzero suboptimality was 4.45%. Run time was not found to be a strong function of the risk bounding function.
The spikes in suboptimality result from increasing expected reward in the true optimal policy with , while the solution found by VulcanFS remains static. This is made clear by figure 7, which shows the expected reward of the true optimal policy and the VulcanFS solution. While small changes in lead to increases in expected reward, VulcanFS continues to return the same solution over a larger range of functions until a better policy in is found. As a result, the solution returned by VulcanFS shows more dramatic jumps, leading to a larger suboptimality immediately before increases in reward. As the risk bounding function becomes more lax, changes in the true optimal reward are less dramatic, and suboptimality decreases as a result.
9.1.2 VulcanFS Run Time for Increasing Problem Size
Examining the run time of VulcanFS is instructive because Vulcan should converge to the policy found by VulcanFS at least as quickly. In figure 8 we show VulcanFS run time for increasing planning horizons in the multi-armed bandit domain. This was compared to the time taken to solve an equivalent CMDP allowing both stochastic policies (by LP) and deterministic policies (by MILP), and using a tree structure (repeated states) and a graph structure (without repeated states). In all tests, a risk bounding function of was used. Since VulcanFS performs model updates and evaluates reward during run time, the times measured for the LP approaches include the time taken both to construct the LP and then to solve it.
When attempting to find the deterministic policies for problems where transitions depend on the entire state history (deterministic tree), VulcanFS provides the most significant speedup. At a planning horizon of 6 actions, Vulcan provides a 600 fold decrease in run time, and enables much larger problems with larger planning horizons to be solved in realistic time frames. Even when stochastic policies are allowed and there are many ways to reach a state (stochastic graph), VulcanFS still reaches a solution over 50 times faster than an LP based method, despite forming a tree structure and without any optimization for repeating states.
9.1.3 Vulcan Convergence
Since Vulcan uses MCTS to run faster than VulcanFS, we investigated its convergence to the policy found by VulcanFS for the the bandit problem with a horizon of 9 actions. No LP based method was able to solve this problem in under two hours, and VulcanFS required approximately 18 minutes. For every 5 seconds between 5 and 60 seconds, Vulcan was run 60 times. The average absolute error from the VulcanFS solution is shown in figure 9. With 60 seconds of run time (less than 6% of the run time of VulcanFS) Vulcan was able to achieve a mean error of 0.08%. At that run time, the exact same policy as VulcanFS was found 90% of the time.
9.1.4 Vulcan Performance Against Heuristic Search
The previous sections have shown Vulcan’s speedup over methods that require an explicit representation of the entire state space. In heuristic search methods, heuristics guide the search, leading to an optimal policy without exploring the entire state space. The strength of Vulcan is in the fact that it tends to find policies quickly even if no strong heuristic is known. However, we now show that Vulcan produces policies an order of magnitude faster than the heuristic search method RAO* [Santana, Thiébaux, WilliamsSantana et al.2016] for large problems in our multi-armed bandit domain, even when it is possible to develop a strong heuristic.
RAO* requires a reward heuristic which overestimates the expected cumulative reward that can be gained from a state by taking a given action, and an execution risk heuristic which underestimates the execution risk from a state given an action. In the multi-armed bandit domain, a simple upper bound is the expected reward of the best possible machine multiplied by the number of actions. The best possible machine is machine 1 operating with , leading to an expected reward of 0.7,
An underestimate of the execution risk is the immediate risk of the action to be taken,
For relatively small problems (up to a planning horizon of 7) RAO* ran approximately twice as fast as the time taken for Vulcan to converge to its final solution, which equaled up to one second time difference. At these scales, it appeared that the practical computational overhead of progressing through the search tree by random sampling at every iteration and tracking value functions at each node was significant compared to the speedup gained by random sampling. At a planning horizon of 8 run times were comparable, and at a planning horizon of 9 Vulcan ran approximately twice as fast as RAO*.
At a planning horizon of 10, the problem was large enough to show a noticeable difference in run time. The experiment was run with a constant risk bound of , and RAO* took 3515.9 seconds to find the optimal solution. Vulcan was run 100 times for run times of every 10 seconds between 10 and 310 seconds. The average rewards of the policies are shown in figure 10.
Unlike unconstrained MCTS, figure 10 shows that the average expected reward does not always increase with time with Vulcan. For short run times, the state space is not fully explored, so the returned policies satisfy the risk bound only up to the states explored. This can lead to overestimates of the reward, since unlikely states that violate have not been found. Further sampling tends to decrease the expected reward as unfeasible policies are recognized and eliminated.
The policies found with the short run times are typically incomplete, in the sense that they do not include actions for all reachable outcomes up the planning horizon and only satisfy the risk bound up to the states that have been explored. Figure 11 shows the complete policy return rate as a function of run time. With 310 seconds of run time or greater, a complete policy was always returned. These complete policies satisfy the risk bound, are found 11.3 times faster than RAO*, and have a mean suboptimality of only 1.2%.
9.2 Exploration of a Gaussian Process
Our second test domain concerns a vehicle that is tasked with maximizing samples taken through an environment described by a Gaussian Process (GP) (for GP details, see for example [Williams RasmussenWilliams Rasmussen2006]).
In our domain, the vehicle’s state is described by a gaussian distribution. The mean state is constrained to a rectangular grid, and each action corresponds to the choice of one of the 8 adjacent locations to which the mean location will move. Meanwhile, after each action, the covariance at time step grows according to
for known , as in [Ono, Williams, BlackmoreOno et al.2013]. In our experiment, we used
Known obstacles exist in the environment, and the risk considered in the domain is the probability of collision with those obstacles. Using the methods of [Ono, Williams, BlackmoreOno et al.2013], the probability of collision with a single obstacle is estimated as the probability that the gaussian distribution passes one of the boundaries of the obstacle. At a given time step, the total probability of failure is conservatively estimated as the sum of the probabilities of collision with all obstacles.
After moving to a new location, a sample with no error is taken from a GP with known hyperparameters, but no prior measurements is taken before the mission begins. For simplicity, each sample is modeled as taken at the vehicle’s mean location. If the location has not been previously visited, a reward is received equal to the value of the sample. After a sample is taken, the Gaussian Process model from the previous state is updated to include the new measurement. The vehicle is tasked with moving through the environment to maximize the sum of its samples.
The Gaussian Process model is described by a mean function and squared exponential covariance kernel of the form
These hyperparameters encourage the vehicle to move to the top right of the environment in order to maximize its samples, but it must bypass obstacles in the way in order to do so. The GP describes a gaussian distribution over measurements, so the discrete outcomes that were considered in the CCMDP were chosen by Gauss-Hermite quadrature with degree 4 [HildebrandHildebrand1987].
The exploration domain shows all of the features that motivated the design of Vulcan. The possible outcomes of future actions are a function of all observations that are taken during the mission, and since the risk of entering a state is also a function of time, the order those measurements are taken in affects the probability of failure. As a result, the risk and reward accumulated depends explicitly on the state history. Prediction of the outcomes of a measurement requires an inversion of a Gaussian Process at each state, which makes reward evaluation relatively expensive.
The problem we ran on this domain began with a fixed initial location and measurement, and allowed 9 additional actions to be taken. With 8 available actions and 4 possible outcomes of each action, the problem has approximately unique state histories.
Vulcan was run on the Gaussian Process exploration domain with 9 actions and allowed 180 seconds of run time. A risk bounding function of the form was used, which begins at the origin and approaches a straight line for large x. This function was chosen to show Vulcan’s capability to handle nonlinear concave risk bounding functions, and the specific parameters in the function, GP, and covariance evolution were chosen so that a high risk path between the two obstacles was permissible only for sufficiently high reward, which did not always occur. A small selection of the potential outcomes of the policy found are shown in figure 12, but they demonstrate the interplay between risk and reward that underlies the reasoning performed by Vulcan.
The trajectories shown in figure 12 (a) and (b) travel through a narrow gap between obstacles. In doing so, there is an increased probability of failure, but this is justified by high reward predictions from the Gaussian Process model. In the case of (a), the explored field is high near the gap, and it is reasoned that it will continue to be high nearby. In the case of (b) the additional risk is justified by large observations before the gap is reached, so traveling through the gap is permitted even though high reward is not found.
In (c) and (d) the vehicle moves towards the gap, encouraged by the increasing mean in that direction, but low measurements are found immediately before the gap. In the case of (c) higher reward is expected to be found downwards, and relatively high measurements justify staying close to the obstacle. In (d) the reward is low at the gap and it is no longer worth the risk to travel through the gap, even though the environment’s mean increases in that direction.
Finally, in (e) and (f), an immediate low measurement suggests that the risk of continuing to explore near the obstacles outweighs the expected reward, and the vehicle moves away from the obstacles. In both cases, the vehicle follows the trends observed in the data to move towards a region of high reward.
In this paper we presented Vulcan, which uses Monte Carlo Tree Search to produce approximately optimal policies for chance constrained MDPs subject to a concave nondecreasing risk bounding function. Whereas previous methods for CCMDPs have been limited to problems with millions of states when strong heuristics are unavailable because the entire policy is coupled through the chance constraint, in Vulcan we decompose the chance constraint into constraints that are placed on individual state histories. By doing so, standard approaches for unconstrained MDPs can be efficiently applied, with a small degree of suboptimality introduced into the final solution. The application of MCTS allows policies to be found for problems with state spaces that are too large and computationally expensive to generate in full. This is particularly important when considering CCMDPs where rewards depend on the entire state history up to a given state.
Using Vulcan, we solved approximately optimal policies for CCMDPs that are orders of magnitude larger than those handled in the literature without heuristics. Experimentally, we showed that Vulcan finds solutions tens to hundreds of times faster than methods based on linear programming, and that convergence happens rapidly without the need to fully explore the state space. Additionally, Vulcan was used to find complete policies 11.3 times faster than heuristic forward search for a sufficiently large problem, and in all experiments the returned policies were found to have a mean expected reward that differed from the true optimal policy by a few percent. We then applied Vulcan to an exploration problem with approximately unique states, and found a policy that appropriately balanced risk against reward in 180 seconds.
The authors would like to acknowledge and thank the Exxon Mobil Corporation for their financial support (grant EM09079).
- [Alden SmithAlden Smith1992] Alden, J. M. Smith, R. L. 1992. Rolling horizon procedures in nonhomogeneous markov decision processes Operations Research, 40(3-supplement-2), S183–S194.
- [AltmanAltman1999] Altman, E. 1999. Constrained Markov decision processes, 7. CRC Press.
- [Auer, Cesa-Bianchi, FischerAuer et al.2002] Auer, P., Cesa-Bianchi, N., Fischer, P. 2002. Finite-time analysis of the multiarmed bandit problem Machine learning, 47(2-3), 235–256.
- [BisschopBisschop2006] Bisschop, J. 2006. AIMMS optimization modeling. Lulu. com.
- [Browne, Powley, Whitehouse, Lucas, Cowling, Rohlfshagen, Tavener, Perez, Samothrakis, ColtonBrowne et al.2012] Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., Colton, S. 2012. A survey of monte carlo tree search methods IEEE Transactions on Computational Intelligence and AI in games, 4(1), 1–43.
- [Chang MarcusChang Marcus2003] Chang, H. S. Marcus, S. I. 2003. Approximate receding horizon approach for markov decision processes: Average reward case Journal of Mathematical Analysis and Applications, 286(2), 636–651.
- [Dolgov DurfeeDolgov Durfee2005] Dolgov, D. Durfee, E. 2005. Stationary deterministic policies for constrained mdps with multiple rewards, costs, and discount factors Ann Arbor, 1001, 48109.
- [FeinbergFeinberg2000] Feinberg, E. A. 2000. Constrained discounted markov decision processes and hamiltonian cycles Mathematics of Operations Research, 25(1), 130–140.
- [Geibel WysotzkiGeibel Wysotzki2005] Geibel, P. Wysotzki, F. 2005. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research, 24, 81–108.
- [Gittins, Glazebrook, WeberGittins et al.2011] Gittins, J., Glazebrook, K., Weber, R. 2011. Multi-armed bandit allocation indices. John Wiley & Sons.
- [Heyman SobelHeyman Sobel1982] Heyman, D. P. Sobel, M. J. 1982. Stochastic models in operations research: stochastic optimization, 2. Courier Corporation.
- [HildebrandHildebrand1987] Hildebrand, F. B. 1987. Introduction to numerical analysis. Courier Corporation.
- [Kocsis SzepesváriKocsis Szepesvári2006] Kocsis, L. Szepesvári, C. 2006. Bandit based monte-carlo planning In European conference on machine learning, 282–293. Springer.
- [Ono, Kuwata, BalaramOno et al.2012] Ono, M., Kuwata, Y., Balaram, J. 2012. Joint chance-constrained dynamic programming In Decision and Control (CDC), 2012 IEEE 51st Annual Conference on, 1915–1922. IEEE.
- [Ono, Williams, BlackmoreOno et al.2013] Ono, M., Williams, B. C., Blackmore, L. 2013. Probabilistic planning for continuous dynamic systems under bounded risk Journal of Artificial Intelligence Research, 46, 511–577.
- [Paruchuri, Tambe, Ordonez, KrausParuchuri et al.2004] Paruchuri, P., Tambe, M., Ordonez, F., Kraus, S. 2004. Towards a formalization of teamwork with resource constraints In Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems-Volume 2, 596–603. IEEE Computer Society.
- [PutermanPuterman2014] Puterman, M. L. 2014. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
- [RossmanRossman1977] Rossman, L. A. 1977. Reliability-constrained dynamic programing and randomized release rules in reservoir management Water Resources Research, 13(2), 247–255.
- [Santana, Thiébaux, WilliamsSantana et al.2016] Santana, P., Thiébaux, S., Williams, B. 2016. Rao*: An algorithm for chance-constrained pomdps In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI’16), 3308–3314.
[Silver, Huang, Maddison, Guez, Sifre, Van Den Driessche,
Schrittwieser, Antonoglou, Panneershelvam, Lanctot, et al.Silver
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche,
G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M.,
et al. 2016.
Mastering the game of go with deep neural networks and tree searchnature, 529(7587), 484–489.
- [SucarSucar2011] Sucar, L. E. 2011. Decision Theory Models for Applications in Artificial Intelligence: Concepts and Solutions: Concepts and Solutions. IGI Global.
- [Undurti HowUndurti How2010] Undurti, A. How, J. P. 2010. An online algorithm for constrained pomdps In Robotics and Automation (ICRA), 2010 IEEE International Conference on, 3966–3973. IEEE.
- [Williams RasmussenWilliams Rasmussen2006] Williams, C. K. Rasmussen, C. E. 2006. Gaussian processes for machine learning the MIT Press, 2(3), 4.