Convex Hull Monte-Carlo Tree Search

03/09/2020 ∙ by Michael Painter, et al. ∙ University of Oxford 0

This work investigates Monte-Carlo planning for agents in stochastic environments, with multiple objectives. We propose the Convex Hull Monte-Carlo Tree-Search (CHMCTS) framework, which builds upon Trial Based Heuristic Tree Search and Convex Hull Value Iteration (CHVI), as a solution to multi-objective planning in large environments. Moreover, we consider how to pose the problem of approximating multiobjective planning solutions as a contextual multi-armed bandits problem, giving a principled motivation for how to select actions from the view of contextual regret. This leads us to the use of Contextual Zooming for action selection, yielding Zooming CHMCTS. We evaluate our algorithm using the Generalised Deep Sea Treasure environment, demonstrating that Zooming CHMCTS can achieve a sublinear contextual regret and scales better than CHVI on a given computational budget.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In Multi-Objective Planning under uncertainty (MOPU) an agent has to plan in a stochastic environment while trading off multiple objectives. Thus, we often assume the agent’s environment is known and modelled as a

Multi-Objective Markov Decision Process

(MOMDP). This problem is useful to consider as agents often encounter environments with uncertainty in their dynamics. Additionally, it is often useful for agents to be able to balance different objectives, for which the priorities are not known a priori or may change over time. MOPU has been applied to tackle problems in several domains, such as having a robot trade-off between battery consumption and performance in its primary task [12] or between timely achievement of a primary tasks whilst achieving as many soft goals as possible [11]; or when planning how to lay new electrical lines while trading installation and operational costs for network reliability [16]. The objective of MOPU is to compute a set of values, and associated policies, known as a Pareto Front, that is optimal for some prioritisation over the objectives. Unfortunately, because there is not a single optimum value, as is the case in single-objective planning, there is an additional dimension of complexity to handle. In single-objective planning under uncertainty two techniques have recently lead to a significant improvement in the scalability of algorithms: Monte-Carlo Tree-Search (MCTS) [10] and Value Function Approximation [19]. Note that these techniques do not have to be mutually exclusive, as demonstrated in a system such as Alpha Go [17]. However, there has been relatively little work in adapting either of these techniques to the multi-objective setting [14, 22, 6, 13, 1].

An additional gap that needs to be addressed in the multi-objective setting is the need for principled approaches to evaluate the online performance of trial based planning algorithms (such as MCTS). In this paper, we propose a regret based metric to do so. We consider a sequence of trials, each with a known priority over the different objectives, such that the performance of the algorithm can be mapped to a scalar value that we wish to optimise. This leads us to use the notion of Contextual Regret as a measure of the online performance of a MOPU algorithm.

The main contributions of this work are: (i) applying the notion of contextual regret to multi-objective planning, and justify that exploration policies that achieve low contextual regret explore the trade-offs between objectives appropriately, as opposed to other metrics proposed in the literature; (2) proposing Contextual Zooming for Trees, that outperforms prior work on this metric.

2 Related Work

In the Multi-Objective Multi-Armed Bandit (MOMAB) problem, an agent must pick one of

arms round by round such that we optimise a multi-objective payoff vector. The MOMAB problem can be considered a special case of MOPU, as it can be mapped to a finite horizon MOMDP with a single state with

actions, one for each arm. drugan2013designing drugan2013designing address the MOMAB problem by defining a set of arms, the Pareto Front, that are all considered optimal, in the sense that the performance in one objective cannot be improved without degrading the performance for least one of the other objectives. They extend the well known UCB1 [2]

algorithm to the MOMAB problem, using a multi-dimensional confidence interval rather than a one-dimensional confidence interval. Because the algorithm was adapted from UCB1, they refer to it as


Multi-objective sequential decision making (which includes multi-objective planning) extends ideas from single-objective sequential decision making algorithms, to handle a vector of rewards. In such problems, the solution to be computed is a Pareto Front or a Convex Hull, which is generally accepted to represent every possible trade off between objectives that could be made. For an in depth introduction to the field, see [15].

Prior work in Multi-Objective Monte-Carlo Tree-Search

(MOMCTS) can be divided into two categories, those that maintain a single scalar value estimate at each node, which we will call

Point-Based MOMCTS, and those that maintain an estimate of the Pareto front at each node.

In terms of point-based MOMCTS, wang2012multi wang2012multi maintain a global Pareto front for the root node, and during trials, they select successor nodes based on how “close” the value estimate in the child nodes are to the global Pareto front. In [6] the Pareto front for a node is formed from the value estimates of the children nodes, and actions are selected by running ParetoUCB over those points.

In terms of maintaining an estimate of the Pareto front at each node, perez2013online,perez2015multiobjective,perez2016multi perez2013online,perez2015multiobjective,perez2016multi define an MOMCTS algorithm for games with deterministic transitions. The deterministic assumption simplifies the operations for updating the Pareto fronts. Our algorithm uses generalised versions of these updates, allowing for stochastic settings too. The rule for selecting successor nodes is adapted from the UCB1 algorithm [2] over the hypervolumes of child nodes. The hypervolume can be thought of as the area under the Pareto front. xu2017chebyshev xu2017chebyshev use the same algorithm that is presented by perez2013online perez2013online, however, they replace the use of the hypervolumes in UCB1 with the Chebychev scalarization function.

All the works described above either assume a deterministic environment or do not maintain an estimated Pareto front at each search node. In this paper, we will show why maintaining an estimated Pareto front at each search node is required to fully explore the space of solutions, and introduce an approach that does so for stochastic models.

3 Preliminaries

3.1 Dp-Uct

keller2013trial keller2013trial introduce the Trial-Based Heuristic Tree Search (THTS) framework, that generalises trial-based planning algorithms for (single-objective) Markov Decision Processes (MDPs). In particular this framework can be specialised to give the Monte-Carlo Tree-Search (MCTS) algorithm [4] and the (UCT) algorithm [10], the most used variant of MCTS, which uses UCB applied to trees for action selection. THTS builds a search tree from decision nodes and chance nodes that correspond to state and state-action pairs in an MDP, respectively. Moreover, THTS is defined modularly, with different variants of algorithms being specified using seven functions, including selectAction, backupDecisionNode and backupChanceNode. For completeness we give an overview of THTS and pseudocode in Appendix A.

Due to the modularity of THTS, it is easy to arrive at new algorithms by altering one or a few of the seven functions. keller2013trial keller2013trial utilise this technique to arrive at DP-UCT, an adaption of the standard UCT algorithm that replaces Monte-Carlo backups with dynamic programming backups, and is shown empirically to outperform UCT in many domains. Broadly speaking, DP-UCT algorithm is split into trials. Each trial traverses the tree until a leaf node is found. Once a leaf node has been reached, all nodes that were visited during the trial are backed up to update their value estimates. selectAction runs UCB1 at each decision node and selectOutcome samples successor states for each chance node. The backupDecisionNode and backupChanceNode perform Bellman backups in each node, using its children as the successor states in the backup.

In this work we, specify Convex Hull Monte-Carlo Tree-Search under the THTS framework, by describing the adaptations made to DP-UCT, rather than re-describing the standard parts of the algorithm. In particular, Section 4 details how to replace the backupDecisionNode and backupChanceNode functions in DP-UCT and Section 5 describes how to alter the selectAction function.

3.2 Multi-Objective Planning under Uncertainty

We will model MOPU problems using a Multi-Objective Markov Decision Process:

Definition 1.

A finite horizon Multi-Objective Markov Decision Process (MOMDP) is a tuple , where: is a finite set of states; is a finite set of actions; is a D-dimensional reward function, specifying immediate rewards for taking action when in state ; is a transition function

specifying for each state, action and next state, the probability that the next state occurs given the current state and action;

is the initial state; and is a finite horizon.

In MOPU we are concerned with the optimisation of a vector , the sum of each reward observed over the time-steps, by an agent following a policy . A (stochastic) policy maps histories of visited states to distributions over actions: given a history of visited states , represents the probability of the policy choosing to execute , given that the history of visited states is . Given a MOMDP , a policy and a state

, we define the random variable representing the cumulative reward attained in

timesteps starting in and following :


where is the reward observed at timestep according to .

Definition 2.

The value of a policy is a function such that:


In the remainder of this paper, when we will omit it and simply write and . For a set of policies , we define the set of points such that . Often there will be no single best policy that can be chosen, as one may encounter a situation where we have , but, for some policies and . However, we can define a partial ordering over multi-objective values:

Definition 3.

We say that a point weakly Pareto dominates another point (denoted ), if ; Pareto dominates (denoted ), if ; is incomparable to (denoted ), if .

Given Pareto domination as partial ordering over multi-objective values, we can now define an ‘optimal set’ of policies, in which no policy Pareto dominates another.

Definition 4.

Let be a set of policies. The Pareto Front is the set of policies in that are not Pareto dominated by any other policy in :


An alternative method to overcome the lack of a total order for multi-objective values is to project them onto some scalar value that can be compared. First, we define the set of normalised -dimensional weights as:

Definition 5.

A scalarisation function maps a multi-objective value into a scalar value, where . The linear scalarization is , which is (strictly) monotonic (i.e. if and only if ).

A weight and the linear scalarisation function gives a total ordering over policies, as can be compared by comparing the scalar values to . Therefore, we can define another optimal set of policies as those that are optimal for some weight vector:

Definition 6.

The Convex Hull is the set of policies in that are optimal for some linear scalarization.


If is defined as the set of stochastic policies then . In order to represent convex hulls and Pareto fronts of more compactly, we consider the notion of Convex Coverage Set of .

Definition 7.

A Convex Coverage Set is any set such that:


For computational reasons, one typically wants to maintain a minimal set of policies that is still a . We refer the reader to [15] for more details on the relation between , and .

To compare sets of points it is common to consider the hypervolume, due to its monotonicity with respect to Pareto domination [20].

Definition 8.

The hypervolume of a set of points can be defined with respect to a reference point as follows:


where is the -dimensional Lebesgue measure. For example, in two dimensions with , this equates to the area between and the and axes.

3.3 Convex Hull Value Iteration

barrett2008learning barrett2008learning proposed an algorithm that extends Value Iteration [3] to handle multiple objectives, named Convex Hull Value Iteration (CHVI). CHVI computes for an infinite-horizon MOMDP by computing a finite set of deterministic policies that is a , where is the set of stochastic policies. The same algorithm can be adapted to compute a for a finite-horizon MOMDP, by computing the values for each timestep in the horizon.

As the first step to arriving at CHVI, we define arithmetic rules over a set of points. For sets of points , point , and scalar , we define:


Then, to arrive at the CHVI algorithm, we can replace the fixed point equations used in Value Iteration as follows:


where and are sets of points, and is an operation that removes Pareto-dominated points from a set. For details on how to compute see the survey by roijers2013survey roijers2013survey. The expectation is taken with respect to the next state , and can be computed using equations (8) and (9

) in the standard definition of expectation for a discrete random variable.

3.4 Contextual Regret

The Contextual Bandit framework extends the standard Multi-Armed Bandit (MAB) problem [2]. In this framework we are given a context in round and have to pick an arm . Subsequently a scalar reward is received, where is a reward distribution for arm in context , with expected value . We denote the shared context-arm space as , and denote the optimal value of context as .

Definition 9.

The contextual regret for many rounds, is defined as follows:


where is the arm chosen at round .

Typically in contextual MABs, the aim is to maximise the total payoff of , which corresponds to minimizing the contextual regret. Algorithms usually aim to achieve sublinear regret, i.e. pick arms such that , where if as .

As the problem is very general, works often make additional assumptions, or assumptions of additional knowledge to make the problem more tractable. In particular, one assumption made by slivkins2014contextual slivkins2014contextual is having access to a metric space called the similarity space, such that the following Lipschitz condition holds:


This allows us to reason about the expected values of contexts that have not been seen in the past.

4 Convex Hull Monte-Carlo Tree-Search

To motivate the design of our tree-search algorithm, we consider an example MOMDP with a horizon of length three. We will see that any Monte-Carlo tree-search algorithm that maintains a sample average at each node, i.e. the average reward obtained from each action, will perform suboptimally.

Example 1.

Consider state and actions , and in the MOMDP in Fig. 1 with a horizon of 2. It is clearly optimal to take action in for both objectives (and mixes thereof), and the true Pareto front is . The sample averages for , and are , and , for some (note that choosing action from will have a return of either or , and , represents a weighted average of the two possible returns). If one used these sample averages, the estimated Pareto front at for a given would be . Therefore, if we try to use the sample averages set to extract a policy, then it may be suboptimal. For example, if then and the policy extracted to maximise only the second objective would have , since , i.e. the sample average for the second objective is higher for than it is for .

Figure 1: A deterministic MOMDP with six states, two-dimensional rewards, and initial state . As it is deterministic, all transition probabilities are one, and so are omitted.

Given Example 1, it is clear that we need something more than just a single point in each node, so we will consider sets of points to approximate a Pareto front at each node. In the remainder of this section, we describe the backup functions that can be used as part of the THTS schema [8]. Any algorithm that makes use of these backups in a Monte-Carlo Tree-Search we will refer to as a Convex Hull Monte-Carlo Tree-Search (CHMCTS).

4.1 Backup Functions

For every THTS decision node, corresponding to some , we store a set of points approximating , and for every chance node, corresponding to some , we store a set of points approximating . The backupDecisionNode function updates the set approximating using Equation (10), but replacing each with its approximation stored in the corresponding child chance node. Similarly, backupChanceNode updates the approximation of using Equation (11), replacing each with the approximation stored in the corresponding child decision node.

5 Action Selection

Now we consider how to select actions from decision nodes (recalling decision nodes correspond to states in an MOMDP). We present the problem of policy selection (i.e. selecting all of the actions for a trial) framed as a Contextual Multi-Armed Bandit problem (Section 3.4):

Definition 10.

(Linear) Contextual Policy Selection problem, is a special case of the Contextual Bandit problem. Let be a MOMDP and the corresponding set of policies. For rounds (or trials), we perform the follow sequence of operations: (1) receive a context ; (2) select a policy to follow for this trial; (3) receive a cumulative reward , where as defined in Equation (1) – recall that . The objective over rounds is to select a sequence of policies that maximise the expected cumulative payoff .

5.1 Design by Regret

In THTS, designing action selection using regret metrics typically has two main benefits it is a direct measure of the online performance of the action selection; and it is a good way to balance the exploration-exploitation trade-off, as it selects arms proportionally to how good the performance of each arm has been in the past. Previous multi-objective works [7, 6] have considered the notion of Pareto regret:

Definition 11.

The Pareto Suboptimality Gap (PSG) for selecting policy is defined as:


where is a vector of ones. Intuitively, we can think of as how much needs to be added to so that it is Pareto optimal, i.e. it is not dominated by any other policy. Let be the number of times that was selected for the trials. The Pareto regret is defined as:


To demonstrate why Pareto regret is not the most suitable regret metric, we consider Example 2, which demonstrates that optimising for the Pareto regret does not correspond to our objective of computing the CCS for a MOMDP.

Example 2.

Consider the algorithm that uses the (single-objective) UCT algorithm [10] to find , the optimal policy for the th objective and then continues to follow . UCT is known to have sublinear regret for the th objective and, because is Pareto-optimal, it also has sublinear Pareto regret. However, this algorithm does not align well with the objective of computing a CCS, as it focuses only on one Pareto-optimal policy and does not explore the rest of the CCS.

Following from this, we introduce Linear Contextual Regret, a special case of contextual regret, that is well correlated with approximating the CCS.

Definition 12.

Let be a context weight vector, i.e. a sequence of weights sampled uniformly from . The Linear Contextual Regret (LCR) for the policy selection is defined by:


We consider LCR because it will penalise any algorithm that cannot find a : if there exists some such that no policy with was found, then a weight close to is sampled, it will accumulate regret.

Recall from Definition 10, we aim to maximise expected cumulative payoff, which, is equivalent to minimizing the expected LCR. Note that for any algorithm that achieves a sublinear LCR, the average regret will tend to zero, and in the limit of the number of trials the algorithm must almost surely act optimally for all weight vectors.

5.2 Exploration Policies

We now formally define exploration policies:

Definition 13.

An exploration policy for MOMDP is a function of the form .

In essence, an exploration policy maps a history and a context weight vector to an action, i.e. it extends a policy to consider the context. Algorithms for the contextual policy selection problem must specify a sequence of exploration policies , where the MOMDP policy followed on trial will be . The set of exploration policies for a MOMDP can be divided into two broad classes:

Definition 14.

A context-free exploration policy is one such that . An exploration policy is context-aware if it is not context-free.

Theorem 1 shows any sequence of context-free exploration policies can suffer linear LCR.

Theorem 1.

For some MOMDP , for any sequence of context-free policies , the expected LCR over trials is .

Proof Outline.
Figure 2: A deterministic MOMDP with three states, two-dimensional rewards and a horizon of one.

(See Appendix B for full proof.) Consider the MOMDP from Fig. 2. If we follow a context-free exploration policy on the th trial, then the action selected at is independent of the weight context vector. Assume, wlog, that for all . As is sampled uniformly from , will be the suboptimal action with probability . If is suboptimal, the expected regret suffered is , because if the weight vector is varied from to , then the contextual regret suffered varies from to . On average, the context-free policies will suffer and expected regret of per trial, and thus the cumulative regret over is . ∎

5.3 Context-Aware Action Selection

Figure 3: Visualisation of a snapshot of the CZ algorithm, running over three actions with two objectives, where is the context weighting for one objective.

Extending UCB1 to handle a contextual MAB can be hard, as we have an uncountably infinite set of contexts, the weight vectors . If one has knowledge of a metric that satisfies Equation (13), the problem can be made more tractable. The metric allows contexts to be grouped (in sets with a fixed radius, i.e. balls), and maintain an average value over all contexts in the ball. Intuitively, smaller balls allow each context vector to have a more accurate value estimate maintained. These ideas underline the Contextual Zooming (CZ) algorithm [18], that modifies UCB1 to run over balls of contexts, and introduces balls of smaller radii as required.

We present CZ by defining the similarity space (i.e. the metric over the context-arm space be satisfying Equation (13)); presenting CZ for policy selection; and then adapting CZ to be used in selectAction from THTS. A snapshot of CZ is visualised in Fig. 3, with three actions. The initial three balls of radii one are blue, and balls of radii are green. has been covered by green balls, and will only use value estimates from these green balls, whereas will use the value of the one green ball that it has, if it is relevant, and otherwise use the blue ball.

Similarity Space. Consider the linear contextual policy selection problem and let such that:


where and , i.e. is a value that overestimates the infinity norm of the expected vector reward. Furthermore, is an upper bound on the maximum scalarised reward that can be achieved in a single trial, i.e. . For example, if then these values can be set to . Additionally, observe the Lipshitz property (Equation (13)) holds when because and it holds when because:


where in the first line we use the definition of and the fact that the modulus and infinity-norm operations are identical on scalar values. We also used the result that for any matrices .

Contextual Zooming. CZ is an algorithm that achieves a contextual regret of , where is the covering dimension. The covering dimension is related to how many balls of a fixed radius are needed to cover the similarity space. For a full explanation and derivation of the regret bound we refer the reader to [18].

Throughout the algorithm a finite set of balls called the et of active balls is maintained. Let be the set of active balls at the start of trial . For us, each ball corresponds to a set of context vectors and has an associated arm (i.e. policy). Whenever CZ needs to select an arm, it will find a set of relevant balls in , compute an upper confidence bound for each relevant ball and select the arm associated with the largest bound. The ball is relevant for the weight vector if there is some arm such that , where:


with the radius of . is used to decide which balls are relevant to consider when making a choice for some context vector. The upper confidence bound for each in the set of relevant balls during round is defined using the following equations:


where is the number of times that ball has been selected in the previous rounds, is the average (scalarised) reward for ball in the previous rounds, is the radius of the ball and is the distance between the centers of balls and .

The algorithm then proceeds by repetitively applying the following two rules. Selection rule: On round select the ball , from those that are relevant, that has the maximum index . From that ball, select an arm arbitrarily such that . Activation Rule. If the ball that was selected satisfies after this round, then a new ball is added to the active set, otherwise we set . When adding a new ball, if was the arm selected by the selection rule in round , then a new ball with center is introduced with radius , and we set . To initialise the set of active balls we add one ball per arm , with center and radius to , where .

Contextual Zooming for Trees. Running CZ directly over policies is infeasible for MOMDPs, because the number of policies grows exponentially in the size of the state space, and additionally the distance metric does not allow two different policies to be close in the similarity space. Instead, we run CZ for the selectAction method in THTS at every decision node. So now we consider a new similarity space associated with the state of the decision node, where , and is defined as:


where . Given this similarity space, the CZ algorithm operates as before, using actions for the arms instead of policies in the contextual MAB problem. Note that for each decision node we, in fact, have a non-stationary contextual multi-armed bandits problem, similar to [9]. We refer to this action selection method as Contextual Zooming for Trees (CZT), and when we use CZT for action selection in CHMCTS we call the algorithm Zooming CHMCTS.

6 Experiments

To validate Zooming CHMCTS we will evaluate its performance on a variable-sized grid world problem, the Generalised Deep Sea Treasure (GDST) problem [20, 21]. Moreover, we will consider its performance in both online and offline planning, where in online planning we assume that the agent follows each policy selected, and we want to maximise the cumulative payoff over many trials. In contrast, in offline planning, we do not care about the performance during the planning phase, but only the quality of policies that can be extracted afterwards.

6.1 Experimental Setup

In the GDST(,) problem we consider a two-dimensional grid world, consisting of columns and a transition noise of . The submarine can move left, right, up or down each time step, with the submarine remaining stationary if it would otherwise leave the grid world. The submarine starts in the top left corner on each trial. On every timestep, transition noise indicates the probability that the submarine is instead swept by a current, moving it in a random direction. The seafloor becomes increasingly deep from left to right (with depth increasing by a small amount between zero and three inclusive for each column), but also holds increasing amounts of treasure at greater depths, ranging in values from one to 1000. There are two objectives, one is to collect the maximal amount of treasure, while the other is to minimise the time cost of reaching the treasure, where a cost of one is incurred for each timestep. Each trial concludes after either steps or as soon as the submarine arrives at a piece of treasure. A visualisation of an environment for is given in Fig. 4. During planning, we normalise rewards to the range to give equal priority over the different objectives.

Figure 4: A visualisation of a GDST(5,p) problem, gray squares represent the treasure that can be obtained.

6.2 Practical Considerations

Because the CHVI backups from Equations (10) and (11) can be computationally intensive, optimisations can be made to minimise the number of these backup operations performed. Where appropriate we use these following optimisations.

Firstly, lizotte2010efficient,lizotte2012linear lizotte2010efficient,lizotte2012linear demonstrate that performing the relevant computations in weight space, rather than value space, is computationally more efficient. We additionally follow the backup optimisation considered by perez2015multiobjective perez2015multiobjective. We can identify when either a backupChanceNode or a backupDecisionNode does not change the value at a node, and subsequently prune the remaining backups for the trial. Also, it is possible to incorporate the ideas from UCD saffidine2012ucd saffidine2012ucd that allow nodes to be re-used when a state is re-visited. Finally, for offline planning, we can use the idea of labelling nodes, where a node is labelled if its value has converged. Nodes can be labelled when they are backed up and all of their children are labelled. Leaf nodes are always labelled. By using labelling, we can avoid searching parts of the tree that have already converged.

6.3 Alternative Action Selection Methods

We now state some other action selection methods used in the literature. Let be the state of the decision node that an action is being selected at, which has been visited times. For any action , let be the number of times that the corresponding child node has been visited and let denote the Pareto front stored in the child node. Most prior works use an exploration policy of the form:


where is some appropriate constant and is some scalarization function that maps a Pareto front to a scalar value. perez2013online,perez2015multiobjective,perez2016multi perez2013online,perez2015multiobjective,perez2016multi set


where is the hypervolume function (Definition 8). This action selection is context-free and we refer to the algorithm that results from this exploration policy as Hypervolume CHMCTS. xu2017chebyshev xu2017chebyshev use the Chebychev scalarization function instead, setting:


where is called the utopian point, defined by . When we use this action selection we call the algorithm Chebychev CHMCTS. This action selection is actually context-aware, and xu2017chebyshev xu2017chebyshev select the weight vector randomly each trial.

The ParetoUCB1 algorithm is used by chen2019pareto chen2019pareto for action selection in a MOMAB problem. In our approach, we run ParetoUCB1 over the set , rather than using vectors from child nodes as in point-based MOMCTS. When using this action selection, we refer to our algorithm as Pareto CHMCTS.

6.4 Regret Comparison

Figure 5: The LCR with respect to the number of trials of different CHMCTS algorithms. Each algorithm was run five times.

In this section, we consider a GDST(7,0.01) instance, which is small enough for us to feasibly compute the true CCS, and thus the regret is computable. Fig. 5 shows a plot of the cumulative LCR over 100,000 trials, on the GDST(7,0.01) instance. As we can see in the figure, only Zooming CHMCTS manages to achieve a sublinear regret. These results are consistent with Theorem 1, with context-free action selection methods suffering a linear regret. Interestingly, Chebychev CHMCTS suffers a linear regret too, despite being context-aware. From these results, we can hypothesise that a regret bound could be found for CZT. This demonstrates that Zooming CHMCTS outperforms all other variants for online planning (i.e. in the contextual policy selection problem).

6.5 Comparing Action Selections

Figure 6: The hypervolume with respect to the number of CHVI operations. We compare each variant of CHMCTS with the baseline CHVI. 95% Confidence intervals are plotted from 5 runs each.

To compare CHMCTS algorithms for offline planning, we compare the hypervolume of the CCS estimated in the root node, with respect to the number of backups required. Hypervolume is considered a suitable metric for the quality of a CCS [20]. We compare algorithms on a GDST(30,0.01) environment, using CHVI as a baseline in Fig. 6. We note that CHVI will compute the optimal CCS in many backups if is the state space and the horizon. is the least number of backups required to guarantee that the optimal value has been computed, so we expect CHVI to converge first. We see that Zooming CHMCTS continues to make steady progress towards the optimal hypervolume, while the other methods appear to quickly plateau. Also, we can see that CHMCTS algorithm would outperform CHVI if given a small computational budget. In the next section, we will also see that as the environment size is increased the performance of CHVI severely deteriorates.

6.6 Scalability Analysis

Figure 7: The approximate ratio of the hypervolume found, , with respect to problem size, i.e. , with a budget of 25000 backups. 95% Confidence intervals are plotted from 3 runs each.

To understand how scalable CHMCTS is, we compare Zooming CHMCTS with CHVI on a range of different sized environments. In Fig. 7 we estimate how much of the true CCS is discovered given a budget of 25000 backups. We vary between three and 80, for both and .

Let denote the optimal hypervolume of a GDST(c,p) instance, and let be the estimated hypervolume resulting from some algorithm on the same GDST(c,p) instance. The ratio of indicates what proportion of the CCS was found by the algorithm. However, is infeasible to compute for . Therefore we instead plot the lower bound , which for GDST instances will always be in the range .

In Fig. 7 we see that CHMCTS outperforms CHVI on instances with . We only compare Zooming CHMCTS because of computational constraints. Our results suggest that in the presence of a budget on the number of backups, there will be some threshold in the sizes of MOMDPs from which CHMCTS outperforms CHVI.

7 Conclusion

In this work, we posed planning in Multi-Objective Markov Decision Processes as a contextual multi-armed bandits problem. We then discussed why one should maintain approximations of the Pareto front in each tree node, to produce the Convex Hull Multi-Objective Tree-Search framework. By considering contextual regret, we introduced a novel action selection method and empirically verified it can achieve a sublinear regret and outperforms other state-of-the-art approaches.

Future work includes proving regret bounds of Multi-Objective Monte-Carlo Tree-Search algorithms and applying these algorithms in larger, real-world settings.


This work is funded by Bossa Nova Robotics, Inc located at 610 22nd Street, Suite 250, San Francisco, CA 94107, USA, UK Research and Innovation and EPSRC through the Robotics and Artificial Intelligence for Nuclear (RAIN) research hub [EP/R026084/1], and the European Union Horizon 2020 research and innovation programme under grant agreement No 821988 (ADE).


  • [1] A. Abels, D. M. Roijers, T. Lenaerts, A. Nowé, and D. Steckelmacher (2018)

    Dynamic Weights in Multi-Objective Deep Reinforcement Learning

    arXiv preprint arXiv:1809.07803. Cited by: §1.
  • [2] P. Auer, N. Cesa-Bianchi, and P. Fischer (2002) Finite-time analysis of the multiarmed bandit problem. Machine learning 47 (2-3), pp. 235–256. Cited by: §2, §2, §3.4.
  • [3] R. Bellman (1957) Dynamic Programming. 1 edition, Princeton University Press, Princeton, NJ, USA. External Links: Link Cited by: §3.3.
  • [4] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton (2012) A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games 4 (1), pp. 1–43. Cited by: §3.1.
  • [5] G. Chaslot, S. Bakkes, I. Szita, and P. Spronck (2008) Monte-Carlo Tree Search: A New Framework for Game AI.. In AIIDE, Cited by: Appendix A.
  • [6] W. Chen and L. Lie (2019) Pareto Monte Carlo Tree Search for Multi-Objective Informative Planning. Cited by: §1, §2, §5.1.
  • [7] M. M. Drugan and A. Nowe (2013) Designing multi-objective multi-armed bandits algorithms: A study. In

    The 2013 International Joint Conference on Neural Networks (IJCNN)

    pp. 1–8. Cited by: §5.1.
  • [8] T. Keller and M. Helmert (2013) Trial-Based Heuristic Tree Search for Finite Horizon MDPs.. In ICAPS, Note: Cited by: Appendix A, §4.
  • [9] L. Kocsis, C. Szepesvári, and J. Willemson (2006) Improved monte-carlo search. Univ. Tartu, Estonia, Tech. Rep 1. Cited by: §5.3.
  • [10] L. Kocsis and C. Szepesvári (2006) Bandit based monte-carlo planning. In European conference on machine learning, pp. 282–293. Note: Cited by: §1, §3.1, Example 2.
  • [11] B. Lacerda, D. Parker, and N. Hawes (2017) Multi-objective policy generation for mobile robots under probabilistic time-bounded guarantees. In Proc. of the 27th Int. Conf on Automated Planning and Scheduling (ICAPS), Pittsburgh, PA, USA. Cited by: §1.
  • [12] M. Lahijanian, M. Svorenova, A. A. Morye, B. Yeomans, D. Rao, I. Posner, P. Newman, H. Kress-Gazit, and M. Kwiatkowska (2018) Resource-performance tradeoff analysis for mobile robots. IEEE Robotics and Automation Letters 3 (3), pp. 1840–1847. Cited by: §1.
  • [13] H. Mossalam, Y. M. Assael, D. M. Roijers, and S. Whiteson (2016) Multi-Objective Deep Reinforcement Learning. CORR. Note: External Links: Document, 1610.02707, ISBN 9781424469178, ISSN 1098-7576, Link Cited by: §1.
  • [14] D. Perez, S. Mostaghim, S. Samothrakis, and S. M. Lucas (2015) Multiobjective monte carlo tree search for real-time games. IEEE Transactions on Computational Intelligence and AI in Games 7 (4), pp. 347–360. Note: Cited by: §1.
  • [15] D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley (2013) A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research 48, pp. 67–113. Cited by: §2, §3.2.
  • [16] N. Sahoo, S. Ganguly, and D. Das (2012)

    Multi-objective planning of electrical distribution systems incorporating sectionalizing switches and tie-lines using particle swarm optimization


    Swarm and Evolutionary Computation

    3, pp. 15–32.
    Cited by: §1.
  • [17] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: §1.
  • [18] A. Slivkins (2014) Contextual bandits with similarity information. The Journal of Machine Learning Research 15 (1), pp. 2533–2568. Cited by: §5.3, §5.3.
  • [19] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §1.
  • [20] P. Vamplew, R. Dazeley, A. Berry, R. Issabekov, and E. Dekker (2011) Empirical evaluation methods for multiobjective reinforcement learning algorithms. Machine learning 84 (1-2), pp. 51–80. Cited by: §3.2, §6.5, §6.
  • [21] P. Vamplew, D. Webb, L. M. Zintgraf, D. M. Roijers, R. Dazeley, R. Issabekov, and E. Dekker (2017) MORL-Glue: A benchmark suite for multi-objective reinforcement learning. In 29th Benelux Conference on Artificial Intelligence November 8–9, 2017, Groningen, pp. 389. Cited by: §6.
  • [22] X. Xu and G. Li (2017) Chebyshev metric based multi-objective Monte Carlo tree search for combat simulations. In 2017 21st International Conference on System Theory, Control and Computing (ICSTCC), pp. 607–612. Note: Cited by: §1.

Appendix A Trial-Based Heuristic Tree Search

Figure 8: A visualization of the THTS schema.

In this appendix we give a brief overview of each of the subroutines used in THTS [8] and what their purpose it, with psuedocode (Algorithm 1) and visualisation (Figure 8).

In Algorithm 1 we give psuedocode for the THTS algorithm, which is split into three subroutines and presented as recursive functions for conciseness. In the loop starting on line 3, we could easily add a function (say generateContext) that samples as context weight vector and passes it to the subroutines thtsDecisionNode and thtsChanceNode, to allow for contextual action selection.

The initialiseNode function provides an initial estimate of the value of a node, for example, this could just return zero for all nodes, or, this could incorporate a Monte-Carlo tree-search style rollout policies [5].

The visitDecisionNode and visitChanceNode functions will update state stored in the chance or decision nodes. For example, in an implementation of UCT, the counts for how many times the node has been visited will be updated.

The selectAction and selectOutcome functions will select the next node to consider during a trial. So selectAction will be given a decision node , and it will chose an action , indicating is the next node to visit. Similarly, selectOutcome will be given a chance node and chose a successor state , indicating the next node to visit is .

Finally, the backupDecisionNode and backupChanceNode functions correspond to updating the value estimate at each node from its successor nodes. For example, in a single-objective setting the value will be a scalar and could be updated using a Bellman backup or Monte-Carlo backup. In the multi-objective setting, our value associated with each node can be either a convex hull or Pareto front.

1:procedure THTS(MDP , Timeout )
3:     while  and time  do
5:     end while
6:end procedure
8:procedure thtsDecisionNode(Node )
9:     if  uninintialised then
11:     end if
14:     for  in  do
16:     end for
18:end procedure
20:procedure thtsChanceNode(Node )
21:     if  uninintialised then
23:     end if
26:     for  in  do
28:     end for
30:end procedure
Algorithm 1 THTS

Appendix B Proof of Theorem 1

Theorem 1.

For any sequence of context-free policies , for some MOMDP , the expected ULCR over trials is .


Consider the MOMDP defined in Figure 9, with a horizon of one, and only two possible policies. Let with sampled uniformly from . Let be the policy selected on the th trial, and let be the optimal policy for the th trial.

Considering the two cases for what could be, we have: if , then , and similarly, if , then . Observing that


we must have . Finally, summing over all terms in the definition of ULCR gives the result:


Figure 9: A deterministic MOMDP with three states, two-dimensional rewards and a horizon of one. Terminal states are marked with a double ring and the initial state is . All actions are labelled with an action name, and the reward associated with the state-action pair. As all probabilities are one, probabilities are omitted from this figure.

Appendix C Proof that is a Metric

For completeness, we show in this section that the function is a metric on the similarity space . Recall that is defined in Equation (16) as follows:


where and for all .

Definition 15.

The function is called a metric on the set if the following properties hold for all :

Theorem 2.

The function defined by Equation (16) is a metric on .


Let . We will show that each of the properties (m1), (m2), (m3) and (m4) hold for .

m1: follows immediately from the assumptions , and because for any .

m2: it is logically equivalent to show that  (m2a) and  (m2b) both hold.

m2a: if we are given that then:


m2b: firstly, if , then we must have that , because by assumption. Now consider the case when and , there must be some index such that and . Therefore, we must have that . Combining this with the assumption that for all gives the result.

m3: if , then immediately we have . When , m3 follows from the fact that for any and using :


m4: we consider three cases for this property, when  (m4a),  (m4b) and  (m4c).

m4a: in this case we must have one of or , so assume that wlog. Given this, by the definition of (Equation (16)) and then by (m1), we have