Markov decision processes (MDP) are an important mathematical formalism for modeling and solving sequential decision problems in stochastic environments 
. The importance of this model has triggered a large number of works in different research communities within computer science, most notably in formal verification, and in artificial intelligence and machine learning. The works done in these research communities have respective weaknesses and complementary strengths. On the one hand, algorithms developed in formal verification are generally complete and provide strong guarantees on the optimality of computed solutions but they tend to be applicable to models of moderate size only. On the other hand, algorithms developed in artificial intelligence and machine learning usually scale to larger models but only provide weaker guarantees. Instead of opposing the two sets of algorithms, there have been recent works[2, 14, 6, 12, 11, 19, 1] that try to combine the strengths of the two approaches in order to offer new hybrid algorithms that scale better and provide stronger guarantees. The contributions described in this paper are part of this research agenda: we show how to integrate symbolic advice defined by formal specifications into Monte Carlo Tree Search algorithms  using techniques such as SAT  and QBF .
When an MDP is too large to be analyzed offline using verification algorithms, receding horizon analysis combined with simulation techniques are used online . Receding horizon techniques work as follows. In the current state of the MDP, for a fixed horizon , the receding horizon algorithm searches for an action that is the first action of a plan to act (almost) optimally on the finite horizon . When such an action is identified, then it is played from and the state evolves stochastically to a new state according to the dynamics specified by the MDP. The same process is repeated from . The optimization criterion over the next step depends on the long run measure that needs to be optimised. The tree unfolding from that needs to be analyzed is often very large (e.g. it may be exponential in ). As a consequence, receding horizon techniques are often coupled with sampling techniques that avoid the systematic exploration of the entire tree unfolding at the expense of approximation. The Monte Carlo Tree Search (MCTS) algorithm  is an increasingly popular tree search algorithm that implements these ideas. It is one of the core building blocks of the AlphaGo algorithm .
While MCTS techniques may offer reasonable performances out of the shelf, they usually need substantial adjustments that depend on the application to really perform well. One way to adapt MCTS to a particular application is to bias the search towards promising subspaces taking into account properties of the application domain [16, 26]. This is usually done by coding directly handcrafted search and sampling strategies. We show in this paper how to use techniques from formal verification to offer a flexible and rigorous framework to bias the search performed by MCTS using symbolic advice. A symbolic advice is a formal specification, that can be expressed for example in your favorite linear temporal logic, and which constrain the search and the sampling phases of the MCTS algorithm using QBF and SAT solvers. Our framework offers in principle the ability to easily experiment with precisely formulated bias expressed declaratively using logic.
On the theoretical side, we study the impact of using symbolic advice on the guarantees offered by MCTS. We identify sufficient conditions for the symbolic advice to preserve the convergence guarantees of the MCTS algorithm (Theorem 4.1). Those results are partly based on an analysis of the incidence of sampling on those guarantees (Theorem 3) which can be of independent interest.
On a more practical side, we show how symbolic advice can be implemented using SAT and QBF techniques. More precisely, we use QBF  to force that all the prefixes explored by the MCTS algorithm in the partial tree unfolding have the property suggested by the selection advice (whenever possible) and we use SAT-based sampling techniques  to achieve uniform sampling among paths of the MDP that satisfy the sampling advice. The use of this symbolic exploration techniques is important as the underlying state space that we need to analyze is usually huge (e.g. exponential in the receding horizon ).
To demonstrate the practical interest of our techniques, we have applied our new MCTS with symbolic advice algorithm to play Pac-Man. Figure 1 shows a grid of the Pac-Man game. In this version of the classical game, the agent Pac-Man has to eat food pills as fast as possible while avoiding being pinched by ghosts. We have chosen this benchmark to evaluate our algorithm for several reasons. First, the state space of the underlying MDP is way too large for the state-of-the-art implementations of complete algorithms. Indeed, the reachable state space of the small grid shown here has approximately states, while the classical grid has approximately states. Our algorithm can handle both grids. Second, this application not only allows for comparison between performances obtained from several versions of the MCTS algorithm but also with the performances that humans can attain in this game. In the Pac-Man benchmark, we show that advice that instructs Pac-Man on the one hand to avoid ghosts at all costs during the selection phase of the MCTS algorithm (enforced whenever possible by QBF) and on the other hand to bias the search to paths in which ghosts are avoided (using uniform sampling based on SAT) allow to attain or surpass human level performances while the standard MCTS algorithm performs much worse.
Our analysis of the convergence of the MCTS algorithm with appropriate symbolic advice is based on extensions of analysis results based on bias defined using UCT (bandit algorithms) [18, 3]. Those results are also related to sampling techniques for finite horizon objectives in MDP .
Our concept of selection phase advice is related to the MCTS-minimax hybrid algorithm proposed in . There the selection phase advice is not specified declaratively using logic but encoded directly in the code of the search strategy. No use of QBF nor SAT is advocated there and no use of sampling advice either. In 
, the authors provide a general framework to add safety properties to reinforcement learning algorithms viashielding. These techniques analyse statically the full state space of the game in order to compute a set of unsafe actions to avoid. This fits our advice framework, so that such a shield could be used as an online selection advice in order to combine their safety guarantees with our formal results for MCTS. More recently, a variation of shielding called
safe paddinghas been studied in . Both works are concerned with reinforcement learning and not with MCTS. Note that in general multiple ghosts may prevent the existence of a strategy to enforce safety, i.e. always avoid pincer moves.
A probability distribution on a finite set is a function such that
. We denote the set of all probability distributions on setby . The support of a distribution is .
2.1 Markov decision process
[MDP] A Markov decision process is a tuple , where is a finite set of states, is a finite set of actions, is a mapping from to such that denotes the probability that action in state leads to state , defines the reward obtained for taking a given action at a given state, and assigns a terminal reward to each state in .111 We assume for convenience that every action in can be taken from every state. One may need to limit this choice to a subset of legal actions that depends on the current state. This concept can be encoded in our formalism by adding a sink state reached with probability when taking an illegal action.
For a Markov decision process , a path of length is a sequence of consecutive states and actions followed by a last state. We say that is an -length path in the MDP if for all , and , and we denote and . We also consider states to be paths of length . An infinite path is an infinite sequence of states and actions such that for all , and . We denote the finite prefix of length of a finite or infinite path by . Let and be two paths such that , let be an action and be state of . Then, denotes and denotes .
For an MDP , the set of all finite paths of length is denoted by . Let denote the set of paths in such that . Similarly, if and , then let denote the set of paths in such that there exists with . We denote the set of all finite paths in by and the set of finite paths of length at most by .
The total reward of a finite path in is defined as
A (probabilistic) strategy is a function that maps a path to a probability distribution in . A strategy is deterministic if the support of the probability distributions has size , it is memoryless if depends only on , i.e. if satisfies that for all , . For a probabilistic strategy and , let denote the paths in such that for all , . For a finite path of length and some , let denote .
For a strategy and a path , let the probability of in according to be defined as . The mapping defines a probability distribution over .
The expected average reward of a probabilistic strategy in an MDP , starting from state , is defined as
is a random variable overfollowing the distribution .
The optimal expected average reward starting from a state in an MDP is defined over all strategies in as .
One can restrict the supremum to deterministic memoryless strategies [23, Proposition 6.2.1]. A strategy is called -optimal for the expected average reward if for all .
The expected total reward of a probabilistic strategy in an MDP , starting from state and for a finite horizon , is defined as , where is a random variable over following the distribution .
The optimal expected total reward starting from a state in an MDP , with horizon , is defined over all strategies in as . One can restrict the supremum to deterministic strategies [23, Theorem 4.4.1.b].
Let denote a deterministic strategy that maximises , and refer to it as an optimal strategy for the expected total reward of horizon at state . For , let refer to a deterministic memoryless strategy that maps every state in to the first action of a corresponding optimal strategy for the expected total reward of horizon , so that . As there may exist several deterministic strategies that maximise , we denote by the set of actions such that there exists an optimal strategy that selects from . A strategy can be obtained by the value iteration algorithm:
[Value iteration [23, Section 5.4]] For a state in MDP , for all ,
Moreover, for a large class of MDPs and a large enough , the strategy is -optimal for the expected average reward: [23, Theorem 9.4.5] For a strongly aperiodic222A Markov decision process is strongly aperiodic if for all and . Markov decision process , it holds that . Moreover, for any there exists such that for all , for all .
A simple transformation can be used to make an MDP strongly aperiodic without changing the optimal expected average reward and the associated optimal strategies. Therefore, one can use an algorithm computing the strategy in order to optimise for the expected average reward, and obtain theoretical guarantees for a horizon big enough. This is known as the receding horizon approach.
Finally, we will use the notation to refer to an MDP obtained as a tree-shaped unfolding of from state and for a depth of . In particular, the states of correspond to paths in . Then, it holds that: is equal to , and is equal to .
The aperiodicity and unfolding transformations are detailed in Appendix A.
2.2 Bandit problems and UCB
In this section, we present bandit problems, whose study forms the basis of a theoretical analysis of Monte Carlo tree search algorithms.
Let denote a finite set of actions. For each , let be a sequence of random payoffs associated to . They correspond to successive plays of action , and for every action and every , let be drawn with respect to a probability distribution over . We denote by the random variable associated to this drawing. In a fixed distributions setting (the classical bandit problem), every action is associated to a fixed probability distribution , so that for all .
The bandit problem consists of a succession of steps where the player selects an action and observes the associated payoff, while trying to maximise the cumulative gains. For example, selecting action , then and then again would yield the respective payoffs , and for the first three steps, drawn from their respective distributions. Let the regret denote the difference, after steps, between the optimal expected payoff and the expected payoff associated to our action selection. The goal is to minimise the long-term regret when the number of steps increases.
The algorithm UCB1 of  offers a practical solution to this problem, and offers theoretical guarantees. For an action and , let denote the average payoff obtained from the first plays of . Moreover, for a given step number let denote how many times action was selected in the first steps. The algorithm UCB1 chooses, at step , the action that maximises , where is defined as . This procedure enjoys optimality guarantees detailed in , as it keeps the regret below .
We will make use of an extension of these results to the general setting of non-stationary bandit problems, where the distributions are no longer fixed with respect to . This problem has been studied in , and results were obtained for a class of distributions that respect assumptions referred to as drift conditions.
For a fixed , let denote the random variable obtained as the average of the random variables associated with the first plays of . Let . We assume that these expected means eventually converge, and let .
[Drift conditions] For all , the sequence converges to some value . Moreover, there exists a constant and an integer such that for and any , if then the tail inequalities and hold.
We recall in Appendix B the results of , and provide an informal description of those results here. Consider using the algorithm UCB1 on a non-stationary bandit problem satisfying the drift conditions, with . First, one can bound logarithmically the number of times a suboptimal action is played. This is used to bound the difference between and by , where is an optimal action and where denotes the global average of payoffs received over the first steps. This is the main theoretical guarantee obtained for the optimality of UCB1. Also for any action , the authors state a lower bound for the number of times the action is played. The authors also prove a tail inequality similar to the one described in the drift conditions, but on the random variable instead of . This will be useful for inductive proofs later on, when the usage of UCB1 is nested so that the global sequence corresponds to a sequence of an action played from the next state of the MDP. Finally, it is shown that the probability of making the wrong decision (choosing a suboptimal action) converges to as the number of plays grows large enough.
3 Monte Carlo tree search with simulation
In a receding horizon approach, the objective is to compute and for some state and some horizon . Exact procedures such as the recursive computation of Proposition 2.1 can not be used on large MDPs, resulting in heuristic approaches. We focus on the Monte Carlo Tree Search (MCTS) algorithm , that can be seen as computing approximations of and on the unfolding . Note that rewards in the MDP are bounded.333There are finitely many paths of length at most , with rewards in . For the sake of simplicity we assume without loss of generality that for all paths of length at most the total reward belongs to .
Given an initial state , MCTS is an iterative process that incrementally constructs a search tree rooted at describing paths of and their associated values. This process goes on until a specified budget (of number of iterations or time) is exhausted. An iteration constructs a path in by following a decision strategy to select a sequence of nodes in the search tree. When a node that is not part of the current search tree is reached, the tree is expanded with this new node, whose expected reward is approximated by simulation. This value is then used to update the knowledge of all selected nodes in backpropagation.
In the search tree, each node represents a path. For a node and an action , let be a list of nodes representing paths of the form where . For each node (resp. node-action pair) we store a value (resp. ) computed for node (resp. for playing from node ), meant to approximate (resp. ), and a counter (resp. ), that keeps track of the number of iterations that selected node (resp. that selected the action from ). We add subscripts to these notations to denote the number of previous iterations, so that is the value of obtained after iterations of MCTS, among which was selected times. We also define and as shorthand for respectively and . Each iteration consists of three phases. Let us describe these phases at iteration number .
Starting from the root node, MCTS descends through the existing search tree by choosing actions based on the current values and counters and by selecting next states stochastically according to the MDP. This continues until reaching a node , either outside of the search tree or at depth . In the former case, the simulation phase is called to obtain a value that will be backpropagated along the path . In the latter case, we use the exact value instead.
The action selection process needs to balance between the exploration of new paths and the exploitation of known, promising paths. A popular way to balance both is the upper confidence bound for trees (UCT) algorithm , that interprets the action selection problem of each node of the MCTS tree as a bandit problem, and selects an action in the set , for some constant .
In the simulation phase, the goal is to get an initial approximation for the value of a node , that will be refined in future iterations of MCTS. Classically, a sampling-based approach can be used, where one computes a fixed number of paths in . Then, one can compute , and fix to . Usually, the samples are derived by selecting actions uniformly at random in the MDP.
In our theoretical analysis of MCTS, we take a more general approach to the simulation phase, defined by a finite domain and a function that maps every path to a probability distribution on . In this approach, the simulation phase simply draws a value at random according to the distribution , and sets .
From the value obtained at a leaf node at depth in the search tree, let denote the reward associated with the path from node to in the search tree. For from to we update the values according to The value is updated based on , and with the same formula.
In the remainder of this section, we prove Theorem 3, that provides theoretical properties of the MCTS algorithm with a general simulation phase (defined by some fixed and ). This theorem was proven in [18, Theorem 6] for a version of the algorithm that called MCTS recursively until leaves were reached, as opposed to the sampling-based approach that has become standard in practice. Note that sampling-based approaches are captured by our general description of the simulation phase. Indeed, if the number of samples is set to , let be the set of rewards associated with paths of , and let be a probability distribution over , such that for every reward , is the probability of path being selected with a uniform action selections in , starting from the node . Then, the value drawn at random according to the distribution corresponds to the reward of a random sample drawn in . If the number of samples is greater than , one simply needs to extend to be the set of average rewards over paths, while becomes a distribution over average rewards.
Consider an MDP , a horizon and a state . Let be a random variable that represents the value at the root of the search tree after iterations of the MCTS algorithm on . Then, is bounded by . Moreover, the failure probability converges to zero.
Following the proof scheme of [18, Theorem 6], this theorem is obtained from the results mentioned in Section 2.2. To this end, every node of the search tree is considered to be an instance of a bandit problem with non-stationary distributions. Every time a node is selected, a step is processed in the corresponding bandit problem.
Let be a sequence of iteration numbers for the MCTS algorithm that describes when the node is selected, so that the simulation phase was used on at iteration number , and so that the -th selection of node happened on the iteration number . We define sequences similarly for node-action pairs.
For all paths and actions , a payoff sequence of associated random variables is defined by . Note that in the selection phase at iteration number , must have been selected and must be a prefix of length of the leaf node reached in this iteration, so that is computed as in the backpropagation phase. According to the notations of Section 2.2, for all we have , and .
Then, one can obtain Theorem 3 by applying inductively the UCB1 results recalled in Appendix B on the search tree in a bottom-up fashion. Indeed, as the root is selected at every iteration, and , while corresponds to recursively selecting optimal actions by Proposition 2.1.
The main difficulty, and the difference our simulation phase brings compared with the proof of [18, Theorem 6], lies in showing that our payoff sequences , defined with an initial simulation step, still satisfy the drift conditions of Definition 2.2. We argue that this is true for all simulation phases defined by any and : For any MDP , horizon and state , the sequences satisfy the drift conditions.
Although the long-term guarantees of Theorem 3 hold for any simulation phase independently of the MDP, in practice one would expect better results from a good simulation, that gives a value close to the real value of the current node. Domain-specific knowledge can be used to obtain such simulations, and also to guide the selection phase based on heuristics. Our goal will be to preserve the theoretical guarantees of MCTS in the process.
4 Symbolic advice for MCTS
In this section, we introduce a notion of advice meant to guide the construction of the Monte Carlo search tree. We argue that a symbolic approach is needed in order to handle large MDPs in practice. Let a symbolic advice be a logical formula over finite paths whose truth value can be tested with an operator .
A number of standard notions can fit this framework. For example, reachability and safety properties, LTL formulæ over finite traces or regular expressions could be used. We will use a safety property for Pac-Man as a example (see Figure 1), by assuming that the losing states of the MDP should be avoided. This advice is thus satisfied by every path such that Pac-Man does not make contact with a ghost.
We denote by the set of paths such that . For a path , we denote by the set of paths such that .444In particular, for all , refers to the paths of length that start from and that satisfy .
A nondeterministic strategy is a function that maps a finite path to a subset of . For a strategy and a nondeterministic strategy , if for all , . Similarly, a nondeterministic strategy for the environment is a function that maps a finite path and an action to a subset of . We extend the notations used for probabilistic strategies to nondeterministic strategies in a natural way, so that and denote the paths of length compatible with the strategy or , respectively.
For a symbolic advice and a horizon , we define a nondeterministic strategy and a nondeterministic strategy for the environment such that for all paths with ,
The strategies and can be defined arbitrarily on paths of length at least , for example with and for all actions . Note that by definition, for all states .
Let (resp. ) denote the universal advice (resp. the empty advice) satisfied by every finite path (resp. never satisfied), and let and (resp. and ) be the associated nondeterministic strategies. We define a class of advice that can be enforced against an adversarial environment by following a nondeterministic strategy, and that are minimal in the sense that paths that are not compatible with this strategy are not allowed.
[Strongly enforceable advice] A symbolic advice is called a strongly enforceable advice from a state and for a horizon if there exists a nondeterministic strategy such that , and such that for all paths .
Note that Definition 4 ensures that paths that follow can always be extended into longer paths that follow . This is a reasonable assumption to make for a nondeterministic strategy meant to enforce a property. In particular, is a path of length 0 in , so that and so that by induction for all .
Let be a strongly enforceable advice from with horizon . It holds that . Moreover, for all paths and all actions , either or . Finally, for all paths in , and if and only if .
We have for any advice . Let us prove that for a strongly enforceable advice of associated strategy . Let be a path in . By definition of , there exists such that , so that . Since , must also belong to .
Consider a path and an action such that . We want to prove that either all stochastic transitions starting from are allowed by , or none of them are. By contradiction, let us assume that there exists and in such that for all , , and such that there exists with . From , we obtain , so that is a path that follows . Then, is a path that follows as well. It follows that , and can be extended in to a path . This implies the contradiction .
Finally, consider a path in . By the definitions of and , if and only if , so that . Then, let us write . From we get , so that , and therefore . ∎
A strongly enforceable advice is encoding a notion of guarantee, as is a winning strategy for the reachability objective on defined by the set .
We say that the strongly enforceable advice is extracted from a symbolic advice for a horizon and a state if is the greatest part of that can be guaranteed for the horizon starting from , i.e. if is the greatest subset of such that is a winning strategy for the reachability objective on . This greatest subset always exists because if and are strongly enforceable advice in , then is strongly enforceable by union of the nondeterministic strategies associated with and . However, this greatest subset may be empty, and as is not a strongly enforceable advice we say that in this case cannot be enforced from with horizon .
Consider a symbolic advice described by the safety property for Pac-Man of Example 4. For a fixed horizon , the associated nondeterministic strategies and describe action choices and stochastic transitions compatible with this property. Notably, may not be a strongly enforceable advice, as there may be situations where some stochastic transitions lead to bad states and some do not. In the small grid of Figure 1, the path of length 1 that corresponds to Pac-Man going left and the red ghost going up is allowed by the advice , but not by any safe strategy for Pac-Man as there is a possibility of losing by playing left. If a strongly enforceable advice can be extracted from , it is a more restrictive safety property, where the set of bad states is obtained as the attractor [20, Section 2.3] for the environment towards the bad states defined in . In this setting, corresponds to playing according to a strategy for Pac-Man that ensures not being eaten by adversarial ghosts for the next steps.
[Pruned MDP] For an MDP a horizon , a state and an advice , let the pruned unfolding be defined as a sub-MDP of that contains exactly all paths in satisfying . It can be obtained by removing all action transitions that are not compatible with , and all stochastic transitions that are not compatible with . The distributions are then normalised over the stochastic transitions that are left.
Note that by Lemma 4, if is a strongly enforceable advice then for all paths in , so that the normalisation step for the distributions is not needed. It follows that for all nodes in and all actions , the distributions in are the same as in . Thus, for all strategies in , , so that