Introduction
Many realworld problems can be modeled as
Partially Observable Markov Decision Process (POMDP)
, where the true state is unknown to the agent due to limited and noisy sensors. The agent has to reason about the history of past observations and actions, and maintain a belief stateas a distribution of possible states. POMDPs have been widely used to model decision making problems in the context of planning and reinforcement learning
[Ross et al.2008].Solving POMDPs exactly is computationally intractable for domains with enormous state spaces and long planning horizons. First, the space of possible belief states grows exponentially w.r.t. the number of states , since that space is dimensional, which is known as curse of dimensionality [Kaelbling, Littman, and Cassandra1998]. Second, the number of possible histories grows exponentially w.r.t. the horizon length, which is known as the curse of history [Pineau, Gordon, and Thrun2006].
In the last few years, MonteCarlo planning has been proposed to break both curses with statistical sampling. These methods construct sparse trees over belief states and actions, representing the stateoftheart for efficient planning in large POMDPs [Silver and Veness2010, Somani et al.2013, Bai et al.2014]. While these approaches avoid exhaustive search, the constructed closedloop trees can still become arbitrarily large for highly complex domains, which could limit the performance due to restricted memory resources [Powley, Cowling, and Whitehouse2017]. In contrast, openloop approaches only focus on searching action sequences and are independent of the history and belief state space. Openloop approaches are able to achieve competitive performance compared to closedloop planning, when the problem is too large to provide sufficient computational and memory resources [Weinstein and Littman2013, Perez Liebana et al.2015, Lecarpentier et al.2018]. However, openloop planning has been a less popular choice for decision making in POMDPs so far [Yu et al.2005].
In this paper, we propose Partially Observable Stacked Thompson Sampling (POSTS), a memory bounded approach to openloop planning in large POMDPs, which optimizes a fixed size stack of Thompson Sampling bandits.
To evaluate the effectiveness of POSTS, we formulate a treebased approach, called Partially Observable OpenLoop Thompson Sampling (POOLTS) and show that POOLTS is able to find optimal openloop plans with sufficient computational and memory resources.
We empirically test POSTS in four large benchmark problems and compare its performance with POOLTS and other treebased approaches like POMCP. We show that POSTS achieves competitive performance compared to treebased openloop planning and offers a performancememory tradeoff, making it suitable for partially observable planning with highly restricted computational and memory resources.
Background
Partially Observable Markov Decision Processes
A POMDP is defined by a tuple , where is a (finite) set of states, is the (finite) set of actions,
is the transition probability function,
is the scalar reward function, is a (finite) set of observations, is the observation probability function, andis a probability distribution over initial states
. It is always assumed, that , , and at time step .A history is a sequence of actions and observations. A belief state is a sufficient statistic for history and defines a probability distribution over states given . is the space of all possible belief states. represents the initial belief state
. The belief state can be updated by Bayes theorem:
(1) 
where is a normalizing constant, is the last action, and is the history without and .
The goal is to find a policy , which maximizes the return at state for a horizon :
(2) 
where is the discount factor. If , then present rewards are weighted more than future rewards.
The value function is the expected return conditioned on belief states given a policy . An optimal policy has a value function with for all and .
Multiarmed Bandits
Multiarmed Bandits (MABs or bandits) are fundamental decision making problems, where an agent has to repeatedly select an arm among a given set of arms in order to maximize its future payoff. MABs can be considered as problems with a single state , a set of actions , and a stochastic reward function , where
is a random variable with an unknown distribution
. To solve a MAB, one has to determine the action, which maximizes the expected reward. The agent has to balance between sufficiently trying out actions to accurately estimate their expected reward and to exploit its current knowledge on all arms by selecting the arm with the currently highest expected reward. This is known as the
explorationexploitation dilemma, where exploration can lead to actions with possibly higher rewards but requires time for trying them out, while exploitation can lead to fast convergence but possibly gets stuck in a local optimum. In this paper, we will cover UCB1 and Thompson Sampling as MAB algorithms.Ucb1
In UCB1, actions are selected by maximizing the upper confidence bound of action values , where is the current average reward when choosing , is an exploration constant, is the total number of action selections, and is the number of times action was selected. The second term represents the exploration bonus, which becomes smaller with increasing [Auer, CesaBianchi, and Fischer2002].
UCB1 is a popular MAB algorithm and widely used in various challenging domains [Kocsis and Szepesvári2006, Bubeck and Munos2010, Silver et al.2016, Silver et al.2017].
Thompson Sampling
Thompson Sampling is a Bayesian approach to balance between exploration and exploitation of actions [Thompson1933]. The unknown reward distribution of of each action is modeled by a parametrized likelihood function
with a parameter vector
. Given a prior distribution and a set of past observed rewards , the posterior distribution can be inferred by using Bayes rule . The expected reward of each action can be estimated by sampling from the posterior. The action with the highest sampled expected reward is selected.Thompson Sampling has been shown to be an effective and robust algorithm for making decisions under uncertainty [Chapelle and Li2011, Kaufmann, Korda, and Munos2012, Bai, Wu, and Chen2013, Bai et al.2014].
Planning in POMDPs
Planning searches for an (near)optimal policy given a model of the environment , which usually consists of explicit probability distributions of the POMDP. Unlike offline planning, which searches the whole (belief) state space to find the optimal policy , local planning only focuses on finding a policy for the current (belief) state by taking possible future (belief) states into account [Weinstein and Littman2013]. Thus, local planning can be applied online at every time step at the current state to recommend the next action for execution. Local planning is usually restricted to a time or computation budget due to strict realtime constraints [Bubeck and Munos2010, Weinstein and Littman2013, Perez Liebana et al.2015].
In this paper, we focus on local MonteCarlo planning, where is a generative model, which can be used as black box simulator [Kocsis and Szepesvári2006, Silver and Veness2010, Weinstein and Littman2013, Bai et al.2014]. Given and , the simulator provides a sample . MonteCarlo planning algorithms can approximate and by iteratively simulating and evaluating action sequences without reasoning about explicit probability distributions of the POMDP.
Local planning can be closed or openloop. Closedloop planning conditions the action selection on histories of actions and observations. Openloop planning only conditions the action selection on previous sequences of actions (also called openloop plans or simply plans) and summarized statistics about predecessor (belief) states [Bubeck and Munos2010, Weinstein and Littman2013, Perez Liebana et al.2015]. An example is shown in Fig. 1. A closedloop tree for a domain with is shown in Fig. (a)a, while Fig. (b)b shows the corresponding openloop tree which summarizes the observation nodes of Fig. (a)a within the blue dotted ellipses into history distribution nodes. Openloop planning can be further simplified by only regarding statistics about the expected return of actions at specific time steps (Fig. (c)c). In that case, only a stack of statistics is used to sample plans for simulation and evaluation [Weinstein and Littman2013].
Partially Observable MonteCarlo Planning (POMCP) is a closedloop approach based on MonteCarlo Tree Search (MCTS) [Silver and Veness2010]. POMCP uses a search tree of histories with onodes representing observations and anodes representing actions (Fig. (a)a). Each onode has a visit count and a value estimate for history and belief state . Each anode has a visit count and a value estimate for action and history . A simulation starts at the current belief state and is divided into two stages: In the first stage, a tree policy is used to traverse the tree until a leaf node is reached. Actions are selected via and simulated in to determine the next nodes to visit. can be implemented with MABs, where each onode represents a MAB. In the second stage, a rollout policy is used to sample action sequences until a terminal state or a maximum search depth is reached. The observed rewards are accumulated to returns (Eq. 2), propagated back to update the value estimate of every node in the simulated path, and a new leaf node is added to the search tree. can be used to integrate domain knowledge into the planning process to focus the search on promising states [Silver and Veness2010]. The original version of POMCP uses UCB1 for and is shown to converge to the optimal bestfirst tree with sufficient computation [Silver and Veness2010].
[Lecarpentier et al.2018] formulates an openloop variant of MCTS using UCB1 as , called OpenLoop Upper Confidence bound for Trees (OLUCT), which could be easily extended to POMDPs by constructing a tree, which summarizes all onodes to history distribution nodes (Fig. (b)b).
Openloop planning generally converges to suboptimal solutions in stochastic domains, since it ignores (belief) state values and optimizes the node values (Fig. (b)b) instead [Lecarpentier et al.2018]. If the problem is too complex to provide sufficient computation budget or memory capacity, then openloop approaches are competitive to closedloop approaches, since they need to explore a much smaller search space to converge to an appropriate solution [Weinstein and Littman2013, Lecarpentier et al.2018].
Related Work
Treebased approaches to openloop planning condition the action selection on previous action sequences as shown in Fig. (b)b [Bubeck and Munos2010, Perez Liebana et al.2015, Lecarpentier et al.2018]. Such approaches have been thoroughly evaluated for fully observable problems, but have been less popular for partially observable problems so far [Yu et al.2005]. POSTS is based on stacked openloop planning, where a stack of distributions over actions is maintained to generate openloop plans with high expected return [Weinstein and Littman2013, Belzner and Gabor2017]. Unlike previous approaches, POSTS is a memorybounded openloop approach to partially observable planning.
[Yu et al.2005] proposed an openloop approach to planning in POMDPs by using hierarchical planning. An openloop plan is constructed at an abstract level, where uncertainty w.r.t. particular actions is ignored. A lowlevel planner controls the actual execution by explicitly dealing with uncertainty. POSTS is more general, since it performs planning directly on the original problem and does not require the POMDP to be transformed for hierarchical planning.
[Powley, Cowling, and Whitehouse2017] proposed a memory bounded version of MCTS with a state pool to add, discard, or reuse states depending on their visitation frequency. However, this approach cannot be easily adapted to treebased openloop approaches, because it requires (belief) states to be identifiable. POSTS does not require a pool to reuse states or nodes, but only maintains a fixed size stack of Thompson Sampling bandits, which adapt according to the temporal dependencies between actions.
OpenLoop Search with Thompson Sampling
Generalized Thompson Sampling
We use a variant of Thompson Sampling, which works for arbitrary reward distributions as proposed in [Bai, Wu, and Chen2013, Bai et al.2014] by assuming that
follows a Normal distribution
with unknown mean and precision , whereis the variance.
follows a Normal Gamma distribution
with , , and . The distribution over is a Gamma distribution and the conditional distribution over given is a Normal distribution .Given a prior distribution and observations , the posterior distribution is defined by , where , , , and . is the mean of all values in and is the variance.
The posterior is inferred for each action to sample an estimate for the expected return. The action with the highest estimate is selected. The complete formulation is given in Algorithm 1.
The prior should ideally reflect knowledge about the underlying model, especially for initial turns, where only a small amount of data has been observed [Honda and Takemura2014]. If no knowledge is available, then uninformative priors should be chosen, where all possibilities can be sampled (almost) uniformly. This can be achieved by choosing the priors such that the variance of the resulting Normal distribution becomes infinite ( and ). Since follows a Gamma distribution with expectation , and should be chosen such that
. Given the hyperparameter space
, , and , it is recommended to set and to center the Normal distribution. should be chosen small enough and should have a sufficiently large value [Bai et al.2014].Monte Carlo Belief State Update
The belief state can be updated exactly according to Eq. 1. However, exact Bayes updates may be computationally infeasible in POMDPs with large state spaces due to the curse of dimensionality. For this reason, we approximate the belief state for history with a particle filter as described in [Silver and Veness2010]. The belief state is represented by a set of sample states or particles. After execution of and observation of , the particles are updated by Monte Carlo simulation. Sampled states are simulated with such that . If , then is added to .
Poolts
To evaluate the effectiveness of POSTS compared to other openloop planners, we first define Partially Observable OpenLoop Thompson Sampling (POOLTS) and show that POOLTS is able to converge to an optimal openloop plan, if sufficient computational and memory resources are provided. POOLTS is a treebased approach based on OLUCT from [Lecarpentier et al.2018]. Each node represents a Thompson Sampling bandit and stores , and for each action .
A simulation starts at a state , which is sampled from the current belief state . The belief state is approximated by a particle filter as described above. An openloop tree (Fig. (b)b) is iteratively constructed by traversing the current tree in a selection step by using Thompson Sampling to select actions. When a leaf node is reached, it is expanded by a child node and a rollout is performed by using a policy until a terminal state is reached or a maximum search depth is exceeded. The observed rewards are accumulated to returns (Eq. 2) and propagated back to update the corresponding bandit of every node in the simulated path. When the computation budget has run out, the action with the highest expected return is selected for execution. The complete formulation of POOLTS is given in Algorithm 2.
[Kocsis and Szepesvári2006, Bubeck and Munos2010, Lecarpentier et al.2018] have shown that tree search algorithms using UCB1 converge to the optimal closedloop or openloop plan respectively, if the computation budget is sufficiently large. This is because the expected state or node values in the leaf nodes become stationary, given a stationary rollout policy . This enables the values in the preceding nodes to converge as well, leading to state or nodewise optimal actions. By replacing UCB1 with Thompson Sampling, the tree search should still converge to the optimal closedloop or openloop plan, since Thompson Sampling also converges to the optimal action, if the return distribution of becomes stationary [Agrawal and Goyal2013]. [Chapelle and Li2011, Bai et al.2014] empirically demonstrated that Thompson Sampling converges faster than UCB1, when rewards are sparse and when the number of arms is large.
Posts
Partially Observable Stacked Thompson Sampling (POSTS) is an openloop approach, which optimizes a stack of Thompson Sampling bandits to search for highquality openloop plans (Fig. (c)c). Each bandit stores , and for each action .
Similarly to POOLTS, a simulation starts at a state , which is sampled from a particle filter , representing the current belief state . Unlike POOLTS, a fixed size stack of bandits is used to sample plans . is evaluated with the generative model to observe immediate rewards , which are accumulated to returns (Eq. 2). Each bandit is then updated with the corresponding return . When the computation budget has run out, the action with the highest expected return is selected for execution. The complete formulation of POSTS is given in Algorithm 3.
The idea of POSTS is to only regard the temporal dependencies between the actions of an openloop plan. The bandit stack is used to learn these dependencies with the expected (discounted) return. When a bandit samples an action with a resulting reward of , then all preceding bandits with are updated with , using a discounted value of . This is because the actions sampled by all preceding bandits are possibly relevant for obtaining the reward . By only regarding these temporal dependencies, POSTS is memory bounded, not requiring a search tree to model dependencies between histories or history distributions (Fig. (a)a and (b)b).
Experiments
Evaluation Environments
We tested POSTS in the RockSample, Battleship, and PocMan domains, which are wellknown POMDP benchmark problems for decision making in POMDPs [Silver and Veness2010, Somani et al.2013, Bai et al.2014]. For each domain, we set the discount factor as proposed in [Silver and Veness2010]. The results were compared with POMCP, POOLTS, and a partially observable version of OLUCT, which we call POOLUCT. The problemsize features of all domains are shown in Table 1.
RockSample(11,11)  RockSample(15,15)  Battleship  PocMan  

# States  
# Actions  
# Observations 
The RockSample(n,k) problem simulates an agent moving in an grid containing rocks [Smith and Simmons2004]. Each rock can be or but the true state of each rock is unknown. The agent has to sample good rocks, while avoiding to sample bad rocks. It has a noisy sensor, which produces an observation for a particular rock. The probability of sensing the correct state of the rock decreases exponentially with the agent’s distance to that rock. Sampling gives a reward of , if the rock is good and otherwise. If a good rock was sampled, it becomes bad. Moving and sensing do not give any rewards. Moving past the east edge of the grid gives a reward of and the episode terminates. We set .
In Battleship five ships of size 1, 2, 3, 4, and 5 respectively are randomly placed into a grid, where the agent has to sink all ships without knowing their actual positions [Silver and Veness2010]. Each cell hitting a ship gives a reward of . There is a reward of per time step and a terminal reward of for hitting all ships. We set .
PocMan is a partially observable version of PacMan [Silver and Veness2010]. The agent navigates in a maze and has to eat randomly distributed food pellets and power pills. There are four ghosts moving randomly in the maze. If the agent is within the visible range of a ghost, it is getting chased by the ghost and dies, if it touches the ghost, terminating the episode with a reward of . Eating a power pill enables the agent to eat ghosts for 15 time steps. In that case, the ghosts will run away, if the agent is under the effect of a power pill. At each time step a reward of is given. Eating food pellets gives a reward of and eating a ghost gives . The agent can only perceive ghosts, if they are in its direct line of sight in each cardinal direction or within a hearing range. Also, the agent can only sense walls and food pellets, which are adjacent to it. We set .
Methods
Pomcp
We use the POMCP implementation from [Silver and Veness2010]. selects actions from a set of legal actions with UCB1. randomly selects actions from , depending on the currently simulated state .
In each simulation step, there is at most one expansion step, where new nodes are added to the search tree. Thus, tree size should increase linearly w.r.t. in large POMDPs.
POOLTS and POOLUCT
POOLTS is implemented according to Algorithm 2, where actions are selected from a set of legal actions with Thompson Sampling (Algorithm 1) in the first stage. randomly selects actions from , depending on the currently simulated state . POOLUCT is similar to POOLTS but uses UCB1 as action selection strategy in the first stage. Since, openloop planning can encounter different states at the same node (Fig. 1), the set of legal actions may vary for each state . We always mask out the statistics of currently illegal actions, regardless of whether they have high average action values, to avoid selecting them.
Similarly to POMCP, the search tree size should increase linearly w.r.t. , but with less nodes, since openloop trees store summarized information about history distributions.
Posts
POSTS is implemented as a stack of Thompson Sampling bandits with according to Algorithm 3. Starting at , all bandits apply Thompson Sampling to a set of legal actions , depending on the currently simulated state . Similarly to POOLTS and POOLUCT, we mask out the statistics of currently illegal actions and only regard the value statistics of legal actions for selection during planning.
Given a horizon of , POSTS always maintains Thompson Sampling bandits, independently of the computation budget .
Results
We ran each approach on RockSample, Battleship, and PocMan with different settings for 100 times or at most 12 hours of total computation. We evaluated the performance of each approach with the undiscounted return (), because we focus on the actual effectiveness instead of the quality of optimization [Bai et al.2014]. For POMCP and POOLTS we set the UCB1 exploration constant to the reward range of each domain as proposed in [Silver and Veness2010].
Prior Sensitivity
Since we assume no additional domain knowledge, we focus on uninformative priors with , , and as proposed in [Bai et al.2014]. With this setting, controls the degree of initial exploration during the planning phase, thus its impact on the performance of POOLTS and POSTS is evaluated. The results are shown in Fig. 2 for for POOLTS and POSTS.
In RockSample, POSTS slightly outperforms POOLTS and keeps up in performance with POMCP. POOLTS slightly outperforms POSTS and POMCP in Battleship with POSTS only being able to keep up when or when . POMCP clearly outperforms all openloop approaches in PocMan. POOLTS slightly outperforms POSTS in PocMan with POSTS only being to keep up, if . POOLUCT performed worst in all domains except Battleship, where it performs best with a computation budget of . POSTS performs slightly better, if is large, but POOLTS seem to be insensitive to the choice of except in PocMan, where it performs better, if is large.
Horizon Sensitivity
We evaluated the sensitivity of all approaches w.r.t. different horizons . The results are shown in Fig. 3 for ^{1}^{1}1Using computation budgets between 1024 and 16384 led to similar plots, thus we stick to with all approaches requiring less than one second per action [Silver and Veness2010]. and for POOLTS and POSTS.
In RockSample(11,11), there is a performance peak at for POMCP and POOLUCT, while for POSTS and POOLTS it is about . In all other domains, there seems to be a performance peak at for most approaches. If , there is no significant improvement or even degrading performance for most approaches except for POMCP, which slightly improves in all domains but RockSample(11,11), if .
PerformanceMemory Tradeoff
We evaluated the performancememory tradeoff of all approaches by introducing a memory capacity , where the computation is interrupted, when the number of nodes exceeds . For POMCP, we count the number of onodes and anodes (Fig. (a)a). For POOLTS and POOLUCT, we count the number of history distribution nodes (Fig. (b)b). For POSTS, we count the number of Thompson Sampling bandits, which is always . The results are shown in Fig. 4 for , , and for POOLTS and POSTS. POSTS never uses more than nodes in each setting.
In Rocksample and Battleship, POMCP is outperformed by POSTS and POOLTS (and also POOLUCT in Battleship). POSTS always performs best in these domains, when . POMCP performs best in PocMan by outperforming POSTS, when and POOLTS keeps up with the best POSTS setting, when . POOLUCT performs worst except in Battleship, improving less and slowest with increasing . It outperforms POMCP in Rocksample(15,15), when though. In PocMan, POOLUCT creates less than 550 nodes, when , indicating that the search tree construction has converged and does not improve any further.
Discussion
The experiments show that partially observable openloop planning can be a good alternative to closedloop planning, when the action space is large, stochasticity is low, and when computational and memory resources are highly restricted. Especially approaches based on Thompson Sampling like POOLTS and POSTS seem to be very effective and robust w.r.t. the hyperparameter choice. Setting a large value for seems to be beneficial for large problems (Fig. 2). This is because an enormous search space needs to be explored, while avoiding premature convergence to poor solutions. However, if is too large, POSTS and POOLTS might converge too slowly, thus requiring much more computation [Bai et al.2014]. If is too large, then the value estimates have very high variance, making bandit adaptation more difficult. This could explain the performance stagnation or degradation for most approaches in Fig. 3, when . The performance of POSTS scales similarly to POOLTS w.r.t. and (Fig. 2 and 3). POSTS is also more robust than POOLUCT w.r.t. changes to and except in Battleship, where both approaches scale similarly, when is sufficiently large.
POSTS is competitive to POOLTS and superior to POOLUCT in all settings except in Battleship (when is large) with POOLTS and POOLUCT being shown to theoretically converge to optimal openloop plans, given sufficient computation budget and memory capacity . POSTS is shown to be superior to all other approaches in RockSample and Battleship, when memory resources are highly restricted, only being outperformed by the treebased approaches in Battleship after thousands of nodes were created, consuming much more memory than POSTS, which only uses 100 nodes at most. This might be due to the relatively large action space of these domains (Table 1), where all treebased planners construct enormous trees with high branching factors, when exploring the effect of each action. RockSample and Battleship have low stochasticity, since state transitions are deterministic. In both domains the agent is primarily uncertain about the real state, thus the planning quality only depends on the belief state approximation and the uncertainty about observations (only in RockSample).
POMCP performs best in PocMan. This might be due to the small action space (Table 1) and high stochasticity (where all ghosts primarily move randomly), since openloop planning is known to converge to suboptimal solutions in such domains [Weinstein and Littman2013, Lecarpentier et al.2018]. However, POMCP has the highest memory consumption, since it constructs larger trees than openloop approaches with the same computation budget (Fig (a)a). In PocMan, POSTS is able to keep up with POOLTS, while being much more memoryefficient (Fig. (d)d and (d)d).
Conclusion and Future Work
In this paper, we proposed Partially Observable Stacked Thompson Sampling (POSTS), a memory bounded approach to openloop planning in large POMDPs, which optimizes a fixed size stack of Thompson Sampling bandits.
To evaluate the effectiveness of POSTS, we formulated a treebased approach, called POOLTS and showed that POOLTS is able to find optimal openloop plans with sufficient computational and memory resources.
We empirically tested POSTS in four large benchmark problems and showed that POSTS achieves competitive performance compared to treebased openloop planners like POOLTS and POOLUCT, if sufficient resources are provided. Unlike treebased approaches, POSTS offers a performancememory tradeoff by performing best, if computational and memory resources are highly restricted, making it suitable for efficient partially observable planning.
For the future, we plan to apply POSTS to conformant planning problems [Hoffmanna and Brafmanb2006, Palacios and Geffner2009, Geffner and Bonet2013] and to extend it to multiagent settings [Phan et al.2018].
References
 [Agrawal and Goyal2013] Agrawal, S., and Goyal, N. 2013. Further Optimal Regret Bounds for Thompson Sampling. In Artificial Intelligence and Statistics, 99–107.
 [Auer, CesaBianchi, and Fischer2002] Auer, P.; CesaBianchi, N.; and Fischer, P. 2002. FiniteTime Analysis of the Multiarmed Bandit Problem. Machine learning 47(23):235–256.
 [Bai et al.2014] Bai, A.; Wu, F.; Zhang, Z.; and Chen, X. 2014. Thompson Sampling based MonteCarlo Planning in POMDPs. In Proceedings of the TwentyFourth International Conferenc on International Conference on Automated Planning and Scheduling, 29–37. AAAI Press.
 [Bai, Wu, and Chen2013] Bai, A.; Wu, F.; and Chen, X. 2013. Bayesian Mixture Modelling and Inference based Thompson Sampling in MonteCarlo Tree Search. In Advances in Neural Information Processing Systems, 1646–1654.
 [Belzner and Gabor2017] Belzner, L., and Gabor, T. 2017. Stacked Thompson Bandits. In Proceedings of the 3rd International Workshop on Software Engineering for Smart CyberPhysical Systems, 18–21. IEEE Press.
 [Bubeck and Munos2010] Bubeck, S., and Munos, R. 2010. Open Loop Optimistic Planning. In COLT, 477–489.
 [Chapelle and Li2011] Chapelle, O., and Li, L. 2011. An Empirical Evaluation of Thompson Sampling. In Advances in neural information processing systems, 2249–2257.
 [Geffner and Bonet2013] Geffner, H., and Bonet, B. 2013. A Concise Introduction to Models and Methods for Automated Planning. Synthesis Lectures on Artificial Intelligence and Machine Learning 8(1):1–141.

[Hoffmanna and
Brafmanb2006]
Hoffmanna, J., and Brafmanb, R. I.
2006.
Conformant Planning via Heuristic Forward Search: A New Approach.
Artificial Intelligence 170:507–541.  [Honda and Takemura2014] Honda, J., and Takemura, A. 2014. Optimality of Thompson Sampling for Gaussian Bandits depends on Priors. In Artificial Intelligence and Statistics, 375–383.
 [Kaelbling, Littman, and Cassandra1998] Kaelbling, L. P.; Littman, M. L.; and Cassandra, A. R. 1998. Planning and Acting in Partially Observable Stochastic Domains. Artificial intelligence 101(1):99–134.
 [Kaufmann, Korda, and Munos2012] Kaufmann, E.; Korda, N.; and Munos, R. 2012. Thompson Sampling: An Asymptotically Optimal FiniteTime Analysis. In International Conference on Algorithmic Learning Theory, 199–213. Springer.
 [Kocsis and Szepesvári2006] Kocsis, L., and Szepesvári, C. 2006. Bandit based MonteCarlo Planning. In ECML, volume 6, 282–293. Springer.
 [Lecarpentier et al.2018] Lecarpentier, E.; Infantes, G.; Lesire, C.; and Rachelson, E. 2018. Open Loop Execution of TreeSearch Algorithms. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2362–2368. IJCAI Organization.
 [Palacios and Geffner2009] Palacios, H., and Geffner, H. 2009. Compiling Uncertainty away in Conformant Planning Problems with Bounded Width. Journal of Artificial Intelligence Research 35:623–675.

[Perez Liebana et al.2015]
Perez Liebana, D.; Dieskau, J.; Hunermund, M.; Mostaghim, S.; and Lucas, S.
2015.
Open Loop Search for General Video Game Playing.
In
Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation
, 337–344. ACM.  [Phan et al.2018] Phan, T.; Belzner, L.; Gabor, T.; and Schmid, K. 2018. Leveraging Statistical MultiAgent Online Planning with Emergent Value Function Approximation. In Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’18, 730–738. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems.
 [Pineau, Gordon, and Thrun2006] Pineau, J.; Gordon, G.; and Thrun, S. 2006. Anytime Pointbased Approximations for Large POMDPs. Journal of Artificial Intelligence Research 27:335–380.
 [Powley, Cowling, and Whitehouse2017] Powley, E.; Cowling, P.; and Whitehouse, D. 2017. Memory Bounded Monte Carlo Tree Search. AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment.
 [Ross et al.2008] Ross, S.; Pineau, J.; Paquet, S.; and ChaibDraa, B. 2008. Online Planning Algorithms for POMDPs. Journal of Artificial Intelligence Research 32:663–704.
 [Silver and Veness2010] Silver, D., and Veness, J. 2010. MonteCarlo Planning in Large POMDPs. In Advances in neural information processing systems, 2164–2172.

[Silver et al.2016]
Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche,
G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.;
et al.
2016.
Mastering the Game of Go with Deep Neural Networks and Tree Search.
Nature 529(7587):484–489.  [Silver et al.2017] Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. 2017. Mastering the Game of Go without Human Knowledge. Nature 550(7676):354–359.
 [Smith and Simmons2004] Smith, T., and Simmons, R. 2004. Heuristic Search Value Iteration for POMDPs. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, 520–527. AUAI Press.
 [Somani et al.2013] Somani, A.; Ye, N.; Hsu, D.; and Lee, W. S. 2013. DESPOT: Online POMDP Planning with Regularization. In Advances in neural information processing systems, 1772–1780.
 [Thompson1933] Thompson, W. R. 1933. On the Likelihood that One Unknown Probability exceeds Another in View of the Evidence of Two Samples. Biometrika 25(3/4):285–294.
 [Weinstein and Littman2013] Weinstein, A., and Littman, M. L. 2013. Openloop Planning in LargeScale Stochastic Domains. In Proceedings of the TwentySeventh AAAI Conference on Artificial Intelligence, 1436–1442. AAAI Press.
 [Yu et al.2005] Yu, C.; Chuang, J.; Gerkey, B.; Gordon, G.; and Ng, A. 2005. OpenLoop Plans in MultiRobot POMDPs. Technical report, Stanford CS Dept.