Memory Bounded Open-Loop Planning in Large POMDPs using Thompson Sampling

05/10/2019
by   Thomy Phan, et al.
Universität München
0

State-of-the-art approaches to partially observable planning like POMCP are based on stochastic tree search. While these approaches are computationally efficient, they may still construct search trees of considerable size, which could limit the performance due to restricted memory resources. In this paper, we propose Partially Observable Stacked Thompson Sampling (POSTS), a memory bounded approach to open-loop planning in large POMDPs, which optimizes a fixed size stack of Thompson Sampling bandits. We empirically evaluate POSTS in four large benchmark problems and compare its performance with different tree-based approaches. We show that POSTS achieves competitive performance compared to tree-based open-loop planning and offers a performance-memory tradeoff, making it suitable for partially observable planning with highly restricted computational and memory resources.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/11/2019

Adaptive Thompson Sampling Stacks for Memory Bounded Open-Loop Planning

We propose Stable Yet Memory Bounded Open-Loop (SYMBOL) planning, a gene...
05/03/2018

Open Loop Execution of Tree-Search Algorithms

In the context of tree-search stochastic planning algorithms where a gen...
04/05/2019

Combining Offline Models and Online Monte-Carlo Tree Search for Planning from Scratch

Planning in stochastic and partially observable environments is a centra...
03/15/2012

Distribution over Beliefs for Memory Bounded Dec-POMDP Planning

We propose a new point-based method for approximate planning in Dec-POMD...
06/08/2021

Efficient Sampling in POMDPs with Lipschitz Bandits for Motion Planning in Continuous Spaces

Decision making under uncertainty can be framed as a partially observabl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Many real-world problems can be modeled as

Partially Observable Markov Decision Process (POMDP)

, where the true state is unknown to the agent due to limited and noisy sensors. The agent has to reason about the history of past observations and actions, and maintain a belief state

as a distribution of possible states. POMDPs have been widely used to model decision making problems in the context of planning and reinforcement learning

[Ross et al.2008].

Solving POMDPs exactly is computationally intractable for domains with enormous state spaces and long planning horizons. First, the space of possible belief states grows exponentially w.r.t. the number of states , since that space is -dimensional, which is known as curse of dimensionality [Kaelbling, Littman, and Cassandra1998]. Second, the number of possible histories grows exponentially w.r.t. the horizon length, which is known as the curse of history [Pineau, Gordon, and Thrun2006].

In the last few years, Monte-Carlo planning has been proposed to break both curses with statistical sampling. These methods construct sparse trees over belief states and actions, representing the state-of-the-art for efficient planning in large POMDPs [Silver and Veness2010, Somani et al.2013, Bai et al.2014]. While these approaches avoid exhaustive search, the constructed closed-loop trees can still become arbitrarily large for highly complex domains, which could limit the performance due to restricted memory resources [Powley, Cowling, and Whitehouse2017]. In contrast, open-loop approaches only focus on searching action sequences and are independent of the history and belief state space. Open-loop approaches are able to achieve competitive performance compared to closed-loop planning, when the problem is too large to provide sufficient computational and memory resources [Weinstein and Littman2013, Perez Liebana et al.2015, Lecarpentier et al.2018]. However, open-loop planning has been a less popular choice for decision making in POMDPs so far [Yu et al.2005].

In this paper, we propose Partially Observable Stacked Thompson Sampling (POSTS), a memory bounded approach to open-loop planning in large POMDPs, which optimizes a fixed size stack of Thompson Sampling bandits.

To evaluate the effectiveness of POSTS, we formulate a tree-based approach, called Partially Observable Open-Loop Thompson Sampling (POOLTS) and show that POOLTS is able to find optimal open-loop plans with sufficient computational and memory resources.

We empirically test POSTS in four large benchmark problems and compare its performance with POOLTS and other tree-based approaches like POMCP. We show that POSTS achieves competitive performance compared to tree-based open-loop planning and offers a performance-memory tradeoff, making it suitable for partially observable planning with highly restricted computational and memory resources.

Background

Partially Observable Markov Decision Processes

A POMDP is defined by a tuple , where is a (finite) set of states, is the (finite) set of actions,

is the transition probability function,

is the scalar reward function, is a (finite) set of observations, is the observation probability function, and

is a probability distribution over initial states

. It is always assumed, that , , and at time step .

A history is a sequence of actions and observations. A belief state is a sufficient statistic for history and defines a probability distribution over states given . is the space of all possible belief states. represents the initial belief state

. The belief state can be updated by Bayes theorem:

(1)

where is a normalizing constant, is the last action, and is the history without and .

The goal is to find a policy , which maximizes the return at state for a horizon :

(2)

where is the discount factor. If , then present rewards are weighted more than future rewards.

The value function is the expected return conditioned on belief states given a policy . An optimal policy has a value function with for all and .

Multi-armed Bandits

Multi-armed Bandits (MABs or bandits) are fundamental decision making problems, where an agent has to repeatedly select an arm among a given set of arms in order to maximize its future pay-off. MABs can be considered as problems with a single state , a set of actions , and a stochastic reward function , where

is a random variable with an unknown distribution

. To solve a MAB, one has to determine the action, which maximizes the expected reward

. The agent has to balance between sufficiently trying out actions to accurately estimate their expected reward and to exploit its current knowledge on all arms by selecting the arm with the currently highest expected reward. This is known as the

exploration-exploitation dilemma, where exploration can lead to actions with possibly higher rewards but requires time for trying them out, while exploitation can lead to fast convergence but possibly gets stuck in a local optimum. In this paper, we will cover UCB1 and Thompson Sampling as MAB algorithms.

Ucb1

In UCB1, actions are selected by maximizing the upper confidence bound of action values , where is the current average reward when choosing , is an exploration constant, is the total number of action selections, and is the number of times action was selected. The second term represents the exploration bonus, which becomes smaller with increasing [Auer, Cesa-Bianchi, and Fischer2002].

UCB1 is a popular MAB algorithm and widely used in various challenging domains [Kocsis and Szepesvári2006, Bubeck and Munos2010, Silver et al.2016, Silver et al.2017].

Thompson Sampling

Thompson Sampling is a Bayesian approach to balance between exploration and exploitation of actions [Thompson1933]. The unknown reward distribution of of each action is modeled by a parametrized likelihood function

with a parameter vector

. Given a prior distribution and a set of past observed rewards , the posterior distribution can be inferred by using Bayes rule . The expected reward of each action can be estimated by sampling from the posterior. The action with the highest sampled expected reward is selected.

Thompson Sampling has been shown to be an effective and robust algorithm for making decisions under uncertainty [Chapelle and Li2011, Kaufmann, Korda, and Munos2012, Bai, Wu, and Chen2013, Bai et al.2014].

Planning in POMDPs

Planning searches for an (near-)optimal policy given a model of the environment , which usually consists of explicit probability distributions of the POMDP. Unlike offline planning, which searches the whole (belief) state space to find the optimal policy , local planning only focuses on finding a policy for the current (belief) state by taking possible future (belief) states into account [Weinstein and Littman2013]. Thus, local planning can be applied online at every time step at the current state to recommend the next action for execution. Local planning is usually restricted to a time or computation budget due to strict real-time constraints [Bubeck and Munos2010, Weinstein and Littman2013, Perez Liebana et al.2015].

In this paper, we focus on local Monte-Carlo planning, where is a generative model, which can be used as black box simulator [Kocsis and Szepesvári2006, Silver and Veness2010, Weinstein and Littman2013, Bai et al.2014]. Given and , the simulator provides a sample . Monte-Carlo planning algorithms can approximate and by iteratively simulating and evaluating action sequences without reasoning about explicit probability distributions of the POMDP.

Local planning can be closed- or open-loop. Closed-loop planning conditions the action selection on histories of actions and observations. Open-loop planning only conditions the action selection on previous sequences of actions (also called open-loop plans or simply plans) and summarized statistics about predecessor (belief) states [Bubeck and Munos2010, Weinstein and Littman2013, Perez Liebana et al.2015]. An example is shown in Fig. 1. A closed-loop tree for a domain with is shown in Fig. (a)a, while Fig. (b)b shows the corresponding open-loop tree which summarizes the observation nodes of Fig. (a)a within the blue dotted ellipses into history distribution nodes. Open-loop planning can be further simplified by only regarding statistics about the expected return of actions at specific time steps (Fig. (c)c). In that case, only a stack of statistics is used to sample plans for simulation and evaluation [Weinstein and Littman2013].

(a) closed-loop tree
(b) open-loop tree
(c) stacked
Figure 1: Illustration of closed- and open-loop planning schemes. (a) Closed-loop tree with state observations (circular nodes) and actions (rectangular nodes). Red links correspond to stochastic observations made with a probability of 0.5. (b) Open-loop tree with links as actions and history distribution nodes according to the blue dotted ellipses in Fig. (a)a. (c) Open-loop approach with stack of action distributions according to the blue dotted ellipses in Fig. (b)b.

Partially Observable Monte-Carlo Planning (POMCP) is a closed-loop approach based on Monte-Carlo Tree Search (MCTS) [Silver and Veness2010]. POMCP uses a search tree of histories with o-nodes representing observations and a-nodes representing actions (Fig. (a)a). Each o-node has a visit count and a value estimate for history and belief state . Each a-node has a visit count and a value estimate for action and history . A simulation starts at the current belief state and is divided into two stages: In the first stage, a tree policy is used to traverse the tree until a leaf node is reached. Actions are selected via and simulated in to determine the next nodes to visit. can be implemented with MABs, where each o-node represents a MAB. In the second stage, a rollout policy is used to sample action sequences until a terminal state or a maximum search depth is reached. The observed rewards are accumulated to returns (Eq. 2), propagated back to update the value estimate of every node in the simulated path, and a new leaf node is added to the search tree. can be used to integrate domain knowledge into the planning process to focus the search on promising states [Silver and Veness2010]. The original version of POMCP uses UCB1 for and is shown to converge to the optimal best-first tree with sufficient computation [Silver and Veness2010].

[Lecarpentier et al.2018] formulates an open-loop variant of MCTS using UCB1 as , called Open-Loop Upper Confidence bound for Trees (OLUCT), which could be easily extended to POMDPs by constructing a tree, which summarizes all o-nodes to history distribution nodes (Fig. (b)b).

Open-loop planning generally converges to sub-optimal solutions in stochastic domains, since it ignores (belief) state values and optimizes the node values (Fig. (b)b) instead [Lecarpentier et al.2018]. If the problem is too complex to provide sufficient computation budget or memory capacity, then open-loop approaches are competitive to closed-loop approaches, since they need to explore a much smaller search space to converge to an appropriate solution [Weinstein and Littman2013, Lecarpentier et al.2018].

Related Work

Tree-based approaches to open-loop planning condition the action selection on previous action sequences as shown in Fig. (b)b [Bubeck and Munos2010, Perez Liebana et al.2015, Lecarpentier et al.2018]. Such approaches have been thoroughly evaluated for fully observable problems, but have been less popular for partially observable problems so far [Yu et al.2005]. POSTS is based on stacked open-loop planning, where a stack of distributions over actions is maintained to generate open-loop plans with high expected return [Weinstein and Littman2013, Belzner and Gabor2017]. Unlike previous approaches, POSTS is a memory-bounded open-loop approach to partially observable planning.

[Yu et al.2005] proposed an open-loop approach to planning in POMDPs by using hierarchical planning. An open-loop plan is constructed at an abstract level, where uncertainty w.r.t. particular actions is ignored. A low-level planner controls the actual execution by explicitly dealing with uncertainty. POSTS is more general, since it performs planning directly on the original problem and does not require the POMDP to be transformed for hierarchical planning.

[Powley, Cowling, and Whitehouse2017] proposed a memory bounded version of MCTS with a state pool to add, discard, or reuse states depending on their visitation frequency. However, this approach cannot be easily adapted to tree-based open-loop approaches, because it requires (belief) states to be identifiable. POSTS does not require a pool to reuse states or nodes, but only maintains a fixed size stack of Thompson Sampling bandits, which adapt according to the temporal dependencies between actions.

Open-Loop Search with Thompson Sampling

Generalized Thompson Sampling

We use a variant of Thompson Sampling, which works for arbitrary reward distributions as proposed in [Bai, Wu, and Chen2013, Bai et al.2014] by assuming that

follows a Normal distribution

with unknown mean and precision , where

is the variance.

follows a Normal Gamma distribution

with , , and . The distribution over is a Gamma distribution and the conditional distribution over given is a Normal distribution .

Given a prior distribution and observations , the posterior distribution is defined by , where , , , and . is the mean of all values in and is the variance.

The posterior is inferred for each action to sample an estimate for the expected return. The action with the highest estimate is selected. The complete formulation is given in Algorithm 1.

procedure 
     for  do
         Infer from prior and
               return
procedure 
     
     
     
Algorithm 1 Generalized Thompson Sampling

The prior should ideally reflect knowledge about the underlying model, especially for initial turns, where only a small amount of data has been observed [Honda and Takemura2014]. If no knowledge is available, then uninformative priors should be chosen, where all possibilities can be sampled (almost) uniformly. This can be achieved by choosing the priors such that the variance of the resulting Normal distribution becomes infinite ( and ). Since follows a Gamma distribution with expectation , and should be chosen such that

. Given the hyperparameter space

, , and , it is recommended to set and to center the Normal distribution. should be chosen small enough and should have a sufficiently large value [Bai et al.2014].

Monte Carlo Belief State Update

The belief state can be updated exactly according to Eq. 1. However, exact Bayes updates may be computationally infeasible in POMDPs with large state spaces due to the curse of dimensionality. For this reason, we approximate the belief state for history with a particle filter as described in [Silver and Veness2010]. The belief state is represented by a set of sample states or particles. After execution of and observation of , the particles are updated by Monte Carlo simulation. Sampled states are simulated with such that . If , then is added to .

Poolts

To evaluate the effectiveness of POSTS compared to other open-loop planners, we first define Partially Observable Open-Loop Thompson Sampling (POOLTS) and show that POOLTS is able to converge to an optimal open-loop plan, if sufficient computational and memory resources are provided. POOLTS is a tree-based approach based on OLUCT from [Lecarpentier et al.2018]. Each node represents a Thompson Sampling bandit and stores , and for each action .

A simulation starts at a state , which is sampled from the current belief state . The belief state is approximated by a particle filter as described above. An open-loop tree (Fig. (b)b) is iteratively constructed by traversing the current tree in a selection step by using Thompson Sampling to select actions. When a leaf node is reached, it is expanded by a child node and a rollout is performed by using a policy until a terminal state is reached or a maximum search depth is exceeded. The observed rewards are accumulated to returns (Eq. 2) and propagated back to update the corresponding bandit of every node in the simulated path. When the computation budget has run out, the action with the highest expected return is selected for execution. The complete formulation of POOLTS is given in Algorithm 2.

procedure 
     Create for
     while  do
         
         
               
     return
procedure 
     if  or is a terminal state then return 0      
     if  is a leaf node then
         Expand
         Perform rollout with to sample
         return      
     
     
     
     
     
     return
Algorithm 2 POOLTS Planning

[Kocsis and Szepesvári2006, Bubeck and Munos2010, Lecarpentier et al.2018] have shown that tree search algorithms using UCB1 converge to the optimal closed-loop or open-loop plan respectively, if the computation budget is sufficiently large. This is because the expected state or node values in the leaf nodes become stationary, given a stationary rollout policy . This enables the values in the preceding nodes to converge as well, leading to state- or node-wise optimal actions. By replacing UCB1 with Thompson Sampling, the tree search should still converge to the optimal closed-loop or open-loop plan, since Thompson Sampling also converges to the optimal action, if the return distribution of becomes stationary [Agrawal and Goyal2013]. [Chapelle and Li2011, Bai et al.2014] empirically demonstrated that Thompson Sampling converges faster than UCB1, when rewards are sparse and when the number of arms is large.

Posts

Partially Observable Stacked Thompson Sampling (POSTS) is an open-loop approach, which optimizes a stack of Thompson Sampling bandits to search for high-quality open-loop plans (Fig. (c)c). Each bandit stores , and for each action .

Similarly to POOLTS, a simulation starts at a state , which is sampled from a particle filter , representing the current belief state . Unlike POOLTS, a fixed size stack of bandits is used to sample plans . is evaluated with the generative model to observe immediate rewards , which are accumulated to returns (Eq. 2). Each bandit is then updated with the corresponding return . When the computation budget has run out, the action with the highest expected return is selected for execution. The complete formulation of POSTS is given in Algorithm 3.

procedure 
     while  do
         
         
               
     return
procedure 
     if  or is a terminal state then return 0      
     
     
     
     
     
     return
Algorithm 3 POSTS Planning

The idea of POSTS is to only regard the temporal dependencies between the actions of an open-loop plan. The bandit stack is used to learn these dependencies with the expected (discounted) return. When a bandit samples an action with a resulting reward of , then all preceding bandits with are updated with , using a discounted value of . This is because the actions sampled by all preceding bandits are possibly relevant for obtaining the reward . By only regarding these temporal dependencies, POSTS is memory bounded, not requiring a search tree to model dependencies between histories or history distributions (Fig. (a)a and (b)b).

Experiments

Evaluation Environments

We tested POSTS in the RockSample, Battleship, and PocMan domains, which are well-known POMDP benchmark problems for decision making in POMDPs [Silver and Veness2010, Somani et al.2013, Bai et al.2014]. For each domain, we set the discount factor as proposed in [Silver and Veness2010]. The results were compared with POMCP, POOLTS, and a partially observable version of OLUCT, which we call POOLUCT. The problem-size features of all domains are shown in Table 1.

RockSample(11,11) RockSample(15,15) Battleship PocMan
# States
# Actions
# Observations
Table 1: The problem-size features of the benchmark domains RockSample, Battleship, and PocMan.

The RockSample(n,k) problem simulates an agent moving in an grid containing rocks [Smith and Simmons2004]. Each rock can be or but the true state of each rock is unknown. The agent has to sample good rocks, while avoiding to sample bad rocks. It has a noisy sensor, which produces an observation for a particular rock. The probability of sensing the correct state of the rock decreases exponentially with the agent’s distance to that rock. Sampling gives a reward of , if the rock is good and otherwise. If a good rock was sampled, it becomes bad. Moving and sensing do not give any rewards. Moving past the east edge of the grid gives a reward of and the episode terminates. We set .

In Battleship five ships of size 1, 2, 3, 4, and 5 respectively are randomly placed into a grid, where the agent has to sink all ships without knowing their actual positions [Silver and Veness2010]. Each cell hitting a ship gives a reward of . There is a reward of per time step and a terminal reward of for hitting all ships. We set .

PocMan is a partially observable version of PacMan [Silver and Veness2010]. The agent navigates in a maze and has to eat randomly distributed food pellets and power pills. There are four ghosts moving randomly in the maze. If the agent is within the visible range of a ghost, it is getting chased by the ghost and dies, if it touches the ghost, terminating the episode with a reward of . Eating a power pill enables the agent to eat ghosts for 15 time steps. In that case, the ghosts will run away, if the agent is under the effect of a power pill. At each time step a reward of is given. Eating food pellets gives a reward of and eating a ghost gives . The agent can only perceive ghosts, if they are in its direct line of sight in each cardinal direction or within a hearing range. Also, the agent can only sense walls and food pellets, which are adjacent to it. We set .

Methods

Pomcp

We use the POMCP implementation from [Silver and Veness2010]. selects actions from a set of legal actions with UCB1. randomly selects actions from , depending on the currently simulated state .

In each simulation step, there is at most one expansion step, where new nodes are added to the search tree. Thus, tree size should increase linearly w.r.t. in large POMDPs.

POOLTS and POOLUCT

POOLTS is implemented according to Algorithm 2, where actions are selected from a set of legal actions with Thompson Sampling (Algorithm 1) in the first stage. randomly selects actions from , depending on the currently simulated state . POOLUCT is similar to POOLTS but uses UCB1 as action selection strategy in the first stage. Since, open-loop planning can encounter different states at the same node (Fig. 1), the set of legal actions may vary for each state . We always mask out the statistics of currently illegal actions, regardless of whether they have high average action values, to avoid selecting them.

Similarly to POMCP, the search tree size should increase linearly w.r.t. , but with less nodes, since open-loop trees store summarized information about history distributions.

Posts

POSTS is implemented as a stack of Thompson Sampling bandits with according to Algorithm 3. Starting at , all bandits apply Thompson Sampling to a set of legal actions , depending on the currently simulated state . Similarly to POOLTS and POOLUCT, we mask out the statistics of currently illegal actions and only regard the value statistics of legal actions for selection during planning.

Given a horizon of , POSTS always maintains Thompson Sampling bandits, independently of the computation budget .

Results

We ran each approach on RockSample, Battleship, and PocMan with different settings for 100 times or at most 12 hours of total computation. We evaluated the performance of each approach with the undiscounted return (), because we focus on the actual effectiveness instead of the quality of optimization [Bai et al.2014]. For POMCP and POOLTS we set the UCB1 exploration constant to the reward range of each domain as proposed in [Silver and Veness2010].

Prior Sensitivity

Since we assume no additional domain knowledge, we focus on uninformative priors with , , and as proposed in [Bai et al.2014]. With this setting, controls the degree of initial exploration during the planning phase, thus its impact on the performance of POOLTS and POSTS is evaluated. The results are shown in Fig. 2 for for POOLTS and POSTS.

(a) RockSample(11,11)
(b) RockSample(15,15)
(c) Battleship
(d) PocMan
Figure 2: Average performance of POSTS, POOLTS, POOLUCT, and POMCP with different prior values for , different computation budgets , and a horizon of .

In RockSample, POSTS slightly outperforms POOLTS and keeps up in performance with POMCP. POOLTS slightly outperforms POSTS and POMCP in Battleship with POSTS only being able to keep up when or when . POMCP clearly outperforms all open-loop approaches in PocMan. POOLTS slightly outperforms POSTS in PocMan with POSTS only being to keep up, if . POOLUCT performed worst in all domains except Battleship, where it performs best with a computation budget of . POSTS performs slightly better, if is large, but POOLTS seem to be insensitive to the choice of except in PocMan, where it performs better, if is large.

Horizon Sensitivity

We evaluated the sensitivity of all approaches w.r.t. different horizons . The results are shown in Fig. 3 for 111Using computation budgets between 1024 and 16384 led to similar plots, thus we stick to with all approaches requiring less than one second per action [Silver and Veness2010]. and for POOLTS and POSTS.

(a) RockSample(11,11)
(b) RockSample(15,15)
(c) Battleship
(d) PocMan
Figure 3: Average performance of POSTS, POOLTS, POOLUCT, and POMCP with different planning horizons and a computation budget of .

In RockSample(11,11), there is a performance peak at for POMCP and POOLUCT, while for POSTS and POOLTS it is about . In all other domains, there seems to be a performance peak at for most approaches. If , there is no significant improvement or even degrading performance for most approaches except for POMCP, which slightly improves in all domains but RockSample(11,11), if .

Performance-Memory Tradeoff

We evaluated the performance-memory tradeoff of all approaches by introducing a memory capacity , where the computation is interrupted, when the number of nodes exceeds . For POMCP, we count the number of o-nodes and a-nodes (Fig. (a)a). For POOLTS and POOLUCT, we count the number of history distribution nodes (Fig. (b)b). For POSTS, we count the number of Thompson Sampling bandits, which is always . The results are shown in Fig. 4 for , , and for POOLTS and POSTS. POSTS never uses more than nodes in each setting.

(a) RockSample(11,11)
(b) RockSample(15,15)
(c) Battleship
(d) PocMan
Figure 4: Average performance of POSTS, POOLTS, POOLUCT, and POMCP with memory bounds, a computation budget of and a horizon of .

In Rocksample and Battleship, POMCP is outperformed by POSTS and POOLTS (and also POOLUCT in Battleship). POSTS always performs best in these domains, when . POMCP performs best in PocMan by outperforming POSTS, when and POOLTS keeps up with the best POSTS setting, when . POOLUCT performs worst except in Battleship, improving less and slowest with increasing . It outperforms POMCP in Rocksample(15,15), when though. In PocMan, POOLUCT creates less than 550 nodes, when , indicating that the search tree construction has converged and does not improve any further.

Discussion

The experiments show that partially observable open-loop planning can be a good alternative to closed-loop planning, when the action space is large, stochasticity is low, and when computational and memory resources are highly restricted. Especially approaches based on Thompson Sampling like POOLTS and POSTS seem to be very effective and robust w.r.t. the hyperparameter choice. Setting a large value for seems to be beneficial for large problems (Fig. 2). This is because an enormous search space needs to be explored, while avoiding premature convergence to poor solutions. However, if is too large, POSTS and POOLTS might converge too slowly, thus requiring much more computation [Bai et al.2014]. If is too large, then the value estimates have very high variance, making bandit adaptation more difficult. This could explain the performance stagnation or degradation for most approaches in Fig. 3, when . The performance of POSTS scales similarly to POOLTS w.r.t. and (Fig. 2 and 3). POSTS is also more robust than POOLUCT w.r.t. changes to and except in Battleship, where both approaches scale similarly, when is sufficiently large.

POSTS is competitive to POOLTS and superior to POOLUCT in all settings except in Battleship (when is large) with POOLTS and POOLUCT being shown to theoretically converge to optimal open-loop plans, given sufficient computation budget and memory capacity . POSTS is shown to be superior to all other approaches in RockSample and Battleship, when memory resources are highly restricted, only being outperformed by the tree-based approaches in Battleship after thousands of nodes were created, consuming much more memory than POSTS, which only uses 100 nodes at most. This might be due to the relatively large action space of these domains (Table 1), where all tree-based planners construct enormous trees with high branching factors, when exploring the effect of each action. RockSample and Battleship have low stochasticity, since state transitions are deterministic. In both domains the agent is primarily uncertain about the real state, thus the planning quality only depends on the belief state approximation and the uncertainty about observations (only in RockSample).

POMCP performs best in PocMan. This might be due to the small action space (Table 1) and high stochasticity (where all ghosts primarily move randomly), since open-loop planning is known to converge to sub-optimal solutions in such domains [Weinstein and Littman2013, Lecarpentier et al.2018]. However, POMCP has the highest memory consumption, since it constructs larger trees than open-loop approaches with the same computation budget (Fig (a)a). In PocMan, POSTS is able to keep up with POOLTS, while being much more memory-efficient (Fig. (d)d and (d)d).

Conclusion and Future Work

In this paper, we proposed Partially Observable Stacked Thompson Sampling (POSTS), a memory bounded approach to open-loop planning in large POMDPs, which optimizes a fixed size stack of Thompson Sampling bandits.

To evaluate the effectiveness of POSTS, we formulated a tree-based approach, called POOLTS and showed that POOLTS is able to find optimal open-loop plans with sufficient computational and memory resources.

We empirically tested POSTS in four large benchmark problems and showed that POSTS achieves competitive performance compared to tree-based open-loop planners like POOLTS and POOLUCT, if sufficient resources are provided. Unlike tree-based approaches, POSTS offers a performance-memory tradeoff by performing best, if computational and memory resources are highly restricted, making it suitable for efficient partially observable planning.

For the future, we plan to apply POSTS to conformant planning problems [Hoffmanna and Brafmanb2006, Palacios and Geffner2009, Geffner and Bonet2013] and to extend it to multi-agent settings [Phan et al.2018].

References

  • [Agrawal and Goyal2013] Agrawal, S., and Goyal, N. 2013. Further Optimal Regret Bounds for Thompson Sampling. In Artificial Intelligence and Statistics, 99–107.
  • [Auer, Cesa-Bianchi, and Fischer2002] Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-Time Analysis of the Multiarmed Bandit Problem. Machine learning 47(2-3):235–256.
  • [Bai et al.2014] Bai, A.; Wu, F.; Zhang, Z.; and Chen, X. 2014. Thompson Sampling based Monte-Carlo Planning in POMDPs. In Proceedings of the Twenty-Fourth International Conferenc on International Conference on Automated Planning and Scheduling, 29–37. AAAI Press.
  • [Bai, Wu, and Chen2013] Bai, A.; Wu, F.; and Chen, X. 2013. Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search. In Advances in Neural Information Processing Systems, 1646–1654.
  • [Belzner and Gabor2017] Belzner, L., and Gabor, T. 2017. Stacked Thompson Bandits. In Proceedings of the 3rd International Workshop on Software Engineering for Smart Cyber-Physical Systems, 18–21. IEEE Press.
  • [Bubeck and Munos2010] Bubeck, S., and Munos, R. 2010. Open Loop Optimistic Planning. In COLT, 477–489.
  • [Chapelle and Li2011] Chapelle, O., and Li, L. 2011. An Empirical Evaluation of Thompson Sampling. In Advances in neural information processing systems, 2249–2257.
  • [Geffner and Bonet2013] Geffner, H., and Bonet, B. 2013. A Concise Introduction to Models and Methods for Automated Planning. Synthesis Lectures on Artificial Intelligence and Machine Learning 8(1):1–141.
  • [Hoffmanna and Brafmanb2006] Hoffmanna, J., and Brafmanb, R. I. 2006.

    Conformant Planning via Heuristic Forward Search: A New Approach.

    Artificial Intelligence 170:507–541.
  • [Honda and Takemura2014] Honda, J., and Takemura, A. 2014. Optimality of Thompson Sampling for Gaussian Bandits depends on Priors. In Artificial Intelligence and Statistics, 375–383.
  • [Kaelbling, Littman, and Cassandra1998] Kaelbling, L. P.; Littman, M. L.; and Cassandra, A. R. 1998. Planning and Acting in Partially Observable Stochastic Domains. Artificial intelligence 101(1):99–134.
  • [Kaufmann, Korda, and Munos2012] Kaufmann, E.; Korda, N.; and Munos, R. 2012. Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis. In International Conference on Algorithmic Learning Theory, 199–213. Springer.
  • [Kocsis and Szepesvári2006] Kocsis, L., and Szepesvári, C. 2006. Bandit based Monte-Carlo Planning. In ECML, volume 6, 282–293. Springer.
  • [Lecarpentier et al.2018] Lecarpentier, E.; Infantes, G.; Lesire, C.; and Rachelson, E. 2018. Open Loop Execution of Tree-Search Algorithms. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2362–2368. IJCAI Organization.
  • [Palacios and Geffner2009] Palacios, H., and Geffner, H. 2009. Compiling Uncertainty away in Conformant Planning Problems with Bounded Width. Journal of Artificial Intelligence Research 35:623–675.
  • [Perez Liebana et al.2015] Perez Liebana, D.; Dieskau, J.; Hunermund, M.; Mostaghim, S.; and Lucas, S. 2015. Open Loop Search for General Video Game Playing. In

    Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation

    , 337–344.
    ACM.
  • [Phan et al.2018] Phan, T.; Belzner, L.; Gabor, T.; and Schmid, K. 2018. Leveraging Statistical Multi-Agent Online Planning with Emergent Value Function Approximation. In Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’18, 730–738. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems.
  • [Pineau, Gordon, and Thrun2006] Pineau, J.; Gordon, G.; and Thrun, S. 2006. Anytime Point-based Approximations for Large POMDPs. Journal of Artificial Intelligence Research 27:335–380.
  • [Powley, Cowling, and Whitehouse2017] Powley, E.; Cowling, P.; and Whitehouse, D. 2017. Memory Bounded Monte Carlo Tree Search. AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment.
  • [Ross et al.2008] Ross, S.; Pineau, J.; Paquet, S.; and Chaib-Draa, B. 2008. Online Planning Algorithms for POMDPs. Journal of Artificial Intelligence Research 32:663–704.
  • [Silver and Veness2010] Silver, D., and Veness, J. 2010. Monte-Carlo Planning in Large POMDPs. In Advances in neural information processing systems, 2164–2172.
  • [Silver et al.2016] Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016.

    Mastering the Game of Go with Deep Neural Networks and Tree Search.

    Nature 529(7587):484–489.
  • [Silver et al.2017] Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. 2017. Mastering the Game of Go without Human Knowledge. Nature 550(7676):354–359.
  • [Smith and Simmons2004] Smith, T., and Simmons, R. 2004. Heuristic Search Value Iteration for POMDPs. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, 520–527. AUAI Press.
  • [Somani et al.2013] Somani, A.; Ye, N.; Hsu, D.; and Lee, W. S. 2013. DESPOT: Online POMDP Planning with Regularization. In Advances in neural information processing systems, 1772–1780.
  • [Thompson1933] Thompson, W. R. 1933. On the Likelihood that One Unknown Probability exceeds Another in View of the Evidence of Two Samples. Biometrika 25(3/4):285–294.
  • [Weinstein and Littman2013] Weinstein, A., and Littman, M. L. 2013. Open-loop Planning in Large-Scale Stochastic Domains. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, 1436–1442. AAAI Press.
  • [Yu et al.2005] Yu, C.; Chuang, J.; Gerkey, B.; Gordon, G.; and Ng, A. 2005. Open-Loop Plans in Multi-Robot POMDPs. Technical report, Stanford CS Dept.