Motion Planning under Partial Observability using Game-Based Abstraction

08/14/2017
by   Leonore Winterer, et al.
0

We study motion planning problems where agents move inside environments that are not fully observable and subject to uncertainties. The goal is to compute a strategy for an agent that is guaranteed to satisfy certain safety and performance specifications. Such problems are naturally modelled by partially observable Markov decision processes (POMDPs). Because of the potentially huge or even infinite belief space of POMDPs, verification and strategy synthesis is in general computationally intractable. We tackle this difficulty by exploiting typical structural properties of such scenarios; for instance, we assume that agents have the ability to observe their own positions inside an environment. Ambiguity in the state of the environment is abstracted into non-deterministic choices over the possible states of the environment. Technically, this abstraction transforms POMDPs into probabilistic two-player games (PGs). For these PGs, efficient verification tools are able to determine strategies that approximate certain measures on the POMDP. If an approximation is too coarse to provide guarantees, an abstraction refinement scheme further resolves the belief space of the POMDP. We demonstrate that our method improves the state of the art by orders of magnitude compared to a direct solution of the POMDP.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/27/2018

Human-in-the-Loop Synthesis for Partially Observable Markov Decision Processes

We study planning problems where autonomous agents operate inside enviro...
03/20/2019

Counterexample-Guided Strategy Improvement for POMDPs Using Recurrent Neural Networks

We study strategy synthesis for partially observable Markov decision pro...
12/20/2017

Temporal logic control of general Markov decision processes by approximate policy refinement

The formal verification and controller synthesis for Markov decision pro...
07/04/2012

Counterexample-guided Planning

Planning in adversarial and uncertain environments can be modeled as the...
04/09/2022

Path-Tree Optimization in Partially Observable Environments using Rapidly-Exploring Belief-Space Graphs

Robots often need to solve path planning problems where essential and di...
09/28/2018

The Partially Observable Games We Play for Cyber Deception

Progressively intricate cyber infiltration mechanisms have made conventi...
05/24/2021

Belief Space Planning: A Covariance Steering Approach

A new belief space planning algorithm, called covariance steering Belief...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Offline motion planning for dynamical systems with uncertainties aims at finding a strategy for an agent that ensures certain desired behavior [1]. Planning scenarios that exhibit stochastic behavior are naturally modeled by Markov decision processes (MDPs). An MDP is a non-deterministic model in which the agent chooses to perform an action under full knowledge of the environment it is operating in. The outcome of the action is a distribution over the states. For many robotic applications, however, information about the current state of the environment is not observable [2, 3, 4]. In such scenarios, where the actual state of the environment is not exactly known, the model is a partially observable Markov decision process (POMDP). By tracking the observations made while it moves, an agent can infer the likelihood of the environment being in a certain state. This likelihood is called the belief state of the agent. Executing an action leads to an update of the belief state because new observations are made. The belief state together with an update function form a (possibly infinite) MDP, commonly referred to as the underlying belief MDP [5].

As an example, take a scenario where a controllable agent needs to traverse a room while avoiding static obstacles and randomly moving opponents whose positions cannot always be observed by the agent. The goal is to determine a strategy for the agent, that (provably) ensures safe traversal with a certain high probability.

Quantitative verification techniques like probabilistic model checking [6] provide comprehensive guarantees on such a strategy. For finite MDPs, tools like  [7] or storm [8] employ efficient model checking algorithms to asses the probability to reach a certain set of states. However, POMDP verification suffers from the large, potentially infinite belief space, and is intractable even for rather small instances.

Approach

ProblemDescription(Sect. III-A)

POMDP(Def. III-B)

PG(Def. III-C)

optimal strategyin PG

induced probability:

encodecf. Sect. III-B

abstractcf. Sect. III-C

refine cf. Sect. III-D

abstractcf. Sect. III-C

model checkingusing [9]

on PG

on POMDP

assumption: fully observable
Fig. 1: Schematic overview of the approach

We outline the approach and the structure of the paper in Fig. 1, details in the figure will be discussed in the respective sections. Starting from a problem description, we propose to use an encoding of the problem as a POMDP. We observe that motion planning scenarios as described above naturally induce certain structural properties in the POMDP. In particular, we assume that the agent can observe its own position while the exact position of the opponents is observable only if they are nearby according to a given distance metric. We propose an abstraction method that, intuitively, lumps states inducing the same observations. Since it is not exactly known in which state of the environment a certain action is executed, a non-deterministic choice over these lumped states is introduced. Resolving this choice induces a new level of non-determinism into the system in addition to the choices of the agent: The POMDP abstraction results in a probabilistic two-player game (PG) [10]. The agent is the first player choosing an action while the second player chooses in which of the possible (concrete) states the action is executed. Model checking computes an optimal strategy for the agent on this PG.

The automated abstraction procedure is inspired by game-based abstraction [10, 11] of potentially infinite MDPs, where states are lumped in a similar fashion. We show that our approach is sound in the sense that a strategy for the agent in the PG defines a strategy for the original POMDP. Guarantees for the strategy carry over to POMDPs, as it induces bounds on probabilities. As we target an undecidable problem [12], our approach is not complete in the sense that it does not always obtain a strategy which yields the required optimal probability. However, we define a scheme to refine the abstraction and extend the observability.

We implemented a Python tool-chain taking a graph formulation of the motion planning as input and applying the proposed abstraction-refinement procedure. The tool-chain uses  [9] as a model checker for PGs. For the motion planning scenario considered, our preliminary results indicate an improvement of orders in magnitude over the state of the art in POMDP verification [13].

Related work

Sampling-based methods for motion planning in POMDP scenarios are considered in [14, 15, 16, 17]. An overview on point-based value iteration for POMDPs is given in [5]. Other methods employ control techniques to synthesize strategies with safety considerations under observation and dynamics noise [2, 18, 19]. Preprocessing of POMDPs in motion planning problems for robotics is suggested in [20].

General verification problems for POMDPs and their decidability have been studied in [21, 22]. A recent survey about decidability results and algorithms for -regular properties is given in [12, 23]. The probabilistic model checker has recently been extended to support POMDPs [13]. Partly based on the methods from [24], it produces lower and upper bounds for a variety of queries. Reachability can be analyzed for POMDPs for up to 30,000 states. In [25], an iterative refinement is proposed to solve POMDPs: Starting with total information, strategies that depend on unobservable information are excluded. In [26], a compositional framework for reasoning about POMDPs is introduced. Refinement based on counterexamples is considered in [27]. Partially observable probabilistic games have been considered in [28]. Finally, an overview of applications for PGs is given in [29].

Ii Formal Foundations

Ii-a Probabilistic games

For a finite or countably infinite set , let such that denote a probability distribution over and

the set of all probability distributions over

. The Dirac distribution on is given by and for . [Probabilistic game] A probabilistic game (PG) is a tuple where is a finite set of states, the states of Player 1, the states of Player 2, the initial state, a finite set of actions, and a (partial) probabilistic transition function. Let denote the available actions in . We assume that PG does not contain any deadlock states, for all . A Markov decision process (MDP) is a PG with . We write . A

discrete-time Markov chain

(MC) is an MDP with for all .

The game is played as follows: In each step, the game is in a unique state . If it is a Player 1 state (), then Player 1 chooses an available action non-deterministically; otherwise Player 2 chooses. The successor state of is determined probabilistically according to the probability distribution : The probability of being the next state is . The game is then in state .

A path through is a finite or infinite sequence , where , , , and for all . The -th state on is , and denotes the last state of if is finite. The set of (in)finite paths is ().

To define a probability measure over the paths of a PG , the non-determinism needs to be resolved by strategies. [PG strategy] A strategy for is a pair of functions such that for all , . denotes the set of all strategies of and all Player- strategies of . For MDPs, the strategy consists of a Player- strategy only. A Player- strategy is memoryless if implies for all . It is deterministic if is a Dirac distribution for all . A memoryless deterministic strategy is of the form .

A strategy  for a PG resolves all non-deterministic choices, yielding an induced MC, for which a probability measure over the set of infinite paths is defined by the standard cylinder set construction [30]. These notions are analogous for MDPs.

Ii-B Partial observability

For many applications, not all system states are observable [2]

. For instance, an agent may only have an estimate on the current state of its environment. In that case, the underlying model is a partially observable Markov decision process. [POMDP] A

partially observable Markov decision process (POMDP) is a tuple such that is the underlying MDP of , a finite set of observations, and the observation function. W. l. o. g. we require that states with the same observations have the same set of enabled actions, implies for all . More general observation functions have been considered in the literature, taking into account the last action and providing a distribution over . There is a polynomial transformation of the general case to the POMDP definition used here [23].

The notions of paths and probability measures directly transfer from MDPs to POMDPs. We lift the observation function to paths: For a POMDP and a path , the associated observation sequence is . Note that several paths in the underlying MDP can give rise to the same observation sequence. Strategies have to take this restricted observability into account. [Observation-based strategy] An observation-based strategy of POMDP is a function such that is a strategy for the underlying MDP and for all paths with we have . denotes the set of such strategies. That means an observation-based strategy selects based on the observations and actions made along the current path.

The semantics of a POMDP can be described using a belief MDP with an uncountable state space. The idea is that each state of the belief MDP corresponds to a distribution over the states in the POMDP. This distribution is expected to correspond to the probability to be in a specific state based on the observations made so far. Initially, the belief is a Dirac distribution on the initial state. A formal treatment of belief MDPs is beyond the scope of this paper, for details see [5].

Ii-C Specifications

Given a set of goal states and a set of bad states, we consider quantitative reach-avoid properties of the form . The specification is satisfied by a PG if Player 1 has a strategy such that for all strategies of Player 2 the probability is at least to reach a goal state without entering a bad state in between. For POMDPs, is satisfied if the agent has an observation-based strategy which leads to a probability of at least .

For MDPs and PGs, memoryless deterministic strategies suffice to prove or disprove satisfaction of such specifications [31]. For POMDPs, observation-based strategies in their full generality are necessary [32].

Iii Methodology

We first intuitively describe the problem and list the assumptions we make. After formalizing the setting, we present a formal problem statement. We present the intuition behind the concept of game-based abstraction for MDPs, how to apply it to POMDPs, and state the correctness of our method.

Iii-a Problem Statement

We consider a motion planning problem that involves moving agents inside a world such as a landscape or a room. One agent is controllable (), the other agents (also called opponents) move stochastically according to a fixed randomized strategy which is based on their own location and the location of . We assume that all agents move in an alternating manner. A position of an agent defines the physical location inside the world as well as additional properties such as the agent’s orientation. A graph models all possible movements of an agent between positions, referred to as the world graph of an agent. Therefore, nodes in the graph uniquely refer to positions while multiple nodes may refer to the same physical location in the world. We require that the graph does not contain any deadlocks: For every position, there is at least one edge in the graph corresponding to a possible action an agent can execute.

A collision occurs, if shares its location with another agent. The set of goal nodes (goal positions) in the graph is uniquely defined by a set of physical goal locations in the world. The target is to move towards a goal node without colliding with other agents. Technically, we need to synthesize a strategy for that maximizes the probability to achieve the target. Additionally, we assume:

  • The strategies of all opponents are known to .

  • is able to observe its own position and knows the goal positions it has to reach.

  • The positions of opponents are observable for from its current position, if they are visible with respect to a certain distance metric.

Generalizing the problem statement is discussed in Sect. VI.

Iii-B Formal setting

We first define an individual world graph for each [i] with over a fixed set of (physical) locations. [World graph of [i]] The world graph for [i] over is a tuple such that is the set of positions and is the initial position of [i]. is the set of movements111We use movements to avoid confusion with actions in PGs.; the edges are the movement effects. The function maps a position to the corresponding location. The enabled movements for Agent  in position are .

For we need the possibility to restrict its viewing range. This is done by a function which assigns to each position of the set of visible locations. According to our assumptions, for all it holds that and .

Each [i] with has a randomized strategy , which maps positions of and [i] to a distribution over enabled movements of [i]. The world graphs for all agents with randomized strategies for the opponents are subsumed by a single world POMDP. We first define the underlying world MDP modeling the possible behavior of all agents based on their associated world graphs. [World MDP] Given world graphs , the induced world MDP is defined by , , and . is defined by:

  • For and , we have .

  • , with , , and .

  • in all other cases.

The first item in the definition of translates each movement in the world graph of into an action in the MDP that connects states with probability one, a Dirac distribution is attached to each action. Upon taking the action, the position of changes and [1] has to move next.

The second item defines movements of the opponents. In each state where [i] is moving next, the action reflecting this move is added. The outcome of is determined by and the fact that Agent moves next. [World POMDP] Let be a world MDP for world graphs . The world POMDP with and defined by:

Thus, the position of [i] is observed iff the location of [i] is visible from the position of , and otherwise a dummy value , which is referred to as far away, is observed.

Given a set , the mappings are used to define the states corresponding to collisions and goal locations. In particular, we have and .

Formal problem statement

Given a world POMDP for a set of world graphs , a set of collision states Collision, and a set of goal states Goals, an observation-based strategy for is -safe for , if holds. We want to compute a -safe strategy for a given .

Iii-C Abstraction

We propose an abstraction method for world POMDPs that builds on game-based abstraction (GBAR), originally defined for MDPs [10, 11].

GBAR for MDPs

For an MDP , we assume a partition of , a set of non-empty, pairwise disjoint, subsets (called blocks) with . GBAR takes the partition and turns each block into an abstract state ; these blocks form the Player 1 states. Then . To determine the outcome of selecting , we add intermediate selector-states as Player 2 states. In the selector state , emanating actions reflect the choice of the actual state the system is in at , . For taking an action in , the distribution is lifted to a distribution over abstract states:

The semantics of this PG is as follows: For an abstract state , Player 1 (controllable) selects an action to execute. In selector-states, the Player 2 (adversary) selects the worst-case state from where the selection was executed.

Applying GBAR to POMDPs

The key idea in GBAR for POMDPs is to merge states with equal observations. [Abstract PG] The abstract PG of POMDP is with , , s. t. , and .

The transition probabilities are defined as follows:

  • for and ,

  • for , , and ,

  • and in all other cases.

By construction, Player 1 has to select the same action for all states in an abstract state. As the abstract states coincide with the observations, this means that we obtain an observation-based strategy for the POMDP. For the classes of properties we consider, a memoryless deterministic strategy suffices for PGs to achieve the maximal probability of reaching a goal state without collision [31]. We thus obtain an optimal strategy for Player  in the PG which maps every abstract state to an action. As abstract states are constructed such that they coincide with all possible observations in the POMDP (see Def. III-C), this means that maps every observation to an action.

Abstract world PG

We now connect the abstraction to our setting. For ease of presentation, we assume in the rest of this section that there is only one opponent agent, we have and [1]. Therefore, if sees an agent and moves, no additional agent will appear. Moreover, either knows the exact state, or does not know where the opponent is.

First, we call the abstract PG of the world POMDP the abstract world PG. In particular, the abstract states in the world PG are either of the form or of the form . with . In the former, the opponent is visible and the agent has full knowledge, in the latter only the own position is known. Recall that is a dummy value for the distance referred to as far away. Furthermore, all states in an abstract state correspond to the same position of . For abstract states with full knowledge, there is no non-determinism of Player  involved as these states correspond to a single state in the world POMDP.

Correctness

We show that a safe strategy for Player  induces a safe strategy for . Consider therefore a path in the PG. This path is projected to the blocks: . The location of encoded in the blocks is independent of the choices by Player . The sequence of actions thus yields a unique path of positions of in its world graph. Thus, if the path in the PG reaches a goal state, the path induces a path in the POMDP which also reaches a goal state. Moreover, the worst-case behavior over-approximates the probability for the opponent to be in any location, and any collision is observable. Thus if there is a collision in the POMDP, then there is a collision in the PG.

Formally, for a deterministic memoryless strategy in the abstract world PG the corresponding strategy in the POMDP is defined as for .

Theorem 1

Given a -safe strategy in an abstract world PG, the corresponding strategy in the world POMDP is -safe.

The assessment of the strategy is conservative: A -safe strategy in the abstract world PG may induce a corresponding strategy in the POMDP which is -safe for some . In particular, applying the corresponding strategy to the original POMDP yields a discrete-time Markov chain (MC). This MC can be efficiently analyzed by, probabilistic model checking to determine the value of . Naturally, the optimal scheduler obtained for the PG does not need to be optimal in the POMDP.

All positions where [1] is visible yield Dirac distributions in the belief MDP, the successor states in the MDP depend solely on the action choice. These beliefs are represented as single states in the abstract world PG. The abstraction lumps for each position of all (uncountably many) other belief states together.

Iii-D Refinement of the PG

In the GBAR approach described above, we remove relevant information for an optimal strategy. In particular, the behavior of [1] (the opponent) is strengthened (over-approximated):

  • We abstract probabilistic movements of [1] outside of the visible area into non-determinism.

  • We allow jumps in [1]’s movements, i.e., [1] may change position in the PG. This is impossible in the POMDP; these movements are called spurious.

If, due to the lack of this information, no safe strategy can be found, the abstraction needs to be refined. In GBAR for MDPs [11]

, abstract states are split heuristically, yielding a finer over-approximation. In our construction, we cannot split abstract states arbitrarily: This would destroy the one-to-one correspondence between abstract states and observations. We would thus obtain a partially observable PG, or equivalently, for a strategy in the PG the corresponding strategy in the original POMDP is no longer observation-based.

However, we can restrict the spurious movements of [1] by taking the history of observations made along a path into account. We present three types of history-based refinements.

One-step history refinement

If moves to state from where [1] is no longer visible, we have . Upon the next move, [1] could thus appear anywhere. However, until [1] moves, the belief MDP is still in a Dirac distribution; the positions where [1] can appear are thus restricted. Similarly, if [1] disappears, upon a turn of in the same direction, [1] will be visible again. The (one-step history) refined world PG extends the original PG by additional states where , is not visible for . These “far away” states are only reached from states with full information. Intuitively, although [1] is invisible, its position is remembered for one step.

Multi-step history refinement

Further refinement is possible by considering longer paths. If we first observe [1] at location , then loose visibility for one turn, and then observe [1] again at position , then we know that either and are at most two moves apart or that such a movement is spurious. To encode the observational history into the states of the abstraction, we store the last known position of [1], as well as the number of moves made since then. We then only allow [1] to appear in positions which are at most moves away from the last known position. We can cap by the diameter of the graph.

Region-based multi-step history refinement

As the refinement above blows up the state space drastically, we utilize a technique called magnifying lens abstraction [33]. Instead of single locations, we define regions of locations together with the information if [1] could be present. After each move, we extend the possible regions by all neighbor regions.

More formally, the (multi-step history) refined world PG has a refined far-away value : Given a partition of the positions of [1], extracted from the graph structure, into sets with and for all . We define . Abstract states now are either of the form as before, or . For singleton regions, this coincides with the method proposed above. Notice that this approach also offers some flexibility: If for instance two regions are connected only by the visible area, can assure wether [1] enters the other region.

Correctness

First, a deterministic memoryless strategy on a refined abstract world PG needs to be translated to a strategy for the original POMDP while -safety is conserved. Intuitively, as the proposed refinement steps encode history into the abstract world PG, the strategy is not memoryless anymore but has a finite memory at most according to the maximum number of moves that are observed.

Theorem 2

A -safe strategy in a refined abstract world PG has a corresponding -safe strategy in the world POMDP.

The proposed refinements eliminate spurious movements of [1] from the original abstract world PG. Intuitively, the number of states where Player  may select states with belief zero (in the underlying belief MDP) is reduced. We thus only prevent paths that have probability zero in the POMDP. Vice versa, the refinement does not restrict the movement of and any path leading to a goal state still leads to one in the refinement. However, the behavior of [1] is restricted, therefore, the probability of a collision drops. Intuitively, for the refined PG strategies can be computed that are at least as good as for the original PG.

Theorem 3

If an abstract world PG has a -safe strategy, then its refined abstract world PG has a -safe strategy with .

Iii-E Refinement of the Graph

The proposed approach cannot solve every scenario — the problem is undecidable [12]. Therefore, if the method fails to find a -safe scheduler, we do not know whether there exists such a scheduler. With increased visibility, however, the maximal level of safety does not decrease in both the POMDP and the PG. To determine good spots for increased visibility, we can use the analysis results: Locations in which a collision occurs are most likely good candidates.

Iv Case Study and Implementation

Iv-a Description

For our experiments, we choose the following scenario: A (controllable) Robot R and a Vacuum Cleaner VC are moving around in a two-dimensional grid world with static opaque obstacles. Neither R nor VC may leave the grid or visit grid cells occupied by a static obstacle. The position of R contains the cell (the location) and a wind direction. R can move one grid cell forward, or turn by in either direction without changing its location. The position of VC is determined solely by its cell . In each step, VC can move one cell in any wind direction. We assume that VC moves to all available successor cells with equal probability.

The sensors on R only sense VC within a viewing range around . More precisely, VC is visible iff and there is no grid cell with a static obstacle on the straight line from ’s center to ’s center. That means, R can observe the position of the VC if VC is in the viewing range and VC is not hidden behind an obstacle. A refinement of the world is realized by adding additional cameras, which make cells visible independent of the location of R.

Iv-B Tool-Chain

To synthesize strategies for the scenario described above, we implemented a tool-chain in Python. The input consists of the grid with the locations of all obstacles, the location of cameras, and the viewing range. As output, two files are created: A PG formulation of the abstraction including one-step history refinement, to be analyzed using  [9], and the original POMDP for  [13]. For multi-step history refinement, additional regions can be defined.

The encoding of the PG contains a precomputed lookup-table for the visibility relation. The PG is described by two parallel processes running interleaved: One for Player  and one for Player . As only R can make choices, they are listed in Player  actions, while VC’s moves are stored in Player  actions. More precisely, the process for R contains its location, and the process for VC either contains its location or a far-away value. Then, Player  makes its decision, afterwards the outcome of the move and the outcomes of the subsequent move of VC are compressed into one step of Player .

V Experiments

V-a Experimental Setup

All experiments were run on a machine with a 3.6 GHz Intel® CoreTM i7-4790 CPU and 16 GB RAM, running Ubuntu Linux 16.04. We denote experiments taking over 5400 s CPU time as time-out and taking over 10 GB memory as mem-out (MO). We considered several variants of the scenario described in IV-A. The Robot always started in the upper-left corner and had the lower-right corner as target. The VC started in the lower-right corner. In all variants, the view range was . We evaluated the following five scenarios:

  1. [label=SC0, wide, labelwidth=!, labelindent=0pt,topsep=0pt]

  2. Rooms of varying size without obstacles.

  3. Differently sized rooms with a cross-shaped obstacle in the center, which scales with increasing grid size.

  4. A room with up to randomly placed obstacles.

  5. Two rooms (together ) as depicted in Fig. 2. The doorway connecting the two rooms is a potential point of failure, as R cannot see to the other side. To improve reachability, we added cameras to improve visibility.

  6. Corridors of the format – long, narrow grids that the Robot has to traverse from top to bottom, passing the VC on its way down.

Fig. 2: Grid for 4. The cameras observe the shaded area.
POMDP solution PG solution MDP
Grid size States Choices Trans. Result Model Time Sol. Time States Choices Trans. Result Model Time Sol. Time Result
299 515 739 0.8323 0.063 0.26 400 645 1053 0.8323 0.142 0.036 0.8323
983 1778 2705 0.9556 0.099 1.81 1348 2198 3897 0.9556 0.353 0.080 0.9556
2835 5207 8148 0.9882 0.144 175.94 6124 10700 19248 0.9740 0.188 0.649 0.9882
4390 8126 12890 0.9945 0.228 4215.056 8058 14383 26079 0.9785 0.242 0.518 0.9945
6705 20086 12501 ?? 0.377 – MO – 10592 19286 35226 0.9830 0.322 1.872 0.9970
24893 47413 78338 ?? 1.735 – MO – 23128 81090 43790 0.9897 0.527 6.349 0.9998
66297 127829 214094 ?? 9.086 – MO – 40464 145482 78054 0.9914 0.904 6.882 0.9999
– Time out during model construction – 199144 745362 395774 0.9921 8.580 122.835 0.9999
– Time out during model construction – 477824 1808442 957494 0.9921 41.766 303.250 0.9999
– Time out during model construction – 876504 3334722 1763214 0.9921 125.737 1480.907 0.9999
– Time out during model construction – 1395184 5324202 2812934 0.9921 280.079 3129.577 – MO –
TABLE I: Comparing the POMDP solution using with the solution of the PG abstraction using on 1.

V-B Results

Table I shows the direct comparison between the POMDP description and the abstraction for 1. The first column gives the grid size. Then, first for the POMDP and afterwards for the PG, the table lists the number of states, non-deterministic choices, and transitions of the model. The results include the safety probability induced by the optimal scheduler (“Result”), the run times (all in seconds) takes for constructing the state space from the symbolic description (“Model Time”), and finally the time to solve the POMDP / PG (“Sol. Time”). The last column shows the safety probability as computed using the fully observable MDP; it is an upper bound on the probability that is achievable for each grid. Note that optimal schedulers from this MDP are in general not observation-based and therefore not admissible for the POMDP. The time for creating the files was  s in all cases.

Table II lists data for the PG constructed from 2 (first block of rows) and 5 (without additional refinement in the second block, with region-based multi-step history refinement in the third block), analogous to Table I. Additionally the runtime for creating the symbolic description is given (“Run times / Create”). On the fully observable MDP, the resulting probability is 1.0 for all 2- and 0.999 for all 5 instances.

Table III shows the results for 3. The first column (“#O”) corresponds to the number of obstacles, while the remaining entries are analogous to Table II. The data for 4 is shown in Table IV. Its structure is identical to that of Table III, with the first column (“#C”) corresponding to the number of cameras added for the graph refinement as in III-E.

PG Run times
Grid States Choices Trans. Result Create Model Solve

2

36084 66942 120480 0.9920 0.08 3 24
173584 331482 618148 0.9972 1.19 41 103
431044 834242 1572948 0.9977 7.62 231 312
808504 1575402 2985348 0.9978 31.92 1220 805

5

50880 93734 170974 0.9228 0.01 1.4 17
77560 143254 261534 0.8923 0.01 2.8 64
104240 192774 352094 0.8628 0.01 5.2 110
130920 242294 442654 0.8343 0.02 6.9 157

5 + ref.

55300 120848 198088 0.9799 0.01 25.2 38
83820 182368 300648 0.9799 0.01 42.6 177
112340 243888 403208 0.9799 0.01 74.2 191
140860 305408 505768 0.9799 0.02 117.5 629
TABLE II: Results for the PG for differently sized models.
PG Run times MDP
#O States Choices Trans. Result Create Model Solve Result
10 297686 581135 1093201 0.9976 2.10 89.7 285.0 0.9999
40 234012 454652 823410 0.9706 2.74 87.3 179.1 0.9999
60 198927 385803 679321 0.6476 3.12 59.4 201.5 0.9999
70 187515 363401 633884 0.6210 3.30 59.4 116.1 0.9896
TABLE III: Results for 3
PG Run times MDP
#C States Choices Trans. Result Create Model Solve Result
none 76768 145562 271152 0.5127 0.22 7.9 23.5 0.9999
2 152920 291866 546719 0.9978 0.24 16.9 68.1
TABLE IV: Results for 4

V-C Evaluation

Consider 1: While for very small examples, delivers results within reasonable time, already the grid yields a mem-out. On the other hand, our abstraction handles grids up to within minutes, while still providing schedulers with a solid performance. The safety probability is lower for small grids, as there is less room for R to avoid VC, and there are proportionally more situations in which R is trapped in a corner or against a wall. Notice that for the MDP, the state space for an grid is in compared to a state space in for the PG, where is the viewing range . As a consequence, no upper bound could be computed for the grid, as constructing the state space yielded a mem-out.

In Table II, for the 5 benchmarks, we see that the safety probability goes down for grids with a longer corridor. This is because in the abstraction, the Robot can meet the VC multiple times when traveling down the corridor. To avoid this unrealistic behavior, we used the region-based multi-step history refinement as described in Sect. III-D. Although we only look at histories of one step of the VC in length, this is enough to keep the safety probability at a value much closer to the upper bound, regardless of the length of the corridor.

Table II, 2, indicates that the pre-computation of the visibility-lookup (see Sect. IV) for large grids with many obstacles eventually takes significant time, yet the model construction time increases on a faster pace. In comparison with 1, we see that adding obstacles decreases the number of reachable states and thus also reduces the number of choices and transitions. Eventually, model construction takes longer than the actual model checking procedure.

Table III indicates that the model checking time is not significantly influenced by the number of obstacles. Furthermore, we observe that the first obstacles behave benevolent and only marginally influence the safety probability, while at over obstacles, the probability dips significantly compared to the upper bound. This is because the added obstacles provide blind spots, in which the Robot can no longer observe the movement of the VC.

The same blind spot behavior can also be observed in Table IV (4). Here we add cameras to aid the robot by providing improved visibility around the blind spot, resulting in a near-perfect safety property. This doubles state space size and increases the model checking time by about seconds.

Vi Discussion

Game-based abstraction successfully prunes the state space of MDPs by merging similar states. By adding an adversary that assumes the worst-case state, a PG is obtained. In general, this turns the POMDP at hand into a partially observable PG, which remains intractable. However, splitting according to observational equivalence leads to a fully observable PG. PGs can be analyzed by black-box algorithms as implemented, in , which also returns an optimal scheduler. The strategy from the PG can be applied to the POMDP, which yields the actual (higher) safety level.

In general, the abstraction can be too coarse; however, in the examples above, we have shown successfully that the game-based abstraction is not too coarse if one makes some assumptions about the POMDP. These assumptions are often naturally fulfilled by motion planning scenarios.

The assumptions from Sect. III-A can be relaxed in several respects: Our method naturally extends to multiple opponents. We restricted the method to a single controllable agent, but if information is shared among multiple agents, the method is applicable also to this setting. If information sharing is restricted, special care has to be taken to prevent information leakage. Richer classes of behavior for the opponents, including non-deterministic choices, are an important area for future research. This would lead to partially observable PGs, and game-based abstraction would yield three-player games. As two sources of non-determinism are uncontrollable, both the opponents and the abstraction could be controlled by Player 2, thus yielding a PG again.

Supporting a richer class of temporal specifications is another option: supports a probabilistic variant of alternating (linear-) time logic extended by rewards and trade-off analysis. Using the same abstraction technique as we have presented, a larger class of properties thus can be analyzed. However, care has to be taken when combining invariants and reachability criteria arbitrarily, as they involve under- and over-approximations.

Our method can be generalized to POMDPs for other settings. We use the original problem statement on the graph only to motivate the correctness. The abstraction can be lifted (as indicated by Def. III-C), for refinement, however, a more refined argument for correctness is necessary.

The proposed construction of the PG is straightforward and currently realized without constructing the POMDP first. This simplifies the implementation of the refinement, but mapping the scheduler on the POMDP is currently not supported. Improved tool support thus should yield better results (cf. the -safety in Fig. 1) without changing the method.

Vii Conclusion

We utilized the successful technique of game-based abstraction to synthesize strategies for a class of POMDPs. Experiments show that this approach is promising. In future work, we will lift our approach to a broader class of POMDPs and improve the refinement steps, including an automatic refinement loop.

References

  • [1] R. A. Howard, Dynamic Programming and Markov Processes, 1st ed.   The MIT Press, 1960.
  • [2] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,” vol. 101, no. 1, pp. 99–134, 1998.
  • [3] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics.   The MIT Press, 2005.
  • [4] T. Wongpiromsarn and E. Frazzoli, “Control of probabilistic systems under dynamic, partially known environments with temporal logic specifications.”   IEEE, 2012, pp. 7644–7651.
  • [5] G. Shani, J. Pineau, and R. Kaplow, “A survey of point-based POMDP solvers,” Autonomous Agents and Multi-Agent Systems, vol. 27, no. 1, pp. 1–51, 2013.
  • [6] J.-P. Katoen, “The probabilistic model checking landscape.”   ACM, 2016, pp. 31–45.
  • [7] M. Kwiatkowska, G. Norman, and D. Parker, “Prism 4.0: Verification of probabilistic real-time systems,” vol. 6806, 2011, pp. 585–591.
  • [8] C. Dehnert, S. Junges, J.-P. Katoen, and M. Volk, “A storm is coming: A modern probabilistic model checker,” CoRR, vol. abs/1702.04311, 2017.
  • [9] T. Chen, V. Forejt, M. Z. Kwiatkowska, D. Parker, and A. Simaitis, “PRISM-games: A model checker for stochastic multi-player games,” vol. 7795, 2013, pp. 185–191.
  • [10] M. Kattenbelt and M. Huth, “Verification and refutation of probabilistic specifications via games,” ser. LIPIcs, vol. 4.   Schloss Dagstuhl, 2009, pp. 251–262.
  • [11] M. Kattenbelt, M. Kwiatkowska, G. Norman, and D. Parker, “A game-based abstraction-refinement framework for Markov decision processes,” vol. 36, no. 3, pp. 246–280, 2010.
  • [12] K. Chatterjee, M. Chmelík, and M. Tracol, “What is decidable about partially observable Markov decision processes with -regular objectives,” vol. 82, no. 5, pp. 878–911, 2016.
  • [13] G. Norman, D. Parker, and X. Zou, “Verification and control of partially observable probabilistic systems,” Real-Time Systems, vol. 53, no. 3, pp. 354–402, 2017.
  • [14] S. Patil, G. Kahn, M. Laskey, J. Schulman, K. Goldberg, and P. Abbeel, “Scaling up Gaussian belief space planning through covariance-free trajectory optimization and automatic differentiation,” in Algorithmic Foundations of Robotics XI, ser. Springer Tracts in Advanced Robotics, vol. 107, 2014, pp. 515–533.
  • [15] B. Burns and O. Brock, “Sampling-based motion planning with sensing uncertainty.”   IEEE, 2007, pp. 3313–3318.
  • [16] A. Bry and N. Roy, “Rapidly-exploring random belief trees for motion planning under uncertainty.”   IEEE, 2011, pp. 723–730.
  • [17] C.-I. Vasile, K. Leahy, E. Cristofalo, A. Jones, M. Schwager, and C. Belta, “Control in belief space with temporal logic specifications,” 2016, pp. 7419–7424.
  • [18] K. Hauser, “Randomized belief-space replanning in partially-observable continuous spaces,” in Algorithmic Foundations of Robotics IX, 2010, pp. 193–209.
  • [19] M. P. Vitus and C. J. Tomlin, “Closed-loop belief space planning for linear, Gaussian systems.”   IEEE, 2011, pp. 2152–2159.
  • [20] D. K. Grady, M. Moll, and L. E. Kavraki, “Extending the applicability of POMDP solutions to robotic tasks,” IEEE Trans. Robotics, vol. 31, no. 4, pp. 948–961, 2015.
  • [21] L. de Alfaro, “The verification of probabilistic systems under memoryless partial-information policies is hard,” DTIC Document, Tech. Rep., 1999.
  • [22] K. Chatterjee, M. Chmelík, R. Gupta, and A. Kanodia, “Qualitative analysis of POMDPs with temporal logic specifications for robotics applications,” 2015, pp. 325–330.
  • [23] ——, “Optimal cost almost-sure reachability in POMDPs,” vol. 234, pp. 26–48, 2016.
  • [24] H. Yu and D. P. Bertsekas, “Discretized approximations for POMDP with average cost,” in UAI.   AUAI Press, 2004, p. 519.
  • [25] S. Giro and M. N. Rabe, “Verification of partial-information probabilistic systems using counterexample-guided refinements,” vol. 7561, 2012, pp. 333–348.
  • [26] X. Zhang, B. Wu, and H. Lin, “Assume-guarantee reasoning framework for MDP-POMDP.”   IEEE, 2016, pp. 795–800.
  • [27] ——, “Counterexample-guided abstraction refinement for POMDPs,” CoRR, vol. abs/1701.06209, 2017.
  • [28] K. Chatterjee and L. Doyen, “Partial-observation stochastic games: How to win when belief fails,” vol. 15, no. 2, pp. 16:1–16:44, 2014.
  • [29] M. Svorenová and M. Kwiatkowska, “Quantitative verification and strategy synthesis for stochastic games,” Eur. J. Control, vol. 30, pp. 15–30, 2016.
  • [30] C. Baier and J.-P. Katoen, Principles of Model Checking.   MIT Press, 2008.
  • [31] A. Condon, “The complexity of stochastic games,” Inf. Comput., vol. 96, no. 2, pp. 203–224, 1992.
  • [32] S. M. Ross, Introduction to Stochastic Dynamic Programming.   Academic Press, Inc., 1983.
  • [33] L. de Alfaro and P. Roy, “Magnifying-lens abstraction for Markov decision processes,” vol. 4590.   Springer, 2007, pp. 325–338.