I Introduction
Offline motion planning for dynamical systems with uncertainties aims at finding a strategy for an agent that ensures certain desired behavior [1]. Planning scenarios that exhibit stochastic behavior are naturally modeled by Markov decision processes (MDPs). An MDP is a nondeterministic model in which the agent chooses to perform an action under full knowledge of the environment it is operating in. The outcome of the action is a distribution over the states. For many robotic applications, however, information about the current state of the environment is not observable [2, 3, 4]. In such scenarios, where the actual state of the environment is not exactly known, the model is a partially observable Markov decision process (POMDP). By tracking the observations made while it moves, an agent can infer the likelihood of the environment being in a certain state. This likelihood is called the belief state of the agent. Executing an action leads to an update of the belief state because new observations are made. The belief state together with an update function form a (possibly infinite) MDP, commonly referred to as the underlying belief MDP [5].
As an example, take a scenario where a controllable agent needs to traverse a room while avoiding static obstacles and randomly moving opponents whose positions cannot always be observed by the agent. The goal is to determine a strategy for the agent, that (provably) ensures safe traversal with a certain high probability.
Quantitative verification techniques like probabilistic model checking [6] provide comprehensive guarantees on such a strategy. For finite MDPs, tools like [7] or storm [8] employ efficient model checking algorithms to asses the probability to reach a certain set of states. However, POMDP verification suffers from the large, potentially infinite belief space, and is intractable even for rather small instances.
Approach
We outline the approach and the structure of the paper in Fig. 1, details in the figure will be discussed in the respective sections. Starting from a problem description, we propose to use an encoding of the problem as a POMDP. We observe that motion planning scenarios as described above naturally induce certain structural properties in the POMDP. In particular, we assume that the agent can observe its own position while the exact position of the opponents is observable only if they are nearby according to a given distance metric. We propose an abstraction method that, intuitively, lumps states inducing the same observations. Since it is not exactly known in which state of the environment a certain action is executed, a nondeterministic choice over these lumped states is introduced. Resolving this choice induces a new level of nondeterminism into the system in addition to the choices of the agent: The POMDP abstraction results in a probabilistic twoplayer game (PG) [10]. The agent is the first player choosing an action while the second player chooses in which of the possible (concrete) states the action is executed. Model checking computes an optimal strategy for the agent on this PG.
The automated abstraction procedure is inspired by gamebased abstraction [10, 11] of potentially infinite MDPs, where states are lumped in a similar fashion. We show that our approach is sound in the sense that a strategy for the agent in the PG defines a strategy for the original POMDP. Guarantees for the strategy carry over to POMDPs, as it induces bounds on probabilities. As we target an undecidable problem [12], our approach is not complete in the sense that it does not always obtain a strategy which yields the required optimal probability. However, we define a scheme to refine the abstraction and extend the observability.
We implemented a Python toolchain taking a graph formulation of the motion planning as input and applying the proposed abstractionrefinement procedure. The toolchain uses [9] as a model checker for PGs. For the motion planning scenario considered, our preliminary results indicate an improvement of orders in magnitude over the state of the art in POMDP verification [13].
Related work
Samplingbased methods for motion planning in POMDP scenarios are considered in [14, 15, 16, 17]. An overview on pointbased value iteration for POMDPs is given in [5]. Other methods employ control techniques to synthesize strategies with safety considerations under observation and dynamics noise [2, 18, 19]. Preprocessing of POMDPs in motion planning problems for robotics is suggested in [20].
General verification problems for POMDPs and their decidability have been studied in [21, 22]. A recent survey about decidability results and algorithms for regular properties is given in [12, 23]. The probabilistic model checker has recently been extended to support POMDPs [13]. Partly based on the methods from [24], it produces lower and upper bounds for a variety of queries. Reachability can be analyzed for POMDPs for up to 30,000 states. In [25], an iterative refinement is proposed to solve POMDPs: Starting with total information, strategies that depend on unobservable information are excluded. In [26], a compositional framework for reasoning about POMDPs is introduced. Refinement based on counterexamples is considered in [27]. Partially observable probabilistic games have been considered in [28]. Finally, an overview of applications for PGs is given in [29].
Ii Formal Foundations
Iia Probabilistic games
For a finite or countably infinite set , let such that denote a probability distribution over and
the set of all probability distributions over
. The Dirac distribution on is given by and for . [Probabilistic game] A probabilistic game (PG) is a tuple where is a finite set of states, the states of Player 1, the states of Player 2, the initial state, a finite set of actions, and a (partial) probabilistic transition function. Let denote the available actions in . We assume that PG does not contain any deadlock states, for all . A Markov decision process (MDP) is a PG with . We write . Adiscretetime Markov chain
(MC) is an MDP with for all .The game is played as follows: In each step, the game is in a unique state . If it is a Player 1 state (), then Player 1 chooses an available action nondeterministically; otherwise Player 2 chooses. The successor state of is determined probabilistically according to the probability distribution : The probability of being the next state is . The game is then in state .
A path through is a finite or infinite sequence , where , , , and for all . The th state on is , and denotes the last state of if is finite. The set of (in)finite paths is ().
To define a probability measure over the paths of a PG , the nondeterminism needs to be resolved by strategies. [PG strategy] A strategy for is a pair of functions such that for all , . denotes the set of all strategies of and all Player strategies of . For MDPs, the strategy consists of a Player strategy only. A Player strategy is memoryless if implies for all . It is deterministic if is a Dirac distribution for all . A memoryless deterministic strategy is of the form .
A strategy for a PG resolves all nondeterministic choices, yielding an induced MC, for which a probability measure over the set of infinite paths is defined by the standard cylinder set construction [30]. These notions are analogous for MDPs.
IiB Partial observability
For many applications, not all system states are observable [2]
. For instance, an agent may only have an estimate on the current state of its environment. In that case, the underlying model is a partially observable Markov decision process. [POMDP] A
partially observable Markov decision process (POMDP) is a tuple such that is the underlying MDP of , a finite set of observations, and the observation function. W. l. o. g. we require that states with the same observations have the same set of enabled actions, implies for all . More general observation functions have been considered in the literature, taking into account the last action and providing a distribution over . There is a polynomial transformation of the general case to the POMDP definition used here [23].The notions of paths and probability measures directly transfer from MDPs to POMDPs. We lift the observation function to paths: For a POMDP and a path , the associated observation sequence is . Note that several paths in the underlying MDP can give rise to the same observation sequence. Strategies have to take this restricted observability into account. [Observationbased strategy] An observationbased strategy of POMDP is a function such that is a strategy for the underlying MDP and for all paths with we have . denotes the set of such strategies. That means an observationbased strategy selects based on the observations and actions made along the current path.
The semantics of a POMDP can be described using a belief MDP with an uncountable state space. The idea is that each state of the belief MDP corresponds to a distribution over the states in the POMDP. This distribution is expected to correspond to the probability to be in a specific state based on the observations made so far. Initially, the belief is a Dirac distribution on the initial state. A formal treatment of belief MDPs is beyond the scope of this paper, for details see [5].
IiC Specifications
Given a set of goal states and a set of bad states, we consider quantitative reachavoid properties of the form . The specification is satisfied by a PG if Player 1 has a strategy such that for all strategies of Player 2 the probability is at least to reach a goal state without entering a bad state in between. For POMDPs, is satisfied if the agent has an observationbased strategy which leads to a probability of at least .
Iii Methodology
We first intuitively describe the problem and list the assumptions we make. After formalizing the setting, we present a formal problem statement. We present the intuition behind the concept of gamebased abstraction for MDPs, how to apply it to POMDPs, and state the correctness of our method.
Iiia Problem Statement
We consider a motion planning problem that involves moving agents inside a world such as a landscape or a room. One agent is controllable (), the other agents (also called opponents) move stochastically according to a fixed randomized strategy which is based on their own location and the location of . We assume that all agents move in an alternating manner. A position of an agent defines the physical location inside the world as well as additional properties such as the agent’s orientation. A graph models all possible movements of an agent between positions, referred to as the world graph of an agent. Therefore, nodes in the graph uniquely refer to positions while multiple nodes may refer to the same physical location in the world. We require that the graph does not contain any deadlocks: For every position, there is at least one edge in the graph corresponding to a possible action an agent can execute.
A collision occurs, if shares its location with another agent. The set of goal nodes (goal positions) in the graph is uniquely defined by a set of physical goal locations in the world. The target is to move towards a goal node without colliding with other agents. Technically, we need to synthesize a strategy for that maximizes the probability to achieve the target. Additionally, we assume:

The strategies of all opponents are known to .

is able to observe its own position and knows the goal positions it has to reach.

The positions of opponents are observable for from its current position, if they are visible with respect to a certain distance metric.
Generalizing the problem statement is discussed in Sect. VI.
IiiB Formal setting
We first define an individual world graph for each [i] with over a fixed set of (physical) locations. [World graph of [i]] The world graph for [i] over is a tuple such that is the set of positions and is the initial position of [i]. is the set of movements^{1}^{1}1We use movements to avoid confusion with actions in PGs.; the edges are the movement effects. The function maps a position to the corresponding location. The enabled movements for Agent in position are .
For we need the possibility to restrict its viewing range. This is done by a function which assigns to each position of the set of visible locations. According to our assumptions, for all it holds that and .
Each [i] with has a randomized strategy , which maps positions of and [i] to a distribution over enabled movements of [i]. The world graphs for all agents with randomized strategies for the opponents are subsumed by a single world POMDP. We first define the underlying world MDP modeling the possible behavior of all agents based on their associated world graphs. [World MDP] Given world graphs , the induced world MDP is defined by , , and . is defined by:

For and , we have .

, with , , and .

in all other cases.
The first item in the definition of translates each movement in the world graph of into an action in the MDP that connects states with probability one, a Dirac distribution is attached to each action. Upon taking the action, the position of changes and [1] has to move next.
The second item defines movements of the opponents. In each state where [i] is moving next, the action reflecting this move is added. The outcome of is determined by and the fact that Agent moves next. [World POMDP] Let be a world MDP for world graphs . The world POMDP with and defined by:
Thus, the position of [i] is observed iff the location of [i] is visible from the position of , and otherwise a dummy value , which is referred to as far away, is observed.
Given a set , the mappings are used to define the states corresponding to collisions and goal locations. In particular, we have and .
Formal problem statement
Given a world POMDP for a set of world graphs , a set of collision states Collision, and a set of goal states Goals, an observationbased strategy for is safe for , if holds. We want to compute a safe strategy for a given .
IiiC Abstraction
We propose an abstraction method for world POMDPs that builds on gamebased abstraction (GBAR), originally defined for MDPs [10, 11].
GBAR for MDPs
For an MDP , we assume a partition of , a set of nonempty, pairwise disjoint, subsets (called blocks) with . GBAR takes the partition and turns each block into an abstract state ; these blocks form the Player 1 states. Then . To determine the outcome of selecting , we add intermediate selectorstates as Player 2 states. In the selector state , emanating actions reflect the choice of the actual state the system is in at , . For taking an action in , the distribution is lifted to a distribution over abstract states:
The semantics of this PG is as follows: For an abstract state , Player 1 (controllable) selects an action to execute. In selectorstates, the Player 2 (adversary) selects the worstcase state from where the selection was executed.
Applying GBAR to POMDPs
The key idea in GBAR for POMDPs is to merge states with equal observations. [Abstract PG] The abstract PG of POMDP is with , , s. t. , and .
The transition probabilities are defined as follows:

for and ,

for , , and ,

and in all other cases.
By construction, Player 1 has to select the same action for all states in an abstract state. As the abstract states coincide with the observations, this means that we obtain an observationbased strategy for the POMDP. For the classes of properties we consider, a memoryless deterministic strategy suffices for PGs to achieve the maximal probability of reaching a goal state without collision [31]. We thus obtain an optimal strategy for Player in the PG which maps every abstract state to an action. As abstract states are constructed such that they coincide with all possible observations in the POMDP (see Def. IIIC), this means that maps every observation to an action.
Abstract world PG
We now connect the abstraction to our setting. For ease of presentation, we assume in the rest of this section that there is only one opponent agent, we have and [1]. Therefore, if sees an agent and moves, no additional agent will appear. Moreover, either knows the exact state, or does not know where the opponent is.
First, we call the abstract PG of the world POMDP the abstract world PG. In particular, the abstract states in the world PG are either of the form or of the form . with . In the former, the opponent is visible and the agent has full knowledge, in the latter only the own position is known. Recall that is a dummy value for the distance referred to as far away. Furthermore, all states in an abstract state correspond to the same position of . For abstract states with full knowledge, there is no nondeterminism of Player involved as these states correspond to a single state in the world POMDP.
Correctness
We show that a safe strategy for Player induces a safe strategy for . Consider therefore a path in the PG. This path is projected to the blocks: . The location of encoded in the blocks is independent of the choices by Player . The sequence of actions thus yields a unique path of positions of in its world graph. Thus, if the path in the PG reaches a goal state, the path induces a path in the POMDP which also reaches a goal state. Moreover, the worstcase behavior overapproximates the probability for the opponent to be in any location, and any collision is observable. Thus if there is a collision in the POMDP, then there is a collision in the PG.
Formally, for a deterministic memoryless strategy in the abstract world PG the corresponding strategy in the POMDP is defined as for .
Theorem 1
Given a safe strategy in an abstract world PG, the corresponding strategy in the world POMDP is safe.
The assessment of the strategy is conservative: A safe strategy in the abstract world PG may induce a corresponding strategy in the POMDP which is safe for some . In particular, applying the corresponding strategy to the original POMDP yields a discretetime Markov chain (MC). This MC can be efficiently analyzed by, probabilistic model checking to determine the value of . Naturally, the optimal scheduler obtained for the PG does not need to be optimal in the POMDP.
All positions where [1] is visible yield Dirac distributions in the belief MDP, the successor states in the MDP depend solely on the action choice. These beliefs are represented as single states in the abstract world PG. The abstraction lumps for each position of all (uncountably many) other belief states together.
IiiD Refinement of the PG
In the GBAR approach described above, we remove relevant information for an optimal strategy. In particular, the behavior of [1] (the opponent) is strengthened (overapproximated):

We abstract probabilistic movements of [1] outside of the visible area into nondeterminism.

We allow jumps in [1]’s movements, i.e., [1] may change position in the PG. This is impossible in the POMDP; these movements are called spurious.
If, due to the lack of this information, no safe strategy can be found, the abstraction needs to be refined. In GBAR for MDPs [11]
, abstract states are split heuristically, yielding a finer overapproximation. In our construction, we cannot split abstract states arbitrarily: This would destroy the onetoone correspondence between abstract states and observations. We would thus obtain a partially observable PG, or equivalently, for a strategy in the PG the corresponding strategy in the original POMDP is no longer observationbased.
However, we can restrict the spurious movements of [1] by taking the history of observations made along a path into account. We present three types of historybased refinements.
Onestep history refinement
If moves to state from where [1] is no longer visible, we have . Upon the next move, [1] could thus appear anywhere. However, until [1] moves, the belief MDP is still in a Dirac distribution; the positions where [1] can appear are thus restricted. Similarly, if [1] disappears, upon a turn of in the same direction, [1] will be visible again. The (onestep history) refined world PG extends the original PG by additional states where , is not visible for . These “far away” states are only reached from states with full information. Intuitively, although [1] is invisible, its position is remembered for one step.
Multistep history refinement
Further refinement is possible by considering longer paths. If we first observe [1] at location , then loose visibility for one turn, and then observe [1] again at position , then we know that either and are at most two moves apart or that such a movement is spurious. To encode the observational history into the states of the abstraction, we store the last known position of [1], as well as the number of moves made since then. We then only allow [1] to appear in positions which are at most moves away from the last known position. We can cap by the diameter of the graph.
Regionbased multistep history refinement
As the refinement above blows up the state space drastically, we utilize a technique called magnifying lens abstraction [33]. Instead of single locations, we define regions of locations together with the information if [1] could be present. After each move, we extend the possible regions by all neighbor regions.
More formally, the (multistep history) refined world PG has a refined faraway value : Given a partition of the positions of [1], extracted from the graph structure, into sets with and for all . We define . Abstract states now are either of the form as before, or . For singleton regions, this coincides with the method proposed above. Notice that this approach also offers some flexibility: If for instance two regions are connected only by the visible area, can assure wether [1] enters the other region.
Correctness
First, a deterministic memoryless strategy on a refined abstract world PG needs to be translated to a strategy for the original POMDP while safety is conserved. Intuitively, as the proposed refinement steps encode history into the abstract world PG, the strategy is not memoryless anymore but has a finite memory at most according to the maximum number of moves that are observed.
Theorem 2
A safe strategy in a refined abstract world PG has a corresponding safe strategy in the world POMDP.
The proposed refinements eliminate spurious movements of [1] from the original abstract world PG. Intuitively, the number of states where Player may select states with belief zero (in the underlying belief MDP) is reduced. We thus only prevent paths that have probability zero in the POMDP. Vice versa, the refinement does not restrict the movement of and any path leading to a goal state still leads to one in the refinement. However, the behavior of [1] is restricted, therefore, the probability of a collision drops. Intuitively, for the refined PG strategies can be computed that are at least as good as for the original PG.
Theorem 3
If an abstract world PG has a safe strategy, then its refined abstract world PG has a safe strategy with .
IiiE Refinement of the Graph
The proposed approach cannot solve every scenario — the problem is undecidable [12]. Therefore, if the method fails to find a safe scheduler, we do not know whether there exists such a scheduler. With increased visibility, however, the maximal level of safety does not decrease in both the POMDP and the PG. To determine good spots for increased visibility, we can use the analysis results: Locations in which a collision occurs are most likely good candidates.
Iv Case Study and Implementation
Iva Description
For our experiments, we choose the following scenario: A (controllable) Robot R and a Vacuum Cleaner VC are moving around in a twodimensional grid world with static opaque obstacles. Neither R nor VC may leave the grid or visit grid cells occupied by a static obstacle. The position of R contains the cell (the location) and a wind direction. R can move one grid cell forward, or turn by in either direction without changing its location. The position of VC is determined solely by its cell . In each step, VC can move one cell in any wind direction. We assume that VC moves to all available successor cells with equal probability.
The sensors on R only sense VC within a viewing range around . More precisely, VC is visible iff and there is no grid cell with a static obstacle on the straight line from ’s center to ’s center. That means, R can observe the position of the VC if VC is in the viewing range and VC is not hidden behind an obstacle. A refinement of the world is realized by adding additional cameras, which make cells visible independent of the location of R.
IvB ToolChain
To synthesize strategies for the scenario described above, we implemented a toolchain in Python. The input consists of the grid with the locations of all obstacles, the location of cameras, and the viewing range. As output, two files are created: A PG formulation of the abstraction including onestep history refinement, to be analyzed using [9], and the original POMDP for [13]. For multistep history refinement, additional regions can be defined.
The encoding of the PG contains a precomputed lookuptable for the visibility relation. The PG is described by two parallel processes running interleaved: One for Player and one for Player . As only R can make choices, they are listed in Player actions, while VC’s moves are stored in Player actions. More precisely, the process for R contains its location, and the process for VC either contains its location or a faraway value. Then, Player makes its decision, afterwards the outcome of the move and the outcomes of the subsequent move of VC are compressed into one step of Player .
V Experiments
Va Experimental Setup
All experiments were run on a machine with a 3.6 GHz Intel^{®} Core^{TM} i74790 CPU and 16 GB RAM, running Ubuntu Linux 16.04. We denote experiments taking over 5400 s CPU time as timeout and taking over 10 GB memory as memout (MO). We considered several variants of the scenario described in IVA. The Robot always started in the upperleft corner and had the lowerright corner as target. The VC started in the lowerright corner. In all variants, the view range was . We evaluated the following five scenarios:

[label=SC0, wide, labelwidth=!, labelindent=0pt,topsep=0pt]

Rooms of varying size without obstacles.

Differently sized rooms with a crossshaped obstacle in the center, which scales with increasing grid size.

A room with up to randomly placed obstacles.

Two rooms (together ) as depicted in Fig. 2. The doorway connecting the two rooms is a potential point of failure, as R cannot see to the other side. To improve reachability, we added cameras to improve visibility.

Corridors of the format – long, narrow grids that the Robot has to traverse from top to bottom, passing the VC on its way down.
POMDP solution  PG solution  MDP  
Grid size  States  Choices  Trans.  Result  Model Time  Sol. Time  States  Choices  Trans.  Result  Model Time  Sol. Time  Result 
299  515  739  0.8323  0.063  0.26  400  645  1053  0.8323  0.142  0.036  0.8323  
983  1778  2705  0.9556  0.099  1.81  1348  2198  3897  0.9556  0.353  0.080  0.9556  
2835  5207  8148  0.9882  0.144  175.94  6124  10700  19248  0.9740  0.188  0.649  0.9882  
4390  8126  12890  0.9945  0.228  4215.056  8058  14383  26079  0.9785  0.242  0.518  0.9945  
6705  20086  12501  ??  0.377  – MO –  10592  19286  35226  0.9830  0.322  1.872  0.9970  
24893  47413  78338  ??  1.735  – MO –  23128  81090  43790  0.9897  0.527  6.349  0.9998  
66297  127829  214094  ??  9.086  – MO –  40464  145482  78054  0.9914  0.904  6.882  0.9999  
– Time out during model construction –  199144  745362  395774  0.9921  8.580  122.835  0.9999  
– Time out during model construction –  477824  1808442  957494  0.9921  41.766  303.250  0.9999  
– Time out during model construction –  876504  3334722  1763214  0.9921  125.737  1480.907  0.9999  
– Time out during model construction –  1395184  5324202  2812934  0.9921  280.079  3129.577  – MO – 
VB Results
Table I shows the direct comparison between the POMDP description and the abstraction for 1. The first column gives the grid size. Then, first for the POMDP and afterwards for the PG, the table lists the number of states, nondeterministic choices, and transitions of the model. The results include the safety probability induced by the optimal scheduler (“Result”), the run times (all in seconds) takes for constructing the state space from the symbolic description (“Model Time”), and finally the time to solve the POMDP / PG (“Sol. Time”). The last column shows the safety probability as computed using the fully observable MDP; it is an upper bound on the probability that is achievable for each grid. Note that optimal schedulers from this MDP are in general not observationbased and therefore not admissible for the POMDP. The time for creating the files was s in all cases.
Table II lists data for the PG constructed from 2 (first block of rows) and 5 (without additional refinement in the second block, with regionbased multistep history refinement in the third block), analogous to Table I. Additionally the runtime for creating the symbolic description is given (“Run times / Create”). On the fully observable MDP, the resulting probability is 1.0 for all 2 and 0.999 for all 5 instances.
Table III shows the results for 3. The first column (“#O”) corresponds to the number of obstacles, while the remaining entries are analogous to Table II. The data for 4 is shown in Table IV. Its structure is identical to that of Table III, with the first column (“#C”) corresponding to the number of cameras added for the graph refinement as in IIIE.
PG  Run times  

Grid  States  Choices  Trans.  Result  Create  Model  Solve  
36084  66942  120480  0.9920  0.08  3  24  
173584  331482  618148  0.9972  1.19  41  103  
431044  834242  1572948  0.9977  7.62  231  312  
808504  1575402  2985348  0.9978  31.92  1220  805  
50880  93734  170974  0.9228  0.01  1.4  17  
77560  143254  261534  0.8923  0.01  2.8  64  
104240  192774  352094  0.8628  0.01  5.2  110  
130920  242294  442654  0.8343  0.02  6.9  157  
5 + ref. 
55300  120848  198088  0.9799  0.01  25.2  38  
83820  182368  300648  0.9799  0.01  42.6  177  
112340  243888  403208  0.9799  0.01  74.2  191  
140860  305408  505768  0.9799  0.02  117.5  629 
PG  Run times  MDP  

#O  States  Choices  Trans.  Result  Create  Model  Solve  Result 
10  297686  581135  1093201  0.9976  2.10  89.7  285.0  0.9999 
40  234012  454652  823410  0.9706  2.74  87.3  179.1  0.9999 
60  198927  385803  679321  0.6476  3.12  59.4  201.5  0.9999 
70  187515  363401  633884  0.6210  3.30  59.4  116.1  0.9896 
PG  Run times  MDP  

#C  States  Choices  Trans.  Result  Create  Model  Solve  Result 
none  76768  145562  271152  0.5127  0.22  7.9  23.5  0.9999 
2  152920  291866  546719  0.9978  0.24  16.9  68.1  – 
VC Evaluation
Consider 1: While for very small examples, delivers results within reasonable time, already the grid yields a memout. On the other hand, our abstraction handles grids up to within minutes, while still providing schedulers with a solid performance. The safety probability is lower for small grids, as there is less room for R to avoid VC, and there are proportionally more situations in which R is trapped in a corner or against a wall. Notice that for the MDP, the state space for an grid is in compared to a state space in for the PG, where is the viewing range . As a consequence, no upper bound could be computed for the grid, as constructing the state space yielded a memout.
In Table II, for the 5 benchmarks, we see that the safety probability goes down for grids with a longer corridor. This is because in the abstraction, the Robot can meet the VC multiple times when traveling down the corridor. To avoid this unrealistic behavior, we used the regionbased multistep history refinement as described in Sect. IIID. Although we only look at histories of one step of the VC in length, this is enough to keep the safety probability at a value much closer to the upper bound, regardless of the length of the corridor.
Table II, 2, indicates that the precomputation of the visibilitylookup (see Sect. IV) for large grids with many obstacles eventually takes significant time, yet the model construction time increases on a faster pace. In comparison with 1, we see that adding obstacles decreases the number of reachable states and thus also reduces the number of choices and transitions. Eventually, model construction takes longer than the actual model checking procedure.
Table III indicates that the model checking time is not significantly influenced by the number of obstacles. Furthermore, we observe that the first obstacles behave benevolent and only marginally influence the safety probability, while at over obstacles, the probability dips significantly compared to the upper bound. This is because the added obstacles provide blind spots, in which the Robot can no longer observe the movement of the VC.
Vi Discussion
Gamebased abstraction successfully prunes the state space of MDPs by merging similar states. By adding an adversary that assumes the worstcase state, a PG is obtained. In general, this turns the POMDP at hand into a partially observable PG, which remains intractable. However, splitting according to observational equivalence leads to a fully observable PG. PGs can be analyzed by blackbox algorithms as implemented, in , which also returns an optimal scheduler. The strategy from the PG can be applied to the POMDP, which yields the actual (higher) safety level.
In general, the abstraction can be too coarse; however, in the examples above, we have shown successfully that the gamebased abstraction is not too coarse if one makes some assumptions about the POMDP. These assumptions are often naturally fulfilled by motion planning scenarios.
The assumptions from Sect. IIIA can be relaxed in several respects: Our method naturally extends to multiple opponents. We restricted the method to a single controllable agent, but if information is shared among multiple agents, the method is applicable also to this setting. If information sharing is restricted, special care has to be taken to prevent information leakage. Richer classes of behavior for the opponents, including nondeterministic choices, are an important area for future research. This would lead to partially observable PGs, and gamebased abstraction would yield threeplayer games. As two sources of nondeterminism are uncontrollable, both the opponents and the abstraction could be controlled by Player 2, thus yielding a PG again.
Supporting a richer class of temporal specifications is another option: supports a probabilistic variant of alternating (linear) time logic extended by rewards and tradeoff analysis. Using the same abstraction technique as we have presented, a larger class of properties thus can be analyzed. However, care has to be taken when combining invariants and reachability criteria arbitrarily, as they involve under and overapproximations.
Our method can be generalized to POMDPs for other settings. We use the original problem statement on the graph only to motivate the correctness. The abstraction can be lifted (as indicated by Def. IIIC), for refinement, however, a more refined argument for correctness is necessary.
The proposed construction of the PG is straightforward and currently realized without constructing the POMDP first. This simplifies the implementation of the refinement, but mapping the scheduler on the POMDP is currently not supported. Improved tool support thus should yield better results (cf. the safety in Fig. 1) without changing the method.
Vii Conclusion
We utilized the successful technique of gamebased abstraction to synthesize strategies for a class of POMDPs. Experiments show that this approach is promising. In future work, we will lift our approach to a broader class of POMDPs and improve the refinement steps, including an automatic refinement loop.
References
 [1] R. A. Howard, Dynamic Programming and Markov Processes, 1st ed. The MIT Press, 1960.
 [2] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,” vol. 101, no. 1, pp. 99–134, 1998.
 [3] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. The MIT Press, 2005.
 [4] T. Wongpiromsarn and E. Frazzoli, “Control of probabilistic systems under dynamic, partially known environments with temporal logic specifications.” IEEE, 2012, pp. 7644–7651.
 [5] G. Shani, J. Pineau, and R. Kaplow, “A survey of pointbased POMDP solvers,” Autonomous Agents and MultiAgent Systems, vol. 27, no. 1, pp. 1–51, 2013.
 [6] J.P. Katoen, “The probabilistic model checking landscape.” ACM, 2016, pp. 31–45.
 [7] M. Kwiatkowska, G. Norman, and D. Parker, “Prism 4.0: Verification of probabilistic realtime systems,” vol. 6806, 2011, pp. 585–591.
 [8] C. Dehnert, S. Junges, J.P. Katoen, and M. Volk, “A storm is coming: A modern probabilistic model checker,” CoRR, vol. abs/1702.04311, 2017.
 [9] T. Chen, V. Forejt, M. Z. Kwiatkowska, D. Parker, and A. Simaitis, “PRISMgames: A model checker for stochastic multiplayer games,” vol. 7795, 2013, pp. 185–191.
 [10] M. Kattenbelt and M. Huth, “Verification and refutation of probabilistic specifications via games,” ser. LIPIcs, vol. 4. Schloss Dagstuhl, 2009, pp. 251–262.
 [11] M. Kattenbelt, M. Kwiatkowska, G. Norman, and D. Parker, “A gamebased abstractionrefinement framework for Markov decision processes,” vol. 36, no. 3, pp. 246–280, 2010.
 [12] K. Chatterjee, M. Chmelík, and M. Tracol, “What is decidable about partially observable Markov decision processes with regular objectives,” vol. 82, no. 5, pp. 878–911, 2016.
 [13] G. Norman, D. Parker, and X. Zou, “Verification and control of partially observable probabilistic systems,” RealTime Systems, vol. 53, no. 3, pp. 354–402, 2017.
 [14] S. Patil, G. Kahn, M. Laskey, J. Schulman, K. Goldberg, and P. Abbeel, “Scaling up Gaussian belief space planning through covariancefree trajectory optimization and automatic differentiation,” in Algorithmic Foundations of Robotics XI, ser. Springer Tracts in Advanced Robotics, vol. 107, 2014, pp. 515–533.
 [15] B. Burns and O. Brock, “Samplingbased motion planning with sensing uncertainty.” IEEE, 2007, pp. 3313–3318.
 [16] A. Bry and N. Roy, “Rapidlyexploring random belief trees for motion planning under uncertainty.” IEEE, 2011, pp. 723–730.
 [17] C.I. Vasile, K. Leahy, E. Cristofalo, A. Jones, M. Schwager, and C. Belta, “Control in belief space with temporal logic specifications,” 2016, pp. 7419–7424.
 [18] K. Hauser, “Randomized beliefspace replanning in partiallyobservable continuous spaces,” in Algorithmic Foundations of Robotics IX, 2010, pp. 193–209.
 [19] M. P. Vitus and C. J. Tomlin, “Closedloop belief space planning for linear, Gaussian systems.” IEEE, 2011, pp. 2152–2159.
 [20] D. K. Grady, M. Moll, and L. E. Kavraki, “Extending the applicability of POMDP solutions to robotic tasks,” IEEE Trans. Robotics, vol. 31, no. 4, pp. 948–961, 2015.
 [21] L. de Alfaro, “The verification of probabilistic systems under memoryless partialinformation policies is hard,” DTIC Document, Tech. Rep., 1999.
 [22] K. Chatterjee, M. Chmelík, R. Gupta, and A. Kanodia, “Qualitative analysis of POMDPs with temporal logic specifications for robotics applications,” 2015, pp. 325–330.
 [23] ——, “Optimal cost almostsure reachability in POMDPs,” vol. 234, pp. 26–48, 2016.
 [24] H. Yu and D. P. Bertsekas, “Discretized approximations for POMDP with average cost,” in UAI. AUAI Press, 2004, p. 519.
 [25] S. Giro and M. N. Rabe, “Verification of partialinformation probabilistic systems using counterexampleguided refinements,” vol. 7561, 2012, pp. 333–348.
 [26] X. Zhang, B. Wu, and H. Lin, “Assumeguarantee reasoning framework for MDPPOMDP.” IEEE, 2016, pp. 795–800.
 [27] ——, “Counterexampleguided abstraction refinement for POMDPs,” CoRR, vol. abs/1701.06209, 2017.
 [28] K. Chatterjee and L. Doyen, “Partialobservation stochastic games: How to win when belief fails,” vol. 15, no. 2, pp. 16:1–16:44, 2014.
 [29] M. Svorenová and M. Kwiatkowska, “Quantitative verification and strategy synthesis for stochastic games,” Eur. J. Control, vol. 30, pp. 15–30, 2016.
 [30] C. Baier and J.P. Katoen, Principles of Model Checking. MIT Press, 2008.
 [31] A. Condon, “The complexity of stochastic games,” Inf. Comput., vol. 96, no. 2, pp. 203–224, 1992.
 [32] S. M. Ross, Introduction to Stochastic Dynamic Programming. Academic Press, Inc., 1983.
 [33] L. de Alfaro and P. Roy, “Magnifyinglens abstraction for Markov decision processes,” vol. 4590. Springer, 2007, pp. 325–338.
Comments
There are no comments yet.