Planning in Stochastic Environments with Goal Uncertainty

10/18/2018 ∙ by Sandhya Saisubramanian, et al. ∙ University of Massachusetts Amherst 0

We present the Goal Uncertain Stochastic Shortest Path (GUSSP) problem --- a general framework to model stochastic environments with goal uncertainty. The model is an extension of the stochastic shortest path (SSP) framework to dynamic environments in which it is impossible to determine the exact goal states ahead of plan execution. GUSSPs introduce flexibility in goal specification by allowing a belief over possible goal configurations. The partial observability is restricted to goals, facilitating the reduction to an SSP. We formally define a GUSSP and discuss its theoretical properties. We then propose an admissible heuristic that reduces the planning time of FLARES --- a start-of-the-art probabilistic planner. We also propose a determinization approach for solving this class of problems. Finally, we present empirical results using a mobile robot and three other problem domains.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Stochastic Shortest Path (SSP) problem is a rich framework to model goal-driven problems that require sequential decision making under uncertainty [Bertsekas and Tsitsiklis1991]. The objective in an SSP is to devise a sequence of actions such that the expected cost of reaching a goal state from the start state is minimized. This requires exact specification of the model parameters including the goal states.

The requirement of precise goal specification limits the applicability of SSPs, since the exact goal states may be hard to identify ahead of plan execution in some real-world settings. We target problems in which the agent is aware of the goal conditions, but may have uncertainty about the states that satisfy the goal conditions. For example, consider a search and rescue domain (Figure 1), where the agent has to rescue people from a building [Kitano et al.1999, Pineda et al.2015]. While the number of victims and the map of the building may be provided to the agent, it is non-trivial to identify the exact victim locations ahead of plan execution. However, it is relatively straightforward to establish a belief distribution over possible victim locations based on sensor or historical data. The search and rescue is an instance of the optimal search for stationary targets [Hansen2007, Stone, Royset, and Washburn2016, Bourgault, Furukawa, and Durrant-Whyte2003]. In this class of problems, the target’s exact location is unknown to the agent, but the agent can fully observe its current location and whether the target is in the current location. The objective is to minimize the expected cost of reaching the target. Since goal states are a critical parameter for planning, reasoning under goal uncertainty requires an efficient formulation that can leverage the fully observable components to provide tractable solutions.

Figure 1: Example of a goal uncertain search and rescue problem with red cells indicating potential victim locations and values denote the agent’s belief. G denotes the true goal.

The Partially Observable Markov Decision Process (POMDP) 

[Kaelbling, Littman, and Cassandra1998] is a rich framework that can capture various forms of partial observability. However, solving POMDPs is much harder [Papadimitriou and Tsitsiklis1987]. The partially observable SSPs (POSSPs) extend the SSP framework to partially observable settings with imperfect state information, offering a class of indefinite-horizon, undiscounted POMDPs that rely on state-based termination [Patek2001]. Other relevant POMDP variants are the Mixed Observable MDPs (MOMDPs) [Ong et al.2010] that model problems with both fully observable and partially observable state factors and the Goal POMDPs [Bonet and Geffner2009] that are goal-based with no discounting. These models are solved using POMDP solvers and are difficult to solve optimally. They also suffer from limited scalability due to their computational complexity [Papadimitriou and Tsitsiklis1987].

We present goal uncertain stochastic shortest path (GUSSP), a framework to model problems with imperfect goal information by allowing for a probabilistic distribution over possible goals. GUSSPs fit well with many real-world settings where it is comparatively easier and more realistic to have belief over goal configurations, rather than predicting the goals. The observation function in a GUSSP facilitates the reduction to an SSP, enabling the computation of tractable and optimal solutions. We specifically address the setting where the goals do not change over time and we assume the existence of a unique observation that allows the agent to accurately identify a goal when it reaches one. We define the property of an order- policy that helps understand the complexity of policy execution. This measure bounds the maximum number of unique visits to states that provide information about the goal, before the agent discovers a true goal.

Our primary contributions are: (i) formal definition of GUSSP and its theoretical properties; (ii) proposing a domain-independent, admissible heuristic that can accelerate probabilistic planners and a determinization approach for solving GUSSPs; and (iii) empirical evaluation of the model on three realistic domains in simulation and using a robot.

2 Background

A Stochastic Shortest Path (SSP) MDP is defined by the tuple , where is a finite set of states; is a finite set of actions;

is the transition function representing the probability of reaching a state

by executing an action in state , and denoted by ; is the cost function representing the cost of executing action in state , and denoted by ; is the initial state; and is the set of absorbing goal states. The cost of an action is positive in all states except absorbing goal states, where it is zero. The objective in an SSP is to minimize the expected cost of reaching a goal state from the start state. It is assumed that there exists at least one proper policy, one that reaches a goal state from any state with probability 1. The optimal policy, , can be extracted using the value function defined over the states, :

with denoting the optimal Q-value of the action in state in the SSP. While SSPs can be solved in polynomial time in the number of states, many problems of interest have a state-space whose size is exponential in the number of variables describing the problem [Littman1997]. This complexity has led to the use of approximate methods that either ignore stochasticity or use a short-sighted labeling approach.

3 Goal Uncertain Stochastic Shortest Path

A goal uncertain stochastic shortest path (GUSSP) problem is a generalized framework to model problems with goal uncertainty. A GUSSP is an SSP in which the agent may not know initially the exact set of goal states (, which does not change over time), and instead can obtain information about the goals via observations.

Definition 1.

A goal uncertain stochastic shortest path problem is a tuple where

  • denote an underlying SSP with unknown to the agent;

  • is the set of potential goals such that ;

  • is the set of states in the GUSSP with denoting the set of possible goal configurations;

  • is a finite set of observations corresponding to the goal configurations, ; and

  • is the observation function denoting the probability of receiving an observation, , given action led to state with probability .

Each state is represented by , with and . GUSSPs have mixed observable state components as is fully observable. Each represents a goal configuration (set of states), thus permitting multiple true goals in the model, . Every action in each state produces an observation, , which is a goal configuration that provides information about the true goals. The agent’s belief about its current state is denoted by , with ; that is, the belief about . The initial belief is denoted by , where is the start state. SSPs are therefore a special type of GUSSPs with a collapsed initial belief over the goals. The process terminates when the agent reaches a state such that and . Figure 2 shows a part of the network representation for a GUSSP.

Figure 2:

A dynamic Bayesian network describing a GUSSP.

As in (PO)SSP, we assume in a GUSSP: (1) the existence of a proper policy with finite cost, (2) all improper policies have infinite cost, and (3) termination is perfectly recognized. In this paper, we consider GUSSPs with state-based termination. However, the model allows for action-based termination as well [Hansen2007].

Observation Function

In a GUSSP, an observation function is characterized by two properties. First, to perfectly recognize termination, all potential goals are characterized by a unique belief-collapsing (when the belief over a state is either 1 or 0) observation. That is, at potential goal states, if , then :

(1)

Second, the observation function is myopic, providing information only about the current state. This is based on real-world settings with limited range sensors and the exploration and navigation approaches for robots that acknowledge the perceptual limitations of robots [Biswas and Veloso2013]. Therefore, the nonpotential goal states provide no information about the true goals, . The landmark states are special nonpotential goal states that provide accurate information about certain potential goals. Each provides observations about a subset of potential goals with denoting the corresponding set of observations. Therefore, the observation function at nonpotential goal states is, :

(2)

with and . The potential goals along with the landmark states are called informative states, , since they provide information about the true goals through deterministic observations. Thus, our observation function satisfies the minimum information required for state-based termination. In the next section, we discuss a more general setting where every state may have a noisy observation regarding the true goals.

Belief Update

A belief

is a probability distribution over

, and . The set of all reachable beliefs forms the belief space , where is the standard -simplex. The agent updates the belief , given the action , an observation , and the current belief . Using the multiplication rule, the updated belief for is:

(3)

with is a normalization constant and is the belief over the goal configuration. Therefore,

(4)

Policy and Value

The agent’s objective in a GUSSP is to minimize the expected cost of reaching a goal, , where and denote the agent’s state and action at time respectively, and denotes the horizon. A policy is a mapping from belief to an action . The value function for a belief, is the expected cost for a fixed policy and a horizon . The Bellman optimality equation for GUSSPs follows from POMDPs:

where is the updated belief following Equation 4, , , and . A proper policy, , in a GUSSP is a policy that guarantees termination in a finite expected number of steps, .

The number of potential goals with non-zero belief values indicate the degree of uncertainty over goals. The problem setting and the optimal policy determine when the belief values collapse to the true goals. When deploying robots in real-world settings with goal uncertainty, it is useful to understand the problem complexity for policy execution. We measure this by the maximum number of unique visits to informative states that may be required before a true goal is discovered by the agent. We consider unique visits since no new information is obtained thereafter. For example, consider a search and rescue domain in which the agent searches for victims in a corridor with the start state on one end and followed by a series of potential goals. If the first potential goal location is a true goal, then the agent visits only one potential goal before the true goal is discovered, following the optimal policy. This property is beneficial especially in environments with landmark states that reveal the true goals, thus minimizing the need to visit the potential goals specifically to determine the true goals.

Definition 2.

A GUSSP policy is of order- if there are at most unique visits to informative states before a true goal is reached following .

For a state-based termination, . We illustrate this property in our experiments on a robot, using optimal policies corresponding to different initial beliefs.

4 Theoretical Analysis

In a GUSSP, the observation function critically affects the number of reachable beliefs. We begin with analyzing how the number of beliefs may grow in the more general (non-myopic observation) setting and then show that a GUSSP with myopic observations has finite reachable beliefs.

In a GUSSP with non-myopic observations, the nonpotential goal states may provide stochastic observations about the true goals, resulting in infinitely many reachable beliefs. While this is a trivial fact, it is useful to understand the growth in complexity of the problem and it provides an important link to POMDPs via the belief MDP. The following proposition formally proves this complexity.

Proposition 1.

For all horizon , the belief-MDP of a GUSSP with non-myopic observations may have states.

Proof Sketch.

By construction, we map this GUSSP to a belief MDP with a horizon  [Kaelbling, Littman, and Cassandra1998]. The set of states in the MDP is the set of reachable beliefs from in the GUSSP, , where is the set of reachable beliefs in the GUSSP. The set of actions in the GUSSP are retained in the MDP, . The cost function , where corresponds to cost function of GUSSP. The transition function for the belief MDP is the probability of executing action in belief state and reaching the reaching belief , and denoted by , is:

with Iversen bracket and denoting the updated belief calculated using Equation 4, after executing action and receiving observation . The probability of receiving is:

with and . Since in the GUSSP is finite, a finite set of reachable beliefs in the GUSSP results in a finite set of reachable states in the belief MDP. This is a tree of depth with internal nodes for decisions and transitions, the branching factor is for each horizon [Papadimitriou and Tsitsiklis1987]. Therefore, the total number of reachable beliefs in the GUSSP is , and thus the resulting belief MDP may have distinct reachable states. ∎

In the worst case, the observation function may be unconstrained and all the beliefs may be unique. Since there is no discounting in a GUSSP and the horizon is unknown a priori, GUSSPs may have infinitely many beliefs and their complexity class may be undecidable in the worst case [Madani, Hanks, and Condon1999]. Hence, solving GUSSPs with non-myopic observation setting optimally is computationally intractable.

We now prove that a myopic observation function results in a finite number of reachable beliefs in a GUSSP.

Proposition 2.

A GUSSP with myopic observation function has a finite number of reachable beliefs.

Proof.

By definition, a myopic observation function produces either belief-collapsing observations or no information at all. For each case, we first calculate the updated belief for the goal configurations using Equation 3. Therefore, with :

Case 1: Belief-collapsing observation. Trivially, when , the updated belief is . When , the updated belief is . Case 2: No information. When the observation provides no information, . Then,

Thus, , a myopic observation function produces collapsed belief or retains the same belief, resulting in a finite number of reachable beliefs for a goal configuration. Since is finite, the belief update following Equation 4 would result in finite number of reachable beliefs for a GUSSP. ∎

Hence, a myopic observation function weakly monotonically collapses beliefs. This allows us to simplify the problem further. In the rest of the paper, we will refer to a GUSSP with myopic observations simply as GUSSP. The following proposition shows that a GUSSP reduces to an SSP, along the same lines as the mapping from a POMDP to belief-MDP [Kaelbling, Littman, and Cassandra1998].

Proposition 3.

A GUSSP reduces to an SSP.

Proof Sketch.

We map the GUSSP to a belief MDP with a horizon  [Kaelbling, Littman, and Cassandra1998], as in Proposition 1. By Proposition 2, a GUSSP with myopic observation function has a finite number of reachable beliefs and therefore, finite states in the belief-MDP. By construction, this belief-MDP is an SSP with the start state and the goal states, , are the set of states with such that and . Since there exists a proper policy in a GUSSP, the policy in this SSP is proper by construction. Thus, a GUSSP with myopic observation function reduces to an SSP. ∎

The reduction to an SSP facilitates solving GUSSPs using the existing rich suite of SSP algorithms. For ease of reference and clarity, we refer to the above-mentioned SSP as compiled-SSP in the rest of this paper.

The order- of for a GUSSP (compiled-SSP) can be calculated using a directed graph constructed using . We now show that computing order- is polynomial.

Proposition 4.

The worst case complexity for computing order- for is , where and denote the vertices and edges of the corresponding directed graph.

Proof Sketch.

To calculate order- for , we construct a directed graph, , using such that and the trajectories between them are the edges, . We begin with setting each potential goal to be a true goal. We introduce additional (artificial) edges from the true goal to the informative states. Then, we compute the strongly connected components, using depth first search that takes , and condense it to form a directed acyclic graph . We start from the true goal in and traverse backwards. The value of the true goal is initialized to 1 and propagated to its (unvisited) neighbors. At each vertex, is increased to be the sum of informative states in the condensed vertex and the incoming value from the neighbor. This continues until all vertices in have been visited and the start state is updated with the maximum . This process may be repeated with every potential goal as the true goal and the overall maximum is the order of the policy. Thus, the worst case complexity is . ∎

Relation to Goal-POMDPs  The Goal-POMDP [Bonet and Geffner2009] models a class of goal-based and shortest-path POMDPs with positive action costs and no discounting. Let be a Goal-POMDP with a finite set of states, , that are partially observable; is a finite set of actions; denotes the transition function; denotes the cost function; is the set of observations; and denotes the observation function. The set of target (or goal) states, , have unique belief-collapsing observations. The target beliefs or goals are beliefs such that . Hence, a Goal-POMDP is a GUSSP when the partial observability is restricted to goals, the observations set , and observation function, , is myopic.

Proposition 5.

GUSSP Goal-POMDP.

The observations in a Goal-POMDP are not constrained and may result in infinitely many reachable beliefs (Proposition 1). This makes it computationally challenging to compute optimal policies [Papadimitriou and Tsitsiklis1987].

GUSSP with Deterministic Transitions

A GUSSP with deterministic transitions presents an opportunity for further reduction in complexity. We show that the optimal policy in this case is a minimum spanning tree of its corresponding directed graph.

Proposition 6.

The optimal policy for a GUSSP with myopic observations and deterministic transitions is the arborescence of a weighted and directed graph .

Proof Sketch.

Consider a GUSSP with deterministic transitions and a dummy start state, , that transitions to the actual start state with probability 1 and zero cost. This can be represented as a directed and weighted graph, , such that ; that is, the start state and the potential goals are the vertices. Each edge denotes a trajectory in the GUSSP between vertices. The proper policy in a GUSSP ensures that there is at least one edge between each pair of vertices. The weight of an edge connecting and is , with denoting the cost of the trajectory and denoting the belief over being a goal. The arborescence (directed minimum spanning tree) of this graph, , will contain trajectories such that the total weight is minimized, with . By construction, this gives the optimal order of visiting the potential goals and hence the optimal policy for the GUSSP with . ∎

5 Solving Compiled-SSPs

We propose an admissible heuristic for SSP solvers that accounts for the goal uncertainty as well as a determinization-based approach for solving the compiled-SSP.

5.1 Admissible Heuristic

In heuristic search-based SSP solvers, the heuristic function helps avoid visiting states that are provably irrelevant. An efficient heuristic for solving the compiled-SSP guides the search by accounting for the goal uncertainty. To achieve this, we propose a heuristic for the compiled-SSP that accounts for goal uncertainty and is calculated as follows:

where denotes the cost of the shortest trajectory to a goal configuration from state and is the agent’s belief of being a true goal. Multiplying by the probability of a state not being a goal () breaks ties in favor of configurations with a higher probability of being a goal, with a lower heuristic value. However, calculating , requires computations. Since this would significantly affect the computation costs, we propose a variant that calculates the distance to the potential goals, requiring only computations. The proposed variant is denoted by and is calculated as:

where denotes the cost of the shortest trajectory to the potential goal from state . The following proposition shows that the proposed heuristic is admissible.

Proposition 7.

is an admissible heuristic.

Proof Sketch.

To show that is admissible, we first show that

is an admissible estimate of the expected cost of reaching a goal configuration

from state . Let be the expected cost of reaching from . Since is the cost of the shortest trajectory to from , . If all paths exist from to all potential goal states , then by definition, the shortest trajectory to a goal configuration is the minimum distance to a potential goal in . That is, and therefore . Multiplying this value by the belief and using the minimum value over all possible goal configurations guarantees that is an admissible estimate of the expected cost reaching a true goal configuration. ∎

5.2 Determinization

Determinization is a popular approach for solving large SSPs as it simplifies the problem by ignoring the uncertainty about action outcomes [Yoon, Fern, and Givan2007, Saisubramanian, Zilberstein, and Shenoy2018]. We extend determinization to a GUSSP by ignoring the uncertainty about the goals. That is, the agent plans to reach one potential goal (determinized goal) at a time, simplifying the problem to a smaller SSP. During execution, if the determinized goal is not a true goal, the agent replans for another potential goal. This results in an approximation scheme that offers a considerable speedup over solving the compiled-SSP.

We consider two determinization approaches: (i) most-likely goal determinization (DET-MLG) and (ii) closest-goal determinization (DET-CG). In the most-likely goal determinization, the agent determinizes the most-likely goal based on its current belief. In DET-CG, the agent determinizes the closest goal based on the heuristic distance to the potential goal (with non-zero belief) from its current state. In our experiments, we resolve ties randomly.

Problem
(size, ) LAO* Flares(1)- Flares(1)- Det-MLG Det-CG
rover (20,6) 28.25 35.35 2.67 30.34 2.37 36.71 2.62 45.51 3.22
rover (20,7) 42.16 43.49 1.62 45.07 1.77 49.69 1.91 48.36 1.43
rover (30,8) 36.96 38.21 1.83 41.31 1.97 38.54 1.54 40.34 1.82
rover (30,9) 34.72 38.21 2.54 43.32 2.54 50.27 2.58 49.49 1.97
search (20,4) 87.63 94.32 0.58 93.32 0.58 91.22 0.67 90.42 0.61
search (20,5) 74.61 83.83 0.56 81.91 0.56 78.32 0.56 79.74 6.37
search (20,5) 86.72 94.21 0.79 91.18 1.46 87.74 0.65 89.98 0.59
search (30,6) 90.89 94.21 1.35 103.77 3.42 101.67 1.61 92.94 0.68
ev (-,5) 2.34 3.29 1.55 4.89 1.36 5.15 1.46 7.17 1.43
ev (-,6) 3.46 4.89 1.96 5.96 1.96 7.15 2.46 8.17 1.43
Table 1: Average cost results on various problems.

6 Experiments

We begin with a comparison of different approximate solution techniques for solving the compiled-SSP on three domains in simulation. We then test the model on a real robot with three different initial belief settings. We show the path taken by the robot along with the order- value of the optimal policy in each setting.

Problem
(size, ) LAO* Flares(1)- Flares(1)- Det-MLG Det-CG
rover (20,6) 14.99 1.08 0.17 0.07 0.06
rover (20,7) 30.19 1.17 0.83 0.02 0.03
rover (30,8) 190.92 2.27 0.16 0.02 0.03
rover (30,9) 832.56 7.56 1.73 0.88 0.45
search (20,4) 15.78 1.45 0.98 1.05 0.86
search (20,5) 14.42 2.99 1.93 1.98 0.98
search (20,5) 63.71 6.21 1.93 0.66 1.68
search (30,6) 267.35 117.63 21.07 12.68 19.50
ev (-,5) 8.16 2.21 0.92 0.52 0.62
ev (-,6) 10.79 2.25 1.14 0.88 0.79
Table 2: Average planning time (in seconds) results on various problems.

6.1 Evaluation in Simulation

We evaluate the solution techniques on three domains in simulation: planetary rover, search and rescue, and electric vehicle (EV) charging problem using real-world data. Our experiments illustrate the performance of the techniques in handling two types of goal uncertainty: location-based goal uncertainty (planetary rover, search and rescue) and temporal goal uncertainty (EV). The expected cost of reaching the goal and run time are used as evaluation metrics. A uniform initial belief is considered for all the domains in these experiments. We solve the compiled-SSPs optimally using LAO* 

[Hansen and Zilberstein2001], and approximately using FLARES, a domain-independent state-of-the-art algorithm for solving large SSPs using horizon=1 [Pineda, Wray, and Zilberstein2017], as well as the two determinization methods. Since we evaluate the determinizations with respect to the goals, we solve the determinized SSPs optimally using LAO*. The heuristic, computed using a labeled version of LRTA* [Bonet and Geffner2003], is used as a baseline for evaluating the proposed heuristic.

Unless otherwise specified: (i) all algorithms and heuristics used in the experiments were implemented by us and tested on an Intel Xeon 3.10 GHz computer with 16GB of RAM; (ii) we used a value of

; (iii) all results are averaged over 100 trials of planning and execution simulations and the average times include the time spent on re-planning; and (iv) standard errors are reported for expected cost.

       

Figure 3: Demonstration of the path taken by the robot with three different initial beliefs for the map in Figure 1. The start state and the true goal state are denoted by S and G, respectively. The other potential goals are denoted by the question mark symbol. Green, blue, and red show the path taken by the robot with 0.1, 0.25, and 0.9 as the initial belief for the true goal state and equal probability for other potential goal states.

Planetary Rover  This domain models the rover science exploration [Zilberstein et al.2002, Ong et al.2010] that explores an environment described by a known map to collect a mineral sample. The possible sample types are: , and there are potential goals. The rover always knows its own position ( coordinates) exactly, as well as those of the samples but does not know which samples are valuable. The process terminates when the rover collects a ‘good’ sample. The actions include moving in all four directions and a sample action. The sample action is deterministic and other actions are stochastic that succeed with a probability of . The sample action costs if the mineral is good and otherwise; all other actions cost .

Search and Rescue  In this domain, an autonomous robot explores an environment described by a known map to find victims [Pineda et al.2015]. We modify the problem such that there are potential goals (victims locations) and total victims, which are known to the robot. However, each location may or may not have victims. The state factors are the robot’s current location () and a counter to indicate the number of victims saved so far. The observations indicate the presence of victims in each potential goal state. The actions include moving in all four directions and a SAVE action that saves all the victims in a state. The move actions cost and are stochastic, succeeding with probability. The SAVE action is deterministic and costs . The objective is to minimize the expected cost of saving all victims.

Electric Vehicle Charging  We experimented with the electric vehicle (EV) charging domain, operating in a vehicle-to-grid setting [Saisubramanian, Zilberstein, and Shenoy2017], where the EV can charge and discharge energy from a smart grid. By planning when to buy or sell electricity, an EV can devise a robust policy that is consistent with the owner’s preferences, while minimizing the operational cost of the vehicle. We modified the problem such that parking duration of the EV is uncertain with denoting the horizon. The potential goals in this problem are the possible departure times. The EV can fully observe the current charge level and the time step. In our experiments, denotes that . Each is equivalent to 30 minutes in real time. The action costs and the peak hours are based on real data [Eversource2017]. The battery capacity and the charge speeds for the EV are based on Nissan Leaf configuration. If the EV’s exit charge level does not meet the owner’s desired exit charge level, a penalty may be incurred.

The charge levels and entry time data are based on charging schedules of electric cars over a four month duration in 2017 from a university campus. The data is clustered based on the entry and exit charges, and we selected 25 representative problem instances across clusters for our experiments.

Discussion  Tables  1 and 2 show the results of the five techniques on various problem instances, in terms of cost and runtime respectively. The results for the EV domain are averaged over 25 problem instances. The grid size and the number of potential goals are shown for each problem instance. We experiment with no landmark states to demonstrate the performance in the worst case setting, and hence the order- values are the sizes of the potential goals. In terms of expected costs, the performance of the approximate techniques are comparable. The runtimes for solving the problems optimally, however, scales rapidly as the number of potential goals increases. The advantage of using FLARES with and the determinization techniques are more evident in the runtime savings. FLARES using our heuristic is significantly faster than using the baseline heuristic. The determinizations are faster than solving the problem using FLARES with either heuristic.

6.2 Evaluation on a mobile robot

The robot experiment aims to visually explain how the belief alters the robot’s plan, apart from the comparison using an abstract notion of cost. Figure 3 shows the results in a ROS simulation and on a real robot for a simple search and rescue problem with one agent and four potential victim locations for the map shown in Figure 1. We test with three different initial beliefs: uniform, optimistic, and pessimistic. The corresponding belief of the true goal, , in each belief setting is: , , and , with the other potential goals having equal probability. The order- of the optimal policy with respect to the true goal in each belief setting is 4. The order- for the optimal policies of the GUSSP with deterministic transitions for this problem are: 3, 1, and 4, corresponding to the three initial beliefs.

7 Conclusion and Future Work

The goal uncertain SSP (GUSSP) provides a natural model for real-world problems where it is non-trivial to identify the exact goals ahead of plan execution. While a general GUSSP could be intractable, we identify several tractable classes of GUSSPs and propose effective algorithms for solving them. Specifically, we show that a GUSSP with a myopic observation function can be reduced to an SSP, allowing us to efficiently solve it using existing SSP solvers. We also propose an admissible heuristic that accounts for goal uncertainty in its estimation and a fast solver based on extending the notion of determinization to handle goal uncertainty. The simulation results show that solving the compiled-SSPs using FLARES with the proposed heuristic is faster than the baseline. The determinization techniques are significantly faster than solving the compiled-SSP optimally. The results on a robot demonstrate the order- property and visualize the policy for different initial beliefs. These results show that GUSSPs can be solved efficiently using scalable algorithms that do not rely on POMDP solvers.

There are a number of improvements that could add value to our approach. First, we are exploring other conditions under which GUSSPs have a bounded set of beliefs. Secondly, we aim to devise an algorithm for solving GUSSPs by sampling beliefs, similar to fruitful research directions that work well for POMDPs, but one that exploits the fully observable components. This will further expand the class of GUSSPs that are tractable. Thirdly, we aim to target additional real-world domains that can be effectively captured using a GUSSP. Finally, we intend to broaden the scope of this work by examining other decision models with uncertain aspects that are useful, yet tractable.

References

  • [Bertsekas and Tsitsiklis1991] Bertsekas, D. P., and Tsitsiklis, J. N. 1991. An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16:580–595.
  • [Biswas and Veloso2013] Biswas, J., and Veloso, M. 2013. Multi-sensor mobile robot localization for diverse environments. In Robot Soccer World Cup, 468–479. Springer.
  • [Bonet and Geffner2003] Bonet, B., and Geffner, H. 2003. Labeled RTDP: Improving the convergence of real-time dynamic programming. In Proc. of the 13th International Conference on Automated Planning and Scheduling.
  • [Bonet and Geffner2009] Bonet, B., and Geffner, H. 2009. Solving POMDPs: RTDP-bel vs. point-based algorithms. In

    Proc. of the 21st International Joint Conference on Artificial Intelligence

    .
  • [Bourgault, Furukawa, and Durrant-Whyte2003] Bourgault, F.; Furukawa, T.; and Durrant-Whyte, H. F. 2003. Coordinated decentralized search for a lost target in a bayesian world. In Proc. of the 16th IEEE International Conference on Intelligent Robots and Systems.
  • [Eversource2017] Eversource. 2017. Time of use rates. https://www.eversource.com/clp/vpp/vpp.aspx.
  • [Hansen and Zilberstein2001] Hansen, E. A., and Zilberstein, S. 2001. LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence, 129:35–62.
  • [Hansen2007] Hansen, E. A. 2007. Indefinite-horizon POMDPs with action-based termination. In Proc. of the 22nd AAAI Conference on Artificial Intelligence.
  • [Kaelbling, Littman, and Cassandra1998] Kaelbling, L. P.; Littman, M. L.; and Cassandra, A. R. 1998. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101:99–134.
  • [Kitano et al.1999] Kitano, H.; Tadokoro, S.; Noda, I.; Matsubara, H.; Takahashi, T.; Shinjou, A.; and Shimada, S. 1999. RoboCup rescue: Search and rescue in large-scale disasters as a domain for autonomous agents research. In IEEE Conference on Systems, Man, and Cybernetics.
  • [Littman1997] Littman, M. L. 1997. Probabilistic propositional planning: Representations and complexity. In Proc. of the 14th Conference on Artificial Intelligence.
  • [Madani, Hanks, and Condon1999] Madani, O.; Hanks, S.; and Condon, A. 1999. On the undecidability of probabilistic planning and infinite-horizon prtially observable markov decision problems. In Proc. of the 16th AAAI Conference on Artificial Intelligence.
  • [Ong et al.2010] Ong, S. C.; Png, S. W.; Hsu, D.; and Lee, W. S. 2010. Planning under uncertainty for robotic tasks with mixed observability. The International Journal of Robotics Research, 29:1053–1068.
  • [Papadimitriou and Tsitsiklis1987] Papadimitriou, C. H., and Tsitsiklis, J. N. 1987. The complexity of markov decision processes. Mathematics of Operations Research, 12:441–450.
  • [Patek2001] Patek, S. D. 2001. On partially observed stochastic shortest path problems. In Proc. of the 40th IEEE Conference on Decision and Control.
  • [Pineda et al.2015] Pineda, L.; Takahashi, T.; Jung, H.-T.; Zilberstein, S.; and Grupen, R. 2015. Continual planning for search and rescue robots. In Proc. of the 15th IEEE Conference on Humanoid Robots.
  • [Pineda, Wray, and Zilberstein2017] Pineda, L.; Wray, K.; and Zilberstein, S. 2017. Fast SSP solvers using short-sighted labeling. In Proc. of the 31st AAAI Conference on Artificial Intelligence.
  • [Saisubramanian, Zilberstein, and Shenoy2017] Saisubramanian, S.; Zilberstein, S.; and Shenoy, P. 2017. Optimizing electric vehicle charging through determinization. In Scheduling and Planning Applications Workshop (SPARK), ICAPS.
  • [Saisubramanian, Zilberstein, and Shenoy2018] Saisubramanian, S.; Zilberstein, S.; and Shenoy, P. 2018. Planning using a portfolio of reduced models. In Proc. of the 17th International Conference on Autonomous Agents and MultiAgent Systems.
  • [Stone, Royset, and Washburn2016] Stone, L. D.; Royset, J. O.; and Washburn, A. R. 2016. Search for a stationary target. In Optimal Search for Moving Targets. Springer. 9–48.
  • [Yoon, Fern, and Givan2007] Yoon, S.; Fern, A.; and Givan, R. 2007. FF-Replan: A baseline for probabilistic planning. In Proc. of the 17th International Conference on Automated Planning and Scheduling.
  • [Zilberstein et al.2002] Zilberstein, S.; Washington, R.; Bernstein, D. S.; and Mouaddib, A.-I. 2002. Decision-theoretic control of planetary rovers. In Revised Papers from the International Seminar on Advances in Plan-Based Control of Robotic Agents, 270–289. Springer-Verlag.