Data-driven Planning via Imitation Learning

11/17/2017 ∙ by Sanjiban Choudhury, et al. ∙ Microsoft Carnegie Mellon University 0

Robot planning is the process of selecting a sequence of actions that optimize for a task specific objective. The optimal solutions to such tasks are heavily influenced by the implicit structure in the environment, i.e. the configuration of objects in the world. State-of-the-art planning approaches, however, do not exploit this structure, thereby expending valuable effort searching the action space instead of focusing on potentially good actions. In this paper, we address the problem of enabling planners to adapt their search strategies by inferring such good actions in an efficient manner using only the information uncovered by the search up until that time. We formulate this as a problem of sequential decision making under uncertainty where at a given iteration a planning policy must map the state of the search to a planning action. Unfortunately, the training process for such partial information based policies is slow to converge and susceptible to poor local minima. Our key insight is that if we could fully observe the underlying world map, we would easily be able to disambiguate between good and bad actions. We hence present a novel data-driven imitation learning framework to efficiently train planning policies by imitating a clairvoyant oracle - an oracle that at train time has full knowledge about the world map and can compute optimal decisions. We leverage the fact that for planning problems, such oracles can be efficiently computed and derive performance guarantees for the learnt policy. We examine two important domains that rely on partial information based policies - informative path planning and search based motion planning. We validate the approach on a spectrum of environments for both problem domains, including experiments on a real UAV, and show that the learnt policy consistently outperforms state-of-the-art algorithms.



There are no comments yet.


page 2

page 19

page 20

page 21

page 25

page 26

page 27

page 29

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Motion planning, the task of computing a sequence of collision-free motions for a robotic system from a start to a goal configuration, has a rich and varied history [71]. Up until now, the bulk of the prominent research has focused on the development of tractable planning algorithms with provable worst-case performance guarantees such as computational complexity [11], probabilistic completeness [72] or asymptotic optimality [58]. In contrast, analysis of the expected performance of these algorithms on real world planning problems a robot encounters has received considerably less attention, primarily due to the lack of standardized datasets or robotic platforms.

Informative path planning, the task of computing an optimal sequence of sensing locations to visit so as to maximize information gain, has also had an extensive amount of prior work on algorithms with provable worst-case performance guarantees such as computational complexities  [105] and the probabilistic completeness [45]

of information theoretic planning. While these algorithms use heuristics to approximate information gain using variants of Shannon’s entropy, their expected performance on real world planning problems is heavily influenced by the geometric distribution of objects encountered in the world.

Fig. 1: Sequential decision making in informative path planning and search based planning. The implicit structure of the environment affects the performance of policies in both tasks. (a) The effectiveness of a policy to gather information depends on the distribution of worlds. (left) When the distribution corresponds to a scene containing ladders, the learnt policy executes a helical motion around parts of the ladder already observed as it is unlikely that there is information elsewhere. (right) When the distribution corresponds to a scene from a construction site, the learnt policy executes a large sweeping motion as information is likely to be dispersed. (b) A learnt heuristic policy adapts to different obstacle configurations to minimize search effort. All schematics show the evolution of a search algorithm as the expansion of a search wavefront (expanded(white), invalid(black), unexpanded(grey)) from start (green) to goal (blue). A commonly used inflated Euclidean heuristic cannot adapt to different environments, e.g it gets stuck in bugtraps. On the other hand, the learnt policy is able to infer the presence of a bug trap when trained on such a distribution and switch to greedy behaviour when trained on other distributions.

A unifying theme for both these problem domains is that as robots break out of contrived laboratory settings and operate in the real world, the scenarios encountered by them vary widely and have a significant impact on performance. Hence, a key requirement for autonomous systems is a robust planning module that maintains consistent performance across the diverse range of scenarios it is likely to encounter. To do so, planning modules must possess the ability to leverage information about the implicit structure of the world in which the robot operates and adapt the planning strategy accordingly. Moreover, this must occur in a pure data-driven fashion without the need for human intervention. Fortunately, recent advances in affordable sensors and actuators have enabled mass deployment of robots that navigate, interact and collect real data. This motivates us to examine the following question:

How can we design planning algorithms that, subject to on-board computation and sensing constraints, maximize their expected performance on the actual distribution of problems that a robot encounters?

I-a Motivation

We look at two domains - informative path planning and search based planning. We briefly delve into these motivations and make the case for data-driven approaches in both.

I-A1 Informative Path Planning

We consider the following information gathering problem - given a hidden world map, sampled from a prior distribution, the goal is to successively visit sensing locations such that the amount of relevant information uncovered is maximized while not exceeding a specified fuel budget. This problem fundamentally recurs in mobile robot applications such as autonomous mapping of environments using ground and aerial robots [13, 43], monitoring of water bodies [45] and inspecting models for 3D reconstruction [50, 47].

The nature of “interesting” objects in an environment and their spatial distribution influence the optimal trajectory a robot might take to explore the environment. As a result, it is important that a robot learns about the type of environment it is exploring as it acquires more information and adapts its exploration trajectories accordingly.

To illustrate our point, we sketch out two extreme examples of environments for a particular mapping problem, shown in Fig. 1(a). Consider a robot equipped with a sensor (RGBD camera) that needs to generate a map of an unknown environment. It is given a prior distribution about the geometry of the world, but has no other information. This geometry could include very diverse settings. First it can include a world where there is only one ladder, but the form of the ladder must be explored, which is a very dense setting. Second, it could include a sparse setting with spatially distributed objects, such as a construction site.

The important task for the robot is to now try to infer which type of environment it is in based on the history of measurements, and thus plan an efficient trajectory. At every time step, the robot visits a sensing location and receives a sensor measurement (e.g. depth image) that has some amount of information utility (e.g. surface coverage of objects with point cloud). As opposed to naive lawnmower-coverage patterns, it will be more efficient if the robot could use a policy that maps the history of locations visited and measurements received to decide which location to visit next such that it maximizes the amount of information gathered in the finite amount of battery time available.

The ability of such a learnt policy to gather information efficiently depends on the prior distribution of worlds in which the robot has been shown how to navigate optimally. Fig. 1(a) (left) shows an efficient learnt policy for inspecting a ladder, which executes a helical motion around parts of the ladder already observed to efficiently uncover new parts without searching naively. This is efficient because given the prior distribution the robot learns that information is likely to be geometrically concentrated in a particular volume given its initial observations of parts of the ladder. Similarly Fig. 1(a) (right) shows an effective policy for exploring construction sites by executing large sweeping motions. Here again the robot learns from prior experience that wide, sweeping motions are efficient since it has learnt that information is likely to be dispersed in such scenarios. We wish to arrive at an efficient procedure for training such a policy.

I-A2 Search Based Planning

Search based motion planning offers a comprehensive framework for reasoning about a vast number of motion planning algorithms [71]. In this framework, an algorithm grows a search tree of feasible robot motions from a start configuration towards a goal [91]. This is done in an incremental fashion by first selecting a leaf node of the tree, expanding this node by computing outgoing edges, checking each edge for validity and finally updating the tree with potentially new leaf nodes. It is useful to visualize this search process as a wavefront of expanded nodes that grows from the start outwards till it finds the goal as illustrated in Fig. 1(b).

This paper addresses a class of robotic motion planning problems where edge evaluation dominates the search effort, such as for robots with complex geometries like robot arms [27] or for robots with limited onboard computation like UAVs [24]. In order to ensure real-time performance, algorithms must prioritize minimizing the search effort, i.e. keeping the volume of the search wavefront as small as possible while it grows towards the goal. This is typically achieved by heuristics, which guide the search towards promising areas by selecting which nodes to expand. As shown in Fig. 1, this acts as a force stretching the search wavefront towards the goal.

A good heuristic must balance the bi-objective criteria of finding a good solution and minimizing the search effort. The bulk of the prior work has focused on the former objective of guaranteeing that the search returns a near-optimal solution [91]. These approaches define a heuristic function as a distance metric

that estimates the cost-to-go value of a node 

[96]. However, estimation of this distance metric is difficult as it is a complex function of robot geometry, dynamics and obstacle configuration. Commonly used heuristics such as the euclidean distance do not adapt to different robot configurations or different environments. On the other hand, by trying to compute a more accurate distance the heuristic should not end up doing more computation than the original search. While state-of-the-art methods propose different relaxation-based [77, 29] and learning-based approaches [89] to estimate the distance metric they run into a much more fundamental limitation - a small estimation error can lead to a large search wavefront. Minimizing the estimation error does not necessarily minimize search effort.

Instead, we focus on the latter objective of designing heuristics that explicitly reduce search effort in the interest of real-time performance. Our key insight is that heuristics should adapt during search - as the search progresses, they should actively infer the structure of the valid configuration space, and focus the search on potentially good areas. Moreover, we want to learn this behaviour from data - changing the data distribution should change the heuristic automatically. Consider the example shown in Fig. 1(b). When a heuristic is trained on a world with ‘bug traps’, it learns to recognize when the search is trapped and circumvent it. On the other hand, when it is trained on a world with narrow gaps, it learns a greedy behaviour that drives the search to the goal.

I-B Key Idea

It is natural to think of both these problems as a Partially Observable Markov Decision Process (POMDP). However the POMDP is defined on a belief over possible world maps, which is very large in size rendering even the most efficient of online POMDP solvers impractical.

Our key insight is that if the policies could fully observe and process the world map during decision making, they could quite easily disambiguate good actions from bad ones. This motivates us to frame the problem of learning a planning policy as a novel data-driven imitation [99] of a clairvoyant oracle. During the training process, the oracle has full knowledge about the world map (hence clairvoyant) and selects actions that maximize cumulative rewards. The policy is then trained to imitate these actions as best as it can using partial knowledge from the current history of actions and observations. As a result of our novel formulation, we are able to sidestep a number of challenging issues in POMDPs like explicitly computing posterior distribution over worlds and planning in belief space.

We empirically show that training such policies using imitation learning of clairvoyant oracles leads to much faster convergence and robustness to poor local minima than training policies via model free policy improvement. We leverage the fact that such oracles can be efficiently computed for our domains once the source of uncertainty is removed. We show in our analysis that imitation of such clairvoyant oracles during training is equivalent to being competitive with a hallucinating oracle at test time, i.e. an oracle that implicitly maintains a posterior over world maps and selects the best action at every time step. This offers some valuable insight behind the success of this approach as well as instances where such an approach would lead to a near-optimal policy.

I-C Contributions

Our contributions are as follows:

  1. We motivate the need to learn a planning policy that adapts to the environment in which the robot operates. We examine two domains - informative path planning and search based planning. We examine both problems through the lens of sequential decision making under uncertainty (Section II).

  2. We present a novel mapping of both these problems to a common POMDP framework (Section III).

  3. We propose a novel framework for training such POMDP policies via imitation learning of a clairvoyant oracle. We analyze the implications of imitating such an oracle (Section IV).

  4. We present training procedures that deal with the non i.i.d distribution of states induced by the policy itself along with performance guarantees. We present concrete instances of the algorithm for both problem domains. We also show that for a certain class of informative path planning problems, policies trained in this fashion possess near-optimality properties (Section V).

  5. We extensively evaluate the approach on both problem domains. In each domain, we evaluate on a spectrum of environments and show that policies outperform state-of-the-art approaches by exhibiting adaptive behaviours. We also demonstrate the impact of this framework on real world problems by presenting flight test results from a UAV (Section VI and Section VII).

This paper is an unification of previous works on adaptive information gathering [21, 20] and learning heuristic search [8]. We present a unified framework for reasoning about both problems. We compare and contrast training procedures due to both domains. We present new results in learning heuristics on 4D planning problems and present flight test results from a UAV. We present new results on comparing the imitation learning with policy search and comparing sample efficiency of AggreVaTe and ForwardTraining. We present more details on implementation and analysis of results. We provide comprehensive discussions on shortcomings of this approach and directions for future work in Section VIII.

Ii Background

Ii-a Informative Path Planning

We now present a framework for informative path planning where the objective is to visit maximally informative sensing locations subjected to time and travel constraints. We use this framework to pose the problem of computing a information gathering policy for a given distribution over worlds and briefly discuss prior work on this topic.

Ii-A1 Framework

We now introduce a framework and set of notations to express the IPP problems of interest. The specific implementation details of the problem are described in detail in Section VI-A.

We have a robot that is constrained to move on a graph where is the set of nodes corresponding to all sensing locations. The start node is . Let be a sequence of connected nodes (a path) such that . Let be the set of all such paths.

Let be the world map in which the robot operates. The world map is usually represented in practice as a binary grid map where grid cells are either occupied or free. We assume that the world map is fixed during an episode.

Let be a measurement received by the robot. Let be a measurement function. When the robot is at node in a world map , the measurement received by the robot is . The measurement function is defined by a sensor model, e.g. a range limited sensor. A measurement is obtained by projecting the sensor model on the sensing node and ray-casting to determine the surfaces of the underlying world that intersect with the sensor rays.

The objective of the robot is to move on the graph and maximize utility. Let be a utility function. For a path and a world map , assigns a utility to executing the path on the world. The utility of a measurement from a node is usually the amount of surface of the world covered by it. In such an instance, the function does not depend on the sequence of vertices in the path, i.e. is a set function. For simplicity, we assume that the measurement and utility function is deterministic. However, this assumption can easily be relaxed in our approach and is discussed in Section. VIII-D.

As the robot moves on the graph, the travel cost is captured by the cost function . For a path and a world map , assigns a travel cost for executing the path on the world. In a practical setting, the total number of timesteps is bounded by and the travel cost is bounded by . Fig. 2 shows an illustration of the framework.

Fig. 2: The informative path planning problem. Given a world map , the robot plans a path which visits a node and receives measurement , such that utility (information gathered) is maximized. Here the utility is the cardinality of all the cells uncovered (green), which is a union of the cells uncovered at each location (and hence a set cover function)

We are now ready to define the informative path planning problems. There are two axes of variations

  1. Constraint on the motion of the robot

  2. Observability of the world map

The first axis arises from whether the robot is subject to any travel constraints. For problems such as sensor placement, the agent is free to select any sequence of nodes and the travel cost between nodes is . For such situations, the graph is also fully connected to permit any sequence. For problems involving physical movements, the agent is constrained by a budget on the travel cost. Additionally the graph may also not be fully connected.

The second axis arises from different task specifications which result in the world map being observable or being hidden. We categorize the problems on this axis to aid future discussions on imitating clairvoyant oracles in Section V.

Ii-A2 Problems with Known World Maps

For the first two variants, the world map is known and can be evaluated while computing a path .

Problem 1 (Known-Unc: Known World Map; Unconstrained Travel Cost).

Given a world map , a fully connected graph and a time horizon , find a path that maximizes utility


In the case where the utility function is a set function, Problem 1 is a set function maximization problem which in general can be NP-Hard [64]). Such problems occur commonly in the sensor placement problem [66]. However, in many instances the utility function can be shown to posses the powerful property of monotone submodularity. This property implies the following

  1. Monotonic improvement: The value of the utility can only increase on adding nodes, i.e.

    for all

  2. Diminishing returns: The gain in adding a set of nodes diminshes

    for all where .

For such functions, it has been shown that a greedy algorithm achieves near-optimality [66, 65].

Problem 2 (Known-Con: Known World Map; Constrained Travel Cost).

Given a world map , a time horizon and a travel cost budget , find a path that maximizes utility


Problem 2 introduces a routing constraint (due to ) for which greedy approaches can perform arbitrarily poorly. Such problems occur when a physical system has to travel between nodes. Chekuri and Pal [14], Singh et al. [105] propose a quasi-polynomial time recursive greedy approach to solving this problem. Iyer and Bilmes [51] solve a related problem (submodular knapsack constraints) using an iterative greedy approach which is generalized by Zhang and Vorobeychik [129]. Yu et al. [128] propose a mixed integer approach to solve a related correlated orienteering problem. Hollinger and Sukhatme [45] propose a sampling based approach. Arora and Scherer [5] use an efficient TSP with a random sampling approach.

Ii-A3 Problems with Hidden World Maps

We now consider the setting where the world map is hidden. Given a prior distribution , it can be inferred only via the measurements received as the robot visits nodes . Hence, instead of solving for a fixed path, we compute a policy that maps history of measurements received and nodes visited to decide which node to visit.

Problem 3 (Hidden-Unc: Hidden World Map; Unconstrained Travel Cost).

Given a distribution of world maps, , a fully connected graph , a time horizon , find a policy that at time , maps the history of nodes visited and measurements received to compute the next node to visit at time , such that the expected utility is maximized.

Such a problem occurs for sensor placement where sensors can optionally fail [36]. Due to the hidden world map , it is not straight forward to apply the approaches of Problem Known-Unc- we have to reason both about and how the function will evolve. However, in some instances the utility function has an additional property of adaptive submodularity [36]. This is an extension of the submodularity property where the gain of the function is measured in expectation over the conditional distribution over world maps . Under such situations, applying greedy strategies to Problem 3 has near-optimality guarantees [37, 52, 53, 16, 17] ). However, these strategies require explicitly sampling from the posterior distribution over which make it intractable to apply for our setting.

Problem 4 (Hidden-Con: Hidden World Map; Constrained Travel Cost).

Given a distribution of world maps, , a time horizon , and a travel cost budget , find a policy that at time , maps the history of nodes visited and measurements received to compute the next node to visit at time , such that the expected utility is maximized.

Such problems crop up in a wide number of areas such as sensor planning for 3D surface reconstruction [50] and indoor mapping with UAVs [13, 87]. Problem 4 does not enjoy the adaptive submodularity property due to the introduction of travel constraints. Hollinger et al. [47, 46] propose a heuristic based approach to select a subset of informative nodes and perform minimum cost tours. Singh et al. [106] replan every step using a non-adaptive information path planning algorithm. Inspired by adaptive TSP approaches by Gupta et al. [39], Lim et al. [79, 78] propose recursive coverage algorithms to learn policy trees. However such methods cannot scale well to large state and observation spaces. Heng et al. [43] make a modular approximation of the objective function. Isler et al. [50] survey a broad number of myopic information gain based heuristics that work well in practice but have no formal guarantees.

Ii-B Search Based Planning

We now present a framework for search based planning where the objective is to find a feasible path from start to goal while minimizing search effort. We use this framework to pose the problem of learning the optimal heuristic for a given distribution over worlds and briefly discuss prior work on this topic.

Ii-B1 Framework

We consider the problem of search on a graph, , where vertices represent robot configurations and edges represent potentially valid movements of the robot between these configurations. Given a pair of start and goal vertices, , the objective is to compute a path - a connected sequence of valid edges. The implicit graph can be compactly represented by and a successor function which returns a list of outgoing edges and child vertices for a vertex . Hence a graph is constructed during search by repeatedly expanding vertices using . Let be a representation of the world that is used to ascertain the validity of an edge. An edge is checked for validity by invoking an evaluation function which is an expensive operation and may require complex geometric intersection operations [26].

Alg. 1 defines a general search based planning algorithm which takes as input the tuple and returns a valid path . To ensure systematic search, the algorithm maintains the following lists - an open list of candidate vertices to be expanded and a closed list of vertices which have already been expanded. It also retains an additional invalid list of edges found to be in collision. These lists together represent the complete information available to the algorithm at any given point of time. At a given iteration, the algorithm uses this information to select a vertex to expand by invoking . It then expands by invoking and checking validity of edges using to get a set of valid successor vertices as well as invalid edges . The lists are then updated and the process repeated till the goal vertex is uncovered. Fig. 3 illustrates this framework.

Fig. 3: The search based planning problem. Given a world map , the agent has to guide a search tree from start to goal by expanding vertices. At any given iteration, the open list represents the set of candidate vertices that can be expanded. The closed list represents the set of vertices already expanded. The invalid list represents the set of edges that were found to be in collision with the world. The status of every other vertex is unknown. The search continues till the goal belongs to the open list, i.e. a feasible path to goal has been found.

Ii-B2 The Optimal Heuristic Problem

In this work, we focus on the feasible path problem and ignore the optimality of the path. Although this is a restrictive setting, quickly finding the feasible path is a very important problem in robotics. Efficient feasible path planners such as RRT-Connect [67] has proven highly effective in high dimensional motion planning applications such as robotic arm planning [71] and mobile robot planning [70]. Hence we ignore the traversal cost of an edge and deal with unweighted graphs. We defer discussions on how to relax this restriction to Section VIII-B.

We view a heuristic policy as a selection function (Alg. 1, Line 3) that selects a vertex from the open list . The objective of the policy is to minimize the number of expansions until the search terminates. Note that the evolution of the open list depends on the underlying world map which is hidden. Given a prior distribution over world maps , it can be inferred only via the outcome of the expansion operation . The history of outcomes is captured by the state of the search, i.e. the combination of the 3 lists .

Problem 5 (Opt-Heur).

Given a distribution of world maps, , find a heuristic policy that at time , maps the state of the search to select a vertex to expand, such that the expected number of expansions till termination is minimized.

2:while  do
Algorithm 1

The problem of heuristic design has a lot of historical significance. A common theme is “Optimism Under Uncertainty”. A spectrum of techniques exist to manually design good heuristics by relaxing the problem to obtain guarantees with respect to optimality and search effort [91]. To get practical performance, these heuristics are inflated, as has been the case in the applications in mobile robot planning [77]. However, being optimistic under uncertainty is not a foolproof approach and could be disastrous in terms of search efforts depending on the environment (See Fig 2.5, LaValle [71]).

Learning heuristics falls under machine learning for general purpose planning 

[55]. Yoon et al. [127] propose using regression to learn residuals over FF-Heuristic [44]. Xu et al. [124, 126, 125] improve upon this in a beam-search framework. Arfaee et al. [4] iteratively improve heuristics. ús Virseda et al. [118] learn combination of heuristic to estimate cost-to-go. Kendall rank coefficient is used to learn open list ranking [123, 35]. Thayer et al. [114] learn heuristics online during search. Paden et al. [89] learn admissible heuristics as S.O.S problems. However, these methods do not address minimization of search effort and also ignore the non i.i.d nature of the problem.

Ii-C Partially Observable Markov Decision Process

POMDPs [56] provide a rich framework for sequential decision making under uncertainty. However, solving a POMDP is often intractable - finite horizon POMDPs are PSPACE-complete [90] and infinite horizon POMDPs are undecidable [83]. Despite this challenge, the field has forged on and produced a vast amount of work by investigating effective approximations and analyzing the structure of the optimal solution. We refer the reader to [100] for a concise survey of modern approaches.

There are two main approaches to POMDP planning: offline policy computation and online search. In offline planning, the agent computes before hand a policy by considering all possible scenarios and executes the policy based on the observation received. Athough offline methods have shown success in planning near-optimal policies in several domains [107, 68, 109], they are difficult to scale up due to the exponential number of future scenarios that must be considered.

Online methods interleave planning and execution. The agent plans with the current belief, executes the action and updates the belief. Monte-carlo sampling methods explicitly maintain probability over states and plan via monte carlo roll-outs 

[84, 7]. This limits scalability since belief update can take time. In contrast, POMCP [103] maintains a set of particles to represent belief and employ UCT methods to plan with these particles. This allows the method to scale up for larger state spaces.

However, the disadvantage of purely online methods is that they require a lot of search effort online and can lead to poor performance due to evaluation on a small number of particles. [108] present a state-of-the-art algorithm DESPOT that combines the best aspects of many algorithms. First it uses determinized sampling techniques to ensure that the branching factor of the tree is bounded [88, 60]. Secondly, it uses offline precomputed policies to roll-out from a vertex, thus lower bounding its value. Finally, it tries to regularize the search by weighing the utility of a node to be robust against the fact that a finite number of samples is being used.

The methods we have talked about explicitly models the belief. For large scale POMDPs, this might be an issue. Model free approaches and representation learning offer attractive alternatives. Model free policy improvement has been successfully used to solve POMDPs [82, 74]. Predictive state representations  [80, 9]

that minimize prediction loss of future observations offer more compact representations than maintaining belief. There also has been a lot of success in employing deep learning to learn powerful representations 

[42, 59].

Ii-D Reinforcement Learning and Imitation Learning

Reinforcement Learning (RL) [112] especially deep RL has dramatically advanced the capabilities of sequential decision making in high dimensional spaces such as controls [30], video games [104] and strategy games [104]

. Several conventional supervised learning tasks are now being solved using deep RL to achieve higher performance 

[97, 75]. In sequential decision making, the prediction of a learner is dependent on the history of previous outcomes. Deep RL algorithms are able to train such predictors by reasoning about the future accumulated cost in a principle manner.

We refer the reader to [62] for a concise survey on RL and to [6]

for a survey on deep RL. Training such policies can be classified into two approaches - either

value function-based approach, where a value function for an action is learnt, or policy search, where a policy is directly learnt. The value function methods can themselves be categorized in two categories - model-free algorithms and model-based algorithms.

Model-free methods are computationally cheap but ignore the dynamics of the world thus requiring a lot of samples. Q-learning [122] is a representative algorithm for estimating the long-term expected return for executing an action from a given state. When the number of state action pairs are too large in number to track each uniquely, a function approximator is required to estimate the value. Deep Q-learning [85, 121]

addresses such a need by employing a neural-network as a function approximator and learning these network weights. However, the process of using the same network to generate both target values and update Q-values results in oscillations. Hence a number of remedies are required to maintain stability such as having a buffer of experience, a separate target network and an adaptive learning rate. These are indicative of the underlying sample inefficiency problem of a model-free approach.

Model-based methods such as R-Max [10] learn a model of the world which is then used to plan for actions. While such methods are sample efficient, they require a lot of exploration to learn the model. Even in the case when the model of the environment is known, solving for the optimal policy might be computationally expensive for large spaces. Policy search approaches are commonly used where its easier to parameterize a policy than learn a value function [92], however such approaches are sensitive to initialization and can lead to poor local minima.

In contrast with RL methods, imitation learning (IL) algorithms [25, 120, 12, 99] reduce the sequential prediction problem to supervised learning by leveraging the fact that, for many tasks, at training time we usually have a (near) optimal cost-to-go oracle. This oracle can either come from a human expert guiding the robot [2]

or from ground truth data as in natural language processing 

[12]. The existence of such oracles can be exploited to alleviate learning by trial and error - imitation of an oracle can significantly speed up learning. A traditional approach to using such oracles is to learn a policy or value function from a pre-collected dataset of oracle demonstrations [98, 131, 34]. A problem with these methods is that they require training and test data to be sampled from the same distribtution which is difficult in practice. In contrast, interactive approaches to data collection and training has been shown to overcome stability issues and works well empirically [101, 99, 111]. Furthermore, these approaches lead to strong performance through a reduction to no-regret online learning.

Recent approaches have also employed imitation of clairvoyant oracles, that has access to more information than the learner during training, to improve reinforcement learning as they offer better sample efficiency and safety. Zhang et al. [130], Kahn et al. [57] train policies that map current observation to action by extending guided policy search [73] for imitation of model predictive control oracles. Tamar et al. [113] consider a cost-shaping approach for short horizon MPC by offline imitation of long horizon MPC which is closest to our work. Gupta et al. [40] develop a holistic mapping and planner framework trained using feedback from optimal plans on a graph.


also theoretically analyze the question of why imitation learning aids in reinforcement learning. They develop a comprehensive theoretical study of IL on discrete MDPs and construct scenarios to show that IL acheives better sample efficiency than any RL algorithm. Concretely, they conclude that one can expect atleast a polynomial gap ad a possible exponential gap in regret between IL and RL when one has access to unbiased estimates of the optimal policy during training.

Iii Problem Formulation

Iii-a POMDPs

A discrete-time finite horizon POMDP is defined by the tuple where

  • is a set of states

  • is a set of actions

  • is a set of state transition probabilities

  • is the reward function

  • is the set of observations

  • is a set of conditional observation probabilities

  • is the time horizon

At each time period, the environment is in some state which cannot be directly observed. The initial state is sampled from a distribution . The agent takes an action which causes the environment to transition to state with probability . The agent receives a reward . On reaching the new state , it receives an observation according to the probability .

A history is a sequence of actions and observations . Note that the initial history is simply the observation at the initial timestep. The history captures all information required to express the belief over state. The belief can be computed recursively applying Bayes’ rule

where is a normalization constant.

The history can then also be used to compute an update :

The agent’s action selection behaviour can be explained by a policy that maps history to action .

Let the state and history distribution induced by a policy after timesteps be . The value of a policy is the expected cumulative reward for executing for timesteps on the induced state and history distribution


The optimal policy maximizes the expected cumulative reward, i.e .

Given a starting history , let be the induced state history distribution after timesteps. The value of executing a policy for time steps from a history is the expected cumulative reward:


The state-action value function is defined as the expected sum of one-step-reward and value-to-go:


Iii-B Mapping Informative Path Planning to POMDPs

We now map IPP problems Hidden-Unc and Hidden-Con to a POMDP. The state is defined to contain all information that is required to define the reward, observation and transition functions. Let the state be the set of nodes visited and the underlying world, . At the start of an episode, a world is sampled from a prior distribution along with a graph . The initial state is assigned by setting . Note that the state is partially observable due to the hidden world map .

We define the action to be the next node to visit. We are now ready to map the utility and travel cost to the reward function definition. Given the agent is in state and has executed , we can extract the path and the underlying world . Hence we can compute the utility function . We can also compute the travel cost function .

Before we define the reward function, we note that for Problem Hidden-Con not all actions are feasible at all times due to connectivity of the graph and constraints due to travel cost. Hence we can define a feasible set of actions for a state as follows


For Problem Hidden-Unc, let .

Since the objective is to maximize the cumulative reward function, we define the reward to be proportional to the marginal utility of visiting a node. Given a node , a path and world , the marginal gain of the utility function is . The one-step-reward function, , is defined as the marginal gain of the utility function. Additionally, the reward is set to whenever an infeasible action is selected. Hence:


The state transition function, , is defined as the deterministic function which sets . We define the observation to be the measurement and the observation model to be a deterministic function .

Note that the history , the sequence of actions and observations, is captured in the sequence of nodes visited and measurements received . In our implementation, we encode this information in an occupancy map as described later in Section VI-A. The information gathering policy maps this history to an action , the sensing location to visit.

Iii-C Mapping Search Based Planning to POMDPs

We now map the problem of computing a heuristic policy to a POMDP setting. Let the state be the open list and the underlying world, . At the start of an episode, a world is sampled from a prior distribution along with a start state . The initial state is assigned by setting . Note that the state is partially observable due to the hidden world map .

We define the action as the vertex that is to be expanded by the search. The state transition function, , is defined as the deterministic function which sets by querying . The one-step-reward function, , is defined as for every until the goal is added to the open list. Additionally, the reward is set to whenever an infeasible action is selected. Hence:


We define the observation to be the successor nodes and invalid edges, i.e. and the observation model to be a deterministic function .

Note that the history, the sequence of actions and observations, is contained in the information present in the concatenation of all lists, i.e . The heuristic is a policy that maps this history to an action , the vertex to expand.

Note that it is more natural to think of this problem as minimizing a one-step-cost than maximizing a reward. Hence when we subsequently refer to this problem instance, we refer to the cost and the cost-to-go . This only results in a change from maximization to minimization.

Iii-D What makes these POMDPs intractable?

A natural question to ask if these problems can be solved by state-of-the-art POMDP solvers such as POMCP [103] or DESPOT [108]. While such solvers are very effective at scaling up and solving large scale POMDPs, there are a few reasons why there are not immediately applicable to our problem.

Firstly, these methods require a lot of online effort. In the case of search based planning, the effort required to plan in belief space defeats the purpose of a heuristic all together. In the case of informative path planning, the observation space is very large and belief updates would be time consuming.

Secondly, since both methods employ a particle filter based approach to tracking plausible world maps, they both are susceptible to a realizability problem. Its unlikely that there will be a world map particle that will explain all observations. That being said, the world maps can explain local correlations in observations. For example, when planning indoors the world maps can explain correlations in observations made at intersection of corridors. Hence, we would like to generalize across these local submaps.

Iv Imitation of Clairvoyant Oracles

A possible approach is to employ model free Q-learning [85] by featurizing the history and collecting on-policy data. However, given the size of , this may require a large number of samples. Another strategy is to parameterize the policy class and employ policy improvement [92] techniques. However, such techniques when applied to POMDP settings may lead to poor local minima due to poor initialization. We discussed in Section II-D how imitation learning offers a more effective strategy than reinforcement learning in scenarios where there exist good policies for the original problem, however these policies cannot be executed online (e.g due to computational complexity) hence requiring imitation via an offline training phase. In this section, we extend this principle and show how imitation of clairvoyant oracles enables efficient learning of POMDP policies.

Iv-a Imitation Learning

We now formally define imitation learning as applied to our setting. Given a policy , we define the distribution of histories induced by it (termed as roll-in). Let

be a loss function that captures how well policy

imitates an oracle. Our goal is to find a policy which minimizes the expected loss as follows.


This is a non-i.i.d supervised learning problem. Ross et al. [101] propose ForwardTraining to train a non-stationary policy (one policy for each timestep), where each policy can be trained on distributions induced by previous policies (). While this solves the problem exactly, it is impractical given a different policy is needed for each timestep. For training a single policy, Ross et al. [101] show how such problems can be reduced to no-regret online learning using dataset aggregation (DAgger). The loss function they consider is a mis-classification loss with respect to what the expert demonstrated. Ross and Bagnell [99] extend the approach to the reinforcement learning setting where is the reward-to-go of an oracle reference policy by aggregating values to imitate (AggreVaTe).

Iv-B Solving POMDP via Imitation of a Clairvoyant Oracle

To examine the applicability of imitation learning in the POMDP framework, we compare the loss function (9) to the action value function (5). We see that a good candidate loss function should incentivize maximization of . A suitable approximation of the optimal value function that can be computed at train time would suffice. However, we cannot resort to oracles that explicitly reasoning about the belief over states , let alone planning in this belief space due to tractability issues.

In this work, we leverage the fact that for our problem domains, we have access to the true state at train time. This allows us to define oracles that are clairvoyant - that can observe the state at training time and plan actions using this information.

Definition 1 (Clairvoyant Oracle).

A clairvoyant oracle is a policy that maps state to action with an aim to maximize the cumulative reward of the underlying MDP .

The oracle policy defines an equivalent action value function defined on the state as follows


Our approach is to imitate the oracle during training. This implies that we train a policy by solving the following optimization problem


While we will define training procedures to concretely realize (11) later in Section V, we offer some intuition behind this approach. Since the oracle knows the state , it has appropriate information to assign a value to an action . The policy attempts to imitate this action from the partial information content present in its history . Due to this realization error, the policy visits a different state, updates the history, and queries the oracle for the best action. Hence while the learnt policy can make mistakes in the beginning of an episode, with time it gets better at imitating the oracle.

Iv-C Analysis using a Hallucinating Oracle

The learnt policy imitates a clairvoyant oracle that has access to more information (state compared to history ). This results in a large realizability error which is due to two terms - firstly the information mismatch between and , and secondly the expressiveness of feature space. This realizability error can be hard to bound making it difficult to apply the performance guarantee analysis of [99]. It is also not desirable to obtain a performance bound with respect to the clairvoyant oracle .

To alleviate the information mismatch, we take an alternate approach to analyzing the learner by introducing a purely hypothetical construct - a hallucinating oracle.

Definition 2 (Hallucinating Oracle).

A hallucinating oracle computes the instantaneous posterior distribution over state and returns the expected clairvoyant oracle action value.


We show that by imitating a clairvoyant oracle, the learner effectively imitates the corresponding hallucinating oracle

Lemma 1.

The offline imitation of clairvoyant oracle (11) is equivalent to online imitation of a hallucinating oracle as shown


Refer to Appendix A. ∎

Note that a hallucinating oracle uses the same information content as the learnt policy. Hence the realization error is purely due to the expressiveness of the feature space. The empirical risk of imitating the hallucinating oracle will be significantly lower than the risk of imitating the clairvoyant oracle.

Lemma 1 now allows us to express the performance of the learner with respect to a hallucinating oracle. This brings us to the key question - how good is a hallucinating oracle? Upon examining (12) we see that this oracle is equivalent to the well known QMDP policy first proposed by [81]. The QMDP policy ignores observations and finds the values of the underlying MDP. It then estimates the action value by taking an expectation on the current belief over states . This estimate amounts to assuming that any uncertainty in the agent’s current belief state will be gone after the next action. Thus, the action where long-term reward from all states (weighed by the probability) is largest will be the one chosen.

[81] points out that policies based on this approach are remarkably effective. This has been verified by other works such as Koval et al. [63] and Javdani et al. [54]. This naturally leads to the question of why we cannot directly apply QMDP to our problem. The QMDP approach requires explicitly sampling from the posterior over states online - a step that we cannot tractably compute as discussed in Section III-D. However, by imitating clairvoyant oracles, we implicitly obtain such a behaviour.

Imitation of clairvoyant oracles has been shown to be effective in other domains such as receding horizon control via imitating MPC methods that have full information [57]. [111] show how the partially observable acrobot can be solved by imitation of oracles having full state. [59] introduce imitation of QMDP in a deep learning architecture to train POMDP policies end to end.

The connection with a hallucinating oracle also provides valuable insight into potential failure situations. [81] point out that policies based on this approach will not take actions to gain information. We discuss such situations in Section VIII-A.

V Approach

V-a Algorithms

Fig. 4: Overview of the two approaches for training policies. (a) ForwardTraining is used to train a non-stationary policy, i.e a sequence of policies at each time-step. To train a policy at time-step , a state is sampled from initial distribution . The policies are then used to roll-in to get . The oracle is queried to get which is then used to update the dataset and train policy . (b) AggreVaTe is used to train a stationary policy. The training process is iterative where dataset collection is interleaved with learning. At iteration , a mixture policy is used to roll-in to get . The oracle is queried to get . The data is then aggregated to the whole dataset which is used to update the entire policy .

We introduced imitation learning and its applicability to POMDPs in Section IV. We now present a set of algorithms to concretely realize the process. The overall idea is as follows - we are training a policy

that maps features extracted from the history

to an action . The training objective is to imitate a clairvoyant oracle that has access to the corresponding full state . In order to define concrete algorithms, we need to reason about two classes of policies - non-stationary and stationary.

V-A1 Non-stationary policy

For the non-stationary case, we have a policy for each timestep . The motivation for adopting such a policy class is that the problems arising from the non i.i.d distribution immediately disappears. Such a policy class can be trained using the ForwardTraining algorithm [101] which sequentially trains each policy on the distribution of features induced from the previous set of policies. Hence the training problem for each policy at timestep is reduced to supervised learning.

1:for  to  do
2:     Initialize .
3:     for  to  do
4:         Sample initial state from dataset
5:         Execute policy to reach .
6:         Execute any action .
7:         Collect value to go
9:     Train cost-sensitive classifier on
10:Return Set of policies for each time step .
Algorithm 2 ForwardTraining (Non-stationary policy)

Alg. 2 describes the ForwardTraining procedure to train the non-stationary policy. The policies are trained in a sequential manner. At each time-step , the previously trained policies are used to create a dataset of by rolling-in (Lines 15). For each such datapoint , there is a corresponding state . A random action is sampled and the oracle is queried for the cost-to-go (Line 7). This is then added to the dataset which is used to train the policy . This is illustrated in Fig. 4.

We can state the following property about the training process

Theorem 1.

ForwardTraining has the following guarantee

where is the regression error of the learner and is the local oracle suboptimality.


Refer to Appendix B. ∎

However, there are several drawbacks to using a non-stationary policy. Firstly, it is impractical to have a different policy for each time-step as it scales with . While this might be a reasonable approach when is small (e.g. sequence classification problems [23]), in our applications can be fairly large. Secondly, and more importantly, each policy operates on data for only that time-step, thus preventing generalizations across timesteps. Each policy sees only fraction of the training data. This leads to a high empirical risk.

V-A2 Stationary policy

A single stationary policy enjoys the benefit of learning on data across all timesteps. However, the non i.i.d data distribution implies the procedure of data collection and training cannot be decoupled - the learner must be involved in the data collection process. Ross and Bagnell [99] show that such policies can be trained by reducing the propblem to a no-regret online learning setting. They present an algorithm, AggreVaTe that trains the policy in an interactive fashion where data is collected by a mixture policy of the learner and the oracle, the data is then aggregated and the learner is trained on this aggregated data. This process is repeated.

1:Initialize , to any policy in
2:for  to  do
3:     Initialize sub-dataset
4:     Let roll-in policy be
5:     Collect data points as follows:
6:     for  to  do
7:         Sample initial state from dataset
8:         Sample uniformly
9:         Execute up to time to reach
10:         Execute any action
11:         Collect value-to-go
13:     Aggregate datasets:
14:     Train cost-sensitive classifier on
15:Return best on validation
Algorithm 3 AggreVaTe (Stationary policy)

Alg. 3 describes the AggreVaTe procedure to train the stationary policy. To overcome the non i.i.d distribution issue, the algorithm interleaves data-collection with learning and iteratively trains a set of policies . Note that these iterations are not to be confused with time steps - they are simply learning iterations. A policy is valid for all timesteps. At iteration , data is collected by rolling-in with a mixture of the learner and the oracle policy (Lines 19). The mixing fraction is chosen to be . Mixing implies flipping a coin with bias and executing the oracle if heads comes up. A random action is sampled and the oracle is queried for the cost-to-go (Line 11).

The key step is to ensure that data is aggregated. The motivation for doing so arises from the fact that we want the learner to do well on the distribution it induces. [99] show that this can be posed as the mixture of learners doing well on the induced loss sequences at every iteration. If we were to treat each iteration as a game in an online adversarial learning setting, this would be equivalent to having bounded regret with respect to the best policy in hindsight on the loss sequence . The strategy of dataset aggregation is an instance of follow the leader and hence has bounded regret. Hence, data is appended to the original dataset and used to train an updated learner (Lines 1314).

AggreVaTe can be shown to have the following guarantee

Theorem 2.

iterations of AggreVaTe, collecting regression examples per iteration guarantees that with probability at least

where is the empirical regression regret of the best regressor in the regression class on the aggregated dataset, is the empirical online learning average regret on the sequence of training examples, is the range of oracle action value and is the local oracle suboptimality.


Refer to Appendix C. ∎

V-B Application to Informative Path Planning

Fig. 5: An overview of QvalAgg in IPP where a learner is trained to imitate a clairvoyant oracle . There are 4 key steps. Step 1: A world map is sampled from database representing . Step 2: A mixture policy of the learner and oracle is used to roll-in on to a timestep to get history . Step 3: A random action is chosen and is featurized as . Step 4: A clairvoyant oracle is given full access to world map to compute the cumulative reward to go . The pair is added to data to update the learner. This process is repeated to train a sequence of learners.

We now consider the applicability of Alg. 2 and Alg. 3 for learning a policy to plan informative paths. We refer to the mapping of the IPP problem to a POMDP defined in Section III-B. We first need to define a clairvoyant oracle in this context. Recall that the state is the set of nodes visited and the underlying world. A clairvoyant oracle takes a state action pair as input and computes a value. Depending on whether we are solving Problem Hidden-Unc or Hidden-Con, we explore two different kinds of oracles:

  1. Clairvoyant One-step-reward

  2. Clairvoyant Reward-to-go

V-B1 Solving Hidden-Unc by Imitating Clairvoyant One-step-reward

We first define a Clairvoyant One-step-reward oracle in the IPP framework.

Definition 3 (Clairvoyant One-step-reward).

A Clairvoyant One-step-reward returns an action value that considers only the one-step-reward. In the context of Hidden-Unc, it uses the world map , the curent path , the next node to visit to compute the value as the marginal gain in utility, i.e.

To motive the use of Clairvoyant One-step-reward, we refer to the discussion on the structure of the Problem Hidden-Unc in Section II-A3. We assume that the utility function is adaptive monotone submodular - it has the property of montonicity and diminishing returns under the belief over world maps. This property implies the following

  1. Adaptive Monotonicity: The expected value of the utility can only increase on adding a node, i.e.

    for all , where , and .

  2. Adaptive Submodularity: The expected gain in adding a node diminshes as more nodes are visited, i.e.

    for all , where (history is contained in history )

For such functions, [36] show that greedily selecting vertices to visit is near-optimal. We use this property to show that the Clairvoyant One-step-reward induces a one-step-oracle which is equivalent to the greedy policy and hence near optimal. This implies the following Lemma

Theorem 3.

iterations of AggreVaTe with Clairvoyant one-step-reward collecting regression examples per iteration guarantees that with probability at least

where is the empirical regression regret of the best regressor in the regression class on the aggregated dataset, is the empirical online learning average regret on the sequence of training examples, is the maximum range of one-step-reward.


Refer to Appendix D. ∎

We will shown in Section VI that such policies are remarkably effective. An added benefit of imitating the Clairvoyant One-step-reward is that the empirical classification loss is lower since only the expected one-step-reward of an action needs to be learnt.

V-B2 Solving Hidden-Con by Imitating Clairvoyant Reward-to-go

Unforutunately, Problem Hidden-Con does not posses the adaptive-submodular property of Hidden-Unc due to the introduction of the travel cost. Hence imitating the one-step-reward is no longer appropriate. We define the Clairvoyant Reward-to-go oracle for this problem class

Definition 4 (Clairvoyant Reward-to-go).

A Clairvoyant Reward-to-go returns an action value that corresponds to the cumulative reward obtained by executing and then following the oracle policy . In the context of Hidden-Con, it uses the world map , the curent path , the next node to visit to solve the problem Known-Con and compute a future sequence of nodes . This provides the value as the marginal gain

The correspoding oracle policy is obtained by following the computed path.

Note that solving Known-Con is NP-Hard and even the best approximation algorithms require some computation time. Hence the calls to the oracle must be minimized.

V-B3 Training and Testing Procedure

We now present concrete algorithms to realize the training procedure. Given the two axes of variation - problem and policy type - we have four possible algorithms

  1. RewardFT: Imitate one-step-reward using non-stationary policy by ForwardTraining (Alg. 2)

  2. QvalFT: Imitate reward-to-go using non-stationary policy by ForwardTraining (Alg. 2)

  3. RewardAgg: Imitate one-step-reward using stationary policy by AggreVaTe (Alg. 3)

  4. QvalAgg: Imitate reward-to-go using stationary policy by AggreVaTe (Alg. 3)

Table. I shows the algorithm mapping.

PolicyProblem Hidden-Unc Hidden-Con
Non-stationary policy RewardFT QvalFT
Stationary policy RewardAgg QvalAgg
TABLE I: Mapping from Problem and Policy type to Algorithm
1:Initialize , to any policy in
2:for  to  do
3:     Initialize sub-dataset
4:     Let roll-in policy be
5:     Collect data points as follows:
6:     for  to  do
7:         Sample world from dataset
8:         Sample start node for
9:         Sample uniformly
10:         Execute up to time
11:          to get path and history
12:         Sample a random action
13:         as the next vertex to visit
14:         Invoke Clairvoyant Reward-to-go oracle
15:          to get .
17:     Aggregate datasets:
18:     Train cost-sensitive classifier on
19:Return best on validation
Algorithm 4 QvalAgg

For completeness, we concretely define the training procedure for QvalAgg in Alg. 4. The procedure for the remaining three algorithms can be inferred from this. The algorithm iteratively trains a sequence of policies . At every iteration , the algorithm conducts episodes. In every episode a different world map and start vertex is sampled from a database. The roll-in is conducted with a mixture policy which blends the learner’s current policy, and the oracle’s policy, using blending parameter . The blending is done in an episodic fashion, with probability the Clairvoyant Reward-to-go oracle is invoked to compute a path which is followed. With probability , the learner is invoked for the whole episode. In a given episode, the roll-in is conducted to a timestep which is uniformly sampled. At the end of the roll-in, we have a path and a history . A random action is sampled which defines the next vertex to visit . The Clairvoyant Reward-to-go oracle is invoked with the world and the path already travelled . It then invokes a solver to Hidden-Con to complete the path and return the reward to go . This history action pair is projected to a feature space along with label . The data is aggregated to the dataset which is eventually used to train policy . Fig. 5 illustrates this approach.

V-C Application to Search Based Planning

Fig. 6: An overview of SaIL in search based planning where a learner is trained to imitate a clairvoyant oracle . There are 4 key steps. Step 1: A world map is sampled from database representing along with start goal pair . Step 2: A mixture policy of the learner and oracle is used to roll-in on to a timestep to get history which is the combination of open list, closed list and invalid edges. Step 3: A random vertex from the open list is chosen and is featurized as . Step 4: A clairvoyant oracle is given full access to world map to compute the cumulative cost to go . The pair is added to data to update the learner. This process is repeated to train a sequence of learners.

We now consider the applicability of Alg. 3 for heuristic learning in search based planning. Unlike the IPP problem domain, there is no incentive to use a non-stationary policy or imitate Clairvoyant One-step-rewards. Hence we only consider training a stationary policy imitating Clairvoyant Reward-to-go.

We first need to define a clairvoyant oracle for this problem. Given access to the world map , the oracle has to solve for the optimal number of expansions to reach the goal. This allows us to define a clairvoyant oracle planner that employs a backward Dijkstra’s algorithm, which given a world and a goal vertex plans for the optimal path from every using dynamic programming.

Definition 5 (Clairvoyant Oracle Planner).

Given full access to the state , which contains the open list and world , and a goal , the oracle planner encodes the cost-to-go from any vertex as the function which implicitly defines an oracle policy, .

The clairvoyant oracle planner provides a look-up table for the optimal cost-to-go from any vertex irrespective of the current state of the search.

A key distinction between this oracle and the one defined for an IPP problem in Section V-B is that we are able to efficiently get the cost-to-go value for all states by dynamic programming - we do not need to repeatedly invoke the oracle. We exploit this fact by extracting multiple labels from an episode even though the oracle is invoked only once. Additionally, this allows us a better roll-in procedure where the oracle and learner are interleaved. We adapt the AggreVaTe framework to present an algorithm, Search as Imitation Learning (SaIL).

1:Initialize , to any policy in
2:for  to  do
3:     Initialize sub dataset
4:     Collect data points as follows:
5:     for  to  do
6:         Sample world map
7:         Sample
8:         Invoke clairvoyant oracle planner
9:          to compute
10:         Sample uniformly timesteps
11:          where each
12:         Rollout search with
14:         At each pick a random
15:          action to get corresponding
16:         Query oracle for
18:     Aggregate datasets:
19:     Train cost-sensitive classifier on
20:Return best on validation
Algorithm 5 SaIL

Alg. 5, describes the SaIL framework which iteratively trains a sequence of policies . For training the learner, we collect a dataset as follows - At every iteration i, the agent executed m different searches (Alg. 1). For every search, a different world and the pair is sampled from a database. The agent then rolls-out a search with a mixture policy which blends the learner’s current policy, and the oracle’s policy, using blending parameter . During the search execution, at every timestep in a set of uniformly sampled timesteps, we select a random action from the set of feasible actions and collect a datapoint . The policy is rolled out till the end of the episode and all the collected data is aggregated with dataset . At the end of N iterations, the algorithm returns the best performing policy on a set of held-out validation environment or alternatively, a mixture of