Learning to Gather Information via Imitation

11/13/2016 ∙ by Sanjiban Choudhury, et al. ∙ Microsoft Carnegie Mellon University 0

The budgeted information gathering problem - where a robot with a fixed fuel budget is required to maximize the amount of information gathered from the world - appears in practice across a wide range of applications in autonomous exploration and inspection with mobile robots. Although there is an extensive amount of prior work investigating effective approximations of the problem, these methods do not address the fact that their performance is heavily dependent on distribution of objects in the world. In this paper, we attempt to address this issue by proposing a novel data-driven imitation learning framework. We present an efficient algorithm, EXPLORE, that trains a policy on the target distribution to imitate a clairvoyant oracle - an oracle that has full information about the world and computes non-myopic solutions to maximize information gathered. We validate the approach on a spectrum of results on a number of 2D and 3D exploration problems that demonstrates the ability of EXPLORE to adapt to different object distributions. Additionally, our analysis provides theoretical insight into the behavior of EXPLORE. Our approach paves the way forward for efficiently applying data-driven methods to the domain of information gathering.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

This paper considers the budgeted information gathering problem. Our aim is to maximally explore a world with a robot that has a budget on the total amount of movement due to battery constraints. This problem fundamentally recurs in mobile robot applications such as autonomous mapping of environments using ground and aerial robots [1, 9], monitoring of water bodies [12] and inspecting models for 3D reconstruction [13, 11].

The nature of “interesting” objects in an environment and their spatial distribution influence the optimal trajectory a robot might take to explore the environment. As a result, it is important that a robot learns about the type of environment it is exploring as it acquires more information and adapts it’s exploration trajectories accordingly. This adaptation must be done online, and we provide such an algorithm in this paper.

To illustrate our point, consider two extreme examples of environments for a particular mapping problem, shown in Fig. 1. Consider a robot equipped with a sensor (RGBD camera) that needs to generate a map of an unknown environment. It is given a prior distribution about the geometry of the world, but has no other information. This geometry could include very diverse settings. First it can include a world where there is only one ladder, but the form of the ladder must be explored (Fig. 1), which is a very dense setting. Second, it could include a sparse setting with spatially distributed objects, such as a construction site (Fig. 1).

The important task for the robot is to now try to infer which type of environment it is in based on the history of measurements, and thus plan an efficient trajectory. At every time step, the robot visits a sensing location and receives a sensor measurement (e.g. depth image) that has some amount of information utility (e.g. surface coverage of objects with point cloud). As opposed to naive lawnmower-coverage patterns, it will be more efficient if the robot could use a policy that maps the history of locations visited and measurements received to decide which location to visit next such that it maximizes the amount of information gathered in the finite amount of battery time available.

The ability of such a learnt policy to gather information efficiently depends on the prior distribution of worlds in which the robot has been shown how to navigate optimally. Fig. 1 shows an efficient learnt policy for inspecting a ladder, which executes a helical motion around parts of the ladder already observed to efficiently uncover new parts without searching naively. This is efficient because given the prior distribution the robot learns that information is likely to be geometrically concentrated in a particular volume given it’s initial observations of parts of the ladder. Similarly Fig. 1 shows an effective policy for exploring construction sites by executing large sweeping motions. Here again the robot learns from prior experience that wide, sweeping motions are efficient since it has learnt that information is likely to be dispersed in such scenarios.

Thus our requirements for an efficient information-gathering policy can be distilled to two points:

  1. Reasoning about posterior distribution over world maps: The robot should use the history of movements and measurements to infer a posterior distribution of worlds. This can be used infer locations that are likely to contain information and efficiently plan a trajectory. However the space of world maps is very large, and it is intractable to compute this posterior online.

  2. Reasoning about non-myopic value of information: Even if the robot is able to compute the posterior and hence the value of information at a location, it has to be cognizant of the travel cost to get to that location. It needs to exhibit non-myopic behavior to achieve a trade-off that maximizes the overall information gathered. Performing this computationally expensive planning at every step is prohibitively expensive.

Even though its is natural to think of this problem setting as a POMDP, we frame this problem as a novel data-driven imitation learning problem [26]. We propose an algorithm ExpLOre (Exploration by Learning to Imitate an Oracle) that trains a policy on a dataset of worlds by imitating a clairvoyant oracle. During the training process, the oracle has full information about the world map (and is hence clairvoyant) and plans movements to maximize information. The policy is then trained to imitate these movements as best as it can using partial information from the current history of movements and measurements. As a result of our novel formulation, we are able to sidestep a number of challenging issues in POMDPs like explicitly computing posterior distribution over worlds and planning in belief space.

Our contributions are as follows

  1. We map the budgeted information gathering problem to a POMDP and present an approach to solve it using imitation learning.

  2. We present an approach to train a policy on the non-stationary distribution of event traces induced by the policy itself. We show that this implicitly results in the policy operating on the posterior distribution of world maps.

  3. We show that by imitating an oracle that has access to the world map and thus can plan optimal routes, the policy is able to learn non-myopic behavior. Since the oracle is executed only during train time, the computational burden does not affect online performance.

The remainder of this paper is organized as follows. Section II presents the formal problem, while Section III contains relevant work. The algorithm is presented in Section IV and Section VI presents experimental results. Finally we conclude in Section VII with discussions and thoughts on future work.

Ii Problem Statement

Ii-a Notation

Let be a set of nodes corresponding to all sensing locations. The robot starts at node . Let be a sequence of nodes (a path) such that . Let be the set of all such paths. Let be the world map. Let be a measurement received by the robot. Let be a measurement function. When the robot is at node in a world map , the measurement received by the robot is . Let be a utility function. For a path and a world map , assigns a utility to executing the path on the world. Note that is a set function. Given a node , a set of nodes and world , the discrete derivative of the utility function is Let be a travel cost function. For a path and a world map , assigns a travel cost to executing the path on the world.

Ii-B Problem Formulation

We first define the problem setting when the world map is fully known.

Problem 1 (Fully Observable World Map; Constrained Travel Cost).

Given a world map , a travel cost budget and a time horizon , find a path that maximizes utility subject to travel cost and cardinality constraints.

(1)

Now, consider the setting where the world map is unknown. Given a prior distribution , it can be inferred only via the measurements received as the robot visits nodes . Hence, instead of solving for a fixed path, we compute a policy that maps history of measurements received and nodes visited to decide which node to visit.

Problem 2 (Partially Observable World Map; Constrained Travel Cost).

Given a distribution of world maps, , a travel cost budget and a time horizon , find a policy that at time , maps the history of nodes visited and measurements received to compute node to visit at time , such that the expected utility is maximized subject to travel cost and cardinality constraints.

Ii-C Mapping to MDP and POMDP

Ii-C1 Mapping fully observable problems to MDP

The Markov Decision Process (MDP) is a tuple

defined upto a fixed finite horizon . It is defined over an augmented state space comprising of the ego-motion state space (which we will refer to as simply the state space) and the space of world maps .

Let be the state of the robot at time . It is defined as the set of nodes visited by the robot upto time , . This implies the dimension of the state space is exponentially large in the space of nodes, . The initial state is the start node. Let be the action executed by the robot at time . It is defined as the node visited at time , . The set of all actions is defined as . Given a world map , when the robot is at state the utility of executing an action is . Let be the set of feasible actions that the robot can execute when in state in a world map . This is defined as follows

(2)

Let be the state transition function. In our setting, this is the deterministic function . Let be the one step reward function. It is defined as the normalized marginal gain of the utility function, . Let be a policy that maps state and world map to a feasible action . The value of executing a policy for time steps on a world starting at state .

(3)

where is the distribution of states at time starting from and following policy . The state action value is the value of executing action in state in world and then following the policy for timesteps

(4)

The value of a policy for steps on a distribution of worlds and starting states

(5)

The optimal MDP policy is .

Ii-C2 Mapping partially observable problems to POMDP

The Partially Observable Markov Decision Process (POMDP) is a tuple . The first component of the augmented state space, the ego motion state space , is fully observable. The second component, the space of world maps , is partially observable through observations received.

Let be the observation at time step . This is defined as the measurement received by the robot . Let

be the probability of receiving an observation

given the robot is at state and executes action . In our setting, this is the deterministic function .

Let the belief at time , , be the history of state, action, observation tuple received so far, i.e. . Note that this differs from the conventional use of the word belief which would usually imply a distribution. However, we use belief here to refer to the history of state, action, observations conditioned on which one can infer the posterior distribution of world maps . Let the belief transition function be . Let be a policy that maps state and belief to a feasible action . The value of executing a policy for time steps starting at state and belief is

(6)

where is the distribution of beliefs at time starting from and following policy . is the posterior distribution on worlds given the belief . Similarly the action value function is defined as

(7)

The optimal POMDP policy can be expressed as

(8)

where

is a uniform distribution over the discrete interval

, is the distribution of states following policy for steps, is the distribution of belief following policy for steps. The value of a policy for steps on a distribution of worlds , starting states and starting belief

(9)

where the posterior world distribution uses as prior.

Iii Related Work

Fig. 2: Overview of ExpLOre. The algorithm iteratively trains a learner to imitate a clairvoyant oracle . A world map is sampled from database representing . A mixture policy of and is used to roll-in on for a random timestep to get state and belief . A exploratory action is chosen. The clairvoyant oracle is given full access to world map to compute the cumulative reward to go . This datapoint comprising of the belief from roll in , the state , the action and the value is used to create a cost sensitive classification problem that updates the learner .

Problem 1 is a submodular function optimization (due to nature of ) subject to routing constraints (due to ). In absence of this constraint, there is a large body of work on near optimality of greedy strategies by Krause et al.[19, 21, 20] - however naive greedy approaches can perform arbitrarily poorly. Chekuri and Pal [2] propose a quasi-polynomial time recursive greedy approach to solving this problem. Singh et al.[30] show how to scale up the approach to multiple robots. However, these methods are slow in practice. Iyer and Bilmes [14] solve a related problem of submodular maximization subject to submodular knapsack constraints using an iterative greedy approach. This inspires Zhang and Vorobeychik [35] to propose an elegant generalization of the cost benefit algorithm (GCB) which we use as an oracle in this paper. Yu et al.[34] frame the problem as a correlated orienteering problem and propose a mixed integer based approach - however only correlations between neighboring nodes are considered. Hollinger and Sukhatme [12] use sampling based approaches which require a lot of evaluations of the utility function in practice.

Problem 2 in the absence of the travel cost constraint can be efficiently solved using the framework of adaptive submodularity developed by Golovin et al.[6, 7] as shown by Javdani et al.[16, 15] and Chen et al.[4, 3]. Hollinger et al.[10, 11]

propose a heuristic based approach to select a subset of informative nodes and perform minimum cost tours. Singh et al.

[31] replan every step using a non-adaptive information path planning algorithm. Such methods suffer when the adaptivity gap is large [10]. Inspired by adaptive TSP approaches by Gupta et al.[8], Lim et al.[25, 24] propose recursive coverage algorithms to learn policy trees. However such methods cannot scale well to large state and observation spaces. Heng et al.[9] make a modular approximation of the objective function. Isler et al.[13] survey a broad number of myopic information gain based heuristics that work well in practice but have no formal guarantees.

Online POMDP planning also has a large body of work ([17, 28, 22]. Although there exists fast solvers such as POMCP (Silver and Veness [29]) and DESPOT (Somani et al.[32]), the space of world maps is too large for online planning.

Iv Approach

Iv-a Overview

Fig. 2 shows an overview of our approach. The central idea is as follows - we train a policy to imitate an algorithm that has access to the world map at train time. The policy

maps features extracted from state

and belief to an action . The algorithm that is being imitated has access to the corresponding world map .

Iv-B Imitation Learning

We now formally define imitation learning as applied to our setting. Let be a policy defined on a pair of state and belief . Let roll-in be the process of executing a policy from the start upto a certain time horizon. Similarly roll-out is the process of executing a policy from the current state and belief till the end. Let be the distribution of states induced by roll-in with policy . Let be the distribution of belief induced by roll-in with policy .

Let be the loss of a policy when executed on state and belief

. This loss function implicitly captures how well policy

imitates a reference policy (such an an oracle algorithm). Our goal is to find a policy which minimizes the observed loss under its own induced distribution of state and beliefs.

(10)

This is a non-i.i.d supervised learning problem. Ross and Bagnell

[26] show how such problems can be reduced to no-regret online learning using dataset aggregation (DAgger). The loss function they consider is a mis-classification loss with respect to what the expert demonstrated. Ross and Bagnell [27]

extend the approach to the reinforcement learning setting where

is the cost-to-go of an oracle reference policy by aggregating values to imitate (AggreVaTe).

Iv-C Solving POMDP via Imitation of a Clairvoyant Oracle

When (8) is compared to the imitation learning framework in (10), we see that in addition to the induced state belief distributions, the loss function analogue is . This implies rolling out with policy . For poor policies

, the action value estimate

would be very different from optimal values .

In our approach, we alleviate this problem by defining a surrogate value functions to imitate - the cumulative reward gathered by a clairvoyant oracle.

Definition 1 (Clairvoyant Oracle).

Given a distribution of world map , a clairvoyant oracle is a policy that maps state and world map to a feasible action such that it approximates the optimal MDP policy, .

The term clairvoyant is used because the oracle has full access to the world map at train time. The oracle can be used to compute state action value as follows

(11)

Our approach is to imitate the oracle during training. This implies that we train a policy by solving the following optimization problem

(12)

Iv-D Algorithm

1:Initialize , to any policy in
2:for  to  do
3:     Initialize sub dataset
4:     Let roll in policy be
5:     Collect data points as follows:
6:     for  to  do
7:         Sample world map from dataset
8:         Sample uniformly
9:         Assign initial state
10:         Execute up to time to reach
11:         Execute any action
12:         Execute oracle from to on
13:         Collect value to go
14:               
15:     Aggregate datasets:
16:

     Train cost-sensitive classifier

on
17:      (Alternately: use any online learner on )
18:Return best on validation
Algorithm 1 ExpLOre: Imitation Learning of Oracle

Alg. 1 describes the ExpLOre algorithm. The algorithm iteratively trains a sequence of learnt policies by aggregating data for an online cost-sensitive classification problem.

is initialized as a random policy (Line 1). At iteration , the policy that is used to roll-in is a mixture policy of learnt policy and the oracle policy (Line 4) using mixture parameter . A set of cost-sensitive classification datapoints are captured as follows: a world is sampled (Line 7). The is used to roll-in upto a random time from an initial state to reach (Lines 810). An exploratory action is selected (Line 11). The clairvoyant oracle is given full access to and asked to roll-out and provide an action value (Lines 1213). is added to a dataset of cost-sensitive classification problem and the process is repeated (Line 14). is appended to the original dataset and used to train an updated learner (Lines 1517). The algorithm returns the best learner from the sequence based on performance on a held out validation set (or alternatively returns a mixture policy of all learnt policies). One can also try variants where all actions are executed, or an online learner is used to update instead of dataset aggregation [27].

V Analysis

Following the analysis style of AggreVaTe [27], we first introduce a hallucinating oracle.

Definition 2 (Hallucinating Oracle).

Given a prior distribution of world map , a hallucinating oracle computes the instantaneous posterior distribution over world maps and takes the action with the highest expected value.

(13)

While the hallucinating oracle is not the optimal POMDP policy (8), it is an effective policy for information gathering as alluded to in [18] and we now show that we effectively imitate it.

Lemma 1.

The policy optimization rule in (12) is equivalent to

by using the fact that

Consequently our learnt policy has the following guarantee

Theorem 1.

N iterations of ExpLOre, collecting regression examples per iteration guarantees that with probability at least

where is the empirical average online learning regret on the training regression examples collected over the iterations and is the empirical regression regret of the best regressor in the policy class.

For both proofs, refer to [5].

Vi Experimental Results

Fig. 3: Comparison of ExpLOre with baseline heuristics on a 2D exploration problem on 2 different datasets - dataset 1 (concentrated information) and dataset 2 (distributed information). The problem details are: . Sample world maps from (a) dataset 1 and (b) dataset 2. Training dataset is created with world maps, each with random node sets to create a dataset size of . Test results on representative world map with random node sets are shown for (c) dataset 1 and (d) dataset 2. A sample test instance is shown along with a plot of cumulative reward with time steps for ExpLOre and other baseline heuristics. The error bars show confidence intervals. Snapshots of execution of ExpLOre, Rear Side Voxel and Average Entropy are shown for (e) dataset 1 and (f) dataset 2. The snapshots show the evidence grid at time steps and .
Fig. 4: Comparison of ExpLOre with baseline heuristics on a 3D exploration problem where training is done on simulated world maps and testing is done on a real dataset of an office workspace. The problem details are: , , . (a) Samples from simulated worlds resembling an office workspace created in Gazebo. (b) Real dataset collected by [33] using a RGBD camera. (c) Plot of cumulative reward with time steps for ExpLOre and baseline heuristics on the real dataset. (d) The 3D model of the real office workspace formed by cumulating measurements from all poses. (e) Snapshots of execution of Occlusion Aware heuristic at time steps . (f) Snapshots of execution of ExpLOre heuristic at time steps .
Fig. 5: Comparison of ExpLOre with baseline heuristics on a number of experiments both 2D and 3D. Each row corresponds to different datasets. The columns contain information about the dataset, representative pictures and performance results for all algorithms. The numbers are the lower and upper confidence (for 95% CI) of cumulative reward at the final time step. The algorithm with the highest median performance is emphasized in bold for each dataset.

Vi-a Implementation Details

Our implementation is open source and available for MATLAB and C++ goo.gl/HXNQwS.

Vi-A1 Problem Details

The utility function is selected to be a fractional coverage function (similar to [13]) which is defined as follows. The world map is represented as a voxel grid representing the surface of a 3D model. The sensor measurement at node is obtained by raycasting on this 3D model. A voxel of the model is said to be ‘covered’ by a measurement received at a node if a point lies in that voxel. The coverage of a path is the fraction of covered voxels by the union of measurements received when visiting each node of the path. The travel cost function is chosen to be the euclidean distance. The values of total time step and travel budget varies with problem instances and are specified along with the results.

Vi-A2 Oracle Algorithm

We use the generalized cost benefit algorithm (GCB) [35] as the oracle algorithm owing to its small run times and acceptable solution qualities.

Vi-A3 Learning Details

The tuple

is mapped to a vector of features

. The feature vector is a vector of information gain metrics as described in [13].

encode the relative rotation and translation required to visit a node. We use random forest

[23] to regress to values from features . The learning details are specified in Table. I.

Problem Train Test ExpLOre Feature
Dataset Dataset Iterations Dimension
2D
3D
TABLE I: Learning Details

Vi-A4 Baseline

For baseline policies, we compare to the class of information gain heuristics discussed in [13]. The heuristics are remarkably effective, however, their performance depends on the distribution of objections in a world map. As ExpLOre uses these heuristic values as part of its feature vector, it will implicitly learn a data driven trade-off between them.

Vi-B 2D Exploration

We create a set of 2D exploration problems to gain a better understanding of the behavior of the ExpLOre and baseline heuristics. A dataset comprises of 2D binary world maps, uniformly distributed nodes and a simulated laser. The training size is , , .

Vi-B1 Dataset 1: Concentrated Information

Fig. 3 shows a dataset created by applying random affine transformations to a pair of parallel lines. This dataset is representative of information being concentrated in a particular fashion. Fig. 3 shows a comparison of ExpLOre with baseline heuristics. The heuristic Rear Side Voxel performs the best, while ExpLOre is able to match the heuristic. Fig. 3 shows progress of ExpLOre along with two relevant heuristics - Rear Side Voxel and Average Entropy. Rear Side Voxel takes small steps focusing on exploiting viewpoints along the already observed area. Average Entropy aggressively visits unexplored area which is mainly free space. ExpLOre initially explores the world but on seeing parts of the lines reverts to exploiting the area around it.

Vi-B2 Dataset 2: Distributed Information

Fig. 3 shows a dataset created by randomly distributing rectangular blocks around the periphery of the map. This dataset is representative of information being distributed around. Fig. 3 shows that the heuristic Average Entropy performs the best, while ExpLOre is able to match the heuristic. Rear Side Voxel saturates early on and performs worse. Fig. 3 shows that Rear Side Voxel gets stuck exploiting an island of information. Average Entropy takes broader sweeps of the area thus gaining more information about the world. ExpLOre also shows a similar behavior of exploring the world map.

Thus we see that on changing the datasets the performance of the heuristics reverse while our data driven approach is able to adapt seamlessly.

Vi-B3 Other Datasets

Fig. 5 shows results from other 2D datasets such as random disks and block worlds, where ExpLOre is able to outperform all heuristics.

Vi-C 3D Exploration

We create a set of 3D exploration problems to test the algorithm on more realistic scenarios. The datasets comprises of 3D worlds created in Gazebo and simulated Kinect.

Vi-C1 Train on Synthetic, Test on Real

To show the practical usage of our pipeline, we show a scenario where a policy is trained on synthetic data and tested on a real dataset.

Fig. 4 shows some sample worlds created in Gazebo to represent an office desk environment on which ExpLOre is trained. Fig. 4

shows a dataset of an office desk collected by TUM Computer Vision Group

[33]. The dataset is parsed to create a pair of pose and registered point cloud which can then be used to evaluate different algorithms. Fig. 4 shows that ExpLOre outperforms all heuristics. Fig. 4 shows ExpLOre getting good coverage of the desk while the best heuristic Occlusion Aware misses out on the rear side of the desk.

Vi-C2 Other Datasets

Fig. 5 shows more datasets where training and testing is done on synthetic worlds.

Vii Conclusion

We presented a novel data-driven imitation learning framework to solve budgeted information gathering problems. Our approach, ExpLOre, trains a policy to imitate a clairvoyant oracle that has full information about the world and can compute non-myopic plans to maximize information. The effectiveness of ExpLOre can be attributed to two main reasons: Firstly, as the distribution of worlds varies, the clairvoyant oracle is able to adapt and consequently ExpLOre adapts as well. Secondly, as the oracle computes non-myopic solutions, imitating it allows ExpLOre to also learn non-myopic behaviors.

Viii Acknowledgement

The authors thank Sankalp Arora for insightful discussions and open source code for exploration in MATLAB.

References

  • [1] Benjamin Charrow, Gregory Kahn, Sachin Patil, Sikang Liu, Ken Goldberg, Pieter Abbeel, Nathan Michael, and Vijay Kumar. Information-theoretic planning with trajectory optimization for dense 3d mapping. In RSS, 2015.
  • [2] Chandra Chekuri and Martin Pal. A recursive greedy algorithm for walks in directed graphs. In FOCS, 2005.
  • [3] Yuxin Chen, S. Hamed Hassani, and Andreas Krause. Near-optimal bayesian active learning with correlated and noisy tests. CoRR, abs/1605.07334, 2016.
  • [4] Yuxin Chen, Shervin Javdani, Amin Karbasi, J. Bagnell, Siddhartha Srinivasa, and Andreas Krause. Submodular surrogates for value of information, 2015.
  • [5] Sanjiban Choudhury. Learning to gather information via imitation: Proofs. goo.gl/GJfg7r, 2016.
  • [6] Daniel Golovin and Andreas Krause.

    Adaptive submodularity: Theory and applications in active learning and stochastic optimization.

    JAIR, 2011.
  • [7] Daniel Golovin, Andreas Krause, and Debajyoti Ray. Near-optimal bayesian active learning with noisy observations. In NIPS, 2010.
  • [8] Anupam Gupta, Viswanath Nagarajan, and R Ravi.

    Approximation algorithms for optimal decision trees and adaptive tsp problems.

    In International Colloquium on Automata, Languages, and Programming, 2010.
  • [9] Lionel Heng, Alkis Gotovos, Andreas Krause, and Marc Pollefeys. Efficient visual exploration and coverage with a micro aerial vehicle in unknown environments. In ICRA.
  • [10] Geoffrey A Hollinger, Brendan Englot, Franz S Hover, Urbashi Mitra, and Gaurav S Sukhatme. Active planning for underwater inspection and the benefit of adaptivity. IJRR, 2012.
  • [11] Geoffrey A Hollinger, Urbashi Mitra, and Gaurav S Sukhatme. Active classification: Theory and application to underwater inspection. arXiv preprint arXiv:1106.5829, 2011.
  • [12] Geoffrey A Hollinger and Gaurav S Sukhatme. Sampling-based motion planning for robotic information gathering. In RSS, 2013.
  • [13] Stefan Isler, Reza Sabzevari, Jeffrey Delmerico, and Davide Scaramuzza. An information gain formulation for active volumetric 3d reconstruction. In ICRA, 2016.
  • [14] Rishabh K Iyer and Jeff A Bilmes. Submodular optimization with submodular cover and submodular knapsack constraints. In NIPS, 2013.
  • [15] Shervin Javdani, Yuxin Chen, Amin Karbasi, Andreas Krause, J. Andrew (Drew) Bagnell, and Siddhartha Srinivasa. Near optimal bayesian active learning for decision making. In AISTATS, 2014.
  • [16] Shervin Javdani, Matthew Klingensmith, J. Andrew (Drew) Bagnell, Nancy Pollard , and Siddhartha Srinivasa. Efficient touch based localization through submodularity. In ICRA, 2013.
  • [17] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 1998.
  • [18] Michael Koval, Nancy Pollard, and Siddhartha Srinivasa. Pre- and post-contact policy decomposition for planar contact manipulation under uncertainty. In RSS, 2014.
  • [19] Andreas Krause and Daniel Golovin. Submodular function maximization. Tractability: Practical Approaches to Hard Problems, 2012.
  • [20] Andreas Krause and Carlos Guestrin. Near-optimal observation selection using submodular functions. In AAAI, 2007.
  • [21] Andreas Krause, Jure Leskovec, Carlos Guestrin, Jeanne VanBriesen, and Christos Faloutsos. Efficient sensor placement optimization for securing large water distribution networks. Journal of Water Resources Planning and Management, 2008.
  • [22] Hanna Kurniawati, David Hsu, and Wee Sun Lee. Sarsop: Efficient point-based pomdp planning by approximating optimally reachable belief spaces. In RSS, 2008.
  • [23] Andy Liaw and Matthew Wiener. Classification and regression by randomforest.
  • [24] Zhan Wei Lim, David Hsu, and Wee Sun Lee. Adaptive stochastic optimization: From sets to paths. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, NIPS. 2015.
  • [25] Zhan Wei Lim, David Hsu, and Wee Sun Lee. Adaptive informative path planning in metric spaces. IJRR, 2016.
  • [26] Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. In AISTATS, 2010.
  • [27] Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979, 2014.
  • [28] Guy Shani, Joelle Pineau, and Robert Kaplow. A survey of point-based pomdp solvers. Autonomous Agents and Multi-Agent Systems, 2013.
  • [29] David Silver and Joel Veness. Monte-carlo planning in large pomdps. In NIPS, 2010.
  • [30] Amarjeet Singh, Andreas Krause, Carlos Guestrin, William Kaiser, and Maxim Batalin. Efficient planning of informative paths for multiple robots. In IJCAI, 2007.
  • [31] Amarjeet Singh, Andreas Krause, and William J. Kaiser. Nonmyopic adaptive informative path planning for multiple robots. In IJCAI, 2009.
  • [32] Adhiraj Somani, Nan Ye, David Hsu, and Wee Sun Lee. Despot: Online pomdp planning with regularization. In NIPS, 2013.
  • [33] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems. In IROS, 2012.
  • [34] Jingjin Yu, Mac Schwager, and Daniela Rus. Correlated orienteering problem and its application to informative path planning for persistent monitoring tasks. In IROS, 2014.
  • [35] Haifeng Zhang and Yevgeniy Vorobeychik. Submodular optimization with routing constraints, 2016.