This paper considers the budgeted information gathering problem. Our aim is to maximally explore a world with a robot that has a budget on the total amount of movement due to battery constraints. This problem fundamentally recurs in mobile robot applications such as autonomous mapping of environments using ground and aerial robots [1, 9], monitoring of water bodies  and inspecting models for 3D reconstruction [13, 11].
The nature of “interesting” objects in an environment and their spatial distribution influence the optimal trajectory a robot might take to explore the environment. As a result, it is important that a robot learns about the type of environment it is exploring as it acquires more information and adapts it’s exploration trajectories accordingly. This adaptation must be done online, and we provide such an algorithm in this paper.
To illustrate our point, consider two extreme examples of environments for a particular mapping problem, shown in Fig. 1. Consider a robot equipped with a sensor (RGBD camera) that needs to generate a map of an unknown environment. It is given a prior distribution about the geometry of the world, but has no other information. This geometry could include very diverse settings. First it can include a world where there is only one ladder, but the form of the ladder must be explored (Fig. 1), which is a very dense setting. Second, it could include a sparse setting with spatially distributed objects, such as a construction site (Fig. 1).
The important task for the robot is to now try to infer which type of environment it is in based on the history of measurements, and thus plan an efficient trajectory. At every time step, the robot visits a sensing location and receives a sensor measurement (e.g. depth image) that has some amount of information utility (e.g. surface coverage of objects with point cloud). As opposed to naive lawnmower-coverage patterns, it will be more efficient if the robot could use a policy that maps the history of locations visited and measurements received to decide which location to visit next such that it maximizes the amount of information gathered in the finite amount of battery time available.
The ability of such a learnt policy to gather information efficiently depends on the prior distribution of worlds in which the robot has been shown how to navigate optimally. Fig. 1 shows an efficient learnt policy for inspecting a ladder, which executes a helical motion around parts of the ladder already observed to efficiently uncover new parts without searching naively. This is efficient because given the prior distribution the robot learns that information is likely to be geometrically concentrated in a particular volume given it’s initial observations of parts of the ladder. Similarly Fig. 1 shows an effective policy for exploring construction sites by executing large sweeping motions. Here again the robot learns from prior experience that wide, sweeping motions are efficient since it has learnt that information is likely to be dispersed in such scenarios.
Thus our requirements for an efficient information-gathering policy can be distilled to two points:
Reasoning about posterior distribution over world maps: The robot should use the history of movements and measurements to infer a posterior distribution of worlds. This can be used infer locations that are likely to contain information and efficiently plan a trajectory. However the space of world maps is very large, and it is intractable to compute this posterior online.
Reasoning about non-myopic value of information: Even if the robot is able to compute the posterior and hence the value of information at a location, it has to be cognizant of the travel cost to get to that location. It needs to exhibit non-myopic behavior to achieve a trade-off that maximizes the overall information gathered. Performing this computationally expensive planning at every step is prohibitively expensive.
Even though its is natural to think of this problem setting as a POMDP, we frame this problem as a novel data-driven imitation learning problem . We propose an algorithm ExpLOre (Exploration by Learning to Imitate an Oracle) that trains a policy on a dataset of worlds by imitating a clairvoyant oracle. During the training process, the oracle has full information about the world map (and is hence clairvoyant) and plans movements to maximize information. The policy is then trained to imitate these movements as best as it can using partial information from the current history of movements and measurements. As a result of our novel formulation, we are able to sidestep a number of challenging issues in POMDPs like explicitly computing posterior distribution over worlds and planning in belief space.
Our contributions are as follows
We map the budgeted information gathering problem to a POMDP and present an approach to solve it using imitation learning.
We present an approach to train a policy on the non-stationary distribution of event traces induced by the policy itself. We show that this implicitly results in the policy operating on the posterior distribution of world maps.
We show that by imitating an oracle that has access to the world map and thus can plan optimal routes, the policy is able to learn non-myopic behavior. Since the oracle is executed only during train time, the computational burden does not affect online performance.
The remainder of this paper is organized as follows. Section II presents the formal problem, while Section III contains relevant work. The algorithm is presented in Section IV and Section VI presents experimental results. Finally we conclude in Section VII with discussions and thoughts on future work.
Ii Problem Statement
Let be a set of nodes corresponding to all sensing locations. The robot starts at node . Let be a sequence of nodes (a path) such that . Let be the set of all such paths. Let be the world map. Let be a measurement received by the robot. Let be a measurement function. When the robot is at node in a world map , the measurement received by the robot is . Let be a utility function. For a path and a world map , assigns a utility to executing the path on the world. Note that is a set function. Given a node , a set of nodes and world , the discrete derivative of the utility function is Let be a travel cost function. For a path and a world map , assigns a travel cost to executing the path on the world.
Ii-B Problem Formulation
We first define the problem setting when the world map is fully known.
Problem 1 (Fully Observable World Map; Constrained Travel Cost).
Given a world map , a travel cost budget and a time horizon , find a path that maximizes utility subject to travel cost and cardinality constraints.
Now, consider the setting where the world map is unknown. Given a prior distribution , it can be inferred only via the measurements received as the robot visits nodes . Hence, instead of solving for a fixed path, we compute a policy that maps history of measurements received and nodes visited to decide which node to visit.
Problem 2 (Partially Observable World Map; Constrained Travel Cost).
Given a distribution of world maps, , a travel cost budget and a time horizon , find a policy that at time , maps the history of nodes visited and measurements received to compute node to visit at time , such that the expected utility is maximized subject to travel cost and cardinality constraints.
Ii-C Mapping to MDP and POMDP
Ii-C1 Mapping fully observable problems to MDP
The Markov Decision Process (MDP) is a tupledefined upto a fixed finite horizon . It is defined over an augmented state space comprising of the ego-motion state space (which we will refer to as simply the state space) and the space of world maps .
Let be the state of the robot at time . It is defined as the set of nodes visited by the robot upto time , . This implies the dimension of the state space is exponentially large in the space of nodes, . The initial state is the start node. Let be the action executed by the robot at time . It is defined as the node visited at time , . The set of all actions is defined as . Given a world map , when the robot is at state the utility of executing an action is . Let be the set of feasible actions that the robot can execute when in state in a world map . This is defined as follows
Let be the state transition function. In our setting, this is the deterministic function . Let be the one step reward function. It is defined as the normalized marginal gain of the utility function, . Let be a policy that maps state and world map to a feasible action . The value of executing a policy for time steps on a world starting at state .
where is the distribution of states at time starting from and following policy . The state action value is the value of executing action in state in world and then following the policy for timesteps
The value of a policy for steps on a distribution of worlds and starting states
The optimal MDP policy is .
Ii-C2 Mapping partially observable problems to POMDP
The Partially Observable Markov Decision Process (POMDP) is a tuple . The first component of the augmented state space, the ego motion state space , is fully observable. The second component, the space of world maps , is partially observable through observations received.
Let be the observation at time step . This is defined as the measurement received by the robot . Let
be the probability of receiving an observationgiven the robot is at state and executes action . In our setting, this is the deterministic function .
Let the belief at time , , be the history of state, action, observation tuple received so far, i.e. . Note that this differs from the conventional use of the word belief which would usually imply a distribution. However, we use belief here to refer to the history of state, action, observations conditioned on which one can infer the posterior distribution of world maps . Let the belief transition function be . Let be a policy that maps state and belief to a feasible action . The value of executing a policy for time steps starting at state and belief is
where is the distribution of beliefs at time starting from and following policy . is the posterior distribution on worlds given the belief . Similarly the action value function is defined as
The optimal POMDP policy can be expressed as
is a uniform distribution over the discrete interval, is the distribution of states following policy for steps, is the distribution of belief following policy for steps. The value of a policy for steps on a distribution of worlds , starting states and starting belief
where the posterior world distribution uses as prior.
Iii Related Work
Problem 1 is a submodular function optimization (due to nature of ) subject to routing constraints (due to ). In absence of this constraint, there is a large body of work on near optimality of greedy strategies by Krause et al.[19, 21, 20] - however naive greedy approaches can perform arbitrarily poorly. Chekuri and Pal  propose a quasi-polynomial time recursive greedy approach to solving this problem. Singh et al. show how to scale up the approach to multiple robots. However, these methods are slow in practice. Iyer and Bilmes  solve a related problem of submodular maximization subject to submodular knapsack constraints using an iterative greedy approach. This inspires Zhang and Vorobeychik  to propose an elegant generalization of the cost benefit algorithm (GCB) which we use as an oracle in this paper. Yu et al. frame the problem as a correlated orienteering problem and propose a mixed integer based approach - however only correlations between neighboring nodes are considered. Hollinger and Sukhatme  use sampling based approaches which require a lot of evaluations of the utility function in practice.
Problem 2 in the absence of the travel cost constraint can be efficiently solved using the framework of adaptive submodularity developed by Golovin et al.[6, 7] as shown by Javdani et al.[16, 15] and Chen et al.[4, 3]. Hollinger et al.[10, 11]
propose a heuristic based approach to select a subset of informative nodes and perform minimum cost tours. Singh et al. replan every step using a non-adaptive information path planning algorithm. Such methods suffer when the adaptivity gap is large . Inspired by adaptive TSP approaches by Gupta et al., Lim et al.[25, 24] propose recursive coverage algorithms to learn policy trees. However such methods cannot scale well to large state and observation spaces. Heng et al. make a modular approximation of the objective function. Isler et al. survey a broad number of myopic information gain based heuristics that work well in practice but have no formal guarantees.
Fig. 2 shows an overview of our approach. The central idea is as follows - we train a policy to imitate an algorithm that has access to the world map at train time. The policy
maps features extracted from stateand belief to an action . The algorithm that is being imitated has access to the corresponding world map .
Iv-B Imitation Learning
We now formally define imitation learning as applied to our setting. Let be a policy defined on a pair of state and belief . Let roll-in be the process of executing a policy from the start upto a certain time horizon. Similarly roll-out is the process of executing a policy from the current state and belief till the end. Let be the distribution of states induced by roll-in with policy . Let be the distribution of belief induced by roll-in with policy .
Let be the loss of a policy when executed on state and belief
. This loss function implicitly captures how well policyimitates a reference policy (such an an oracle algorithm). Our goal is to find a policy which minimizes the observed loss under its own induced distribution of state and beliefs.
This is a non-i.i.d supervised learning problem. Ross and Bagnell show how such problems can be reduced to no-regret online learning using dataset aggregation (DAgger). The loss function they consider is a mis-classification loss with respect to what the expert demonstrated. Ross and Bagnell 
extend the approach to the reinforcement learning setting whereis the cost-to-go of an oracle reference policy by aggregating values to imitate (AggreVaTe).
Iv-C Solving POMDP via Imitation of a Clairvoyant Oracle
When (8) is compared to the imitation learning framework in (10), we see that in addition to the induced state belief distributions, the loss function analogue is . This implies rolling out with policy . For poor policies
, the action value estimatewould be very different from optimal values .
In our approach, we alleviate this problem by defining a surrogate value functions to imitate - the cumulative reward gathered by a clairvoyant oracle.
Definition 1 (Clairvoyant Oracle).
Given a distribution of world map , a clairvoyant oracle is a policy that maps state and world map to a feasible action such that it approximates the optimal MDP policy, .
The term clairvoyant is used because the oracle has full access to the world map at train time. The oracle can be used to compute state action value as follows
Our approach is to imitate the oracle during training. This implies that we train a policy by solving the following optimization problem
Alg. 1 describes the ExpLOre algorithm. The algorithm iteratively trains a sequence of learnt policies by aggregating data for an online cost-sensitive classification problem.
is initialized as a random policy (Line 1). At iteration , the policy that is used to roll-in is a mixture policy of learnt policy and the oracle policy (Line 4) using mixture parameter . A set of cost-sensitive classification datapoints are captured as follows: a world is sampled (Line 7). The is used to roll-in upto a random time from an initial state to reach (Lines 8–10). An exploratory action is selected (Line 11). The clairvoyant oracle is given full access to and asked to roll-out and provide an action value (Lines 12–13). is added to a dataset of cost-sensitive classification problem and the process is repeated (Line 14). is appended to the original dataset and used to train an updated learner (Lines 15–17). The algorithm returns the best learner from the sequence based on performance on a held out validation set (or alternatively returns a mixture policy of all learnt policies). One can also try variants where all actions are executed, or an online learner is used to update instead of dataset aggregation .
Following the analysis style of AggreVaTe , we first introduce a hallucinating oracle.
Definition 2 (Hallucinating Oracle).
Given a prior distribution of world map , a hallucinating oracle computes the instantaneous posterior distribution over world maps and takes the action with the highest expected value.
The policy optimization rule in (12) is equivalent to
by using the fact that
Consequently our learnt policy has the following guarantee
N iterations of ExpLOre, collecting regression examples per iteration guarantees that with probability at least
where is the empirical average online learning regret on the training regression examples collected over the iterations and is the empirical regression regret of the best regressor in the policy class.
For both proofs, refer to .
Vi Experimental Results
Vi-a Implementation Details
Our implementation is open source and available for MATLAB and C++ goo.gl/HXNQwS.
Vi-A1 Problem Details
The utility function is selected to be a fractional coverage function (similar to ) which is defined as follows. The world map is represented as a voxel grid representing the surface of a 3D model. The sensor measurement at node is obtained by raycasting on this 3D model. A voxel of the model is said to be ‘covered’ by a measurement received at a node if a point lies in that voxel. The coverage of a path is the fraction of covered voxels by the union of measurements received when visiting each node of the path. The travel cost function is chosen to be the euclidean distance. The values of total time step and travel budget varies with problem instances and are specified along with the results.
Vi-A2 Oracle Algorithm
We use the generalized cost benefit algorithm (GCB)  as the oracle algorithm owing to its small run times and acceptable solution qualities.
Vi-A3 Learning Details
is mapped to a vector of features. The feature vector is a vector of information gain metrics as described in .
encode the relative rotation and translation required to visit a node. We use random forest to regress to values from features . The learning details are specified in Table. I.
For baseline policies, we compare to the class of information gain heuristics discussed in . The heuristics are remarkably effective, however, their performance depends on the distribution of objections in a world map. As ExpLOre uses these heuristic values as part of its feature vector, it will implicitly learn a data driven trade-off between them.
Vi-B 2D Exploration
We create a set of 2D exploration problems to gain a better understanding of the behavior of the ExpLOre and baseline heuristics. A dataset comprises of 2D binary world maps, uniformly distributed nodes and a simulated laser. The training size is , , .
Vi-B1 Dataset 1: Concentrated Information
Fig. 3 shows a dataset created by applying random affine transformations to a pair of parallel lines. This dataset is representative of information being concentrated in a particular fashion. Fig. 3 shows a comparison of ExpLOre with baseline heuristics. The heuristic Rear Side Voxel performs the best, while ExpLOre is able to match the heuristic. Fig. 3 shows progress of ExpLOre along with two relevant heuristics - Rear Side Voxel and Average Entropy. Rear Side Voxel takes small steps focusing on exploiting viewpoints along the already observed area. Average Entropy aggressively visits unexplored area which is mainly free space. ExpLOre initially explores the world but on seeing parts of the lines reverts to exploiting the area around it.
Vi-B2 Dataset 2: Distributed Information
Fig. 3 shows a dataset created by randomly distributing rectangular blocks around the periphery of the map. This dataset is representative of information being distributed around. Fig. 3 shows that the heuristic Average Entropy performs the best, while ExpLOre is able to match the heuristic. Rear Side Voxel saturates early on and performs worse. Fig. 3 shows that Rear Side Voxel gets stuck exploiting an island of information. Average Entropy takes broader sweeps of the area thus gaining more information about the world. ExpLOre also shows a similar behavior of exploring the world map.
Thus we see that on changing the datasets the performance of the heuristics reverse while our data driven approach is able to adapt seamlessly.
Vi-B3 Other Datasets
Fig. 5 shows results from other 2D datasets such as random disks and block worlds, where ExpLOre is able to outperform all heuristics.
Vi-C 3D Exploration
We create a set of 3D exploration problems to test the algorithm on more realistic scenarios. The datasets comprises of 3D worlds created in Gazebo and simulated Kinect.
Vi-C1 Train on Synthetic, Test on Real
To show the practical usage of our pipeline, we show a scenario where a policy is trained on synthetic data and tested on a real dataset.
shows a dataset of an office desk collected by TUM Computer Vision Group. The dataset is parsed to create a pair of pose and registered point cloud which can then be used to evaluate different algorithms. Fig. 4 shows that ExpLOre outperforms all heuristics. Fig. 4 shows ExpLOre getting good coverage of the desk while the best heuristic Occlusion Aware misses out on the rear side of the desk.
Vi-C2 Other Datasets
Fig. 5 shows more datasets where training and testing is done on synthetic worlds.
We presented a novel data-driven imitation learning framework to solve budgeted information gathering problems. Our approach, ExpLOre, trains a policy to imitate a clairvoyant oracle that has full information about the world and can compute non-myopic plans to maximize information. The effectiveness of ExpLOre can be attributed to two main reasons: Firstly, as the distribution of worlds varies, the clairvoyant oracle is able to adapt and consequently ExpLOre adapts as well. Secondly, as the oracle computes non-myopic solutions, imitating it allows ExpLOre to also learn non-myopic behaviors.
The authors thank Sankalp Arora for insightful discussions and open source code for exploration in MATLAB.
-  Benjamin Charrow, Gregory Kahn, Sachin Patil, Sikang Liu, Ken Goldberg, Pieter Abbeel, Nathan Michael, and Vijay Kumar. Information-theoretic planning with trajectory optimization for dense 3d mapping. In RSS, 2015.
-  Chandra Chekuri and Martin Pal. A recursive greedy algorithm for walks in directed graphs. In FOCS, 2005.
-  Yuxin Chen, S. Hamed Hassani, and Andreas Krause. Near-optimal bayesian active learning with correlated and noisy tests. CoRR, abs/1605.07334, 2016.
-  Yuxin Chen, Shervin Javdani, Amin Karbasi, J. Bagnell, Siddhartha Srinivasa, and Andreas Krause. Submodular surrogates for value of information, 2015.
-  Sanjiban Choudhury. Learning to gather information via imitation: Proofs. goo.gl/GJfg7r, 2016.
Daniel Golovin and Andreas Krause.
Adaptive submodularity: Theory and applications in active learning and stochastic optimization.JAIR, 2011.
-  Daniel Golovin, Andreas Krause, and Debajyoti Ray. Near-optimal bayesian active learning with noisy observations. In NIPS, 2010.
Anupam Gupta, Viswanath Nagarajan, and R Ravi.
Approximation algorithms for optimal decision trees and adaptive tsp problems.In International Colloquium on Automata, Languages, and Programming, 2010.
-  Lionel Heng, Alkis Gotovos, Andreas Krause, and Marc Pollefeys. Efficient visual exploration and coverage with a micro aerial vehicle in unknown environments. In ICRA.
-  Geoffrey A Hollinger, Brendan Englot, Franz S Hover, Urbashi Mitra, and Gaurav S Sukhatme. Active planning for underwater inspection and the benefit of adaptivity. IJRR, 2012.
-  Geoffrey A Hollinger, Urbashi Mitra, and Gaurav S Sukhatme. Active classification: Theory and application to underwater inspection. arXiv preprint arXiv:1106.5829, 2011.
-  Geoffrey A Hollinger and Gaurav S Sukhatme. Sampling-based motion planning for robotic information gathering. In RSS, 2013.
-  Stefan Isler, Reza Sabzevari, Jeffrey Delmerico, and Davide Scaramuzza. An information gain formulation for active volumetric 3d reconstruction. In ICRA, 2016.
-  Rishabh K Iyer and Jeff A Bilmes. Submodular optimization with submodular cover and submodular knapsack constraints. In NIPS, 2013.
-  Shervin Javdani, Yuxin Chen, Amin Karbasi, Andreas Krause, J. Andrew (Drew) Bagnell, and Siddhartha Srinivasa. Near optimal bayesian active learning for decision making. In AISTATS, 2014.
-  Shervin Javdani, Matthew Klingensmith, J. Andrew (Drew) Bagnell, Nancy Pollard , and Siddhartha Srinivasa. Efficient touch based localization through submodularity. In ICRA, 2013.
-  Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 1998.
-  Michael Koval, Nancy Pollard, and Siddhartha Srinivasa. Pre- and post-contact policy decomposition for planar contact manipulation under uncertainty. In RSS, 2014.
-  Andreas Krause and Daniel Golovin. Submodular function maximization. Tractability: Practical Approaches to Hard Problems, 2012.
-  Andreas Krause and Carlos Guestrin. Near-optimal observation selection using submodular functions. In AAAI, 2007.
-  Andreas Krause, Jure Leskovec, Carlos Guestrin, Jeanne VanBriesen, and Christos Faloutsos. Efficient sensor placement optimization for securing large water distribution networks. Journal of Water Resources Planning and Management, 2008.
-  Hanna Kurniawati, David Hsu, and Wee Sun Lee. Sarsop: Efficient point-based pomdp planning by approximating optimally reachable belief spaces. In RSS, 2008.
-  Andy Liaw and Matthew Wiener. Classification and regression by randomforest.
-  Zhan Wei Lim, David Hsu, and Wee Sun Lee. Adaptive stochastic optimization: From sets to paths. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, NIPS. 2015.
-  Zhan Wei Lim, David Hsu, and Wee Sun Lee. Adaptive informative path planning in metric spaces. IJRR, 2016.
-  Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. In AISTATS, 2010.
-  Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979, 2014.
-  Guy Shani, Joelle Pineau, and Robert Kaplow. A survey of point-based pomdp solvers. Autonomous Agents and Multi-Agent Systems, 2013.
-  David Silver and Joel Veness. Monte-carlo planning in large pomdps. In NIPS, 2010.
-  Amarjeet Singh, Andreas Krause, Carlos Guestrin, William Kaiser, and Maxim Batalin. Efficient planning of informative paths for multiple robots. In IJCAI, 2007.
-  Amarjeet Singh, Andreas Krause, and William J. Kaiser. Nonmyopic adaptive informative path planning for multiple robots. In IJCAI, 2009.
-  Adhiraj Somani, Nan Ye, David Hsu, and Wee Sun Lee. Despot: Online pomdp planning with regularization. In NIPS, 2013.
-  J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems. In IROS, 2012.
-  Jingjin Yu, Mac Schwager, and Daniela Rus. Correlated orienteering problem and its application to informative path planning for persistent monitoring tasks. In IROS, 2014.
-  Haifeng Zhang and Yevgeniy Vorobeychik. Submodular optimization with routing constraints, 2016.