Information-Guided Robotic Maximum Seek-and-Sample in Partially Observable Continuous Environments

by   Genevieve Flaspohler, et al.

We present PLUMES, a planner to localizing and collecting samples at the global maximum of an a priori unknown and partially observable continuous environment. The "maximum-seek-and-sample" (MSS) problem is pervasive in the environmental and earth sciences. Experts want to collect scientifically valuable samples at an environmental maximum (e.g., an oil-spill source), but do not have prior knowledge about the phenomenon's distribution. We formulate the MSS problem as a partially-observable Markov decision process (POMDP) with continuous state and observation spaces, and a sparse reward signal. To solve the MSS POMDP, PLUMES uses an information-theoretic reward heuristic with continous-observation Monte Carlo Tree Search to efficiently localize and sample from the global maximum. In simulation and field experiments, PLUMES collects more scientifically valuable samples than state-of-the-art planners in a diverse set of environments, with various platforms, sensors, and challenging real-world conditions.



There are no comments yet.


page 1

page 4

page 5

page 6

page 7


Sparse tree search optimality guarantees in POMDPs with continuous observation spaces

Partially observable Markov decision processes (POMDPs) with continuous ...

Monte Carlo Information-Oriented Planning

In this article, we discuss how to solve information-gathering problems ...

Spatial Language Understanding for Object Search in Partially Observed Cityscale Environments

We present a system that enables robots to interpret spatial language as...

Adaptive Informative Path Planning with Multimodal Sensing

Adaptive Informative Path Planning (AIPP) problems model an agent tasked...

Monte Carlo Bayesian Reinforcement Learning

Bayesian reinforcement learning (BRL) encodes prior knowledge of the wor...

Simplified Belief-Dependent Reward MCTS Planning with Guaranteed Tree Consistency

Partially Observable Markov Decision Processes (POMDPs) are notoriously ...

Proactive Intention Recognition for Joint Human-Robot Search and Rescue Missions through Monte-Carlo Planning in POMDP Environments

Proactively perceiving others' intentions is a crucial skill to effectiv...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In many environmental and earth science applications, experts want to collect scientifically valuable samples of a maximum (e.g., an oil spill source), but the distribution of the phenomenon is initially unknown. This maximum seek-and-sample (MSS) problem is pervasive. Canonically, samples are collected at predetermined locations by a technician or by a mobile platform following a uniform coverage trajectory. These non-adaptive strategies result in sample sparsity at the maximum and may be infeasible when the geometric structure of the environment is unknown (e.g., boulder fields) or changing (e.g., tidal zones). Increasing the number of valuable samples at the maximum requires adaptive online planning and execution. We present PLUMES — Plume Localization under Uncertainty using Maximum-ValuE information and Search — an adaptive algorithm that enables a mobile robot to efficiently localize and densely sample an environmental maximum, subject to practical challenges including dynamic constraints, unknown geometric map and obstacles, and noisy sensors with limited field-of-view. Fig. 1 shows a motivating application: coral head localization.

Fig. 1: Coral head localization with an autonomous surface vehicle (ASV): The objective of the ASV is to find and sample at the most exposed (shallowest) coral head in a region of Bellairs Fringing Reef, Barbados. Overlaid on the aerial photo is the a priori unknown bathymetry of the region (yellow is shallow, blue is deep). Equipped with an acoustic point altimeter, the ASV must explore to infer the location of the maximum (marked with a star) and then sample at that coral colony.

Informative Path Planning: The MSS problem is closely related to informative path planning (IPP) problems. Canonical offline IPP techniques for pure information-gathering that optimize submodular coverage objectives can achieve near-optimal performance [Srinivas2012, binney2012branch]. However, in the MSS problem, the value of a sample depends on the unknown maximum location, requiring adaptive planning to enable the robot to select actions that explore to localize the maximum and then seamlessly transition to selecting actions that exploitatively collect valuable samples there. Even for adaptive IPP methods, the MSS problem presents considerable challenges. The target environmental phenomenon is partially observable and most directly modeled as a continuous scalar function. Additionally, efficient maximum sampling with a mobile robot requires consideration of vehicle dynamics, travel cost, and a potentially unknown obstacle map. Handling these challenges in combination excludes adaptive IPP algorithms that use discrete state spaces [Lim2016, Arora2017], known metric maps [singh2009nonmyopic, Jawaid2015], or unconstrained sensor placement [Krause2008].

The MSS POMDP: Partially-observable Markov decision processes (POMDPs) are general models for decision-making under uncertainty that allow the challenging aspects of the MSS problem to be encoded. We define the MSS POMDP, in which the partially observable state represents the continuous environmental phenomenon and a sparse reward function encodes the MSS scientific objective by giving reward only to samples sufficiently close to the global maximum. Solving a POMDP exactly is generally intractable, and the MSS POMDP is additionally complicated by both continuous state and observation spaces, and the sparse MSS reward function. This presents the two core challenges that PLUMES addresses: performing online search in a belief-space over continuous functions, and overcoming reward function sparsity.

Planning over Continuous Domains: In the MSS problem, the state of the environment can be modeled as a continuous function. PLUMES uses a Gaussian Process (GP) model to represent the belief over this continuous function, and must plan over the uncountable set of possible GP beliefs that arise from future continuous observations. To address planning in continuous spaces, state-of-the-art online POMDP solvers use deterministic discretization [ling2016gaussian] or a combination of sampling techniques and particle filter belief representations [somani2013despot, kurniawati2016online, Silver2010, sunberg2017value]. Efficiently discretizing or maintaining a sufficiently rich particle set to represent the underlying continuous function in MSS applications is itself a challenging problem, and can lead to inaccurate inference of the maximum [dallaire2009bayesian]. Other approaches have considered using the maximum-likelihood observation to make search tractable [Marchant2014a]. However, this assumption can compromise search and has optimality guarantees only in linear-Gaussian systems [platt2010belief]. Instead, PLUMES uses Monte Carlo Tree Search (MCTS) with progressive widening, which we call continuous-observation MCTS, to limit planning tree growth [couetoux2011continuous] and retain optimality [auger2013continuous] in continuous environments.

Rewards and Heuristics: In the MSS POMDP, the reward function is sparse and does not explicitly encode the value of exploration. Planning with sparse rewards requires long-horizon information gathering and is an open problem in robotics [smart2002effective]. To alleviate this difficulty, less sparse heuristic reward functions can be optimized in place of the true reward, but these heuristics need to be selected carefully to ensure the planner performs well with respect to the true objective. In IPP, heuristics based on the value of information have been applied successfully [Marchant2014a, Sun2017, Krause2008, Hitz2017], primarily using the GP-UCB criteria [Contal2013, Srinivas2012]. We demonstrate that within practical mission constraints, using UCB as the heuristic reward function for the MSS POMDP can lead to suboptimal convergence to local maxima due to a mismatch between the UCB heuristic and the true MSS reward. Instead, PLUMES takes advantage of a heuristic function from the Bayesian optimization (BO) community for state-of-the-art black-box optimization [wang2017max], which we call maximum-value information (MVI). MVI overcomes sparsity and encourages long-term information gathering, while still converging to the true reward of the MSS POMDP.

The contribution of this paper is the MSS POMDP formalism and the corresponding PLUMES planner, which by virtue of its belief model, information-theoretic reward heuristic, and search framework, enables efficient maximum seek and sample with asymptotic optimality guarantees in continuous environments. PLUMES extends the state-of-the-art in MSS planners by applying a BO heuristic reward function to MSS that alleviates the challenges of the true sparse MSS reward function, and integrating GP belief representations within continuous-observation MCTS. The utility of PLUMES for MSS applications is demonstrated in extensive simulation and field trials, showing a statistically significant performance improvement over state-of-the-art baselines.

Ii Maximum Seek-and-Sample POMDP

We formalize the MSS problem by considering a target environmental domain as a -dimensional compact set . We allow to contain obstacles with arbitrary geometry and let be the set of reachable points with respect to the robot’s initial pose. We assume there is an unknown underlying continuous function representing the value of a continuous phenomenon of interest. The objective is to find the unique global maximizer by safely navigating while receiving noisy observations of this function . Because is unknown, we cannot access derivative information or any analytic form.

We model the process of navigating and generating observations as the MSS POMDP: an 8-tuple :

  • : continuous state space of the robot and environment

  • : discrete set of action primitives

  • : continuous space of possible observations

  • : , the transition function, i.e.,

  • : , the observation model, i.e.,

  • : , the reward of taking action when robot’s state is , i.e.,

  • : discount factor,

  • : initial belief state of the robot,


denotes the space of probability distributions over the argument.

The Bellman equation is used to recursively quantify the value of belief over a finite horizon under policy as:


where the expectation is taken over the current belief and is the updated belief after taking action and observing . The optimal policy over horizon- is the maximizer of the value function over the space of possible policies : . However, Eq. 1 is intractable to compute in continuous state and observation spaces; the optimal policy must be approximated. PLUMES uses a receding-horizon, online POMDP planner and heuristic reward function to approximately solve the MSS POMDP in real-time on robotic systems.

Iii The Plumes Algorithm

PLUMES is an online planning algorithm with a sequential decision-making structure:

  1. Conditioned on , approximate the optimal policy for finite horizon and execute the action .

  2. Collect observations , according to .

  3. Update to incorporate this new observation; repeat.

In the following sections, we define the specific choice of belief model, planning algorithm, and heuristic reward function that PLUMES uses to solve the MSS POMDP.

Iii-a Gaussian Process Belief Model

We assume the robot’s pose at planning iteration is fully observable, and the unknown environmental phenomenon is partially observable. The full belief-state is represented as a tuple of robot state and environment belief at time . Because is a continuous function, we cannot represent the belief as a distribution over discrete states, as is standard in POMDP literature [kaelbling1998planning], and must choose an alternate representation. PLUMES uses a Gaussian process (GP) [Rasmussen2004] to represent conditioned on a history of past observations. This GP is parameterized by mean and covariance function .

As the robot traverses a location , it gathers observations of subject to sensor noise , such that with . Given a history of observations and observation locations at planning iteration , the posterior belief at a new location is computed:


where , is the positive definite kernel matrix with for all , and .

Iii-B Planning with Continuous-Observation MCTS

PLUMES selects high-reward actions with receding-horizon search over possible belief states. This search requires a simulator that can sample observations and generate beliefs given a proposed action sequence. For PLUMES, this simulator is the GP model, which represents the belief over the continuous function

, and in turn simulates continuous observations from proposed action sequences by sampling from the Gaussian distribution defined by Eq. 

3 & 4.

PLUMES uses continuous-observation MCTS to overcome the challenges of planning in continuous state and observation spaces. Continuous-observation MCTS has three stages: selection, forward simulation, and back-propagation. Each node in the tree can be represented as the tuple of robot pose and GP belief, = {, }. Additionally, we will refer to two types of nodes: belief nodes and belief-action nodes. The root of the tree is always a belief node, which represents the entire history of actions and observations up through the current planning iteration. Through selection and simulation, belief and belief-action nodes are alternately added to the tree (Fig. 2).

From the root, a rollout begins with the selection stage, in which a belief-action child is selected according to the Polynomial Upper Confidence Tree (PUCT) policy [auger2013continuous]. The PUCT value is the sum of the average heuristic rewards (i.e., MVI) from all previous simulations and a term that favors less-simulated action sequences:


where is the average heuristic reward of choosing action with belief in all previous rollouts, is the number of times the node has been simulated, is the number of times that particular action from node has been selected, and is a depth-dependent parameter***Refer to Table 1 of Auger et al. [auger2013continuous] for parameter settings. .

Fig. 2: Continuous-observation MCTS: Illustrated to horizon , the tree consists of alternating belief and belief-action nodes. Action decisions are made at belief nodes and random belief transitions according to the observation function occur at belief-action nodes. Note that belief-action nodes have a varying number of children due to progressive widening and unequal simulation (not visualized) due to PUCT policy.

Once a child belief-action node is selected, the action associated with the child is forward simulated using the generative observation model , and a new belief node is generated as though the action were taken and samples observed. The simulated observations are drawn from the belief-action node’s GP model , and the robot’s pose is updated deterministically based on the selected action. Since the observations in a GP are continuous, every sampled observation is unique with probability one. Progressive widening, with depth-dependent parameter incrementally grows the tree by limiting the number of belief children of each belief-action node. When growing the tree, is either chosen to be the least visited node if , or otherwise is a new child with observations simulated from . By limiting the width of the search tree and incrementally growing the number of explored children, progressive widening avoids search degeneracy in continuous environments.

Once a sequence of actions has been rolled out to a horizon , the accumulated heuristic reward is propagated upward from the leaves to the tree root. The average accumulated heuristic reward and number of queries are updated for each node visited in the rollout. Rollouts continue until the computation budget is exhausted. The most visited belief-action child of the root node is executed.

Fig. 3: Convergence of MVI vs UCB heuristic: The true environmental phenomenon with the global maximum marked by a star is shown in the center; high regions are colored yellow and low regions blue. In (A,C), the robot trajectory and corresponding reward functions are shown early (20 actions) and later (140 actions) in a mission. On the top row, snapshots of the robot belief state with planned trajectories are shown, with recent actions colored pink and earlier actions colored blue. Red stars mark maxima sampled by MVI. In the bottom row, the corresponding reward function is shown, with high-reward regions colored yellow and low reward regions colored purple. By the end of the mission, MVI clearly converges to placing reward only at the global maximum, which in turn leads to efficient convergence of the robot. By contrast, the reward landscape resulting from canonically used UCB converges to the underlying function, causing the UCB planner to uniformly tour high-valued regions of the environment.

Continuous-observation MCTS within PLUMES provides both practical and theoretical benefits. Practically, progressive-widening directly addresses search degeneracy by visiting belief nodes multiple times even in continuous observation spaces, allowing for a more representative estimate of their value. Theoretically, PLUMES can be shown to select asymptotically optimal actions. We briefly describe how analysis in Auger et al.

[auger2013continuous] for PUCT-MCTS with progressive widening in MDPs can be extended to PLUMES.

Using standard methods [kaelbling1998planning], we can reduce the MSS POMDP to an equivalent belief-state MDP. This belief-state MDP has a state space equal to the set of all possible beliefs, and a transition distribution that captures the effect of both the dynamics and the observation model after each action. Planning in this representation is often intractable as the state space is continuous and infinite-dimensional. However, PLUMES plans directly in the belief-state MDP by using its GP belief state to compute the transition function efficiently.

Subsequently, Theorem 1 in Auger et al. [auger2013continuous] shows that for an MDP with a continuous state space, like the belief-state MDP representation suggested, the value function estimated by continuous-observation MCTS asymptotically converges to that of the optimal policy:


with high probability [auger2013continuous], for constants and .

Iii-C Maximum-Value Information Reward

The true state-dependent reward function for the MSS POMDP would place value on collecting sample points within an -ball of the true global maximum :


where is determined by the scientific application. Optimizing this sparse reward function directly is challenging, so PLUMES approximates the true MSS reward by using the maximum-value information (MVI) heuristic reward [wang2017max]. MVI initially encourages exploration behavior, but ultimately rewards exploitative sampling near the inferred maximum.

The belief-dependent MVI heuristic reward quantifies the expected value of having belief and collecting a sample at location

. MVI reward quantifies the mutual information between the random variable

, representing the observation at location , and , the random variable representing the value of the function at the global maximum:


where . To compute the reward of collecting a random observation at location under belief , we approximate the expectation over the unknown by sampling from the posterior distribution and use Monte Carlo integration with samples [wang2017max]:


Each entropy expression

can be respectively approximated as the entropy of a Gaussian random variable with mean and variance given by the GP equations (Eq. 

3 & 4), and the entropy of a truncated Gaussian, with upper limit and the same mean and variance.

To draw samples from the posterior , we employ spectral sampling [rahimi2008random]. Spectral sampling draws a function , which has analytic form and is differentiable, from the posterior belief of a GP with stationary covariance function [wang2017max, hernandez2014predictive]. To complete the evaluation of Eq. 10, can be computed by applying standard efficient global optimization techniques (e.g., sequential least squares programming, quasi-Newton methods) to find the global maximum of the sampled . This results in the following expression for MVI reward [wang2017max]:


where , and are given by Eq. 3 & 4, and and are the standard normal PDF and CDF. For actions that collect samples at more then one location, the reward of an action is the sum of rewards of the locations sampled by that action.

MVI initially favors collecting observations in areas that have high uncertainty due to sampling maxima from the initial uniform GP belief. As observations are collected and uncertainty diminishes in the GP, the sampled maxima converge to the true maximum and reward concentrates locally at this point, encouraging exploitative behavior. This contrasts with the Upper Confidence Bound (UCB) heuristic, which distributes reward proportional to predictive mean and weighted variance of the current GP belief model (Eq. 3 & 4): . As the robot explores, UCB reward converges to the underlying phenomenon, . The difference in convergence characteristics between MVI and UCB can be observed in Fig. 3.

Iv Experiments and Results

Convex Simulation Trials ASV Trial Non-convex Simulation Trials Dubins Car Trials
= , 50 trials = , 1 trial = , 50 trials = , 5 trials
MSS Reward RMSE Error MSS Reward MSS Reward RMSE Error MSS Reward
PLUMES 199 (89) 3.8 (9.2) 0.21 (0.23) 524 206 (100) 3.6 (2.1) 0.25 (0.56) 159 (74)
UCB-MCTS 171 (179)* 3.7 (9.6) 0.24 (0.29) - 115 (184)* 3.6 (1.5) 0.27 (1.18) 52 (17)
UCB-Myopic 148 (199)* 3.6 (9.2) 0.33 (3.25) - 86 (102)* 3.4 (1.0) 0.23 (0.34) 42 (66)
Boustro. 27 (3)* 2.7 (10.4) 0.26 (0.46) 63 - - - -
TABLE I: Accumulated True MSS Reward (Eq. 7), RMSE, and Error, Reported as Median (Interquartile Range).                                                                                        Asterisks denote baselines whose difference in performance is statistically significant compared to PLUMES.

We analyze the empirical performance of PLUMES in a breadth of MSS scenarios that feature convex and non-convex environments. We compare against three baselines used in environmental surveying: non-adaptive lawnmower-coverage (Boustro., an abbreviation of boustrophedonic [choset1998coverage]), greedy myopic planning with UCB reward (UCB-Myopic) [Sun2017], and nonmyopic planning with traditional MCTS [browne2012survey] that uses the maximum-likelihood observation and UCB reward (UCB-MCTS) [Marchant2014a]. The performance of UCB planners has been shown to be sensitive with respect to value [Marchant2014a]. In order to avoid subjective tuning, we select a time-varying that is known to enable no-regret UCB planning [Srinivas2012, Sun2017]

. PLUMES uses continuous-observation MCTS with hyperparameters presented in Auger et al.


To evaluate the mission performance of all planners, we report accumulated MSS reward (Eq. 7), which directly corresponds to the number of scientifically valuable samples collected within an -ball of the true maximum. This metric is reported for all trial scenarios in Table I

. We additionally report several metrics commonly used in IPP to evaluate posterior model quality: overall environmental posterior root mean-squared error (RMSE) and error in posterior prediction of

at the end of a mission ( error). We use a Mann-Whitney U non-parametric significance test [mann1947test] to report statistical significance (p = 0.05 level) in performance between PLUMES and baseline algorithms.

Fig. 4: Simulation Environments: The multimodal simulated m m environments. Yellow regions are high-valued; blue regions are low-valued. The global maximum is marked with a star. The left and center environments represent convex-worlds (Section IV-A), while the right environment is representative of a non-convex world (Section IV-B).

Iv-a Bounded Convex Environments

Fig. 5: Distribution of accumulated MSS reward in 50 convex-world simulations:

Accumulated MSS reward is calculated for each trial and the distribution for each planner is plotted as a kernel density estimate (solid line). The dashed lines represent the median accumulated reward for each planner (reported in Table 

I). The gray area of the plot indicates a low performance region where the planner collected <50 samples near the maximum. PLUMES has a single mode near 200, whereas both UCB-based methods are multi-modal, with modes in the low performance region.

In marine and atmospheric applications, MSS often occurs in a geographically bounded, obstacle-free environment. In 50 simulated trials, we applied PLUMES and our baseline planners to a point robot in a multimodal environment drawn randomly from a GP prior with a squared-exponential covariance function and zero mean (, , [1%]) (see Fig.4). The action set consisted of ten viable trajectories centered at the robot’s pose with path length , and samples were collected every of travel. Mission lengths were budgeted to be . Nonmyopic planners rolled out to a 5-action horizon and were allowed 250 rollouts per planning iteration. Summary simulation results are presented in Table I.

In these trials, PLUMES accumulated significantly (0.05-level) more reward than baselines. The distribution of accumulated reward (Fig. 5) shows that PLUMES has a single dominating mode near reward 200 and few low-performing missions (reward <50). In contrast, both UCB-based methods have distributions which are multimodal, with non-trivial modes in the low-performance region. Boustro. collected consistently few scientifically valuable samples. In addition to collecting many more samples at the maximum, PLUMES achieved statistically indistinguishable levels of posterior RMSE and error compared to baselines (Table I).

The corresponding field trial for convex-world maximum-search was performed in the Bellairs Fringing Reef, Barbados by a custom-built autonomous surface vehicle (ASV) with the objective of localizing the most exposed coral head. Coral head exposure is used to select vantage points for coral imaging [manjanna2016efficient] and in ultraviolet radiation studies on coral organisms [banaszak2009effects]. Due to time and resource constraints, only one trial of two planners was feasible on the physical reef; we elected to demonstrate PLUMES and Boustro., one of the most canonical surveying strategies in marine sciences.

The ASV ( ) had holonomic dynamics and a downward-facing acoustic point altimeter (Tritech Micron Echosounder) with returns at . Ten dynamically-feasible straight paths radiating from the location of the ASV were used in the action set. The environment was bounded by a by geofence. Localization and control was provided by a PixHawk Autopilot with GPS and internal IMU; the fused state estimate was empirically suitable for the desired maximum localization accuracy ( = ). The budget for each mission was , which took approx. 45 minutes to travel. The GP kernel was trained on altimeter data from a dense data collection deployment the day before (parameters , , [26%]). Note the high noise in the inferred GP model, as well as the relatively small length-scale in the field site. The reconstructed bathymetry and vehicle are shown in Fig. 6.

Fig. 6: Coral head map and ASV: (A) The ground truth bathymetric map inferred from all collected data, mean corrected in depth. Yellow represents shallower depths, and blue is deeper. The global maximum is marked with a black star. (B) The custom ASV used to traverse the region.
Fig. 7: Extending PLUMES for Spatiotemporal Monitoring: (A) The ground truth map at two planning iterations for a dynamic environment. The maximum is marked with a black star, and migrates from the top left to the top right of the world. (B) MVI reward is redistributed by using a spacetime kernel within PLUMES that captures the environment’s dynamics.

PLUMES successfully identified the same coral head to be maximal as that inferred from the GP trained on prior dense data collection, as indicated by accumulated reward in Table I, overcoming the challenges of moving in ocean waves, noisy altimeter measurements, and highly multimodal environment. Additionally, the posterior prediction of had an error of only while Boustro. reported error due to its non-adaptive sampling strategy.

In the Bellairs Fringing Reef trials, the environment was assumed to be static. However, in many marine domains the impact of sediment transport, waves, and tides could physically change the location of a maximum over the course of a mission. PLUMES can be extended to dynamic environments by employing a spatiotemporal kernel in the GP model, which allows for the predictive mean and variance to change temporally [singh2010modeling]. If the dynamics of an environment can be encoded in the kernel function, no other changes to PLUMES are necessary; MVI will be distributed according to the time dynamic. Fig. 7 demonstrates the properties of PLUMES with a squared-exponential kernel over space (, , ) and time (, , ). In this illustrative scenario, the global maximum moved between planning iteration and . PLUMES with a spatiotemporal kernel maintained multiple hypotheses about the maximum’s location given the random-walk dynamic of the environment, resulting in MVI reward being re-distributed between the two maxima over time.

Iv-B Non-Convex Environments

We next consider non-convex environments with potentially unknown obstacles, a situation that occurs frequently in practical MSS applications with geographical no-go zones for rover or ASV missions, and in indoor or urban settings. We evaluated PLUMES, UCB-Myopic, and UCB-MCTS planners in 50 simulated trials with the same environments, vehicle, and actions as described in Section IV-A, with the inclusion of 12 block obstacles placed uniformly around the world in known locations (see Fig.4). Boustro. was not used as a baseline because of non-generality of the offline approach to unknown obstacle maps.

As indicated in Table I, PLUMES accumulated significantly more MSS reward than UCB-MCTS and UCB-Myopic, at the 0.05-level. The distribution of reward across the trials is visualized in Fig. 8. Like in the convex-world, the PLUMES has a primary mode between reward 200-250, while the UCB-based planners have a primary mode in the low-performance region (reward <50). There was no significant difference between planners with respect to RMSE or error. The fact that PLUMES maximized the true MSS reward while achieving statistically indistinguishable error highlights the difference in exploitation efficiency between PLUMES and UCB-based methods.

Fig. 8: Distribution of accumulated MSS reward in 50 non-convex mission simulations: Accumulated MSS reward distribution (solid line) and median (dashed line, reported in Table I) for each planner. The gray area of the plot indicates a low performance region (reward <50). PLUMES has few low-performing missions and a primary mode near reward 250. The primary mode of both UCB-based methods is in the low performance region due to convergence to suboptimal local maxima.
Fig. 9: Snapshot of unknown non-convex map scenario:

(A) shows examples of how the action-primitives change based upon obstacle detection (black lines) and safety padding (grey lines). (B-D) show a planning iteration of PLUMES, starting with the current belief map and obstacle detections (B). The MVI heuristic is illustrated in (C) where lighter regions are higher value. (D) shows the rollout visibility of continuous-observation MCTS where darker regions are visited more often. Areas of high reward are generally visited more often by the search as the tree expands.

The simulation experiments assume that a geometric map is known a priori. However in practical applications, like indoor gas leak detection, access to a map may be limited or unavailable. We simulate the scenario in which a nonholonomic car equipped with a laser range-finder must build a map online as it seeks the maximum in a cluttered indoor environment (Fig. 9). We generate a simulated chemical phenomenon from a GP (, , [2%]), and simulate observations at . The action set for the vehicle consists of eleven Dubins curves projected in front of the vehicle, one straight path behind the vehicle, and a “stay in place” action. Results for five trials are shown in Table I and illustrate that PLUMES accumulates more MSS reward than baselines, indicating robust performance.

These simulation and robot trials demonstrate the utility of PLUMES compared to canonical and state-of-the-art baselines in a diverse set of environments with challenging practical conditions. For high-stakes scientific deployments, the consistent convergence and sampling performance of PLUMES is critical and beneficial.

V Discussion and Future Work

Online planning methods for robotic maximum seek-and-sample are critical in a variety of contexts, including general environmental monitoring (scientific inquiry, reconnaissance) and disaster response (oil spill, gas leak, radiation). For partially observable environments that can be modelled using a GP, PLUMES is a novel approach for global maximum seek-and-sample that provides several key insights.

This work presents MVI as an empirically suitable alternative to the canonical GP-UCB heuristic in MSS solvers, which is both naturally adaptive and avoids a hand-tuned parameter to balance exploration and exploitation. MVI samples potential global maxima from the robot’s full belief state to manage exploration and exploitation. In contrast, heuristic functions like UCB place reward on all high-valued or highly uncertain regions, leading to unnecessary exploration and limiting the time available to exploit knowledge of the true maximum. Ultimately, the MVI heuristic allows PLUMES to collect exploitative samples, while still achieving the same overall level of posterior model accuracy (shown by RMSE) as UCB-based planners. Additionally, continuous-observation MCTS allows PLUMES to search over belief-spaces on continuous functions without discretization or maximum-likelihood assumptions.

One important area of future work for PLUMES is online GP kernel hyperparameter learning [ranganathan2011online], which is important when only one mission is possible and there is insufficient prior knowledge for hyperparameter selection. Another avenue of future work could be to examine the proprieties of the maxima sampled by MVI, to be used as a heuristic for meta-behavior transitions (e.g., action model switching, dynamic horizon setting) or mission termination. Finally, the performance of PLUMES in non-convex environments is impacted by the chosen discrete action set. Extending PLUMES to continuous actions spaces, in the spirit of, e.g., Morere et al. [morere2018continuous], would allow increased flexibility in these environments.

Vi Conclusion

This paper formalizes the maximum-seek-and-sample POMDP and presents PLUMES, an adaptive planning algorithm that employs continuous-observation MCTS and maximum-value information reward to perform efficient maximum-seeking in partially observable, continuous environments. PLUMES outperforms canonical coverage and UCB-based state-of-the-art methods with statistical significance in challenging simulated and real-world conditions (e.g. multiple local maxima, unknown obstacles, sensor noise). Maximum seek-and-sample is a critical task in environmental monitoring for which PLUMES, with theoretical convergence guarantees, strong empirical performance, and robustness under real-world conditions, is well-suited.


We would like to thank our reviewers for their feedback on this manuscript. Additionally, we thank the SLI group, RRG and WARPlab for their insight and support. This project was supported by an NSF-GRFP award (G.F.), NDSEG Fellowship award (V.P.), and NSF NRI Award 1734400.