I Introduction
In many environmental and earth science applications, experts want to collect scientifically valuable samples of a maximum (e.g., an oil spill source), but the distribution of the phenomenon is initially unknown. This maximum seekandsample (MSS) problem is pervasive. Canonically, samples are collected at predetermined locations by a technician or by a mobile platform following a uniform coverage trajectory. These nonadaptive strategies result in sample sparsity at the maximum and may be infeasible when the geometric structure of the environment is unknown (e.g., boulder fields) or changing (e.g., tidal zones). Increasing the number of valuable samples at the maximum requires adaptive online planning and execution. We present PLUMES — Plume Localization under Uncertainty using MaximumValuE information and Search — an adaptive algorithm that enables a mobile robot to efficiently localize and densely sample an environmental maximum, subject to practical challenges including dynamic constraints, unknown geometric map and obstacles, and noisy sensors with limited fieldofview. Fig. 1 shows a motivating application: coral head localization.
Informative Path Planning: The MSS problem is closely related to informative path planning (IPP) problems. Canonical offline IPP techniques for pure informationgathering that optimize submodular coverage objectives can achieve nearoptimal performance [Srinivas2012, binney2012branch]. However, in the MSS problem, the value of a sample depends on the unknown maximum location, requiring adaptive planning to enable the robot to select actions that explore to localize the maximum and then seamlessly transition to selecting actions that exploitatively collect valuable samples there. Even for adaptive IPP methods, the MSS problem presents considerable challenges. The target environmental phenomenon is partially observable and most directly modeled as a continuous scalar function. Additionally, efficient maximum sampling with a mobile robot requires consideration of vehicle dynamics, travel cost, and a potentially unknown obstacle map. Handling these challenges in combination excludes adaptive IPP algorithms that use discrete state spaces [Lim2016, Arora2017], known metric maps [singh2009nonmyopic, Jawaid2015], or unconstrained sensor placement [Krause2008].
The MSS POMDP: Partiallyobservable Markov decision processes (POMDPs) are general models for decisionmaking under uncertainty that allow the challenging aspects of the MSS problem to be encoded. We define the MSS POMDP, in which the partially observable state represents the continuous environmental phenomenon and a sparse reward function encodes the MSS scientific objective by giving reward only to samples sufficiently close to the global maximum. Solving a POMDP exactly is generally intractable, and the MSS POMDP is additionally complicated by both continuous state and observation spaces, and the sparse MSS reward function. This presents the two core challenges that PLUMES addresses: performing online search in a beliefspace over continuous functions, and overcoming reward function sparsity.
Planning over Continuous Domains: In the MSS problem, the state of the environment can be modeled as a continuous function. PLUMES uses a Gaussian Process (GP) model to represent the belief over this continuous function, and must plan over the uncountable set of possible GP beliefs that arise from future continuous observations. To address planning in continuous spaces, stateoftheart online POMDP solvers use deterministic discretization [ling2016gaussian] or a combination of sampling techniques and particle filter belief representations [somani2013despot, kurniawati2016online, Silver2010, sunberg2017value]. Efficiently discretizing or maintaining a sufficiently rich particle set to represent the underlying continuous function in MSS applications is itself a challenging problem, and can lead to inaccurate inference of the maximum [dallaire2009bayesian]. Other approaches have considered using the maximumlikelihood observation to make search tractable [Marchant2014a]. However, this assumption can compromise search and has optimality guarantees only in linearGaussian systems [platt2010belief]. Instead, PLUMES uses Monte Carlo Tree Search (MCTS) with progressive widening, which we call continuousobservation MCTS, to limit planning tree growth [couetoux2011continuous] and retain optimality [auger2013continuous] in continuous environments.
Rewards and Heuristics: In the MSS POMDP, the reward function is sparse and does not explicitly encode the value of exploration. Planning with sparse rewards requires longhorizon information gathering and is an open problem in robotics [smart2002effective]. To alleviate this difficulty, less sparse heuristic reward functions can be optimized in place of the true reward, but these heuristics need to be selected carefully to ensure the planner performs well with respect to the true objective. In IPP, heuristics based on the value of information have been applied successfully [Marchant2014a, Sun2017, Krause2008, Hitz2017], primarily using the GPUCB criteria [Contal2013, Srinivas2012]. We demonstrate that within practical mission constraints, using UCB as the heuristic reward function for the MSS POMDP can lead to suboptimal convergence to local maxima due to a mismatch between the UCB heuristic and the true MSS reward. Instead, PLUMES takes advantage of a heuristic function from the Bayesian optimization (BO) community for stateoftheart blackbox optimization [wang2017max], which we call maximumvalue information (MVI). MVI overcomes sparsity and encourages longterm information gathering, while still converging to the true reward of the MSS POMDP.
The contribution of this paper is the MSS POMDP formalism and the corresponding PLUMES planner, which by virtue of its belief model, informationtheoretic reward heuristic, and search framework, enables efficient maximum seek and sample with asymptotic optimality guarantees in continuous environments. PLUMES extends the stateoftheart in MSS planners by applying a BO heuristic reward function to MSS that alleviates the challenges of the true sparse MSS reward function, and integrating GP belief representations within continuousobservation MCTS. The utility of PLUMES for MSS applications is demonstrated in extensive simulation and field trials, showing a statistically significant performance improvement over stateoftheart baselines.
Ii Maximum SeekandSample POMDP
We formalize the MSS problem by considering a target environmental domain as a dimensional compact set . We allow to contain obstacles with arbitrary geometry and let be the set of reachable points with respect to the robot’s initial pose. We assume there is an unknown underlying continuous function representing the value of a continuous phenomenon of interest. The objective is to find the unique global maximizer by safely navigating while receiving noisy observations of this function . Because is unknown, we cannot access derivative information or any analytic form.
We model the process of navigating and generating observations as the MSS POMDP: an 8tuple :

: continuous state space of the robot and environment

: discrete set of action primitives

: continuous space of possible observations

: , the transition function, i.e.,

: , the observation model, i.e.,

: , the reward of taking action when robot’s state is , i.e.,

: discount factor,

: initial belief state of the robot,
where
denotes the space of probability distributions over the argument.
The Bellman equation is used to recursively quantify the value of belief over a finite horizon under policy as:
(1) 
where the expectation is taken over the current belief and is the updated belief after taking action and observing . The optimal policy over horizon is the maximizer of the value function over the space of possible policies : . However, Eq. 1 is intractable to compute in continuous state and observation spaces; the optimal policy must be approximated. PLUMES uses a recedinghorizon, online POMDP planner and heuristic reward function to approximately solve the MSS POMDP in realtime on robotic systems.
Iii The Plumes Algorithm
PLUMES is an online planning algorithm with a sequential decisionmaking structure:

Conditioned on , approximate the optimal policy for finite horizon and execute the action .

Collect observations , according to .

Update to incorporate this new observation; repeat.
In the following sections, we define the specific choice of belief model, planning algorithm, and heuristic reward function that PLUMES uses to solve the MSS POMDP.
Iiia Gaussian Process Belief Model
We assume the robot’s pose at planning iteration is fully observable, and the unknown environmental phenomenon is partially observable. The full beliefstate is represented as a tuple of robot state and environment belief at time . Because is a continuous function, we cannot represent the belief as a distribution over discrete states, as is standard in POMDP literature [kaelbling1998planning], and must choose an alternate representation. PLUMES uses a Gaussian process (GP) [Rasmussen2004] to represent conditioned on a history of past observations. This GP is parameterized by mean and covariance function .
As the robot traverses a location , it gathers observations of subject to sensor noise , such that with . Given a history of observations and observation locations at planning iteration , the posterior belief at a new location is computed:
(2)  
(3)  
(4) 
where , is the positive definite kernel matrix with for all , and .
IiiB Planning with ContinuousObservation MCTS
PLUMES selects highreward actions with recedinghorizon search over possible belief states. This search requires a simulator that can sample observations and generate beliefs given a proposed action sequence. For PLUMES, this simulator is the GP model, which represents the belief over the continuous function
, and in turn simulates continuous observations from proposed action sequences by sampling from the Gaussian distribution defined by Eq.
3 & 4.PLUMES uses continuousobservation MCTS to overcome the challenges of planning in continuous state and observation spaces. Continuousobservation MCTS has three stages: selection, forward simulation, and backpropagation. Each node in the tree can be represented as the tuple of robot pose and GP belief, = {, }. Additionally, we will refer to two types of nodes: belief nodes and beliefaction nodes. The root of the tree is always a belief node, which represents the entire history of actions and observations up through the current planning iteration. Through selection and simulation, belief and beliefaction nodes are alternately added to the tree (Fig. 2).
From the root, a rollout begins with the selection stage, in which a beliefaction child is selected according to the Polynomial Upper Confidence Tree (PUCT) policy [auger2013continuous]. The PUCT value is the sum of the average heuristic rewards (i.e., MVI) from all previous simulations and a term that favors lesssimulated action sequences:
(5) 
where is the average heuristic reward of choosing action with belief in all previous rollouts, is the number of times the node has been simulated, is the number of times that particular action from node has been selected, and is a depthdependent parameter^{*}^{*}*Refer to Table 1 of Auger et al. [auger2013continuous] for parameter settings. .
Once a child beliefaction node is selected, the action associated with the child is forward simulated using the generative observation model , and a new belief node is generated as though the action were taken and samples observed. The simulated observations are drawn from the beliefaction node’s GP model , and the robot’s pose is updated deterministically based on the selected action. Since the observations in a GP are continuous, every sampled observation is unique with probability one. Progressive widening, with depthdependent parameter incrementally grows the tree by limiting the number of belief children of each beliefaction node. When growing the tree, is either chosen to be the least visited node if , or otherwise is a new child with observations simulated from . By limiting the width of the search tree and incrementally growing the number of explored children, progressive widening avoids search degeneracy in continuous environments.
Once a sequence of actions has been rolled out to a horizon , the accumulated heuristic reward is propagated upward from the leaves to the tree root. The average accumulated heuristic reward and number of queries are updated for each node visited in the rollout. Rollouts continue until the computation budget is exhausted. The most visited beliefaction child of the root node is executed.
Continuousobservation MCTS within PLUMES provides both practical and theoretical benefits. Practically, progressivewidening directly addresses search degeneracy by visiting belief nodes multiple times even in continuous observation spaces, allowing for a more representative estimate of their value. Theoretically, PLUMES can be shown to select asymptotically optimal actions. We briefly describe how analysis in Auger et al.
[auger2013continuous] for PUCTMCTS with progressive widening in MDPs can be extended to PLUMES.Using standard methods [kaelbling1998planning], we can reduce the MSS POMDP to an equivalent beliefstate MDP. This beliefstate MDP has a state space equal to the set of all possible beliefs, and a transition distribution that captures the effect of both the dynamics and the observation model after each action. Planning in this representation is often intractable as the state space is continuous and infinitedimensional. However, PLUMES plans directly in the beliefstate MDP by using its GP belief state to compute the transition function efficiently.
Subsequently, Theorem 1 in Auger et al. [auger2013continuous] shows that for an MDP with a continuous state space, like the beliefstate MDP representation suggested, the value function estimated by continuousobservation MCTS asymptotically converges to that of the optimal policy:
(6) 
with high probability [auger2013continuous], for constants and .
IiiC MaximumValue Information Reward
The true statedependent reward function for the MSS POMDP would place value on collecting sample points within an ball of the true global maximum :
(7) 
where is determined by the scientific application. Optimizing this sparse reward function directly is challenging, so PLUMES approximates the true MSS reward by using the maximumvalue information (MVI) heuristic reward [wang2017max]. MVI initially encourages exploration behavior, but ultimately rewards exploitative sampling near the inferred maximum.
The beliefdependent MVI heuristic reward quantifies the expected value of having belief and collecting a sample at location
. MVI reward quantifies the mutual information between the random variable
, representing the observation at location , and , the random variable representing the value of the function at the global maximum:(8) 
where . To compute the reward of collecting a random observation at location under belief , we approximate the expectation over the unknown by sampling from the posterior distribution and use Monte Carlo integration with samples [wang2017max]:
(9)  
(10) 
Each entropy expression
can be respectively approximated as the entropy of a Gaussian random variable with mean and variance given by the GP equations (Eq.
3 & 4), and the entropy of a truncated Gaussian, with upper limit and the same mean and variance.To draw samples from the posterior , we employ spectral sampling [rahimi2008random]. Spectral sampling draws a function , which has analytic form and is differentiable, from the posterior belief of a GP with stationary covariance function [wang2017max, hernandez2014predictive]. To complete the evaluation of Eq. 10, can be computed by applying standard efficient global optimization techniques (e.g., sequential least squares programming, quasiNewton methods) to find the global maximum of the sampled . This results in the following expression for MVI reward [wang2017max]:
(11) 
where , and are given by Eq. 3 & 4, and and are the standard normal PDF and CDF. For actions that collect samples at more then one location, the reward of an action is the sum of rewards of the locations sampled by that action.
MVI initially favors collecting observations in areas that have high uncertainty due to sampling maxima from the initial uniform GP belief. As observations are collected and uncertainty diminishes in the GP, the sampled maxima converge to the true maximum and reward concentrates locally at this point, encouraging exploitative behavior. This contrasts with the Upper Confidence Bound (UCB) heuristic, which distributes reward proportional to predictive mean and weighted variance of the current GP belief model (Eq. 3 & 4): . As the robot explores, UCB reward converges to the underlying phenomenon, . The difference in convergence characteristics between MVI and UCB can be observed in Fig. 3.
Iv Experiments and Results
Convex Simulation Trials  ASV Trial  Nonconvex Simulation Trials  Dubins Car Trials  
= , 50 trials  = , 1 trial  = , 50 trials  = , 5 trials  
MSS Reward  RMSE  Error  MSS Reward  MSS Reward  RMSE  Error  MSS Reward  
PLUMES  199 (89)  3.8 (9.2)  0.21 (0.23)  524  206 (100)  3.6 (2.1)  0.25 (0.56)  159 (74) 
UCBMCTS  171 (179)*  3.7 (9.6)  0.24 (0.29)    115 (184)*  3.6 (1.5)  0.27 (1.18)  52 (17) 
UCBMyopic  148 (199)*  3.6 (9.2)  0.33 (3.25)    86 (102)*  3.4 (1.0)  0.23 (0.34)  42 (66) 
Boustro.  27 (3)*  2.7 (10.4)  0.26 (0.46)  63         
We analyze the empirical performance of PLUMES in a breadth of MSS scenarios that feature convex and nonconvex environments. We compare against three baselines used in environmental surveying: nonadaptive lawnmowercoverage (Boustro., an abbreviation of boustrophedonic [choset1998coverage]), greedy myopic planning with UCB reward (UCBMyopic) [Sun2017], and nonmyopic planning with traditional MCTS [browne2012survey] that uses the maximumlikelihood observation and UCB reward (UCBMCTS) [Marchant2014a]. The performance of UCB planners has been shown to be sensitive with respect to value [Marchant2014a]. In order to avoid subjective tuning, we select a timevarying that is known to enable noregret UCB planning [Srinivas2012, Sun2017]
. PLUMES uses continuousobservation MCTS with hyperparameters presented in Auger et al.
[auger2013continuous].To evaluate the mission performance of all planners, we report accumulated MSS reward (Eq. 7), which directly corresponds to the number of scientifically valuable samples collected within an ball of the true maximum. This metric is reported for all trial scenarios in Table I
. We additionally report several metrics commonly used in IPP to evaluate posterior model quality: overall environmental posterior root meansquared error (RMSE) and error in posterior prediction of
at the end of a mission ( error). We use a MannWhitney U nonparametric significance test [mann1947test] to report statistical significance (p = 0.05 level) in performance between PLUMES and baseline algorithms.Iva Bounded Convex Environments
In marine and atmospheric applications, MSS often occurs in a geographically bounded, obstaclefree environment. In 50 simulated trials, we applied PLUMES and our baseline planners to a point robot in a multimodal environment drawn randomly from a GP prior with a squaredexponential covariance function and zero mean (, , [1%]) (see Fig.4). The action set consisted of ten viable trajectories centered at the robot’s pose with path length , and samples were collected every of travel. Mission lengths were budgeted to be . Nonmyopic planners rolled out to a 5action horizon and were allowed 250 rollouts per planning iteration. Summary simulation results are presented in Table I.
In these trials, PLUMES accumulated significantly (0.05level) more reward than baselines. The distribution of accumulated reward (Fig. 5) shows that PLUMES has a single dominating mode near reward 200 and few lowperforming missions (reward <50). In contrast, both UCBbased methods have distributions which are multimodal, with nontrivial modes in the lowperformance region. Boustro. collected consistently few scientifically valuable samples. In addition to collecting many more samples at the maximum, PLUMES achieved statistically indistinguishable levels of posterior RMSE and error compared to baselines (Table I).
The corresponding field trial for convexworld maximumsearch was performed in the Bellairs Fringing Reef, Barbados by a custombuilt autonomous surface vehicle (ASV) with the objective of localizing the most exposed coral head. Coral head exposure is used to select vantage points for coral imaging [manjanna2016efficient] and in ultraviolet radiation studies on coral organisms [banaszak2009effects]. Due to time and resource constraints, only one trial of two planners was feasible on the physical reef; we elected to demonstrate PLUMES and Boustro., one of the most canonical surveying strategies in marine sciences.
The ASV ( ) had holonomic dynamics and a downwardfacing acoustic point altimeter (Tritech Micron Echosounder) with returns at . Ten dynamicallyfeasible straight paths radiating from the location of the ASV were used in the action set. The environment was bounded by a by geofence. Localization and control was provided by a PixHawk Autopilot with GPS and internal IMU; the fused state estimate was empirically suitable for the desired maximum localization accuracy ( = ). The budget for each mission was , which took approx. 45 minutes to travel. The GP kernel was trained on altimeter data from a dense data collection deployment the day before (parameters , , [26%]). Note the high noise in the inferred GP model, as well as the relatively small lengthscale in the field site. The reconstructed bathymetry and vehicle are shown in Fig. 6.
PLUMES successfully identified the same coral head to be maximal as that inferred from the GP trained on prior dense data collection, as indicated by accumulated reward in Table I, overcoming the challenges of moving in ocean waves, noisy altimeter measurements, and highly multimodal environment. Additionally, the posterior prediction of had an error of only while Boustro. reported error due to its nonadaptive sampling strategy.
In the Bellairs Fringing Reef trials, the environment was assumed to be static. However, in many marine domains the impact of sediment transport, waves, and tides could physically change the location of a maximum over the course of a mission. PLUMES can be extended to dynamic environments by employing a spatiotemporal kernel in the GP model, which allows for the predictive mean and variance to change temporally [singh2010modeling]. If the dynamics of an environment can be encoded in the kernel function, no other changes to PLUMES are necessary; MVI will be distributed according to the time dynamic. Fig. 7 demonstrates the properties of PLUMES with a squaredexponential kernel over space (, , ) and time (, , ). In this illustrative scenario, the global maximum moved between planning iteration and . PLUMES with a spatiotemporal kernel maintained multiple hypotheses about the maximum’s location given the randomwalk dynamic of the environment, resulting in MVI reward being redistributed between the two maxima over time.
IvB NonConvex Environments
We next consider nonconvex environments with potentially unknown obstacles, a situation that occurs frequently in practical MSS applications with geographical nogo zones for rover or ASV missions, and in indoor or urban settings. We evaluated PLUMES, UCBMyopic, and UCBMCTS planners in 50 simulated trials with the same environments, vehicle, and actions as described in Section IVA, with the inclusion of 12 block obstacles placed uniformly around the world in known locations (see Fig.4). Boustro. was not used as a baseline because of nongenerality of the offline approach to unknown obstacle maps.
As indicated in Table I, PLUMES accumulated significantly more MSS reward than UCBMCTS and UCBMyopic, at the 0.05level. The distribution of reward across the trials is visualized in Fig. 8. Like in the convexworld, the PLUMES has a primary mode between reward 200250, while the UCBbased planners have a primary mode in the lowperformance region (reward <50). There was no significant difference between planners with respect to RMSE or error. The fact that PLUMES maximized the true MSS reward while achieving statistically indistinguishable error highlights the difference in exploitation efficiency between PLUMES and UCBbased methods.
The simulation experiments assume that a geometric map is known a priori. However in practical applications, like indoor gas leak detection, access to a map may be limited or unavailable. We simulate the scenario in which a nonholonomic car equipped with a laser rangefinder must build a map online as it seeks the maximum in a cluttered indoor environment (Fig. 9). We generate a simulated chemical phenomenon from a GP (, , [2%]), and simulate observations at . The action set for the vehicle consists of eleven Dubins curves projected in front of the vehicle, one straight path behind the vehicle, and a “stay in place” action. Results for five trials are shown in Table I and illustrate that PLUMES accumulates more MSS reward than baselines, indicating robust performance.
These simulation and robot trials demonstrate the utility of PLUMES compared to canonical and stateoftheart baselines in a diverse set of environments with challenging practical conditions. For highstakes scientific deployments, the consistent convergence and sampling performance of PLUMES is critical and beneficial.
V Discussion and Future Work
Online planning methods for robotic maximum seekandsample are critical in a variety of contexts, including general environmental monitoring (scientific inquiry, reconnaissance) and disaster response (oil spill, gas leak, radiation). For partially observable environments that can be modelled using a GP, PLUMES is a novel approach for global maximum seekandsample that provides several key insights.
This work presents MVI as an empirically suitable alternative to the canonical GPUCB heuristic in MSS solvers, which is both naturally adaptive and avoids a handtuned parameter to balance exploration and exploitation. MVI samples potential global maxima from the robot’s full belief state to manage exploration and exploitation. In contrast, heuristic functions like UCB place reward on all highvalued or highly uncertain regions, leading to unnecessary exploration and limiting the time available to exploit knowledge of the true maximum. Ultimately, the MVI heuristic allows PLUMES to collect exploitative samples, while still achieving the same overall level of posterior model accuracy (shown by RMSE) as UCBbased planners. Additionally, continuousobservation MCTS allows PLUMES to search over beliefspaces on continuous functions without discretization or maximumlikelihood assumptions.
One important area of future work for PLUMES is online GP kernel hyperparameter learning [ranganathan2011online], which is important when only one mission is possible and there is insufficient prior knowledge for hyperparameter selection. Another avenue of future work could be to examine the proprieties of the maxima sampled by MVI, to be used as a heuristic for metabehavior transitions (e.g., action model switching, dynamic horizon setting) or mission termination. Finally, the performance of PLUMES in nonconvex environments is impacted by the chosen discrete action set. Extending PLUMES to continuous actions spaces, in the spirit of, e.g., Morere et al. [morere2018continuous], would allow increased flexibility in these environments.
Vi Conclusion
This paper formalizes the maximumseekandsample POMDP and presents PLUMES, an adaptive planning algorithm that employs continuousobservation MCTS and maximumvalue information reward to perform efficient maximumseeking in partially observable, continuous environments. PLUMES outperforms canonical coverage and UCBbased stateoftheart methods with statistical significance in challenging simulated and realworld conditions (e.g. multiple local maxima, unknown obstacles, sensor noise). Maximum seekandsample is a critical task in environmental monitoring for which PLUMES, with theoretical convergence guarantees, strong empirical performance, and robustness under realworld conditions, is wellsuited.
Acknowledgment
We would like to thank our reviewers for their feedback on this manuscript. Additionally, we thank the SLI group, RRG and WARPlab for their insight and support. This project was supported by an NSFGRFP award (G.F.), NDSEG Fellowship award (V.P.), and NSF NRI Award 1734400.
Comments
There are no comments yet.