I Introduction
Recent years have seen an increasing usage of autonomous mobile robots in a variety of data collection applications, including environmental monitoring [1, 2, 3, 4, 5], exploration [6], and inspection [7]. In many tasks, these systems promise a more flexible, safe, and economic solution compared to traditional manual or static sampling methods [4, 8]. However, to fully exploit their potential, a key challenge is developing algorithms for active sensing, where the objective is to plan paths for efficient data gathering subject to finite computational and sensing resources, such as energy, time, or travel distance.
This paper examines the problem of active sensing using an unmanned aerial vehicle (UAV) in terrain monitoring scenarios. Our goal is to map a nonhomogeneous 2D scalar field, e.g. of temperature, humidity, etc., on the terrain using measurements taken by an onboard sensor. In similar setups, most practical systems rely on precomputed paths for data collection, e.g. coveragebased planning [7]
. However, such approaches assume a uniform distribution of measurement information value in the environment and hence do not allow for
adaptivity, i.e. more closely inspecting regions of interest, such as hotspots [2, 1] or anomalies [9], as they are discovered. Our motivation is to quickly find informationrich paths targeting these areas by performing efficient online adaptive replanning on computationally constrained platforms.Recently, several informative path planning (IPP) approaches for active sensing have been proposed [1, 10, 2, 3, 8], which enable adjusting decisionmaking based on observed data. However, scaling these methods to large environments and action spaces remains an open challenge. The main computational bottleneck in IPP is the predictive replanning step, since multiple possible future measurements must be simulated when evaluating next candidate actions. Previous studies have tackled this by discretizing the action state space, e.g. by using sparse graphs [10, 11], for replanning; however, such simplifications sacrifice on the quality of predictive plans. A new alternative paradigm is to use reinforcement learning (RL) to learn data gathering actions directly. Though emerging works in RL for IPP demonstrate promising results [12, 13], they have been limited to small 2D action spaces, and adaptive planning to map environments with spatial correlations and large 3D action spaces has not yet been investigated.
To address this, we propose a new RLbased IPP framework suitable for UAVbased active sensing. Inspired by recent advances in RL [14, 15]
, our method combines Monte Carlo tree search (MCTS) with a convolutional neural network (CNN) to learn informationrich actions in adaptive data gathering missions. Since active sensing tasks are typically expensive to simulate, our approach caters for training in low data regimes. By replacing the computational burden of predictive planning with simple tree search, we achieve efficient online replanning, which is critical for deployment on mobile robots (
Fig. 1).The contributions of this work are:

A new deep RL algorithm for robotic planning applications that supports continuous highdimensional state spaces, large action spaces, and dataefficient training.

The integration of our RL algorithm in an IPP framework for UAVbased terrain monitoring.

The validation of our approach in an ablation study and evaluations against benchmarks using synthetic and realworld data showcasing its performance.
Our framework will be opensourced for usage by the community.
Ii Related Work
Our work lies at the intersection of IPP for active sensing, MCTS planning methods, and recent advances in RL.
IPP methods are gaining rapid traction in many active sensing applications [2, 1, 3, 10]. In this area of study, our work focuses on strategies with adaptive online replanning capabilities, which allow the targeted monitoring of regions of interest, e.g. hotspots or abnormal areas [9]. Some methods focus on discrete action spaces defined by sparse graphs of permissible actions [10, 11]
. However, these simplifications are not applicable as the distribution of target regions is a priori unknown. Our proposed algorithm reasons about a discrete action space magnitudes larger while ensuring online computability. In terms of planning strategy, adaptive IPP algorithms can be classified into combinatorial
[16, 17], samplingbased [3, 10], and optimizationbased approaches [1, 2]. Combinatorial methods exhaustively query the search space. Thus, they cannot plan online in large search spaces, which makes them impractical for adaptive replanning.Continuousspace samplingbased planners generate informative robot trajectories by sampling candidate actions while guaranteeing probabilistically asymptotic optimality [3, 8]. However, their sampleefficiency is typically low for planning with more complex objectives and larger action spaces since many measurements need to be forwardsimulated to find promising paths in the problem space [1, 18]
. In our particular setup, considering spatial correlations in a terrain over many candidate regions leads to a complex and expensivetoevaluate information criterion. In similar scenarios, several works have investigated global optimization, e.g. evolutionary algorithms
[2, 1] or Bayesian Optimization [8], to enhance planning efficiency. Although these approaches deliver highquality paths [1, 11], using them for online decisionmaking is still computationally expensive when reasoning about many spatially correlated candidate future measurements.In robotics, RL algorithms are increasingly being utilized for autonomous search and rescue [19], information gathering [12], and exploration of unknown environments [13]. Although emerging works show promising performance in these applications, RLbased approaches have not yet been investigated for online adaptive IPP, thus being strongly restricted to environment exploration. To address this gap, we propose the first RL approach for planning informative paths online over spatially correlated terrains with large action spaces.
Another wellstudied planning paradigm is MCTS [20, 21]. Recently, MCTS extensions were proposed for large and continuous action spaces [22] and partially observable environments [23]. Choudhury et al. [10] applied a variant of MCTS to obtain longhorizon, anytime solutions in adaptive IPP problems. However, these online methods are restricted to small action spaces and spatially uncorrelated environments.
Inspired by recent advances in RL [15, 14], our RLbased algorithm bypasses computationally expensive MCTS rollouts and sampleinefficient action selections with a learned value function and an action policy, respectively. We extend the AlphaZero algorithm by applying it to robotics tasks with limited computational budget and low data regimes.
Iii Background
We begin by briefly describing the general active sensing problem and specifying the terrain monitoring scenario used to develop our RLbased IPP approach.
Iiia Problem Formulation
The general active sensing problem is formulated as follows. The goal is to maximize an informationtheoretic criterion over a set of action sequences , i.e. robot trajectories:
(1) 
where maps an action sequence to its associated execution cost, is the robot’s budget limit, e.g. time or energy, and is the information criterion, computed from the new sensor measurements obtained by executing .
This work focuses on the scenario of monitoring a terrain using a UAV equipped with a camera. Specifically, in this case, the costs of an action sequence of length are defined by the total flight time:
(2) 
where is a 3D measurement position above the terrain the image is registered from. computes the flight time costs between measurement positions by a constant accelerationdeceleration model with maximum speed .
IiiB Terrain Mapping
We leverage the method of Popović et al. [1] for efficient probabilistic mapping of the terrain. The terrain is discretized by a grid map while the mapped target variable, e.g. temperature, is a scalar field . The prior map distribution
is given by a Gaussian Process (GP) defined by a prior mean vector
and covariance matrix . At mission time, a Kalman Filter is used to sequentially fuse data observed at a measurement location with the last iteration’s map belief in order to obtain the posterior map mean , and covariance . For further details, the reader is referred to [1].IiiC Utility Definition for Adaptive IPP
In Eq. 1, we define the Aoptimal information criterion associated with an action sequence of measurements [24]:
(3) 
where and are obtained before and after applying to the measurements observed by executing , respectively.
We study an active sensing task where the goal is to gather terrain areas with higher values of the target variable , e.g. high temperature. This scenario requires online replanning to focus on mapping these areas of interest as they are discovered, and is thus a relevant problem setup for our new efficient RLbased IPP strategy. We utilize confidencebased level sets to define [25]:
(4) 
where and
are the mean and variance of grid cell
.are a userdefined confidence interval width and threshold, respectively. Thus, we restrict
and in Eq. 1 to the grid cells as defined by Eq. 4. This way, we only consider highervalued regions of interest over the current map state when computing the information gain.Iv Approach
This section presents our new RLbased IPP approach for active sensing. As shown in Fig. 2, we iteratively train a CNN on diverse simulated terrain monitoring scenarios to learn the most informative data gathering actions. The trained CNN is then leveraged during a mission to achieve fast online replanning. The following subsections detail our CNN architecture and RL algorithm designed for robotic applications.
Iva Connection between IPP & RL
We first cast the general IPP problem from Sec. III in a RL setting. The value of a state is defined as , where , and is the successor state when choosing a next action according to the policy . A state is defined by , where is the current map state, and is the previously executed action, i.e. the current UAV position. Consequently, is defined by . In our work, the 3D action space is a discrete set of measurement positions. The reward function is defined as:
(5) 
where are restricted to regions of interest given by Eq. 4. We set , such that restores .
IvB Algorithm Overview
Our goal is to learn the best policies for IPP offline to allow for fast online replanning at deployment. To achieve this, we bring recent advances in RL by Silver et al. [14, 15] into the robotics domain. In a similar spirit, our RL algorithm combines MCTS with a policyvalue CNN (Fig. 2). At train time, the algorithm alternates between episode generation and CNN training. For terrain monitoring, episodes are generated by simulating diverse scenarios varying in map priors, target variables, and initial UAV positions as explained in Sec. IVC. Each episode step from a state is planned by a tree search producing a target value given by the simulator and a target policy proportional to the tree’s root node’s action visit counts. and are stored in a replay buffer used to train the CNN. As described in Sec. IVE, we introduce a CNN architecture suitable for inference on mobile robots. Further, Sec. IVF proposes components for low data regimes in robotics tasks, which are often expensive to simulate.
IvC Episode Generation at Train Time
The most recently trained CNN is used to simulate a fixed number of episodes. An episode terminates when the budget is spent or a maximum number of steps is reached. In each step from state , a tree search is executed for a certain number of simulations as described in Sec. IVD. The policy is derived from the root node’s action visits :
(6) 
where is the set of next measurement positions reachable within the remaining budget . is a hyperparameter smoothing policies to be uniform as and collapsing to as . The action is sampled from and the next map state is given by . and are given by Eq. 5 and respectively. The tuple is stored in the replay buffer.
IvD Tree Search with Neural Networks
As shown in Fig. 3, a fixed number of simulations is executed by traversing the tree. Each simulation terminates when the budget or a maximal depth is exceeded. The tree search queries the CNN for policy and value estimates at leaf nodes with state
and stores the node’s prior probabilities
. The probabilistic upper confidence tree (PUCT) bound is used to traverse the tree [26]:(7) 
where is the stateaction value and is the visit count of the parent node . are exploration factors. We choose the next action .
IvE Network Architecture & Training
The CNN is parameterized by predicting a policy and value . Input feature planes to the CNN are (a) the minmax normalized current map covariance restricted to ; (b) the remaining budget normalized over ; (c) the UAV position normalized over the bounds of the 3D action space ; and (d) a costs feature map of same shape as with , subsequently minmax normalized. Note that the scalar inputs are expanded to feature maps of the same shape as . Additionally, we input a history of the previous two covariance, position, and budget input planes.
As visualized in Fig. 4, the CNN has a shared encoder for policy and value representations. We leverage Nonbottleneck1D blocks with separable 2D convolutions proposed by Romera et al. [27]
and SiLU activations to reduce inference time. The encoder is followed by two separate prediction heads for policy and value estimates. Both heads consist of three blocks with 2D convolution, batch norm, and SiLU activations. The last block’s output feature maps in each head are flattened to fixed dimensions by global average and max pooling before applying a fully connected layer. This reduces the number of parameters and ensures an input sizeagnostic architecture. The CNN parameters
are trained with stochastic gradient descent (SGD) on minibatches of size
to minimize:(9) 
where the loss coefficients
are hyperparameters. SGD uses a onecycle learning rate over three epochs
[28].Variant  33%  67%  100%  33% RMSE  67% RMSE  100% RMSE  Runtime [s] 
Baseline as in Sec. IV  73.61  31.83  12.44  0.15  0.09  0.05  0.64 
(i) w/ fixed offpolicy window  83.25  50.17  24.68  0.16  0.12  0.09  0.68 
(i) w/ fixed exploration constants  95.27  39.46  21.86  0.20  0.11  0.08  0.65 
(i) w/o forced playouts + policy pruning  79.23  28.53  22.62  0.18  0.09  0.07  0.66 
(ii) w/o global pooling bias blocks  103.58  45.78  31.44  0.19  0.11  0.10  0.64 
(ii) 5 residual blocks in encoder  82.90  29.94  17.94  0.16  0.08  0.07  0.55 
(ii) w/o input feature history  102.40  40.48  31.33  0.20  0.10  0.09  0.66 
IvF AlphaZero in Low Data Regimes
Adaptive IPP with spatiotemporal correlations is expensive to simulate. Opposed to previously examined fasttosimulate games such as Go or chess [15], realworld robotics tasks are often limited in the number of simulations and generated episodes at train time. Further, we cannot leverage massive computational resources [14, 15], but only use a single GPU.
A major shortcoming of the original AlphaZero algorithm [15] is that the policy targets in Eq. 6 merely reflect the tree search exploration dynamics. However, the raw action visit counts do not necessarily capture the gathered stateaction value information for a finite number of simulations. Hence, with only a moderate number of simulations per episode step, AlphaZero tends to overemphasize the initially explored actions in subsequent training iterations, leading to bias in training data and thus overestimated stateaction values. As Eq. 7 is also guided by , the overemphasis on initially explored actions leads to overfitting and lowquality policies. In the following, we introduce methods to solve these problems and increase efficiency of our RL algorithm.
To avoid overemphasizing random actions in the node selection, a large exploration constant in Eq. 7 is desirable. However, in later training iterations, increasing exploitation of known good actions is required to ensure convergence. Thus, we propose an exponentially decaying exploration constant:
(10) 
where is the training iteration number, is the initial constant, is the exponential decay factor, and is the minimal value. This can be seen as a dynamic explorationexploitation tradeoff in RL algorithms.
Similarly, for the Dirichlet exploration noise in Eq. 8 defined by , we introduce an exponentially decaying scheduling:
(11) 
where is the initial value, is the exponential decay factor, and is the minimal value. A high around leads to a uniform noise distribution avoiding overemphasis on random actions. However, should decrease with increasing to exploit the learned .
Next, we propose an increasing replay buffer size to accelerate training. Similar to our approach, adaptive replay buffers are known to improve performance in other RL domains [29]. On the one hand, a substantial amount of data is required to train the CNN on a variety of paths. On the other hand, the loss (Eq. 9) initially shows sudden drops when outdated train data is not in the replay buffer anymore. Thus, in early training stages, a small improves convergence speed. Larger in later training stages help regularize training and ensure train data diversity. Hence, is adaptively set to:
(12) 
where and are the initial and maximum window sizes. The window size is increased by one each training iterations.
Moreover, we adapt two techniques introduced by Wu [30] for the game of Go. First, forced playouts and policy pruning decouple exploration dynamics and policy targets. While traversing the search tree, underexplored root node actions are chosen by setting in Eq. 7. In Eq. 6, action visits are subtracted unless action led to a highvalue successor state. Second, in regular intervals in the encoder, and as the first layers of the value and policy head, we use global pooling bias layers. This enables our CNN to focus on local and global features required for IPP. Further details are discussed by Wu [30].
IvG Planning at Mission Time
Replanning during a mission is similar to generating a training episode. We use the offlinelearned CNN and perform tree search from the current state . An action is chosen from with in Eq. 6, i.e. . Since steers the exploration, the tree search is highly sampleefficient. This way, the number of tree search simulations is significantly reduced to allow fast replanning. Note that policy noise injection and forced playouts at the root are disabled to avoid wrong exploration bias and improve performance.
On average, our RL approach ensures the fastest uncertainty and RMSE reduction over mission time. Solid lines indicate means over 10 trials, and shaded regions indicate the standard deviations. Note that there is inherent variability due to the randomly generated hotspot locations. However, our method ensures stable performance (small standard deviation) over changing environments. Further, runtime is substantially reduced by a factor of
. The planned path (evolving over time from blue to red) validates the adaptive behavior of our approach exploring the terrain in a mission with focus on the highvalue region (green).V Experimental Results
This section presents our experimental results. We first validate our RL approach in an ablation study, then assess its IPP and runtime performance in terrain monitoring scenarios.
Va Experimental Setup
Our simulation setup considers terrains with 2D discrete field maps with values between 0 and 1, randomly split in high and lowvalue regions to create regions of interest as defined by Eq. 4. We model ground truth terrain maps of and resolution. The UAV action space of measurement locations is defined by a discrete 3D lattice above the terrain . The lattice mirrors the grid resolution on two altitude levels ( and ), resulting in actions. The missions are implemented in Python on a desktop with a 1.8 GHz Intel i7 processor, 16 GB RAM without GPU acceleration to avoid unfair advantages in inference speed of our CNN. We repeat the missions times and report means and standard deviations. Our RL algorithm is trained offline on a single machine with a 2.2 GHz AMD Ryzen 9 3900X, 63GB RAM, and a NVIDIA GeForce RTX 2080 Ti GPU.
We use the same inverse sensor model as Popović et al. [1] to simulate camera measurement noise, assuming a downwardsfacing square camera footprint with FoV. The prior map mean is uniform with a value of . The GP is defined by an isotropic Matérn kernel with length scale , signal variance , and noise variance by maximizing log marginal likelihood over independent maps. The threshold defines regions of interest.
We set the mission budget , the UAV initial position to , the accelerationdeceleration with maximum speed . At train time, each episode randomly generates a new ground truth map, map priors from a wide range of GP hyperparameters and UAV start positions, such that our approach has no unfair overfitting advantage. We evaluate map uncertainty with the covariance trace in regions and the root mean squared error (RMSE) of in to assess planning performance. Lower values in these metrics indicate better performance. In contrast to earlier work [1, 10, 2], the remaining budget does not only incorporate the path travel time, but also the planning (computation) runtime, as relevant for robotic platforms with limited onboard resources. We refer to the spent budget as the effective mission time.
VB Ablation Study
This section validates the algorithm design and CNN architecture introduced in Sec. IV. We perform an ablation study comparing our approach to versions of itself (i) removing proposed training procedure components, and (ii) changing the CNN architecture. We assume a resolution of resulting in a grid map with of actions. Note that the results do not depend on the actual size of and . We generate a small number of episodes in each iteration and terminate training after iterations. Each tree search is executed as described in Sec. IVG with simulations and exploration constants . Table I summarizes our results. We evaluate the map uncertainty and RMSE over the posterior map state after 33%, 67%, and 100% effective mission time, and average planning runtime.
As proposed in Eq. 12, the training procedure considers a replay buffer with adaptive size . Convergence speed, and thus performance, is improved compared to a fixedsize buffer . Also, our proposed exploration constants scheduling scheme improves plan quality by stabilizing the explorationexploitation tradeoff compared to fixed constants . Further, the results show the benefits of including a history of the previous two map states and UAV positions in addition to their current values. Interestingly, reducing the encoder depth from to blocks and removing forced playouts both perform reasonably well, but still lead to worse results in later mission stages. This suggests that deeper CNNs and forced playouts facilitate learning in larger grid maps and longer missions. Similarly, global pooling bias blocks help learning global map features, which benefits informationgathering performance.
VC Comparison Against Benchmarks
Next, our RL algorithm is evaluated against various benchmarks. We set a resolution , hence is a grid, and has actions. Our approach is compared against: (a) uniform random sampling in ; (b) coverage path with equallyspaced measurement locations at a fixed altitude; (c) MCTS with progressive widening [22] for large action spaces and a generalized costbenefit rollout policy proposed by Choudhury et al. [10] for adaptive IPP; (d) a stateof theart IPP framework based on Covariance Matrix Adaptation Evolution Strategy (CMAES) proposed by Popović et al. [1]. All approaches reason about a step planning horizon. We set CMAES parameters to iterations, offsprings, and coordinatewise step size based on the tradeoff between path quality and runtime. The permissible next actions of the MCTS were reduced to a radius of around the UAV position to be computable online, resulting in next actions per move. For a fair comparison, we trained our approach offline on this restricted action space, which is still much larger than considered in previous work [10, 11].
Fig. 5 reports the results obtained using each approach. Our method substantially reduces runtime, achieving a speedup of compared to CMAES and MCTS. This result highlights the significantly improved sampleefficiency in our tree search and confirms that the CNN can successfully learn informative actions from training in diverse simulated missions. Random sampling performs poorly as it reduces uncertainty and RMSE in highvalue regions only by chance. The coverage path shows high variability since datagathering efficiency greatly depends on the problem setup, i.e. hotspot locations relative to the preplanned path. Our approach outperforms this benchmark with much greater consistency.
VD Temperature Mapping Scenario
We demonstrate our RLbased IPP approach in a photorealistic simulation using realworld surface temperature data. The data was collected in a crop field nearby Forschungszentrum Jülich, Germany on July 20, 2021 with a DJI Matrice 600 UAV carrying a Vue Pro R 640 thermal sensor. The UAV executed a coverage path at altitude to collect images, which were then processed using Pix4D software to generate an orthomosaic representing the target terrain in our simulation as depicted in Fig. 1left. The aim is to validate our method for adaptively mapping hightemperature areas in this realistic setting.
For fusing new data into the map, we assume altitudedependent sensor noise as described in Sec. VA. The terrain is discretized using a uniform resolution. We compare our RLbased online algorithm against a fixed altitude lawnmower path as a traditional baseline. Our approach is trained only on synthetic simulated data as shown in Fig. 5.
Fig. 1right shows the planned 3D path above the terrain using our strategy. This confirms that our method collects targeted measurements in hightemperature areas of interest (red) by efficient online replanning. This is reflected quantitatively in Fig. 6right, where our approach ensures fast uncertainty reduction while a coverage path performs worse as it cannot adapt mapping behaviour. These results verify the successful transfer of our model trained in simulation to realworld data and demonstrate its benefits over a traditional approach.
Vi Conclusions and Future Work
This paper proposes a new RLbased approach for online adaptive IPP using resourceconstrained mobile robots. The algorithm is designed for sampleefficient planning in large action spaces and highdimensional state spaces, enabling fast information gathering in active sensing tasks. A key feature of our approach are components for accelerated learning in low data regimes. We validate the approach in an ablation study, and evaluate its performance compared to multiple benchmarks. Results show that our approach drastically reduces planning runtime, enabling efficient adaptive replanning on physical platforms.
Future work will investigate extending our algorithm to multirobot teams and larger environments. We will also conduct field experiments to validate our method in realworld missions and to explore its sim2real capabilities.
Acknowledgement
We would like to thank Jordan Bates from Forschungszentrum Jülich for providing the realworld data and sensor information.
References
 Popović et al. [2020] M. Popović, T. VidalCalleja, G. Hitz, J. J. Chung, I. Sa, R. Siegwart, and J. Nieto, “An informative path planning framework for UAVbased terrain monitoring,” Autonomous Robots, vol. 44, no. 6, pp. 889–911, 2020.
 Hitz et al. [2017] G. Hitz, E. Galceran, M.È. Garneau, F. Pomerleau, and R. Siegwart, “Adaptive continuousspace informative path planning for online environmental monitoring,” Journal of Field Robotics, vol. 34, no. 8, pp. 1427–1449, 2017.
 Hollinger and Sukhatme [2014] G. A. Hollinger and G. S. Sukhatme, “Samplingbased robotic information gathering algorithms,” The International Journal of Robotics Research, vol. 33, no. 9, pp. 1271–1287, 2014.
 Dunbabin and Marques [2012] M. Dunbabin and L. Marques, “Robots for Environmental Monitoring: Significant Advancements and Applications,” IEEE Robotics & Automation Magazine, vol. 19, no. 1, pp. 24–39, 2012.
 Lelong et al. [2008] C. C. Lelong, P. Burger, G. Jubelin, B. Roux, S. Labbé, and F. Baret, “Assessment of Unmanned Aerial Vehicles Imagery for Quantitative Monitoring of Wheat Crop in Small Plots,” Sensors, vol. 8, no. 5, pp. 3557–3585, 2008.

Doherty and Rudol [2007]
P. Doherty and P. Rudol, “A UAV search and rescue scenario with human body
detection and geolocalization,” in
Australasian Joint Conference on Artificial Intelligence
. Springer, 2007, pp. 1–13.  Galceran and Carreras [2013] E. Galceran and M. Carreras, “A survey on coverage path planning for robotics,” Robotics and Autonomous Systems, vol. 61, no. 12, pp. 1258–1276, 2013.
 Vivaldini et al. [2019] K. C. T. Vivaldini, T. H. Martinelli, V. C. Guizilini, J. R. Souza, M. D. Oliveira, F. T. Ramos, and D. F. Wolf, “UAV route planning for active disease classification,” Autonomous robots, vol. 43, no. 5, pp. 1137–1153, 2019.
 Blanchard and Sapsis [2020] A. Blanchard and T. Sapsis, “Informative path planning for anomaly detection in environment exploration and eonitoring,” arXiv preprint arXiv:2005.10040, 2020.
 Choudhury et al. [2020] S. Choudhury, N. Gruver, and M. J. Kochenderfer, “Adaptive Informative Path Planning with Multimodal Sensing,” in International Conference on Automated Planning and Scheduling, vol. 30. AAAI Press, 2020, pp. 57–65.
 Popović et al. [2020] M. Popović, T. VidalCalleja, J. J. Chung, J. Nieto, and R. Siegwart, “Informative Path Planning for Active Field Mapping under Localization Uncertainty,” in IEEE International Conference on Robotics and Automation. IEEE, 2020.
 Viseras and Garcia [2019] A. Viseras and R. Garcia, “DeepIG: Multirobot information gathering with deep reinforcement learning,” IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 3059–3066, 2019.
 Chen et al. [2020] F. Chen, J. D. Martin, Y. Huang, J. Wang, and B. Englot, “Autonomous Exploration Under Uncertainty via Deep Reinforcement Learning on Graphs,” in IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2020, pp. 6140–6147.
 Silver et al. [2017] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017.
 Silver et al. [2018] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel et al., “A general reinforcement learning algorithm that masters chess, shogi, and Go through selfplay,” Science, vol. 362, no. 6419, pp. 1140–1144, 2018.
 Ko et al. [1995] C.W. Ko, J. Lee, and M. Queyranne, “An Exact Algorithm for Maximum Entropy Sampling,” Operations Research, vol. 43, no. 4, pp. 684–691, 1995.
 Binney and Sukhatme [2012] J. Binney and G. S. Sukhatme, “Branch and bound for informative path planning,” in IEEE International Conference on Robotics and Automation. IEEE, 2012, pp. 2147–2154.
 Omidvar and Li [2010] M. N. Omidvar and X. Li, “A Comparative Study of CMAES on Large Scale Global Optimisation,” in Australasian Joint Conference on Artificial Intelligence. Springer, 2010, pp. 303–312.
 Niroui et al. [2019] F. Niroui, K. Zhang, Z. Kashino, and G. Nejat, “Deep reinforcement learning robot for search and rescue applications: Exploration in unknown cluttered environments,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 610–617, 2019.
 Browne et al. [2012] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, “A Survey of Monte Carlo Tree Search Methods,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 4, no. 1, pp. 1–43, 2012.
 Chaslot et al. [2008] G. Chaslot, S. Bakkes, I. Szita, and P. Spronck, “MonteCarlo Tree Search: A New Framework for Game AI,” AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, vol. 8, pp. 216–217, 2008.
 Sunberg and Kochenderfer [2018] Z. N. Sunberg and M. J. Kochenderfer, “Online algorithms for POMDPs with continuous state, action, and observation spaces,” in International Conference on Automated Planning and Scheduling. AAAI Press, 2018.
 Silver and Veness [2010] D. Silver and J. Veness, “MonteCarlo planning in large POMDPs,” in Neural Information Processing Systems, 2010.
 Sim and Roy [2005] R. Sim and N. Roy, “Global Aoptimal Robot Exploration in SLAM,” in IEEE International Conference on Robotics and Automation. IEEE, 2005, pp. 661–666.

Gotovos et al. [2013]
A. Gotovos, N. Casati, G. Hitz, and A. Krause, “Active Learning for Level Set Estimation,” in
International Joint Conference on Artificial Intelligence. AAAI Press, 2013, pp. 1344–1350.  Schrittwieser et al. [2020] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel et al., “Mastering Atari, Go, chess and shogi by planning with a learned model,” Nature, vol. 588, no. 7839, pp. 604–609, 2020.
 Romera et al. [2017] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “ERFNet: Efficient Residual Factorized ConvNet for RealTime Semantic Segmentation,” IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 1, pp. 263–272, 2017.
 Smith [2018] L. N. Smith, “A disciplined approach to neural network hyperparameters: Part 1 – rate, batch size, momentum, and weight decay,” arXiv, 2018.
 Liu and Zou [2018] R. Liu and J. Zou, “The Effects of Memory Replay in Reinforcement Learning,” in Allerton Conference on Communication, Control, and Computing. IEEE, 2018, pp. 478–485.
 Wu [2019] D. J. Wu, “Accelerating SelfPlay Learning in Go,” arXiv, 2019.
Comments
There are no comments yet.