Recent years have seen an increasing usage of autonomous mobile robots in a variety of data collection applications, including environmental monitoring [1, 2, 3, 4, 5], exploration , and inspection . In many tasks, these systems promise a more flexible, safe, and economic solution compared to traditional manual or static sampling methods [4, 8]. However, to fully exploit their potential, a key challenge is developing algorithms for active sensing, where the objective is to plan paths for efficient data gathering subject to finite computational and sensing resources, such as energy, time, or travel distance.
This paper examines the problem of active sensing using an unmanned aerial vehicle (UAV) in terrain monitoring scenarios. Our goal is to map a nonhomogeneous 2D scalar field, e.g. of temperature, humidity, etc., on the terrain using measurements taken by an on-board sensor. In similar setups, most practical systems rely on precomputed paths for data collection, e.g. coverage-based planning 
. However, such approaches assume a uniform distribution of measurement information value in the environment and hence do not allow foradaptivity, i.e. more closely inspecting regions of interest, such as hotspots [2, 1] or anomalies , as they are discovered. Our motivation is to quickly find information-rich paths targeting these areas by performing efficient online adaptive replanning on computationally constrained platforms.
Recently, several informative path planning (IPP) approaches for active sensing have been proposed [1, 10, 2, 3, 8], which enable adjusting decision-making based on observed data. However, scaling these methods to large environments and action spaces remains an open challenge. The main computational bottleneck in IPP is the predictive replanning step, since multiple possible future measurements must be simulated when evaluating next candidate actions. Previous studies have tackled this by discretizing the action state space, e.g. by using sparse graphs [10, 11], for replanning; however, such simplifications sacrifice on the quality of predictive plans. A new alternative paradigm is to use reinforcement learning (RL) to learn data gathering actions directly. Though emerging works in RL for IPP demonstrate promising results [12, 13], they have been limited to small 2D action spaces, and adaptive planning to map environments with spatial correlations and large 3D action spaces has not yet been investigated.
, our method combines Monte Carlo tree search (MCTS) with a convolutional neural network (CNN) to learn information-rich actions in adaptive data gathering missions. Since active sensing tasks are typically expensive to simulate, our approach caters for training in low data regimes. By replacing the computational burden of predictive planning with simple tree search, we achieve efficient online replanning, which is critical for deployment on mobile robots (Fig. 1).
The contributions of this work are:
A new deep RL algorithm for robotic planning applications that supports continuous high-dimensional state spaces, large action spaces, and data-efficient training.
The integration of our RL algorithm in an IPP framework for UAV-based terrain monitoring.
The validation of our approach in an ablation study and evaluations against benchmarks using synthetic and real-world data showcasing its performance.
Our framework will be open-sourced for usage by the community.
Ii Related Work
Our work lies at the intersection of IPP for active sensing, MCTS planning methods, and recent advances in RL.
IPP methods are gaining rapid traction in many active sensing applications [2, 1, 3, 10]. In this area of study, our work focuses on strategies with adaptive online replanning capabilities, which allow the targeted monitoring of regions of interest, e.g. hotspots or abnormal areas . Some methods focus on discrete action spaces defined by sparse graphs of permissible actions [10, 11]
. However, these simplifications are not applicable as the distribution of target regions is a priori unknown. Our proposed algorithm reasons about a discrete action space magnitudes larger while ensuring online computability. In terms of planning strategy, adaptive IPP algorithms can be classified into combinatorial[16, 17], sampling-based [3, 10], and optimization-based approaches [1, 2]. Combinatorial methods exhaustively query the search space. Thus, they cannot plan online in large search spaces, which makes them impractical for adaptive replanning.
Continuous-space sampling-based planners generate informative robot trajectories by sampling candidate actions while guaranteeing probabilistically asymptotic optimality [3, 8]. However, their sample-efficiency is typically low for planning with more complex objectives and larger action spaces since many measurements need to be forward-simulated to find promising paths in the problem space [1, 18]
. In our particular setup, considering spatial correlations in a terrain over many candidate regions leads to a complex and expensive-to-evaluate information criterion. In similar scenarios, several works have investigated global optimization, e.g. evolutionary algorithms[2, 1] or Bayesian Optimization , to enhance planning efficiency. Although these approaches deliver high-quality paths [1, 11], using them for online decision-making is still computationally expensive when reasoning about many spatially correlated candidate future measurements.
In robotics, RL algorithms are increasingly being utilized for autonomous search and rescue , information gathering , and exploration of unknown environments . Although emerging works show promising performance in these applications, RL-based approaches have not yet been investigated for online adaptive IPP, thus being strongly restricted to environment exploration. To address this gap, we propose the first RL approach for planning informative paths online over spatially correlated terrains with large action spaces.
Another well-studied planning paradigm is MCTS [20, 21]. Recently, MCTS extensions were proposed for large and continuous action spaces  and partially observable environments . Choudhury et al.  applied a variant of MCTS to obtain long-horizon, anytime solutions in adaptive IPP problems. However, these online methods are restricted to small action spaces and spatially uncorrelated environments.
Inspired by recent advances in RL [15, 14], our RL-based algorithm bypasses computationally expensive MCTS rollouts and sample-inefficient action selections with a learned value function and an action policy, respectively. We extend the AlphaZero algorithm by applying it to robotics tasks with limited computational budget and low data regimes.
We begin by briefly describing the general active sensing problem and specifying the terrain monitoring scenario used to develop our RL-based IPP approach.
Iii-a Problem Formulation
The general active sensing problem is formulated as follows. The goal is to maximize an information-theoretic criterion over a set of action sequences , i.e. robot trajectories:
where maps an action sequence to its associated execution cost, is the robot’s budget limit, e.g. time or energy, and is the information criterion, computed from the new sensor measurements obtained by executing .
This work focuses on the scenario of monitoring a terrain using a UAV equipped with a camera. Specifically, in this case, the costs of an action sequence of length are defined by the total flight time:
where is a 3D measurement position above the terrain the image is registered from. computes the flight time costs between measurement positions by a constant acceleration-deceleration model with maximum speed .
Iii-B Terrain Mapping
We leverage the method of Popović et al.  for efficient probabilistic mapping of the terrain. The terrain is discretized by a grid map while the mapped target variable, e.g. temperature, is a scalar field . The prior map distribution
is given by a Gaussian Process (GP) defined by a prior mean vectorand covariance matrix . At mission time, a Kalman Filter is used to sequentially fuse data observed at a measurement location with the last iteration’s map belief in order to obtain the posterior map mean , and covariance . For further details, the reader is referred to .
Iii-C Utility Definition for Adaptive IPP
where and are obtained before and after applying to the measurements observed by executing , respectively.
We study an active sensing task where the goal is to gather terrain areas with higher values of the target variable , e.g. high temperature. This scenario requires online replanning to focus on mapping these areas of interest as they are discovered, and is thus a relevant problem setup for our new efficient RL-based IPP strategy. We utilize confidence-based level sets to define :
are the mean and variance of grid cell.
are a user-defined confidence interval width and threshold, respectively. Thus, we restrictand in Eq. 1 to the grid cells as defined by Eq. 4. This way, we only consider higher-valued regions of interest over the current map state when computing the information gain.
This section presents our new RL-based IPP approach for active sensing. As shown in Fig. 2, we iteratively train a CNN on diverse simulated terrain monitoring scenarios to learn the most informative data gathering actions. The trained CNN is then leveraged during a mission to achieve fast online replanning. The following sub-sections detail our CNN architecture and RL algorithm designed for robotic applications.
Iv-a Connection between IPP & RL
We first cast the general IPP problem from Sec. III in a RL setting. The value of a state is defined as , where , and is the successor state when choosing a next action according to the policy . A state is defined by , where is the current map state, and is the previously executed action, i.e. the current UAV position. Consequently, is defined by . In our work, the 3D action space is a discrete set of measurement positions. The reward function is defined as:
where are restricted to regions of interest given by Eq. 4. We set , such that restores .
Iv-B Algorithm Overview
Our goal is to learn the best policies for IPP offline to allow for fast online replanning at deployment. To achieve this, we bring recent advances in RL by Silver et al. [14, 15] into the robotics domain. In a similar spirit, our RL algorithm combines MCTS with a policy-value CNN (Fig. 2). At train time, the algorithm alternates between episode generation and CNN training. For terrain monitoring, episodes are generated by simulating diverse scenarios varying in map priors, target variables, and initial UAV positions as explained in Sec. IV-C. Each episode step from a state is planned by a tree search producing a target value given by the simulator and a target policy proportional to the tree’s root node’s action visit counts. and are stored in a replay buffer used to train the CNN. As described in Sec. IV-E, we introduce a CNN architecture suitable for inference on mobile robots. Further, Sec. IV-F proposes components for low data regimes in robotics tasks, which are often expensive to simulate.
Iv-C Episode Generation at Train Time
The most recently trained CNN is used to simulate a fixed number of episodes. An episode terminates when the budget is spent or a maximum number of steps is reached. In each step from state , a tree search is executed for a certain number of simulations as described in Sec. IV-D. The policy is derived from the root node’s action visits :
where is the set of next measurement positions reachable within the remaining budget . is a hyper-parameter smoothing policies to be uniform as and collapsing to as . The action is sampled from and the next map state is given by . and are given by Eq. 5 and respectively. The tuple is stored in the replay buffer.
Iv-D Tree Search with Neural Networks
As shown in Fig. 3, a fixed number of simulations is executed by traversing the tree. Each simulation terminates when the budget or a maximal depth is exceeded. The tree search queries the CNN for policy and value estimates at leaf nodes with state
and stores the node’s prior probabilities. The probabilistic upper confidence tree (PUCT) bound is used to traverse the tree :
where is the state-action value and is the visit count of the parent node . are exploration factors. We choose the next action .
Iv-E Network Architecture & Training
The CNN is parameterized by predicting a policy and value . Input feature planes to the CNN are (a) the min-max normalized current map covariance restricted to ; (b) the remaining budget normalized over ; (c) the UAV position normalized over the bounds of the 3D action space ; and (d) a costs feature map of same shape as with , subsequently min-max normalized. Note that the scalar inputs are expanded to feature maps of the same shape as . Additionally, we input a history of the previous two covariance, position, and budget input planes.
and SiLU activations to reduce inference time. The encoder is followed by two separate prediction heads for policy and value estimates. Both heads consist of three blocks with 2D convolution, batch norm, and SiLU activations. The last block’s output feature maps in each head are flattened to fixed dimensions by global average and max pooling before applying a fully connected layer. This reduces the number of parameters and ensures an input size-agnostic architecture. The CNN parameters
are trained with stochastic gradient descent (SGD) on mini-batches of sizeto minimize:
where the loss coefficients28].
|Variant||33%||67%||100%||33% RMSE||67% RMSE||100% RMSE||Runtime [s]|
|Baseline as in Sec. IV||73.61||31.83||12.44||0.15||0.09||0.05||0.64|
|(i) w/ fixed off-policy window||83.25||50.17||24.68||0.16||0.12||0.09||0.68|
|(i) w/ fixed exploration constants||95.27||39.46||21.86||0.20||0.11||0.08||0.65|
|(i) w/o forced playouts + policy pruning||79.23||28.53||22.62||0.18||0.09||0.07||0.66|
|(ii) w/o global pooling bias blocks||103.58||45.78||31.44||0.19||0.11||0.10||0.64|
|(ii) 5 residual blocks in encoder||82.90||29.94||17.94||0.16||0.08||0.07||0.55|
|(ii) w/o input feature history||102.40||40.48||31.33||0.20||0.10||0.09||0.66|
Iv-F AlphaZero in Low Data Regimes
Adaptive IPP with spatio-temporal correlations is expensive to simulate. Opposed to previously examined fast-to-simulate games such as Go or chess , real-world robotics tasks are often limited in the number of simulations and generated episodes at train time. Further, we cannot leverage massive computational resources [14, 15], but only use a single GPU.
A major shortcoming of the original AlphaZero algorithm  is that the policy targets in Eq. 6 merely reflect the tree search exploration dynamics. However, the raw action visit counts do not necessarily capture the gathered state-action value information for a finite number of simulations. Hence, with only a moderate number of simulations per episode step, AlphaZero tends to overemphasize the initially explored actions in subsequent training iterations, leading to bias in training data and thus overestimated state-action values. As Eq. 7 is also guided by , the overemphasis on initially explored actions leads to overfitting and low-quality policies. In the following, we introduce methods to solve these problems and increase efficiency of our RL algorithm.
To avoid overemphasizing random actions in the node selection, a large exploration constant in Eq. 7 is desirable. However, in later training iterations, increasing exploitation of known good actions is required to ensure convergence. Thus, we propose an exponentially decaying exploration constant:
where is the training iteration number, is the initial constant, is the exponential decay factor, and is the minimal value. This can be seen as a dynamic exploration-exploitation trade-off in RL algorithms.
Similarly, for the Dirichlet exploration noise in Eq. 8 defined by , we introduce an exponentially decaying scheduling:
where is the initial value, is the exponential decay factor, and is the minimal value. A high around leads to a uniform noise distribution avoiding overemphasis on random actions. However, should decrease with increasing to exploit the learned .
Next, we propose an increasing replay buffer size to accelerate training. Similar to our approach, adaptive replay buffers are known to improve performance in other RL domains . On the one hand, a substantial amount of data is required to train the CNN on a variety of paths. On the other hand, the loss (Eq. 9) initially shows sudden drops when outdated train data is not in the replay buffer anymore. Thus, in early training stages, a small improves convergence speed. Larger in later training stages help regularize training and ensure train data diversity. Hence, is adaptively set to:
where and are the initial and maximum window sizes. The window size is increased by one each training iterations.
Moreover, we adapt two techniques introduced by Wu  for the game of Go. First, forced playouts and policy pruning decouple exploration dynamics and policy targets. While traversing the search tree, underexplored root node actions are chosen by setting in Eq. 7. In Eq. 6, action visits are subtracted unless action led to a high-value successor state. Second, in regular intervals in the encoder, and as the first layers of the value and policy head, we use global pooling bias layers. This enables our CNN to focus on local and global features required for IPP. Further details are discussed by Wu .
Iv-G Planning at Mission Time
Replanning during a mission is similar to generating a training episode. We use the offline-learned CNN and perform tree search from the current state . An action is chosen from with in Eq. 6, i.e. . Since steers the exploration, the tree search is highly sample-efficient. This way, the number of tree search simulations is significantly reduced to allow fast replanning. Note that policy noise injection and forced playouts at the root are disabled to avoid wrong exploration bias and improve performance.
On average, our RL approach ensures the fastest uncertainty and RMSE reduction over mission time. Solid lines indicate means over 10 trials, and shaded regions indicate the standard deviations. Note that there is inherent variability due to the randomly generated hotspot locations. However, our method ensures stable performance (small standard deviation) over changing environments. Further, runtime is substantially reduced by a factor of. The planned path (evolving over time from blue to red) validates the adaptive behavior of our approach exploring the terrain in a mission with focus on the high-value region (green).
V Experimental Results
This section presents our experimental results. We first validate our RL approach in an ablation study, then assess its IPP and runtime performance in terrain monitoring scenarios.
V-a Experimental Setup
Our simulation setup considers terrains with 2D discrete field maps with values between 0 and 1, randomly split in high- and low-value regions to create regions of interest as defined by Eq. 4. We model ground truth terrain maps of and resolution. The UAV action space of measurement locations is defined by a discrete 3D lattice above the terrain . The lattice mirrors the grid resolution on two altitude levels ( and ), resulting in actions. The missions are implemented in Python on a desktop with a 1.8 GHz Intel i7 processor, 16 GB RAM without GPU acceleration to avoid unfair advantages in inference speed of our CNN. We repeat the missions times and report means and standard deviations. Our RL algorithm is trained offline on a single machine with a 2.2 GHz AMD Ryzen 9 3900X, 63GB RAM, and a NVIDIA GeForce RTX 2080 Ti GPU.
We use the same inverse sensor model as Popović et al.  to simulate camera measurement noise, assuming a downwards-facing square camera footprint with FoV. The prior map mean is uniform with a value of . The GP is defined by an isotropic Matérn kernel with length scale , signal variance , and noise variance by maximizing log marginal likelihood over independent maps. The threshold defines regions of interest.
We set the mission budget , the UAV initial position to , the acceleration-deceleration with maximum speed . At train time, each episode randomly generates a new ground truth map, map priors from a wide range of GP hyperparameters and UAV start positions, such that our approach has no unfair overfitting advantage. We evaluate map uncertainty with the covariance trace in regions and the root mean squared error (RMSE) of in to assess planning performance. Lower values in these metrics indicate better performance. In contrast to earlier work [1, 10, 2], the remaining budget does not only incorporate the path travel time, but also the planning (computation) runtime, as relevant for robotic platforms with limited on-board resources. We refer to the spent budget as the effective mission time.
V-B Ablation Study
This section validates the algorithm design and CNN architecture introduced in Sec. IV. We perform an ablation study comparing our approach to versions of itself (i) removing proposed training procedure components, and (ii) changing the CNN architecture. We assume a resolution of resulting in a grid map with of actions. Note that the results do not depend on the actual size of and . We generate a small number of episodes in each iteration and terminate training after iterations. Each tree search is executed as described in Sec. IV-G with simulations and exploration constants . Table I summarizes our results. We evaluate the map uncertainty and RMSE over the posterior map state after 33%, 67%, and 100% effective mission time, and average planning runtime.
As proposed in Eq. 12, the training procedure considers a replay buffer with adaptive size . Convergence speed, and thus performance, is improved compared to a fixed-size buffer . Also, our proposed exploration constants scheduling scheme improves plan quality by stabilizing the exploration-exploitation trade-off compared to fixed constants . Further, the results show the benefits of including a history of the previous two map states and UAV positions in addition to their current values. Interestingly, reducing the encoder depth from to blocks and removing forced playouts both perform reasonably well, but still lead to worse results in later mission stages. This suggests that deeper CNNs and forced playouts facilitate learning in larger grid maps and longer missions. Similarly, global pooling bias blocks help learning global map features, which benefits information-gathering performance.
V-C Comparison Against Benchmarks
Next, our RL algorithm is evaluated against various benchmarks. We set a resolution , hence is a grid, and has actions. Our approach is compared against: (a) uniform random sampling in ; (b) coverage path with equally-spaced measurement locations at a fixed altitude; (c) MCTS with progressive widening  for large action spaces and a generalized cost-benefit rollout policy proposed by Choudhury et al.  for adaptive IPP; (d) a state-of the-art IPP framework based on Covariance Matrix Adaptation Evolution Strategy (CMA-ES) proposed by Popović et al. . All approaches reason about a -step planning horizon. We set CMA-ES parameters to iterations, offsprings, and coordinate-wise step size based on the trade-off between path quality and runtime. The permissible next actions of the MCTS were reduced to a radius of around the UAV position to be computable online, resulting in next actions per move. For a fair comparison, we trained our approach offline on this restricted action space, which is still much larger than considered in previous work [10, 11].
Fig. 5 reports the results obtained using each approach. Our method substantially reduces runtime, achieving a speedup of compared to CMA-ES and MCTS. This result highlights the significantly improved sample-efficiency in our tree search and confirms that the CNN can successfully learn informative actions from training in diverse simulated missions. Random sampling performs poorly as it reduces uncertainty and RMSE in high-value regions only by chance. The coverage path shows high variability since data-gathering efficiency greatly depends on the problem setup, i.e. hotspot locations relative to the preplanned path. Our approach outperforms this benchmark with much greater consistency.
V-D Temperature Mapping Scenario
We demonstrate our RL-based IPP approach in a photorealistic simulation using real-world surface temperature data. The data was collected in a crop field nearby Forschungszentrum Jülich, Germany on July 20, 2021 with a DJI Matrice 600 UAV carrying a Vue Pro R 640 thermal sensor. The UAV executed a coverage path at altitude to collect images, which were then processed using Pix4D software to generate an orthomosaic representing the target terrain in our simulation as depicted in Fig. 1-left. The aim is to validate our method for adaptively mapping high-temperature areas in this realistic setting.
For fusing new data into the map, we assume altitude-dependent sensor noise as described in Sec. V-A. The terrain is discretized using a uniform resolution. We compare our RL-based online algorithm against a fixed altitude lawnmower path as a traditional baseline. Our approach is trained only on synthetic simulated data as shown in Fig. 5.
Fig. 1-right shows the planned 3D path above the terrain using our strategy. This confirms that our method collects targeted measurements in high-temperature areas of interest (red) by efficient online replanning. This is reflected quantitatively in Fig. 6-right, where our approach ensures fast uncertainty reduction while a coverage path performs worse as it cannot adapt mapping behaviour. These results verify the successful transfer of our model trained in simulation to real-world data and demonstrate its benefits over a traditional approach.
Vi Conclusions and Future Work
This paper proposes a new RL-based approach for online adaptive IPP using resource-constrained mobile robots. The algorithm is designed for sample-efficient planning in large action spaces and high-dimensional state spaces, enabling fast information gathering in active sensing tasks. A key feature of our approach are components for accelerated learning in low data regimes. We validate the approach in an ablation study, and evaluate its performance compared to multiple benchmarks. Results show that our approach drastically reduces planning runtime, enabling efficient adaptive replanning on physical platforms.
Future work will investigate extending our algorithm to multi-robot teams and larger environments. We will also conduct field experiments to validate our method in real-world missions and to explore its sim2real capabilities.
We would like to thank Jordan Bates from Forschungszentrum Jülich for providing the real-world data and sensor information.
- Popović et al.  M. Popović, T. Vidal-Calleja, G. Hitz, J. J. Chung, I. Sa, R. Siegwart, and J. Nieto, “An informative path planning framework for UAV-based terrain monitoring,” Autonomous Robots, vol. 44, no. 6, pp. 889–911, 2020.
- Hitz et al.  G. Hitz, E. Galceran, M.-È. Garneau, F. Pomerleau, and R. Siegwart, “Adaptive continuous-space informative path planning for online environmental monitoring,” Journal of Field Robotics, vol. 34, no. 8, pp. 1427–1449, 2017.
- Hollinger and Sukhatme  G. A. Hollinger and G. S. Sukhatme, “Sampling-based robotic information gathering algorithms,” The International Journal of Robotics Research, vol. 33, no. 9, pp. 1271–1287, 2014.
- Dunbabin and Marques  M. Dunbabin and L. Marques, “Robots for Environmental Monitoring: Significant Advancements and Applications,” IEEE Robotics & Automation Magazine, vol. 19, no. 1, pp. 24–39, 2012.
- Lelong et al.  C. C. Lelong, P. Burger, G. Jubelin, B. Roux, S. Labbé, and F. Baret, “Assessment of Unmanned Aerial Vehicles Imagery for Quantitative Monitoring of Wheat Crop in Small Plots,” Sensors, vol. 8, no. 5, pp. 3557–3585, 2008.
Doherty and Rudol 
P. Doherty and P. Rudol, “A UAV search and rescue scenario with human body
detection and geolocalization,” in
Australasian Joint Conference on Artificial Intelligence. Springer, 2007, pp. 1–13.
- Galceran and Carreras  E. Galceran and M. Carreras, “A survey on coverage path planning for robotics,” Robotics and Autonomous Systems, vol. 61, no. 12, pp. 1258–1276, 2013.
- Vivaldini et al.  K. C. T. Vivaldini, T. H. Martinelli, V. C. Guizilini, J. R. Souza, M. D. Oliveira, F. T. Ramos, and D. F. Wolf, “UAV route planning for active disease classification,” Autonomous robots, vol. 43, no. 5, pp. 1137–1153, 2019.
- Blanchard and Sapsis  A. Blanchard and T. Sapsis, “Informative path planning for anomaly detection in environment exploration and eonitoring,” arXiv preprint arXiv:2005.10040, 2020.
- Choudhury et al.  S. Choudhury, N. Gruver, and M. J. Kochenderfer, “Adaptive Informative Path Planning with Multimodal Sensing,” in International Conference on Automated Planning and Scheduling, vol. 30. AAAI Press, 2020, pp. 57–65.
- Popović et al.  M. Popović, T. Vidal-Calleja, J. J. Chung, J. Nieto, and R. Siegwart, “Informative Path Planning for Active Field Mapping under Localization Uncertainty,” in IEEE International Conference on Robotics and Automation. IEEE, 2020.
- Viseras and Garcia  A. Viseras and R. Garcia, “DeepIG: Multi-robot information gathering with deep reinforcement learning,” IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 3059–3066, 2019.
- Chen et al.  F. Chen, J. D. Martin, Y. Huang, J. Wang, and B. Englot, “Autonomous Exploration Under Uncertainty via Deep Reinforcement Learning on Graphs,” in IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2020, pp. 6140–6147.
- Silver et al.  D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017.
- Silver et al.  D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel et al., “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play,” Science, vol. 362, no. 6419, pp. 1140–1144, 2018.
- Ko et al.  C.-W. Ko, J. Lee, and M. Queyranne, “An Exact Algorithm for Maximum Entropy Sampling,” Operations Research, vol. 43, no. 4, pp. 684–691, 1995.
- Binney and Sukhatme  J. Binney and G. S. Sukhatme, “Branch and bound for informative path planning,” in IEEE International Conference on Robotics and Automation. IEEE, 2012, pp. 2147–2154.
- Omidvar and Li  M. N. Omidvar and X. Li, “A Comparative Study of CMA-ES on Large Scale Global Optimisation,” in Australasian Joint Conference on Artificial Intelligence. Springer, 2010, pp. 303–312.
- Niroui et al.  F. Niroui, K. Zhang, Z. Kashino, and G. Nejat, “Deep reinforcement learning robot for search and rescue applications: Exploration in unknown cluttered environments,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 610–617, 2019.
- Browne et al.  C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, “A Survey of Monte Carlo Tree Search Methods,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 4, no. 1, pp. 1–43, 2012.
- Chaslot et al.  G. Chaslot, S. Bakkes, I. Szita, and P. Spronck, “Monte-Carlo Tree Search: A New Framework for Game AI,” AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, vol. 8, pp. 216–217, 2008.
- Sunberg and Kochenderfer  Z. N. Sunberg and M. J. Kochenderfer, “Online algorithms for POMDPs with continuous state, action, and observation spaces,” in International Conference on Automated Planning and Scheduling. AAAI Press, 2018.
- Silver and Veness  D. Silver and J. Veness, “Monte-Carlo planning in large POMDPs,” in Neural Information Processing Systems, 2010.
- Sim and Roy  R. Sim and N. Roy, “Global A-optimal Robot Exploration in SLAM,” in IEEE International Conference on Robotics and Automation. IEEE, 2005, pp. 661–666.
Gotovos et al. 
A. Gotovos, N. Casati, G. Hitz, and A. Krause, “Active Learning for Level Set Estimation,” inInternational Joint Conference on Artificial Intelligence. AAAI Press, 2013, pp. 1344–1350.
- Schrittwieser et al.  J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel et al., “Mastering Atari, Go, chess and shogi by planning with a learned model,” Nature, vol. 588, no. 7839, pp. 604–609, 2020.
- Romera et al.  E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation,” IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 1, pp. 263–272, 2017.
- Smith  L. N. Smith, “A disciplined approach to neural network hyper-parameters: Part 1 – rate, batch size, momentum, and weight decay,” arXiv, 2018.
- Liu and Zou  R. Liu and J. Zou, “The Effects of Memory Replay in Reinforcement Learning,” in Allerton Conference on Communication, Control, and Computing. IEEE, 2018, pp. 478–485.
- Wu  D. J. Wu, “Accelerating Self-Play Learning in Go,” arXiv, 2019.