Robot exploration problem is defined as making a robot or multi-robots explore unknown cluttered environments (i.e., office environment, forests, ruins, etc.) autonomously with specific goals. The goal can be classified as: (1) Maximizing the knowledge of the unknown environments, i.e., acquiring a map of the world, . (2) Searching a static or moving object without prior information about the world. While solving the exploration problems with the second goal can combine the prior information of the target object, such as the semantic information, it also needs to fulfill the first goal.
The frontier-based methods ,  have been widely used to solve the robot exploration problem.  adopts a greedy strategy, which may lead to an inefficient overall path. Many approaches have been proposed to consider more performance metrics (i.e. information gain, etc.) , , . However, these approaches are designed and evaluated in a limited number of environments. Therefore, they may fail to generalize to other environments whose layouts are different.
Compared with these classical methods, machine learning-based methods exhibit the advantages of learning from various data. Deep Reinforcement Learning (DRL), which uses a neural network to approximate the agent’s policy during the interactions with the environments, gains more and more attention in the application of games , robotics , etc. When applied to robot exploration problems, most research works , 
design the state space as the form of image and use Convolution Neural Networks (CNN). For example, in, CNN is utilized as a mapping from the local observation of the robot to the next optimal action.
However, the size of CNN’s input images is fixed, which results in the following limitations: (1) If input images represent the global information, the size of the overall explored map needs to be pre-defined to prevent input images from failing to fully containing all the map information. (2) If input images represent the local information, which fails to convey all the state information, the recurrent neural networks (RNN) or memory networks need to be adopted. Unfortunately, the robot exploration problem in this formulation requires relatively long-term planning, which is still a tough problem that has not been perfectly solved.
In this paper, to deal with the aforementioned problems, we present a novel state representation method, which relies on 4D point-clouds-like information of variable size. These information have the same data structure as point clouds and consists of 2D points’ location information, and the corresponding 1D frontier and 1D distance information, as shown in Fig.1. We also designs the corresponding training framework, which bases on the deep Q-Learning method with variable action space. By replacing the image observation with 4D point-clouds-like information, our proposed exploration model can deal with unknown maps of arbitrary size. Based on dynamic graph CNN (DGCNN) , which is one of the typical neural network structure that process point clouds information, our proposed neural network takes 4D point-clouds-like information as input and outputs the expected value of each frontier, which can guide the robot to the frontier with the highest value. This neural network is trained in a way similar to DQN in the environment , which is a fast exploration simulation platform that includes data of many 2D indoor layouts. The experiment shows that our exploration model can achieve a relatively good performance, compared with the baseline in , state-of-the-art in , classical methods in ,  and a random method.
I-a Original Contributions
The contributions of this paper are threefold. First, we propose a novel state representation method using 4D point-clouds-like information to solve the aforementioned problems in Section II. Although point clouds have been utilized in motion planning and navigation (, ), our work is different from these two papers in two main parts: (1) We use point clouds to represent the global information while they represent the local observation. (2) Our action space is to select a frontier point from the frontier set of variable size, while their action space contains control commands. Second, we design the corresponding state function based on DGCNN , and the training framework based on DQN . The novelty is that our action space’s size is variable, which makes our neural network converge in a faster way. Third, we demonstrate the performance of the proposed method on a wide variety of environments, which the model has not seen before, and includes maps whose size is much larger than maps in the training set.
The remainder of this paper is organized as follows. We first introduce the related work in Section II. Then we formulate the frontier-based robot exploration problem and DRL exploration problem in Section III. After that, the framework of our proposed method are detailed in Section IV. In Section V, we demonstrate the performance of our proposed method through a series of simulation experiments. At last, we conclude the work of this paper and discuss directions for future work in section VI.
Ii Related Work
In , the classical frontier method is defined, where an occupancy map is utilized in which each cell is placed into one of three classes: open, unknown and occupied. Then frontiers are defined as the boundaries between open areas and unknown areas. The robot can constantly gain new information about the world by moving to successive frontiers, while the problem of selecting which frontiers at a specific stage remains to be solved. Therefore, in a frontier-based setting, solving the exploration problem is equivalent to finding an efficient exploration strategy that can determine the optimal frontier for the robot to explore. A greedy exploration strategy is utilized in  to select the nearest unvisited, accessible frontiers. The experiment results in that paper show that the greedy strategy is short-sighted and can waste lots of time, especially when missing a nearby frontier that will disappear at once if selected (this case is illustrated in the experiment part).
Many DRL techniques have been applied into the robot exploration problem in several previous works. In a typical DRL framework , the agent interacts with the environment by taking actions and receiving rewards from the environment. Through this trial-and-error manner, the agent can learn an optimal policy eventually. In , a CNN network is trained under the DQN framework with RGB-D sensor images as input to make the robot learn obstacle avoidance ability during exploration. Although avoiding obstacles is important, this paper does not apply DRL to learn the exploration strategy. The work in  combines frontier-based exploration with DRL to learn an exploration strategy directly. The state information includes the global occupancy map, robot locations and frontiers, while the action is to output the weight of a cost function that evaluates the goodness of each frontier. The cost function includes distance and information gain. By adjusting the weight, the relative importance of each term can be changed. However, the terms of the cost function rely on human knowledge and may not be applicable in other situations. In , the state space is similar to the one in , while the action is to select a point from the global map. However, the map size can vary dramatically from one environment to the next. It losses generality when setting the maximum map size before exploring the environments.
In  and , a local map, which is centered at the robot’s current location, is extracted from the global map to represent current state information. By using a local map,  trains the robot to select actions in “turn left, turn right, move forward”, while  learns to select points in the free space of the local map. Local map being state space can eliminate the limitation of global map size, but the current local map fails to contain all the information. In , the robot tends to get trapped in an explored room when there is no frontier in the local map, because the robot has no idea where the frontiers are. The training process in  needs to drive the robot to the nearest frontier when the local map contains no frontier information, although a RNN network is integrated into their framework. This human intervention adopts a greedy strategy and can not guarantee an optimal or near-optimal solution. When utilizing local observations, the robot exploration problem requires the DRL approach to have a long-term memory. Neural map in  is proposed to tackle simple memory architecture problems in DRL. Besides, Neural SLAM in  embeds traditional SLAM into attention-based external memory architecture. However, the memory architectures in  and  are based on the fixed size of the global map. It is difficult to be applied to unknown environments whose size may be quite large compared with maps in training sets. Unlike the aforementioned methods, our method uses the 4D point-clouds-like information to represent the global state information, which does not suffer from both the map size limitation and the simple memory problem. As far as we know, our method is the first to apply point clouds to robot exploration problems. Therefore, we also design a respective DRL training framework to map the exploration strategy directly from point clouds.
Iii Problem Formulation
Our work aims to develop and train a neural network that can take 4D point-clouds-like information as input and generate efficient policy to guide the exploration process of a robot equipped with a laser scanner. The network should take into account the information about explored areas, occupied areas, and unexplored frontiers. In this paper, the robot exploration problem is to make a robot equipped with a limited-range laser scanner explore unknown environments autonomously.
Iii-a Frontier-based Robot Exploration Problem
In the robot exploration problem, a 2D occupancy map is most frequently adopted to store the explored environment information. Define the explored 2D occupancy map at step as . Each grid in can be classified into the following three states: free grid , occupied grid , and unknown grid . According to , frontiers are defined as the boundaries between the free space and unknown space . Many existing DRL exploration frameworks learn a mapping from and to robot movement commands which can avoid obstacles and navigate to specific locations. Although this end-to-end framework has a simple structure, it is difficult to train. Instead, our method learns a policy network that can directly determine which frontier to explore, which is similar to  and . At step , a target frontier is selected from based on an exploration strategy and current explored map . Once the change of the explored map is larger than the threshold, the robot is stopped, and the explored map at step is obtained. By moving to selected frontiers constantly, the robot will explore more unknown areas until no accessible frontier exists.
Because can be represented as an image, it is commonly used to directly represent the state information. As explained in Section II, a novel state representation method with 4D point-clouds-like information is proposed instead. The 4D point-clouds-like information at step is defined as a 4-dimensional point cloud set with points, denoted by . Each point contains 4D coordinates , where denotes the location of the point, denotes the distance from the point to the robot location without collision, denotes whether point in belongs to frontier or not.
Iii-B DRL exploration formulation
The robot exploration problem can be formulated as a Markov Decision Process (MDP), which can be modeled as a tuple. The state space at step is defined by 4D point cloud , which can be divided into frontier set and obstacle set . The action space at step is the frontier set , and the action is to select a point from , which is the goal of the navigation module implemented by . When the robot take an action from the action space, the state will transit to state
according to the stochastic state transition probability. Then the robot will receive an immediate reward . The discount factor adjusts the relative importance of immediate and future reward. The objective of DRL algorithms is to learn a policy that can select actions to maximize the expected reward, which is defined as the accumulated discounted rewards over time.
Because the action space varies according to the size of frontier set
, it is difficult to design a neural network that maps the state to the action directly. The value-based RL is more suitable to this formulation. In value-based RL, a vector of action values, which are the expected rewards after taking actions in stateunder policy , can be estimated by a deep Q network (DQN) , where are the parameters of the muti-layered neural network. The optimal policy is to take action that has the highest action value: . DQN  is a novel variant of Q-learning, which utilizes two key ideas: experience reply and target network. The DQN tends to overestimate action values, which can be tackled by double DQN in . Double DQN select an action the same as DQN selects, while estimate this action’s value by the target network.
In this section, we present the framework of our method and illustrate its key components in detail.
Iv-a Framework Overview
The typical DRL framework is adopted, where the robot interacts with the environment step by step to learn an exploration strategy. The environment in this framework is based on . The state and action space of the original environment is the local observation and robot movement commands, respectively. When incorporated into our framework, receives a sequence of robot movement command and outputs the global map once the change of the explored map is larger than the threshold, which is detailed in Section III-A. As shown in Fig. 2, the 4D point-clouds-like information can be obtained by a modified Dijkstra algorithm. After the policy network outputting the goal point, the algorithm is implemented to find the path connecting the robot location to the goal point. Then a simple path tracking algorithm is applied to generate the sequence of robot movement commands.
Iv-B Frontier Detection and Distance Computation
Computing the distances from the robot location to points in the frontier set without collision will be time-consuming if each distance is obtained by running the algorithm once. Instead, we modify the Dijkstra algorithm to detect frontiers and compute distance at the same time by sharing the search information. Denote the “open list” and “close list” as and , respectively. The open list contains points that need to be searched, while the close list contains points that have been searched. Define the contour list as , which contains the location and the cost of points that belong to frontier or obstacle. Only points in the free space of map are walkable. The goal of this modified algorithm is to extract the contour of free space and obtain the distance information simultaneously. As shown in Algorithm 1, the start point with the cost of zero, which is decided by the robot location, is added to . While the open list is not empty, the algorithm repeats searching 8 points, denoted by , adjacent to current point . The differences from Dijkstra algorithms are: (1) If belongs to occupied or unknown space, add to frontier list , as shown in line 10 of Algorithm 1. (2) Instead of stopping when the goal is found, the algorithm terminates until contains zero points. After the algorithm ends, the contour list contains points that are frontiers or boundaries between free space and obstacle space. Points in the contour list can be classified by their neighboring information into frontier or obstacle set, which is shown in Fig. 3.
Iv-C Network with Point Clouds as Input
In this section, the architecture of the state-action value network with 4D point-clouds-like information as input is detailed. The architecture is modified from DGCNN in Segmentation task , which proposes edge convolution (EdgeConv) operation to extract the local information of point clouds. The EdgeConv operation includes two steps for each point in the input set: (1) construct a local graph including the center point and its k-nearest neighbors; (2) apply convolution-like operations on edges which connect each neighbor to the center point. The EdgeConv operation takes the point set as input and outputs the feature set , where and denote the dimension of input and output set, and denotes the number of points in the input set. Different from DGCNN and other typical networks processing point cloud such as PointNet , which have the same input and output point number, our network takes the frontier and obstacle set as input and only outputs the value of points from the frontier set. The reason for this special treatment is to decrease the action space’s size to make the network converge in a faster manner.
The network takes as input points at time step , which includes frontier points and obstacle points, which are denoted as and respectively. The output is a vector of estimated value of each action
. The network architecture contains multiple EdgeConv layers and multi-layer perceptron (mlp) layers. At one EdgeConv layer, the feature set is extracted from the input set. Then all features in this edge feature set are aggregated to compute the output EdgeConv feature for each corresponding point. At a mlp layer, the data of each point is operated on independently to obtain the information of one point, which is the same as the mlp in PointNet. After 4 EdgeConv layers, the outputs of all EdgeConv layers are aggregated and processed by a mlp layer to generate a 1D global descriptor, which encodes the global information of input points. Then this 1D global descriptor is concatenated with the outputs of all EdgeConv layers. After that, points that belong to the obstacle set are ignored, and the points from the frontier set are processed by three mlp layers to generate scores for each point.
Iv-D Learning framework based on DQN
As described in Section III-B, the DQN is a neural network that for a given state outputs a vector of action values. The network architecture of DQN is detailed in Section IV-C, where the state in point clouds format contains frontier and obstacle sets. Under a given policy , the true value of an action in state is: . The target is to make the estimate from the frontier network converge to the true value. The parameters of the action-value function can be updated in a gradient decent way, after taking action in state and observing the next state and immediate reward :
where is the learning rate and the target is defined as
where denotes the parameter of the target network, which is updated periodically by .
To make the estimate from our network converge to the true value, a typical DQN framework  can be adopted. For each step , the tuple is saved in a reply buffer. The parameters can be updated by equation 1 and 2 given the tuple sampled from the reply buffer.
The reward signal in the DRL framework helps the robot know at a certain state whether an action is appropriate to take. To make the robot able to explore unknown environments successfully, we define the following reward function:
The term equals the newly discovered area at time , which is designed to encourage the robot to explore unknown areas. The term provides a consistent penalization signal to the robot when a movement command is taken. This reward encourages the robot to explore unknown environments with a relatively short length of overall path. The term is designed to guide the robot to reduce the number of frontiers, as defined in equation 4:
where denotes the number of frontier group at time .
V Training Details and Simulation Experiments
In this section, we first detail the training process of our DRL agent. Then we validate the feasibility of our proposed method in robot exploration problem by two sets of experiments: (1) a comparison with five other exploration methods, (2) a test of scalability where maps of larger size compared with training set are to be explored.
V-a Training Details
To learn a general exploration strategy, the robot is trained in the
environment where 100 maps with different shapes and features are to be explored. The robot is equipped with a 2m range laser scanner with a 180 degree field of view. The noise of laser range measurement is simulated by the Gaussian distribution with a mean of 0 and a standard deviation of 0.02m. As an episode starts, a map is randomly selected from the training set, and the robot start pose, including the location and pose, is also set randomly. In the beginning, the circle area centered at the start location is explored by making the robot rotate for 360 degree. Then at each step, a goal frontier point is selected from the frontier set under the policy of our proposed method. A* algorithm is applied to find a feasible path connecting the current robot location to the goal frontier. A simple path tracking algorithm is used to find the robot commands to follow the planned path: moving to the nearest unvisited path point, and replanning the path if the distance between the robot and is larger than a fixed threshold. An episode ends if the explored ratio of the whole map is over .
|Laser range||Discount factor||0.99|
|Laser field of view||Target network update f||4000 steps|
|Robot radius||Learning rate||0.001|
|Linear step length||Replay buffer size||50000|
|Angular step length||Exploration rate||15000|
Some training map samples are shown in Fig. 5. The largest size of a map in the training set is 256 by 256. Because the size of state information changes at each time and batch update method requires point clouds with same size, currently, it is not realistic to train the model in a batch-update way. In typical point clouds classification training process, the size of all the point clouds data are pre-processed to the same size. However, these operations will change the original data’s spatial information in our method. Instead, for each step, the network parameters are updated 32 times with a batch size equal to 1. Learning parameters are listed in Table 1. The training process is performed on a computer with Intel i7-7820X CPU and GeForce GTX 1080Ti GPU. The training starts at 2000 steps and ends after 90000 update steps, which takes 72 hours.
V-B Comparison Study
Besides our proposed method, we also test the performance of the weight tuning method in , a cost-based method in , a method with greedy strategy in , a method utilizing a random policy and the baseline in , which we denote as the weight method, cost method, greedy method, random method and baseline, respectively. To compare the performance of different methods, we use 4 maps as a testing set, as shown in Fig. 6. For each test map, we conduct 100 trials by setting robot initial locations randomly for each trail.
The random method selects a frontier point from the frontier set randomly. The greedy method chooses the nearest frontier point. The baseline utilizes a CNN which directly determines the robot movement commands by the current local observation.
The cost method evaluates the scalar value of frontiers by a cost function considering distance and information gain information.
where is the weight that adjusts the relative importance of costs. and denote the normalized distance and information gain of a frontier. At each step, after obtaining the frontier set as detailed in Section IV-B
, the k-means method is adopted to cluster the points in the frontier set to find frontier centers. To reduce the runtime of computing information gains, we only compute the information gain for each frontier center. The information gain is computed by the area of the unknown space to be explored if this frontier center is selected. The weight in the cost method is fixed to 0.5, which fails to be optimal in environments with different structures and features.
The weight method can learn to adjust the value of the weight in equation 5 under the same training framework as our proposed method. The structure of the neural network in weight method is presented in Fig. 4, which takes the 4D point-clouds-like information as input and outputs a scalar value.
We select the length of overall paths, which are recorded after of areas are explored, as the metric to evaluate the relative performance of the six exploration methods. The length of overall paths can indicate the efficiency of the exploration methods. The box plot in Fig. 7
is utilized to visualize the path length’s data of each exploration method on four test maps. The baseline is not considered here because the baseline sometimes fails to explore the whole environment, which will be explained later. Three values of this metric are used to analyze the experiment results: (1) the average, (2) the minimum value, (3) the variance. Our proposed method has the minimum average length of overall paths for all four test maps. The minimum length of the proposed method in each map is also smaller than other five methods. This indicates that the exploration strategy of our proposed method is more effective and efficient than the other five methods. The random method has the largest variance and average value for each map because it has more chances to sway between two or more frontier groups. That is why the overall path of the random method in Fig.8 is the most disorganized. The weight method has a lower average and minimum value compared with the cost method in all test maps, due to the advantages of learning from rich data. However, the weight method only adjusts a weight value to change the relative importance of distance and information gain. If the computation of the information gain is not accurate or another related cost exists, the weight method fails to fully demonstrate the advantages of learning. Instead, our proposed method can learn useful information, including information gain, by learning to estimate the value of frontiers, which is the reason that our proposed method outperforms the weight method.
Fig. 8 shows an example of the overall paths of six different methods when exploring the map1 with the same starting pose. Each episode ended once the explored ratio was more than 0.95. The proposed method explored the environment with the shortest path. The explored ratio of the total map according to the length of the current path is shown in Fig. 9. The greedy method’s curve has a horizontal line when the explored ratio is near 0.95. This is because the greedy method missed a nearby frontier which would disappear at once if selected. However, the greedy method chose the nearest frontier instead, which made the robot travel to that missed frontier again and resulted in a longer path. For example, in Fig. 8, the greedy method chose to explore the point C instead of the point A, when the robot was at point B. This decision made the robot travel to the point A later, thus making the overall path longer. The baseline’s curve also exhibits a horizontal line in Fig. 9, which is quite long. The baseline’s local observation contained no frontier information when the surroundings were all explored (in the left part of the environment shown in Fig. 8). Therefore, at this situation, the baseline could only take “random” action (i.e. travelling along the wall) to find the existing frontiers, which would waste lots of travelling distances and may fail to explore the whole environment.
V-C Scalability Study
In this section, a map size of (531, 201) is used to test the performance of our proposed method in larger environments compared with maps in the training set. If the network’s input is fixed size images, the map needs to be padded into a (531, 531) image, which is a low-efficient state representation way. Then a downscaling operation of the image is required to make the size of the image input the same as the requirement of the neural network, e.g. (256,256). Although the neural network can process the state data by downscaling, the quality of the input data decreases. Therefore, the network fails to work once the scaled input contains much less necessary information than the original image. Fig.10 presents the overall path generated by our method without downscaling the map size. Our proposed method, which takes point clouds as input, has better robustness in maps with large scales because of the following two reasons. First, we incorporate the distance information into point clouds, which can help neural networks learn which part of point clouds are walkable. Although the image representation can also have a fourth channel as distance information, the scaling operation can make some important obstacles or free points disappear, which changes the structure of the map. Secondly, the number of pixels in an image increases exponentially as the size of the image increase. The number of points in point clouds equals the number of pixels that represent an obstacle or frontier in a map, which is not an exponential relation unless all the pixels in a map are obstacles or frontiers.
Vi Conclusions And Future Work
In this paper, we present a novel state representation method using 4D point-clouds-like information and design the framework to learn an efficient exploration strategy. Our proposed method can solve the problems that come with using images as observations. The experiments demonstrate the effectiveness of our proposed method, compared with other five commonly used methods. For the future work, other network structures and RL algorithms can be modified and applied to the robot exploration problem with point clouds as input. The converge speed of training may also be improved by optimizing the training techniques. Besides, the multi-robot exploration problem may also use point clouds to represent the global information.
-  (2011) Exploration strategies based on multi-criteria decision making for searching environments in rescue operations. Autonomous Robots 31 (4), pp. 401–417. Cited by: §I.
-  (2002) Information based adaptive robotic exploration. In IEEE/RSJ international conference on intelligent robots and systems, Vol. 1, pp. 540–545. Cited by: §I.
-  (2019) Self-learning exploration and mapping for mobile robots via deep reinforcement learning. In Aiaa scitech 2019 forum, pp. 0396. Cited by: §I, §II.
-  (2002) Navigation strategies for exploring indoor environments. The International Journal of Robotics Research 21 (10-11), pp. 829–848. Cited by: §I, §I, §V-B, §V-B.
Deep reinforcement learning with double q-learning.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pp. 2094–2100. Cited by: §III-B.
-  (2020) Rlbench: the robot learning benchmark & learning environment. IEEE Robotics and Automation Letters 5 (2), pp. 3019–3026. Cited by: §I.
-  (2020) Robust rl-based map-less local planning: using 2d point clouds as observations. IEEE Robotics and Automation Letters 5 (4), pp. 5787–5794. External Links: Cited by: §I-A.
-  (2019) Deep reinforcement learning-based automatic exploration for navigation in unknown environment. IEEE transactions on neural networks and learning systems 31 (6), pp. 2064–2076. Cited by: §II, §III-A.
-  (2019) Houseexpo: a large-scale 2d indoor layout dataset for learning-based algorithms on mobile robots. arXiv preprint arXiv:1903.09845. Cited by: §I, Fig. 2, §IV-A, §V-B.
-  (2015) Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §I-A, §I, §III-B, §IV-D.
-  (2019) Deep reinforcement learning robot for search and rescue applications: exploration in unknown cluttered environments. IEEE Robotics and Automation Letters 4 (2), pp. 610–617. External Links: Cited by: §I, §II, §III-A, §V-B.
-  (2016) Speeding-up robot exploration by exploiting background information. IEEE Robotics and Automation Letters 1 (2), pp. 716–723. External Links: Cited by: §I.
-  (2018) Neural map: structured memory for deep reinforcement learning. In International Conference on Learning Representations, Cited by: §II.
Pointnet: deep learning on point sets for 3d classification and segmentation. In , pp. 652–660. Cited by: §IV-C, §IV-C.
A formal basis for the heuristic determination of minimum cost paths. IEEE transactions on Systems Science and Cybernetics 4 (2), pp. 100–107. Cited by: §III-B.
-  (2020) Learning obstacle representations for neural motion planning. Proceedings of Conference on Robot Learning (CoRL). Cited by: §I-A.
-  (2018) Reinforcement learning: an introduction. MIT press. Cited by: §I, §II.
-  (2016) A robot exploration strategy based on q-learning network. In 2016 IEEE International Conference on Real-time Computing and Robotics (RCAR), Vol. , pp. 57–62. External Links: Cited by: §II.
-  (2002) Probabilistic robotics. Communications of the ACM 45 (3), pp. 52–57. Cited by: §I.
-  (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38 (5), pp. 1–12. Cited by: §I-A, §I, §IV-C.
-  (1997) A frontier-based approach for autonomous exploration. In Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97.’Towards New Computational Principles for Robotics and Automation’, pp. 146–151. Cited by: §I, §I, §II, §III-A, §V-B.
-  (2017) Neural slam: learning to explore with external memory. arXiv preprint arXiv:1706.09520. Cited by: §II.
-  (2018) Deep reinforcement learning supervised autonomous exploration in office environments. In 2018 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 7548–7555. External Links: Cited by: §I, §II.