Learning Robot Exploration Strategy with 4D Point-Clouds-like Information as Observations

06/17/2021 ∙ by Zhaoting Li, et al. ∙ Tencent 0

Being able to explore unknown environments is a requirement for fully autonomous robots. Many learning-based methods have been proposed to learn an exploration strategy. In the frontier-based exploration, learning algorithms tend to learn the optimal or near-optimal frontier to explore. Most of these methods represent the environments as fixed size images and take these as inputs to neural networks. However, the size of environments is usually unknown, which makes these methods fail to generalize to real world scenarios. To address this issue, we present a novel state representation method based on 4D point-clouds-like information, including the locations, frontier, and distance information. We also design a neural network that can process these 4D point-clouds-like information and generate the estimated value for each frontier. Then this neural network is trained using the typical reinforcement learning framework. We test the performance of our proposed method by comparing it with other five methods and test its scalability on a map that is much larger than maps in the training set. The experiment results demonstrate that our proposed method needs shorter average traveling distances to explore whole environments and can be adopted in maps with arbitrarily sizes.



There are no comments yet.


page 1

page 3

page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robot exploration problem is defined as making a robot or multi-robots explore unknown cluttered environments (i.e., office environment, forests, ruins, etc.) autonomously with specific goals. The goal can be classified as: (1) Maximizing the knowledge of the unknown environments, i.e., acquiring a map of the world

[19], [12]. (2) Searching a static or moving object without prior information about the world. While solving the exploration problems with the second goal can combine the prior information of the target object, such as the semantic information, it also needs to fulfill the first goal.

The frontier-based methods [21], [4] have been widely used to solve the robot exploration problem. [21] adopts a greedy strategy, which may lead to an inefficient overall path. Many approaches have been proposed to consider more performance metrics (i.e. information gain, etc.) [4], [2], [1]. However, these approaches are designed and evaluated in a limited number of environments. Therefore, they may fail to generalize to other environments whose layouts are different.

Fig. 1: The example of 4D point-clouds-like information. (a) shows a map example in , where the black, gray, and white areas denote unknown space, free space, and obstacle space, respectively. (b) shows the global map in our algorithm, where the map is obtained by homogeneous transformations. The map’s center is the robot start location, and the map’s x-coordinate is the same as the robot start orientation. (c) shows the 4D point-clouds-like information generated based on the global map. The x, y, z coordinate denote x location, y location and distance information, respectively. The color denotes frontier information, where red denotes obstacles and blue denotes frontiers.

Compared with these classical methods, machine learning-based methods exhibit the advantages of learning from various data. Deep Reinforcement Learning (DRL), which uses a neural network to approximate the agent’s policy during the interactions with the environments

[17], gains more and more attention in the application of games [10], robotics [6], etc. When applied to robot exploration problems, most research works [23], [3]

design the state space as the form of image and use Convolution Neural Networks (CNN). For example, in

[23], CNN is utilized as a mapping from the local observation of the robot to the next optimal action.

However, the size of CNN’s input images is fixed, which results in the following limitations: (1) If input images represent the global information, the size of the overall explored map needs to be pre-defined to prevent input images from failing to fully containing all the map information. (2) If input images represent the local information, which fails to convey all the state information, the recurrent neural networks (RNN) or memory networks need to be adopted. Unfortunately, the robot exploration problem in this formulation requires relatively long-term planning, which is still a tough problem that has not been perfectly solved.

In this paper, to deal with the aforementioned problems, we present a novel state representation method, which relies on 4D point-clouds-like information of variable size. These information have the same data structure as point clouds and consists of 2D points’ location information, and the corresponding 1D frontier and 1D distance information, as shown in Fig.1. We also designs the corresponding training framework, which bases on the deep Q-Learning method with variable action space. By replacing the image observation with 4D point-clouds-like information, our proposed exploration model can deal with unknown maps of arbitrary size. Based on dynamic graph CNN (DGCNN) [20], which is one of the typical neural network structure that process point clouds information, our proposed neural network takes 4D point-clouds-like information as input and outputs the expected value of each frontier, which can guide the robot to the frontier with the highest value. This neural network is trained in a way similar to DQN in the environment [9], which is a fast exploration simulation platform that includes data of many 2D indoor layouts. The experiment shows that our exploration model can achieve a relatively good performance, compared with the baseline in [9], state-of-the-art in [11], classical methods in [21], [4] and a random method.

I-a Original Contributions

The contributions of this paper are threefold. First, we propose a novel state representation method using 4D point-clouds-like information to solve the aforementioned problems in Section II. Although point clouds have been utilized in motion planning and navigation ([7], [16]), our work is different from these two papers in two main parts: (1) We use point clouds to represent the global information while they represent the local observation. (2) Our action space is to select a frontier point from the frontier set of variable size, while their action space contains control commands. Second, we design the corresponding state function based on DGCNN [20], and the training framework based on DQN [10]. The novelty is that our action space’s size is variable, which makes our neural network converge in a faster way. Third, we demonstrate the performance of the proposed method on a wide variety of environments, which the model has not seen before, and includes maps whose size is much larger than maps in the training set.

The remainder of this paper is organized as follows. We first introduce the related work in Section II. Then we formulate the frontier-based robot exploration problem and DRL exploration problem in Section III. After that, the framework of our proposed method are detailed in Section IV. In Section V, we demonstrate the performance of our proposed method through a series of simulation experiments. At last, we conclude the work of this paper and discuss directions for future work in section VI.

Ii Related Work

In [21], the classical frontier method is defined, where an occupancy map is utilized in which each cell is placed into one of three classes: open, unknown and occupied. Then frontiers are defined as the boundaries between open areas and unknown areas. The robot can constantly gain new information about the world by moving to successive frontiers, while the problem of selecting which frontiers at a specific stage remains to be solved. Therefore, in a frontier-based setting, solving the exploration problem is equivalent to finding an efficient exploration strategy that can determine the optimal frontier for the robot to explore. A greedy exploration strategy is utilized in [21] to select the nearest unvisited, accessible frontiers. The experiment results in that paper show that the greedy strategy is short-sighted and can waste lots of time, especially when missing a nearby frontier that will disappear at once if selected (this case is illustrated in the experiment part).

Many DRL techniques have been applied into the robot exploration problem in several previous works. In a typical DRL framework [17], the agent interacts with the environment by taking actions and receiving rewards from the environment. Through this trial-and-error manner, the agent can learn an optimal policy eventually. In [18], a CNN network is trained under the DQN framework with RGB-D sensor images as input to make the robot learn obstacle avoidance ability during exploration. Although avoiding obstacles is important, this paper does not apply DRL to learn the exploration strategy. The work in [11] combines frontier-based exploration with DRL to learn an exploration strategy directly. The state information includes the global occupancy map, robot locations and frontiers, while the action is to output the weight of a cost function that evaluates the goodness of each frontier. The cost function includes distance and information gain. By adjusting the weight, the relative importance of each term can be changed. However, the terms of the cost function rely on human knowledge and may not be applicable in other situations. In [8], the state space is similar to the one in [11], while the action is to select a point from the global map. However, the map size can vary dramatically from one environment to the next. It losses generality when setting the maximum map size before exploring the environments.

In [23] and [3], a local map, which is centered at the robot’s current location, is extracted from the global map to represent current state information. By using a local map, [23] trains the robot to select actions in “turn left, turn right, move forward”, while [3] learns to select points in the free space of the local map. Local map being state space can eliminate the limitation of global map size, but the current local map fails to contain all the information. In [23], the robot tends to get trapped in an explored room when there is no frontier in the local map, because the robot has no idea where the frontiers are. The training process in [3] needs to drive the robot to the nearest frontier when the local map contains no frontier information, although a RNN network is integrated into their framework. This human intervention adopts a greedy strategy and can not guarantee an optimal or near-optimal solution. When utilizing local observations, the robot exploration problem requires the DRL approach to have a long-term memory. Neural map in [13] is proposed to tackle simple memory architecture problems in DRL. Besides, Neural SLAM in [22] embeds traditional SLAM into attention-based external memory architecture. However, the memory architectures in [13] and [22] are based on the fixed size of the global map. It is difficult to be applied to unknown environments whose size may be quite large compared with maps in training sets. Unlike the aforementioned methods, our method uses the 4D point-clouds-like information to represent the global state information, which does not suffer from both the map size limitation and the simple memory problem. As far as we know, our method is the first to apply point clouds to robot exploration problems. Therefore, we also design a respective DRL training framework to map the exploration strategy directly from point clouds.

Iii Problem Formulation

Our work aims to develop and train a neural network that can take 4D point-clouds-like information as input and generate efficient policy to guide the exploration process of a robot equipped with a laser scanner. The network should take into account the information about explored areas, occupied areas, and unexplored frontiers. In this paper, the robot exploration problem is to make a robot equipped with a limited-range laser scanner explore unknown environments autonomously.

Iii-a Frontier-based Robot Exploration Problem

In the robot exploration problem, a 2D occupancy map is most frequently adopted to store the explored environment information. Define the explored 2D occupancy map at step as . Each grid in can be classified into the following three states: free grid , occupied grid , and unknown grid . According to [21], frontiers are defined as the boundaries between the free space and unknown space . Many existing DRL exploration frameworks learn a mapping from and to robot movement commands which can avoid obstacles and navigate to specific locations. Although this end-to-end framework has a simple structure, it is difficult to train. Instead, our method learns a policy network that can directly determine which frontier to explore, which is similar to [11] and [8]. At step , a target frontier is selected from based on an exploration strategy and current explored map . Once the change of the explored map is larger than the threshold, the robot is stopped, and the explored map at step is obtained. By moving to selected frontiers constantly, the robot will explore more unknown areas until no accessible frontier exists.

Because can be represented as an image, it is commonly used to directly represent the state information. As explained in Section II, a novel state representation method with 4D point-clouds-like information is proposed instead. The 4D point-clouds-like information at step is defined as a 4-dimensional point cloud set with points, denoted by . Each point contains 4D coordinates , where denotes the location of the point, denotes the distance from the point to the robot location without collision, denotes whether point in belongs to frontier or not.

Fig. 2: The framework of the proposed method, which consists of five components: (a) A simulator adopted from [9], which receives and executes a robot movement command and outputs the new map in coordinate; (b) The modified Dijkstra algorithm to extract the contour of the free space; (c) The state represented by 4D point-clouds-like information; (d) The policy network which processes the state information and estimates the value of frontiers; (e) The A* algorithm that finds a path connecting the current robot location to the goal frontier and a simple path tracking algorithm that generates the corresponding robot movement commands.

Iii-B DRL exploration formulation

The robot exploration problem can be formulated as a Markov Decision Process (MDP), which can be modeled as a tuple

. The state space at step is defined by 4D point cloud , which can be divided into frontier set and obstacle set . The action space at step is the frontier set , and the action is to select a point from , which is the goal of the navigation module implemented by [15]. When the robot take an action from the action space, the state will transit to state

according to the stochastic state transition probability

. Then the robot will receive an immediate reward . The discount factor adjusts the relative importance of immediate and future reward. The objective of DRL algorithms is to learn a policy that can select actions to maximize the expected reward, which is defined as the accumulated discounted rewards over time.

Because the action space varies according to the size of frontier set

, it is difficult to design a neural network that maps the state to the action directly. The value-based RL is more suitable to this formulation. In value-based RL, a vector of action values, which are the expected rewards after taking actions in state

under policy , can be estimated by a deep Q network (DQN) , where are the parameters of the muti-layered neural network. The optimal policy is to take action that has the highest action value: . DQN [10] is a novel variant of Q-learning, which utilizes two key ideas: experience reply and target network. The DQN tends to overestimate action values, which can be tackled by double DQN in [5]. Double DQN select an action the same as DQN selects, while estimate this action’s value by the target network.

Iv Algorithm

Fig. 3: The illustration of the 4D point-clouds-like information generation process. (a) presents the map where the black, gray, white, blue points denote obstacles, unknown space, free space, and robot location. In (b), the contour of free space, which is denoted by green points, is generated by the modified Dijkstra algorithm. The number on each point indicates the distance from this point to the robot location. In (c), the points in the contour set are divided into frontier or obstacle sets, which are denoted in dark green and light green, respectively. In (d), the 4D point-clouds-like information are extracted from the image (c). The point clouds include location, frontier flag, and distance information.

In this section, we present the framework of our method and illustrate its key components in detail.

Iv-a Framework Overview

The typical DRL framework is adopted, where the robot interacts with the environment step by step to learn an exploration strategy. The environment in this framework is based on [9]. The state and action space of the original environment is the local observation and robot movement commands, respectively. When incorporated into our framework, receives a sequence of robot movement command and outputs the global map once the change of the explored map is larger than the threshold, which is detailed in Section III-A. As shown in Fig. 2, the 4D point-clouds-like information can be obtained by a modified Dijkstra algorithm. After the policy network outputting the goal point, the algorithm is implemented to find the path connecting the robot location to the goal point. Then a simple path tracking algorithm is applied to generate the sequence of robot movement commands.

Iv-B Frontier Detection and Distance Computation

Computing the distances from the robot location to points in the frontier set without collision will be time-consuming if each distance is obtained by running the algorithm once. Instead, we modify the Dijkstra algorithm to detect frontiers and compute distance at the same time by sharing the search information. Denote the “open list” and “close list” as and , respectively. The open list contains points that need to be searched, while the close list contains points that have been searched. Define the contour list as , which contains the location and the cost of points that belong to frontier or obstacle. Only points in the free space of map are walkable. The goal of this modified algorithm is to extract the contour of free space and obtain the distance information simultaneously. As shown in Algorithm 1, the start point with the cost of zero, which is decided by the robot location, is added to . While the open list is not empty, the algorithm repeats searching 8 points, denoted by , adjacent to current point . The differences from Dijkstra algorithms are: (1) If belongs to occupied or unknown space, add to frontier list , as shown in line 10 of Algorithm 1. (2) Instead of stopping when the goal is found, the algorithm terminates until contains zero points. After the algorithm ends, the contour list contains points that are frontiers or boundaries between free space and obstacle space. Points in the contour list can be classified by their neighboring information into frontier or obstacle set, which is shown in Fig. 3.

2:  while  do
3:     ;
4:     ;
5:     for  in 8 points adjacent to  do
7:        if  then
8:           continue;
9:        else if  is not walkable then
10:           ;
11:        else if  then
13:           ;
14:        else if  and  then
16:        end if
17:     end for
18:  end while
Algorithm 1 Modified Dijkstra Algorithm
Fig. 4: The architecture of our proposed neural network. The input point clouds are classified into two categories: frontier set denoted by dark blue and obstacle set denoted by red. The edge convolution operation is denoted by a light blue block, which is used to extract local features around each point in the input set. The feature sets of frontiers and obstacles, which are generated by operations, are denoted by light green and green. The MLP operation is to extract one point’s feature by only considering this point’s information. After several

operations and one MLP operation, a max-pooling operation is applied to generate the global information, which is shown in light yellow.

Iv-C Network with Point Clouds as Input

In this section, the architecture of the state-action value network with 4D point-clouds-like information as input is detailed. The architecture is modified from DGCNN in Segmentation task [20], which proposes edge convolution (EdgeConv) operation to extract the local information of point clouds. The EdgeConv operation includes two steps for each point in the input set: (1) construct a local graph including the center point and its k-nearest neighbors; (2) apply convolution-like operations on edges which connect each neighbor to the center point. The EdgeConv operation takes the point set as input and outputs the feature set , where and denote the dimension of input and output set, and denotes the number of points in the input set. Different from DGCNN and other typical networks processing point cloud such as PointNet [14], which have the same input and output point number, our network takes the frontier and obstacle set as input and only outputs the value of points from the frontier set. The reason for this special treatment is to decrease the action space’s size to make the network converge in a faster manner.

The network takes as input points at time step , which includes frontier points and obstacle points, which are denoted as and respectively. The output is a vector of estimated value of each action

. The network architecture contains multiple EdgeConv layers and multi-layer perceptron (mlp) layers. At one EdgeConv layer, the feature set is extracted from the input set. Then all features in this edge feature set are aggregated to compute the output EdgeConv feature for each corresponding point. At a mlp layer, the data of each point is operated on independently to obtain the information of one point, which is the same as the mlp in PointNet

[14]. After 4 EdgeConv layers, the outputs of all EdgeConv layers are aggregated and processed by a mlp layer to generate a 1D global descriptor, which encodes the global information of input points. Then this 1D global descriptor is concatenated with the outputs of all EdgeConv layers. After that, points that belong to the obstacle set are ignored, and the points from the frontier set are processed by three mlp layers to generate scores for each point.

Iv-D Learning framework based on DQN

As described in Section III-B, the DQN is a neural network that for a given state outputs a vector of action values. The network architecture of DQN is detailed in Section IV-C, where the state in point clouds format contains frontier and obstacle sets. Under a given policy , the true value of an action in state is: . The target is to make the estimate from the frontier network converge to the true value. The parameters of the action-value function can be updated in a gradient decent way, after taking action in state and observing the next state and immediate reward :


where is the learning rate and the target is defined as


where denotes the parameter of the target network, which is updated periodically by .

To make the estimate from our network converge to the true value, a typical DQN framework [10] can be adopted. For each step , the tuple is saved in a reply buffer. The parameters can be updated by equation 1 and 2 given the tuple sampled from the reply buffer.

The reward signal in the DRL framework helps the robot know at a certain state whether an action is appropriate to take. To make the robot able to explore unknown environments successfully, we define the following reward function:


The term equals the newly discovered area at time , which is designed to encourage the robot to explore unknown areas. The term provides a consistent penalization signal to the robot when a movement command is taken. This reward encourages the robot to explore unknown environments with a relatively short length of overall path. The term is designed to guide the robot to reduce the number of frontiers, as defined in equation 4:


where denotes the number of frontier group at time .

V Training Details and Simulation Experiments

In this section, we first detail the training process of our DRL agent. Then we validate the feasibility of our proposed method in robot exploration problem by two sets of experiments: (1) a comparison with five other exploration methods, (2) a test of scalability where maps of larger size compared with training set are to be explored.

V-a Training Details

Fig. 5: Some map samples from training set. The black and white pixels represent obstacle and free space respectively.

To learn a general exploration strategy, the robot is trained in the

environment where 100 maps with different shapes and features are to be explored. The robot is equipped with a 2m range laser scanner with a 180 degree field of view. The noise of laser range measurement is simulated by the Gaussian distribution with a mean of 0 and a standard deviation of 0.02m. As an episode starts, a map is randomly selected from the training set, and the robot start pose, including the location and pose, is also set randomly. In the beginning, the circle area centered at the start location is explored by making the robot rotate for 360 degree. Then at each step, a goal frontier point is selected from the frontier set under the policy of our proposed method. A* algorithm is applied to find a feasible path connecting the current robot location to the goal frontier. A simple path tracking algorithm is used to find the robot commands to follow the planned path: moving to the nearest unvisited path point

, and replanning the path if the distance between the robot and is larger than a fixed threshold. An episode ends if the explored ratio of the whole map is over .

HouseExpo Training
Laser range Discount factor 0.99
Laser field of view Target network update f 4000 steps
Robot radius Learning rate 0.001
Linear step length Replay buffer size 50000
Angular step length Exploration rate 15000
Meter2pixel Learning starts 3000
TABLE I: Parameters in Training

Some training map samples are shown in Fig. 5. The largest size of a map in the training set is 256 by 256. Because the size of state information changes at each time and batch update method requires point clouds with same size, currently, it is not realistic to train the model in a batch-update way. In typical point clouds classification training process, the size of all the point clouds data are pre-processed to the same size. However, these operations will change the original data’s spatial information in our method. Instead, for each step, the network parameters are updated 32 times with a batch size equal to 1. Learning parameters are listed in Table 1. The training process is performed on a computer with Intel i7-7820X CPU and GeForce GTX 1080Ti GPU. The training starts at 2000 steps and ends after 90000 update steps, which takes 72 hours.

V-B Comparison Study

Fig. 6: Four maps for testing. The size of map1, map2, map3 and map4 is (234, 191), (231, 200), (255, 174) and (235, 174), respectively.
Fig. 7: The path length’s data of each method on four test maps. For each map, each method is tested for 100 times with the same randomly initialized robot start locations.

Besides our proposed method, we also test the performance of the weight tuning method in [11], a cost-based method in [4], a method with greedy strategy in [21], a method utilizing a random policy and the baseline in [9], which we denote as the weight method, cost method, greedy method, random method and baseline, respectively. To compare the performance of different methods, we use 4 maps as a testing set, as shown in Fig. 6. For each test map, we conduct 100 trials by setting robot initial locations randomly for each trail.

The random method selects a frontier point from the frontier set randomly. The greedy method chooses the nearest frontier point. The baseline utilizes a CNN which directly determines the robot movement commands by the current local observation.

The cost method evaluates the scalar value of frontiers by a cost function considering distance and information gain information.


where is the weight that adjusts the relative importance of costs. and denote the normalized distance and information gain of a frontier. At each step, after obtaining the frontier set as detailed in Section IV-B

, the k-means method is adopted to cluster the points in the frontier set to find frontier centers. To reduce the runtime of computing information gains, we only compute the information gain for each frontier center. The information gain is computed by the area of the unknown space to be explored if this frontier center is selected

[4]. The weight in the cost method is fixed to 0.5, which fails to be optimal in environments with different structures and features.

The weight method can learn to adjust the value of the weight in equation 5 under the same training framework as our proposed method. The structure of the neural network in weight method is presented in Fig. 4, which takes the 4D point-clouds-like information as input and outputs a scalar value.

Fig. 8: The representative exploration trials of six different methods on map1 with the same robot start locations and poses. The bright red denotes the start state and the dark red denotes the end state. The green point denotes the locations where the robot made decisions to choose which frontiers to explore. As the step increases, the brightness of green points becomes darker and darker. The baseline’s result doesn’t have green points because its action is to choose a robot movement command.

We select the length of overall paths, which are recorded after of areas are explored, as the metric to evaluate the relative performance of the six exploration methods. The length of overall paths can indicate the efficiency of the exploration methods. The box plot in Fig. 7

is utilized to visualize the path length’s data of each exploration method on four test maps. The baseline is not considered here because the baseline sometimes fails to explore the whole environment, which will be explained later. Three values of this metric are used to analyze the experiment results: (1) the average, (2) the minimum value, (3) the variance. Our proposed method has the minimum average length of overall paths for all four test maps. The minimum length of the proposed method in each map is also smaller than other five methods. This indicates that the exploration strategy of our proposed method is more effective and efficient than the other five methods. The random method has the largest variance and average value for each map because it has more chances to sway between two or more frontier groups. That is why the overall path of the random method in Fig.

8 is the most disorganized. The weight method has a lower average and minimum value compared with the cost method in all test maps, due to the advantages of learning from rich data. However, the weight method only adjusts a weight value to change the relative importance of distance and information gain. If the computation of the information gain is not accurate or another related cost exists, the weight method fails to fully demonstrate the advantages of learning. Instead, our proposed method can learn useful information, including information gain, by learning to estimate the value of frontiers, which is the reason that our proposed method outperforms the weight method.

Fig. 8 shows an example of the overall paths of six different methods when exploring the map1 with the same starting pose. Each episode ended once the explored ratio was more than 0.95. The proposed method explored the environment with the shortest path. The explored ratio of the total map according to the length of the current path is shown in Fig. 9. The greedy method’s curve has a horizontal line when the explored ratio is near 0.95. This is because the greedy method missed a nearby frontier which would disappear at once if selected. However, the greedy method chose the nearest frontier instead, which made the robot travel to that missed frontier again and resulted in a longer path. For example, in Fig. 8, the greedy method chose to explore the point C instead of the point A, when the robot was at point B. This decision made the robot travel to the point A later, thus making the overall path longer. The baseline’s curve also exhibits a horizontal line in Fig. 9, which is quite long. The baseline’s local observation contained no frontier information when the surroundings were all explored (in the left part of the environment shown in Fig. 8). Therefore, at this situation, the baseline could only take “random” action (i.e. travelling along the wall) to find the existing frontiers, which would waste lots of travelling distances and may fail to explore the whole environment.

Fig. 9: The ratio of the area explored and the area of the whole map with respect to the current path’s length. The test map is map1 and the start locations is the same as Fig. 8.

V-C Scalability Study

In this section, a map size of (531, 201) is used to test the performance of our proposed method in larger environments compared with maps in the training set. If the network’s input is fixed size images, the map needs to be padded into a (531, 531) image, which is a low-efficient state representation way. Then a downscaling operation of the image is required to make the size of the image input the same as the requirement of the neural network, e.g. (256,256). Although the neural network can process the state data by downscaling, the quality of the input data decreases. Therefore, the network fails to work once the scaled input contains much less necessary information than the original image. Fig.

10 presents the overall path generated by our method without downscaling the map size. Our proposed method, which takes point clouds as input, has better robustness in maps with large scales because of the following two reasons. First, we incorporate the distance information into point clouds, which can help neural networks learn which part of point clouds are walkable. Although the image representation can also have a fourth channel as distance information, the scaling operation can make some important obstacles or free points disappear, which changes the structure of the map. Secondly, the number of pixels in an image increases exponentially as the size of the image increase. The number of points in point clouds equals the number of pixels that represent an obstacle or frontier in a map, which is not an exponential relation unless all the pixels in a map are obstacles or frontiers.

Fig. 10: The result of the scalability test. The map size is (531, 201). The meaning of points’ color is the same as Fig. 8

Vi Conclusions And Future Work

In this paper, we present a novel state representation method using 4D point-clouds-like information and design the framework to learn an efficient exploration strategy. Our proposed method can solve the problems that come with using images as observations. The experiments demonstrate the effectiveness of our proposed method, compared with other five commonly used methods. For the future work, other network structures and RL algorithms can be modified and applied to the robot exploration problem with point clouds as input. The converge speed of training may also be improved by optimizing the training techniques. Besides, the multi-robot exploration problem may also use point clouds to represent the global information.


  • [1] N. Basilico and F. Amigoni (2011) Exploration strategies based on multi-criteria decision making for searching environments in rescue operations. Autonomous Robots 31 (4), pp. 401–417. Cited by: §I.
  • [2] F. Bourgault, A. A. Makarenko, S. B. Williams, B. Grocholsky, and H. F. Durrant-Whyte (2002) Information based adaptive robotic exploration. In IEEE/RSJ international conference on intelligent robots and systems, Vol. 1, pp. 540–545. Cited by: §I.
  • [3] F. Chen, S. Bai, T. Shan, and B. Englot (2019) Self-learning exploration and mapping for mobile robots via deep reinforcement learning. In Aiaa scitech 2019 forum, pp. 0396. Cited by: §I, §II.
  • [4] H. H. González-Banos and J. Latombe (2002) Navigation strategies for exploring indoor environments. The International Journal of Robotics Research 21 (10-11), pp. 829–848. Cited by: §I, §I, §V-B, §V-B.
  • [5] H. v. Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence

    AAAI’16, pp. 2094–2100. Cited by: §III-B.
  • [6] S. James, Z. Ma, D. R. Arrojo, and A. J. Davison (2020) Rlbench: the robot learning benchmark & learning environment. IEEE Robotics and Automation Letters 5 (2), pp. 3019–3026. Cited by: §I.
  • [7] F. Leiva and J. Ruiz-del-Solar (2020) Robust rl-based map-less local planning: using 2d point clouds as observations. IEEE Robotics and Automation Letters 5 (4), pp. 5787–5794. External Links: Document Cited by: §I-A.
  • [8] H. Li, Q. Zhang, and D. Zhao (2019) Deep reinforcement learning-based automatic exploration for navigation in unknown environment. IEEE transactions on neural networks and learning systems 31 (6), pp. 2064–2076. Cited by: §II, §III-A.
  • [9] T. Li, D. Ho, C. Li, D. Zhu, C. Wang, and M. Q. Meng (2019) Houseexpo: a large-scale 2d indoor layout dataset for learning-based algorithms on mobile robots. arXiv preprint arXiv:1903.09845. Cited by: §I, Fig. 2, §IV-A, §V-B.
  • [10] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §I-A, §I, §III-B, §IV-D.
  • [11] F. Niroui, K. Zhang, Z. Kashino, and G. Nejat (2019) Deep reinforcement learning robot for search and rescue applications: exploration in unknown cluttered environments. IEEE Robotics and Automation Letters 4 (2), pp. 610–617. External Links: Document Cited by: §I, §II, §III-A, §V-B.
  • [12] S. Oßwald, M. Bennewitz, W. Burgard, and C. Stachniss (2016) Speeding-up robot exploration by exploiting background information. IEEE Robotics and Automation Letters 1 (2), pp. 716–723. External Links: Document Cited by: §I.
  • [13] E. Parisotto and R. Salakhutdinov (2018) Neural map: structured memory for deep reinforcement learning. In International Conference on Learning Representations, Cited by: §II.
  • [14] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017)

    Pointnet: deep learning on point sets for 3d classification and segmentation


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 652–660. Cited by: §IV-C, §IV-C.
  • [15] B. Raphael (19681968)

    A formal basis for the heuristic determination of minimum cost paths

    IEEE transactions on Systems Science and Cybernetics 4 (2), pp. 100–107. Cited by: §III-B.
  • [16] R. Strudel, R. Garcia, J. Carpentier, J.P. Laumond, I. Laptev, and C. Schmid (2020) Learning obstacle representations for neural motion planning. Proceedings of Conference on Robot Learning (CoRL). Cited by: §I-A.
  • [17] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §I, §II.
  • [18] L. Tai and M. Liu (2016) A robot exploration strategy based on q-learning network. In 2016 IEEE International Conference on Real-time Computing and Robotics (RCAR), Vol. , pp. 57–62. External Links: Document Cited by: §II.
  • [19] S. Thrun (2002) Probabilistic robotics. Communications of the ACM 45 (3), pp. 52–57. Cited by: §I.
  • [20] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38 (5), pp. 1–12. Cited by: §I-A, §I, §IV-C.
  • [21] B. Yamauchi (1997) A frontier-based approach for autonomous exploration. In Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97.’Towards New Computational Principles for Robotics and Automation’, pp. 146–151. Cited by: §I, §I, §II, §III-A, §V-B.
  • [22] J. Zhang, L. Tai, J. Boedecker, W. Burgard, and M. Liu (2017) Neural slam: learning to explore with external memory. arXiv preprint arXiv:1706.09520. Cited by: §II.
  • [23] D. Zhu, T. Li, D. Ho, C. Wang, and M. Q. -. Meng (2018) Deep reinforcement learning supervised autonomous exploration in office environments. In 2018 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 7548–7555. External Links: Document Cited by: §I, §II.