I Introduction
Deep reinforcement learning (DRL) [1], which utilizes deep neural networks (DNN) to approximate navigation policy with reinforcement learning, has been widely applied in robot navigation domain. The DRL model shows great advantage in mapless navigation because it can directly handle the range data without any handengineered navigation rules compared to fuzzy reactive control [2]. It can also drive the robot out of small localminimum areas [3] compared to artificial potential field approach [4]. Moreover, compared to the commonly used local planner in SLAM [5], such as dynamic window approach (DWA) [6], the DRL model requires neither local map nor timeconsuming forward simulation, and it can adjust its policy quickly when facing a moving obstacles [7].
The most used sensors for DRLbased navigation are range sensors. With a range sensor, such as lidar, the model trained in simulation can be directly deployed to a real robot without any finetuning. Tai et al. [8] trained their model in simulation scenarios, and the trained model can generate navigation policy using only ten laser beams. Xie et al. [9] introduced a PID controller to accelerate the training of DRL in simulation. This model is built with convolutional neural networks (CNN), which takes 512 laser beams as input. Pfeiffer et al. [10] used min pooling to downsample 36 distance values from 1080 laser beams. This operation can handle more laser points and enhance safety during navigation. However, it sacrifices much local information and may lead to a suboptimal policy.
Although these approaches work well in robot navigation applications, the trained DNN model requires a fixed parameter of the employed sensor, i.e., sensor position, field of view (FOV), and angular resolution. These parameters may change due to different task requirements. Fig. 1 shows examples of potential parameter changes of the lidar sensor. Once these parameters change, the original model cannot directly handle the new sensor readings. Although the missing distance values in some direction can be estimated by interpolation, it is difficult to ensure the accuracy of the interpolated value, especially when the new sensor has a smaller FOV and low angular resolution. Moreover, a false estimation of a short distance may cause a collision. In addition, if the input data of the model comes from lidar with low configuration, the navigation performance of the model cannot be improved with the upgrading of the lidar because the input dimension is fixed and much useful local information has to be abandoned. Building a local grid map and inputting the map into a CNN model may allow the change of lidar configuration. However, map building requires additional computation, and the dimension of a highresolution map is huge. Training a network with highdimensional and sparse inputs is much harder than the lowdimensional dense inputs. Existing works only operate maps with a lowresolution [12], [13], such resolution is not enough for local obstacle avoidance in crowded scenarios.
In this paper, our model operates on each obstacle point and takes accurate positional information of these points as input. It is inspired by Pointnet [14], which handles semantic segmentation problems of the point cloud. Different from [14], our model incorporates the goalreaching objective into the pointwise feature extraction. Specifically, we introduce a gate mechanism to guide the network to extract obstacle features that are critical for goalreaching. Subsequently, global features are selected from all the feature candidates. The number of global features decides the maximum number of selected points. As only a few points are used to support the final decision, we refer to these points as support points and our approach as supportpoint based navigation (SPN). This operation is similar to human navigation, in which only a few features, such as sharp corners on the way to the destination, are paid attention to. As our model only operates on each obstacle point, the LiDAR specifications and position can be changed, provided that it can return the obstacle point position in robot frame. Besides, the navigation performance can be improved with the increase of LiDAR resolution because more candidate points can be chosen from. Moreover, our model can be used to guide where to mount the LiDAR sensor such that a better navigation performance can be achieved. To sum up, the contributions of this paper are:

A humanlike navigation algorithm is proposed, which automatically focuses on critical obstacle points and ignore the noncritical ones.

Our model can generate reliable navigation policy for robot with a new LiDAR setup without any retraining.

The performance of our model can be improved with the upgrading of the LiDAR sensor.

The robot navigation performance can be improved through changing the position of the LiDAR under the guidance of the trained model.
The remaining of this paper is organized as follows. A brief introduction of our problem is given in Section II. The proposed SPN approach is described in Section III, followed by the implementation and results in Section IV. Last, we draw the conclusions in Section V.
Ii Background
Iia Problem Formulation
The illustration of mapless robot navigation with varied LiDAR configurations is given in Fig. 2. As shown, a circular robot is required to reach its target without colliding with any obstacles. It carries a LiDAR sensor to perceive its surroundings. The LiDAR can be described using its pose and the specifications. As shown, in the robot frame (the yaxis is robot heading direction), the pose of the LiDAR is described with . Besides, we define the LiDAR angular resolution as , the field of view (FOV) as and maximum detection range as . We call all the parameters above as LiDAR configurations. In this paper, the robot is required to perform navigation tasks when its LiDAR configurations are changed.
Our problem can be modeled as a sequential decisionmaking process. During navigation, similar to most mapless navigation problem [8], [9] and [10], the target position in robot frame is given or directly obtained by sensors [19]. At time , the input of the model contains the LiDAR readings, current robot velocities and the goal position. The dimension of may change with different LiDAR configurations. The action of the robot comprises the commands of linear and angular velocities. Given , the robot takes action controlled by the current policy . With new observations received at time , the robot updates its input and obtains a reward from the reward function. The objective of this paper is to find an optimal policy that maximizes the discount total rewards , where is a discount factor.
Iii Approach
We use DRL to learn a DNN model to approximate the optimal poicy . The DRL algorithm used in this paper is soft actor critic (SAC) [15]. It contains an actor network for approximating the policy and three critic networks (one value network, and two Q networks) for approximating expected total returns. The structures for these networks are given in Fig. 3. As shown, the actor network takes pointwise information as input, while the critic networks share a similar structure with our previous work [16], which take 1D laser scans as input.
Iiia Actor Network
IiiA1 Input representation
To learn DNNbased navigation controller from laser scans, previous works usually format the laser scans as a onedimensional vector
, where denotes the th scanned distance value and ( in this paper) is the number of laser beams. The direction of each laser beam is included by its location in . However, such representations restrict the trained model from handling the data obtained by LiDARs with different FOVs or angular resolutions. To address this problem, in this paper, the laser scans are represented by an element set , where and is the relative angle of the obstacle point in robot frame. The same as our previous work [16], we use the reciprocal of distance to highlight the smalldistance values in laser scans. According to such representation, the position of each obstacle point is explicitly described, and hence there is no ordering requirement for elements in .IiiA2 Goaldirected pointwise feature extraction
The shadow layers (labeled with orange in Fig. 3(a)) of actor network operate on each point representation
individually. As shown, each obstacle point is fed into a dense layer and activated by LReLU (Leaky Rectified Linear Unit). Meanwhile, the relative position of the goal
and current robot velocityare fed into another dense layer and activated by a sigmoid function. Subsequently, we apply elementwise multiplication to the two features as follows,
(1) 
where ,, are network weights and bias of the aforementioned two layers, dentotes the elementwise multiplication, denotes the LReLU function, denotes the sigmoid function and . The sigmoid function ) here serves as a gate function that helps reserve important obstacle features and remove noncritical ones given the goal position and current velocity.
IiiA3 Global feature selection and policy calculation
The following network structure is similar to Pointnet [14]. The filtered feature representation is fed into another dense layer with
output neurons, followed by a maxpooling operation. Through max pooling, the global features of all obstacle points are extracted. The number
of global features is a key factor, which decides the maximum number of obstacle points used for the final decision. The selected obstacle features, combined with goal position and current velocity, are fed into a twolayer DNN for policy calculation. As squashed Gaussian policy is used in this paper, we refer to the sampled stochastic action (used in training) as and the deterministic action (used in testing) as .IiiB Critic Networks
As critic networks are not used in testing, their input representation and the network structure are not necessarily similar to the actor network. In this paper, the adopted structure of critic network is similar to [8] and [16]. As shown in Fig. 3(b), all critic networks are fullyconnected threelayer DNNs. To reduce the dimension of laser scans and the number of network parameters, the original laser scans are compressed into ( in this paper) values by 1D minimum downsampling. Similar to actor network, we choose the reciprocal of the downsampled value as the input representation , which is computed as,
(2) 
where is the index of the downsampled laser scans and ( in this paper) is the length of downsampling window. The downsampled representations of laser scans, together with relative goal position and robot velocity, are fed into the value network to approximate the value function. The double networks (Fig. 3(b) top) share the same structure as the value network (Fig. 3(b) bottom) except that it also takes the agent’s action as input.
IiiC Training Approach
The agent’s learning objective is to reach a given goal point in the shortest timesteps without colliding with the obstacles. To guide the learning process, the reward function contains a positive part for encouraging goalreaching and a negative part for punishing collision. As the above rewards are too sparse to be received during training, similar to [8], [9] and [10], an additional dense part is adopted as follows,
(3) 
where is a scaling constant, and is a small negative constant for punishing staying still.
As the DRL algorithm used for training the agent SAC [15], the expected total return is regularized by the policy entropy and as follows,:
(4) 
where is policy entropy, and is a weighing factor. During training, the critic networks are optimized to approximate using the Bellman equation, and the actor network learns a policy that maximizes the Q values approximated by the network.
Iv Implementation and Tests
Iva Model Training
We train our model in four simulated scenarios using ROS Stage [17]. As shown in Fig. 4, the simulated robot is a circular robot with a radius of 0.2 meters. A LiDAR is mounted on the robot center. Its FOV is , angular resolution is and scanning range is
. The four training scenarios are ordered (left to right) by their training difficulty: Env(0) is the simplest, and Env(3) is the hardest. The training purpose is to train a model that can generate optimal navigation strategies in complex scenarios such as Env(2) and Env(3). To accelerate training, we adopt curriculum learning as follows. At beginning, the selection probability of the four scenarios Env(0), Env(1), Env(2), Env(3) is
. When the agent’s success rate in Env() reaches 0.9 in the past 50 episodes, the selection probability of the next scenario Env() exchanges with the selection probability of Env(). When the selection probability of Env(3) reaches 0.7, we keep the selection probability of each scenario until the end of training.In each training episode, the robot is spawned in a scenario arranged by the selection probability. It starts from a randomly chosen obstaclefree point and aims to reach a goal point (not rendered). In the first 100 episodes, the agent is controlled by a PID controller [9]. Afterwards, it takes the stochastic action sampled from squashed Gaussian policy. We train the model 500k steps and test the model every 5000 training steps. During testing, in each scenario, the agent is required to reach four goals (labeled with red circles) from the original point with the deterministic policy. To investigate the effect of the number of global features, we train six models with . To compare with conventional input format, we train a model with 1D LiDAR data as input. The input and structure of this model are the same as the value network.
Each model is trained five times using different random seeds. As the dense part in the reward function cannot represent the real navigation objective, the metric used to evaluate the model’s performance in each task is similar to [9], and is represented by a score function as follows,
(5) 
where is the total number of steps spent on navigation. As shown, the robot will get a negative score when its ending status is collision or time out. The learning curves are given in Fig. 5. The training curves in Env(0) and Env(1) are not plotted because all the models converge quickly with high scores in these two scenarios. As shown, all models converge to high scores in Env(2) at the end of training. In Env(3), Our SPN models with more than five global features can achieve a comparable convergence speed and highest scores with the model using conventional 1D input. The averaged best scores in Env(3) and in all scenarios are given in Table I. As shown, the SPN () model performs the best among all models. Our model can achieve similar performance with the 1Dinput model using only ten global features. Notably, the best score of SPN decreases when increases from 20 to 50, which indicates too many features may disturb the final decision and degrade the model performance. In the remaining of this paper, the model used for testing is SPN ().
1D input  SPN model: No. of global features:  

2  3  5  10  20  50  
In Env(3)  14.397  14.326  14.352  14.39  14.396  14.404  14.389 
In Env(03)  3.616  3.597  3.611  3.618  3.619  3.622  3.619 
IvB Support Points Visualization
To better understand the effect of the support points, we visualize them in a simulated map shown in Fig. 6. This map is an square room, which is considered a complex scenario in [9]. Four models SPN(), SPN(), SPN() and SPN() are used for testing. As shown in Fig. 6, the robot is initialized at the center point (black) and facing right (black arrow). It is required to reach four goalpoints (G0G3) from the initial state respectively. We plot the support points selected by our model every 20 control steps for each task. As shown, the support points are marked with different shapes according to their goal, and their colors are consistent with the colors of the footprints of the robot. As several features may come from the same obstacle point, we amplify the size of the support point marker by times if features are extracted from this point. Notably, at beginning of each task, the LiDAR readings are the same, but the support points are different (see black markers). This observation indicates the effect of the goaldirected feature extraction operation, i.e., the model only reserves obstacle features that is valuable for goalreaching. Moreover, our model can accurately capture the corner points that may affect its navigation in the near future and ignore the nearby points that have been passed or in the opposite direction of its goal.
IvC Performance Evaluation in Simulated Scenarios
To evaluate the performance of our model, robots with seven LiDAR configurations containing different angular resolution, FOVs, and LiDAR positions are tested in two simulated scenarios. For simplicity, we label each LiDAR setup with {}, where the first three parameters represent the specification of the LiDAR and the last term represents the LiDAR position (see Fig. 2 for parameter definitions). The other two position parameter and are both zero. The first four LiDAR setups are used in realworld experiments in [8], [9], [10] and [11] respectively, and their detailed setups are given in Fig. 7 and Fig. 8. Note that none of these configurations are used for model training. The simulation scenario Env5, Env6 and Env7 are all square rooms used in [10]. In each scenario, the robot performs 100 navigation tasks with randomly generated initial and goal positions. In Fig. 8, we plot the ratio of different ending status (success, crash or time out) of the robot using different Lidar setup in the four scenairos. As shown, all robots with highresolution LiDARs () can achieve near 100% success rate in Env4 an Env5, which demonstrates a high generalization performance of our model. As we haven’t trained our model with the objective of escaping localminimum areas, our model’s performace degrades in scenarios containing many large concave areas such as Env6 and Env7. Interestingly, when the resolution of LiDAR is very low (), moving the LiDAR forward 0.05m from the robot center can increase the robot’s success rate in all four scenairos, and the success rate in Env5 can be increased by . This observation indicates that our model can also be used to guide the installation of LiDAR to improve robot navigation performance.
IvD Realworld Performance Evaluation
We test our model using a turtlebot2 robot in a realworld scenario shown in Fig. 9, in which UTM30LX LiDAR is used for perceiving the surroundings. As shown, it is mounted at the topcenter of the robot and its configuration is different from the one used the training. As target localization sensor is not available, the same as [9], we use Gmapping [18] to prebuild a map of this scenario for robot localization. With robot and goal positions in the map, we can calculate the relative position of the goal in robot frame. It should be noted that this map is only used to calculate the relative position of the goal and is not used by our SPN model. As shown in Fig. 9(b) and Fig. 10(a), the robot starts from the point “S” and is required to reach five targets ( to and , marked with red crosses on the floor) in succession. This testing scenario contains a very small obstacle (the cylinder) and many obstacles with sharp corners. The trajectories of the robot are recorded and plotted in Fig. 10(a). As shown, the robot reaches five destinations successfully, and the chosen paths are nearoptimal.
IvD1 Reaction to sudden changes
Last, we test the model’s capability of addressing a suddenly appearing obstacle. As shown in Fig. 10(b) and the attached video, the researcher (labeled with in the map), serving as a dynamic obstacle, suddenly blocks the way of the robot. The robot in point must adjust its navigation strategy to avoid potential collisions and find another path. The trajectories and video show that our robot can well adapt to such abrupt changes and quickly switch to another path.
V Conclusions
In this work, we present a learningbased approach for mapless navigation with varied LiDAR configurations. Our model selects support points from the obstacle points and can make nearoptimal navigation decisions. It is purely trained in simulation and can be deployed to a real robot without any finetuning. Extensive simulation and realworld experiments have been conducted to evaluate our model. The experimental results demonstrate that the trained model can be used for a robot with a different LiDAR setup and can achieve high navigation performance.
References
 [1] V. Mnih et al., Humanlevel control through deep reinforcement learning, Nature, vol. 518, no. 7540, pp. 529533, 2015.
 [2] W. Li, C. Ma and F. Wahl, A neurofuzzy system architecture for behaviorbased control of a mobile robot in unknown environments, Fuzzy Sets and Systems, vol. 87, no. 2, pp. 133140, 1997.
 [3] O. Zhelo, J. Zhang, L. Tai, M. Liu, and W. Burgard, Curiositydriven exploration for mapless navigation with deep reinforcement learning, 2018. [Online]. Available: https://arxiv.org/abs/1804.00456.
 [4] Y. Hwang and N. Ahuja, A potential field approach to path planning, IEEE Trans. Robot. Autom., vol. 8, no. 1, pp.2332, 1992.
 [5] M. Dissanayake, P. Newman, S. Clark, H. DurrantWhyte and M. Csorba, A solution to the simultaneous localization and map building (SLAM) problem, IEEE Trans. Robot. Autom., vol. 17, no. 3, pp. 229241, 2001.
 [6] D. Fox, W. Burgard and S. Thrun, The dynamic window approach to collision avoidance, IEEE Robot. Autom. Mag., vol. 4, no. 1, pp. 2333, 1997.
 [7] Y. Wang, H. He and C. Sun, Learning to Navigate Through Complex Dynamic Environment With Modular Deep Reinforcement Learning, textitIEEE Trans. Games, vol. 10, no. 4, pp. 400412, 2018.
 [8] L. Tai, G. Paolo, and M. Liu, Virtualtoreal deep reinforcement learning: Continuous control of mobile robots for mapless navigation, in IEEE Int. Conf. Intell. Robot. Syst. (IROS), pp. 3136, Sept., 2017.
 [9] L. Xie, S. Wang, S. Rosa, A. Markham, and N. Trigoni, Learning with training wheels: Speeding up training with a simple controller for deep reinforcement learning, in Proc.  IEEE Int. Conf. Robot. Autom. (ICRA), pp. 62766283, 2018
 [10] M. Pfeiffer et al., Reinforced Imitation: Sample Efficient Deep Reinforcement Learning for Mapless Navigation by Leveraging Prior Demonstrations, IEEE Robot. Autom. Lett., vol. 3, no. 4, pp. 44234430, 2018.
 [11] J. Lim, S. Ha, and J. Choi, Prediction of Reward Functions for Deep Reinforcement Learning via Gaussian Process Regression, textitIEEE/ASME Trans. Mechatronics, vol. 25, no. 4, pp. 17391746, 2020.
 [12] G. Chen et al., Robot Navigation with MapBased Deep Reinforcement Learning, 2020. [Online]. Available: https://arxiv.org/abs/2002.04349.
 [13] D.S. Chaplot, D. Gandhi, S. Gupta, A. Gupta, and R. Salakhutdinov, , Learning To Explore Using Active Neural SLAM, in Int. Conf. Learn. Represent. (ICLR), 2019.

[14]
R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas,
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation,
inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 652660, 2017.  [15] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor, in Int. Conf. Mach. Learn. (ICML), pp. 29762989, 2018.
 [16] W. Zhang, Y. Zhang, and N. Liu, Mapless Navigation: A Single DRLbased Controller for Robots with Varied Dimensions, 2020. [Online]. Available: https://arxiv.org/abs/2002.06320.
 [17] stage/Tutorials  ROS Wiki, Wiki.ros.org, 2020. [Online]. Available: http://wiki.ros.org/stage/Tutorials. [Accessed: 14Oct2020].
 [18] gmapping  ROS Wiki, Wiki.ros.org, 2020. [Online]. Available: http://wiki.ros.org/gmapping. [Accessed: 14Oct2020].
 [19] C. Chen, Y. Chen, Y. Han, H. Q. Lai, F. Zhang, and K. J. R. Liu, Achieving CentimeterAccuracy Indoor Localization on WiFi Platforms: A MultiAntenna Approach, IEEE Internet Things J., vol. 4, no. 1, pp. 122134, 2017.
Comments
There are no comments yet.