Log In Sign Up

Deterministic and Stochastic Analysis of Deep Reinforcement Learning for Low Dimensional Sensing-based Navigation of Mobile Robots

by   Ricardo B. Grando, et al.

Deterministic and Stochastic techniques in Deep Reinforcement Learning (Deep-RL) have become a promising solution to improve motion control and the decision-making tasks for a wide variety of robots. Previous works showed that these Deep-RL algorithms can be applied to perform mapless navigation of mobile robots in general. However, they tend to use simple sensing strategies since it has been shown that they perform poorly with a high dimensional state spaces, such as the ones yielded from image-based sensing. This paper presents a comparative analysis of two Deep-RL techniques - Deep Deterministic Policy Gradients (DDPG) and Soft Actor-Critic (SAC) - when performing tasks of mapless navigation for mobile robots. We aim to contribute by showing how the neural network architecture influences the learning itself, presenting quantitative results based on the time and distance of navigation of aerial mobile robots for each approach. Overall, our analysis of six distinct architectures highlights that the stochastic approach (SAC) better suits with deeper architectures, while the opposite happens with the deterministic approach (DDPG).


page 1

page 3


Double Critic Deep Reinforcement Learning for Mapless 3D Navigation of Unmanned Aerial Vehicles

This paper presents a novel deep reinforcement learning-based system for...

Benchmarking Reinforcement Learning Techniques for Autonomous Navigation

Deep reinforcement learning (RL) has brought many successes for autonomo...

Guided Deep Reinforcement Learning for Swarm Systems

In this paper, we investigate how to learn to control a group of coopera...

Optimal control of point-to-point navigation in turbulent time-dependent flows using Reinforcement Learning

We present theoretical and numerical results concerning the problem to f...

Risk-Conditioned Distributional Soft Actor-Critic for Risk-Sensitive Navigation

Modern navigation algorithms based on deep reinforcement learning (RL) s...

Supplementary Material

I Introduction

Many problems in robotics can be expressed as Reinforcement Learning (RL) problems. RL techniques allow a robot to learn progressively to excel in a distinct task, such as motion-based tasks. Through trial-and-error interactions with an environment, an agent gets feedback in terms of a scalar objective function that guides it step-by-step towards the learning [kober2013reinforcement]. This can be approached by many policies of learning, which can framed into two groups: deterministic and stochastic.

More recently, RL techniques have been further improved by using deep neural networks. In this case, the agent of Deep Reinforcement Learning (Deep-RL) becomes a neural network that escalates its ability to learn complex behaviors, such as the behavior needed to perform navigation tasks in complex environments. The techniques based on Deep-RL have been used extensively to improve navigation-related tasks for a range of mobile vehicles, including terrestrial mobile robots [ota2020efficient, jesus2021soft], aerial robots [tong2021uav, grando2022double] and underwater robots [carlucho2018]

. These approaches diverge on the choice of an ANN, ranging from Multi-Layer Perceptron (MLP) network structures to Convolutional Neural Networks (CNN). Most of them have achieved interesting results not only in performing mapless navigation-related tasks but also in obstacle avoidance and even media transitioning for hybrid vehicles

[bedin2021deep, de2022depth]. However, the choice of the learning method can be affected not only by the selection of the Deep-RL technique but also by the agent’s ANN structure.

In this work, we further explore the use of distinct Deep ANN architectures for two state-of-art Deep-RL algorithms: Deep Deterministic Policy Gradient (DDPG) [lillicrap2015continuous] and Soft Actor-Critic (SAC) [haarnoja2018soft]; a deterministic one and a stochastic one. We perform a study showing how the deep network structure and the depth of the network impact the agent’s learning. We perform an evaluation using an Unmanned Aerial Vehicle (ANN) in tasks related to goal-oriented mapless navigation using low-dimensional data. We perform a two-fold evaluation, taking into account navigation time and distance navigated to enrich the results.

This work contains the following main contributions:

  • We present comparative analyses of how the agent’s deep neural network affects deterministic and stochastic algorithms’ performance for mapless navigation-related tasks of mobile robots.

  • We show that low dimensional sensing is better suited to use in Deep-RL for continuous control tasks in general

  • We also show that the depth increases the performance of deterministic approaches in general, while the opposite tends to happen with stochastic approaches.

  • We provide a framework with a simulation and environment and two approaches based on state-of-the-art actor-critic Deep-RL algorithms with a range of structures that can be successfully adapted to perform mapless navigation of mobile robots, using only range data readings and the vehicles’ relative localization data.

This paper is organized as follows: the related works section (Sec. II) is presented in the sequence. We show our methodology in Sec. III and the results are presented in Sec. IV; both are complemented in Sec. V. For last, we highlight our contributions and present future works in Sec. VI.

Ii Related Work

A couple of Deep-RL works in robotics have already been presented, discussing how efficiently these methods can be used in problems related to motion control with low dimensional sensing information [tai2017virtual, bedin2021deep]. For a terrestrial mobile robot, Tai et al. [tai2017virtual] used ten samples of range findings and the relative distance of the vehicle to a target to perform navigation through obstacles. The DDPG algorithm used learned effectively to navigate to a target. Recently, deep-RL methods have also been successfully used in robotics by De Jesus et al.. Also, [jesus2019deep, jesus2021soft] and others accomplished mapless navigation-related tasks for terrestrial mobile robots using simple information.

Singh and Thongam [singh2018mobile]

show that a Multi-Layer Perceptron (MLP) can be used for mapless navigation of terrestrial mobile robots in dynamic environments. Their method used MLP and Recurrent Neural Networks to decide the robot’s speed for each motion. They concluded that the approach is efficient in guiding the robot to a target position.

For aerial mobile robots, the use of Deep-RL is still limited. Rodriguez et al. [rodriguez2018deep] used a DDPG-based approach to teach an agent to land on a moving platform. Their approach used information from images, but it was fed with simplified information to the agent. It used Deep-RL in simulation with the RotorS framework [furrer2016rotors] and with the Gazebo simulator. Grando et al. [grando2020deep] presented a DDPG and a SAC approach on Gazebo for 2D UAV navigation. Recently, double critic-based Deep-RL has also been used for UAVs [grando2022double]. All of them use information from ranging sensors in a simple state information model for the agent.

Two works have recently tackled the navigation problem with the medium transition of hybrid mobile robots [de2022depth], [bedin2021deep]. Grando et al. [bedin2021deep] presented Deep-RL approaches with a MLP architecture. It was developed using distance sensing information for aerial and underwater navigation. De Jesus et al. [de2022depth] tackled the problem of motion for this kind of vehicle using image information, but it was used with contrasting learning that takes into account a decoder to simplify the image information to feed the agent.

Based on these works that used simple state information, we present a comparative analysis of how the agent’s deep neural network affects the performance of Deep-RL for continuous motion tasks in mobile robots. We aim to provide the best architecture for each kind of algorithm. The environment used for the testing is the Gazebo simulator with a described real-world aerial mobile robot.

Iii Methodology

In this section, we discuss the Deep-RL approaches used in this work and the mobile aerial robot used. We detail the structure of all networks used to perform the comparison for both deterministic and stochastic agents.

Iii-a Deep Deterministic Policy Gradient

Deep Deterministic Policy Gradient (DDPG) [lillicrap2015continuous] has two main deep neural networks: an actor network that provides the real value of a chosen action and a second deep neural network to learn a target function that gives stability to the learning process [lillicrap2015continuous]. The observation of the current state is the input of the actor-network. The actor network provides a continuous action space value chosen by the policy. At the same time, the critic network uses the current state and the agent’s action to provide the Q value for the agent.

This method provides good performance for continuous control but has a challenging problem related to exploration. Since it is deterministic, DDPG needs some exploration policy to avoid learning stagnation. This can be solved by adding a noise process to the actor policy. This noise adding process can be defined as:


where is a noise chosen. The Ornstein-Uhlenbeck process [uhlenbeck1930theory] is typically used and provides good exploration.

Iii-B Soft Actor Critic (SAC)

We also developed a stochastic approach for comparison. It was based on the Soft Actor-Critic algorithm [haarnoja2018soft]. It also consists of an actor-critic system that combines off-policy updates but with a stochastic actor-critic method to learn continuous action space policies. It does so by using neural networks as an approximation function to learn a policy. However, in the SAC algorithm, the current stochastic policy is used instead of the noise used in DDPG. By acting without noise, it tends to provide better stability and performance. The learning speed also tends to be higher since the algorithm encourages the agent to explore new states. It uses the Bellman equation with neural networks as a function approximation to maximize the entropy.

(a) MLP 2.
(b) MLP 3.
(c) MLP 4.
(d) MLP 5.
(e) LSTM.
Figure 1: Networks structures used in our comparison study.

Iii-C Simulated Environments

The algorithms were implemented in simulation together with ROS and the Gazebo simulator. We used the aerial mobile vehicle presented at  [grando2022double]. This vehicle was described using the framework RotorS [furrer2016rotors] and was based on the real vehicle presented at [grando2020visual]. Its low dimensional sensing was given by a simulated LIDAR. The described LIDAR is based on the UST 10LX model. It provides a 10 meters distance sensing with 270° range and 0.25° of resolution.

We developed two environments with dimensions of 10106 meters. The first one is a simple environment that has no obstacles. It only has walls that limit the scenario. Our idea for using this scenario was to guarantee that all the versions of the ANNs used were able to learn to perform the task of navigating to a target point to have a fair comparison. The second one has a few obstacles added to it to make it more difficult to learn and test the ability of the agent in avoiding obstacles.

Iii-D Reward Function

A simple binary rewarding function was used: a positive reward is given if the agent reaches the goal, or a negative reward is given if the robot collides with the walls or does not reach the goal in less than a 500 steps limit. The function can be described as follows:


The reward is 100, while the negative reward is -10. Both and distances were set to meters.

Iii-E Networks Structure

The input for the networks has a total of 26 values, 20 samples for the distance sensors, the three previous actions and three values related to the target goal, which are the vehicle’s relative position to the target and the relative angle to the target in the x-y plane and z-distance plane. The only exception was used in the CNN architecture, where 270 samples from the sensor were used instead of 20. The network outputs are the linear velocity and the variation of the vehicle’s yaw ( yaw) that will be sent to the robot. The actions are normalized between and for the linear velocity and from to for the yaw. All network structures were developed inspired by related works that deal with low-dimensional sensing data as inputs.

We used six distinct ANN architectures, four of them based on the MLP fully connected architecture, one based on the LSTM architecture and one based on a CNN architecture. For all actor-critic network structures, we fixed the critic network with a standard of three hidden layers with 512 neurons each, while the actor-network varies, as shown in Figure


. The main idea is to evaluate the impact of the network structure on the agent’s ability to provide the actions, not to evaluate if the agent is capable of learning, which is the main goal of the critic network. For the last, we used ReLU activation in the hidden layers and Tanh activation in the output layer.

Iv Experimental Results

For training, we generated target goals in a random manner, towards which the agent should navigate. Five hundred steps was the limit defined for each episode, which could end first if the agent collided with an obstacle or with the scenario border. A new goal in the same episode was generated if the agent reached the goal before finishing the 500 steps limit. In this case, the total amount of reward could exceed the maximum value of 100.A learning rate of was used, with a minibatch of 256 samples and the Adam optimizer for all approaches. In the first scenario, we limited the number of episodes to be trained to 1000, while the agent was trained for 1500 episodes in the second. These respective limits for the episode number are used based on the stagnation of the maximum average reward received.

Iv-a Results

In this section, the results obtained during our evaluation are shown. For each scenario and model, an extensive amount of statistics are collected. The evaluation was done in a two-fold manner, with goal-oriented navigation and a waypoint navigation manner. These two-fold tasks were performed for 100 trials and the total of successful trials are recorded. Also, the average navigation time with its standard deviation is recorded.

Env Algorithm Rate Average Time (s)
1 DDPG 2 100%
1 DDPG 3 100%
1 DDPG 4 100%
1 DDPG 5 0%
2 DDPG 2 65%
2 DDPG 3 35%
2 DDPG 4 39%
2 DDPG 5 0%
1 SAC 2 100%
1 SAC 3 93%
1 SAC 4 100%
1 SAC 5 97%
1 SAC LSTM 100%
1 SAC CONV 100%
2 SAC 2 13%
2 SAC 3 74%
2 SAC 4 60%
2 SAC 5 32%
2 SAC LSTM 35%
Table I: Statistics of goal-oriented navigation 2D
Env Algorithm Rate Average Time (s) Distance
1 DDPG 2 100% 100%
1 DDPG 3 100% 100%
1 DDPG 4 95% 97.376%
1 DDPG 5 0% 0%
1 DDPG LSTM 84% 87.75%
1 DDPG CONV 39% 67.89%
2 DDPG 2 1% 46.734%
2 DDPG 3 7% 20.816%
2 DDPG 4 0% 10%
2 DDPG 5 0% 0%
2 DDPG LSTM 0% 7.142%
2 DDPG CONV 0% 0.61%
1 SAC 2 2% 32.21%
1 SAC 3 54% 73.76%
1 SAC 4 68% 83.09%
1 SAC 5 48% 74.63%
1 SAC LSTM 100% 100%
1 SAC CONV 0% 50.43%
2 SAC 2 0% 11.224%
2 SAC 3 0% 12.65%
2 SAC 4 0% 14.285%
2 SAC 5 9% 28.775%
2 SAC LSTM 1% 12.85%
2 SAC CONV 0% 0.204%
Table II: Statistics of waypoint goal-oriented navigation 2D
Env Algorithm Rate Average Time (s)
1 DDPG 2 100%
1 DDPG 3 100%
1 DDPG 4 100%
1 DDPG 5 0%
1 DDPG LSTM 100%
2 DDPG 2 8%
2 DDPG 3 17%
2 DDPG 4 24%
2 DDPG 5 0%
1 SAC 2 100%
1 SAC 3 100%
1 SAC 4 91%
1 SAC 5 100%
1 SAC LSTM 100%
2 SAC 2 5%
2 SAC 3 19%
2 SAC 4 29%
2 SAC 5 8%
Table III: Statistics of goal-oriented navigation 3D
Env Algorithm Rate Average Time (s) Distance
1 DDPG 2 100% 100%
1 DDPG 3 77% 86.44%
1 DDPG 4 99% 98.396%
1 DDPG 5 0% 1.02%
1 DDPG LSTM 99% 98.97%
1 DDPG CONV 0% 0.62%
2 DDPG 2 0% 2.24%
2 DDPG 3 0% 2.24%
2 DDPG 4 0% 4.897%
2 DDPG 5 0% 0%
2 DDPG LSTM 0% 8.36%
2 DDPG CONV 0% 0%
1 SAC 2 0% 37.463%
1 SAC 3 10% 47.084%
1 SAC 4 52% 68.22%
1 SAC 5 26% 55.685%
1 SAC LSTM 94% 95.34%
1 SAC CON 0% 1.603%
2 SAC 2 0% 0.204%
2 SAC 3 0% 3.469 %
2 SAC 4 3% 16.938%
2 SAC 5 1% 3.06%
2 SAC LSTM 0% 1.224%
2 SAC CONV 0% 0%
Table IV: Statistics of waypoint goal-oriented navigation 3D

V Discussion

In general, the extensive validation of the various models created and tested shows that both agents are flexible regarding the type of ANNs used. It can be concluded that the DDPG-based approach performs better in an unhindered scenario, while the opposite occurs with the SAC-based approaches. Figure 2 shows the final reward of each model in each context and scenario.

(a) Reward 2D Navigation Scenario 1.
(b) Reward 2D Navigation Scenario 2.
(c) Reward 3D Navigation Scenario 1.
(d) Reward 2D Navigation Scenario 2.
Figure 2: Comparative Reward.

It can be observed that the larger and more complex the network, the greater the average reinforcement tends to be, as, for example, for the models with LSTM and CNN. It is important to note that this is due to the fact that a greater number of navigations are performed in each episode and not that the models are better or worse. All models with an average reinforcement greater than 100 can be considered functional. The higher number of navigations is due to the slower step with more complex networks, allowing the task to be completed and optimized with a smaller number of actions. This further reinforces the importance of focusing on a simple reward system like the one proposed in this work.

In Figure 3 it is possible to observe the comparison of the average time to perform the first one in the 2D context, the context where the approaches presented average results close to the maximum possible for all structures. It is interesting to observe in Figure 3 the characteristics of each approach in more detail. It is possible to observe that the SAC approach has a similar average time between the structures, while the DDPG-based approach varies with greater intensity. This is due to the greater generalization capability that the stochastic biased method, such as SAC provides, while DDPG can be very good for specific structures. In general, with respect to time, it is possible to conclude that the approach based on SAC tends to be, on average, a little longer and more predictable, while the opposite occurs with approaches based on DDPG.

Figure 3: Comparison of average time for 2D navigation in the first scenario (more stable).

In Figure 4 it is also possible to observe the average distance for the second task in the second scenario, also in the 2D context, as it is more generally stable between the structures. From this illustration, it is interesting to observe in more detail the characteristics of each approach. It can be seen how the DDPG-based approach performs better with two layers and how the performance drops with increasing network complexity. Meanwhile, the SAC approach presents better results with more complex network structures, increasing performance as the number of layers increases, for example. This is due to the ability to generalize and create greater gradients than the method based on stochastic bias SAC has. In general, it can be concluded that the larger the network, the better the performance of agents based on SAC tends to be, while the opposite occurs with DDPG.

Figure 4: Comparison of average time for 2D waypoint navigation in the first scenario (more stable).

The limit for complexity, however, appears to be close to the proposed convolutional model. As can be seen in Figure 4 and also in the results for the 3D context, both approaches with CNN failed to learn to perform the tasks. The solution to this can be contrastive networks [de2022depth]. The use of contrastive networks with Deep-RL can be a way not only to solve this problem with CNNs, but also to optimize the problem of the work as a whole.

Vi Conclusions

In this paper, we presented a comparative analysis of deterministic and stochastic algorithms for low-dimensional sensing-based mapless navigation-related tasks for mobile robots. We discussed how the agent’s deep neural network affects performance while executing the tasks. We can conclude that the depth of the neural network increases the inefficiency of deterministic approaches in general, while the opposite tends to happen with stochastic approaches. We can also conclude that low-dimensional sensing is better suited to use in Deep-RL for continuous control tasks in general. Overall, future work related to the effect of the critic’s neural network will be conducted as well to evaluate how it impacts the learning of the policy itself.


The authors would like to thank the VersusAI team. This work was partly supported by the CAPES, CNPq and PRH-ANP.