Supplementary Material
I Introduction
Many problems in robotics can be expressed as Reinforcement Learning (RL) problems. RL techniques allow a robot to learn progressively to excel in a distinct task, such as motionbased tasks. Through trialanderror interactions with an environment, an agent gets feedback in terms of a scalar objective function that guides it stepbystep towards the learning [kober2013reinforcement]. This can be approached by many policies of learning, which can framed into two groups: deterministic and stochastic.
More recently, RL techniques have been further improved by using deep neural networks. In this case, the agent of Deep Reinforcement Learning (DeepRL) becomes a neural network that escalates its ability to learn complex behaviors, such as the behavior needed to perform navigation tasks in complex environments. The techniques based on DeepRL have been used extensively to improve navigationrelated tasks for a range of mobile vehicles, including terrestrial mobile robots [ota2020efficient, jesus2021soft], aerial robots [tong2021uav, grando2022double] and underwater robots [carlucho2018]
. These approaches diverge on the choice of an ANN, ranging from MultiLayer Perceptron (MLP) network structures to Convolutional Neural Networks (CNN). Most of them have achieved interesting results not only in performing mapless navigationrelated tasks but also in obstacle avoidance and even media transitioning for hybrid vehicles
[bedin2021deep, de2022depth]. However, the choice of the learning method can be affected not only by the selection of the DeepRL technique but also by the agent’s ANN structure.In this work, we further explore the use of distinct Deep ANN architectures for two stateofart DeepRL algorithms: Deep Deterministic Policy Gradient (DDPG) [lillicrap2015continuous] and Soft ActorCritic (SAC) [haarnoja2018soft]; a deterministic one and a stochastic one. We perform a study showing how the deep network structure and the depth of the network impact the agent’s learning. We perform an evaluation using an Unmanned Aerial Vehicle (ANN) in tasks related to goaloriented mapless navigation using lowdimensional data. We perform a twofold evaluation, taking into account navigation time and distance navigated to enrich the results.
This work contains the following main contributions:

We present comparative analyses of how the agent’s deep neural network affects deterministic and stochastic algorithms’ performance for mapless navigationrelated tasks of mobile robots.

We show that low dimensional sensing is better suited to use in DeepRL for continuous control tasks in general

We also show that the depth increases the performance of deterministic approaches in general, while the opposite tends to happen with stochastic approaches.

We provide a framework with a simulation and environment and two approaches based on stateoftheart actorcritic DeepRL algorithms with a range of structures that can be successfully adapted to perform mapless navigation of mobile robots, using only range data readings and the vehicles’ relative localization data.
Ii Related Work
A couple of DeepRL works in robotics have already been presented, discussing how efficiently these methods can be used in problems related to motion control with low dimensional sensing information [tai2017virtual, bedin2021deep]. For a terrestrial mobile robot, Tai et al. [tai2017virtual] used ten samples of range findings and the relative distance of the vehicle to a target to perform navigation through obstacles. The DDPG algorithm used learned effectively to navigate to a target. Recently, deepRL methods have also been successfully used in robotics by De Jesus et al.. Also, [jesus2019deep, jesus2021soft] and others accomplished mapless navigationrelated tasks for terrestrial mobile robots using simple information.
Singh and Thongam [singh2018mobile]
show that a MultiLayer Perceptron (MLP) can be used for mapless navigation of terrestrial mobile robots in dynamic environments. Their method used MLP and Recurrent Neural Networks to decide the robot’s speed for each motion. They concluded that the approach is efficient in guiding the robot to a target position.
For aerial mobile robots, the use of DeepRL is still limited. Rodriguez et al. [rodriguez2018deep] used a DDPGbased approach to teach an agent to land on a moving platform. Their approach used information from images, but it was fed with simplified information to the agent. It used DeepRL in simulation with the RotorS framework [furrer2016rotors] and with the Gazebo simulator. Grando et al. [grando2020deep] presented a DDPG and a SAC approach on Gazebo for 2D UAV navigation. Recently, double criticbased DeepRL has also been used for UAVs [grando2022double]. All of them use information from ranging sensors in a simple state information model for the agent.
Two works have recently tackled the navigation problem with the medium transition of hybrid mobile robots [de2022depth], [bedin2021deep]. Grando et al. [bedin2021deep] presented DeepRL approaches with a MLP architecture. It was developed using distance sensing information for aerial and underwater navigation. De Jesus et al. [de2022depth] tackled the problem of motion for this kind of vehicle using image information, but it was used with contrasting learning that takes into account a decoder to simplify the image information to feed the agent.
Based on these works that used simple state information, we present a comparative analysis of how the agent’s deep neural network affects the performance of DeepRL for continuous motion tasks in mobile robots. We aim to provide the best architecture for each kind of algorithm. The environment used for the testing is the Gazebo simulator with a described realworld aerial mobile robot.
Iii Methodology
In this section, we discuss the DeepRL approaches used in this work and the mobile aerial robot used. We detail the structure of all networks used to perform the comparison for both deterministic and stochastic agents.
Iiia Deep Deterministic Policy Gradient
Deep Deterministic Policy Gradient (DDPG) [lillicrap2015continuous] has two main deep neural networks: an actor network that provides the real value of a chosen action and a second deep neural network to learn a target function that gives stability to the learning process [lillicrap2015continuous]. The observation of the current state is the input of the actornetwork. The actor network provides a continuous action space value chosen by the policy. At the same time, the critic network uses the current state and the agent’s action to provide the Q value for the agent.
This method provides good performance for continuous control but has a challenging problem related to exploration. Since it is deterministic, DDPG needs some exploration policy to avoid learning stagnation. This can be solved by adding a noise process to the actor policy. This noise adding process can be defined as:
(1) 
where is a noise chosen. The OrnsteinUhlenbeck process [uhlenbeck1930theory] is typically used and provides good exploration.
IiiB Soft Actor Critic (SAC)
We also developed a stochastic approach for comparison. It was based on the Soft ActorCritic algorithm [haarnoja2018soft]. It also consists of an actorcritic system that combines offpolicy updates but with a stochastic actorcritic method to learn continuous action space policies. It does so by using neural networks as an approximation function to learn a policy. However, in the SAC algorithm, the current stochastic policy is used instead of the noise used in DDPG. By acting without noise, it tends to provide better stability and performance. The learning speed also tends to be higher since the algorithm encourages the agent to explore new states. It uses the Bellman equation with neural networks as a function approximation to maximize the entropy.
IiiC Simulated Environments
The algorithms were implemented in simulation together with ROS and the Gazebo simulator. We used the aerial mobile vehicle presented at [grando2022double]. This vehicle was described using the framework RotorS [furrer2016rotors] and was based on the real vehicle presented at [grando2020visual]. Its low dimensional sensing was given by a simulated LIDAR. The described LIDAR is based on the UST 10LX model. It provides a 10 meters distance sensing with 270° range and 0.25° of resolution.
We developed two environments with dimensions of 10106 meters. The first one is a simple environment that has no obstacles. It only has walls that limit the scenario. Our idea for using this scenario was to guarantee that all the versions of the ANNs used were able to learn to perform the task of navigating to a target point to have a fair comparison. The second one has a few obstacles added to it to make it more difficult to learn and test the ability of the agent in avoiding obstacles.
IiiD Reward Function
A simple binary rewarding function was used: a positive reward is given if the agent reaches the goal, or a negative reward is given if the robot collides with the walls or does not reach the goal in less than a 500 steps limit. The function can be described as follows:
(2) 
The reward is 100, while the negative reward is 10. Both and distances were set to meters.
IiiE Networks Structure
The input for the networks has a total of 26 values, 20 samples for the distance sensors, the three previous actions and three values related to the target goal, which are the vehicle’s relative position to the target and the relative angle to the target in the xy plane and zdistance plane. The only exception was used in the CNN architecture, where 270 samples from the sensor were used instead of 20. The network outputs are the linear velocity and the variation of the vehicle’s yaw ( yaw) that will be sent to the robot. The actions are normalized between and for the linear velocity and from to for the yaw. All network structures were developed inspired by related works that deal with lowdimensional sensing data as inputs.
We used six distinct ANN architectures, four of them based on the MLP fully connected architecture, one based on the LSTM architecture and one based on a CNN architecture. For all actorcritic network structures, we fixed the critic network with a standard of three hidden layers with 512 neurons each, while the actornetwork varies, as shown in Figure
1. The main idea is to evaluate the impact of the network structure on the agent’s ability to provide the actions, not to evaluate if the agent is capable of learning, which is the main goal of the critic network. For the last, we used ReLU activation in the hidden layers and Tanh activation in the output layer.
Iv Experimental Results
For training, we generated target goals in a random manner, towards which the agent should navigate. Five hundred steps was the limit defined for each episode, which could end first if the agent collided with an obstacle or with the scenario border. A new goal in the same episode was generated if the agent reached the goal before finishing the 500 steps limit. In this case, the total amount of reward could exceed the maximum value of 100.A learning rate of was used, with a minibatch of 256 samples and the Adam optimizer for all approaches. In the first scenario, we limited the number of episodes to be trained to 1000, while the agent was trained for 1500 episodes in the second. These respective limits for the episode number are used based on the stagnation of the maximum average reward received.
Iva Results
In this section, the results obtained during our evaluation are shown. For each scenario and model, an extensive amount of statistics are collected. The evaluation was done in a twofold manner, with goaloriented navigation and a waypoint navigation manner. These twofold tasks were performed for 100 trials and the total of successful trials are recorded. Also, the average navigation time with its standard deviation is recorded.
Env  Algorithm  Rate  Average Time (s) 

1  DDPG 2  100%  
1  DDPG 3  100%  
1  DDPG 4  100%  
1  DDPG 5  0%  
1  DDPG LSTM  95%  
1  DDPG CONV  84%  
2  DDPG 2  65%  
2  DDPG 3  35%  
2  DDPG 4  39%  
2  DDPG 5  0%  
2  DDPG LSTM  21%  
2  DDPG CONV  0%  
1  SAC 2  100%  
1  SAC 3  93%  
1  SAC 4  100%  
1  SAC 5  97%  
1  SAC LSTM  100%  
1  SAC CONV  100%  
2  SAC 2  13%  
2  SAC 3  74%  
2  SAC 4  60%  
2  SAC 5  32%  
2  SAC LSTM  35%  
2  SAC CONV  0% 
Env  Algorithm  Rate  Average Time (s)  Distance 

1  DDPG 2  100%  100%  
1  DDPG 3  100%  100%  
1  DDPG 4  95%  97.376%  
1  DDPG 5  0%  0%  
1  DDPG LSTM  84%  87.75%  
1  DDPG CONV  39%  67.89%  
2  DDPG 2  1%  46.734%  
2  DDPG 3  7%  20.816%  
2  DDPG 4  0%  10%  
2  DDPG 5  0%  0%  
2  DDPG LSTM  0%  7.142%  
2  DDPG CONV  0%  0.61%  
1  SAC 2  2%  32.21%  
1  SAC 3  54%  73.76%  
1  SAC 4  68%  83.09%  
1  SAC 5  48%  74.63%  
1  SAC LSTM  100%  100%  
1  SAC CONV  0%  50.43%  
2  SAC 2  0%  11.224%  
2  SAC 3  0%  12.65%  
2  SAC 4  0%  14.285%  
2  SAC 5  9%  28.775%  
2  SAC LSTM  1%  12.85%  
2  SAC CONV  0%  0.204% 
Env  Algorithm  Rate  Average Time (s) 

1  DDPG 2  100%  
1  DDPG 3  100%  
1  DDPG 4  100%  
1  DDPG 5  0%  
1  DDPG LSTM  100%  
1  DDPG CONV  0%  
2  DDPG 2  8%  
2  DDPG 3  17%  
2  DDPG 4  24%  
2  DDPG 5  0%  
2  DDPG LSTM  61%  
2  DDPG CONV  0%  
1  SAC 2  100%  
1  SAC 3  100%  
1  SAC 4  91%  
1  SAC 5  100%  
1  SAC LSTM  100%  
1  SAC CONV  0%  
2  SAC 2  5%  
2  SAC 3  19%  
2  SAC 4  29%  
2  SAC 5  8%  
2  SAC LSTM  5%  
2  SAC CONV  0% 
Env  Algorithm  Rate  Average Time (s)  Distance 

1  DDPG 2  100%  100%  
1  DDPG 3  77%  86.44%  
1  DDPG 4  99%  98.396%  
1  DDPG 5  0%  1.02%  
1  DDPG LSTM  99%  98.97%  
1  DDPG CONV  0%  0.62%  
2  DDPG 2  0%  2.24%  
2  DDPG 3  0%  2.24%  
2  DDPG 4  0%  4.897%  
2  DDPG 5  0%  0%  
2  DDPG LSTM  0%  8.36%  
2  DDPG CONV  0%  0%  
1  SAC 2  0%  37.463%  
1  SAC 3  10%  47.084%  
1  SAC 4  52%  68.22%  
1  SAC 5  26%  55.685%  
1  SAC LSTM  94%  95.34%  
1  SAC CON  0%  1.603%  
2  SAC 2  0%  0.204%  
2  SAC 3  0%  3.469 %  
2  SAC 4  3%  16.938%  
2  SAC 5  1%  3.06%  
2  SAC LSTM  0%  1.224%  
2  SAC CONV  0%  0% 
V Discussion
In general, the extensive validation of the various models created and tested shows that both agents are flexible regarding the type of ANNs used. It can be concluded that the DDPGbased approach performs better in an unhindered scenario, while the opposite occurs with the SACbased approaches. Figure 2 shows the final reward of each model in each context and scenario.
It can be observed that the larger and more complex the network, the greater the average reinforcement tends to be, as, for example, for the models with LSTM and CNN. It is important to note that this is due to the fact that a greater number of navigations are performed in each episode and not that the models are better or worse. All models with an average reinforcement greater than 100 can be considered functional. The higher number of navigations is due to the slower step with more complex networks, allowing the task to be completed and optimized with a smaller number of actions. This further reinforces the importance of focusing on a simple reward system like the one proposed in this work.
In Figure 3 it is possible to observe the comparison of the average time to perform the first one in the 2D context, the context where the approaches presented average results close to the maximum possible for all structures. It is interesting to observe in Figure 3 the characteristics of each approach in more detail. It is possible to observe that the SAC approach has a similar average time between the structures, while the DDPGbased approach varies with greater intensity. This is due to the greater generalization capability that the stochastic biased method, such as SAC provides, while DDPG can be very good for specific structures. In general, with respect to time, it is possible to conclude that the approach based on SAC tends to be, on average, a little longer and more predictable, while the opposite occurs with approaches based on DDPG.
In Figure 4 it is also possible to observe the average distance for the second task in the second scenario, also in the 2D context, as it is more generally stable between the structures. From this illustration, it is interesting to observe in more detail the characteristics of each approach. It can be seen how the DDPGbased approach performs better with two layers and how the performance drops with increasing network complexity. Meanwhile, the SAC approach presents better results with more complex network structures, increasing performance as the number of layers increases, for example. This is due to the ability to generalize and create greater gradients than the method based on stochastic bias SAC has. In general, it can be concluded that the larger the network, the better the performance of agents based on SAC tends to be, while the opposite occurs with DDPG.
The limit for complexity, however, appears to be close to the proposed convolutional model. As can be seen in Figure 4 and also in the results for the 3D context, both approaches with CNN failed to learn to perform the tasks. The solution to this can be contrastive networks [de2022depth]. The use of contrastive networks with DeepRL can be a way not only to solve this problem with CNNs, but also to optimize the problem of the work as a whole.
Vi Conclusions
In this paper, we presented a comparative analysis of deterministic and stochastic algorithms for lowdimensional sensingbased mapless navigationrelated tasks for mobile robots. We discussed how the agent’s deep neural network affects performance while executing the tasks. We can conclude that the depth of the neural network increases the inefficiency of deterministic approaches in general, while the opposite tends to happen with stochastic approaches. We can also conclude that lowdimensional sensing is better suited to use in DeepRL for continuous control tasks in general. Overall, future work related to the effect of the critic’s neural network will be conducted as well to evaluate how it impacts the learning of the policy itself.
Acknowledgment
The authors would like to thank the VersusAI team. This work was partly supported by the CAPES, CNPq and PRHANP.