I Introduction
It is well established that Deep Reinforcement Learning (DRL)based local planners [freezing2][crowdsteer][everett2019collision] can navigate a nonholonomic/differential drive robot through pedestrian rich environments safely. They are effective in capturing and reacting to the evolution of the environment’s state over time, resulting in excellent dynamic obstacle avoidance capabilities. In addition, these methods employ inexpensive perception sensors such as RGBD cameras and simple 2D lidars. However, it is not guaranteed that the instantaneous velocities computed by DRLbased methods will be dynamically feasible. That is, the computed velocities must obey the nonholonomic and acceleration constraints of the robot, which could otherwise lead to highly nonsmooth and jerky trajectories.
Learning to choose smooth, dynamically feasible velocities during navigation happens through a DRL method’s reward function, where such trajectories and other desirable behaviors are rewarded, and collisions and nondesirable behaviors are penalized. However, a fully trained policy could overprioritize collision avoidance behavior over dynamic feasibility if the penalty for collision is not appropriately balanced with the reward for computing feasible velocities. Therefore, acceleration limits and the nonholonomic constraints of the robot are largely ignored. The policy learning such fundamental constraints is crucial especially when the robot navigates among humans and in dense scenes.
On the other hand, the Dynamic Window Approach (DWA) [DWA], a classic navigation algorithm accounts for the robot’s dynamics constraints and guarantees that the velocities in a space known as the dynamic window are collisionfree and feasible/achievable for the robot within a time horizon . However, DWA’s formulation only considers the robot’s sensor data at the current time instant to make decisions and that limits its performance in highly dynamic environments leading to potentially many collisions.
Our work provides a novel solution to mitigate the limitations of both DRLbased methods and DWA by using the benefit of one to address the shortcoming of the other.
Main Results: We present a DRLbased collision avoidance policy that utilizes a novel observation space formulation and a novel reward function that leads to more spatially aware, collisionfree, dynamically feasible velocities for navigation. We show that our approach has a superior performance compared to DWA and DRL methods like Crowdmove [crowdmove] and DenseCAvoid [densecavoid] in terms of collision rate, smoothness and time to reach the goal satisfying the dynamics constraints of the robot. The main contributions of our work are:

A novel formulation for the observation space based on the concept of dynamic window is used to train our DRLbased policy. The observation space is constructed by calculating the robot’s feasible velocity set at that time instant and the costs corresponding to using those velocities in the past n time instants. This formulation leads to a significantly lower dimension observation space compared to other DRL methods [crowdsteer] [densecavoid] while also embedding the time evolution of the environment’s state in the form of costs. This in turn leads to significantly lower training times, and easier simtoreal transfer of the fully trained policy.

A novel reward function that is shaped such that the robot’s navigation is more spatially aware. That is, the robot must navigate to a space where the probability of colliding with an obstacle is drastically reduced. This leads to the robot taking swifter trajectories around dynamic obstacles unlike DWA which might directly go in the dynamic obstacle’s way or stops for the obstacles to pass first. This reduces the collision rate by 33%.
Ii Related Work
Iia Traditional Collision Avoidance in Dynamic Scenes
Global collision avoidance methods [dijkstra, Astar, RRT, RRTSTAR]
provide an optimal trajectory for the entire route, but is generally offline that is not suitable for dynamic obstacles. On the other hand, vectorbased local approaches
[DWA] such as DWA use limited sensory information and are computationally efficient to avoid multiple dynamic obstacles. However, such methods can not be globally optimal.For instance, CurvatureVelocity method [curvatureVM] solves for obstacle avoidance velocities by formulating it as a constrained optimization problem incorporating goal and vehicle dynamics. It however suffers from local minima issues.
Another instance of such methods is an extended version of DWA [DWA], which models mobile agents in the environment as cells of an occupancy grid and checks for collision using D* graph search. Collisionfree velocities which are then generated using DWA on this grid, work well for linear agents, but not for agents with nonlinear trajectories.
DWA* [vspaceApproach] uses lookahead verification and navigates with high speed and smoothness to avoid local minimas that could arise due to nonconvex obstacles. But with limited sensing it is not guaranteed to be effective in every scenario. DW4DOT [DW4DOT&DW4DO] extends beyond the local window and computes possible paths in the future using a tree approach and is effective for dynamic agents. But it’s working frequency reduces with an increasing number of agents in the environment. In order to deal with environments with multiple dynamic agents DRLbased techniques are a viable option.
IiB DRLbased Collision Avoidance
There have been numerous works on DRLbased collision avoidance in recent years. Methods such as [HybridCollisionAvoidance] have used a deep doubleQ network for depth prediction from RGB images for collision avoidance in static environments whereas more advanced methods [monocularVisionRL]
have exploited Convolutional Neural Networks to model endtoend visuomotor navigation capabilities.
Obstacle avoidance in previously unseen scenarios filled with pedestrians is demonstrated in [DLnavigation]. A decentralized, scalable, sensor level collision avoidance proposed in [RL_uestimate] performed better with lesser freezing issues in a new hybrid architecture implemented in [Multirobot_collison_avoidance]. With the assumption that pedestrians’ motions aid in collision avoidance, a cooperative model between robot and pedestrians was proposed in [DBLP:journals/corr/abs180803841]. However, its good performance is limited to sparse crowds. An extension using LSTMs [LSTMobstacletrackingRL] to capture temporal information enabled it to be operational among a larger number of pedestrians.
Another class of works have focused on training policies that make the robot behave in a socially acceptable manner and avoid the wellstudied freezing robot problem. For instance, socially acceptable navigation models were developed using simple depth camera images and GAIL (General Adversarial Imitation Learning) in
[gail]. DenseCAvoid [densecavoid] utilized pedestrian prediction models to avoid potential collision spaces and reduces the occurrence of the freezing robot problem. However, the model is still susceptible to oscillations and provides no guarantees on generating dynamically feasible robot velocities.Iii Background
In this section we briefly provide details and definitions on the different concepts and components that are used in our work.
Iiia Symbols and Notations
A list of symbols frequently used in this work is shown in table I in the order in which they appear.
Symbols  Definitions 
k  Number of linear or angular velocities robot can execute at a time instant 
n  Number of past time instants used in observation space formulation 
Robot’s linear velocity and acceleration  
Robot’s angular velocity and acceleration  
dist()  Function gives the distance to the nearest obstacle from the trajectory generated by velocity vector() 
Robot’s current linear and angular velocity  
Maximum linear and angular braking deceleration  
Policy Network  
Coordinates wrt the odom frame  
Distance between the pedestrian and robot  
r  Reward for a specific robot action during training 
Robot’s radius  
g  Goal location 
IiiB Dynamic Window Approach
The Dynamic Window Approach (DWA) [DWA] mainly uses the following two stages to search for a collisionfree, and reachable [v, ] velocity vector in a 2dimensional velocity space known as the dynamic window. The dynamic window is a discrete space with number of [v, ] velocity vectors, where k is the number of linear and angular velocities that the robot can execute at any time instant.
IiiB1 Search Space
The goal of the first stage is to generate a space of reachable velocities for the robot. This stage involves the following steps.
Velocity Vectors Generation: In this step, according to the maximum linear and angular velocities the robot can attain, a set V of [v, ] vectors is generated. Each velocity vector in the set corresponds to an arc of different radius that the robot can move along. The equations describing the trajectory of the robot for different [v, ] vectors can be found in [DWA].
Admissible Velocities: After forming set V, for each , the distance to the nearest obstacle from its corresponding arc is computed. The [v, ] vector is considered admissible only if the robot is able to stop before it collides with the obstacle. The admissible velocity set is given by,
(1) 
dist(v, ), is the distance to the nearest obstacle on the arc.
Dynamic Window: The next step is to further prune the set to remove the velocities that are not achievable within a considering the robot’s linear and angular acceleration limits. This final set is called the dynamic window and is formulated as.
(2) 
IiiB2 Objective Function Optimization
In the second stage, the which maximizes the objective function defined in equation 3 is searched for in .
(3) 
For a [v, ] executed by the robot, heading() measures the robot’s progress towards the goal (more progress higher value), dist() measures the robot’s distance from the nearest obstacles (more distance higher value), and the vel() function checks that . and denote weighing constants that can be tuned by the user.
Obstacle information embedded in the velocity space is utilized to select the optimal velocity pair. The [v, ] vector computed by DWA may be a local minima. However, this issue can be mitigated if the connectivity of free space to the goal is known.
IiiC DRL Policy Training
DRLbased collision avoidance policies are usually trained in simulated environments using a robot which uses the said policy to perform certain actions based on environmental observations to earn some rewards
. The robot’s actions are represented by the velocities that it can execute, and the set of all the robot’s velocities is called its action space. The robot’s observation consists of information regarding the robot’s environment (such as the poses of obstacles), and the set of all observations that the robot can make is called its observation space. The policy is encoded as a deep neural network whose weights are updated during training.
The robot’s primary objective is to maximize a reward function by performing the actions which are rewarded and avoiding actions or behaviors which are penalized. A typical reward function involves terms to provide positive reward for desirable behaviors such as the robot moving towards and reaching its goal, and to provide negative reward when the robot exhibits undesirable behaviors such as colliding with obstacles. We train our DRLbased collision avoidance policy based on the reward function using a wellknown method known as Proximal Policy Optimization (PPO) [PPO].
Post training, a collision free velocity at each time instant is sampled from a trained policy as:
(4) 
where and are the velocity and the observation spaces at time t, respectively.
Iv Our Approach
In this section, we explain the construction of our novel observation space, reward function and our network architecture.
Iva Observation Space Generation
The manner in which data in the observation space is represented is crucial for the policy training’s success, since an important component of policy training is correlating the observations with desired actions. An ideal observation space must balance sufficiently encoding data about the environment and maintaining a low dimensionality. A low dimensional observation space does not require a complicated network architecture to process it, which in turn leads to lower training times.
In our work, the observation space is constructed using the robot’s current feasible velocities (equation 2) and the environment’s state is embedded as a matrix of costs for using those velocities for the current and past n1 time instants. The action space is a single [v, ] vector that the robot must execute at the current time instant that is computed by our fully trained DRL policy considering the time evolution of the environment’s state for n time instants. The steps used in the observation space construction are detailed below.
IvA1 Dynamically Feasible Velocity Vectors
Unlike DWA, we do not first generate an admissible velocity set that contains collisionfree robot velocities. Instead, we first compute sets of feasible/reachable linear and angular velocities (lin = and ang = ) using equation 2. We discretize sets lin and ang into k intervals such that the total number of [v, ] vectors obtained from the intervals is . We then form the set of feasible velocities from these discretized sets as,
(5) 
The velocity vectors in do not account for the locations of the obstacles in the current time instant or the past n1 time instants. Therefore some velocities in could lead to collisions. The velocity vectors in are appended n1 times as column vectors in two matrices of size () and the generated linear and angular velocity matrices are shown in the Fig. 3(a)[top].
IvA2 Obstacle sets
We use the lidar scan to get the location of the obstacles around the robot. For each time instant, the obstacle locations are obtained relative to the fixed odometry frame of the robot and stored in a set. The odometry frame is attached to the ground at the location from which the robot started (see 4). Consider Fig. 4(a), the locations of two obstacles in the current as well as in the past n1 time steps are shown. We add the set of obstacle locations in a list of length n (see Fig. 4(b)), where each column shows the set of obstacle locations for a specific time instant. We use list to incorporate the information of the motion of various obstacles in the environment.
IvA3 Obstacle cost calculation
We next calculate the obstacle cost similar to the dist() function used in equation 3 for every velocity vector in . Each vector in is forward simulated for a time to check if it leads to a collision, given the obstacle positions in . The velocity vectors that lead to a collision are given higher costs and the velocities that are collisionfree are assigned cost inversely proportional to the least distance of the obstacle from the arc generated after executing the velocity vector.
(6) 
The Fig.3(a)[top] shows the obtained ()matrix of the obstacle costs.
IvA4 Goal alignment cost calculation
Each is forward simulated for a time and the distance from the robot and its goal is measured. The velocity vectors that reduce the distance between the robot and the goal location are given a low cost. The goal alignment cost is independent of the location of the obstacles around the robot, therefore the same cost for each pair is appended for n times to obtain a goal alignment cost matrix of shape () as seen in Fig. 3(a)[top].
(7) 
Here, EndPoint function gives the coordinates of the end point of the trajectory generated by executing the velocity vector and is a constant, we set = 2.5.
IvA5 Total cost calculation
The total cost for the robot using a vector [] for the current time instant is calculated as,
(8) 
and is shown in Fig.3(b)[top].
IvB PreProcessing
The linear, angular, obstacle cost and goal alignment cost matrices obtained previously are now preprocessed to better represent which velocities in had the least costs given the obstacle positions for the past n time instants.
IvB1 Sorting the velocity vectors
The velocity vectors are sorted in the descending order according to the total cost of the velocity vectors at the current time instant. The sorting step is shown in the Fig. 3(b). The elements in the linear, angular velocities, obstacle costs and goal alignment costs matrices are then rearranged in a similar manner (Fig.3(a)[bottom]).
IvB2 Observation space
Finally, the preprocessed linear and angular matrices along with their obstacle as well as the goal alignment cost matrices are stacked up to get a matrix of size (), which is then passed to the policy network (see Fig.2).
IvC Network Architecture
The policy network architecture used in this hybrid local planner is shown in Fig.5
. Five 2D convolutional layers, followed by 3 fully connected layers are used for processing the observation space. ReLU activation is applied to the output of the hidden layers before it is mapped to the action space vector
. This action space vector holds the index of the ordered velocity pairs of the current time frame. This architecture is much simpler and requires fewer layers for handling our observation space.IvD Reward Shaping
The primary aim of the policy being trained is to reach the goal while steering clear of static and dynamic obstacles while enabling smooth navigation. Rewards for the basic navigation task of reaching the target and collision against obstacles are provided with higher positive or negative rewards correspondingly.
In order to enable faster convergence, the difference of distance from the goal in previous time step and current time instant is utilized. This incentivizes the policy to move the robot closer to goal every time step or get penalized as,
(9) 
At any time instant t, the collision penalty is defined as,
(10) 
When a dynamic obstacle (located at ) is in the sweep radius of the robot (located at ) relative to the odometry frame, we define the following parameters,
(11) 
(12) 
Since the policy must be capable of operating even in dense human crowds, if it computes a trajectory in the same relative direction as a dynamic obstacle (say a pedestrian), it increases the chance of collision. Trajectories that steer the robot to be on opposite side of a dynamic obstacle’s heading direction (behind the obstacle) are therefore rewarded positively and all other trajectories are penalized based on the proximity to the obstacle (see Fig.6). This reward/penalty is denoted as,
(13) 
In case of a obstacle moving headon, is zero and total steering reward is zero. In the presence of multiple dynamic obstacles proximal to the robot, the union of the red zones as shown in the figure is negatively rewarded.
This is also supplemented by providing negative rewards inversely proportional to distance from obstacles near the robot. This reduces the danger of collision as negative reward is accumulated as the robot approaches the obstacle.
(14) 
We set .
V Results and Evaluations
Va Implementation
We train the deep reinforcement learning policy in a simulator and test the trained models in the simulation as well as on the real robot. We use ROS Melodic and Gazebo 9 to create the simulation environments on a workstation with an IntelXeon 3.6GHz processor and an Nvidia GeForce RTX 2080TiGPU. We implement the policy network using tensorflow and use the PPO2 implementation provided by stable baselines to train the policy. We simulate multiple copies of Turtlebot 2 model with a Hokuyo Lidar in parallel to train the models in the simulation. The different copies of Turtlebot 2 are deployed in different scenarios of the same simulation environment, this makes sure that the model does not over fit to a particular scenario.
VB Testing Scenarios
We test the performance of our approach and the prior approaches in different high fidelity simulation environments. The simulation environments that we use, characterize the pedestrian robot interactions in the different indoor and outdoor scenarios. We assume that the robot is taking the full responsibility of the collisions. We evaluate our method and the prior approaches in the following test environments,
Zigzag static: This scenario contains many turns with number of static obstacles. This scenario tests the performance of various approaches in densely cluttered static environment.
OccludedPed: Scenario with many turn and pedestrians that are occluded by the walls. Using this scenario, we test the how different approaches handle the obstacles when they are suddenly introduced in the sensor’s field of view.
Sparse dynamic environment: Scenario contains sparse dynamic pedestrians, the environment contains 4 pedestrians. The pedestrians are walking at 45 and 90 with the line connecting the robot’s start and goal locations. This scenario mainly evaluate the robot’s ability to handle dynamic obstacles and plan a path that doesn’t lead to collision.
Dense dynamic environment: This scenario has a total of 16 pedestrians. It contains both static and dynamic pedestrians. It tests the behaviour of various algorithm in the dense dynamic environments.
VC Evaluation Metrics
We compare our approach with two other classical and learning based collision avoidance techniques: (i) Dynamic Window Approach [DWA] (ii) Long et al. [Multirobot_collison_avoidance]. We also provide some ablation study results to show effects of various design choices that we made in our approach. We use the following metrics to compare the performance of various collision avoidance algorithms and the ablation study models. We use the following metrics to compare the performance of various collision avoidance algorithms.

Success Rate  The number of times the robot reached the goal without colliding with the obstacles over 50 randomly initialized episodes.

Average Trajectory Length  It is the total distance traversed by the robot, until the goal is reached, averaged over the number of successful attempts.

Average Velocity  The average velocity of the robot is computed as the trajectory length over the time taken to reach the goal in a successful trial.
VD Analysis and Comparison
The performance comparison of our method versus the prior approaches in various test environments is show in Table. II. The results of ablation study showing the benefits of adding positive reinforcement and 4 layer observation space are given in the Table. III and IV.
Metrics  Method  Sparse Dynamic Env.  Zigzag static  OccludedPed  Dense Dynamic Env. 
Success Rate  Hybrid RL (Our Method)  0.54  1.0  1.0  0.4 
DRL  0.3342  1.0  .76  .31  
DWA  0.42  1.0  0.9  0.3  
Avg Trajectory Length (m)  Hybrid RL (Our Method)  11.95  28.85  27.26  12.23 
DRL  12.1  26.89  25.63  11.57  
DWA  11.99  26.62  25.07  12.81  
Avg Velocity (m/s)  Hybrid RL (Our Method)  .42  .38  .46  .37 
DRL  .37  .42  .51  .503  
DWA  .6  .435  .568  .64 
We present the comparison of our method with prior collision avoidance methods in Table. II. We can observe that in terms of success rate all the approaches perform well in the Zigzag static scenario. But when the number of dynamic obstacles increases in the environment we can clearly see our method performs better compared all other methods. The DWA approach does not consider the temporal information while navigating through the dynamic obstacles and this lead to collisions as the density of the dynamic obstacles increases in the environment. Whereas our approach uses the information of the past n time steps to make navigation decisions also formulation of the reward function helps the policy to handle dynamic obstacles better compared to other learning based approaches. In the Occluded Ped test scenario, the pedestrians are suddenly introduced in the field of view of the lidar sensor. The DWA and the DRL approach can not handle the occluded pedestrians well. However, our approach with limited temporal information can still perform better.
Ablation Study for the Positive Reinforcement We show the benefits of adding positive reinforcement for in the reward function. For the ablation study, we compare two models trained with and without positive reinforcement in different test environments. From the Table III, we can clearly see that the model that is trained with positive reinforcement out performs the model that is trained without positive reinforcement in all the test environments.
Metrics  Method  Sparse Dynamic Env.  Zigzag Static  OccludedPed  Dense Dynamic Env. 
Success Rate  With positive reinforcement  0.54  1.0  1.0  .42 
Without positive reinforcement  .44  .86  .64  .4  
Avg Trajectory Length  With positive reinforcement  11.95  28.85  27.26  12.46 
Without positive reinforcement  11.6  28.12  27.86  12.78  
Avg Velocity  With positive reinforcement  .42  .37  .47  .38 
Without positive reinforcement  .41  .34  .41  .34 
Ablation Study for the Observation Space We use the observation space of 4 different matrices as show in Fig. 3. The observation space includes a linear velocity frame, angular velocity frame, obstacle cost frame and goal alignment cost. We compare this formulation of the observation space with a 3 frame observation space where we use a linear velocity frame, angular velocity frame and total cost frame. We hypothesis that the 4 frame observation space can better impart the information about the obstacles around the robot as the obstacle cost is provided separately whereas in the 3 layer observation space the obstacle cost is added in the goal alignment cost to obtain the total cost. For this ablation study, we train models with a 3 frame and 4 frame observation space, we keep all other parameters the same for both of the models. The results for both the models are shown in Table IV. We can observe that the 4 frame model performs better compared to the 3 frame model in all of the test scenarios.
Metrics  Method  Sparse Dynamic Env.  Zigzag static  OccludedPed  Dense Dynamic Env. 
Success Rate  3 layer  .36  .82  .94  .36 
4 layer  0.54  1.0  1.0  .42  
Avg Trajectory Length  3 layer  11.56  27.93  27.04  12.34 
4 layer  11.95  28.85  27.26  12.46  
Avg Velocity  3 layer  .46  .38  .44  .37 
4 layer  .42  .37  .47  .38 
Dynamics Constraints Violation We measure the number of dynamics constraints violations by our approach and the learning based collision avoidance techniques. The Fig. 7 shows the graph of linear velocities generated by the DRL approach in the Sparse Dynamic environment. The red circles marked on the figure shows the time steps at which the method violated the dynamics constraints. We observed when the robot starts from the initial location or when the robot stops to avoid the collision it most of the time chose the velocities that were not in the dynamically feasible velocity space. In the complete trajectory from the start to the goal location, the DRL method violated the dynamics constraints for 10 % of the total time steps.
Vi Conclusions, Limitations and Future Work
We present a novel formulation of the deep reinforcement learning policy that generates dynamically feasible and spatially aware smooth velocities. Our method rectifies the issues associated with the learning based approaches and the classical Dynamic Window Approach. We validate our approach in the simulation and in the real world environments. We compare our approach with the other collision avoidance techniques in terms of the collision rate, average trajectory length, average velocity and dynamically feasible velocity vector generation. We also implement our approach on two differential drive robots: Clearpath Turtlebot2 and Jackal, and show the performance improvements in the real world. Our work has several limitations as well. The model uses the information of past few seconds to make decision, if any obstacle is suddenly introduced in the field of view of the robot then the policy might not have enough information to make informed decisions and the robot might freeze. The current model uses convolutional neural network as the policy network, but in future we would like to train a model based on LSTM [LSTM] policy and we would also like to modify the reward function to get better navigation behavior.